TalkPHP

TalkPHP (http://www.talkphp.com/forums.php)
-   Advanced PHP Programming (http://www.talkphp.com/advanced-php-programming/)
-   -   help with fscanf (http://www.talkphp.com/advanced-php-programming/4014-help-fscanf.html)

ETbyrne 03-02-2009 12:18 AM

help with fscanf
 
I have to read a very large text file with PHP (about 1 million lines and 45.4 MB) and speed is critical. The file is layed out like so.

Notice the extra space before and after each line

Code:

{"one":1727294,"two":2667541,"three":9998168}
 {"one":7005310,"two":9377441,"three":4658508}
 {"one":2638549,"two":7931823,"three":992431}
 {"one":8817443,"two":1587524,"three":6495056}
 {"one":5009765,"two":4831848,"three":2782592}
 {"one":9882507,"two":4866943,"three":7389221}
 {"one":7161254,"two":281677,"three":9001464}
 {"one":6177062,"two":661010,"three":4880065}
 {"one":850830,"two":5882873,"three":4219360}
 {"one":5865173,"two":8852539,"three":6194152}

Obviously, the file is too large to just load the entire thing into PHP so I'm using fscanf to read the file line by line.

Right now I have it so I can select an entire line if I know the whole thing. Here is what I got for that:

PHP Code:

while($data fscanf($fh,"\n{\"one\":1054687,\"two\":8728332,\"three\":2499389}%s\n"))
{
    
print_r($data);


What I want to do is be able to select one entire line with only knowing part of it (ex: "three":4219360). I'm not sure how to go about doing this because I have very little experience with fscanf and function of the like. I've tried something like this but it returns all of the rows:

PHP Code:

while($data fscanf($fh,"\n%s\"three\":2499389%s\n"))
{
    
print_r($data);


I do NOT want to have to load every line into PHP and check it that way as that could be very slow.

Salathe 03-02-2009 01:11 AM

If you're after just a simple search, you could use fgets to grab the line and strpos to search it.

PHP Code:

// line: {"one":8405219,"two":4552659,"three":6640965}
$search '"three":6640965';

$fp fopen('misc.txt''r');

$result 'No result';
while( ! 
feof($fp))
{
    
$line fgets($fp70);
    if (
strpos($line$search) !== FALSE)
    {
        
$result trim($line);
        break;
    }
}
fclose($fp);

echo 
$result

For me, searching through a 46MB file with lines like yours took under two seconds to not find a match (worst case scenario).

allworknoplay 03-02-2009 05:05 PM

Quote:

Originally Posted by Salathe (Post 22041)

For me, searching through a 46MB file with lines like yours took under two seconds to not find a match (worst case scenario).


Wow that's pretty fast for a file with a million records...

ETbyrne 03-02-2009 06:13 PM

@allworknoplay: That's because I don't have to load the data into PHP when just using fscanf.

@Salathe: The problem with your code is that it requires me to load every line into PHP and then check it. I need to be able to do this with only fscanf if at all possible.

allworknoplay 03-02-2009 06:51 PM

Quote:

Originally Posted by ETbyrne (Post 22053)
@allworknoplay: That's because I don't have to load the data into PHP when just using fscanf.

@Salathe: The problem with your code is that it requires me to load every line into PHP and then check it. I need to be able to do this with only fscanf if at all possible.

Thanks for bring this functionality to light for me, I am going to check it out, I could definitley use this for my projects.

I don't necessarily trust databases for everything. Sometimes I think people go overboard with databases because of the ease of use for the SQL language, but it doesn't always make sense to use a database for everything.

allworknoplay 03-02-2009 06:56 PM

Quote:

Originally Posted by ETbyrne (Post 22053)
@allworknoplay: That's because I don't have to load the data into PHP when just using fscanf.

@Salathe: The problem with your code is that it requires me to load every line into PHP and then check it. I need to be able to do this with only fscanf if at all possible.

Ok I have an idea. You don't want to load the entire file into PHP, so I get that.

What if you were to return all of the lines you were searching for anyway, but then once you retrieve the line, you then parse the info?

This way you're only parsing what you need AFTER you get the full line, instead of trying to parse it ahead of time which would require much more resources?

Does that make sense?

Salathe 03-02-2009 08:15 PM

Quote:

Originally Posted by ETbyrne (Post 22053)
@Salathe: The problem with your code is that it requires me to load every line into PHP and then check it. I need to be able to do this with only fscanf if at all possible.

You'd need to fscanf every line so whether you use that or fgets you're still reading every single line (up until the one that matches). fscanf just parses a line according to the formatting string which you don't need to do; you just need to see if the line is the one you want. Unless I'm mistaken. There's no faster way to do what you want without going through the file line-by-line.

allworknoplay 03-02-2009 08:28 PM

Quote:

Originally Posted by Salathe (Post 22057)
You'd need to fscanf every line so whether you use that or fgets you're still reading every single line (up until the one that matches). fscanf just parses a line according to the formatting string which you don't need to do; you just need to see if the line is the one you want. Unless I'm mistaken. There's no faster way to do what you want without going through the file line-by-line.

hey sal:

You seem to really know your PHP. This is a bit OT from this thread, but would you happen to know of any good PHP/CSS way of creating really professional looking charts/graphs?

I'm currenly using Flash graphs right now which look spectacular but I really want to get away from flash and just use charts that can be outputted to PNG or JPG format.

I've seen a couple of the popular PHP ways to generate graphs and they just don't look clean and sharp. What I mean, is they look pixelated.

I know this is kind of a loaded question, but if you are familiar of any ways to make really need looking charts in PHP, please let me know!

i can scour around and provide to you the "look and feel" of what I'm looking for and you can let me know if this is possible or not..

Salathe 03-02-2009 10:12 PM

allworknoplay, post a new topic for that.

ETbyrne 03-02-2009 10:18 PM

Quote:

Originally Posted by Salathe (Post 22057)
You'd need to fscanf every line so whether you use that or fgets you're still reading every single line (up until the one that matches). fscanf just parses a line according to the formatting string which you don't need to do; you just need to see if the line is the one you want. Unless I'm mistaken. There's no faster way to do what you want without going through the file line-by-line.

Well, all I know is that by using fscanf I can find all matching rows in a 45 MB file in 80 ms. I know because I've benchmarked it... I think that is reason to believe that fscanf does not load all that data into PHP but rather does it all in C++ or somehow in the file system. :-)

The problem is I have to know what the entire line is in order to do that.

EDIT: By looking at the comments on http://us.php.net/fscanf I found you can use regex. That could very well solve my problem.

Salathe 03-02-2009 11:23 PM

All fscanf does is parse a single line according to the formatting string provided. Show me your code which finds all matching rows in 80ms and the benchmark you did as my own experiments were nothing like that amount of time.

ETbyrne 03-03-2009 01:31 AM

1 Attachment(s)
Attached are the scripts I used to do the same test on a similar file. Takes me about 60 ms to run huge_select.php

Note that this is being run on a ~ 8.5 MB 1 million line file. I also ran these tests on the file listed above.

Run huge_create.php first, then huge_select.php. files.php is a file class I made, file::scan() is a wrapper for the fscanf function. The other php file is the class used for benchmarking.

It worked on my and my friend's server so I'm not crazy. If it doesn't work for you then you are doing something wrong... That or I'm doing something terribly right. :-D

Obviously fscanf was not written in PHP and thus works a lot faster than comparing each string manually with PHP.

Salathe 03-03-2009 10:40 AM

I can only reiterate what was said before, fscanf is only parsing a line into your requested format (if it can). Whether you use fscanf to read a line or fgets, PHP is still reading the file line by line behind-the-scenes.

Your servers must have much faster disk IO than my laptop and cheap shared hosting, which run your tests at over 2 seconds: both with fscanf and fgets.

ETbyrne 03-03-2009 09:57 PM

I don't know what to tell ya man, but I'm just running on a cheap dell... The program did take over 3 seconds when I used the wrong syntax for fscanf, and when I tried loading and checking every line. But, when I got the syntax right it took a little less than 60ms for me and my friend. Every time, and it still does. And that is on completely different hardware and software too.

I highly doubt it is my hardware, or the fact that I'm running vista that is making it so fast. I'll have to run these test on my web host and my old 2000 XP computer.

Could anybody else try running run these test? This is all very interesting indeed.

Salathe 03-03-2009 10:08 PM

What were you doing wrong initially, to take over 3 seconds, and what did you do to fix it? Do you now have a working script doing what you initially wanted (find a matching line)?

ETbyrne 03-03-2009 10:21 PM

Initially, I had the format for the fscanf function messed up so it matched all of the lines in the table. It's pretty easy to mess up, like the second piece of code I posted waaay up at the top of this thread:

PHP Code:

while($data fscanf($fh,"\n%s\"three\":2499389%s\n"))
{
    
print_r($data);


The code above matched all the lines in the text file, so it would print out every line. All the stuff before \"three\" - the newline and the %s thing - where put there in a sad attempt to get the data for the rest of the line. But it messed it all up.

allworknoplay 03-04-2009 01:01 AM

Quote:

Originally Posted by ETbyrne (Post 22079)

Could anybody else try running run these test? This is all very interesting indeed.


Yes this is a very interesting issue.

I'll test this out myself. I am running a 64bit Vista laptop but that part actually doesn't matter. I have VMware running with centos 5.2.

I can definitley help verify how quickly the script runs because if I can get it at around 60-80ms as well like you, then to me that would be great benchmark speeds since it's running on virtualized OS...

I'll let you know what I discover...

Sakakuchi 03-04-2009 04:25 PM

Tested on a XP-machine(32bit).
Hardware:
AMD Turion 64x2
Harddrive has 5400rpm(can't remember more details :-P )
(Hope I did'nt make something wrong ;-))


allworknoplay 03-04-2009 04:28 PM

how many milliseconds equals 1 second?

ETbyrne 03-04-2009 04:30 PM

Quote:

Originally Posted by Sakakuchi (Post 22083)
Tested on a XP-machine(32bit).
Hardware:
AMD Turion 64x2
Harddrive has 5400rpm(can't remember more details :-P )
(Hope I did'nt make something wrong ;-))


Yes, it worked for you! Looks like you did everything right, only took 120ms to find over 11 thousand matches! :-D

@allworknoplay: A millisecond is one thousandth of a second.


All times are GMT. The time now is 07:21 AM.

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0