TalkPHP
 
 
Account Login
Latest Articles
» The basic usage of PHPTAL, a XML/XHTML template library for PHP
» Vulnerable methods and the areas they are commonly trusted in.
» Simple way to protect a form from bot
» The Basics On: How Session Stealing Works
» How to keep your forms from double posting data
IRC Channel
IRC Speech Bubble Join the friendly bunch on IRC...
(#TalkPHP on Freenode)

...Also available via a web interface.

See this thread for information on the TalkPHP Free Hugs Initiative™. Subject to availability.
Associates
Associates
CSS Tutorials
Reply
 
LinkBack Thread Tools Search this Thread Display Modes
Old 03-02-2009, 12:18 AM   #1 (permalink)
how quixotic are you?
 
ETbyrne's Avatar
 
Join Date: Dec 2007
Location: Lapeer, MI
Posts: 445
Thanks: 37
ETbyrne is on a distinguished road
Default help with fscanf

I have to read a very large text file with PHP (about 1 million lines and 45.4 MB) and speed is critical. The file is layed out like so.

Notice the extra space before and after each line

Code:
 {"one":1727294,"two":2667541,"three":9998168} 
 {"one":7005310,"two":9377441,"three":4658508} 
 {"one":2638549,"two":7931823,"three":992431} 
 {"one":8817443,"two":1587524,"three":6495056} 
 {"one":5009765,"two":4831848,"three":2782592} 
 {"one":9882507,"two":4866943,"three":7389221} 
 {"one":7161254,"two":281677,"three":9001464} 
 {"one":6177062,"two":661010,"three":4880065} 
 {"one":850830,"two":5882873,"three":4219360} 
 {"one":5865173,"two":8852539,"three":6194152}
Obviously, the file is too large to just load the entire thing into PHP so I'm using fscanf to read the file line by line.

Right now I have it so I can select an entire line if I know the whole thing. Here is what I got for that:

PHP Code:
while($data fscanf($fh,"\n{\"one\":1054687,\"two\":8728332,\"three\":2499389}%s\n"))
{
    
print_r($data);

What I want to do is be able to select one entire line with only knowing part of it (ex: "three":4219360). I'm not sure how to go about doing this because I have very little experience with fscanf and function of the like. I've tried something like this but it returns all of the rows:

PHP Code:
while($data fscanf($fh,"\n%s\"three\":2499389%s\n"))
{
    
print_r($data);

I do NOT want to have to load every line into PHP and check it that way as that could be very slow.
__________________
Dingo Web Systems > http://www.dingocode.com
My Website > http://www.evanbot.com
ETbyrne is offline  
Reply With Quote
Old 03-02-2009, 01:11 AM   #2 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

If you're after just a simple search, you could use fgets to grab the line and strpos to search it.

PHP Code:
// line: {"one":8405219,"two":4552659,"three":6640965}
$search '"three":6640965';

$fp fopen('misc.txt''r');

$result 'No result';
while( ! 
feof($fp))
{
    
$line fgets($fp70);
    if (
strpos($line$search) !== FALSE)
    {
        
$result trim($line);
        break;
    }
}
fclose($fp);

echo 
$result
For me, searching through a 46MB file with lines like yours took under two seconds to not find a match (worst case scenario).
Salathe is offline  
Reply With Quote
Old 03-02-2009, 05:05 PM   #3 (permalink)
The Gregarious
 
allworknoplay's Avatar
 
Join Date: Feb 2009
Location: New York
Posts: 645
Thanks: 64
allworknoplay is on a distinguished road
Default

Quote:
Originally Posted by Salathe View Post

For me, searching through a 46MB file with lines like yours took under two seconds to not find a match (worst case scenario).

Wow that's pretty fast for a file with a million records...
allworknoplay is offline  
Reply With Quote
Old 03-02-2009, 06:13 PM   #4 (permalink)
how quixotic are you?
 
ETbyrne's Avatar
 
Join Date: Dec 2007
Location: Lapeer, MI
Posts: 445
Thanks: 37
ETbyrne is on a distinguished road
Default

@allworknoplay: That's because I don't have to load the data into PHP when just using fscanf.

@Salathe: The problem with your code is that it requires me to load every line into PHP and then check it. I need to be able to do this with only fscanf if at all possible.
__________________
Dingo Web Systems > http://www.dingocode.com
My Website > http://www.evanbot.com
ETbyrne is offline  
Reply With Quote
Old 03-02-2009, 06:51 PM   #5 (permalink)
The Gregarious
 
allworknoplay's Avatar
 
Join Date: Feb 2009
Location: New York
Posts: 645
Thanks: 64
allworknoplay is on a distinguished road
Default

Quote:
Originally Posted by ETbyrne View Post
@allworknoplay: That's because I don't have to load the data into PHP when just using fscanf.

@Salathe: The problem with your code is that it requires me to load every line into PHP and then check it. I need to be able to do this with only fscanf if at all possible.
Thanks for bring this functionality to light for me, I am going to check it out, I could definitley use this for my projects.

I don't necessarily trust databases for everything. Sometimes I think people go overboard with databases because of the ease of use for the SQL language, but it doesn't always make sense to use a database for everything.
allworknoplay is offline  
Reply With Quote
Old 03-02-2009, 06:56 PM   #6 (permalink)
The Gregarious
 
allworknoplay's Avatar
 
Join Date: Feb 2009
Location: New York
Posts: 645
Thanks: 64
allworknoplay is on a distinguished road
Default

Quote:
Originally Posted by ETbyrne View Post
@allworknoplay: That's because I don't have to load the data into PHP when just using fscanf.

@Salathe: The problem with your code is that it requires me to load every line into PHP and then check it. I need to be able to do this with only fscanf if at all possible.
Ok I have an idea. You don't want to load the entire file into PHP, so I get that.

What if you were to return all of the lines you were searching for anyway, but then once you retrieve the line, you then parse the info?

This way you're only parsing what you need AFTER you get the full line, instead of trying to parse it ahead of time which would require much more resources?

Does that make sense?
allworknoplay is offline  
Reply With Quote
Old 03-02-2009, 08:15 PM   #7 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

Quote:
Originally Posted by ETbyrne View Post
@Salathe: The problem with your code is that it requires me to load every line into PHP and then check it. I need to be able to do this with only fscanf if at all possible.
You'd need to fscanf every line so whether you use that or fgets you're still reading every single line (up until the one that matches). fscanf just parses a line according to the formatting string which you don't need to do; you just need to see if the line is the one you want. Unless I'm mistaken. There's no faster way to do what you want without going through the file line-by-line.
Salathe is offline  
Reply With Quote
Old 03-02-2009, 08:28 PM   #8 (permalink)
The Gregarious
 
allworknoplay's Avatar
 
Join Date: Feb 2009
Location: New York
Posts: 645
Thanks: 64
allworknoplay is on a distinguished road
Default

Quote:
Originally Posted by Salathe View Post
You'd need to fscanf every line so whether you use that or fgets you're still reading every single line (up until the one that matches). fscanf just parses a line according to the formatting string which you don't need to do; you just need to see if the line is the one you want. Unless I'm mistaken. There's no faster way to do what you want without going through the file line-by-line.
hey sal:

You seem to really know your PHP. This is a bit OT from this thread, but would you happen to know of any good PHP/CSS way of creating really professional looking charts/graphs?

I'm currenly using Flash graphs right now which look spectacular but I really want to get away from flash and just use charts that can be outputted to PNG or JPG format.

I've seen a couple of the popular PHP ways to generate graphs and they just don't look clean and sharp. What I mean, is they look pixelated.

I know this is kind of a loaded question, but if you are familiar of any ways to make really need looking charts in PHP, please let me know!

i can scour around and provide to you the "look and feel" of what I'm looking for and you can let me know if this is possible or not..
allworknoplay is offline  
Reply With Quote
Old 03-02-2009, 10:12 PM   #9 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

allworknoplay, post a new topic for that.
Salathe is offline  
Reply With Quote
Old 03-02-2009, 10:18 PM   #10 (permalink)
how quixotic are you?
 
ETbyrne's Avatar
 
Join Date: Dec 2007
Location: Lapeer, MI
Posts: 445
Thanks: 37
ETbyrne is on a distinguished road
Default

Quote:
Originally Posted by Salathe View Post
You'd need to fscanf every line so whether you use that or fgets you're still reading every single line (up until the one that matches). fscanf just parses a line according to the formatting string which you don't need to do; you just need to see if the line is the one you want. Unless I'm mistaken. There's no faster way to do what you want without going through the file line-by-line.
Well, all I know is that by using fscanf I can find all matching rows in a 45 MB file in 80 ms. I know because I've benchmarked it... I think that is reason to believe that fscanf does not load all that data into PHP but rather does it all in C++ or somehow in the file system.

The problem is I have to know what the entire line is in order to do that.

EDIT: By looking at the comments on http://us.php.net/fscanf I found you can use regex. That could very well solve my problem.
__________________
Dingo Web Systems > http://www.dingocode.com
My Website > http://www.evanbot.com
ETbyrne is offline  
Reply With Quote
Old 03-02-2009, 11:23 PM   #11 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

All fscanf does is parse a single line according to the formatting string provided. Show me your code which finds all matching rows in 80ms and the benchmark you did as my own experiments were nothing like that amount of time.
Salathe is offline  
Reply With Quote
Old 03-03-2009, 01:31 AM   #12 (permalink)
how quixotic are you?
 
ETbyrne's Avatar
 
Join Date: Dec 2007
Location: Lapeer, MI
Posts: 445
Thanks: 37
ETbyrne is on a distinguished road
Default

Attached are the scripts I used to do the same test on a similar file. Takes me about 60 ms to run huge_select.php

Note that this is being run on a ~ 8.5 MB 1 million line file. I also ran these tests on the file listed above.

Run huge_create.php first, then huge_select.php. files.php is a file class I made, file::scan() is a wrapper for the fscanf function. The other php file is the class used for benchmarking.

It worked on my and my friend's server so I'm not crazy. If it doesn't work for you then you are doing something wrong... That or I'm doing something terribly right.

Obviously fscanf was not written in PHP and thus works a lot faster than comparing each string manually with PHP.
Attached Files
File Type: zip filesfind.zip (3.3 KB, 12 views)
__________________
Dingo Web Systems > http://www.dingocode.com
My Website > http://www.evanbot.com
ETbyrne is offline  
Reply With Quote
Old 03-03-2009, 10:40 AM   #13 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

I can only reiterate what was said before, fscanf is only parsing a line into your requested format (if it can). Whether you use fscanf to read a line or fgets, PHP is still reading the file line by line behind-the-scenes.

Your servers must have much faster disk IO than my laptop and cheap shared hosting, which run your tests at over 2 seconds: both with fscanf and fgets.
Salathe is offline  
Reply With Quote
Old 03-03-2009, 09:57 PM   #14 (permalink)
how quixotic are you?
 
ETbyrne's Avatar
 
Join Date: Dec 2007
Location: Lapeer, MI
Posts: 445
Thanks: 37
ETbyrne is on a distinguished road
Default

I don't know what to tell ya man, but I'm just running on a cheap dell... The program did take over 3 seconds when I used the wrong syntax for fscanf, and when I tried loading and checking every line. But, when I got the syntax right it took a little less than 60ms for me and my friend. Every time, and it still does. And that is on completely different hardware and software too.

I highly doubt it is my hardware, or the fact that I'm running vista that is making it so fast. I'll have to run these test on my web host and my old 2000 XP computer.

Could anybody else try running run these test? This is all very interesting indeed.
__________________
Dingo Web Systems > http://www.dingocode.com
My Website > http://www.evanbot.com
ETbyrne is offline  
Reply With Quote
Old 03-03-2009, 10:08 PM   #15 (permalink)
Moderateur
RegEx Guru PHP Guru Top Contributor Advanced Programmer 
 
Salathe's Avatar
 
Join Date: Apr 2007
Posts: 1,393
Thanks: 5
Salathe is on a distinguished road
Default

What were you doing wrong initially, to take over 3 seconds, and what did you do to fix it? Do you now have a working script doing what you initially wanted (find a matching line)?
Salathe is offline  
Reply With Quote
Old 03-03-2009, 10:21 PM   #16 (permalink)
how quixotic are you?
 
ETbyrne's Avatar
 
Join Date: Dec 2007
Location: Lapeer, MI
Posts: 445
Thanks: 37
ETbyrne is on a distinguished road
Default

Initially, I had the format for the fscanf function messed up so it matched all of the lines in the table. It's pretty easy to mess up, like the second piece of code I posted waaay up at the top of this thread:

PHP Code:
while($data fscanf($fh,"\n%s\"three\":2499389%s\n"))
{
    
print_r($data);

The code above matched all the lines in the text file, so it would print out every line. All the stuff before \"three\" - the newline and the %s thing - where put there in a sad attempt to get the data for the rest of the line. But it messed it all up.
__________________
Dingo Web Systems > http://www.dingocode.com
My Website > http://www.evanbot.com
ETbyrne is offline  
Reply With Quote
Old 03-04-2009, 01:01 AM   #17 (permalink)
The Gregarious
 
allworknoplay's Avatar
 
Join Date: Feb 2009
Location: New York
Posts: 645
Thanks: 64
allworknoplay is on a distinguished road
Default

Quote:
Originally Posted by ETbyrne View Post

Could anybody else try running run these test? This is all very interesting indeed.

Yes this is a very interesting issue.

I'll test this out myself. I am running a 64bit Vista laptop but that part actually doesn't matter. I have VMware running with centos 5.2.

I can definitley help verify how quickly the script runs because if I can get it at around 60-80ms as well like you, then to me that would be great benchmark speeds since it's running on virtualized OS...

I'll let you know what I discover...
allworknoplay is offline  
Reply With Quote
Old 03-04-2009, 04:25 PM   #18 (permalink)
The Contributor
 
Sakakuchi's Avatar
 
Join Date: Feb 2009
Posts: 64
Thanks: 1
Sakakuchi is on a distinguished road
Default

Tested on a XP-machine(32bit).
Hardware:
AMD Turion 64x2
Harddrive has 5400rpm(can't remember more details )
(Hope I did'nt make something wrong )

Sakakuchi is offline  
Reply With Quote
Old 03-04-2009, 04:28 PM   #19 (permalink)
The Gregarious
 
allworknoplay's Avatar
 
Join Date: Feb 2009
Location: New York
Posts: 645
Thanks: 64
allworknoplay is on a distinguished road
Default

how many milliseconds equals 1 second?
allworknoplay is offline  
Reply With Quote
Old 03-04-2009, 04:30 PM   #20 (permalink)
how quixotic are you?
 
ETbyrne's Avatar
 
Join Date: Dec 2007
Location: Lapeer, MI
Posts: 445
Thanks: 37
ETbyrne is on a distinguished road
Default

Quote:
Originally Posted by Sakakuchi View Post
Tested on a XP-machine(32bit).
Hardware:
AMD Turion 64x2
Harddrive has 5400rpm(can't remember more details )
(Hope I did'nt make something wrong )

Yes, it worked for you! Looks like you did everything right, only took 120ms to find over 11 thousand matches!

@allworknoplay: A millisecond is one thousandth of a second.
__________________
Dingo Web Systems > http://www.dingocode.com
My Website > http://www.evanbot.com
ETbyrne is offline  
Reply With Quote
Reply



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


All times are GMT. The time now is 05:45 AM.

 
     

Powered by vBulletin® Version 3.6.8
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.1.0
Inactive Reminders By Icora Web Design