View Single Post
Old 06-10-2010, 12:56 PM   #11 (permalink)
soma56
The Visitor
 
Join Date: Jun 2010
Posts: 3
Thanks: 0
soma56 is on a distinguished road
Default Old Thread New PHP Programmer

I can see that this thread is a few years old. However, it turned up as one of the only results for connecting a proxy to file_get_contents. I've been learning php for the last two weeks and I've been trying a few experiments. For simplicities sake here's a little program I put together that uses Yahoo, cycles through a set amount of Yahoo pages and searches each page for a word.

PHP Code:
<?php

//assign variable to query
$query "The colors of Serps";

//assign integer to starting Yahoo search page
$page 01
            
//assign variable to Yahoo Search page
$yahoo "http://ca.search.yahoo.com/search?&b=$page&p=";

//define pattern to search for
$pattern"/(red)(blue)/"
                            
//Get webpage contents
$resultspage file_get_contents("http://ca.search.yahoo.com/search?&b=$page&p=".urlencode($query)); 

// create while loop to cycle through pages
    
while (($page <= 100)){
    
usleep(1000000);

// Search for your pattern in Serp 
if (!empty($resultspage)) {
  
$res preg_match_all($pattern$resultspage$matches);
      if (
$res) {
        foreach(
array_unique($matches[0]) as $pattern) {
              echo 
$pattern "<br />"PHP_EOL;
            
flush();
            
ob_flush();
            
usleep(50000);
            }
          } 
    
$page $page 10;
}
    }

echo 
"<br />";
echo 
"PROGRAM END<br /><br />";
exit;

?>
I put this together myself and as I've only been learning PHP in the last two weeks I think I've come a long way. What happens is Yahoo eventually returns a '999' error and temporarily blocks your IP when you make too many requests in a short time and I can understand why. That being said the only logical solution would be to have the file_get_contents function go through a proxy.

I have a subscription to a page that I log into that gives me access to a simple page of proxies updated with new ones every second. The conveniently comes out in the following format:

proxy1:portA
proxy2:portB
proxy3:portC

I'll be reviewing the earlier responses in this thread to see if I can figure out how to do this.

Essentially I want the script to:

1> Auto log into my proxy page
2> Grab five random proxies (or top proxies)
3> Use them to make requests as per the script above
4> Loop back to "2" and get more proxies

I would think this would eliminate the block by Yahoo has the results are coming from multiple IPs. Am I on the right path?
soma56 is offline  
Reply With Quote