• php

    Posted on November 3rd, 2008

    Written by Jose (Jossi) Fresco Benaim

    Tags

    ,
    Sphider and .htaccess protection

    Sphider, the open source PHP spider (aka Web crawler) and search engine, uses the fsockopen() function to get files that are spidered. This means that if the site you are spidering is protected via .htaccess or the Apache directive to protect realms, Sphider will return a “401 unreachable” error when attempting to fetch files during spidering and indexing.

    To enable Sphider to access files in a protected realm, we need to modify the functions url_status and getFileContents.

    First, create a user such as “sphider” and assign it a password via the shell…

    htpasswd yourhtaccessfile sphider

    … and provide a password when prompted. This user will be used exclusively for Sphider.

    Then, modify the functions in /admin/spiderfuncs.php:

    getFileContents function

    Replace:

    $request = "GET $path HTTP/1.0\r\nHost: $host$portq\r\nAccept: $all\r\n
               User-Agent: $user_agent\r\n\r\n";
    

    with:

    $user="sphider";
    $pass="abc12345";
    $request = "GET $path HTTP/1.1\r\nHost: $host$portq\r\nAuthorization: Basic "  .
               base64_encode ("$user:$pass") . "\r\n\r\n" .
               "Accept: $all\r\nUser-Agent: $user_agent\r\n\r\n";
    

    url_status function

    Replace:

    $request = "HEAD $path HTTP/1.0\r\nHost: $host$portq\r\nAccept: $all\r\n
               User-Agent: $user_agent\r\n\r\n";
    

    with:

    $user="sphider";
    $pass="abc12345";
    $request = "HEAD $path HTTP/1.1\r\nHost: $host$portq\r\nAuthorization: Basic "  .
               base64_encode ("$user:$pass") . "\r\n\r\n" .
               "Accept: $all\r\nUser-Agent: $user_agent\r\n\r\n";
    

    The user and password will not be visible to users, as it used solely during indexing.

    Download this mod for Sphider (1.3.4)

    Download this mod for Sphider Plus (1.6)

    This entry was posted on Monday, November 3rd, 2008 at 5:01 pm and is filed under php. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
  • 3 Comments

    Take a look at some of the responses we've had to this article.

    1. dave
      Posted on June 17th

      Hi, I’ve tried this on my website which is using Webber for user account management, and their protection system uses mod rewrite via htaccess. Unfortunately I’m on Mosso without Shell access.. is this workaround still possible?

      kind regards,

      Dave

    2. jossi fresco
      Posted on June 18th

      Yes. It should work as well with .htaccess @ Mosso

    3. Posted on July 6th

      tested today on a website in dev, works perfectyl thanks very much

  • Post a Comment

    Let us know what you thought.

  • Name:

    Email (required):

    Website:

    Message: