Adële's smolweb site

Block non-human crawlers with lighttpd

2025-04-20 19:05

Recently, I've put a copy of some ZIM files online with kiwix-server. I posted the url of this site on the Fediverse and, a few days later, the little server was a bit overloaded. The logs showed that the site was being crawled by search engines and AI training bots. There was no reason to let them. A robots.txt file calmed some, but not others.

Analysing user agents and IP addresses is not the answer, because, everything is done to make it complicated (randomisation, many datacenter origins). I thought about Cloudflare protection, Google captcha or the open source solution Anubis. All of them require javascript to be enabled on the human browsers.

After several tests, I have found a simple method to stop these crawlers.

The principle

When a connection arrives on the web server, it checks to see if the request comes with a cookie. If it does not, the web server redirects the browser to an HTML form that asks the user to tick a checkbox and submit. If the user submits the form correctly, he or she receives a cookie and is redirected to the previously requested page. The new request is made with a cookie. So, the web server does its job and send the expected content.

Lighttpd configuration


server.modules += (
   "mod_rewrite",
   "mod_openssl",
   "mod_proxy"
)

$SERVER["socket"] == ":443" {
    ssl.engine                  = "enable"
    ssl.privkey                 = "/etc/acme-sh/pollux.casa_ecc/pollux.casa.key"
    ssl.pemfile                 = "/etc/acme-sh/pollux.casa_ecc/fullchain.cer"
}

include "conf.d/fastcgi.conf"

$HTTP["host"] == "zim.pollux.casa" {
    server.document-root = "/var/www/zim.pollux.casa"

    # if the requested file is not robots.txt nor cookie-check.php
    $HTTP["url"] !~ "^/(robots\.txt|cookie-check\.php)" {

        # is there a cookie in the request
        $HTTP["cookie"] == "" {

            # only cookie-check.php is not redirected
            $HTTP["url"] !~ "^/cookie-check.php" {
                # redirect to the HTML form
                url.redirect-code = 302
                url.redirect = ( "^(.*)$" => "/cookie-check.php?redirect=$1" )
            }
        }


        # Proxy configuration
        # if can be the app of your choice
        proxy.server = (
            "" => ((
                "host" => "127.0.0.1",
                "port" => 3000
            ))
        )
        proxy.header = ( "upgrade" => "enable" )
        proxy.forwarded = ( "for" => 1, "proto" => 1, "host" => 1 )
    }

}

cookie-check.php


<?php

header("X-Robots-Tag: noindex, nofollow", true);
if($_POST['k']==md5($_SERVER['HTTP_USER_AGENT']))
{
        // set a random cookie
        setcookie("x-cookie-access", "ok", [
          'path' => '/',
          'samesite' => 'Lax'
        ]);
        // find the original requested page url
        $redirect = $_GET['redirect'] ?? '/';

        // no redirection to external url (security)
        if (str_starts_with($redirect, '/')) {
            header("Location: " . $redirect);
            exit;
        }

        // Fallback: default redirection
        header("Location: /");
        exit;
}
?><!DOCTYPE html>
<html lang="en">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" >
        <meta name="viewport" content="width=device-width, initial-scale=1.0" >
        <meta name="color-scheme" content="light dark" >
        <title>Human check</title>
    </head>
    <body>
        <h1>
        <strong>Verifying you are a human</strong>
        </h1>
        <main>
            <form method="POST">
                <p>
                <input type="checkbox"  name="k" value="<?=md5($_SERVER['HTTP_USER_AGENT'])?>" ><span> I am a human</span>
                <br>
                <br>
                <input type="submit">
                <br>
                <br>
                <em>this will set a dummy cookie (AI crawlers do not manage cookies)</em>
                </p>
            </form>
        </main>
    </body>
</html>

I think is can be done easily with other web servers and other scripting languages

Comments on the Fediverse