Block non-human crawlers with lighttpd
2025-04-20 19:05
Recently, I've put a copy of some ZIM files online with kiwix-server. I posted the url of this site on the Fediverse and, a few days later, the little server was a bit overloaded. The logs showed that the site was being crawled by search engines and AI training bots. There was no reason to let them. A robots.txt file calmed some, but not others.
Analysing user agents and IP addresses is not the answer, because, everything is done to make it complicated (randomisation, many datacenter origins). I thought about Cloudflare protection, Google captcha or the open source solution Anubis. All of them require javascript to be enabled on the human browsers.
After several tests, I have found a simple method to stop these crawlers.
The principle
When a connection arrives on the web server, it checks to see if the request comes with a cookie. If it does not, the web server redirects the browser to an HTML form that asks the user to tick a checkbox and submit. If the user submits the form correctly, he or she receives a cookie and is redirected to the previously requested page. The new request is made with a cookie. So, the web server does its job and send the expected content.
Lighttpd configuration
server.modules += (
"mod_rewrite",
"mod_openssl",
"mod_proxy"
)
$SERVER["socket"] == ":443" {
ssl.engine = "enable"
ssl.privkey = "/etc/acme-sh/pollux.casa_ecc/pollux.casa.key"
ssl.pemfile = "/etc/acme-sh/pollux.casa_ecc/fullchain.cer"
}
include "conf.d/fastcgi.conf"
$HTTP["host"] == "zim.pollux.casa" {
server.document-root = "/var/www/zim.pollux.casa"
# if the requested file is not robots.txt nor cookie-check.php
$HTTP["url"] !~ "^/(robots\.txt|cookie-check\.php)" {
# is there a cookie in the request
$HTTP["cookie"] == "" {
# only cookie-check.php is not redirected
$HTTP["url"] !~ "^/cookie-check.php" {
# redirect to the HTML form
url.redirect-code = 302
url.redirect = ( "^(.*)$" => "/cookie-check.php?redirect=$1" )
}
}
# Proxy configuration
# if can be the app of your choice
proxy.server = (
"" => ((
"host" => "127.0.0.1",
"port" => 3000
))
)
proxy.header = ( "upgrade" => "enable" )
proxy.forwarded = ( "for" => 1, "proto" => 1, "host" => 1 )
}
}
cookie-check.php
<?php
header("X-Robots-Tag: noindex, nofollow", true);
if($_POST['k']==md5($_SERVER['HTTP_USER_AGENT']))
{
// set a random cookie
setcookie("x-cookie-access", "ok", [
'path' => '/',
'samesite' => 'Lax'
]);
// find the original requested page url
$redirect = $_GET['redirect'] ?? '/';
// no redirection to external url (security)
if (str_starts_with($redirect, '/')) {
header("Location: " . $redirect);
exit;
}
// Fallback: default redirection
header("Location: /");
exit;
}
?><!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" >
<meta name="viewport" content="width=device-width, initial-scale=1.0" >
<meta name="color-scheme" content="light dark" >
<title>Human check</title>
</head>
<body>
<h1>
<strong>Verifying you are a human</strong>
</h1>
<main>
<form method="POST">
<p>
<input type="checkbox" name="k" value="<?=md5($_SERVER['HTTP_USER_AGENT'])?>" ><span> I am a human</span>
<br>
<br>
<input type="submit">
<br>
<br>
<em>this will set a dummy cookie (AI crawlers do not manage cookies)</em>
</p>
</form>
</main>
</body>
</html>
I think is can be done easily with other web servers and other scripting languages
Comments on the Fediverse