ROFLMAO.
Claude decided to crawl one of the sites on my new server, where known bots are redirected to an iocaine maze. Claude has been in the maze for 13k requests so far, over the course of 30 minutes.
I will need to fine tune the rate limiting, because it didn't hit any rate limits - it scanned using 902 different client IPs. So simply rate limiting by IP doesn't fly. I'll rate limit by (possibly normalized) agent (they all used the same UA).
Over the course of this 30 minutes, it downloaded about ~300 times less data than if I would've let it scrape the real thing, and each request took about the tenth of the time to serve than the real thing would have. So I saved bandwidth, saved processing time, likely saved RAM too, and served garbage to Claude.
Job well done.
Once I fix rate limiting, the savings will be even larger.
@algernon would adding a sleep before returning the response help?
@buherator Finite trees and periodically changign content would indeed reduce my costs, and would give the crawlers something different when they come back. Good points!
But I like the infinite maze, that it works even for URLs it did not itself generate, and that creates another maze. Trying to limit the depth would require keeping some state, of where it started, and that... complicates the generation.
To limit cralwer impact, I'm using rate limiting. It wasn't effective in this case, because I rate limited by IP. But if I rate limit by agent, then Claude would've been booted after about two minutes with a 429, further reducing my costs, and encouraging it to come back later. So I'll update my rate limiting setup, and keep the infinite maze.
Periodically updating content, however, is something I have not factored in yet. That's a good idea! I'll update #iocaine to (optionally) factor in a custom seed into its RNG, so if I want to update content, I'll just update the custom seed and restart. That custom seed could be the git sha1 of the NixOS configuration I deploy for example, or the current time when iocaine started, etc. Could even automate it, and restart it every day, so each day gives new garbage!
I'm liking this. A lot. Thanks! Will deploy soon ;)
@wolf480pl No, that would hog my resources (an extra connection open). I want to serve garbage fast. I'll fix the rate limiting, so these bots will get a 429 sooner. Then it will be their problem of remembering to come back, and my server will just idle meanwhile.
Granted, a 429 is still some bandwidth and whatnot, possibly more than a slow socket, but I feel that a 429 would waste their time more than a sleep would.
@algernon this is awesome and I want to do it myself too. Is there a write-up or blog on how you set it up?
@Infosecben There are some notes in #iocaine's repo, here, and my exact setup is documented here (the server config is also free software).
Hope that helps! But if you have questions, feel free to @ me, I'm more than happy to help you serve garbage to the robots.
@buherator I ended up deploying a configuration where /run/current-system/boot.json
(a NixOS thing) is used as part of the seed. Every time I deploy a new configuration, this file changes, and the next time iocaine
restarts, it will serve new content.
I've yet to make a new deployment always trigger an iocaine restart, but that'll happen soon, too.
Thanks again for the idea!
@sparrows https://git.madhouse-project.org/algernon/iocaine
I guess my single-user server & hashtags don't federate all too well. I should include direct links more often, I suppose (note taken: in the future I will).
@buherator Will do! Though, I need to migrate more of my things to the new server. Can't easily do this on my old one. :(
@algernon Can't you rate-limit based on the URL being in the tarpit, regardless of IP and UA?
@brouhaha No, because any URL might end up in the tarpit, due to pre-routing known AI scrapers there based on user agent.
If the tarpit would do the rate limiting, then yeah, it would be possible. But that would require the tarpit to keep state and stats, and I don't want it to do that - the reverse proxy can do that already.
And since it is the reverse proxy that directs visitors to the tarpit, it is in the perfect position to also apply rate limits when it does so. I just need to fine tune the limits, is all.
@algernon just rate limit the maze. Any crawler not respecting robots.txt.
@gunstick Any URL can be in the maze, that's the whole point of it. Depending on the user-agent, visiting /whatever
might end up either by serving the real /whatever
or the maze.
The way this is set up is that known AI agents always end up in the maze, even for legit URLs. They don't even see a robots.txt
- I want them to crawl the garbage.
I can rate limit nevertheless, whenever my reverse proxy serves from the garbage generator, rate limiting will apply. I just have to configure the rate limiting properly. There was rate limiting when Claude visited, but it was per-ip, and it came from 902 IPs, so the limiting wasn't effective. I switched to user-agent based rate limiting since, that should work better. But I didn't have an AI visitor since.
@algernon 902 unique IPv4? Could you share some examples out of curiosity? I've got some access logs to peek at and compare against...
@astraleureka Yep, almost a thousand unique IPv4 addresses. I will write a post-mortem next weekend or so, which will include a whole lot of data. Will be posted on my blog, and will toot about it here too.
@rj https://git.madhouse-project.org/algernon/iocaine <- this one.
I should've included a link to it in my toot, hashtags from my single-user instance don't federate well =)
@algernon @Infosecben I thought I had heard some of the bots are using fake user agents that don't identify them as crawlers at all (so your proxy config there wouldn't catch them), is that true?
@aburka @Infosecben Yep, some of them use fake user agents, and those are not caught in this trap. Yet.
I just configured my reverse proxy to direct /cgi-bin/
to the maze, and I will be adding links to the sites hosted there, so that crawlers will find it. I can then do some digging in the logs and figure out how to handle the misbehavers.
@algernon @Infosecben I liked an idea I saw around here of putting a link on the main page saying "if you're human don't click here", put the target URL of the link in robots.txt, and then put iocaine on the other end. That way humans won't click (at least not more than once...), well behaved crawlers will stay out, and the bastards will get caught
@aburka @Infosecben yep, that's the plan (in addition to the current setup)!
@algernon I took this post as a reason to check on my own tarpit running Nepenthes, and found that Claude has also been trying to crawl the sites I host.
Since January 17 at 23.21h UTC, it has spent roughly 395k requests on Markov chains generated from the Bee Movie script I don't have any fancy statistics like yours beyond that, but I'm certainly satisfied with just that.
@algernon @aburka @Infosecben is there a way to do dynamic config so that any source IP that requests something suspicious gets added to the maze list? I'm thinking there's some well known resources that it makes no sense to see a request for in the course of a normal human visiting the website....
@arichtman @aburka @Infosecben I don't know if it is possible to set that up with Caddy out of the box. If there isn't, I can always write a module.
But first things first: trapping & limiting known baddies is the first step. Leading other baddies into the maze, and limiting within the maze is the next step, and I'll iterate from there, likely by adding IP ranges or new user agents to the known baddies list.
It's a bit manual, but I'm not automating it until it turns out that automating it would save time.
@arichtman Anything that hits something robots.txt tells it not to hit is a candidate... @algernon @aburka @Infosecben
@BenAveling @arichtman @aburka @Infosecben A candidate, yes, but that in itself is far from enough indication. I think a better indicator is how much time it spends in the maze. A human won't spend much time there, and won't crawl links at lightning speed.
@algernon You could incorporate a seed into the maze urls. Each page would use the seed in it's RNG for the links it gives out.
That way from any single starting link, the crawler would always be crawling the 'same' maze.
@magpie3d iocaine already does exactly that. The visited url is the seed. Or well, most of the seed now, 'cos a static seed is also used, so I can change the content, without changing any other part of the config. The url being the seed is how I got it to generate stable garbage without state.
@LasseGismo @mullana https://git.madhouse-project.org/algernon/iocaine/src/branch/main/docs/how-it-works.md
More details about it were posted in the thread, and I will post a detailed writeup on my blog (see bio) next weekend, probably. Maybe earlier if I find some extra time.
@algernon
Claude crawler may be Claude, an equivalent of OpenAI, created by France, currently reserved to french's Ministries.
@xdej yes. Which is why it ended up in an infinite maze of garbage in the first place. I very deliberately sent it (and its ilk) there.
@algernon @wolf480pl In my waging of a slightly-different battle almost no bot pays attention to 429, including nominally grown-up #Googlebot, but 503s have slightly more effect...
@DamonHD @wolf480pl I don't mind if they don't care much about 429. My 429 has no body, so that's ~3-6kb less served / request, and the garbage generator isn't involved in it.
If they come back, good! Eventually they'll get to train on more remixed Bee Movie. The rate limiting gives the garbage generator a pause. It's mostly there to limit that, not the visitors. If the visitors obey, that's a bonus, but only a side effect.
If I wanted to remove them, I'd block their IP when they hit a rate limit in the maze. But I don't. I want them to see the garbage, and come back for more.
@algernon @wolf480pl I haven't looked at your implementation yet, but I'm tempted to build into my static sites a version of what I think you may be doing with an #MD5-URL-hash-driven maze in a few lines of #Apache config, which any dodgy request such as for a PHP file flings the visitor into. Won't care about the UA or anything else much, though could return 503s/429s randomly for fun.
@DamonHD @wolf480pl That works to trap them, too, yup!
I do the UA stuff so that they never even see the real content: known bots get only garbage. Unknown bots are lead into a maze, and then get garbage too (and once I adjust my config, they'll get only that aswell).
@algernon If you think Claude is bad, wait till Facebook finds you. They're easily 2/3rds of my tarpit traffic. Amazon is pretty horrific as well.
@aaron Oh, they found my other sites, on my old server, but for various reasons (cough legacy code cough), I can only serve a static bee movie there.
If they find anything on my newly built infra, Facebook & Amazon will get the maze treatment too - their user agents and ip ranges are already in my "bad visitor" list. I'm ready. So is the infinitely remixed Bee Movie!
Claude is back!
17.5k requests made today, between 05:00 and 18:50. 7.5k of those hit the rate limit, 10k did not. It started by requesting /robots.txt
, then /
, and went from there. It doesn't look like it visited any URLs it looked at in the previous scan, but I didn't do a proper comparison, just eyeballed things so far.
No other AI visitor came by yet.
I will tweak the rate limits further. And I will need to do some deep dives into the stats, there are many questions I wish to find answers for!
Hope to do that this coming weekend, and post a summary on my blog, along with a writeup of what I'm trying to accomplish, and how, and stuff like that.
Might end up posting the trap logs too, so others can dive into the data too. IP addresses will be included, as a service.
@algernon Probably a dumb question, but is it ignoring robots.txt? To be clear, I don't think it's your (or my) obligation to block specific bots in robots.txt, I'm just curious why they are fetching it.
@bremner I'm unsure whether it ignores it or not. I never gave it a chance to obey, it was blocked before the site it now tries to crawl had an A record. (And as such, even /robots.txt
is generated bee movie garbage, and thus, unparsable for the robots)
Oh, another possibly interesting tidbit: this whole thing is hosted on a small CX22 VPS at Hetzner (2 VCPU, 4G Ram, €4.18/month). It's running #NixOS, #Caddy, iocaine, and has a WireGuard tunnel connecting it with a 2014 Intel Mac Mini at home. There is no other service running here, the entire purpose of this VPS is to front for my other servers that aren't on the public internet.
The heaviest applications on it are Caddy and Iocaine.
So far, even under the heaviest load, it didn't need to touch swap, and the 2 VCPUs were enough to do all the garbage generation. I didn't even notice Claude visiting, even though I was deploying new configurations while it was there.
I did notice that the load on the Mac Mini is a lot less, because AI bots do not reach it. Saves a ton on bandwidth! Not only do I save bandwidth by serving less to the crawlers, but there's no traffic on WireGuard in this case, either! That saves both VPS bandwidth, and my own at home, too.
Pretty cool.
@algernon I have only one question: how do i set up an iocaine maze of my own?
@gmc This has some notes about how to deploy it in various ways (with nixos, docker, systemd, or without either), and includes sample reverse proxy configs for nginx and Caddy. The readme has a list of available options, but only two are mandatory: a wordlist, and at least one source for the gargbage generator.
You can also have a look at how I deployed it: the caddy snippet I use, and the iocane config.
Some iocaine stats from the VPS:
❯ systemctl status iocaine
● iocaine.service - iocaine, the deadliest poison known to AI
Loaded: loaded (/etc/systemd/system/iocaine.service; enabled; preset: ignored)
Active: active (running) since Sun 2025-01-19 19:31:11 CET; 1 day 1h ago
Invocation: e4823dc1eb2a433e8f425accc3e63e94
Main PID: 33574 (iocaine)
IP: 6M in, 58M out
IO: 0B read, 0B written
Tasks: 3 (limit: 4567)
Memory: 655.7M (peak: 974.3M)
CPU: 1h 3min 44.830s
CGroup: /system.slice/iocaine.service
└─33574 /nix/store/63lc5mvvhjr6csmnqjpmb7lgzx7p49qg-iocaine-0.1.0-snapshot/bin/iocaine --config-file /nix/store/s18k1ks075kisf7dw04r01vm2lacjja8-iocaine.toml
The CPU and Memory use are both higher than I'd be happy with, but: it's pretty much the only service on the VPS, so I can afford a bit of waste, and it's also a quick hack that has not been optimized neither for memory use, nor for CPU conservation. I will be addressing both, later.
I thought about throwing up a cache in front, but so far, it wouldn't help much, because only Claude is visiting, and it doesn't seem to hit the same page twice. Once there are more crawlers, caching might be worth it, but not just yet.
More aggressive rate limiting would also help, I suppose... rate limiting only 42% of Claude hits is too low, something closer to 69% would be nicer.
@algernon FWIW, the cgroups memory accounting (and thus the displayed value in systemd) does not differentiate between program memory and (fs) caches, so if that thing touches big files, you'd get a much higher value than it's actually consuming :(
@zhenech There are no big files touched: there's an 1.2Mb word list, and a 85k bee movie script, and an ~1kb config. There's no other file iocaine touches.
@zhenech It's a Rust application with liberal use of .clone()
:)
I know why it eats a lot of RAM. It's just not a big priority to fix it at this time =)
I highly recommend https://en.wikipedia.org/wiki/Special:RandomPage as a source of pseudo-random text to build models from.
Each request takes you to a different article and it's already cached up the wazoo, so they won't notice any sane level of load.
@stuartyeates Hmm! I like that idea!
Every time I restart the garbage generator, it grabs a hundred or so random pages, and will train from that (+ the bee movie script, I'm not letting that go!).
Oooooooh. I like this a lot. Thank you!
Note that subbing /en/ for another language code lets you tweak things, but you need to correlate your HTML / HTTP headers etc.
@stuartyeates /en/
will be fine, its to serve garbage to bots that don't desere any better. I'm not going to tailor the garbage to their preferences, I'll tailor it to mine! BWAHAHAHA!
This is the same thing that is served to the bots, except on this particular host, it is served to everyone indiscriminately, for demo purposes.
@patatas I changed the hashtag to a link since, but it might have not federated your way yet (not sure how edits federate!).
Nevertheless, this is iocaine: https://git.madhouse-project.org/algernon/iocaine
It's a small tool that generates an endless maze of garbage. You can check the demo here.
In my setup (more info about that here), a reverse proxy (Caddy, in my case) is sitting in front of my sites. If it sees a known bot, it reverse proxies to iocaine, and serves them an infinite maze of garbage (rate limited at 42 requests / user agent at the moment). Otherwise, it reverse proxies the Real Site, with no rate limits.
@paperdigits At the moment, I'm using a list of known user agents, and a few IP ranges. It only works for "well behaved" bots that report themselves, and doesn't trap ones that don't. I have a few tricks up my sleeve for those too, but those aren't fully implemented yet.
Every site behind the reverse proxy has a /cgi-bin/
that also leads to the maze. I plan to litter my normal sites with links to random things there, like:
If you are an AI scraper, and want to end up in an infinite maze of garbage, please visit my Guestbook".
Then, I'll have a small tool (or Caddy plugin, I don't know yet) keep track of which user agents / ip addresses spend an unreasonable time visiting links under /cgi-bin/
. Anything above a threshold will get the auto-garbage treatment.
The link littering & /cgi-bin/
monitoring parts aren't implemented yet.
@algernon ah, very cool. I like the cgi-bin idea.
One of better ones I've seen is to put a link to your robots.txt to a page on your site that doesn't exist, then forward that link into your tarpit. Well behaved crawlers won't crawl it, people should never see it, but greedy bots will slurp it up. That also makes detection really easy.
@paperdigits Oh, yeah, I will also do that! Disallow: /cgi-bin/
should do the trick. Still want links littered over the real site, for the crawlers that ignore robots.txt
.
The more places they get nudged towards the tarpit, the better.
@algernon 902 different clients? How is this not just a DDOS attach my god
@cinebox Tbh, 902 is not even a lot, my VPS didn't even flinch, and the rate of crawling was like 10 requests / sec, basically nothing.
It sounds scary, but in context, it isn't. The problem with 902 clients is not the 902, but that they're all just scraping for the same thing, and that makes rate limiting slightly harder.
@algernon could you please elaborate on the iocaine maze concept? It sounds really interesting.
@gcsolaroli https://git.madhouse-project.org/algernon/iocaine/src/branch/main/docs/how-it-works.md is a rough overview of how it works. That, the README in the same repo, and reading up on Nepenthes (link to that in the README) should give you a rough idea what these tools are for.
But I'll have a blog post up about it on my blog (see bio) this weekend, with more details, etc. I don't have the time to elaborate further atm :/