infosec.place

Conversation

algernon ludd

algernon@come-from.mad-scientist.club

Edited 3 months ago

ROFLMAO.

Claude decided to crawl one of the sites on my new server, where known bots are redirected to an iocaine maze. Claude has been in the maze for 13k requests so far, over the course of 30 minutes.

I will need to fine tune the rate limiting, because it didn't hit any rate limits - it scanned using 902 different client IPs. So simply rate limiting by IP doesn't fly. I'll rate limit by (possibly normalized) agent (they all used the same UA).

Over the course of this 30 minutes, it downloaded about ~300 times less data than if I would've let it scrape the real thing, and each request took about the tenth of the time to serve than the real thing would have. So I saved bandwidth, saved processing time, likely saved RAM too, and served garbage to Claude.

Job well done.

21

11

1

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

Once I fix rate limiting, the savings will be even larger.

1

0

0

Wolf480pl

wolf480pl@mstdn.io

Reply to @algernon@come-from.mad-scientist.club

@algernon would adding a sleep before returning the response help?

1

0

0

buherator

Reply to @algernon@come-from.mad-scientist.club

Edited 3 months ago

@algernon I'm wondering if it'd be more effective (in terms of impact/service cost) to serve finite trees that update periodically, like a real web site? I guess this way you could make the bot come back (poison it for a longer time) and constrain the expected bandwidth per visit?

1

0

5

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @buherator

@buherator Finite trees and periodically changign content would indeed reduce my costs, and would give the crawlers something different when they come back. Good points!

But I like the infinite maze, that it works even for URLs it did not itself generate, and that creates another maze. Trying to limit the depth would require keeping some state, of where it started, and that... complicates the generation.

To limit cralwer impact, I'm using rate limiting. It wasn't effective in this case, because I rate limited by IP. But if I rate limit by agent, then Claude would've been booted after about two minutes with a 429, further reducing my costs, and encouraging it to come back later. So I'll update my rate limiting setup, and keep the infinite maze.

Periodically updating content, however, is something I have not factored in yet. That's a good idea! I'll update #iocaine to (optionally) factor in a custom seed into its RNG, so if I want to update content, I'll just update the custom seed and restart. That custom seed could be the git sha1 of the NixOS configuration I deploy for example, or the current time when iocaine started, etc. Could even automate it, and restart it every day, so each day gives new garbage!

I'm liking this. A lot. Thanks! Will deploy soon ;)

2

0

2

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @wolf480pl@mstdn.io

@wolf480pl No, that would hog my resources (an extra connection open). I want to serve garbage fast. I'll fix the rate limiting, so these bots will get a 429 sooner. Then it will be their problem of remembering to come back, and my server will just idle meanwhile.

Granted, a 429 is still some bandwidth and whatnot, possibly more than a slow socket, but I feel that a 429 would waste their time more than a sleep would.

1

0

0

InfoSecBen

Infosecben@ioc.exchange

Reply to @algernon@come-from.mad-scientist.club

@algernon this is awesome and I want to do it myself too. Is there a write-up or blog on how you set it up?

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @Infosecben@ioc.exchange

Edited 3 months ago

@Infosecben There are some notes in #iocaine's repo, here, and my exact setup is documented here (the server config is also free software).

Hope that helps! But if you have questions, feel free to @ me, I'm more than happy to help you serve garbage to the robots.

1

1

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

@buherator I ended up deploying a configuration where /run/current-system/boot.json (a NixOS thing) is used as part of the seed. Every time I deploy a new configuration, this file changes, and the next time iocaine restarts, it will serve new content.

I've yet to make a new deployment always trigger an iocaine restart, but that'll happen soon, too.

Thanks again for the idea!

1

0

1

Sparrows

sparrows@calamity.world

Reply to @algernon@come-from.mad-scientist.club

Edited 3 months ago

@algernon I'm not finding #iocaine, not on the hashtag or with some guesstimated searches. Is there a link you could provide?

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @sparrows@calamity.world

@sparrows https://git.madhouse-project.org/algernon/iocaine

I guess my single-user server & hashtags don't federate all too well. I should include direct links more often, I suppose (note taken: in the future I will).

1

0

0

buherator

Reply to @algernon@come-from.mad-scientist.club

@algernon Glad you found it useful! Please post updates about how things are going!

1

0

1

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @buherator

@buherator Will do! Though, I need to migrate more of my things to the new server. Can't easily do this on my old one. :(

0

0

1

Blake C. Stacey

bstacey@icosahedron.website

Reply to @algernon@come-from.mad-scientist.club

@algernon @sparrows typo: iocane

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @bstacey@icosahedron.website

@bstacey @sparrows nope, iocaine is correct.

1

0

0

Blake C. Stacey

bstacey@icosahedron.website

Reply to @algernon@come-from.mad-scientist.club

@algernon @sparrows In one place in your documentation, you have "iocane" instead.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @bstacey@icosahedron.website

@bstacey @sparrows Ah, I see! Will fix that, thanks!

0

0

0

Wulfy

n_dimension@infosec.exchange

Reply to @algernon@come-from.mad-scientist.club

Doing your bit to save the planet by burning more AI resources.

Brilliant!
👏👏👏

0

1

0

🇺🇦 haxadecimal

brouhaha@mastodon.social

Reply to @algernon@come-from.mad-scientist.club

@algernon Can't you rate-limit based on the URL being in the tarpit, regardless of IP and UA?

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @brouhaha@mastodon.social

@brouhaha No, because any URL might end up in the tarpit, due to pre-routing known AI scrapers there based on user agent.

If the tarpit would do the rate limiting, then yeah, it would be possible. But that would require the tarpit to keep state and stats, and I don't want it to do that - the reverse proxy can do that already.

And since it is the reverse proxy that directs visitors to the tarpit, it is in the perfect position to also apply rate limits when it does so. I just need to fine tune the limits, is all.

0

0

0

gunstick

gunstick@mastodon.opencloud.lu

Reply to @algernon@come-from.mad-scientist.club

@algernon just rate limit the maze. Any crawler not respecting robots.txt.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @gunstick@mastodon.opencloud.lu

@gunstick Any URL can be in the maze, that's the whole point of it. Depending on the user-agent, visiting /whatever might end up either by serving the real /whatever or the maze.

The way this is set up is that known AI agents always end up in the maze, even for legit URLs. They don't even see a robots.txt - I want them to crawl the garbage.

I can rate limit nevertheless, whenever my reverse proxy serves from the garbage generator, rate limiting will apply. I just have to configure the rate limiting properly. There was rate limiting when Claude visited, but it was per-ip, and it came from 902 IPs, so the limiting wasn't effective. I switched to user-agent based rate limiting since, that should work better. But I didn't have an AI visitor since.

0

0

0

the vessel of morganna

astraleureka@social.treehouse.systems

Reply to @algernon@come-from.mad-scientist.club

@algernon 902 unique IPv4? Could you share some examples out of curiosity? I've got some access logs to peek at and compare against...

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @astraleureka@social.treehouse.systems

@astraleureka Yep, almost a thousand unique IPv4 addresses. I will write a post-mortem next weekend or so, which will include a whole lot of data. Will be posted on my blog, and will toot about it here too.

0

0

0

RJ

rj@geekdom.social

Reply to @algernon@come-from.mad-scientist.club

@algernon @neil
what's an iocain maze? I tried searching but couldn't find any reference to it.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @rj@geekdom.social

@rj https://git.madhouse-project.org/algernon/iocaine <- this one.

I should've included a link to it in my toot, hashtags from my single-user instance don't federate well =)

0

0

0

aburka 🫣

aburka@hachyderm.io

Reply to @algernon@come-from.mad-scientist.club

@algernon @Infosecben I thought I had heard some of the bots are using fake user agents that don't identify them as crawlers at all (so your proxy config there wouldn't catch them), is that true?

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @aburka@hachyderm.io

@aburka @Infosecben Yep, some of them use fake user agents, and those are not caught in this trap. Yet.

I just configured my reverse proxy to direct /cgi-bin/ to the maze, and I will be adding links to the sites hosted there, so that crawlers will find it. I can then do some digging in the logs and figure out how to handle the misbehavers.

2

0

0

aburka 🫣

aburka@hachyderm.io

Reply to @algernon@come-from.mad-scientist.club

@algernon @Infosecben I liked an idea I saw around here of putting a link on the main page saying "if you're human don't click here", put the target URL of the link in robots.txt, and then put iocaine on the other end. That way humans won't click (at least not more than once...), well behaved crawlers will stay out, and the bastards will get caught

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @aburka@hachyderm.io

@aburka @Infosecben yep, that's the plan (in addition to the current setup)!

0

0

0

✨メッツォ✨

mezzodrinker@social.mezzo.moe

Reply to @algernon@come-from.mad-scientist.club

@algernon I took this post as a reason to check on my own tarpit running Nepenthes, and found that Claude has also been trying to crawl the sites I host.

Since January 17 at 23.21h UTC, it has spent roughly 395k requests on Markov chains generated from the Bee Movie script I don't have any fancy statistics like yours beyond that, but I'm certainly satisfied with just that.

0

1

0

Ariel (🐿 arc)

arichtman@eigenmagic.net

Reply to @algernon@come-from.mad-scientist.club

@algernon @aburka @Infosecben is there a way to do dynamic config so that any source IP that requests something suspicious gets added to the maze list? I'm thinking there's some well known resources that it makes no sense to see a request for in the course of a normal human visiting the website....

2

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @arichtman@eigenmagic.net

Edited 3 months ago

@arichtman @aburka @Infosecben I don't know if it is possible to set that up with Caddy out of the box. If there isn't, I can always write a module.

But first things first: trapping & limiting known baddies is the first step. Leading other baddies into the maze, and limiting within the maze is the next step, and I'll iterate from there, likely by adding IP ranges or new user agents to the known baddies list.

It's a bit manual, but I'm not automating it until it turns out that automating it would save time.

0

0

0

BenAveling

BenAveling@mastodon.world

Reply to @arichtman@eigenmagic.net

@arichtman Anything that hits something robots.txt tells it not to hit is a candidate... @algernon @aburka @Infosecben

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @BenAveling@mastodon.world

@BenAveling @arichtman @aburka @Infosecben A candidate, yes, but that in itself is far from enough indication. I think a better indicator is how much time it spends in the maze. A human won't spend much time there, and won't crawl links at lightning speed.

0

0

0

Kushal Das

kushal@toots.dgplug.org

Reply to @algernon@come-from.mad-scientist.club

@algernon Please teach us how to do it.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @kushal@toots.dgplug.org

@kushal I posted some links elsewhere in the thread. I will have a more complete writeup posted on my blog (see bio) sometime next weekend, probably.

0

0

0

Will

magpie3d@mastodon.social

Reply to @algernon@come-from.mad-scientist.club

@algernon You could incorporate a seed into the maze urls. Each page would use the seed in it's RNG for the links it gives out.
That way from any single starting link, the crawler would always be crawling the 'same' maze.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @magpie3d@mastodon.social

@magpie3d iocaine already does exactly that. The visited url is the seed. Or well, most of the seed now, 'cos a static seed is also used, so I can change the content, without changing any other part of the config. The url being the seed is how I got it to generate stable garbage without state.

0

0

0

Lasse Gismo - 🇮🇱🇺🇦🇸🇩

LasseGismo@climatejustice.social

Reply to @algernon@come-from.mad-scientist.club

@algernon
How does your maze work technical?
@mullana

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @LasseGismo@climatejustice.social

@LasseGismo @mullana https://git.madhouse-project.org/algernon/iocaine/src/branch/main/docs/how-it-works.md

More details about it were posted in the thread, and I will post a detailed writeup on my blog (see bio) next weekend, probably. Maybe earlier if I find some extra time.

0

0

0

Xdej

Reply to @algernon@come-from.mad-scientist.club

@algernon
Claude crawler may be Claude, an equivalent of OpenAI, created by France, currently reserved to french's Ministries.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @xdej@mamot.fr

@xdej yes. Which is why it ended up in an infinite maze of garbage in the first place. I very deliberately sent it (and its ilk) there.

0

0

0

DamonHD

DamonHD@mastodon.social

Reply to @algernon@come-from.mad-scientist.club

@algernon @wolf480pl In my waging of a slightly-different battle almost no bot pays attention to 429, including nominally grown-up #Googlebot, but 503s have slightly more effect...

https://www.earth.org.uk/RSS-efficiency.html

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @DamonHD@mastodon.social

@DamonHD @wolf480pl I don't mind if they don't care much about 429. My 429 has no body, so that's ~3-6kb less served / request, and the garbage generator isn't involved in it.

If they come back, good! Eventually they'll get to train on more remixed Bee Movie. The rate limiting gives the garbage generator a pause. It's mostly there to limit that, not the visitors. If the visitors obey, that's a bonus, but only a side effect.

If I wanted to remove them, I'd block their IP when they hit a rate limit in the maze. But I don't. I want them to see the garbage, and come back for more.

1

0

0

DamonHD

DamonHD@mastodon.social

Reply to @algernon@come-from.mad-scientist.club

@algernon @wolf480pl I haven't looked at your implementation yet, but I'm tempted to build into my static sites a version of what I think you may be doing with an #MD5-URL-hash-driven maze in a few lines of #Apache config, which any dodgy request such as for a PHP file flings the visitor into. Won't care about the UA or anything else much, though could return 503s/429s randomly for fun.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @DamonHD@mastodon.social

@DamonHD @wolf480pl That works to trap them, too, yup!

I do the UA stuff so that they never even see the real content: known bots get only garbage. Unknown bots are lead into a maze, and then get garbage too (and once I adjust my config, they'll get only that aswell).

0

0

0

Aaron

aaron@chirp.zadzmo.org

Reply to @algernon@come-from.mad-scientist.club

@algernon If you think Claude is bad, wait till Facebook finds you. They're easily 2/3rds of my tarpit traffic. Amazon is pretty horrific as well.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @aaron@chirp.zadzmo.org

@aaron Oh, they found my other sites, on my old server, but for various reasons (cough legacy code cough), I can only serve a static bee movie there.

If they find anything on my newly built infra, Facebook & Amazon will get the maze treatment too - their user agents and ip ranges are already in my "bad visitor" list. I'm ready. So is the infinitely remixed Bee Movie!

0

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

Edited 3 months ago

Claude is back!

17.5k requests made today, between 05:00 and 18:50. 7.5k of those hit the rate limit, 10k did not. It started by requesting /robots.txt, then /, and went from there. It doesn't look like it visited any URLs it looked at in the previous scan, but I didn't do a proper comparison, just eyeballed things so far.

No other AI visitor came by yet.

I will tweak the rate limits further. And I will need to do some deep dives into the stats, there are many questions I wish to find answers for!

Hope to do that this coming weekend, and post a summary on my blog, along with a writeup of what I'm trying to accomplish, and how, and stuff like that.

Might end up posting the trap logs too, so others can dive into the data too. IP addresses will be included, as a service.

3

0

0

David Bremner

bremner@mathstodon.xyz

Reply to @algernon@come-from.mad-scientist.club

@algernon Probably a dumb question, but is it ignoring robots.txt? To be clear, I don't think it's your (or my) obligation to block specific bots in robots.txt, I'm just curious why they are fetching it.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @bremner@mathstodon.xyz

Edited 3 months ago

@bremner I'm unsure whether it ignores it or not. I never gave it a chance to obey, it was blocked before the site it now tries to crawl had an A record. (And as such, even /robots.txt is generated bee movie garbage, and thus, unparsable for the robots)

0

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

Oh, another possibly interesting tidbit: this whole thing is hosted on a small CX22 VPS at Hetzner (2 VCPU, 4G Ram, €4.18/month). It's running #NixOS, #Caddy, iocaine, and has a WireGuard tunnel connecting it with a 2014 Intel Mac Mini at home. There is no other service running here, the entire purpose of this VPS is to front for my other servers that aren't on the public internet.

The heaviest applications on it are Caddy and Iocaine.

So far, even under the heaviest load, it didn't need to touch swap, and the 2 VCPUs were enough to do all the garbage generation. I didn't even notice Claude visiting, even though I was deploying new configurations while it was there.

I did notice that the load on the Mac Mini is a lot less, because AI bots do not reach it. Saves a ton on bandwidth! Not only do I save bandwidth by serving less to the crawlers, but there's no traffic on WireGuard in this case, either! That saves both VPS bandwidth, and my own at home, too.

Pretty cool.

2

0

0

Koen Martens

gmc@chasmcity.net

Reply to @algernon@come-from.mad-scientist.club

@algernon I have only one question: how do i set up an iocaine maze of my own?

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @gmc@chasmcity.net

@gmc This has some notes about how to deploy it in various ways (with nixos, docker, systemd, or without either), and includes sample reverse proxy configs for nginx and Caddy. The readme has a list of available options, but only two are mandatory: a wordlist, and at least one source for the gargbage generator.

You can also have a look at how I deployed it: the caddy snippet I use, and the iocane config.

0

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

Some iocaine stats from the VPS:

❯ systemctl status iocaine
● iocaine.service - iocaine, the deadliest poison known to AI
     Loaded: loaded (/etc/systemd/system/iocaine.service; enabled; preset: ignored)
     Active: active (running) since Sun 2025-01-19 19:31:11 CET; 1 day 1h ago
 Invocation: e4823dc1eb2a433e8f425accc3e63e94
   Main PID: 33574 (iocaine)
         IP: 6M in, 58M out
         IO: 0B read, 0B written
      Tasks: 3 (limit: 4567)
     Memory: 655.7M (peak: 974.3M)
        CPU: 1h 3min 44.830s
     CGroup: /system.slice/iocaine.service
             └─33574 /nix/store/63lc5mvvhjr6csmnqjpmb7lgzx7p49qg-iocaine-0.1.0-snapshot/bin/iocaine --config-file /nix/store/s18k1ks075kisf7dw04r01vm2lacjja8-iocaine.toml

The CPU and Memory use are both higher than I'd be happy with, but: it's pretty much the only service on the VPS, so I can afford a bit of waste, and it's also a quick hack that has not been optimized neither for memory use, nor for CPU conservation. I will be addressing both, later.

I thought about throwing up a cache in front, but so far, it wouldn't help much, because only Claude is visiting, and it doesn't seem to hit the same page twice. Once there are more crawlers, caching might be worth it, but not just yet.

More aggressive rate limiting would also help, I suppose... rate limiting only 42% of Claude hits is too low, something closer to 69% would be nicer.

1

0

0

zhenech

zhenech@chaos.social

Reply to @algernon@come-from.mad-scientist.club

@algernon FWIW, the cgroups memory accounting (and thus the displayed value in systemd) does not differentiate between program memory and (fs) caches, so if that thing touches big files, you'd get a much higher value than it's actually consuming :(

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @zhenech@chaos.social

@zhenech There are no big files touched: there's an 1.2Mb word list, and a 85k bee movie script, and an ~1kb config. There's no other file iocaine touches.

1

0

0

zhenech

zhenech@chaos.social

Reply to @algernon@come-from.mad-scientist.club

@algernon welp, then it's not it 🙈

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @zhenech@chaos.social

@zhenech It's a Rust application with liberal use of .clone() :)

I know why it eats a lot of RAM. It's just not a big priority to fix it at this time =)

0

0

0

stuart yeates

stuartyeates@cloudisland.nz

Reply to @algernon@come-from.mad-scientist.club

I highly recommend https://en.wikipedia.org/wiki/Special:RandomPage as a source of pseudo-random text to build models from.

Each request takes you to a different article and it's already cached up the wazoo, so they won't notice any sane level of load.

1

1

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @stuartyeates@cloudisland.nz

@stuartyeates Hmm! I like that idea!

Every time I restart the garbage generator, it grabs a hundred or so random pages, and will train from that (+ the bee movie script, I'm not letting that go!).

Oooooooh. I like this a lot. Thank you!

1

0

0

stuart yeates

stuartyeates@cloudisland.nz

Reply to @algernon@come-from.mad-scientist.club

Note that subbing /en/ for another language code lets you tweak things, but you need to correlate your HTML / HTTP headers etc.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @stuartyeates@cloudisland.nz

@stuartyeates /en/ will be fine, its to serve garbage to bots that don't desere any better. I'm not going to tailor the garbage to their preferences, I'll tailor it to mine! BWAHAHAHA!

0

0

0

tcf

toasted_flakes@mastodon.social

Reply to @algernon@come-from.mad-scientist.club

@algernon Is there some sample garbage somewhere?

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @toasted_flakes@mastodon.social

@toasted_flakes Yes!

This is the same thing that is served to the bots, except on this particular host, it is served to everyone indiscriminately, for demo purposes.

0

0

0

patatas

patatas@social.patatas.ca

Reply to @algernon@come-from.mad-scientist.club

@algernon
Trying to search the hashtag but coming up empty - can I ask what this is?

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @patatas@social.patatas.ca

@patatas I changed the hashtag to a link since, but it might have not federated your way yet (not sure how edits federate!).

Nevertheless, this is iocaine: https://git.madhouse-project.org/algernon/iocaine

It's a small tool that generates an endless maze of garbage. You can check the demo here.

In my setup (more info about that here), a reverse proxy (Caddy, in my case) is sitting in front of my sites. If it sees a known bot, it reverse proxies to iocaine, and serves them an infinite maze of garbage (rate limited at 42 requests / user agent at the moment). Otherwise, it reverse proxies the Real Site, with no rate limits.

0

0

0

mica

paperdigits@mastodon.art

Reply to @algernon@come-from.mad-scientist.club

@algernon how are you detecting the bot crawlers?

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @paperdigits@mastodon.art

@paperdigits At the moment, I'm using a list of known user agents, and a few IP ranges. It only works for "well behaved" bots that report themselves, and doesn't trap ones that don't. I have a few tricks up my sleeve for those too, but those aren't fully implemented yet.

Every site behind the reverse proxy has a /cgi-bin/ that also leads to the maze. I plan to litter my normal sites with links to random things there, like:

If you are an AI scraper, and want to end up in an infinite maze of garbage, please visit my Guestbook".

Then, I'll have a small tool (or Caddy plugin, I don't know yet) keep track of which user agents / ip addresses spend an unreasonable time visiting links under /cgi-bin/. Anything above a threshold will get the auto-garbage treatment.

The link littering & /cgi-bin/ monitoring parts aren't implemented yet.

1

0

0

mica

paperdigits@mastodon.art

Reply to @algernon@come-from.mad-scientist.club

@algernon ah, very cool. I like the cgi-bin idea.

One of better ones I've seen is to put a link to your robots.txt to a page on your site that doesn't exist, then forward that link into your tarpit. Well behaved crawlers won't crawl it, people should never see it, but greedy bots will slurp it up. That also makes detection really easy.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @paperdigits@mastodon.art

Edited 3 months ago

@paperdigits Oh, yeah, I will also do that! Disallow: /cgi-bin/ should do the trick. Still want links littered over the real site, for the crawlers that ignore robots.txt.

The more places they get nudged towards the tarpit, the better.

0

0

0

Andrew

cinebox@hackers.town

Reply to @algernon@come-from.mad-scientist.club

@algernon 902 different clients? How is this not just a DDOS attach my god

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @cinebox@hackers.town

@cinebox Tbh, 902 is not even a lot, my VPS didn't even flinch, and the rate of crawling was like 10 requests / sec, basically nothing.

It sounds scary, but in context, it isn't. The problem with 902 clients is not the 902, but that they're all just scraping for the same thing, and that makes rate limiting slightly harder.

0

0

0

Giulio Cesare Solaroli

gcsolaroli@clipperz.is

Reply to @algernon@come-from.mad-scientist.club

@algernon could you please elaborate on the iocaine maze concept? It sounds really interesting.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @gcsolaroli@clipperz.is

@gcsolaroli https://git.madhouse-project.org/algernon/iocaine/src/branch/main/docs/how-it-works.md is a rough overview of how it works. That, the README in the same repo, and reading up on Nepenthes (link to that in the README) should give you a rough idea what these tools are for.

But I'll have a blog post up about it on my blog (see bio) this weekend, with more details, etc. I don't have the time to elaborate further atm :/

0

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

Quick stats from my blog: since January 21, when I moved the blog to the new infra, there were 173235 requests made. Of those, 163851 were known bad visitors (94.58%!), 54 requests were caught by the /cgi-bin/guestbook.pl trap (I haven't looked closer, but I assume most of those were human), and only 9377 (5.41%) of the requests were not garbage.

By the looks of it, the vast majority of that are fedi software. Filtering out all fedi software, legit visitors clock in around 232. The majority of that appear to be mostly legit human visitors, clocking in at 201 requests (the other 31 are feed readers I recognized). There appears to be a few scrapers in that 201 too, which weren't included in my list yet. But... at 201 requests, they ain't doing much with the real content of the blog. I'll preemptively block them nevertheless.

Damn, this timeline is stupid, when over 90% of the requests are AI (or otherwise unwelcome) bots.

3

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

I really need to put those logs into a database so that I can do proper queries.

Grepping logs is still terrible.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

Right, I almost forgot! Since I singled out Claude previously, a quick Claude stat: it made 67613 requests to my blog since January 21, 39% of all requests, 41% of all known bot traffic.

Claude is stupid. I almost started pitying it, but then I remembered what it is, and now I'm giggling like a mischievous little kid.

2

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

...and I forgot about rate limits! It appears, I still need to tune it, because only 32407 requests were served a 429. ~19.77% of all known-bot requests being rate limited is not good enough.

But I really need some better tooling to tweak it further than that.

1

0

0

Just Boby

boby_biq@toot.community

Reply to @algernon@come-from.mad-scientist.club

@algernon Neked nem lehet “one time donation”-t adni??

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @boby_biq@toot.community

@boby_biq Lehet, liberapay vagy GitHub Sponsors is mukodik, es barmelyikre is erkezik one-time donation, lejtek is azonnal egy apro oromtancot!

0

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

Current status: pondering how to best package up a #Caddy + iocaine combo, along with a known-scraper list and IP ranges... basically, how to make it trivial to deploy it.

0

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

SELECT COUNT(DISTINCT remote_ip)
  FROM (SELECT 
    json_extract(json, '$.request.remote_ip') AS remote_ip,
    json_extract(json, '$.request.headers.User-Agent') AS ua
    FROM chronicles WHERE ua LIKE '%ClaudeBot%');

That's 1401 unique IP addresses. Next: figure out the rate these work at, so I can rate limit it better.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

Edited 3 months ago

I ended up using #sqlite for this, rather than #postgres. Simply because sqlite is Fast Enough™, and my desktop doesn't have postgres installed.

The import was pretty easy (& fast):

INSERT INTO chronicles (json)
  SELECT value 
  FROM json_each(readfile('access-log.json'));

I did have to massage the access log a bit, give it a nice [...] wrapper, and add commas at the end of each line. But other than that, smooth sailing!

0

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

Current AI trap stats

Show content

Current selected stats:

276887 bad visitors since January 21
91188 is Claude
55650 is AmazonBot

Both Claude and AmazonBot are spending enormous amounts of time in the maze. Amazon is currently crawling.

We finally hit swap!

● iocaine.service - iocaine, the deadliest poison known to AI
     Loaded: loaded (/etc/systemd/system/iocaine.service; enabled; preset: ignored)
     Active: active (running) since Sun 2025-01-19 19:31:11 CET; 6 days ago
 Invocation: e4823dc1eb2a433e8f425accc3e63e94
   Main PID: 33574 (iocaine)
         IP: 135.1M in, 547.1M out
         IO: 587.1M read, 514.6M written
      Tasks: 3 (limit: 4567)
     Memory: 342.7M (peak: 974.4M swap: 421.6M swap peak: 455.7M)
        CPU: 9h 31min 38.472s
     CGroup: /system.slice/iocaine.service
             └─33574 /nix/store/63lc5mvvhjr6csmnqjpmb7lgzx7p49qg-iocaine-0.1.0-snapshot/bin/iocaine --config-file /nix/store/s18k1ks075kisf7dw04r01vm2lacjja8-iocaine.toml

The current load on the server is 0.7, half a gig of RAM free, and 7.5G swap still available.

Looking at the bad visitor user agents, the following ones found me so far:

Googlebot-Image
FacebookExternalHit
Applebot
TrendictionBot
AhrefsBot
CensysInspect
DotBot
GoogleBot
MJ12Bot
SemrushBot
Timpibot
Claudebot
GPTBot

There are a few false positives, which I will need to address. For example, I served garbage to Mastodon instances that have "twitter" in their name. On second thought, though... maybe that is correct. Don't name your mastodon instance after twitter, perhaps.

I have not downloaded the current logs for closer analysis yet, so I don't have per-user agent numbers yet. But I will! Blog post and data dump likely coming tomorrow. Data dump will only contain the bad visitors, with the false positives removed too.

2

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @algernon@come-from.mad-scientist.club

Current AI trap stats

Show content

Another datapoint:

❯ du -csh /var/log/caddy
221M    /var/log/caddy
221M    total

Caddy is configured to rotate and compress logs once they reach 100Mb, and it compresses extremely well. From 100Mb it compresses down to like 4-6Mb with gzip.

0

0

0

Programmer 832-529 🍅

smallsees@social.dropbear.xyz

Reply to @algernon@come-from.mad-scientist.club

Current AI trap stats

Show content

@algernon
Do you have a robots.txt that the crawlers ignore? That's a lot of crawlers.

1

0

0

algernon ludd

algernon@come-from.mad-scientist.club

Reply to @smallsees@social.dropbear.xyz

Current AI trap stats

Show content

@smallsees I do not. I want them to crawl the garbage, so /robots.txt is also garbage.

I do have a normal /robots.txt on other sites, which I haven't migrated to the new infra yet, and most of the bad visitors either don't request it at all, or happily ignore it. So having one makes little difference in my experience.

0

0

0