Conversation

Ask HN: How to stop an AWS bot sending 2B requests/month, predictably full of useless advice.

Person complains about too much traffic.

Half the responses: "Serve it a gzip bomb". My dear HNers! That won't do jack shit. It won't make most crawlers go away (they don't decompress during crawling), but will significantly increase outgoing traffic. Good job, you made things worse!

"Use Anubis!" - that still won't make it go away, and will cost you more outgoing traffic than sending it an empty 401.

"Use this proxy!" - that won't stop it either. Catching it isn't the askers problem. Making it go away is. No proxy or reverse proxy will help you do that.

There are a few responses that tell the original poster to block AWS Singapore on the firewall - that is the solution. But they're drowned out by all the bad advice who didn't read the question.

2
0
0

Why, yes, I read HN. It's my guilty pleasure. I have no account there, but reading the incredible bullshit they sometimes come up with in the comments is great entertainment during lunch.

1
0
0

Here's the thing about these crawlers, and I'm gonna make it bold, because it is important: you cannot make them go away.

They have more money than you do. They have more bandwidth than you do. They have more resources than you do. They have more everything than you do1. They do not fucking care.

Anubis won't make them go away. iocaine won't make them go away. go-away won't make them go away. Nepenthes won't make them go away. None of those will. They may block access to the real contents, they may reduce the impact the crawlers have, but none of them make them go away.

Serving them an empty 401 won't make them go away, either. That reduces your outgoing traffic significantly, but the bots will keep coming back, because they - I repeat - do not fucking care.

If you want to make them go away, you block them on the firewall. That's not always possible, but in case of AWS Singapure, when you don't expect legit traffic from there, blocking the entire ASN is an option.

If you can't firewall them off, the best you can do is mitigate. You cannot make them go away. You can make it more expensive for them, and help the bubble burst faster by serving them garbage. But that doesn't come for free, and if you're having traffic problems, this ain't an option either.


  1. ...until the bubble bursts ↩︎

2
0
0
@algernon I miss n-gate's webshit weekly so much :,(
1
0
2

@wolf480pl Some of them... possibly? Not sure what that'd accomplish, though. There are a zillion different crawlers. If you manage to crash one, you still have a zillion-1 crawlers gnawing at your :443.

I don't think it would be worth the effort, except for the entertainment value.

1
0
0

@algernon
I see :/

hmm do they follow redirects? if so, do they follow them immediately, or just add to the end of the queue?

1
0
0

@buherator Oh! Yeah, that was a thing! I loved it, and now that you mention it, I was sad when it stopped being updated.

Hm. Maybe I should turn my guilty pleasure of reading HN into blog posts. Not quite n-gate weekly, but... commentary on some of the stuff I laugh at. Hmmm, hmm.

I have an idea. This might end up being a hilariously bad experiment, and I'll blame you! flan_set_fire

1
0
1
@algernon Wait, could an LLM faithfully imitate n-gate?
1
0
0

@wolf480pl Some do, some don't, some do immediately, some later.

The majority of them follow them immediately, if they follow them at all, though. I have not tried external redirects, only relative ones.

1
0
0

@algernon And if it's THAT crawler I'm thinking of, even blocking IPs is exceptionally difficult, because it cycles through an absolutely massive amount of IPv4 addresses, using each address only once or twice in a month.

You have to drop all traffic from at least three different ASNs.

1
0
0

@aaron It is THAT crawler most likely, and yep, would need to block at least an entire ASN, but possibly a few more, indeed.

0
0
0

@buherator No, LLMs lack the sophisticated snark. But a human pretending to be an LLM...

1
0
1
@algernon I ran some tests and you are right: LLMs are still far from reaching n-gate-level snark
0
0
1

@algernon
for those that follow redirects immediately, one could do some nifty thngs like:

- find a hosting with free bandwidth, set up a slowloris server there (one which sends the response one character at a time), redirect bots that server, so that they can't crawl other websites while stuck there (assuming they don't have a separate thread for each domain)

- redirect them to your enemy

- redirect them to a website of someone with deep pockets and many lawyers, who will go after the ISPs

1
0
0

@wolf480pl

  • Slowing things down doesn't seem to do anything. I tried it, the vast majority (if not all) of the bots kept the same crawling rate, I just had more connections open.

  • Redirecting them somewhere else: eh, yeah, I guess I could do that, but that'd make it someone else's problem. I might aswell serve them a 401 then and not point them to places where they may find non-garbage content to crawl.

Enemy or not, I'd prefer if these things disappear. Directing them to somewhere they can maybe crawl useful training data from doesn't help that goal, not even if that somewhere else is an enemy of mine.

1
0
0

@algernon
> they keep the same crawling rate, just more connections open

what I'm wondering is whether that makes them have less connections open to other websites

> enemy or not

I'm not saying you should, I'm just saying they could be weaponized. Which, if true, is concerning.

1
0
0

@wolf480pl

what I'm wondering is whether that makes them have less connections open to other websites

I don't think so.

I'm not saying you should, I'm just saying they could be weaponized. Which, if true, is concerning.

Hrm. I can give that a try. I can redirect some of the crawlers I see to an entirely different host (one that I also control), and see what happens. I'll add that to my experiments TODO list!

1
0
0

@algernon
also, do you know how often the same IP address returns?

1
0
0

@wolf480pl Depends on the bot! The ones that try to disguise themselves rarely return. From what I'm seeing, the same IPs are used for a dozen or two requests, then they're not seen for at least a week.

My logs only go back 7 days, so I can't tell if they return after. But I've heard from other people that if they do return, it may take weeks, or even months.

1
0
0

@algernon hmm so we need to make them keep an open connection for a month or more... that'd be difficult

1
0
0

@wolf480pl They won't keep it open for that long, for the simple reason that the residential IP they're using via malware won't keep being online that long.

And those that run from various public clouds, they'll eventually time out too.

(Also: they have more resources than we do. I wouldn't waste time on trying to exhaust theirs.)

1
0
0

@algernon
We can disable conntrack, and their ISO can't, because their ISP is doing NAT.

So if we could somehow force all bots from a certain ISP to hit a server we chose, we could break that ISP's CGNAT. Nobody at that ISP would have internet access anymore. That way we could force the ISPs to do something about the problem...

though they'd most likely just IP-ban the tarpit server :/

1
0
0

@algernon
although, we don't need all those bots to hit the same server. We just need all those bots to hit servers that keep connections open for long enough.

But then maybe the bots have timeouts and rates short enough that they'll never exhaust the CGNAT :/

0
0
0