infosec.place

Conversation

buherator

Edited 4 months ago

Observation:

- People started deploying anti-scraping measures to fight LLM scraping
- Web indexers can't index stuff anymore
- Search results are even worse than before
- The only way to retrieve the information is to use models that were trained in pre-anti-scraping times (or beat anti-scraping)

If I'm right, anti-scraping can actually push people towards LLM's (who currently absolutely have the capacity to circumvent most anti-scraping).

If you think you share knowledge worth finding, please consider this before deploying countermeasures!

#scraping #search #llm

a hungering beast

algernon@come-from.mad-scientist.club

4 months ago

Reply to @buherator

@buherator Anti-scraping measures do not necessarily affect indexers. What does is that the Big Ones are also very much doing scraping too, and would display AI summaries/suggestions instead of real search finds. Thus, people block them too. Not as a side-effect, but as a deliberate choice.

Using pre-anti-scraping LLMs works, maybe, for a little while. But those fail to capture new stuff behind anti-scraper measures, and as such, are inadequate for search longer term.

Using smaller indexes, personal/community ones still yield decent results. Decent, as in, usually better than pre-AI global indexes.

Beating anti-scraper measures in the long run is not possible, because people can - and will - put their stuff behind login walls with no open registration, or take them completely offline.

The winning move would be making community-run smaller indexes a viable option. Federated, decentralized search, if you wish.

buherator

4 months ago

Reply to @algernon@come-from.mad-scientist.club

@algernon I get that there's a lot of nuance here, that's why I asked for "consideration" that can include e.g. allowing standard crawlers.

Apparently building an index is much bigger effort than I expected (based on the struggles of EU and alternative providers), so I don't think that will happen in the near future.

LLM performance will degrade for sure, but I don't think it will restore trust in traditional search or otherwise move ppl away from assistants once they became dependent.

Btw. my post was less about your work, and more about e.g. GitHub where content is no longer properly searchable either via web search or their internal search :)

a hungering beast

algernon@come-from.mad-scientist.club

4 months ago

Reply to @buherator

@buherator

I get that there's a lot of nuance here, that's why I asked for "consideration" that can include e.g. allowing standard crawlers.

The problem is that... what do you define as standard crawler? Googlebot & Bingbot? Or Kagi's? The first two will happily display AI summaries first, and both try to hide real finds. I do not think it is worth lettimg them through. As far as I remember, both double as AI scrapers too.

Apparently building an index is much bigger effort than I expected (based on the struggles of EU and alternative providers), so I don't think that will happen in the near future.

A global index is hard, indeed, especially if you wish to filter out AI slop. A limited, community-run or topical index on the other hand? That's a whole lot easier.

I've been using my own YaCy instance for the past two years. It yields very good results, better than any other search engine did, even prior to AI - because it is not a global index. It indexes what I tell it to: whatever I bookmark on my GtS instance or in my Readeck, and whatever else I send its way.

Yes, it is limited, and it won't find stuff I didn't put in. In those cases, I fall back to laternatives: asking on fedi, or if that fails too, duckduckgo. I haven't used DDG this year yet.

Now, if we had community-run smaller indexes, and topical ones, with searxngs thrown on top to combine some of them, my educated guess is that we'd have viable search without much AI slop.

The trouble is, such an index still takes effort, requires resources, and great care. I can do it on a personal level, because I'm a stubborn little mouse. It's not viable for normal people - hence, community indexes. But that's hard. Not global-index-level hard, but harder than it needs to be.

LLM performance will degrade for sure, but I don't think it will restore trust in traditional search or otherwise move ppl away from assistants once they became dependent.

I think traditional search is dead, and has been for a while now. A global index will never yield useful results again (thanks, partly, to all the AI slop, but even before that, SEO and other algorithm cheating for the relentless pursuit of $$$ made global indexes worthless, imo).

Btw. my post was less about your work, and more about e.g. GitHub where content is no longer properly searchable either via web search or their internal search :)

My gut feeling is that GitHub search was degraded not as an anti-scraper measure, but to push people towards LLMs. That's GitHub's entire purpose lately, so why would this be an exception? :)

buherator

4 months ago

Reply to @algernon@come-from.mad-scientist.club

@algernon

> both double as AI scrapers too

Yes that's definitely a problem, but that can be decided on a case-by-case basis (again, nuance).

> traditional search is dead

In my dreams a service with pagerank+full-text indexing+user-defined ranking would be incredibly useful. I have to deal with so much new shit every day that a personal index wouldn't even be remotely useful.

You may be right about GH, but in this case the means matter more than the ends. "A systems purpose is what it does", and it'd be painful to see anti-scaping work *for* LLMs (I'm still not sure if this is happening or not).

Csepp 🌢

csepp@merveilles.town

4 months ago

Reply to @buherator

Edited 4 months ago

@buherator @algernon AFAIK Reddit closed itself down in large part because it doesn't want anyone to scrape its data, that way they could sell it exclusively to Google.
You can't even use it with a VPN nowadays.

buherator

a hungering beast

buherator

a hungering beast

buherator

Csepp 🌢

Terms of Service