@buherator The paywall era is lame, I'm extracting urls/article text from RSS feeds to do some basic AI tagging. It's my biggest problem still and requires constant tweaking or reworking to overcome.
Current solution is a queue with about 4 different IPs pulling new feed entries and grabbing the full page. Works OKish as long as they don't publish too many news articles in a single day đŸ˜
@buherator Yup, Im collecting the RSS as normal maybe every 4 hours but then I'm trying to collect the articles themselves (full body) also. I'm having a lot more difficulty with the latter and hitting paywalls.
My process boiled down is,
New article entry from RSS -> store RSS available data-> push url of article to queue
Collector runs (x4 IPs)
Pulls from queue -> check wayback and collect or collect directly -> push to my AI processing queue
@buherator have you tried asciimoos omnom: https://github.com/asciimoo/omnom
i host an instance at https://links.ctrlc.hu/