Conversation
Edited 4 months ago
TBF I face much more challenges saving data _from_ the WaybackMachine using the CDX API than most of the sites I've scraped:

Most tools for offline archiving simply don't work, and although I'm *really* slow with my requests I get throttled all the time :P

Oh, and I almost forgot that it's surprisingly hard to translate IA URL's to local file paths, esp. since the URL's retrieved from the API aren't properly encoded (https://web.archive.org/.../http://example.com/...)
2
0
1

@buherator The paywall era is lame, I'm extracting urls/article text from RSS feeds to do some basic AI tagging. It's my biggest problem still and requires constant tweaking or reworking to overcome.

Current solution is a queue with about 4 different IPs pulling new feed entries and grabbing the full page. Works OKish as long as they don't publish too many news articles in a single day đŸ˜­

1
0
0
@ciaranmak I'm not sure I follow. Are you doing this via the CDX API? If there is RSS what requires tweaking? The RSS feeds don't include the whole content so you have to scrape them for archiving?
1
0
0

@buherator Yup, Im collecting the RSS as normal maybe every 4 hours but then I'm trying to collect the articles themselves (full body) also. I'm having a lot more difficulty with the latter and hitting paywalls.

My process boiled down is,
New article entry from RSS -> store RSS available data-> push url of article to queue

Collector runs (x4 IPs)
Pulls from queue -> check wayback and collect or collect directly -> push to my AI processing queue

1
0
0
@ciaranmak Got you! I'd say that hitting paywalls and even some JS-based UI monstrocity is the "normal" these days which I'd expect (and probably use Selenium or similar to grab it). But in case of the Wayback Machine I'd expect a friendlier API...
0
0
1
@stf Thanks for the reminder, I never had the opportunity to use it! My goal is specifically to dump datasets from Wayback Machine for specific domains, so browser-based solutions are less useful for me now.
1
0
1