Conversation
Archivists: suppose you have to scrape large amounts of media for preservation from an unfriendly service. How do you ensure that the retrieved media didn't get corrupted along the way? (I don't want to watch/listen to 1000s of hours A/V)

#archiving #scraping
2
2
1
@13reak There are no originals, only the data from the service with some random compression. What do I compare hashes to?

Edit: you can also think of this as analog->digital conversion (which is also part of this story actually) - how do I know there were no glitches in my encoding software along the way?
1
0
1

@buherator

So you want to check if 1) the data got corrupted from the original source to the "unfriendly service" or 2) from the unfriendly service to you?

(for 2) hashing works)

1
1
0

@buherator

Well, I'd do like acquisition tools for hard disks do:

Hash the data in transit and then hash again the data on your hard disk (the resulting file). Should be possible with tee maybe?

If that's too complicated for your use case, just download, hash and then download again and pipe into hashing tool. On Linux I'd do something like:

# download everything
wget -R [url]

# hash everything
md5sum [files]

# download again and only hash
# maybe insert loop here
curl [url to filename] | md5sum > filename.md5

That's quick and dirty, you probably want a loop around curl but that would not be hard (something like for $file in ${ls -R *.mp4}; do ...; done should work or use find instead of ls)

PS: before someone says md5 is not secure. Yes, if assuming attacks, but here we only want to prove that the data is identical and md5 is faster than sha hashes.

0
1
0

@buherator @13reak do you get an e-tag ? That should be basically md5 - all though you will have to verify chuncks (last digits of the etag should tell the number of chunks). Or you could compare content length header ?

1
1
0

@buherator fetch twice and compare two copies in your storage? if you are only concerned about corruption on its way to you or the process of saving it on the disk, of course.

1
1
0
@nieldk @13reak E-tags are a neat idea, but I'm not sure if will reflect the hash of the compressed stream delivered to me? I'll keep this in mind though!
0
0
2
@infosecdj That would make a good sanity check assuming the compression doesn't change across downloads (I can imagine the stream in optimized in lots of ways), and I'll definitely give it a shot!

My other concern though is that the service may just start streaming blank/noise/whatever randomly, and I wouldn't notice.
1
0
0

@buherator it should work unless the service does live media encoding. easy to test against too, just download twice. :)

as for your other concern, not much you can do about it without examining the received media in detail. to save time, you could produce a small still image every hour, for example; much easier to inspect daily instead of watching everything, but still requires a human in the loop.

1
1
1
@infosecdj That's a pretty neat idea actually, would also catch invalid encoding programmatically, not to mention brightening my day with frames of events I care about :)
1
0
0

@buherator pretty sure you can find examples of how to do that, as a lot of media sites produce "previews" in a similar way. should be easy to throw together.

1
1
1
@infosecdj Hell, I can vibe-code a bot that posts publicly and outsource the whole verification mess :D
1
0
1

@buherator @infosecdj I came here to suggest random audits, which would be more robust against an adversary than periodic ones. Otherwise, the ideas in this thread sound good to me.

0
0
1