infosec.place

Conversation

buherator

Archivists: suppose you have to scrape large amounts of media for preservation from an unfriendly service. How do you ensure that the retrieved media didn't get corrupted along the way? (I don't want to watch/listen to 1000s of hours A/V)

#archiving #scraping

13reak

13reak@infosec.exchange

1 month ago

Reply to @buherator

@buherator

Hashing?

buherator

1 month ago

Reply to @13reak@infosec.exchange

Edited 1 month ago

@13reak There are no originals, only the data from the service with some random compression. What do I compare hashes to?

Edit: you can also think of this as analog->digital conversion (which is also part of this story actually) - how do I know there were no glitches in my encoding software along the way?

13reak

13reak@infosec.exchange

1 month ago

Reply to @buherator

@buherator

So you want to check if 1) the data got corrupted from the original source to the "unfriendly service" or 2) from the unfriendly service to you?

(for 2) hashing works)

buherator

1 month ago

Reply to @13reak@infosec.exchange

@13reak It's 2), please explain!

13reak

13reak@infosec.exchange

1 month ago

Reply to @buherator

@buherator

Well, I'd do like acquisition tools for hard disks do:

Hash the data in transit and then hash again the data on your hard disk (the resulting file). Should be possible with tee maybe?

If that's too complicated for your use case, just download, hash and then download again and pipe into hashing tool. On Linux I'd do something like:

# download everything
wget -R [url] 

# hash everything
md5sum [files]

# download again and only hash
# maybe insert loop here
curl [url to filename] | md5sum > filename.md5

That's quick and dirty, you probably want a loop around curl but that would not be hard (something like for $file in ${ls -R *.mp4}; do ...; done should work or use find instead of ls)

PS: before someone says md5 is not secure. Yes, if assuming attacks, but here we only want to prove that the data is identical and md5 is faster than sha hashes.

DJ🌞

infosecdj@infosec.exchange

1 month ago

Reply to @buherator

@buherator fetch twice and compare two copies in your storage? if you are only concerned about corruption on its way to you or the process of saving it on the disk, of course.

buherator

1 month ago

Reply to

@nieldk @13reak E-tags are a neat idea, but I'm not sure if will reflect the hash of the compressed stream delivered to me? I'll keep this in mind though!

buherator

1 month ago

Reply to @infosecdj@infosec.exchange

@infosecdj That would make a good sanity check assuming the compression doesn't change across downloads (I can imagine the stream in optimized in lots of ways), and I'll definitely give it a shot!

My other concern though is that the service may just start streaming blank/noise/whatever randomly, and I wouldn't notice.

DJ🌞

infosecdj@infosec.exchange

1 month ago

Reply to @buherator

@buherator it should work unless the service does live media encoding. easy to test against too, just download twice. :)

as for your other concern, not much you can do about it without examining the received media in detail. to save time, you could produce a small still image every hour, for example; much easier to inspect daily instead of watching everything, but still requires a human in the loop.

buherator

1 month ago

Reply to @infosecdj@infosec.exchange

@infosecdj That's a pretty neat idea actually, would also catch invalid encoding programmatically, not to mention brightening my day with frames of events I care about :)

DJ🌞

infosecdj@infosec.exchange

1 month ago

Reply to @buherator

@buherator pretty sure you can find examples of how to do that, as a lot of media sites produce "previews" in a similar way. should be easy to throw together.

buherator

1 month ago

Reply to @infosecdj@infosec.exchange

@infosecdj Hell, I can vibe-code a bot that posts publicly and outsource the whole verification mess :D

David

david42@mastodon.online

1 month ago

Reply to @buherator

@buherator @infosecdj I came here to suggest random audits, which would be more robust against an adversary than periodic ones. Otherwise, the ideas in this thread sound good to me.

buherator

13reak

buherator

13reak

buherator

13reak

DJ🌞

buherator

buherator

DJ🌞

buherator

DJ🌞

buherator

David

Terms of Service