Conversation

New from 404 Media: Bluesky may have said it won't use user data to train generative AI, but someone else just published a dataset of million Bluesky posts for "machine learning research". Already very popular dataset, your data may be scraped https://www.404media.co/someone-made-a-dataset-of-one-million-bluesky-posts-for-machine-learning-research/

5
10
0

@josephcox I guess you could also pull this together from Mastodon, but Bluesky is going to make readily available data much faster.

1
0
0

@tehstu @josephcox I'm pretty critical of Bluesky (see my timeline) but I don't see why this would be any harder or slower to do from Mastodon/fedi

2
0
1

@tomw @josephcox I meant in the sense that Bluesky has more users and so generates the content faster. I was trying to guess why there haven't been very public datasets to this effect from Mastodon.

0
0
0

@tomw @tehstu @josephcox all Bluesky data is public. Many ActivityPub posts are private or followers-only.

2
0
0

@josephcox one million posts? Is that a lot? Like per hour or total?

0
0
0

@evan Small clarification as I know you’ve avoided the Bluesky literature: Bluesky DMs are not public because they’re not part of ATProto. They’re a separate service.

1
0
0

@amd thanks. Updated.

0
0
0
Here we go again explaining supposedly technologically literate people that what they *publish* on the Internet can and will be scraped... Bluesky's explanation ("we can't enforce this") is on point btw.

RE: https://infosec.exchange/@josephcox/113551853623942786
1
1
3

@josephcox One million bluesky posts are nearly enough to train generative AI, and putting scare quotes around "machine learning research" doesn’t miraculously make it so.

0
0
0

@buherator People broadly understand this just fine. The problem is not of a technical nature, it is of a social and ethical nature, and the culture of a place absolutely *does* affect whether someone can get away with this or not.

1
0
0
@joepie91 Based on arguments I had over here people definitely believe that technical measures at the publishing platform (such as limiting search) can affect this. Also, what is the point of being outraged about the single person who is open about his scraping while I guarantee you a dozen other orgs do the same rn just don't talk about it?
1
0
0

@buherator Technical measures *can* affect it. The exact measures needed and their exact impact are going to vary from case to case, but yes, putting up barriers does in fact make it less likely to happen, even if it cannot fully prevent it.

As to "what is the point of being outraged": because that is how you set social norms in a community, and make clear to potential scrapers that they will be doing so at the cost of their inclusion in the community. This is how it works in all social environments and it seems to be mostly just IT nerds who think this "doesn't work", despite mountains of evidence to the contrary.

Nobody half-competent believes that there's some magical incantation to totally stop any and all scraping. But it's equally absurd to go "well, it's public, nothing you can do, it literally doesn't matter". Harm reduction is a thing, and crucially important to many vulnerable and marginalized folks.

1
0
0
@joepie91 Do you really think people who want to e.g. earn money with this give a flying fart if they are excluded from a community (which they weren't part of in the first place)?
1
0
0

@buherator They might not themselves. But the dataset itself becomes toxic, and if it's known as "that dataset from the people who didn't want it", that will make an awful lot of people think twice before using it.

1
0
0

@buherator The dynamic is same as with many other forms of abuse; the group of malicious people is very small, and the only reason they can do so much damage is because they can bank on the tolerance of a much broader set of people who wouldn't do the malicious thing themselves, but also aren't going to look too closely at the background of what someone else did.

Making a stink out of the collection process sabotages the use of the dataset for that group of people, which is going to be most of them.

1
0
0
@joepie91 - You're still assuming you can know about the scraping in the first place
- Money doesn't stink
1
0
0

@buherator I'm assuming no such thing. That's the whole point of social norms; they apply and have an effect *without* needing to personally know about every single case.

1
0
0
@joepie91 OK, please let me know when the scraping stops because of our collective will!
0
0
0

@josephcox even if they don't actively train their own models, anything that is public-facing on the internet is bound to be gobbled up by the scrapers :/

you just can't reliably protect against those while allowing anonymous human visitors to see the content

0
1
0
@evan
Is follower-only a thing on ActivityPub? I thought everything was unencrypted and basically accessible to any server that receives it.
1
0
0

@mcv
> Is follower-only a thing on ActivityPub?

Yes. They're called Followers posts in Mastodon now.

> I thought everything was unencrypted and basically accessible to any server that receives it.

Yes. There's been all sorts of research into things like Object Capabilities, that could force receiving servers to do what they're told. But for now, not displaying Specific People posts publicly, or showing Followers(-only) posts only to followers, is an unenforceable handshake agreement.

@evan

2
0
0

@strypey @evan

So it only works as intended if all your followers are on servers that obey the agreement. Accidentally following someone on a questionable server can create a leak.

Is it completely out of the question to add some encryption to the protocol for this sort of situation?

1
0
0

(1/?)

@mcv
> it only works as intended if all your followers are on servers

...whose admins and software obey the agreement ("protocol"), yes.

For now if you want decentralised *and* E2EE, you need to switch to another network for that. Options include federated networks like XMMP+OMEMO or Matrix, or a P2P one like Jami or Tox.

If you're wanting something noob-friendly, I'd go for Element (Matrix) or Snikket (XMPP).

(Full disclosure: I've done paid contracting for Snikket)

@evanprodromou

2
0
0

(2/?)

@mcv
> Is it completely out of the question to add some encryption to the protocol for this sort of situation?

Opinions vary. IMHO Mastodon's decision to clone Titter DMs was a mistake. I'm inclined to think it's wiser to clearly separate social publishing from private communications, by having them in separate apps (with some kind of SSO so we can use the same account in both).

I've even suggested using AutoCrypt, so I can check my fediverse DMs in Delta Chat;

https://codeberg.org/fediverse/fediverse-ideas/issues/72

2
0
0

(3/?)

But I suspect I'm in the minority. Most people seem to want the fediverse to be a thneed (see The Lorax), and work of various kinds is underway on bringing E2EE to the verse. Apparently there's a taskforce working on it;

https://socialhub.activitypub.rocks/t/end-to-end-encryption-e2ee-task-force-meeting-jul-19-2024/

... which might be this one?

https://github.com/swicg/activitypub-e2ee

Then there's @soatak@furry.engineer's efforts;

https://soatok.blog/2024/09/13/e2ee-for-the-fediverse-update-were-going-post-quantum/

(this covers some policy issues as well as technical ones and is well worth a read)

1
0
0

(4/4)

@helge posted a Fediverse Idea on an E2EE AP messenger last year;

https://codeberg.org/fediverse/fediverse-ideas/issues/3

... and @dansup's of PixelFed and loops.video announced Sup messenger about a year ago;

https://wedistribute.org/2023/08/sup-by-pixelfed-is-coming/

The MIMI working group at the IETF included the possibility of using AP in their investigations;

https://bifurcation.github.io/mimi-aim/draft-barnes-mimi-aim.html

So in summary, a lot is going on all over the place. Maybe we need to get all these folks in a room?

0
0
0

@strypey

I still think the combination of usenet+email was the perfect integration of public and private communication. They weren't completely separate; usenet posts included the email address of the author.

I'd like to see something modular, like maybe XMPP+ActivityPub, or something like that.

But that still doesn't address semi-public communication, like to all your followers, or to a specific group/circle/aspect, that still guarantees (through encyption) that it's only to that group.

I imagine everybody would automatically publish their public key as part of their profile, and a limited message would be encrypted, with for each authorized recipient an attachment containing the key encrypted with their public key. Of course that could get pretty heavy with posts to lots of users, but servers could throw away attached keys that aren't for any of their own users.

1
0
0

@mcv
> But that still doesn't address semi-public communication, like to all your followers, or to a specific group/circle/aspect

E2EE private groups are the core of Matrix, to the point that DMs are just groups with only 2 members. Delta Chat can encrypt group with AutoCrypt, and I believe XMPP can encrypt private groups too, with MUC+OMEMO.

But from what I've read, the new MLS standard is key to doing E2EE groups efficiently. Devs from all 3 protocol networks are working on implementations.

0
0
0

@mcv
> There's been all sorts of research into things like Object Capabilities

FYI @cwebber wrote a bit about this, with some links for further reading, towards the end of her insightful analysis of BlueSky/ ATProto;

https://dustycloud.org/blog/how-decentralized-is-bluesky/

@evan

0
0
0

@strypey @mcv @evanprodromou

Sadly nobody can use it, and self host. The installer script is very out of date, and broken.

https://github.com/snikket-im/snikket-selfhosted/issues/14

1
0
0

@SchickeSchickeSchweine
> Sadly nobody can use it, and self host. The installer script is very out of date, and broken

I know @snikket_im are keen to support a range of installation options. But I suspect progress is being slowed by funding challenges. The same thing I'm told is making it harder to deliver Matrix 2.0, MLS support, etc (funding challenges for Element).

@mcv @evanprodromou

1
0
0

@strypey @snikket_im @mcv @evanprodromou It turns out they updated the installer script almost immediately and I tried it and my server installed just like that. I am using it now with my friends and it's really really nice. Thank you so much to the development team.

1
0
0

@SchickeSchickeSchweine
> It turns out they updated the installer script almost immediately and I tried it and my server installed just like that

I love it when this happens! One advantage of being able to @mention projects when we complain about them ; )

@snikket_im @mcv @evanprodromou

0
0
0