infosec.place

Conversation

洪民憙 (Hong Minhee) :nonbinary:

hongminhee@hollo.social

Been thinking a lot about @algernon's recent post on FLOSS and LLM training. The frustration with AI companies is spot on, but I wonder if there's a different strategic path. Instead of withdrawal, what if this is our GPL moment for AI—a chance to evolve copyleft to cover training? Tried to work through the idea here: Histomat of F/OSS: We should reclaim LLMs, not reject them.

4

0

0

na na na na na na chauve-souris

algernon@come-from.mad-scientist.club

Reply to @hongminhee@hollo.social

Edited 1 month ago

@hongminhee I think you're giving the AI companies too much credit and goodwill. The technology itself may have use, but their total lack of respect is not the only problem. The energy required to do the training and the inference, the environmental impact of it will not be addressed by freeing the models.

But... that's probably worth another blog post. Nevertheless, I'd like to address a few things here:

OpenAI and Anthropic have already scraped what they need.

They did not. I'm receiving 3+ million requests a day from Anthropic's ClaudeBot, and about a 1 million a day from OpenAI: see the stats on @iocaine . If they scraped everything they need, they wouldn't aggressively continue the practice, would they?

They need new content to "improve" the models. They need new data to scrape now, more than ever, because when the internet is getting filled with slop, legit human work to train on becomes even more desirable.

Heck, I've seen scraping waves where my sites received over 100 million requests in a single day! I do not know which companies were responsible (though, I have my suspicions), but they most definitely do not have all the data they need.

And I'm hosting small, tiny things, nothing of much importance. I imagine more juicy targets like Codeberg receive a whole lot more of these.

GitHub already has everyone's code.

GitHub has a lot of code, but far from everyone's. And we shouldn't give them more to exploit. Just because they already exploited us in the past 10 years doesn't mean we should "accept reality" and let them continue.

Then, you go and ponder licensing: it doesn't matter. See the beginning of my blog post:

None of the major models keep attribution properly, and their creators and proponents of these “tools” assert that they do not need to keep either. That by nature of the training, the models recycle and remix, and no substantial code is emitted from the original as-is, only small parts that are not copyrightable in themselves. As such, they do not constitute derived work, and no attribution is necessary.

No matter how you word your licensing, as long as they can argue that training emits only uncopyrightable fragments through remixing and recycling, your license is irrelevant.

You can try and incorporate explicitly allowing training if the wights are released - they will not care. Once they deem your code uncopyrightable, they can do whatever they want, like they've been doing now.

You assume these companies behave ethically. They do not.

See the recentish Anthropic vs Authors case: Anthropic was fined not because they violated copyright, but because they sourced the books illegally. The copyright violation was dismissed.

Why do you think applying a different license would help, when there's existing legal precedent that it does fuck all?

Also, releasing the weights is... insufficient. Important, but insufficient. To free a model, you also need the training data, otherwise you can't reproduce it. Training data should be considered part of its source, because you can't reproduce the model without it.

Good luck with that. There is no scenario where surrendering to this "new reality" plays out well.

It really is quite simple: as we do not negotiate with fascists, we do not negotiate with AI companies either.

2

0

1

洪民憙 (Hong Minhee) :nonbinary:

hongminhee@hollo.social

Reply to @hongminhee@hollo.social

AI 企業(기업)이 F/OSS 코드로 LLM 訓練(훈련)하는 걸 막을 게 아니라, 訓練(훈련)한 모델을 公開(공개)하도록 要求(요구)해야 한다고 생각합니다.

撤收(철수)가 아니라 再專有(재전유)! GPL이 그랬던 것처럼요.

訓練(훈련) 카피레프트에 對(대)한 글을 썼습니다: 〈F/OSS 史唯(사유): 우리는 LLM을 拒否(거부)할 게 아니라 되찾아 와야 한다〉(한글).

1

0

0

洪民憙 (Hong Minhee) :nonbinary:

hongminhee@hollo.social

Reply to @hongminhee@hollo.social

AI企業がF/OSSコードでLLMを訓練することを止めるのではなく、訓練したモデルを公開するよう要求すべきだと思います。

撤退ではなく、再専有。GPLがそうだったように。

訓練コピーレフトについて書きました：「F/OSSの唯物史観——LLMを拒絶するのではなく、取り戻すべきだ」

0

0

0

buherator

Reply to @algernon@come-from.mad-scientist.club

@algernon @hongminhee "They need new content to "improve" the models." -> The easiest way to think about this is to consider some fast evolving library API. If you don't scrape+train constantly, you won't be able to generate code for the latest&greatest, so by this logic (incredibly costly) training *can never stop*.

1

0

1

na na na na na na chauve-souris

algernon@come-from.mad-scientist.club

Reply to @buherator

@buherator Exactly. LLM users expect models to adapt to new and shiny things, if they can't regurgritate the latest APIs, they're "falling behind".

The scraping will never stop, and it will never be consensual, because they need the data more than our consent.

@hongminhee You see, the thing is, they need us more than we need them. The opportunity here is realizing this, and forcing them to respect us. Only then can we talk about anything further. And the way towards that is denying them the data they require. The path there is through rejecting the current parctice of exploitation, not conceding.

0

0

0

silverpill

silverpill@mitra.social

Reply to @hongminhee@hollo.social

@hongminhee @algernon Personally, I don't mind AI scrapers. Developers of closed-source projects have always been using copylefted code, today it is just easier as they don't need to hide anymore.

However, I don't think the resistance is futile:

>OpenAI and Anthropic have already scraped what they need. GitHub already has everyone's code. The training data exists.

Because they will need more data, and easily accessible training data pools (e.g. Github) are slowly becoming poisoned with slop. Did they find a solution to this problem?

>GPLv4

I like this idea!

1

0

0

na na na na na na chauve-souris

algernon@come-from.mad-scientist.club

Reply to @silverpill@mitra.social

@silverpill Considering they're ignoring existing licenses, and there's legal precedent that training is essentially fair use, any new license that tries to forbid training, or allow it under certain conditions will be similarly ignored.

Licensing is not the front to fight them.

1

0

0

silverpill

silverpill@mitra.social

Reply to @algernon@come-from.mad-scientist.club

@algernon @hongminhee Sure, but I view software licenses as manifestos, rather than something legally enforceable. In other words, it's for us, not for them

1

0

0

na na na na na na chauve-souris

algernon@come-from.mad-scientist.club

Reply to @silverpill@mitra.social

@silverpill @hongminhee Not sure I agree it's a good manifesto to concede ground to the LLMs. That legitimizes the current practice.

Going harder and trying to forbid, would go against the tenet of not discriminating against field of use, so that's not an option either.

So I'm not seeing why we attempt to dance around licensing. There are better ways to fight them, why waste time on something that doesn't have any practical purpose?

0

0

0

Light

light@noc.social

Reply to @algernon@come-from.mad-scientist.club

@algernon
@condret has suggested that AI can be used as a political tool to abolish intellectual property. What do you think of that?
@hongminhee @iocaine

1

0

0

na na na na na na chauve-souris

algernon@come-from.mad-scientist.club

Reply to @light@noc.social

While very capitalist entities control the AI? Nope, not a chance. Wrestle it away from them first, then we'll see. I don't see the point in entertaining that idea until then.

If we tried it now, it might abolish intellectual property indeed, and bring about a future where everything is "free for all" - but access is in the hands of the AI companies. I do not wish to see that future. Abolish intellectual poperty, yes! But... not like that.

@condret @hongminhee

0

0

0

WEBmadman

webmadman@indieweb.social

Reply to @hongminhee@hollo.social

@hongminhee @algernon I've felt all along that if a model is trained on public data, it should be available to the public.

1

0

0

na na na na na na chauve-souris

algernon@come-from.mad-scientist.club

Reply to @webmadman@indieweb.social

@webmadman @hongminhee but but but but fair use and the added value of slop!!11!!

AI companies are neither ethical, nor sensible. Nor are legislators.

1

0

0

WEBmadman

webmadman@indieweb.social

Reply to @algernon@come-from.mad-scientist.club

@algernon @hongminhee So demand access or take it any way we can.

1

0

0

na na na na na na chauve-souris

algernon@come-from.mad-scientist.club

Reply to @webmadman@indieweb.social

@webmadman @hongminhee or: deny them (the crawlers) access to our work, because llms are bullshit anyway.

Easier, doable today, on an individual level too.

0

0

0