Been thinking a lot about @algernon's recent post on FLOSS and LLM training. The frustration with AI companies is spot on, but I wonder if there's a different strategic path. Instead of withdrawal, what if this is our GPL moment for AI—a chance to evolve copyleft to cover training? Tried to work through the idea here: Histomat of F/OSS: We should reclaim LLMs, not reject them.
@hongminhee I think you're giving the AI companies too much credit and goodwill. The technology itself may have use, but their total lack of respect is not the only problem. The energy required to do the training and the inference, the environmental impact of it will not be addressed by freeing the models.
But... that's probably worth another blog post. Nevertheless, I'd like to address a few things here:
OpenAI and Anthropic have already scraped what they need.
They did not. I'm receiving 3+ million requests a day from Anthropic's ClaudeBot, and about a 1 million a day from OpenAI: see the stats on @iocaine . If they scraped everything they need, they wouldn't aggressively continue the practice, would they?
They need new content to "improve" the models. They need new data to scrape now, more than ever, because when the internet is getting filled with slop, legit human work to train on becomes even more desirable.
Heck, I've seen scraping waves where my sites received over 100 million requests in a single day! I do not know which companies were responsible (though, I have my suspicions), but they most definitely do not have all the data they need.
And I'm hosting small, tiny things, nothing of much importance. I imagine more juicy targets like Codeberg receive a whole lot more of these.
GitHub already has everyone's code.
GitHub has a lot of code, but far from everyone's. And we shouldn't give them more to exploit. Just because they already exploited us in the past 10 years doesn't mean we should "accept reality" and let them continue.
Then, you go and ponder licensing: it doesn't matter. See the beginning of my blog post:
None of the major models keep attribution properly, and their creators and proponents of these “tools” assert that they do not need to keep either. That by nature of the training, the models recycle and remix, and no substantial code is emitted from the original as-is, only small parts that are not copyrightable in themselves. As such, they do not constitute derived work, and no attribution is necessary.
No matter how you word your licensing, as long as they can argue that training emits only uncopyrightable fragments through remixing and recycling, your license is irrelevant.
You can try and incorporate explicitly allowing training if the wights are released - they will not care. Once they deem your code uncopyrightable, they can do whatever they want, like they've been doing now.
You assume these companies behave ethically. They do not.
See the recentish Anthropic vs Authors case: Anthropic was fined not because they violated copyright, but because they sourced the books illegally. The copyright violation was dismissed.
Why do you think applying a different license would help, when there's existing legal precedent that it does fuck all?
Also, releasing the weights is... insufficient. Important, but insufficient. To free a model, you also need the training data, otherwise you can't reproduce it. Training data should be considered part of its source, because you can't reproduce the model without it.
Good luck with that. There is no scenario where surrendering to this "new reality" plays out well.
It really is quite simple: as we do not negotiate with fascists, we do not negotiate with AI companies either.
AI 企業이 F/OSS 코드로 LLM 訓練하는 걸 막을 게 아니라, 訓練한 모델을 公開하도록 要求해야 한다고 생각합니다.
撤收가 아니라 再專有! GPL이 그랬던 것처럼요.
訓練 카피레프트에 對한 글을 썼습니다: 〈F/OSS 史唯: 우리는 LLM을 拒否할 게 아니라 되찾아 와야 한다〉(한글).
AI企業がF/OSSコードでLLMを訓練することを止めるのではなく、訓練したモデルを公開するよう要求すべきだと思います。
撤退ではなく、再専有。GPLがそうだったように。
訓練コピーレフトについて書きました:「F/OSSの唯物史観——LLMを拒絶するのではなく、取り戻すべきだ」
@buherator Exactly. LLM users expect models to adapt to new and shiny things, if they can't regurgritate the latest APIs, they're "falling behind".
The scraping will never stop, and it will never be consensual, because they need the data more than our consent.
@hongminhee You see, the thing is, they need us more than we need them. The opportunity here is realizing this, and forcing them to respect us. Only then can we talk about anything further. And the way towards that is denying them the data they require. The path there is through rejecting the current parctice of exploitation, not conceding.