Opinionated article by Alexander Hanff, a computer scientist and privacy technologist who helped develop Europe’s GDPR (General Data Protection Regulation) and ePrivacy rules.
We cannot allow Big Tech to continue to ignore our fundamental human rights. Had such an approach been taken 25 years ago in relation to privacy and data protection, arguably we would not have the situation we have to today, where some platforms routinely ignore their legal obligations at the detriment of society.
Legislators did not understand the impact of weak laws or weak enforcement 25 years ago, but we have enough hindsight now to ensure we don’t make the same mistakes moving forward. The time to regulate unlawful AI training is now, and we must learn from mistakes past to ensure that we provide effective deterrents and consequences to such ubiquitous law breaking in the future.
That’s stupid. The damage is still done to the owner of that data used illegally. Make them destroy it.
But when you levy such miniscule fines that are less than they stand to make from it, it’s just a cost of business. Fines can work if they were appropriate to the value derived.
Yeah, the only threat to Big Tech is that they might sink a lot of money into training material they’d have to give away later. But releasing the material into the Public Domain is not exactly an improvement for the people whose data and work has been used without consent or payment.
“Congratulations, your rights are still being violated, but now the data is free to use for everyone”.
They would actually still benefit from public-domain’ing LLMs, because they themselves also get to use the data produced by others. Everyone gets losses but also gets gains on this idea, which is much better than current model.
That’s like saying victims of deepfake porn benefit because they get to watch themselves having sex. Nope, not buying it.
Whether rights have been violated depends on the jurisdiction, of course.
Semantics. If person A is protected by privacy rights in her jurisdiction, but her data is scraped by project B from one where such rights conveniently aren’t legally respected, A should still be able to expect some way of injunction.
I guess the idea is that the models themselves are not infringing copyright, but the training process DID. Some of the big players have admitted to using pirated material in training data. The rest obviously did even if they haven’t admitted it.
While language models have the capacity to produce infringing output, I don’t think the models themselves are infringing (though there are probably exceptions). I mean, gzip can reproduce infringing material too with the correct input. If producing infringing work requires both the algorithm AND specific, intentional user input, then I don’t think you should put the blame solely on the algorithm.
Either way, I don’t think existing legal frameworks are suitable to answer these questions, so I think it’s more important to think about what the law should be rather than what it currently is.
I remember stories about the RIAA suing individuals for many thousands of dollars per mp3 they downloaded. If you applied that logic to OpenAI — maximum fine for every individual work used — it’d instantly bankrupt them. Honestly, I’d love to see it. But I don’t think any copyright holder has the balls to try that against someone who can afford lawyers. They’re just bullies.
I’m still not understanding the logic. Here is a copyrighted picture. I can search for it, download it, view it, see it with my own eye balls. My browser already downloaded the image for me, in order for me to see it in the browser. I can take that image and edit it in a photo editor. I can do whatever I want with the image on my own computer, as long as I don’t publish the image elsewhere on the internet. All of that is legal. None of it infringes on copyright.
Hell, it could be argued that if I transform the image to a significant degree, I can still publish it under Fair Use. But, that still gets into a gray area for each use case.
What is not a gray area is what AI training does. They download the image and use it in training, which is like me looking at a picture in a browser. The image isn’t republished, or stored in the published model, or represented in any way that could be reconstructed back to the source image in any reasonable form. It just changes a bunch of weights in a LLM model. It’s mathematically impossible for a 4GB model to somehow store the many many terabytes of images on the internet.
Where is the copyright infringement?
You want to use the same bullshit tactics and unreasonable math that the RIAA used in their court cases?
I agree that the models themselves are clearly transformative. That doesn’t mean it’s legal for Meta to pirate everything on earth to use for training. THAT’S where the infringement is. And they admitted they used pirated material: https://www.techspot.com/news/101507-meta-admits-using-pirated-books-train-ai-but.html
I would enjoying seeing megacorps held to at least the same standards as individuals. I would prefer for those standards to be reasonable across the board, but that’s not really on the table here.
If you take that image, copy it and then try to resell it for profit you’ll find you’re quickly in breach of copyright.
The LLM is, in most cases, being licensed out to users for a profit off of the input data without which it could not exist in its current form.
You could see it akin to plagiarism if you think ctrl+c, ctrl+v is too extreme.
That’s not what’s happening. Did you even read my comment?
OK, if you ignore the hyperbole of my pre-christmas stress aggressive start, how much of the rest do you disagree with?
Less combatitively, I’m of the stance that just make AI generated materials exempt from copyright and you’ll at least limit mass adoption in public facing things by big money. Doesn’t address all the issues, though.
AI-generated materials are already exempt from copyright. It falls under the same arguments as the monkey selfie. Which is great.
Crack copyright like a fucking egg. It only benefited the rich, anyway.
That’s good, and I’m glad to have been informed of it.
Thank you.
My copyright change is the 17 years from first publication. Feels maybe still a little long, but much better than what we have now.
Destroying it is both not an option, and an objectively regressive suggestion to even make.
Destruction isn’t possible because even if you deleted every bit of information from every hard drive in the world, now that we know it’s possible, someone would recreate it all in a matter of months.
Regressive because you’re literally suggesting that we destroy a new technology because we’re afraid of what it will do to the technology it replaces. Meanwhile, there’s a very decent chance that AI is our best chance at solving the energy/climate crises through advancing nuclear tech, as well as surviving the next pandemic via ground breaking protein folding tech.
I realize AI tech makes people uncomfortable (for…so many reasons), but becoming old fashioned conservatives in response is not a solution.
I would take it a step further than public domain, though. I would also make any profits from illegally trained AI need to be licensed from the public. If you’re going to use an AI to replace workers, then you need to pay taxes to the people proportional to what you would be paying those it replaces.
I never suggested destroying the technology that is “AI”. I’m not uncomfortable about AI, I’ve even considered pivoting my career in that direction.
I suggested destroying the particular implementation that was trained on the illegitimate data. If someone can recreate it using legitimate data, GREAT. That’s what we want to happen. The tool isn’t the problem. It’s the method they’re using to train them.
Please don’t make up random ass narratives I never even hunted at, and then argue against them.
I didn’t misinterpret what you were saying, everything I said applies to the specific case you lay out. If illegal networks were somehow entirely destroyed, someone would just make them again. That’s my point, there’s no way around that, there’s just holding people accountable when they do it. IMO that takes the form of restitutions to the people proportional to profits.
This is the dumb kind of “best do nothing, because both no is perfect” approach to making sure no disincentives are ever taken because someone somewhere else might also try to do the illegal thing that they’ll lose access to the moment they’re caught…
What the? I’m literally saying what action to take, what is happening? Is there maybe a bug where you only see the first few characters of my post? Are you able to read these characters I’m typing? Testing testing testing. Let me know how far you get. Maybe there’s just too many words for you? Test test. Say “elephant” if you can read this.
Mate LLMs are literally gobbling up energy as if they’re working at a power plant gloryhole. It’s furthering the climate crisis, not solving it. They’re also incapable of logic to make something new so they’re not gonna invent anything. AI in general has it’s uses but LLMs are not the golden goose you should bet on. And profits from them are afaik non existent. They only come from investors thinking it’ll be profitable some day but it’s a way too energy intense process to be profitable
I understand that you are familiar with the buzzword “LLM”, but let me introduce you to a different one: transformers.
Virtually all modern successful AIs are based on transformers, LLMs included. I agree that LLMs currently amount to a chinese-room-inspired parlor trick, but the money involved has no doubt advanced all transfomer-based AI research, both directly (what works for LLMs may generalize) and indirectly (the market demand for LLMs in consumer products has created the a demand for power and compute hardware).
We have transformer-based AI to thank for our understanding of the covid19 protein, and developing a safe and effective vaccine in a timely manner.
The massive demand for energy has convinced Microsoft, Meta, and others to invest in their own modern nuclear power plants, representing a monumental step forward in sustainable energy generation that we have been trying to convince the US government to take for decades.
Modern AI is being used to solve the hardest problems of nuclear fusion. If we can finally crack that nut, there’s no telling what’s possible.
But specifically when it comes to LLMs, profitable or not, people obviously find them useful. People aren’t using it in place of search engines, or doing all their homework with it because they don’t find it useful. My only argument is that any AI trained on public content without consent should be required to effectively buy a license from, or pay royalties to the public. If McDonald’s is going to replace their front counters with AI trained on public content, then they should have to pay taxes proportional to how much use they get from that AI.
In the theoretical extreme, if someone trains an AI on the general public’s data, and is able to create an AI that somehow replaces every job on earth, then congrats, we now live in a post-work society, we just need to reach out and take it rather than letting one person capitalize infinitely.
And at the end of the day, if you honestly believe the profits from AI are non-existent, then what are you worried about? All those companies putting all their eggs in the LLM basket are going to disappear overnight when the AI bubble finally pops, right?
There’s a reason why in my comment i talked about LLMs as bad while saying AI in general has it’s uses. The reason being this post being about LLMs.
I know very well that specialized AI has a lot of uses in medical science and other fields but that’s not really what got hit with all the hype, is it? The hype is managers saw a language model give seemingly better answers to questions than John Rando from 2 blocks down the road so they’re now looking to cut out all the already low paid workers and spoiler alert we will not land in a society where the general public profits from not having work. It will be the same owners of capital profiting as per usual.
If we do nothing, sure. I’m suggesting, like the article, that we do something.
The only sentiment I took issue with was the poster above who suggested that somehow the solution would be to delete/destroy illegally trained networks. I’m just saying that’s not practical nor progressive. AI is here to stay, we just need to create legislature that ensures it works for us, especially when it couldn’t have been built without us.
would love to see a source for AI helping with the covid 19 vaccine
For sure, here you go.
I’d argue it’s not useless, rather, it would remove any financial incentive for these companies to sink who knows how much into training AI. By putting them on the public domain, they would loose their competitve advantage over other cloud providers who could exploit it all the same, all the while not disturbing the current usage of AI.
Now, I do agree that destroying it would be even better, but I fear something like that would face too much force back by the parts of civil society who do use AI.