I believe it was already known that anything trained on The Pile contained references to copyrighted material from scihub. It seems unlikely that folks who chose to use these sources were completely unaware of the nature of the data. Presumably, given the urgency in the last 2-3 years to be a leader in this space, a number of shortcuts were taken.
You talk about morals, but did you consider that they are releasing the model as open source, and given that OpenAI and others do the same, Zuck is really the only current option to have a reasonably comparable open source model? Also, did you consider that it might be more moral to create an AI model than to uphold copyright law, which actually many on this site deem immoral?
On true for all - you’d need to split it by era I think
During the early Llama 1 days The Pile dataset was in heavy use by many. Bit later people figured out that a subset of it - Books 3 - was especially problematic.
I’m guessing all the big houses threw that piece out in later models since it’s extra radioactive
We're literally extracting, refining, and re-using the information, art, and thoughts of fellow humans to make billionaires money.
This isn't the 90s. Computing isn't about discovery, not in the big leagues. Its about grinding up authenticity and feeding it into a machine to convert it into shareholder value.
If they want the value, let them pay for it or release the models open source for all to benefit.
They have released all the models for free so far unlike other companies like OpenAI who are most likely doing the same but keeping it private and proprietary.
I believe it was already known that anything trained on The Pile contained references to copyrighted material from scihub. It seems unlikely that folks who chose to use these sources were completely unaware of the nature of the data. Presumably, given the urgency in the last 2-3 years to be a leader in this space, a number of shortcuts were taken.
Zuck did a calculation: "Does the risk of lawsuits and bad PR outweigh the benefits of being early?".
If u remove morals from the equation, nearly every CEO would have made that same decision if in that position.
You talk about morals, but did you consider that they are releasing the model as open source, and given that OpenAI and others do the same, Zuck is really the only current option to have a reasonably comparable open source model? Also, did you consider that it might be more moral to create an AI model than to uphold copyright law, which actually many on this site deem immoral?
IMHO this is a moral win on Zuck side.
I would speculate this is true of all the leading commercial LLM models. Don’t have enough training data? Just steal some!
On true for all - you’d need to split it by era I think
During the early Llama 1 days The Pile dataset was in heavy use by many. Bit later people figured out that a subset of it - Books 3 - was especially problematic.
I’m guessing all the big houses threw that piece out in later models since it’s extra radioactive
What was problematic about it?
Thousands of pirated copyrighted books
Copy some*
That's exactly the difference, one does not steal in the digital world. If i could download/copy a car i would do it ;)
Courts have yet to decide on which it is, and it might depend on how well the model can transform vs. recite.
i was under the impression that almost everyone trained on books3
Jail time? Or just multi-million fines.
Will he be allowed to lead Meta if convicted as criminal?
Stripping out the copyrights is quite damning.
There is wrongdoing and there is obvious evidence that you known what you’re doing is wrong. That really limits options on Defence
Stupid question: I have 400000 ebooks (yup pirated ones) what happens if I build an LLM with this?
You would have a net worth of 1bn
You'd still ask stupid questions?
You’ll be fine. It’s like laundering money.
You go to jail forever.
Nothing.
What do you imagine could happen?
Its also reported elsewhere ( in media articles linked to by Hacker News ) they torrented copyright material. AMAZING
Everyone knows that LLMs are trained on shit ton of pirated content
"I'm shocked, shocked to find out that piracy is going on here!"
"Your LLM, Captain Zuckerberg."
"Oh, thank you very much!"
Good. These laws are anti progress.
We're literally extracting, refining, and re-using the information, art, and thoughts of fellow humans to make billionaires money.
This isn't the 90s. Computing isn't about discovery, not in the big leagues. Its about grinding up authenticity and feeding it into a machine to convert it into shareholder value.
If they want the value, let them pay for it or release the models open source for all to benefit.
They have released all the models for free so far unlike other companies like OpenAI who are most likely doing the same but keeping it private and proprietary.
What "progress"?
Exfiltration of information from the economy