[email protected]English • 2 months ago

“Model collapse” threatens to kill progress on generative AIs

468

“Model collapse” threatens to kill progress on generative AIs

[email protected]English • 2 months ago

Generative AIs start churning out nonsense when trained on synthetic data — a problem that could put a ceiling on their ability to improve.

Chat

@[email protected]
link
fedilink
English
2•2 months ago
I mean, we’ve seen already that AI companies are forced to be reactive when people exploit loopholes in their models or some unexpected behavior occurs. Not that they aren’t smart people, but these things are very hard to predict, and hard to fix once they go wrong.

Also, what do you mean by synthetic data? If it’s made by AI, that’s how collapse happens.

The problem with curated data is that you have to, well, curate it, and that’s hard to do at scale. No longer do we have a few decades’ worth of unpoisoned data to work with; the only way to guarantee training data isn’t from its own model is to make it yourself
- FaceDeer
  link
  fedilink
  1•2 months ago
  
  Also, what do you mean by synthetic data? If it’s made by AI, that’s how collapse happens.
  
  But that’s exactly my point. Synthetic data is made by AI, but it doesn’t cause collapse. The people who keep repeating this “AI fed on AI inevitably dies!” Headline are ignorant of the way this is actually working, of the details that actually matter when it comes to what causes model collapse.
  
  If people want to oppose AI and wish for its downfall, fine, that’s their opinion. But they should do so based on actual real data, not an imaginary story they pass around among themselves. Model collapse isn’t a real threat to the continuing development of AI. At worst, it’s just another checkbox that AI trainers need to check off on their “am I ready to start this training run?” Checklist, alongside “have I paid my electricity bill?”
  
  The problem with curated data is that you have to, well, curate it, and that’s hard to do at scale.
  
  It was, before we had AI. Turns out that that’s another aspect of synthetic data creation that can be greatly assisted by automation.
  
  For example, the Nemotron-4 AI family that NVIDIA released a few months back is specifically intended for creating synthetic data for LLM training. It consists of two LLMs, Nemotron-4 Instruct (which generates the training data) and Nemotron-4 Reward (which curates it). It’s not a fully automated process yet but the requirement for human labor is drastically reduced.
  
  the only way to guarantee training data isn’t from its own model is to make it yourself
  
  But that guarantee isn’t needed. AI-generated data isn’t a magical poison pill that kills anything that tries to train on it. Bad data is bad, of course, but that’s true whether it’s AI-generated or not. The same process of filtering good training data from bad training data can work on either.

[email protected]

[email protected]

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

3.11K users / day
8.48K users / week
16K users / month
32.7K users / 6 months
59.6K subscribers
12.2K Posts
512K Comments
Modlog