ModerateImprovement@sh.itjust.works to

Technology@lemmy.worldEnglish · 9 months ago

AI trained on AI garbage spits out AI garbage.

www.technologyreview.com

446

AI trained on AI garbage spits out AI garbage.

www.technologyreview.com

ModerateImprovement@sh.itjust.works to

Technology@lemmy.worldEnglish · 9 months ago

AI trained on AI garbage spits out AI garbage

www.technologyreview.com

As junk web pages written by AI proliferate, the models that rely on that data will suffer.

You must log in or register to comment.

Chat

Admiral Patrick@dubvee.org
link
fedilink
English
arrow-up
86·
9 months ago

As junk web pages written by AI proliferate, the models that rely on that data will suffer.

Good.
Madrigal@lemmy.world
link
fedilink
English
arrow-up
80·
9 months ago
“On two occasions I have been asked, ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.” - Charles Babbage
- bionicjoey@lemmy.ca
  link
  fedilink
  English
  arrow-up
  15·
  9 months ago
  The business people adopting AI: “who cares what it’s trained on? It’s intelligent right? It’ll just sort through the garbage and magically come up with the right answers to everything”
  - RecluseRamble@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    1·
    9 months ago
    Not so hard to imagine given that these people have always seen technical systems as magic.
- CookieOfFortune@lemmy.world
  link
  fedilink
  English
  arrow-up
  6·
  9 months ago
  Of course modern UX design is very much based on getting the right answer with the wrong inputs (autocorrect, etc).
  - lennivelkant@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    1·
    9 months ago
    I believe Robustness was the term I learned years ago: the ability of a system to gracefully handle user error, make it easy to recover from or fix, clearly communicate what was wrong etc.
    
    Of course, nothing is ever perfect and humans are very creative at fucking up, and a lot of companies don’t seem to take UX too seriously. Particularly when the devs get tunnel vision and forget about user error being a thing…
Crazyslinkz@lemmy.world
link
fedilink
English
arrow-up
59·
9 months ago
Garbage in; Garbage out.
- _haha_oh_wow_@sh.itjust.works
  link
  fedilink
  English
  arrow-up
  19·
  9 months ago
  Shit-fueled ouroboros
- lemmeout@lemm.ee
  link
  fedilink
  English
  arrow-up
  4·
  9 months ago
  You can’t explain it!
- BluesF@lemmy.world
  link
  fedilink
  English
  arrow-up
  2·
  9 months ago
  Recycle the garbage that comes out… Still more garbage out.
Lvxferre [he/him]@mander.xyz
link
fedilink
English
arrow-up
40·
9 months ago
Model degeneration is an already well-known phenomenon. The article already explains well what’s going on so I won’t go into details, but note how this happens because the model does not understand what it is outputting - it’s looking for patterns, not for the meaning conveyed by said patterns.

Frankly at this rate might as well go with a neuro-symbolic approach.
- CeeBee_Eh@lemmy.world
  link
  fedilink
  English
  arrow-up
  2·
  9 months ago
  The issue with your assertion is that people don’t actually work a similar way. Have you ever met someone who was clearly taught "garbage’?
  - Lvxferre [he/him]@mander.xyz
    link
    fedilink
    English
    arrow-up
    12·
    9 months ago
    
    The issue with your assertion is that people don’t actually work a similar way.
    
    I’m talking about LLMs, not about people.
    - CeeBee_Eh@lemmy.world
      link
      fedilink
      English
      arrow-up
      2·
      9 months ago
      I know you are, but the argument that an LLM doesn’t understand context is incorrect. It’s not human level understanding, but it’s been demonstrated that they do have a level of understanding.
      
      And to be clear, I’m not talking about consciousness or sapience.
      - Lvxferre [he/him]@mander.xyz
        link
        fedilink
        English
        arrow-up
        10·
        9 months ago
        
        I know you are, but the argument that an LLM doesn’t understand context is incorrect
        
        Emphasis mine. I am talking about the textual output. I am not talking about context.
        
        It’s not human level understanding
        
        Additionally, your obnoxiously insistent comparison between LLMs and human beings boils down to a red herring.
        
        Not wasting my time further with you.
        
        [For others who might be reading this: sorry for the blatantly rude tone but I got little to no patience towards people who distort what others say, like the one above.]
        
        CeeBee_Eh@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        9 months ago
        
        I got little to no patience towards people who distort what others say,
        
        My original reply was meant to be tongue-in-cheek, but I guess I forgot about Poe’s law. I’m not a layman, for the record. I’ve worked with AI for over a decade
        
        Not wasting my time further with you.
        
        Ditto. Have a nice day.
      - CileTheSane@lemmy.ca
        link
        fedilink
        English
        arrow-up
        3·
        9 months ago
        
        but it’s been demonstrated that they do have a level of understanding.
        
        Citation needed
        
        CeeBee_Eh@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        9 months ago
        Here you go
        
        https://youtu.be/gQddtTdmG_8
        
        CileTheSane@lemmy.ca
        link
        fedilink
        English
        arrow-up
        4·
        edit-2
        9 months ago
        A better mathematical system of storing words does not mean the LLM understands any of them. It just has a model that represents the relation between words that it uses.
        
        If I put 10 minus 8 into my calculator I get 2. The calculator doesn’t actually understand what 2 means, or what subtracting represents, it just runs the commands that gives the appropriate output.
        
        CeeBee_Eh@lemmy.world
        link
        fedilink
        English
        arrow-up
        2·
        edit-2
        9 months ago
        That’s a bad analogy, because the calculator wasn’t trained using an artificial neural network literally designed by studying biological brains (aka biological neutral networks).
        
        And “understand” doesn’t equate to consciousness or sapience. For example, it is entirely and factually correct to state that an LLM is capable of reasoning. That’s not even up for debate. The accuracy of an LLM’s reasoning capability is one of the fundamental benchmarks used for evaluating its quality.
        
        But that doesn’t mean it’s “thinking” in the way most people consider.
        
        Edit: anyone up voting this CileTheSane clown is in the same boat of not comprehending how LLMs work.
  - PenisDuckCuck9001
    link
    fedilink
    English
    arrow-up
    8·
    edit-2
    9 months ago
    I’m autistic and sometimes I feel like an ai bot spewing out garbage in social situations. If I do what people normally do and make it sound believable, maybe no one will notice.
tal@lemmy.today
link
fedilink
English
arrow-up
27·
9 months ago
Removed by mod
- palordrolap@kbin.run
  link
  fedilink
  arrow-up
  21·
  9 months ago
  Hopefully archive.org have measures in place to stop people from yanking all their data too quickly. As least not without a hefty donation or something. As a user it can chug a bit, and I’m hoping that’s the rate-limiting I’m talking about and not that they’re swamped.
  - Grimy@lemmy.world
    link
    fedilink
    English
    arrow-up
    9·
    edit-2
    9 months ago
    That would go against the principal of the archive imo but regardless, if you take away all means of acquiring data freely, you are just giving companies like OpenAI and Google who already have copies of it an insane advantage.
    
    AI isn’t going away, we need to make sure we have free access to it as to not give our whole economy to a handful of companies.
Catoblepas@lemmy.blahaj.zone
link
fedilink
English
arrow-up
27·
9 months ago
AI making itself sick and worthless after flooding the internet with trash just gives me a warm glow.
kromem@lemmy.world
link
fedilink
English
arrow-up
16·
9 months ago
I’d be very wary of extrapolating too much from this paper.

The past research along these lines found that a mix of synthetic and organic data was better than organic alone, and a caveat for all the research to date is that they are using shitty cheap models where there’s a significant performance degrading in the synthetic data as compared to SotA models, where other research has found notable improvements to smaller models from synthetic data from the SotA.

Basically this is only really saying that AI models across multiple types from a year or two ago in capabilities recursively trained with no additional organic data will collapse.

It’s not representative of real world or emerging conditions.
Anarki_@lemmy.blahaj.zone
link
fedilink
English
arrow-up
16·
9 months ago
⢀⣠⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⠀⣠⣤⣶⣶ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⠀⢰⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⣀⣀⣾⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⡏⠉⠛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⣿ ⣿⣿⣿⣿⣿⣿⠀⠀⠀⠈⠛⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠿⠛⠉⠁⠀⣿ ⣿⣿⣿⣿⣿⣿⣧⡀⠀⠀⠀⠀⠙⠿⠿⠿⠻⠿⠿⠟⠿⠛⠉⠀⠀⠀⠀⠀⣸⣿ ⣿⣿⣿⣿⣿⣿⣿⣷⣄⠀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣴⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠠⣴⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⡟⠀⠀⢰⣹⡆⠀⠀⠀⠀⠀⠀⣭⣷⠀⠀⠀⠸⣿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⠃⠀⠀⠈⠉⠀⠀⠤⠄⠀⠀⠀⠉⠁⠀⠀⠀⠀⢿⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⢾⣿⣷⠀⠀⠀⠀⡠⠤⢄⠀⠀⠀⠠⣿⣿⣷⠀⢸⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⡀⠉⠀⠀⠀⠀⠀⢄⠀⢀⠀⠀⠀⠀⠉⠉⠁⠀⠀⣿⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣧⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢹⣿⣿ ⣿⣿⣿⣿⣿⣿⣿⣿⣿⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿
- Mizule@lemm.ee
  link
  fedilink
  English
  arrow-up
  2·
  26 days ago
  the root
  - Anarki_@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    1·
    26 days ago
    Welcome, weary traveler.
    
    How was your journey?
    - Mizule@lemm.ee
      link
      fedilink
      English
      arrow-up
      2·
      25 days ago
      depressing
cordlesslamp@lemmy.today
link
fedilink
English
arrow-up
16·
9 months ago
Oh no, the AI are inbreeding.
KevonLooney@lemm.ee
link
fedilink
English
arrow-up
12·
edit-2
9 months ago

provenance requires some way to filter the internet into human-generated and AI-generated content, which hasn’t been cracked yet

It doesn’t need to be filtered into human / AI content. It needs to be filtered into good (true) / bad (false) content. Or a “truth score” for each.

We don’t teach children to read by just handing them random tweets. We give them books that are made specifically for children. Our filtering mechanism for good / bad content is very robust for humans. Why can’t AI just read every piece of “classic literature”, famous speeches, popular books, good TV and movie scripts, textbooks, etc?
- Lvxferre [he/him]@mander.xyz
  link
  fedilink
  English
  arrow-up
  9·
  9 months ago
  It doesn’t need to be filtered into human / AI content. It needs to be filtered into good (true) / bad (false) content. Or a “truth score” for each.
  
  That isn’t enough because the model isn’t able to reason.
  
  I’ll give you an example. Suppose that you feed the model with both sentences:
  
  Cats have fur.
  
  Birds have feathers.
  
  Both sentences are true. And based on vocabulary of both, the model can output the following sentences:
  
  Cats have feathers.
  
  Birds have fur.
  
  Both are false but the model doesn’t “know” it. All that it knows is that “have” is allowed to go after both “cats” and “birds”, and that both “feathers” and “fur” are allowed to go after “have”.
  - KevonLooney@lemm.ee
    link
    fedilink
    English
    arrow-up
    8·
    9 months ago
    It’s not just a predictive text program. That’s been around for decades. That’s a common misconception.
    
    As I understand it, it uses statistics from the whole text to create new text. It would be very rare to output “cats have feathers” because that phrase doesn’t ever appear in the training data. Both words “have feathers” never follow “cats”.
    - skulblaka@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      9·
      9 months ago
      But the fact remains that it doesn’t know what a cat or a feather is. All of this is still based purely on statistical frequency and not at all on actual meanings.
    - vrighter@discuss.tchncs.de
      link
      fedilink
      English
      arrow-up
      5·
      edit-2
      9 months ago
      and that is exactly how a predictive text algorithm works.
      
      some tokens go in
      
      they are processed by a deterministic, static statistical model, and a set of probabilities (always the same, deterministic, remember?) comes out.
      
      pick the word with the highest probability, add it to your initial string and start over.
      
      if you want variety, add some randomness and don’t just always pick the most probable next token.
      
      Coincidentally, this is exactly how llms work. It’s a big markov chain, but with a novel lossy compression algorithm on its state transition table. The last point is also the reason why, if anyone says they can fix llm hallucinations, they’re lying.
      - CeeBee_Eh@lemmy.world
        link
        fedilink
        English
        arrow-up
        2·
        9 months ago
        
        Coincidentally, this is exactly how llms work
        
        Everyone who says this doesn’t actually understand how LLMs work.
        
        Multivector word embeddings create emergent relationships that’s new knowledge that doesn’t exist in the training dataset.
        
        Computerphile did a good video on this well before the LLM craze.
        
        vrighter@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        1·
        edit-2
        9 months ago
        1 - a markov chain only takes previous tokens as input.
        
        2 - It uses a function (in the mathematical sense, so same input results in same output, completely stateless) to generate a set of probabilities for what the next token might be.
        
        3 - The most probable token is picked, else randomness (temperature) is inserted here to choose a different token occasionally.
        
        an llm’s internals, the part that’s trained is literally the function used in step 2. You could have this function implemented a number of ways, ex you could build a huge table and consult it. Or you could generate it somehow. You could train a big neural network that takes previous tokens as input, and outputs probabilities of tokens as output. You then enumerate its outputs for every possible permutation of inputs and there’s your table. This would take too much time and space, so we just run the function on-demand instead. Exact same result. It can be very smart and notice correlations, but ultimately it generates a (virtual) huge static table. This is a completely deterministic process. A trained NN is still a (huge) mathematical function. So the big network that they spend resources training is basically the function used in step 2.
        
        Step 3 is the cause of hallucinations. It’s the only nondeterministic part. And it’s not part of the llm itself in any way. No matter how smarter the neural network gets, the hallucinations are introduced mainly in step 3. So no, they won’t be solving the LLM hallucination problem anytime soon.
    - barsoap@lemm.ee
      link
      fedilink
      English
      arrow-up
      3·
      edit-2
      9 months ago
      
      because that phrase doesn’t ever appear in the training data.
      
      Eh but LLMs abstract. It has seen “<animal> have feathers” and “<animal> have fur” quite a lot of times. The problem isn’t that LLMs can’t reason at all, the problem is that they do employ techniques used in proper reasoning, in particular tracking context throughout the text (cross-attention) but lack techniques necessary for the whole thing, instead relying on confabulation to sound convincing regardless of the BS they spout. Suffices to emulate an Etonian but that’s not a high standard.
      - FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        2·
        9 months ago
        Workarounds for those sorts of limitations have been developed, though. Chain-of-thought prompting has been around for a while now, and I recall recently seeing an article about a model that had that built right into it; it had been trained to use <thought></thought> tags to enclose invisible chunks of its output that would be hidden from the end user but would be used by the AI to work its way through a problem. So if you asked it whether cats had feathers it might respond “<thought>Feathers only grow on birds and dinosaurs. Cats are mammals.</thought> No, cats don’t have feathers.” And you’d only see the latter bit. It was a pretty neat approach to improving LLM reasoning.
    - WalnutLum@lemmy.ml
      link
      fedilink
      English
      arrow-up
      1·
      edit-2
      9 months ago
      This isn’t really accurate either. At the moment of generation, an LLM only has context for the input string and the network of text tokens it’s been assigned. It pulls from a “pool” of these tokens based on what it’s already output and the input context, nothing more.
      
      Most LLMs have what are called “Top P”, “Top K” etc, these are the number of tokens that it ends up selecting from based on the previous token, alongside the input tokens. It then randomly chooses one based on temperature settings.
      
      It’s why if you turn these models’ temperature settings really high they output pure nonsense both conceptually and grammatically, because the tenuous thread linking the previous token’s context to the next token has been widened enough that it completely loses any semblance of cohesiveness.
    - Lvxferre [he/him]@mander.xyz
      link
      fedilink
      English
      arrow-up
      1·
      9 months ago
      Your “ackshyually” is missing the point.
  - CeeBee_Eh@lemmy.world
    link
    fedilink
    English
    arrow-up
    2·
    9 months ago
    
    Both sentences are true. And based on vocabulary of both, the model can output the following sentences:
    
    Cats have feathers.
    
    Birds have fur
    
    This is not how the models are trained or work.
    
    Both are false but the model doesn’t “know” it. All that it knows is that “have” is allowed to go after both “cats” and “birds”, and that both “feathers” and “fur” are allowed to go after “have”.
    
    Demonstrably false. This isn’t how LLMs are trained or built.
    
    Just considering the contextual relationships between word embeddings that are created during training is evidence enough. Those relationships from the multi-vector fields are an emergent property that doesn’t exist in the dataset.
    
    If you want a better understanding of what I just said, take a look at this Computerphile video from four years ago. And this came out before the LLM hype and before ChatGPT 3, which was the big leap in LLMs.
- Zos_Kia
  link
  fedilink
  English
  arrow-up
  2·
  9 months ago
  That’s what smaller models do, but it doesn’t yield great performance because there’s only so much stuff available. To get to gpt4 levels you need a lot more data, and to break the next glass ceiling you’ll need even more.
  - KevonLooney@lemm.ee
    link
    fedilink
    English
    arrow-up
    3·
    9 months ago
    Then these models are stupid. Humans don’t start as a blank slate. They have an inherent aptitude for language and communication. These models should start out with basics of language, so they don’t have to learn it from the ground up. That’s the next step. Right now they’re just well read idiots.
    - Zos_Kia
      link
      fedilink
      English
      arrow-up
      3·
      9 months ago
      
      Then these models are stupid
      
      Yup that is kind of the point. They are math functions designed to approximate human tasks.
      
      These models should start out with basics of language, so they don’t have to learn it from the ground up. That’s the next step. Right now they’re just well read idiots.
      
      I’m not sure what you’re pointing at here. How they do it right now, simplified, is you have a small model designed to cut text into tokens (“knowledge of syllables”), which are fed into a larger model which turns tokens into semantic information (“knowledge of language”), which is fed to a ridiculously fat model which “accomplishes the task” (“knowledge of things”).
      
      The first two models are small enough that they can be trained on the kind of data you describe, classic books, movie scripts etc… A couple hundred billion words maybe. But the last one requires orders of magnitude more data, in the trillions.
downpunxx@fedia.io
link
fedilink
arrow-up
9·
edit-2
8 months ago
deleted by creator
SkaveRat@discuss.tchncs.de
link
fedilink
English
arrow-up
8·
9 months ago
People are already comparing older content with Low Background Steel, as it’s uncontaminated
- FaceDeer@fedia.io
  link
  fedilink
  arrow-up
  2·
  edit-2
  9 months ago
  And they’re overlooking that radionuclide contamination of steel actually isn’t much of a problem any more, since the surge in background radionuclides caused by nuclear testing peaked in 1963 and has since gone down almost back to the original background level again.
  
  I guess it’s still a good analogy, though. People bring up Low Background Steel because they think radionuclide contamination is an unsolved problem (despite it having been basically solved), and they bring up “model collapse” because they think it’s an unsolved problem (despite it having been basically solved). It’s like newspaper stories, everyone sees the big scary front page headline but nobody pays attention to the little block of text retracting it on page 8.
superminerJG@lemmy.world
link
fedilink
English
arrow-up
6·
9 months ago
News at 11.
FlashZordon@lemmy.world
link
fedilink
English
arrow-up
5·
9 months ago
The AI art is inbreeding.
TheReturnOfPEB@reddthat.com
link
fedilink
English
arrow-up
4·
9 months ago
certainly at least a downvote to free will

Technology@lemmy.world

technology@lemmy.world

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

3.86K users / day
9.36K users / week
18.7K users / month
35.9K users / 6 months
322 local subscribers
69.3K subscribers
14.4K Posts
589K Comments
Modlog