@[email protected] to

[email protected]English • 6 months ago

AI trained on AI garbage spits out AI garbage.

www.technologyreview.com

445

AI trained on AI garbage spits out AI garbage.

www.technologyreview.com

@[email protected] to

[email protected]English • 6 months ago

AI trained on AI garbage spits out AI garbage

www.technologyreview.com

As junk web pages written by AI proliferate, the models that rely on that data will suffer.

Chat

@[email protected]
link
fedilink
English
8•6 months ago
It’s not just a predictive text program. That’s been around for decades. That’s a common misconception.

As I understand it, it uses statistics from the whole text to create new text. It would be very rare to output “cats have feathers” because that phrase doesn’t ever appear in the training data. Both words “have feathers” never follow “cats”.
- skulblaka
  link
  fedilink
  English
  9•6 months ago
  But the fact remains that it doesn’t know what a cat or a feather is. All of this is still based purely on statistical frequency and not at all on actual meanings.
- @[email protected]
  link
  fedilink
  English
  5•
  edit-2
  6 months ago
  and that is exactly how a predictive text algorithm works.
  
  some tokens go in
  
  they are processed by a deterministic, static statistical model, and a set of probabilities (always the same, deterministic, remember?) comes out.
  
  pick the word with the highest probability, add it to your initial string and start over.
  
  if you want variety, add some randomness and don’t just always pick the most probable next token.
  
  Coincidentally, this is exactly how llms work. It’s a big markov chain, but with a novel lossy compression algorithm on its state transition table. The last point is also the reason why, if anyone says they can fix llm hallucinations, they’re lying.
  - @[email protected]
    link
    fedilink
    English
    2•6 months ago
    
    Coincidentally, this is exactly how llms work
    
    Everyone who says this doesn’t actually understand how LLMs work.
    
    Multivector word embeddings create emergent relationships that’s new knowledge that doesn’t exist in the training dataset.
    
    Computerphile did a good video on this well before the LLM craze.
    - @[email protected]
      link
      fedilink
      English
      1•
      edit-2
      6 months ago
      1 - a markov chain only takes previous tokens as input.
      
      2 - It uses a function (in the mathematical sense, so same input results in same output, completely stateless) to generate a set of probabilities for what the next token might be.
      
      3 - The most probable token is picked, else randomness (temperature) is inserted here to choose a different token occasionally.
      
      an llm’s internals, the part that’s trained is literally the function used in step 2. You could have this function implemented a number of ways, ex you could build a huge table and consult it. Or you could generate it somehow. You could train a big neural network that takes previous tokens as input, and outputs probabilities of tokens as output. You then enumerate its outputs for every possible permutation of inputs and there’s your table. This would take too much time and space, so we just run the function on-demand instead. Exact same result. It can be very smart and notice correlations, but ultimately it generates a (virtual) huge static table. This is a completely deterministic process. A trained NN is still a (huge) mathematical function. So the big network that they spend resources training is basically the function used in step 2.
      
      Step 3 is the cause of hallucinations. It’s the only nondeterministic part. And it’s not part of the llm itself in any way. No matter how smarter the neural network gets, the hallucinations are introduced mainly in step 3. So no, they won’t be solving the LLM hallucination problem anytime soon.
- @[email protected]
  link
  fedilink
  English
  3•
  edit-2
  6 months ago
  
  because that phrase doesn’t ever appear in the training data.
  
  Eh but LLMs abstract. It has seen “<animal> have feathers” and “<animal> have fur” quite a lot of times. The problem isn’t that LLMs can’t reason at all, the problem is that they do employ techniques used in proper reasoning, in particular tracking context throughout the text (cross-attention) but lack techniques necessary for the whole thing, instead relying on confabulation to sound convincing regardless of the BS they spout. Suffices to emulate an Etonian but that’s not a high standard.
  - FaceDeer
    link
    fedilink
    2•6 months ago
    Workarounds for those sorts of limitations have been developed, though. Chain-of-thought prompting has been around for a while now, and I recall recently seeing an article about a model that had that built right into it; it had been trained to use <thought></thought> tags to enclose invisible chunks of its output that would be hidden from the end user but would be used by the AI to work its way through a problem. So if you asked it whether cats had feathers it might respond “<thought>Feathers only grow on birds and dinosaurs. Cats are mammals.</thought> No, cats don’t have feathers.” And you’d only see the latter bit. It was a pretty neat approach to improving LLM reasoning.
- Lvxferre [he/him]
  link
  fedilink
  English
  1•6 months ago
  Your “ackshyually” is missing the point.
- @[email protected]
  link
  fedilink
  English
  1•
  edit-2
  6 months ago
  This isn’t really accurate either. At the moment of generation, an LLM only has context for the input string and the network of text tokens it’s been assigned. It pulls from a “pool” of these tokens based on what it’s already output and the input context, nothing more.
  
  Most LLMs have what are called “Top P”, “Top K” etc, these are the number of tokens that it ends up selecting from based on the previous token, alongside the input tokens. It then randomly chooses one based on temperature settings.
  
  It’s why if you turn these models’ temperature settings really high they output pure nonsense both conceptually and grammatically, because the tenuous thread linking the previous token’s context to the next token has been widened enough that it completely loses any semblance of cohesiveness.

[email protected]

[email protected]

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

2.11K users / day
9.22K users / week
16.2K users / month
31K users / 6 months
61.3K subscribers
13.1K Posts
550K Comments
Modlog