ylai@lemmy.ml to

Technology@lemmy.worldEnglish · 1 year ago

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

cross-posted to:
[email protected]

514

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

ylai@lemmy.ml to

Technology@lemmy.worldEnglish · 1 year ago

cross-posted to:
[email protected]

Wondering what data OpenAI used to train its buzzy new text-to-video AI? OpenAI CTO Mira Murati seems to be wondering, too.

You must log in or register to comment.

Chat

redditReallySucks@lemmy.dbzer0.com
link
fedilink
English
arrow-up
181·
1 year ago

I hope this is gonna become a new meme template
- driving_crooner@lemmy.eco.br
  link
  fedilink
  English
  arrow-up
  92·
  1 year ago
  She looks like she just talked to the waitress about a fake rule in eating nachos and got caught up by her date.
  - HACKthePRISONS@kolektiva.social
    link
    fedilink
    arrow-up
    80·
    1 year ago
    this is incomprehensible to me. can you try it with two or three sentences?
    - driving_crooner@lemmy.eco.br
      link
      fedilink
      English
      arrow-up
      81·
      1 year ago
      Her date was eating all the fully loaded nachos, so she went up and ask to the waitress to make up a rule about how one person cannot eat all the nacho with meat and cheese. But her date knew that rule was bullshit and called her out about it. She’s trying to look confused and sad because they’re going to be too soon for the movie.
      - Uninvited Guest@lemmy.ca
        link
        fedilink
        English
        arrow-up
        57·
        1 year ago
        What?! What the hell are you talking about?!
      - RatsOffToYa@lemmy.world
        link
        fedilink
        English
        arrow-up
        53·
        1 year ago
        Not sure what’s funnier. your first comment or the comment explaining it to someone who obviously not part of a turbo team
        
        fjordbasa@lemmy.world
        link
        fedilink
        English
        arrow-up
        24·
        1 year ago
        Removed by mod
        
        RatsOffToYa@lemmy.world
        link
        fedilink
        English
        arrow-up
        9·
        1 year ago
        Look until you’re part of the turbo team… WALK SLOWLY
        
        fjordbasa@lemmy.world
        link
        fedilink
        English
        arrow-up
        6·
        1 year ago
        Removed by mod
      - THCDenton@lemmy.world
        link
        fedilink
        English
        arrow-up
        33·
        1 year ago
        
        Plopp@lemmy.world
        link
        fedilink
        English
        arrow-up
        13·
        1 year ago
        Lmao that’s wonderful, scrolling down from those weird ass comments only to be greeted by my own exact facial expression.
        
        Buttons@programming.dev
        link
        fedilink
        English
        arrow-up
        9·
        1 year ago
        “No… Hell no… Man, I believe you’d get your ass kicked if you said something like that…”
      - HACKthePRISONS@kolektiva.social
        link
        fedilink
        arrow-up
        24·
        1 year ago
        thank you. it must be a reference to something, but i don’t watch tv any more.
        
        datavoid@lemmy.ml
        link
        fedilink
        English
        arrow-up
        23·
        edit-2
        1 year ago
        I think you should leave…
        
        (is what you would search to find this)
        
        JWBananas@lemmy.world
        link
        fedilink
        English
        arrow-up
        10·
        1 year ago
        I’m sorry, what does this have to do with Coffin Flops. Does this mean it isn’t getting cancelled?
        
        swab148@startrek.website
        link
        fedilink
        English
        arrow-up
        8·
        1 year ago
        I DIDN’T RIG SHIT!
      - squid_slime@lemmy.world
        link
        fedilink
        English
        arrow-up
        14·
        1 year ago
        Chatgpt, you okay? 😅
- whoisearth@lemmy.ca
  link
  fedilink
  English
  arrow-up
  2·
  1 year ago
  Coffeezilla had a video in his void where he plays this back a few times. It’s hilarious seeing the guilt without stating it.
Fisk400@feddit.nu
link
fedilink
English
arrow-up
134·
1 year ago
They know what they fed the thing. Not backing up their own training data would be insane. They are not insane, just thieves
- VirtualOdour@sh.itjust.works
  link
  fedilink
  English
  arrow-up
  5·
  edit-2
  1 year ago
  That’s really not how it works though, it’s a web crawler they’re not going to download the whole internet
  
  And a reason they don’t is it would actually potentially be copywrite infringement in some cases where as what they do legally isn’t (no matter how much people wish the law was set based on their emotions)
_haha_oh_wow_@sh.itjust.works
link
fedilink
English
arrow-up
100·
1 year ago
Gee, seems like something a CTO would know. I’m sure she’s not just lying, right?
- Bogasse@lemmy.ml
  link
  fedilink
  English
  arrow-up
  8·
  1 year ago
  And on the other hand it is a very obvious question to expect. If you have something hide how on the world are you not prepared for this question !? 🤡
- VirtualOdour@sh.itjust.works
  link
  fedilink
  English
  arrow-up
  6·
  1 year ago
  It’s a question that is based on a purposeful misunderstanding of the technology, it’s like expecting a bee keeper to know each bees name and bedtime. Really it’s like asking a bricklayer where each brick came from in the pile, He can tell you the batch but not going to know this brick came from the forth row of the sixth pallet, two from the left. There is no reason to remember that it’s not important to anyone.
  
  The don’t log it because it would take huge amounts of resources and gain nothing.
  - zaphod@lemmy.ca
    link
    fedilink
    English
    arrow-up
    6·
    edit-2
    1 year ago
    What?
    
    Compiling quality datasets is enormously challenging and labour intensive. OpenAI absolutely knows the provenance of the data they train on as it’s part of their secret sauce. And there’s no damn way their CTO won’t have a broad strokes understanding of the origins of those datasets.
  - Guntrigger@feddit.ch
    link
    fedilink
    English
    arrow-up
    1·
    1 year ago
    [Citation needed]
- Hotzilla@sopuli.xyz
  link
  fedilink
  English
  arrow-up
  2·
  1 year ago
  To be fair, these datasets are one of their biggest competitive edge. But saying in to interviewer “I cannot tell you”, is not very nice, so you can take the americal politician approach and say “I don’t know/remember” which you cannot ever be hold accountable for.
phoneymouse@lemmy.world
link
fedilink
English
arrow-up
91·
1 year ago
There is no way in hell it isn’t copyrighted material.
- abhibeckert@lemmy.world
  link
  fedilink
  English
  arrow-up
  65·
  edit-2
  1 year ago
  Every video ever created is copyrighted.
  
  The question is — do they need a license? Time will tell. This is obviously going to court.
  - Kazumara@feddit.de
    link
    fedilink
    English
    arrow-up
    39·
    1 year ago
    Don’t downvote this guy. He’s mostly right. Creative works have copyright protections from the moment they are created. The relevant question is indeed if they have the relevant permissions for their use, not wether it had protections in the first place.
    
    Maybe some surveillance camera footage is not sufficiently creative to get protections, but that’s hardly going to be good for machine reinforcement learning.
  - iknowitwheniseeit
    link
    fedilink
    English
    arrow-up
    15·
    1 year ago
    There are definitely non copyrighted videos! Both old videos (all still black and white I think) and also things released into the public domain by copyright holders.
    
    But for sure that’s a very small subset of videos.
Buttons@programming.dev
link
fedilink
English
arrow-up
69·
1 year ago
If I were the reporter my next question would be:

“Do you feel that not knowing the most basic things about your product reflects on your competence as CTO?”
- ForgotAboutDre@lemmy.world
  link
  fedilink
  English
  arrow-up
  32·
  1 year ago
  Hilarious, but if the reporter asked this they would find it harder to get invites to events. Which is a problem for journalists. Unless your very well regarded for your journalism, you can’t push powerful people without risking your career.
  - Aniki 🌱🌿@lemm.ee
    link
    fedilink
    English
    arrow-up
    10·
    1 year ago
    boofuckingwoo. Reporters are not supposed to be friends with the people they are writing about.
    - tb_@lemmy.world
      link
      fedilink
      English
      arrow-up
      18·
      1 year ago
      True, but if those same people they’re not supposed to be friends with are the ones inviting them to those events/granting them early access…
      
      In other words: the system is rigged.
      - Aniki 🌱🌿@lemm.ee
        link
        fedilink
        English
        arrow-up
        3·
        1 year ago
        Again - boofuckinghooo. Let the fuckers have no friends in the media. The media owners make journalists spinless advertisement sellers. I have very little respect for the profession at this point.
        
        tb_@lemmy.world
        link
        fedilink
        English
        arrow-up
        6·
        1 year ago
        What a delightful and helpful attitude.
        
        Deceptichum@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        2·
        edit-2
        1 year ago
        booduckinghoo.
        
        We’re sick and tired of this shit, it will never change if people make excuses for it.
        
        MalachaiConstant@lemmy.world
        link
        fedilink
        English
        arrow-up
        3·
        1 year ago
        You’re missing the point that they need those relationships to gain access to sources. You literally cannot force people to talk to you
      - nifty@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        The system is rigged.
        
        You cannot give the same criticism to a rich person vs. a poor person even if their incompetence is the same. I am not sure what’s the fix, other than the common refrain of “there should be no millionaires/billionaires”. How does society heal itself if you cannot hold people accountable?
  - Abnorc@lemm.ee
    link
    fedilink
    English
    arrow-up
    8·
    1 year ago
    That, and the reporter is there to get information, not mess with and judge people. Asking that sort of question is really just an attack. We can leave it to commentators and ourselves for judge people.
    - Aniki 🌱🌿@lemm.ee
      link
      fedilink
      English
      arrow-up
      8·
      edit-2
      1 year ago
      this is limp dick energy. If asking questions is an attack then you’re probably a piece of shit doing bad things.
      - tastysnacks@programming.dev
        link
        fedilink
        English
        arrow-up
        8·
        1 year ago
        no it isn’t. what answer to that question has any value to me as a reader?
      - Abnorc@lemm.ee
        link
        fedilink
        English
        arrow-up
        4·
        edit-2
        1 year ago
        Think about the answer you would actually get. They would dismiss the question or give some sort of nonsense answer. It’s a rhetorical question, and the only thing that it serves to do is criticize the person being asked. That’s not what reporters are there to do. If the answer would actually give some useful information to the reader, then it’s worth asking.
- RatBin@lemmy.world
  link
  fedilink
  English
  arrow-up
  4·
  1 year ago
  Also about this line:
  
  Others, meanwhile, jumped to Murati’s defense, arguing that if you’ve ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.
  
  No I am not fine. When I wrote that stuff and those researches in old phpbb forums I did not do it with the knowledge of a future machine learning system eating it up without my consent. I never gave consent for that despite it being publicly available, because this would be a designation of use that wouldn’t exist back than. Many other things are also publicly available, but some a re copyrighted, on the same basis: you can publish and share content upon conditions that are defined by the creator of the content. What’s that, when I use zlibrary I am evil for pirating content but openai can do it just fine due to their huge wallets? Guess what, this will eventually creating a crisis of trust, a tragedy of the commons if you will when enough ai generated content will build the bulk of your future Internet search! Do we even want this?
CosmoNova@lemmy.world
link
fedilink
English
arrow-up
58·
edit-2
1 year ago
I almost want to believe they legitimately do not know nor care they‘re committing a gigantic data and labour heist but the truth is they know exactly what they‘re doing and they rub it under our noses.
- laxe@lemmy.world
  link
  fedilink
  English
  arrow-up
  20·
  1 year ago
  Of course they know what they’re doing. Everybody knows this, how could they be the only ones that don’t?
- Bogasse@lemmy.ml
  link
  fedilink
  English
  arrow-up
  17·
  1 year ago
  Yeah, the fact that AI progress just relies on “we will make so much money that no lawsuit will consequently alter our growth” is really infuriating. The fact that general audience apparently doesn’t care is even more infuriating.
- A_Very_Big_Fan@lemmy.world
  link
  fedilink
  English
  arrow-up
  3·
  1 year ago
  Look guys! I’m stealing from Tolkien!
  - Guntrigger@feddit.ch
    link
    fedilink
    English
    arrow-up
    2·
    1 year ago
    I don’t think anyone’s going to pay for your version of ChatGPT
  - toddestan@lemmy.world
    link
    fedilink
    English
    arrow-up
    2·
    1 year ago
    I’d say not really, Tolkien was a writer, not an artist.
    
    What you are doing is violating the trademark Middle-Earth Enterprises has on the Gandalf character.
    - A_Very_Big_Fan@lemmy.world
      link
      fedilink
      English
      arrow-up
      2·
      1 year ago
      The point was that I absorbed that information to inform my “art”, since we’re equating training with stealing.
      
      I guess this would have been a better example lol. It’s clearly not Gandalf, but I wouldn’t have ever come up with it if I hadn’t seen that scene
stackPeek@lemmy.world
link
fedilink
English
arrow-up
49·
1 year ago
This tellls you so much what kind of company OpenAI is
- webghost0101@sopuli.xyz
  link
  fedilink
  English
  arrow-up
  21·
  1 year ago
  An Intelligence piracy company?
- jaemo@sh.itjust.works
  link
  fedilink
  English
  arrow-up
  8·
  1 year ago
  It also tells us how hypocritical we all are since absolutely every single one of us would make the same decisions they have if we were in their shoes. This shit was one bajillion percent inevitable; we are in a river and have been since we tilled soil with a plough in the Nile valley millennia ago.
  - adrian783@lemmy.world
    link
    fedilink
    English
    arrow-up
    11·
    1 year ago
    most of us would never be in their shoes because most of us are not sociopathic techbros
    - jaemo@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      2·
      1 year ago
      I guess a lot of us didn’t learn from history, or even go see ‘Oppenheimer’…
  - whoisearth@lemmy.ca
    link
    fedilink
    English
    arrow-up
    3·
    1 year ago
    Speak for yourself. Were I in their shoes no I would not. But then again my company wouldn’t be as big as theirs for that reason.
- wabafee@lemmy.world
  link
  fedilink
  English
  arrow-up
  7·
  1 year ago
  Half open or half close?
anon_8675309@lemmy.world
link
fedilink
English
arrow-up
45·
1 year ago
CTO should definitely know this.
- ItsMeSpez@lemmy.world
  link
  fedilink
  English
  arrow-up
  47·
  1 year ago
  They do know this. They’re avoiding any legal exposure by being vague.
- blazeknave@lemmy.world
  link
  fedilink
  English
  arrow-up
  6·
  1 year ago
  I feel like at their scale, if there’s going to be a figure head marketable CTO, it’s going to be this company. If not, you’re right, and she’s lying lol
- turkishdelight@lemmy.ml
  link
  fedilink
  English
  arrow-up
  2·
  1 year ago
  Of course she knows it. She just doesn’t want to get sued.
BringMeTheDiscoKing@lemmy.ca
link
fedilink
English
arrow-up
44·
1 year ago
Did they intentionally chose a picture where she looks like she’s morphing into Elon?
- rab@lemmy.ca
  link
  fedilink
  English
  arrow-up
  12·
  1 year ago
  I was thinking mads mikkelssen
  - billwashere@lemmy.world
    link
    fedilink
    English
    arrow-up
    1·
    1 year ago
    Well after just finishing Death Stranding, I can’t unsee that.
- BoscoBear@lemmy.sdf.org
  link
  fedilink
  English
  arrow-up
  6·
  1 year ago
  I suspect so. It is a very slanted article.
andrew_bidlaw@sh.itjust.works
link
fedilink
English
arrow-up
43·
1 year ago
Funny she didn’t talked it out with lawyers before that. That’s a bad way to answer that.
- driving_crooner@lemmy.eco.br
  link
  fedilink
  English
  arrow-up
  35·
  1 year ago
  Or she talked and the lawyers told her to pretend ignorance.
  - QuaternionsRock@lemmy.world
    link
    fedilink
    English
    arrow-up
    9·
    1 year ago
    It probably means that they don’t scrape and preprocess training data in house. She knows they get it from a garden variety of underpaid contractors, but she doesn’t know the specific data sources beyond the stipulations of the contract (“publicly available or licensed”), and she probably doesn’t even know that for certain.
    - driving_crooner@lemmy.eco.br
      link
      fedilink
      English
      arrow-up
      3·
      1 year ago
      “Publicly a available” can mean a lot of things. Is youtube publicly available? Is public broadcasting publicly available?
  - andrew_bidlaw@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    5·
    1 year ago
    Maybe, but it sounds very weak.
    - anlumo@lemmy.world
      link
      fedilink
      English
      arrow-up
      10·
      1 year ago
      Lawyers aren’t PR people.
      - andrew_bidlaw@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        2·
        1 year ago
        She didn’t even adress them though.
TheObviousSolution@lemm.ee
link
fedilink
English
arrow-up
26·
1 year ago
Then wipe it out and start again once you have where your data is coming from sorted out. Are we acting like you having built datacenter pack full of NVIDIA processors just for this sort of retraining? They are choosing to build AI without proper sourcing, that’s not an AI limitation.
IvanOverdrive@lemm.ee
link
fedilink
English
arrow-up
22·
1 year ago
REPORTER: Where does your data come from?

CTO: Bitch, are you trying to get me sued?
من البحر إلى النهر@lemmy.world
link
fedilink
English
arrow-up
19·
1 year ago
So plagiarism?
- BoscoBear@lemmy.sdf.org
  link
  fedilink
  English
  arrow-up
  16·
  1 year ago
  I don’t think so. They aren’t reproducing the content.
  
  I think the equivalent is you reading this article, then answering questions about it.
  - A_Very_Big_Fan@lemmy.world
    link
    fedilink
    English
    arrow-up
    26·
    1 year ago
    Idk why this is such an unpopular opinion. I don’t need permission from an author to talk about their book, or permission from a singer to parody their song. I’ve never heard any good arguments for why it’s a crime to automate these things.
    
    I mean hell, we have an LLM bot in this comment section that took the article and spat 27% of it back out verbatim, yet nobody is pissing and moaning about it “stealing” the article.
    - MostlyGibberish@lemm.ee
      link
      fedilink
      English
      arrow-up
      5·
      1 year ago
      Because people are afraid of things they don’t understand. AI is a very new and very powerful technology, so people are going to see what they want to see from it. Of course, it doesn’t help that a lot of people see “a shit load of cash” from it, so companies want to shove it into anything and everything.
      
      AI models are rapidly becoming more advanced, and some of the new models are showing sparks of metacognition. Calling that “plagiarism” is being willfully ignorant of its capabilities, and it’s just not productive to the conversation.
      - A_Very_Big_Fan@lemmy.world
        link
        fedilink
        English
        arrow-up
        6·
        1 year ago
        True
        
        Of course, it doesn’t help that a lot of people see “a shit load of cash” from it, so companies want to shove it into anything and everything.
        
        And on a similar note to this, I think a lot of what it is is that OpenAI is profiting off of it and went closed-source. Lemmy being a largely anti-capitalist and pro-open-source group of communities, it’s natural to have a negative gut reaction to what’s going on, but not a single person here, nor any of my friends that accuse them of “stealing” can tell me what is being stolen, or how it’s different from me looking at art and then making my own.
        
        Like, I get that the technology is gonna be annoying and even dangerous sometimes, but maybe let’s criticize it for that instead of shit that it’s not doing.
        
        MostlyGibberish@lemm.ee
        link
        fedilink
        English
        arrow-up
        5·
        1 year ago
        I can definitely see why OpenAI is controversial. I don’t think you can argue that they didn’t do an immediate heel turn on their mission statement once they realized how much money they could make. But they’re not the only player in town. There are many open source models out there that can be run by anyone on varying levels of hardware.
        
        As far as “stealing,” I feel like people imagine GPT sitting on top of this massive collection of data and acting like a glorified search engine, just sifting through that data and handing you stuff it found that sounds like what you want, which isn’t the case. The real process is, intentionally, similar to how humans learn things. So, if you ask it for something that it’s seen before, especially if it’s seen it many times, it’s going to know what you’re talking about, even if it doesn’t have access to the real thing. That, combined with the fact that the models are trained to be as helpful as they possibly can be, means that if you tell it to plagiarize something, intentionally or not, it probably will. But, if we condemned any tool that’s capable of plagiarism without acknowledging that they’re also helpful in the creation process, we’d still be living in caves drawing stick figures on the walls.
        
        Mnemnosyne@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        3·
        1 year ago
        One problem is people see those whose work may no longer be needed or as profitable, and…they rush to defend it, even if those same people claim to be opposed to capitalism.
        
        They need to go ‘yes, this will replace many artists and writers…and that’s a good thing because it gives everyone access to being able to create bespoke art for themselves.’ but at the same time realize that while this is a good thing, it also means the need for societal shift to support people outside of capitalism is needed.
        
        MostlyGibberish@lemm.ee
        link
        fedilink
        English
        arrow-up
        2·
        1 year ago
        
        it also means the need for societal shift to support people outside of capitalism is needed.
        
        Exactly. This is why I think arguing about whether AI is stealing content from human artists isn’t productive. There’s no logical argument you can really make that a theft is happening. It’s a foregone conclusion.
        
        Instead, we need to start thinking about what a world looks like where a large portion of commercially viable art doesn’t require a human to make it. Or, for that matter, what does a world look like where most jobs don’t require a human to do them? There are so many more pressing and more interesting conversations we could be having about AI, but instead we keep circling around this fundamental misunderstanding of what the technology is.
    - Hawk@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      5·
      1 year ago
      What you’re giving as examples are legitimate uses for the data.
      
      If I write and sell a new book that’s just Harry Potter with names and terms switched around, I’ll definitely get in trouble.
      
      The problem is that the data CAN be used for stuff that violates copyright. And because of the nature of AI, it’s not even always clear to the user.
      
      AI can basically throw out a Harry Potter clone without you knowing because it’s trained on that data, and that’s a huge problem.
      - A_Very_Big_Fan@lemmy.world
        link
        fedilink
        English
        arrow-up
        3·
        edit-2
        1 year ago
        Out of curiosity I asked it to make a Harry Potter part 8 fan fiction, and surprisingly it did. But I really don’t think that’s problematic. There’s already an insane amount of fan fiction out there without the names swapped that I can read, and that’s all fair use.
        
        I mean hell, there are people who actually get paid to draw fictional characters in sexual situations that I’m willing to bet very few creators would prefer to exist lol. But as long as they don’t overstep the bounds of fair use, like trying to pass it off as an official work or submit it for publication, then there’s no copyright violation.
        
        The important part is that it won’t just give me the actual book (but funnily enough, it tried lol). If I meet a guy with a photographic memory and he reads my book, that’s not him stealing it or violating my copyright. But if he reproduces and distributes it, then we call it stealing or a copyright violation.
      - A_Very_Big_Fan@lemmy.world
        link
        fedilink
        English
        arrow-up
        2·
        1 year ago
        I just realized I misread what you said, so that wasn’t entirely relevant to what you said but I think it still stands so ig I won’t delete it.
        
        But I asked both GPT3.5 and GPT4 to give me Harry Potter with the names and words changed, and they can’t do that either. I can’t speak for all models, but I can at least say the two owned by the people this thread was about won’t do that.
  - ...m...@ttrpg.network
    link
    fedilink
    English
    arrow-up
    8·
    edit-2
    1 year ago
    …with the prevalence of clickbaity bottom-feeder news sites out there, i’ve learned to avoid TFAs and await user summaries instead…
    
    (clicks through)
    
    …yep, ~~seven~~ nine ads plus another pop-over, about 15% of window real estate dedicated to the actual story…
    - neptune@dmv.social
      link
      fedilink
      English
      arrow-up
      3·
      1 year ago
      The issue is that the LLMs do often just verbatim spit out things they plagiarized form other sources. The deeper issue is that even if/when they stop that from happening, the technology is clearly going to make most people agree our current copyright laws are insufficient for the times.
      - A_Very_Big_Fan@lemmy.world
        link
        fedilink
        English
        arrow-up
        2·
        1 year ago
        The model in question, plus all of the others I’ve tried, will not give you copyrighted material
        
        neptune@dmv.social
        link
        fedilink
        English
        arrow-up
        3·
        1 year ago
        That’s one example, plus I’m talking generally why this is an important question for a CEO to answer and why people think generally LLMs may infringe on copyright, be bad for creative people
        
        A_Very_Big_Fan@lemmy.world
        link
        fedilink
        English
        arrow-up
        2·
        edit-2
        1 year ago
        
        I’m talking generally why this is an important question for a CEO to answer …
        
        Right, which your only evidence for is “LLMs do often just verbatim spit out things they plagiarized form other sources” and that they aren’t trying to prevent this from happening.
        
        Which is demonstrably false, and I’ll demonstrate it with as many screenshots/examples you want. You’re just wrong about that (at least about GPT). You can also demonstrate it yourself, and if you can prove me wrong I’ll eat my shoe.
        
        neptune@dmv.social
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        https://archive.is/nrAjc
        
        Yep here you go. It’s currently a very famous lawsuit.
  - Linkerbaan@lemmy.world
    link
    fedilink
    English
    arrow-up
    4·
    1 year ago
    Actually neural networks verbatim reproduce this kind of content when you ask the right question such as “finish this book” and the creator doesn’t censor it out well.
    
    It uses an encoded version of the source material to create “new” material.
    - BoscoBear@lemmy.sdf.org
      link
      fedilink
      English
      arrow-up
      3·
      1 year ago
      Sure, if that is what the network has been trained to do, just like a librarian will if that is how they have been trained.
      - Linkerbaan@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        edit-2
        1 year ago
        Actually it’s the opposite, you need to train a network not to reveal its training data.
        
        “Using only $200 USD worth of queries to ChatGPT (gpt-3.5- turbo), we are able to extract over 10,000 unique verbatim memorized training examples,” the researchers wrote in their paper, which was published online to the arXiv preprint server on Tuesday. “Our extrapolation to larger budgets (see below) suggests that dedicated adversaries could extract far more data.”
        
        The memorized data extracted by the researchers included academic papers and boilerplate text from websites, but also personal information from dozens of real individuals. “In total, 16.9% of generations we tested contained memorized PII [Personally Identifying Information], and 85.8% of generations that contained potential PII were actual PII.” The researchers confirmed the information is authentic by compiling their own dataset of text pulled from the internet.
        
        BoscoBear@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        3·
        1 year ago
        Interesting article. It seems to be about a bug, not a designed behavior. It also says it exposes random excerpts from books and other training data.
        
        Linkerbaan@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        It’s not designed to do that because they don’t want to reveal the training data. But factually all neural networks are a combination of their training data encoded into neurons.
        
        When given the right prompt (or image generation question) they will exactly replicate it. Because that’s how they have been trained in the first place. Replicating their source images with as little neurons as possible, and tweaking them when it’s not correct.
        
        BoscoBear@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        4·
        1 year ago
        That is a little like saying every photograph is a copy of the thing. That is just factually incorrect. I have many three layer networks that are not the thing they were trained on. As a compression method they can be very lossy and in fact that is often the point.
Politically Incorrect@lemmy.world
link
fedilink
English
arrow-up
18·
1 year ago
Watching a video or reading an article by a human isn’t copyright infringement, why then if an “AI” do it then it is? I believe the copyright infringement it’s made by the prompt so by the user not the tool.
- echo64@lemmy.world
  link
  fedilink
  English
  arrow-up
  34·
  1 year ago
  If you read an article, then copy parts of that article into a new article, that’s copyright infringement. Same with ais.
  - anlumo@lemmy.world
    link
    fedilink
    English
    arrow-up
    6·
    1 year ago
    Depends on how much is copied, if it’s a small amount it’s fair use.
    - echo64@lemmy.world
      link
      fedilink
      English
      arrow-up
      15·
      1 year ago
      Fair use depends on a lot, and just being a small amount doesn’t factor in. It’s the actual use. Small amounts just often fly under the nose of legal teams.
    - FireTower@lemmy.world
      link
      fedilink
      English
      arrow-up
      8·
      1 year ago
      Fair use is a four factor test amount used is a factor but a low amount being used doesn’t strictly mean something is fair use. You could use a single frame of a movie and have it not qualify as fair use.
- Drewelite
  link
  fedilink
  English
  arrow-up
  20·
  1 year ago
  This is what people fundamentally don’t understand about intelligence, artificial or otherwise. People feel like their intelligence is 100% “theirs”. While I certainly would advocate that a person owns their intelligence, It didn’t spawn from nothing.
  
  You’re standing on the shoulders of everyone that came before you. You take a prehistoric man or an alien that hasn’t had any of the same experiences you’ve had, they won’t be able to function in this world. It’s not because they are any dumber than you. It’s because you absorbed the hive mind of the society you live in. Everyone’s racing to slap their brand on stuff to copyright it to get ahead and carve out their space.
  
  “No you can’t tell that story, It’s mine.” “That art is so derivative.”
  
  But copyright was only meant to protect something for a short period in order to monetize it; to adapt the value of knowledge for our capital market. Our world can’t grow if all knowledge is owned forever and isn’t able to be used when even THINKING about new ideas.
  
  ANY VERSION OF INTELLIGENCE YOU WOULD WANT TO INTERACT WITH MUST CONSUME OUR KNOWLEDGE AND PRODUCE TRANSFORMATIONS OF IT.
  
  That’s all you do.
  
  Imagine how useless someone would be who’d never interacted with anything copyrighted, patented, or trademarked.
  - raspberriesareyummy@lemmy.world
    link
    fedilink
    English
    arrow-up
    7·
    edit-2
    1 year ago
    That’s not a very agreeable take. Just get rid of patents and copyrights altogether and your point dissolves itself into nothing. The core difference being derivative works by humans can respect the right to privacy of original creators.
    
    Deep learning bullshit software however will just regurgitate creator’s contents, sometimes unrecognizable, but sometimes outright steal their likeness or individual style to create content that may be associated with the original creators.
    
    what you are in effect doing, is likening learning from the ideas of others to a deep learning “AI” using images for creating revenge porn, to give a drastic example.
    - Drewelite
      link
      fedilink
      English
      arrow-up
      4·
      edit-2
      1 year ago
      Yes. Your last sentence is my point exactly. LLMs haven’t replicated everything about the human brain. But the hype is here because it cracks one of our brains key features: How it learns. Your brain isn’t magic. It just records training data until it has enough to mash it together into different things.
      
      A child doesn’t respect copyright, they’ll draw a picture of Mario. You probably would too If I asked you to. Respecting copyright is something we learn to do in specific situations. This is called “coming up with an original idea”. But that’s bullshit. There are no original ideas.
      
      If you come up with a product that’s a cold brew cup that refrigerates its contents, I’d say that’s a very original idea. But you didn’t come up with refrigeration, you didn’t come up with cups, or cold brew, or the idea of putting technology in a cup, or the concept of a product you sell to people. Name one thing about this idea that you didn’t learn somewhere else? You can’t. Because that’s not how people work. A very real part of business, that you will learn as you put your new cup to market, is skirting around copyright. Somebody out there with a heated cup might come after you for example.
      
      This is a difficult thing to learn the precise line on. Mostly because it can’t work as a concrete rule. AI still has to be used, tested, and developed to learn the nuances here. And it will. But what baffles me is how my example above outlines how every process of invention has worked since the beginning of humanity. But if an LLM does it, people say, “That’s not a real idea. It just took a bunch of stuff it’s learned and mashed it together.” But I hear, “My brain is 🪄magic✨ I’m special.”
  - rottingleaf@lemmy.zip
    link
    fedilink
    English
    arrow-up
    3·
    1 year ago
    Yes, so how come all these arguments were not popular before the current hype about text generators?
    
    Have some integrity.
    - dezmd@lemmy.world
      link
      fedilink
      English
      arrow-up
      1·
      1 year ago
      They absolutely were, the entire time. You just didn’t have interest in hearing about it aned weren’t engaged on it.
      
      Learn what integrity means if you want to use it as a snarky one liner.
      
      Have some common sense.
      - rottingleaf@lemmy.zip
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        
        They absolutely were, the entire time. You just didn’t have interest in hearing about it aned weren’t engaged on it.
        
        Why express your opinion on subjects where it’s not worth anything?
        
        You are saying these mutated cryptobros cared about copyright and patent laws being obsolete and harmful before “AI”?
        
        Learn what integrity means if you want to use it as a snarky one liner.
        
        I know what every word I use means
- Uninvited Guest@lemmy.ca
  link
  fedilink
  English
  arrow-up
  7·
  edit-2
  1 year ago
  When a school professor “prompts” you to write an essay and you, the “tool” go consume copyrighted material and plagiarize it in the production of your essay is the infringement made by the professor?
  - Politically Incorrect@lemmy.world
    link
    fedilink
    English
    arrow-up
    12·
    1 year ago
    If you quote the sources and write it with your own words I believe it isn’t, AFAIK “AI” already do that.
    - ominouslemon@lemm.ee
      link
      fedilink
      English
      arrow-up
      15·
      1 year ago
      Copilot lists its sources. The problem is half of them are completely made up and if you click on the links they take you to the wrong pages
    - Uninvited Guest@lemmy.ca
      link
      fedilink
      English
      arrow-up
      12·
      1 year ago
      It definitely does not cite sources and use it’s own words in all cases - especially in visual media generation.
      
      And in the proposed scenario I did write the student plagiarizes the copyrighted material.
      - Politically Incorrect@lemmy.world
        link
        fedilink
        English
        arrow-up
        8·
        edit-2
        1 year ago
        If you read a book or watch a movie and get inspired by it to create something new and different, it’s plagiarism and copyright infringement?
        
        If that were the case the majority of stuff nowadays it’s plagiarism and copyright infringement, I mean generally people get inspired by someone or something.
        
        buffaloseven@fedia.io
        link
        fedilink
        arrow-up
        10·
        1 year ago
        There’s a long history of this and you might find some helpful information in looking at “transformative use” of copyrighted materials. Google Books is a famous case where the technology company won the lawsuit.
        
        The real problem is that LLMs constantly spit out copyrighted material verbatim. That’s not transformative. And it’s a near-impossible problem to solve while maintaining the utility. Because these things aren’t actually AI, they’re just monstrous statistical correlation databases generated from an enormous data set.
        
        Much of the utility from them will become targeted applications where the training comes from public/owned datasets. I don’t think the copyright case is going to end well for these companies…or at least they’re going to have to gradually chisel away parts of their training data, which will have an outsized impact as more and more AI generated material finds its way into the training data sets.
        
        stephen01king@lemmy.zip
        link
        fedilink
        English
        arrow-up
        4·
        1 year ago
        How constantly does it spit out copyrighted material? Is there data on that?
        
        buffaloseven@fedia.io
        link
        fedilink
        arrow-up
        2·
        1 year ago
        There’s more and more research starting to happen on it, but I’ve seen anywhere from 20% to 60% of responses. Here’s a recent study where they explicitly try to coerce LLMs to break copyright: https://www.patronus.ai/blog/introducing-copyright-catcher
        
        I don’t have the time to grab them right now, but in many of the lawsuits brought forward against companies developing LLMs, their openings contain some statistics gathered on how frequently they infringed by returning copyrighted material.
        
        potustheplant@feddit.nl
        link
        fedilink
        English
        arrow-up
        2·
        edit-2
        1 year ago
        You do realize that AI is just a marketing term, right? None of these models learn, have intelligence or create truly original work. As a matter of fact, if people don’t continue to create original content, these models would stagnate or enter a feedback loop that would poison themselves with their own erroneous responses.
        
        AIs don’t think. They copy with extra steps.
        
        Politically Incorrect@lemmy.world
        link
        fedilink
        English
        arrow-up
        2·
        edit-2
        1 year ago
        Removed by mod
        
        potustheplant@feddit.nl
        link
        fedilink
        English
        arrow-up
        4·
        1 year ago
        Except that the information it gives you is often objectively incorrect and it makes up sources (this happened to me a lot of times). And no, it can’t do what a human can. It doesn’t interpret the information it gets and it can’t reach new conclusions based on what it “knows”.
        
        I honestly don’t know how you can even begin to compare an LLM to the human brain.
      - Tja@programming.dev
        link
        fedilink
        English
        arrow-up
        7·
        1 year ago
        So your question is “is plagiarism plagiarism”?
        
        Uninvited Guest@lemmy.ca
        link
        fedilink
        English
        arrow-up
        3·
        1 year ago
        No, that is not the question nor a reasonable interpretation of it.
- topinambour_rex@lemmy.world
  link
  fedilink
  English
  arrow-up
  7·
  1 year ago
  What does this human is going to do with this reading ? Are they going to produce something by using part of this book or this article ?
  
  If yes, that’s copyright infringement.
- Prandom_returns@lemm.ee
  link
  fedilink
  English
  arrow-up
  2·
  edit-2
  1 year ago
  Because it’s software.
  - Drewelite
    link
    fedilink
    English
    arrow-up
    3·
    1 year ago
    How do you expect people will create AI if it can’t do the things we do, when “doing the things we do” is the whole point?
    - Prandom_returns@lemm.ee
      link
      fedilink
      English
      arrow-up
      2·
      1 year ago
      I never want software to impersonate a human.
AutoTL;DR@lemmings.worldB
link
fedilink
English
arrow-up
13·
1 year ago
This is the best summary I could come up with:

Mira Murati, OpenAI’s longtime chief technology officer, sat down with The Wall Street Journal’s Joanna Stern this week to discuss Sora, the company’s forthcoming video-generating AI.

It’s a bad look all around for OpenAI, which has drawn wide controversy — not to mention multiple copyright lawsuits, including one from The New York Times — for its data-scraping practices.

After the interview, Murati reportedly confirmed to the WSJ that Shutterstock videos were indeed included in Sora’s training set.

But when you consider the vastness of video content across the web, any clips available to OpenAI through Shutterstock are likely only a small drop in the Sora training data pond.

Others, meanwhile, jumped to Murati’s defense, arguing that if you’ve ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.

Whether Murati was keeping things close to the vest to avoid more copyright litigation or simply just didn’t know the answer, people have good reason to wonder where AI data — be it “publicly available and licensed” or not — is coming from.

The original article contains 667 words, the summary contains 178 words. Saved 73%. I’m a bot and I’m open source!
- A_Very_Big_Fan@lemmy.world
  link
  fedilink
  English
  arrow-up
  2·
  1 year ago
  Funny how we have all this pissing and moaning about stealing, yet nobody ever complains about this bot actually lifting entire articles and spitting them back out without ads or fluff. I guess it’s different when you find it useful, huh?
  
  I like the bot, but I mean y’all wanna talk about copyright violations? The argument against this bot is a hell of a lot more solid than just using data for training.
  - Guntrigger@feddit.ch
    link
    fedilink
    English
    arrow-up
    2·
    1 year ago
    Is this bot a closed system which is being used for profit? No, you know exactly what its source is (the single article it is condensing) and even has a handy link about how it is open source at the end of every single post.
    - A_Very_Big_Fan@lemmy.world
      link
      fedilink
      English
      arrow-up
      1·
      1 year ago
      It copied all of its text from the article, and it allows me to get all the information from it I want without providing that publisher with traffic or ad revenue. That’s not fair use.
      
      I do like the bot, and personally I’d rather it stay, but no matter how you look at it this isn’t “fair use” of the article.
      - Guntrigger@feddit.ch
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        Interesting take. In all of the defences of LLMs using copyrighted material it’s very often highlighted that “fair use” allows exactly such summaries of larger texts.
        
        In reality, “fair use” is ruled on a case by case basis, so it’s impossible to judge whether something is or not without it going to court.
        
        A_Very_Big_Fan@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        We’re not making legislation here, so we don’t have that level of burden of proof. But either way, when it comes to factors of fair use that every authority on the matter will list, it violates almost all of them.
        
        It’s non-commercial, and it’s using facts rather than using a more creative work, so it’s got that going for it… But it’s
        
        composed of 100% copied material
        
        it’s not transformative
        
        it’s substituting the original work
        
        it uses officially published work
        
        it specifically copies the “heart” of the work
        
        it bypasses all of the ads and impacts their traffic/metrics so it has a financial impact on them.
        
        It’s pretty obvious that there is no argument here. The factors that are violated the hardest and most undisputably are the ones that most authorities on the matter (including the one I linked) agree are the most important.

Technology@lemmy.world

technology@lemmy.world

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

5.23K users / day
11.3K users / week
20.7K users / month
35.4K users / 6 months
307 local subscribers
67.7K subscribers
14.2K Posts
583K Comments
Modlog