@[email protected] to

[email protected]English • 11 months ago

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

cross-posted to:
[email protected]

514

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

@[email protected] to

[email protected]English • 11 months ago

cross-posted to:
[email protected]

Wondering what data OpenAI used to train its buzzy new text-to-video AI? OpenAI CTO Mira Murati seems to be wondering, too.

Chat

AutoTL;DRB
link
fedilink
English
13•11 months ago
This is the best summary I could come up with:

Mira Murati, OpenAI’s longtime chief technology officer, sat down with The Wall Street Journal’s Joanna Stern this week to discuss Sora, the company’s forthcoming video-generating AI.

It’s a bad look all around for OpenAI, which has drawn wide controversy — not to mention multiple copyright lawsuits, including one from The New York Times — for its data-scraping practices.

After the interview, Murati reportedly confirmed to the WSJ that Shutterstock videos were indeed included in Sora’s training set.

But when you consider the vastness of video content across the web, any clips available to OpenAI through Shutterstock are likely only a small drop in the Sora training data pond.

Others, meanwhile, jumped to Murati’s defense, arguing that if you’ve ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.

Whether Murati was keeping things close to the vest to avoid more copyright litigation or simply just didn’t know the answer, people have good reason to wonder where AI data — be it “publicly available and licensed” or not — is coming from.

The original article contains 667 words, the summary contains 178 words. Saved 73%. I’m a bot and I’m open source!
- @[email protected]
  link
  fedilink
  English
  2•11 months ago
  Funny how we have all this pissing and moaning about stealing, yet nobody ever complains about this bot actually lifting entire articles and spitting them back out without ads or fluff. I guess it’s different when you find it useful, huh?
  
  I like the bot, but I mean y’all wanna talk about copyright violations? The argument against this bot is a hell of a lot more solid than just using data for training.
  - @[email protected]
    link
    fedilink
    English
    2•11 months ago
    Is this bot a closed system which is being used for profit? No, you know exactly what its source is (the single article it is condensing) and even has a handy link about how it is open source at the end of every single post.
    - @[email protected]
      link
      fedilink
      English
      1•11 months ago
      It copied all of its text from the article, and it allows me to get all the information from it I want without providing that publisher with traffic or ad revenue. That’s not fair use.
      
      I do like the bot, and personally I’d rather it stay, but no matter how you look at it this isn’t “fair use” of the article.
      - @[email protected]
        link
        fedilink
        English
        1•11 months ago
        Interesting take. In all of the defences of LLMs using copyrighted material it’s very often highlighted that “fair use” allows exactly such summaries of larger texts.
        
        In reality, “fair use” is ruled on a case by case basis, so it’s impossible to judge whether something is or not without it going to court.
        
        @[email protected]
        link
        fedilink
        English
        1•11 months ago
        We’re not making legislation here, so we don’t have that level of burden of proof. But either way, when it comes to factors of fair use that every authority on the matter will list, it violates almost all of them.
        
        It’s non-commercial, and it’s using facts rather than using a more creative work, so it’s got that going for it… But it’s
        
        composed of 100% copied material
        
        it’s not transformative
        
        it’s substituting the original work
        
        it uses officially published work
        
        it specifically copies the “heart” of the work
        
        it bypasses all of the ads and impacts their traffic/metrics so it has a financial impact on them.
        
        It’s pretty obvious that there is no argument here. The factors that are violated the hardest and most undisputably are the ones that most authorities on the matter (including the one I linked) agree are the most important.

[email protected]

[email protected]

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

3.63K users / day
9.28K users / week
17.1K users / month
31.6K users / 6 months
63K subscribers
13.5K Posts
566K Comments
Modlog