return2ozma@lemmy.world to Technology@lemmy.worldEnglish · 1 year ago

OpenAI strikes Reddit deal to train its AI on your posts

www.theverge.com

526

OpenAI strikes Reddit deal to train its AI on your posts

www.theverge.com

return2ozma@lemmy.world to Technology@lemmy.worldEnglish · 1 year ago

Reddit’s deal with OpenAI will plug its posts into “ChatGPT and new products”

www.theverge.com

Reddit’s signed AI licensing deals with Google and OpenAI.

You must log in or # to comment.

Chat

myliltoehurts@lemm.ee
link
fedilink
English
arrow-up
150·
1 year ago
So they filled reddit with bot generated content, and now they’re selling back the same stuff likely to the company who generated most of it.

At what point can we call an AI inbred?
- orca@orcas.enjoying.yachts
  link
  fedilink
  English
  arrow-up
  90·
  1 year ago
  This is actually a thing. It’s called “Model Collapse”. You can read about it here.
  - FaceDeer@fedia.io
    link
    fedilink
    arrow-up
    23·
    1 year ago
    “Model collapse” can be easily avoided by keeping old human data with new synthetic data in the training set. The old archives of Reddit content from before there was AI are still around.
    - Ghostalmedia@lemmy.world
      link
      fedilink
      English
      arrow-up
      15·
      1 year ago
      A model trained on jokes about bacon, narwhals, and rage comics.
      - FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        2·
        1 year ago
        By “old archives” I mean everything from 2022 and earlier.
        
        BakerBagel@midwest.social
        link
        fedilink
        English
        arrow-up
        13·
        1 year ago
        Removed by mod
        
        FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        1·
        1 year ago
        Existing AIs such as ChatGPT were trained in part on that data so obviously they’ve got ways to make it work. They filtered out some stuff, for example - the “glitch tokens” such as solidgoldmagikarp were evidence of that.
        
        Ghostalmedia@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        I SAID RAGE COMICS
    - mint_tamas@lemmy.world
      link
      fedilink
      English
      arrow-up
      3·
      1 year ago
      That paper is yet to be peer reviewed or released. I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?
      - barsoap@lemm.ee
        link
        fedilink
        English
        arrow-up
        2·
        edit-2
        1 year ago
        
        That paper is yet to be peer reviewed or released.
        
        Never doing either (release as in submit to journal) isn’t uncommon in maths, physics, and CS. Not to say that it won’t be released but it’s not a proper standard to measure papers by.
        
        I think you are jumping into conclusion with that statement. How much can you dilute the data until it breaks again?
        
        Quoth:
        
        If each linear model is instead fit to the generate targets of all the preceding linear models i.e. data accumulate, then the test squared error has a finite upper bound, independent of the number of iterations. This suggests that data accumulation might be a robust solution for mitigating model collapse.
        
        Emphasis on “finite upper bound, independent of the number of iterations” by doing nothing more than keeping the non-synthetic data around each time you ingest new synthetic data. This is an empirical study so of course it’s not proof you’ll have to wait for theorists to have their turn for that one, but it’s darn convincing and should henceforth be the null hypothesis.
        
        Btw did you know that noone ever proved (or at least hadn’t last I checked) that reversing, determinising, reversing, and determinising again a DFA minimises it? Not proven yet widely accepted as true, crazy, isn’t it? But, wait, no, people actually proved it on a napkin. It’s not interesting enough to do a paper about.
        
        mint_tamas@lemmy.world
        link
        fedilink
        English
        arrow-up
        2·
        1 year ago
        Peer review, for all its flaws is a good minimum before a paper is worth taking seriously.
        
        In your original comment you said tha model collapse can be easily avoided with this technique, which is notably different from it being mitigated. I’m not saying that these findings are not useful, just that you are overselling them a bit with this wording.
        
        barsoap@lemm.ee
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        It was someone different who said that. There’s a chance the authors might’ve gotten some claim wrong because their maths and/or methodology is shoddy but it’s a large and diverse set of authors so that’s unlikely. Fraud in CS empirics is generally unheard of, I mean what are you going to do when challenged, claim that the dog ate the program you ran to generate the data? There’s shenanigans about the equivalent of p-hacking especially from papers from commercial actors trying to sell stuff but that’s not the case here, either.
        
        CS academics generally submit papers to journals more because of publish or perish than the additional value formal peer review offers. It’s on the internet, after all. By all means, if you spot something in the paper that’s wrong then be right on the internet.
  - noodle (he/him)@lemm.ee
    link
    fedilink
    English
    arrow-up
    5·
    1 year ago
    I prefer “Habsburg AI”.
- restingboredface@sh.itjust.works
  link
  fedilink
  English
  arrow-up
  18·
  1 year ago
  I wonder if Open AI or any of the other firms have thought to put in any kind of stipulations about monitoring and moderating reddit content to reduce ai generated posts and reduce risk of model collapse.
  
  Anybody who’s looked at reddit in the past 2 years especially has seen the impact of ai pretty clearly. If I was running open ai I wouldn’t want that crap contaminating my models.
jordanlund@lemmy.world
link
fedilink
English
arrow-up
90·
1 year ago
Removed by mod
- return2ozma@lemmy.worldOP
  link
  fedilink
  English
  arrow-up
  18·
  1 year ago
  Know any bots or ways to perma delete all Reddit comments?
  - thejml@lemm.ee
    link
    fedilink
    English
    arrow-up
    64·
    1 year ago
    Reddit has backups, permanently isn’t an option.
    - metaStatic@kbin.social
      link
      fedilink
      arrow-up
      17·
      1 year ago
      yep they fuckin got us
      
      but it’s not like our posts are safe here either. This is the world we live in now.
      - andrew@lemmy.stuart.fun
        link
        fedilink
        English
        arrow-up
        7·
        1 year ago
        But here, the API is open and I can run my own copy and train my own LLM same as anyone else. It’s not one asshole who decides to whom and for how much he’ll sell the content we all gave him for free, so he can justify his $193 million paycheck.
        
        PseudorandomNoise@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        Does that really matter? The owner of a given instance can still choose to sell everything on their server, no?
      - the_doktor@lemmy.zip
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        We have to either make AI illegal or make it accountable by giving references to where it gets its data so it can properly cite its sources.
    - db2@lemmy.world
      link
      fedilink
      English
      arrow-up
      15·
      1 year ago
      They’re not multiple though, edit it and then delete it and it’s gone. They disabled all the tools to do it though so it’s manually or nothing now.
      - Coasting0942@reddthat.com
        link
        fedilink
        English
        arrow-up
        14·
        1 year ago
        Damn. You outsmarted them well paid data jockeys. And assuming your edits change the actual comment and don’t simply hide the original.
        
        I could be an idiot too though. Reddit might have been running this whole shit show on the original version of the database system and be upselling to buyers.
      - SchmidtGenetics@lemmy.world
        link
        fedilink
        English
        arrow-up
        12·
        1 year ago
        They just reload a previous cached comment, doesn’t matter how many times you edit or delete, it’s all logged and backed up.
    - Imgonnatrythis@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      5·
      1 year ago
      Will be interesting to see if they stoop so low as to allow this. Probably wouldn’t be a super wise move as most deleted posts are likely material that would not be great to train on anyway. My first thought when I read this was, “well, not on MY posts” I’m clean off of reddit.
      - mox@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        11·
        1 year ago
        There have already been reports of people being banned and finding their posts restored in response to their attempts to delete them.
      - FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        7·
        1 year ago
        There are torrents of complete Reddit comment archives available for any random person who wants them, I’m sure Reddit themselves has a comprehensive edit history of everything.
  - bobs_monkey@lemm.ee
    link
    fedilink
    English
    arrow-up
    11·
    edit-2
    1 year ago
    I used redact.dev to mass edit all my comments, worked pretty well. Problem is that if you mass delete, they’ll restore them pretty quick, but so far they haven’t reverted my edits.
  - catloaf@lemm.ee
    link
    fedilink
    English
    arrow-up
    6·
    1 year ago
    https://github.com/j0be/PowerDeleteSuite
    - jabathekek@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      1·
      1 year ago
      This is what I used awhile ago to delete/edit all my comments multiple times.
  - Rolando@lemmy.world
    link
    fedilink
    English
    arrow-up
    2·
    1 year ago
    Back when I deleted all my comments, I was told I could claim to be in Europe and make a request citing the European law that Reddit has to follow. I think Reddit had a page where you could make the request, but of course it was hard to find.
- micka190@lemmy.world
  link
  fedilink
  English
  arrow-up
  8·
  1 year ago
  Realistically, when you’re operating at Reddit’s scale, you’re probably keeping a history of each comment for analytics purposes.
- RecluseRamble@lemmy.dbzer0.com
  link
  fedilink
  English
  arrow-up
  3·
  1 year ago
  That was really my thought - future iterations of Chat GPT won’t like spez very much.
Everythingispenguins@lemmy.world
link
fedilink
English
arrow-up
46·
1 year ago
Some day historians will be able to look back at this moment and be able to determine it was what caused ChatGPT to become horny and weird.
- frickineh@lemmy.world
  link
  fedilink
  English
  arrow-up
  5·
  1 year ago
  My comment history was like 50% shitposting about the beauty industry and 50% hating on Christian fundamentalists. There’s honestly no way it won’t make AI at least a little bit worse, and I’m not mad about it.
  - Flying Squid@lemmy.world
    link
    fedilink
    English
    arrow-up
    2·
    1 year ago
    That AI is going to be super anti-Christian fundementalist (or possibly just anti-Christian), so maybe there is an upside.
- assassin_aragorn@lemmy.world
  link
  fedilink
  English
  arrow-up
  4·
  1 year ago
  Only an idiot would decide to mindlessly trawl Reddit to train an LLM. They’ll be confused when their model suddenly is confidently wrong about everything and have no clue.
  - Everythingispenguins@lemmy.world
    link
    fedilink
    English
    arrow-up
    2·
    1 year ago
    You are a hundred percent right, but how many idiots are there out there?
    - assassin_aragorn@lemmy.world
      link
      fedilink
      English
      arrow-up
      2·
      1 year ago
      Uncountably many
      - Everythingispenguins@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        Sadly looks like we have an answer
        
        https://lemmy.world/post/15712886
AlexWIWA@lemmy.ml
link
fedilink
English
arrow-up
39·
1 year ago
LLMs have been training on Reddit posts since at least 2012. Nothing really new here.
- YIj54yALOJxEsY20eU@lemm.ee
  link
  fedilink
  English
  arrow-up
  7·
  1 year ago
  Now they get to train on all the “deleted” comments/posts as well.
  - SparrowRanjitScaur@lemmy.world
    link
    fedilink
    English
    arrow-up
    3·
    edit-2
    1 year ago
    Probably not, I’m sure they’re training on Reddit’s internal data set which likely includes all deleted posts.
    - YIj54yALOJxEsY20eU@lemm.ee
      link
      fedilink
      English
      arrow-up
      12·
      1 year ago
      Did you just say probably not then agree with me?
      - SparrowRanjitScaur@lemmy.world
        link
        fedilink
        English
        arrow-up
        5·
        edit-2
        1 year ago
        Ya, lol. Sorry, I’m not sure if I replied to the wrong comment or just misread your comment earlier. I agree with you.
        
        YIj54yALOJxEsY20eU@lemm.ee
        link
        fedilink
        English
        arrow-up
        3·
        1 year ago
        Lol no worries
- UnderpantsWeevil@lemmy.world
  link
  fedilink
  English
  arrow-up
  5·
  1 year ago
  It’s ground zero for Bots training on other Bots
filister@lemmy.world
link
fedilink
English
arrow-up
32·
edit-2
1 year ago
What makes you think that they are not scraping Lemmy too? The only reason they might not be is probably how niche Lemmy and the fediverse are, but I am sure there have been people already doing it.
- Dr. Moose@lemmy.world
  link
  fedilink
  English
  arrow-up
  28·
  1 year ago
  Fediverse is designed to do exactly that. It’s free flow of information which is a good thing. Don’t let corporations hijack this beautiful concept. We all want information to be free.
- olympicyes@lemmy.world
  link
  fedilink
  English
  arrow-up
  15·
  1 year ago
  I’m not mad about the scraping. The linkedin scraping case pretty much cemented that there was nothing that could be done to stop it. I’m just mad that I can no longer use the app of my choice. No such problem with Lemmy.
- AlexWIWA@lemmy.ml
  link
  fedilink
  English
  arrow-up
  4·
  1 year ago
  Lemmy is even easier to scrape. Just set up your own instance, then read the database after activity pub pushes everything to you.
- kia@lemmy.ca
  link
  fedilink
  English
  arrow-up
  3·
  1 year ago
  I’m sure they are, but Reddit probably provides these companies with lots of personalized metadata they collect just for them which they may not get from Lemmy.
Possibly linux@lemmy.zip
link
fedilink
English
arrow-up
30·
edit-2
1 year ago
They now are paying Reddit? I thought they could just scrape for free.

Also, you can not delete anything on the internet. Once something is public there will always be a copy somewhere.
- Fetus@lemmy.world
  link
  fedilink
  English
  arrow-up
  26·
  1 year ago
  Scraping through a website at the scale they are talking about isn’t really viable. You need access to the API so that you can have very targeted requests.
  
  This is why reddit changed their API pricing and screwed over everyone using third party apps. They can make more money selling access to LLM trainers than they could from having millions of people using apps that rely on the API.
  - Dr. Moose@lemmy.world
    link
    fedilink
    English
    arrow-up
    1·
    edit-2
    1 year ago
    Scraping at scale is actually cheaper than buying API access. It’s a massive rising market, try googling “web scraping service” and there are hundreds of services that provide API to scrape any public web page and bypass the blocks for you and render all of the javascript.
    - BatrickPateman@lemmy.world
      link
      fedilink
      English
      arrow-up
      1·
      1 year ago
      Scraping ia nice for static conten, no doubt. But I wonder at what point it is easier to request changes to a developing thread via API than to request the whole page with all nested content over and over to find the new answes in there.
      - Dr. Moose@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        Following a developing thread is a very tiny use case I’d imagine and even then you can just scrape the backend API that is used on the public page for the same results as private API.
- micka190@lemmy.world
  link
  fedilink
  English
  arrow-up
  10·
  edit-2
  1 year ago
  There’s actually legal precedent against scrapping a website through unofficial channels, even if the information is public. But basically, if you scrape a website and hinder their ability to operate, it falls under “virtual trespassing”.
  
  I’m assuming it would be even worse now that everyone is using the cloud and that scrapping their site would cause a noticeable increase in resource cost (and thus, directly cost them more money because of cloud usage fees).
  
  It’s why APIs are such a big deal. They provide you with an official, controlled, entry point to a platform’s data.
  - Dr. Moose@lemmy.world
    link
    fedilink
    English
    arrow-up
    11·
    edit-2
    1 year ago
    It’s the opposite! There’s legal precedence that scraping public data is 100% legal in the US.
    
    There are few countries where scraping is illegal though like Japan and China. European countries often also have things called “database protection” laws that forbid replicating public databases through scraping or any other means but that has to be a big chunk of overal database. Also there are personally identifiable info (PII) protection laws that protect storing of people data without their consent (like GDPR).
    
    Source: I work with anti bot tech and we have to explain this to almost every customer who wants to “sue the web scrapers” that lol if Linkedin couldn’t do it, you’re not sueing anyone.
    - General_Effort@lemmy.world
      link
      fedilink
      English
      arrow-up
      2·
      1 year ago
      Refreshing to see a post on this topic that has its facts straight.
      
      EU copyright allows a machine-readable opt-out from AI training (unless it’s for scientific purposes). I guess that’s behind these deals. It means they will have to pay off Reddit and the other platforms for access to the EU market. Or more accurately, EU customers will have to pay Reddit and the other platforms for access to AIs.
- nondescripthandle@lemmy.dbzer0.com
  link
  fedilink
  English
  arrow-up
  2·
  1 year ago
  My guess is reddit was cheap enough that it made sense to pay them as sort of insurance they dont get sued in the future.
Dr. Moose@lemmy.world
link
fedilink
English
arrow-up
29·
1 year ago
This form of propaganda is my pet peeve. It’s not “your posts” as soon as you put something to public you don’t get to eat your cake. It’s out there, you shared it. Don’t share it if you don’t want humanity to ingest and use it.
- Dataprolet@lemmy.dbzer0.com
  link
  fedilink
  English
  arrow-up
  23·
  1 year ago
  You’re technically right, but nobody anticipated and therefore agreed on their posts being used for training LLMs.
  - SparrowRanjitScaur@lemmy.world
    link
    fedilink
    English
    arrow-up
    1·
    1 year ago
    Public information is public information.
    - Dataprolet@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      1·
      1 year ago
      Oh boy have I bad news for you. You ever heard of copyright?
      - SparrowRanjitScaur@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        Have you ever heard of fair use?
- Azzu@lemm.ee
  link
  fedilink
  English
  arrow-up
  6·
  edit-2
  1 year ago
  It’s not about it being used to train AI. It’s about the AI either not being open source/I don’t get access to it (i.e. not benefitting me) or reddit being paid for my comments (i e. also not benefitting me).
  
  If this AI training would get me or the public access to the AI, or I would be paid for my comments instead of Reddit, I’d be fine with it.
  - Dr. Moose@lemmy.world
    link
    fedilink
    English
    arrow-up
    5·
    edit-2
    1 year ago
    yeah but you don’t get to choose that. You give away that right as soon as you participate in public discourse. It’s a zero sum game - either it’s a public for everyone or no one.
    
    Don’t get me wrong, Reddit is a bitch but I think people want to cut their noses off to spite their faces here. It’s much more important to have free information flow than to fuck reddit.
    
    My fear is that people will vote in some really dumb rules to spite AI and restrict free information flow accidentally.
    - Azzu@lemm.ee
      link
      fedilink
      English
      arrow-up
      3·
      edit-2
      1 year ago
      That’s how it is currently and maybe also your opinion. But that doesn’t mean it has to be like that in a society. It’s your opinion that everything public can go private at any time (training proprietary private AI), but we can decide as a society that’s not how we want to do things. We can require stuff that used public data to be public as well.
      
      And yeah I kinda get to choose that. As democratic society, anything that the public (i.e. including me) decides, goes. Of course, if there are people like you that don’t want stuff trained on public data to be required to be public, democracy will also work in the sense that we don’t get that, as it is currently.
Dark_Dragon@lemmy.dbzer0.com
link
fedilink
English
arrow-up
21·
edit-2
1 year ago
Reddit banned me through IP address or something. Whatever new account i create will be banned within 24hrs even if i don’t upvote a single post or comment. I tried with 10 new account all banned and all new email address. So gave up and randomly changed all my good comments. Shifted permanently to lemmy. Missing some of the most niche community. But not so much to return to reddit.

Edit: I didn’t even commit any rule violation. Took a too long to change from modded reddit app. I only logged in once. That doesn’t amount to blocking me from every using reddit.
- dumblederp@lemmy.world
  link
  fedilink
  English
  arrow-up
  2·
  1 year ago
  If you use a vpn and a disposable email you can get about a week out of an account if you need to comment, it’ll get quietly shadowbanned though.
leftzero
link
fedilink
English
arrow-up
19·
1 year ago
Meh, good luck with that.

All my Reddit comments have just said “Comment redacted in protest against Reddit’s deranged attacks against third party apps, the community, and common sense. See you’ll in Lemmy or Kbin once this embarrassment of a site is done enshittifying itself out of existence. Monetize this, u/spez, you greedy little pigboy. 🖕” since I edited them before moving here. 🤷‍♂️
- hessenjunge@discuss.tchncs.de
  link
  fedilink
  English
  arrow-up
  18·
  1 year ago
  You better double check. I just found out that only my comments with few upvotes are still that way, the others have been restored.
  
  A script replacing them with random words might do the trick.
  - ManOMorphos@lemmy.world
    link
    fedilink
    English
    arrow-up
    3·
    1 year ago
    I replaced all my comments with the same phrase before deleting them with PowerDeleteSuite. The comments were fully restored and visible through a google search (but not visible through the user page). My posts were not restored, AFAIK.
    
    This was during the whole 3rd party API thing. Maybe it was just something done during that time, but they certainly got around the edit replacement trick before.
  - FierySpectre@lemmy.world
    link
    fedilink
    English
    arrow-up
    3·
    1 year ago
    That’s assuming the old comments are actually overwritten instead of just marked as ‘old’
boatsnhos931@lemmy.worldBanned
link
fedilink
English
arrow-up
17·
edit-2
1 year ago
Removed by mod
- macrocephalic@lemmy.world
  link
  fedilink
  English
  arrow-up
  2·
  1 year ago
  All future AI will have autocorrect errors and will look like no one read it before hitting enter. You’re welcome.
  - boatsnhos931@lemmy.worldBanned
    link
    fedilink
    English
    arrow-up
    1·
    edit-2
    1 year ago
    Removed by mod
Mastengwe@lemm.ee
link
fedilink
English
arrow-up
15·
1 year ago
Isn’t this news like every month?
noorbeast@lemmy.zip
link
fedilink
English
arrow-up
13·
1 year ago
Finally found a use for MS Edge, loaded up Nuke Reddit History and removed all comments and posts: https://microsoftedge.microsoft.com/addons/detail/nuke-reddit-history/bklbcgohenjegdibgmppligaapohkgip
- gravitas_deficiency@sh.itjust.works
  link
  fedilink
  English
  arrow-up
  33·
  1 year ago
  Hate to break it to you, but the time to do that was over a year ago, and even then it wasn’t ever really a sure thing - we don’t really know what their backup policies are around that stuff.
  
  This is what the former power user community that made an exodus from Reddit roughly a year ago has been trying to communicate, but a ton of people here seem to enjoy keeping their toes in the water over there, with rather predictable consequences (literally, the post we’re commenting on).
  
  All that said: I am very much looking forward to the absolutely titanic lawsuit around GDPR I’m sure is in the works over this.
  - AlexWIWA@lemmy.ml
    link
    fedilink
    English
    arrow-up
    2·
    1 year ago
    Not even a year ago. Reddit has been used for training data for well over a decade. We used it in 2012 in an AI class.
    - gravitas_deficiency@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      2·
      1 year ago
      My point is that there was not a revenue-generating b2b contract allowing another company to exploit it at scale, while compensating Reddit directly.
      - AlexWIWA@lemmy.ml
        link
        fedilink
        English
        arrow-up
        2·
        1 year ago
        My apologies. I missed it
- humorlessrepost@lemmy.world
  link
  fedilink
  English
  arrow-up
  13·
  1 year ago
  Worth doing, but I suspect they’re sending OpenAI snapshots of the database from before you did that.
- snownyte@kbin.social
  link
  fedilink
  arrow-up
  1·
  1 year ago
  Wish I had known this beforehand in like several accounts I’ve had with that shit-ass place.
  
  Then again, it’s likely that Reddit has shit archived because Spez is one of them data-farmers like Mark is. Nothing is truly deleted from their sites. It’s just archived.
  
  There’s been lots of evidence that proves this, because people have dug up old comments, even down to who posted it originally. Then, even if your account is deleted, your comment body is still there, I know because I’ve deleted an account and checked back where I was before.
db2@lemmy.world
link
fedilink
English
arrow-up
8·
1 year ago
Not my posts. Go ahead, look at what remains. The rest was edited and then deleted.

Fuck you, Steve. Right in the ass.
- yeehaw@lemmy.ca
  link
  fedilink
  English
  arrow-up
  10·
  1 year ago
  If only snapshots and backups were a thing…
  - CeeBee@lemmy.world
    link
    fedilink
    English
    arrow-up
    6·
    1 year ago
    It’s theoretically possible, but the issue that anyone trying to do that would run into is consistency.
    
    How do you restore the snapshots of a database to recover deleted comments but also preserve other comments newer than the snapshot date?
    
    The answer is that it’s nearly impossible. Not impossible, but not worth the massive monumental effort when you can just focus on existing comments which greatly outweigh any deleted ones.
    - yeehaw@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1·
      1 year ago
      It’s a piece of cake. Some code along the lines of:
      
      If ($user.modifyCommentRecentlyCount > 50){
      
      Print “user is nuking comments” $comment = $previousComment }
      
      Or some shit. It can be done quite easily, trust me.
      - CeeBee@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        
        It can be done quite easily, trust me.
        
        The words of every junior dev right before I have to spend a weekend undoing their crap.
        
        I’ve been there too many times.
        
        There are always edge cases you need to account for, and you can’t account for them until you run tests and then verify the results.
        
        And you’d be parsing billions upon billions of records. Not a trivial thing to do when running multiple tests to verify. And ultimately for what is a trivial payoff.
        
        You don’t screw around with infinitely invaluable prod data of your business without exhausting every single possibility of data modification.
        
        It’s a piece of cake.
        
        It hurts how often I’ve heard this and how often it’s followed by a massive screw up.
        
        yeehaw@lemmy.ca
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        
        The words of every junior dev right before I have to spend a weekend undoing their crap.
        
        There are so many ways this can be done that I think you are not thinking of. Say a user goes to “shreddit” (or some other similar app) their comments. They likely have thousands. On every comment edit, it’s quite easy to check the last time the users edited one of their comments. All they need is some check like checking if the last 10 consecutive comments were edited in hours or milliseconds/seconds. After that, reddit could easily just tell the user it’s editing their comments but it’s not. Like a shadowban kind of method. Another way would be at the data structure level. We don’t know what their databases and hardware are like, but I can speculate. What if each user edited comment is not an update query on a database, but an add/insert. Then all you need to do is update the live comments where the date is before the malicious date where the username=$username. Not to mention when you start talking Nimble storage and stuff like that, the storage is extremely quick to respond. Hell I would wager it didn’t even hit storage yet, probably still on some all flash cache or in memory. Another way could be at the filesystem level. Ever heard of zfs? What if each user had their own dataset or something, it’s extremely easy and quick to roll back a snapshot, or to clone the previous snapshot. There are so many ways.
        
        At the end of the day a user is triggering this action, so we don’t necessarily need to parse “billions” of records. Just the records for a single user.
        
        CeeBee@lemmy.world
        link
        fedilink
        English
        arrow-up
        1·
        edit-2
        1 year ago
        
        There are so many ways this can be done that I think you are not thinking of.
        
        No, I can think of countless ways to do this. I do this kind of thing every single day.
        
        What I’m saying is that you need to account for every possibility. You need to isolate all the deleted comments that fit the criteria of the “Reddit Exodus”.
        
        How do you do that? Do you narrow it down to a timeframe?
        
        The easiest way to do this is identify all deleted accounts, find the backup with the most recent version of their profile with non-deleted comments, and insert that user back into the main database (not the prod db).
        
        Now you need to parse billions upon billions upon billions of records. And yes, it’s billions because you need the system to search through all the records to know which record fits the parameters. And you need to do that across multiple backups for each deleted profile/comment.
        
        It’s a lot of work. And what’s the payoff? A few good comments and a ton of “yes this ^” comments.
        
        I sincerely doubt it’s worth the effort.
        
        Edit: formatting
        
        yeehaw@lemmy.ca
        link
        fedilink
        English
        arrow-up
        1·
        1 year ago
        
        How do you do that? Do you narrow it down to a timeframe?
        
        When a user edits a comment, they submit a response. When they submit a response, they trigger an action. An action can do validation steps and call methods, just like I said above, for example. When the edit action is triggered, check the timestamp against the previously edited comment’s timestamp. If the previous - or previous 5 are less than a given timeframe, flag it. “Shadowban” the user. Make it look like they’ve updated their comments to them, but in reality they’re the same.
        
        We’ve had detection methods for this sort of thing for a long time. Thing about how spam filtering works. If you’re using some tool to scramble your data, they likely have patterns. To think reddit doesn’t have some means to protect itself against this is naive. It’s their whole business. All these user submitted comments are worth money.
        
        Now you need to parse billions upon billions upon billions of records. And yes, it’s billions because you need the system to search through all the records to know which record fits the parameters. And you need to do that across multiple backups for each deleted profile/comment.
        
        This makes me thing you don’t understand my meaning. I think you’re talking about one day reddit decides to search for an restore obfuscated and deleted comments. Yes, that would be a large undertaking. This is not what I’m suggesting at all. Stop it while it’s happening, not later. Patterns and trends can easily identify when a user is doing something like shreddit or the like, then the code can act on it.
        
        It’s a lot of work. And what’s the payoff? A few good comments and a ton of “yes this ^” comments.
        
        this
    - skulblaka@startrek.website
      link
      fedilink
      English
      arrow-up
      1·
      1 year ago
      Just collate them based on edit/deletion date… Each post will have a last-edited attribute that can be used for sorting. Even more so once the AI is bootstrapped enough to start recognizing the standard protest edit messages. At that point you hardly even need human oversight anymore, because the bot will be able to recognize “that’s a fuck spez edit, ignore that; this post looks good; that’s a Shreddit/PowerDelete edit, ignore that” and so on. Can even have it fetch the previous edit automatically when it comes across something like that, to a point where a comment removed by a PowerDelete tool is nothing more than a cover letter that states “there was once a real human-generated comment in this location”.
  - Todgerdickinson@lemmy.world
    link
    fedilink
    English
    arrow-up
    3·
    1 year ago
    Yea that’s the problem isn’t it. I had a great idea involving bullshit-efying my comments by editing them slowly with a LLM via long running script and repeatedly over months.
    
    I realised that they probably don’t delete the original text on edit anyway which, as you say is probably buried in a backup someplace.
    - Ace! _SL/S@ani.social
      link
      fedilink
      English
      arrow-up
      2·
      1 year ago
      I don’t think it is in backups only. My guess is they store your full edit history for each comment/post/whatever. Newest one will be shown on the frontend, rest is for data vampires
      - yeehaw@lemmy.ca
        link
        fedilink
        English
        arrow-up
        2·
        1 year ago
        This is it exactly. Edits to use are “changed”. To the back end it’s just an iteration while the rest still exist.
jeanofthedead@sh.itjust.works
link
fedilink
English
arrow-up
7·
1 year ago
Does this mean I can stop prefacing my AI requests with “According to Reddit…”?
Kyrgizion@lemmy.world
link
fedilink
English
arrow-up
6·
1 year ago
I didn’t delete my comments before nuking my account, but I’m pretty sure the grand majority were shitposts containing ample amounts of smut, gore and other ridiculous over the top shit. So I consider this a win.

Technology@lemmy.world

technology@lemmy.world

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

2.22K users / day
7.52K users / week
16K users / month
37.1K users / 6 months
342 local subscribers
72.9K subscribers
15.5K Posts
629K Comments
Modlog