retrospectology@lemmy.worldBanned to Today I Learned@lemmy.worldEnglish · 10 months ago

TIL That the entirety of Wikipedia is only ~100Gb and you can download it for offline use

524

TIL That the entirety of Wikipedia is only ~100Gb and you can download it for offline use

retrospectology@lemmy.worldBanned to Today I Learned@lemmy.worldEnglish · 10 months ago

In light of the recent Crowdstrike crash revealing how weak points in IT infrastructure can have wide ranging effects, I figured this might be an interesting one.

The entirety of wikipedia is periodically uploaded here, along with many other useful wikis and How To websites (ex. iFixit tutorials and WikiHow): https://download.kiwix.org/zim

You select the archive you want, then the language and archive version (for example, you can get an archive with no pictures, to save on space). For the totality of the english wikipedia you’d select the “wikipedia_en_all_maxi_2024-01.zim”

The archives are packed as .zim files, which can be read with the Kiwix app completely offline.

I have several USBs I keep that have some of these archives along with the app installer. In the event of some major catastrophe I’d at least be able to access some potentially useful information. I have no stake in Kiwix, and don’t know if there are other alternative apps and schemes, just thought it was neat.

You must log in or register to comment.

Chat

bionicjoey@lemmy.ca
link
fedilink
English
arrow-up
125·
10 months ago
The text version of Wikipedia*

The images and other media are a hell of a lot more.
- BuddyTheBeefalo@lemmy.ml
  link
  fedilink
  English
  arrow-up
  88·
  10 months ago
  it’s 102GB with images, 53GB without
  - Silverseren@fedia.io
    link
    fedilink
    arrow-up
    49·
    10 months ago
    I presume this is images directly hosted on English Wikipedia and not the entirety of Commons where the vast majority of images are kept, right?
    - BuddyTheBeefalo@lemmy.ml
      link
      fedilink
      English
      arrow-up
      103·
      edit-2
      10 months ago
      Wikimedia Commons is 373TB images. https://commons.m.wikimedia.org/wiki/Special:MediaStatistics
      - clearedtoland@lemmy.world
        link
        fedilink
        English
        arrow-up
        68·
        10 months ago
        So I have to upgrade my NAS again, ay?
        
        gmtom@lemmy.world
        link
        fedilink
        English
        arrow-up
        11·
        10 months ago
        You’re not already running petabyte NAS???
      - maegul (he/they)@lemmy.ml
        link
        fedilink
        English
        arrow-up
        9·
        10 months ago
        Kinda interesting at a broad level … that there’s still something to the efficiency of language.
        
        Sure storage is cheap now, but so much of the calculation of the utility of data in modern tech is the presumption of an internet connection and retrieval of information over the network.
        
        With the internet going to shit in various ways, local or decentralised computing is making more sense, at least depending on your priorities and perspective. And so all of a sudden, storage tradeoffs become a bit more meaningful. Do I need all of the pictures and media … or would a simple textual description suffice for most instances with high res media available at a more centralised archive if I’m really interested? A picture is worth 1000 words, but takes a hell of a lot more digital storage space!
        
        iknowitwheniseeit
        link
        fedilink
        English
        arrow-up
        1·
        10 months ago
        So many home instructions are so much easier with a photograph or two, or better yet a video.
- retrospectology@lemmy.worldBannedOP
  link
  fedilink
  English
  arrow-up
  83·
  edit-2
  10 months ago
  The 100Gb version mentioned above does only have thumbnails/lowres pictures, yeah. Better than nothing for some types of articles, but not everything. The true text-only version is actually only ~53Gb though.
  - ByteOnBikes@slrpnk.net
    link
    fedilink
    English
    arrow-up
    50·
    10 months ago
    Some of the high res photos are ridiculous.
    
    Like a 8000x9000 uncompressed image of someone’s hand and weighs about 22mb.
    
    I know that because I use a lot of royalty free images.
    - owsei@programming.dev
      link
      fedilink
      English
      arrow-up
      8·
      10 months ago
      Is there an index of the images or something like that?
      - morhp
        link
        fedilink
        English
        arrow-up
        9·
        10 months ago
        https://commons.wikimedia.org/
        
        The images are categorised and there’s a search function.
        
        owsei@programming.dev
        link
        fedilink
        English
        arrow-up
        3·
        10 months ago
        Thank you very much!
- Dasus@lemmy.world
  link
  fedilink
  English
  arrow-up
  24·
  10 months ago
  Without images Wikipedia is a “mere” 22.14gb.
  
  https://en.m.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#:~:text=The total number of pages,about 22.14 GB without media.
  - Psythik@lemmy.world
    link
    fedilink
    English
    arrow-up
    15·
    edit-2
    10 months ago
    I’ve installed game patches that were larger than this.
    - Valmond@lemmy.world
      link
      fedilink
      English
      arrow-up
      1·
      10 months ago
      They should put it in a popular game patch.
lolola@lemmy.blahaj.zone
link
fedilink
English
arrow-up
63·
10 months ago
So something akin to this joke image I saw the other day is actually feasible for Wikipedia?
- Max@lemmy.world
  link
  fedilink
  English
  arrow-up
  19·
  10 months ago
  Chatgpt is also probably around 50-100GB at most
  - souperk@reddthat.com
    link
    fedilink
    English
    arrow-up
    5·
    10 months ago
    Probably a lot less, keep in mind that whenever it answers a question the whole model is traversed multiple times, going through multiple GBs is not possible in the matter of seconds the model answers.
    - Max@lemmy.world
      link
      fedilink
      English
      arrow-up
      7·
      10 months ago
      I’d be surprised if it was significantly less. A comparable 70 billion parameter model from llama requires about 120GB to store. Supposedly the largest current chatgpt goes up to 170 billion parameters, which would take a couple hundred GB to store. There are ways to tradeoff some accuracy in order to save a bunch of space, but you’re not going to get it under tens of GB.
      
      These models really are going through that many Gb of parameters once for every word in the output. GPUs and tensor processors are crazy fast. For comparison, think about how much data a GPU generates for 4k60 video display. Its like 1GB per second. And the recommended memory speed required to generate that image is like 400GB per second. Crazy fast.
  - lolola@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    5·
    edit-2
    10 months ago
    Plus input data?
    - jose1324@lemmy.world
      link
      fedilink
      English
      arrow-up
      16·
      10 months ago
      No, but it’s the model after the input that you need.
  - anivia@lemmy.ml
    link
    fedilink
    English
    arrow-up
    3·
    10 months ago
    So it would fit on a Bluray disc
- mctoasterson@reddthat.com
  link
  fedilink
  English
  arrow-up
  15·
  10 months ago
  I mean, you can self-host your own local LLMs using something like Ollama. The performance will be bound by the disk space you have (the complexity of the model you’re able to store), and the performance of the CPU or GPU you are using to run it, but it does work just fine. Probably as good results as ChatGPT for most use cases.
  - Nooodel@lemmy.world
    link
    fedilink
    English
    arrow-up
    3·
    10 months ago
    We do this at work (lots of sensitive data that we don’t want Openai to capitalize on) and it works pretty well. Hosted locally, setup by a data security and privacy sensitive admin, who specifically runs the settings to not save any queries even on the server. Bit slower than chatgpt but not by much
- Slovene@feddit.nl
  link
  fedilink
  English
  arrow-up
  2·
  10 months ago
  https://m.youtube.com/watch?v=1lRI35gKSPA
Em Adespoton@lemmy.ca
link
fedilink
English
arrow-up
50·
10 months ago
Aside from the text clarification, this is also only the US version of Wikipedia.

What worries me though is that most videos linked on Wikipedia are hosted on YouTube. That’s a pretty dangerous choke point.
- superkret@feddit.org
  link
  fedilink
  English
  arrow-up
  14·
  10 months ago
  Videos aren’t an essential part of an encyclopedia.
- AnUnusualRelic@lemmy.world
  link
  fedilink
  English
  arrow-up
  14·
  10 months ago
  I never even noticed any videos on Wikipedia. Maybe for some cinema articles.
- ByteOnBikes@slrpnk.net
  link
  fedilink
  English
  arrow-up
  4·
  10 months ago
  My brain immediately thought archive.org but after the last incident, I kinda feel like archive org is going to get lawsuited into oblivion
  - whats_all_this_then@lemmy.world
    link
    fedilink
    English
    arrow-up
    2·
    10 months ago
    I tried searching but found nothing. What incident?
Muffi@programming.dev
link
fedilink
English
arrow-up
38·
10 months ago
This saved my ass at my engineering chemistry exam (still a requirement, even for software engineers) where only offline tools were allowed. Love Kiwix!
- snrkl@lemmy.sdf.org
  link
  fedilink
  English
  arrow-up
  12·
  10 months ago
  LOL… Malicious compliance at its best…
Aatube@kbin.melroy.org
link
fedilink
arrow-up
36·
10 months ago
DYK that Kiwix was actually created by Wikipedia? Back in the late 2000s there was this gigantic effort to select and improve a ton of articles to make an offline “Wikipedia 1.0” release. The only remains of that effort are Kiwix, periodic backups, and an incredibly useful article-rating system.
- felixwhynot@lemmy.world
  link
  fedilink
  English
  arrow-up
  17·
  10 months ago
  Can you write more about the rating system you mentioned?
  - Aatube@kbin.melroy.org
    link
    fedilink
    arrow-up
    7·
    edit-2
    10 months ago
    
    There is a set of criteria to rate an article B, C, Start or Stub. These are called classes. Similarly, articles can be rated to be of 1 of 4 importance values to a particular WikiProject.
    
    There’s a banner on every article’s talk page. Any editor can change an article’s rating between one of the above classes boldly; if a revert happens, they discuss it according to the criteria.
    
    Some WikiProjects have their own criteria for rating articles. Some of them even have process to make an article A-class.
    
    Before this system, Wikipedia already had processes to make an article a Good Article or Featured article.
    
    With GAs, a nominator should put a candidate onto backlog. Later, a reviewer will scrutinize the article according to criteria. Often, the reviewer asks the nominator to fix quite a bit of issues. If these issues are fixed promptly, or the reviewer thinks that there are only nitpicks, the article passes. If they aren’t fixed in a week or the reviewer thinks that there are major problems, the article fails.
    
    As with other processes, the nominator and reviewer can be anyone, though reviewers are usually experienced.
    
    With FAs, a nominator brings the candidate to a noticeaboard. Editors there then come to a consensus about whether the article should pass.
    
    Both processes display a badge directly on passed articles.
    
    Both processes have an associated re-review process where editors come to a consensus whether the article should fail if it were nominated today
    
    There’s also an informal process called “peer review”, where someone just puts an article at a noticeable and anyone can comment about its quality.
    
    Articles are automatically sorted into categories by their rating and importance. Editors usually look at these to decide which articles to focus on nowadays.
TheReturnOfPEB@reddthat.com
link
fedilink
English
arrow-up
26·
10 months ago
and you should donate to wikipedia if you are gonna do that
- NewAgeOldPerson@lemmy.world
  link
  fedilink
  English
  arrow-up
  12·
  10 months ago
  I couldn’t afford to donate for a long time but I used it near daily. So now I do monthly, probably larger than average, contribution to make up for sibs from other cribs that can’t afford it. Pay it forward is indeed a golden rule.
  - 𝕸𝖔𝖘𝖘@infosec.pub
    link
    fedilink
    English
    arrow-up
    7·
    10 months ago
    Do you wear a cape? Or are you one of those who doesn’t wear one?
    - NewAgeOldPerson@lemmy.world
      link
      fedilink
      English
      arrow-up
      6·
      10 months ago
      No cape. I’m brown so I’m on the radar bad enough as it is as soon as I leave major cities lol.
Silverseren@fedia.io
link
fedilink
arrow-up
22·
10 months ago
The benefit of text not taking up much space.
Fenrisulfir@lemmy.ca
link
fedilink
English
arrow-up
21·
10 months ago
Is there a git repo for it or do I have to redownload the whole thing to do an update?
ohwhatfollyisman@lemmy.world
link
fedilink
English
arrow-up
20·
10 months ago
i remember a time when it was only 2gb for all of wikipedia. usain bolt had just burst onto the world stage at the time.
- Ricky Rigatoni@lemm.ee
  link
  fedilink
  English
  arrow-up
  20·
  10 months ago
  And by now he’s exited the solar system at incomprehensible speeds.
CannedCairn@lemmy.world
link
fedilink
English
arrow-up
13·
10 months ago
I did! I do! Also all public domain books as part of the project Gutenberg
clearedtoland@lemmy.world
link
fedilink
English
arrow-up
7·
10 months ago
I know there are a few companies working on DNA storage. From the comment below about the entirety of Wikipedia and Wiki Commons, I’d say that’d be a pretty practical thing to store.

Here’s the wiki article about it.
ThatWeirdGuy1001@lemmy.world
link
fedilink
English
arrow-up
6·
10 months ago
Imagine downloading it just after some troll changed critical information lmao
- milicent_bystandr@lemm.ee
  link
  fedilink
  English
  arrow-up
  3·
  10 months ago
  I imagine you could also download with all the history of every article
Don_Dickle@piefed.social
link
fedilink
Afaraf
arrow-up
4·
10 months ago
I am currently reading on terrorists while in the states. But something tells me I will get my IP banning me. But I have read a shitton and I highly doubt its just 100gb. Otherwise you would see it more on piracy sites.
- whoreticulture@lemmy.world
  link
  fedilink
  English
  arrow-up
  26·
  10 months ago
  But it’s freely and easily available to download, why would it be on piracy sites?
  - Serinus@lemmy.world
    link
    fedilink
    English
    arrow-up
    2·
    10 months ago
    China is making a copy. For… reasons.
- ripcord@lemmy.world
  link
  fedilink
  English
  arrow-up
  9·
  10 months ago
  How high were you when you wrote this?
  - Don_Dickle@piefed.social
    link
    fedilink
    arrow-up
    1·
    10 months ago
    Currently where I am working can’t get high. But can get drunk. But was neither when I wrote that. My ISP is very brutal on looking up stuff or downloading shit.
- Dasus@lemmy.world
  link
  fedilink
  English
  arrow-up
  4·
  10 months ago
  
  Otherwise you would see it more on piracy sites.
  
  What on Earth do you mean? Piracy sites share things which aren’t available easily for free otherwise.
  
  https://en.m.wikipedia.org/wiki/Wikipedia:Database_download
  
  And the text only version of Wiki is just 22.14gb.
  
  https://en.m.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia#:~:text=The total number of pages,about 22.14 GB without media.
- Icalasari@fedia.io
  link
  fedilink
  arrow-up
  1·
  10 months ago
  deleted by creator
Farmfixit@lemmy.world
link
fedilink
English
arrow-up
2·
10 months ago
I tried to download it but couldn’t get it to work :(
- retrospectology@lemmy.worldBannedOP
  link
  fedilink
  English
  arrow-up
  4·
  edit-2
  10 months ago
  Download the kiwix app for whatever OS you’re using, then go into Kiwix and click on the folder icon in the app and navigate to where the .zim file you downloaded is located. If you click it it should automatically pop-up and be viewable.
  
  If you did that and it’s still failing, is it giving you a specific error or anything?
Slovene@feddit.nl
link
fedilink
English
arrow-up
1·
10 months ago
It’s already been done: https://m.youtube.com/watch?v=1lRI35gKSPA
- ripcord@lemmy.world
  link
  fedilink
  English
  arrow-up
  3·
  10 months ago
  What’s already been done…?
  - Slovene@feddit.nl
    link
    fedilink
    English
    arrow-up
    3·
    10 months ago
    Sorry, I meant to reply to the commenter with the chatgpt on a dvd pic saying that it’s actually feasible for Wikipedia.

Today I Learned@lemmy.world

til@lemmy.world

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]

What did you learn today? Share it with us!

We learn something new every day. This is a community dedicated to informing each other and helping to spread knowledge.

The rules for posting and commenting, besides the rules defined here for lemmy.world, are as follows:

Rules (interactive)

Rule 1- All posts must begin with TIL. Linking to a source of info is optional, but highly recommended as it helps to spark discussion.

** Posts must be about an actual fact that you have learned, but it doesn’t matter if you learned it today. See Rule 6 for all exceptions.**

Rule 2- Your post subject cannot be illegal or NSFW material.

Your post subject cannot be illegal or NSFW material. You will be warned first, banned second.

Rule 3- Do not seek mental, medical and professional help here.

Do not seek mental, medical and professional help here. Breaking this rule will not get you or your post removed, but it will put you at risk, and possibly in danger.

Rule 4- No self promotion or upvote-farming of any kind.

That’s it.

Rule 5- No baiting or sealioning or promoting an agenda.

Posts and comments which, instead of being of an innocuous nature, are specifically intended (based on reports and in the opinion of our crack moderation team) to bait users into ideological wars on charged political topics will be removed and the authors warned - or banned - depending on severity.

Rule 6- Regarding non-TIL posts.

Provided it is about the community itself, you may post non-TIL posts using the [META] tag on your post title.

Rule 7- You can't harass or disturb other members.

If you vocally harass or discriminate against any individual member, you will be removed.

Likewise, if you are a member, sympathiser or a resemblant of a movement that is known to largely hate, mock, discriminate against, and/or want to take lives of a group of people, and you were provably vocal about your hate, then you will be banned on sight.

For further explanation, clarification and feedback about this rule, you may follow this link.

Rule 8- All comments should try to stay relevant to their parent content.

Rule 9- Reposts from other platforms are not allowed.

Let everyone have their own content.

Rule 10- Majority of bots aren't allowed to participate here.

Unless included in our Whitelist for Bots, your bot will not be allowed to participate in this community. To have your bot whitelisted, please contact the moderators for a short review.

Partnered Communities

You can view our partnered communities list by following this link. To partner with our community and be included, you are free to message the moderators or comment on a pinned post.

Community Moderation

For inquiry on becoming a moderator of this community, you may comment on the pinned post of the time, or simply shoot a message to the current moderators.

Visibility: Public

This community can be federated to other instances and be posted/commented in by their users.

110 users / day
1.77K users / week
4.59K users / month
11.2K users / 6 months
93 local subscribers
21.8K subscribers
1.04K Posts
25.7K Comments
Modlog