Giter VIP home page Giter VIP logo

distributed-wikipedia-mirror's Introduction

Distributed Wikipedia Mirror Project

Putting Wikipedia Snapshots on IPFS and working towards making it fully read-write.

Existing Mirrors

There are various ways one can access the mirrors: through a DNSLink, public gateway or directly with a CID.

You can read all about the available methods here.

DNSLinks

CIDs

The latest CIDs that the DNSLinks point at can be found in snapshot-hashes.yml.


Each mirror has a link to the original Kiwix ZIM archive in the footer. It can be dowloaded and opened offline with the Kiwix Reader.

Table of Contents

Purpose

“We believe that information—knowledge—makes the world better. That when we ask questions, get the facts, and are able to understand all perspectives on an issue, it allows us to build the foundation for a more just and tolerant society” -- Katherine Maher, Executive Director of the Wikimedia Foundation

Wikipedia on IPFS -- Background

What does it mean to put Wikipedia on IPFS?

The idea of putting Wikipedia on IPFS has been around for a while. Every few months or so someone revives the threads. You can find such discussions in this github issue about archiving wikipedia, this issue about possible integrations with Wikipedia, and this proposal for a new project.

We have two consecutive goals regarding Wikipedia on IPFS: Our first goal is to create periodic read-only snapshots of Wikipedia. A second goal will be to create a full-fledged read-write version of Wikipedia. This second goal would connect with the Wikimedia Foundation’s bigger, longer-running conversation about decentralizing Wikipedia, which you can read about at https://strategy.wikimedia.org/wiki/Proposal:Distributed_Wikipedia

(Goal 1) Read-Only Wikipedia on IPFS

The easy way to get Wikipedia content on IPFS is to periodically -- say every week -- take snapshots of all the content and add it to IPFS. That way the majority of Wikipedia users -- who only read wikipedia and don’t edit -- could use all the information on wikipedia with all the benefits of IPFS. Users couldn't edit it, but users could download and archive swaths of articles, or even the whole thing. People could serve it to each other peer-to-peer, reducing the bandwidth load on Wikipedia servers. People could even distribute it to each other in closed, censored, or resource-constrained networks -- with IPFS, peers do not need to be connected to the original source of the content, being connected to anyone who has the content is enough. Effectively, the content can jump from computer to computer in a peer-to-peer way, and avoid having to connect to the content source or even the internet backbone. We've been in discussions with many groups about the potential of this kind of thing, and how it could help billions of people around the world to access information better -- either free of censorship, or circumventing serious bandwidth or latency constraints.

So far, we have achieved part of this goal: we have static snapshots of all of Wikipedia on IPFS. This is already a huge result that will help people access, keep, archive, cite, and distribute lots of content. In particular, we hope that this distribution helps people in Turkey, who find themselves in a tough situation. We are still working out a process to continue updating these snapshots, we hope to have someone at Wikimedia in the loop as they are the authoritative source of the content. If you could help with this, please get in touch with us at wikipedia-project <AT> ipfs.io

(Goal 2) Fully Read-Write Wikipedia on IPFS

The long term goal is to get the full-fledged read-write Wikipedia to work on top of IPFS. This is much more difficult because for a read-write application like Wikipedia to leverage the distributed nature of IPFS, we need to change how the applications write data. A read-write wikipedia on IPFS would allow it to be completely decentralized, and create an extremely difficult to censor operation. In addition to all the benefits of the static version above, the users of a read-write Wikipedia on IPFS could write content from anywhere and publish it, even without being directly connected to any wikipedia.org servers. There would be automatic version control and version history archiving. We could allow people to view, edit, and publish in completely encrypted contexts, which is important to people in highly repressive regions of the world.

A full read-write version (2) would require a strong collaboration with Wikipedia.org itself, and finishing work on important dynamic content challenges -- we are working on all the technology (2) needs, but it's not ready for prime-time yet. We will update when it is.

How to add new Wikipedia snapshots to IPFS

The process can be nearly fully automated, however it consists of many stages and understanding what happens during each stage is paramount if ZIM format changes and our build toolchain requires a debug and update.

  • Manual build are useful in debug situations, when specific stage needs to be executed multiple times to fix a bug.
    • mirrorzim.sh automates some steps for QA purposes and ad-hoc experimentation

Note: This is a work in progress.. We intend to make it easy for anyone to create their own wikipedia snapshots and add them to IPFS, making sure those builds are deterministic and auditable, but our first emphasis has been to get the initial snapshots onto the network. This means some of the steps aren't as easy as we want them to be. If you run into trouble, seek help through a github issue, commenting in chat, or by posting a thread on https://discuss.ipfs.tech.

Manual build

If you would like to create an updated Wikipedia snapshot on IPFS, you can follow these steps.

Step 0: Clone this repository

All commands assume to be run inside a cloned version of this repository

Clone the distributed-wikipedia-mirror git repository

$ git clone https://github.com/ipfs/distributed-wikipedia-mirror.git

then cd into that directory

$ cd distributed-wikipedia-mirror

Step 1: Install dependencies

Node and yarn are required. On Mac OS X you will need sha256sum, available in coreutils.

Install the node dependencies:

$ yarn

Then, download the latest zim-tools and add zimdump to your PATH. This tool is necessary for unpacking ZIM.

Step 2: Configure your IPFS Node

It is advised to use separate IPFS node for this:

$ export IPFS_PATH=/path/to/IPFS_PATH_WIKIPEDIA_MIRROR
$ ipfs init -p server,local-discovery,flatfs,randomports --empty-repo

Tune DHT for speed

Wikipedia has a lot of blocks, to publish them as fast as possible, enable Accelerated DHT Client:

$ ipfs config --json Experimental.AcceleratedDHTClient true

Tune datastore for speed

Make sure repo uses flatfs with sync set to false:

$ ipfs config --json 'Datastore.Spec.mounts' "$(ipfs config 'Datastore.Spec.mounts' | jq -c '.[0].child.sync=false')"

NOTE: While badgerv1 datastore is faster is nome configurations, we choose to avoid using it with bigger builds like English because of memory issues due to the number of files. Potential workaround is to use filestore that avoids duplicating data and reuses unpacked files as-is.

HAMT sharding

Make sure you use go-ipfs 0.12 or later, it has automatic sharding of big directories.

Step 3: Download the latest snapshot from kiwix.org

Source of ZIM files is at https://download.kiwix.org/zim/wikipedia/ Make sure you download _all_maxi_ snapshots, as those include images.

To automate this, you can also use the getzim.sh script:

First, download the latest wiki lists using bash ./tools/getzim.sh cache_update

After that create a download command using bash ./tools/getzim.sh choose, it should give an executable command e.g.

Download command:
    $ ./tools/getzim.sh download wikipedia wikipedia tr all maxi latest

Running the command will download the choosen zim file to the ./snapshots directory.

Step 4: Unpack the ZIM snapshot

Unpack the ZIM snapshot using extract_zim:

$ zimdump dump ./snapshots/wikipedia_tr_all_maxi_2021-01.zim --dir ./tmp/wikipedia_tr_all_maxi_2021-01

ℹ️ ZIM's main page

Each ZIM file has "main page" attribute which defines the landing page set for the ZIM archive. It is often different than the "main page" of upstream Wikipedia. Kiwix Main page needs to be passed in the next step, so until there is an automated way to determine "main page" of ZIM, you need to open ZIM in Kiwix reader and eyeball the name of the landing page.

Step 5: Convert the unpacked zim directory to a website with mirror info

IMPORTANT: The snapshots must say who disseminated them. This effort to mirror Wikipedia snapshots is not affiliated with the Wikimedia foundation and is not connected to the volunteers whose contributions are contained in the snapshots. The snapshots must include information explaining that they were created and disseminated by independent parties, not by Wikipedia.

The conversion to a working website and the appending of necessary information is is done by the node program under ./bin/run.

$ node ./bin/run --help

The program requires main page for ZIM and online versions as one of inputs. For instance, the ZIM file for Turkish Wikipedia has a main page of Kullanıcı:The_other_Kiwix_guy/Landing but https://tr.wikipedia.org uses Anasayfa as the main page. Both must be passed to the node script.

To determine the original main page use ./tools/find_main_page_name.sh:

$ ./tools/find_main_page_name.sh tr.wikiquote.org
Anasayfa

To determine the main page in ZIM file open in in a Kiwix reader or use zimdump info (version 3.0.0 or later) and ignore the A/ prefix:

$ zimdump info wikipedia_tr_all_maxi_2021-01.zim
count-entries: 1088190
uuid: 840fc82f-8f14-e11e-c185-6112dba6782e
cluster count: 5288
checksum: 50113b4f4ef5ddb62596d361e0707f79
main page: A/Kullanıcı:The_other_Kiwix_guy/Landing
favicon: -/favicon

$ zimdump info wikipedia_tr_all_maxi_2021-01.zim | grep -oP 'main page: A/\K\S+'
Kullanıcı:The_other_Kiwix_guy/Landing

The conversion is done on the unpacked zim directory:

node ./bin/run ./tmp/wikipedia_tr_all_maxi_2021-02 \
  --hostingdnsdomain=tr.wikipedia-on-ipfs.org \
  --zimfile=./snapshots/wikipedia_tr_all_maxi_2021-02.zim \
  --kiwixmainpage=Kullanıcı:The_other_Kiwix_guy/Landing \
  --mainpage=Anasayfa

Step 6: Import website directory to IPFS

Increase the limitation of opening files

In some cases, you will meet an error like could not create socket: Too many open files when you add files to the IPFS store. It happens when IPFS needs to open more files than it is allowed by the operating system and you can temporarily increase this limitation to avoid this error using this command.

ulimit -n 65536

Add immutable copy

Add all the data to your node using ipfs add. Use the following command, replacing $unpacked_wiki with the path to the website that you created in Step 4 (./tmp/wikipedia_en_all_maxi_2018-10).

$ ipfs add -r --cid-version 1 --offline $unpacked_wiki

Save the last hash of the output from the above process. It is the CID of the website.

Step 7: Share the root CID

Share the CID of your new snapshot so people can access it and replicate it onto their machines.

Step 8: Update *.wikipedia-on-ipfs.org

Make sure at least two full reliable copies exist before updating DNSLink.

mirrorzim.sh

It is possible to automate steps 3-6 via a wrapper script named mirrorzim.sh. It will download the latest snapshot of specified language (if needed), unpack it, and add it to IPFS.

To see how the script behaves try running it on one of the smallest wikis, such as cu:

$ ./mirrorzim.sh --languagecode=cu --wikitype=wikipedia --hostingdnsdomain=cu.wikipedia-on-ipfs.org

Docker build

A Dockerfile with all the software requirements is provided. For now it is only a handy container for running the process on non-Linux systems or if you don't want to pollute your system with all the dependencies. In the future it will be end-to-end blackbox that takes ZIM and spits out CID and repo.

To build the docker image:

docker build . -t distributed-wikipedia-mirror-build

To use it as a development environment:

docker run -it -v $(pwd):/root/distributed-wikipedia-mirror --net=host --entrypoint bash distributed-wikipedia-mirror-build

How to Help

If you don't mind command line interface and have a lot of disk space, bandwidth, or code skills, continue reading.

Share mirror CID with people who can't trust DNS

Sharing a CID instead of a DNS name is useful when DNS is not reliable or trustworthy. The latest CID for specific language mirror can be found via DNSLink:

$ ipfs resolve -r /ipns/tr.wikipedia-on-ipfs.org
/ipfs/bafy..

CID can then be opened via ipfs://bafy.. in a web browser with IPFS Companion extension resolving IPFS addresses via IPFS Desktop node.

You can also try Brave browser, which ships with native support for IPFS.

Cohost a lazy copy

Using MFS makes it easier to protect snapshots from being garbage collected than low level pinning because you can assign meaningful names and it won't prefetch any blocks unless you explicitly ask.

Every mirrored Wikipedia article you visit will be added to your lazy copy, and will be contributing to your partial mirror. , and you won't need to host the entire thing.

To cohost a lazy copy, execute:

$ export LNG="tr"
$ ipfs files mkdir -p /wikipedia-mirror/$LNG
$ ipfs files cp $(ipfs resolve -r /ipns/$LNG.wikipedia-on-ipfs.org) /wikipedia-mirror/$LNG/$LNG_$(date +%F_%T)

Then simply start browsing the $LNG.wikipedia-on-ipfs.org site via your node. Every visited page will be cached, cohosted, and protected from garbage collection.

Cohost a full copy

Steps are the same as for a lazy copy, but you execute additional preload after a lazy copy is in place:

$ # export LNG="tr"
$ ipfs refs -r /ipns/$LNG.wikipedia-on-ipfs.org

Before you execute this, check if you have enough disk space to fit CumulativeSize:

$ # export LNG="tr"
$ ipfs object stat --human /ipns/$LNG.wikipedia-on-ipfs.org                                                                                                                                 ...rror MM?fix/build-2021
NumLinks:       5
BlockSize:      281
LinksSize:      251
DataSize:       30
CumulativeSize: 15 GB

We are working on improving deduplication between snapshots, but for now YMMV.

Code

If you would like to contribute more to this effort, look at the issues in this github repo. Especially check for issues marked with the "wishlist" label and issues marked "help wanted".

distributed-wikipedia-mirror's People

Contributors

aschmahmann avatar fledgexu avatar flyingzumwalt avatar hsanjuan avatar ipfs-mgmt-read-write[bot] avatar jbenet avatar kanej avatar kubuxu avatar lidel avatar mkg20001 avatar momack2 avatar punkchameleon avatar victorb avatar web-flow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

distributed-wikipedia-mirror's Issues

Add Arabic and Kurdish snapshots

Add Snapshots of http://ar.wikipedia.org and http://ku.wikipedia.org

The IPNS entries for these snapshots will be (we can update these links to point to the hash of the most recent snapshot) :

When the corresponding snapshots are ready, we will update that IPNS entries to point to the snapshots. We will also announce the links to those snapshots in comments on this issue and on the ipfs blog.

Insert custom footer into snapshots

Make a body.js file in this repo that inserts our footer instead of the original social media links. When we build wikipedia snapshots we will replace the original body.js with this new body.js so the footer includes a clear statement about who generated the snapshots. (see #13)

Add the new file to #16

[BOUNTY] Fix script responsible for preparing IPFS mirror

BOUNTY: $500 (how to claim?)

Summary

  • We unpack ZIM and put it on IPFS as a regular HTML+JS+CSS website.
  • Before its published we customize JS to a footer is attached to every page, informing reader that its unofficial wikipedia mirror
    • The build inserts custom footer into snapshots (#17, #15).
    • Existing scripts no longer work and need to be updated/redone before we start creating new snapshots (this is blocking #58, #61, #60)

TODO

Pick up where PR at #67 ended and update execute-changes.sh script to:

  • Ensure there are no JS errors when pages are loaded
  • Make it possible to navigate to other articles
    • ensure relative paths work
      • on https://ipfs.io/ipns/<cid>/wiki/
      • on https://<cid>.ipfs.dweb.link/wiki/
    • When unpacked with extract_zim, all pages are named ArticleName.html but they link to other article names without .html
      • idea: for every article, create ArticleName/index.html with a redirect page to ArticleName.html (similar to this)
  • Custom footer needs to be appended to every page
  • Update footer contents
    • add link to article snapshot at original Wikipedia
      • oldid= links can be found in page sources, for example:
        href="https://tr.wikipedia.org/w/index.php?title=<title>&amp;oldid=<timestamp>"
    • add link to the source .zim file
      • for now it can be link at download.kiwix.org, in the future it will be .zim on IPFS
    • remove logos/buttons of centralized services
    • include information on takedown policy / contact (eg. if latest snapshot includes information removed in upstream wikipedia)
  • Restore original Main Page
    • every wikipedia has a Main Page under different name
      • there are scripts to find out that name and fetch original for the right snapshot, see work started in bb9f48c
      • what needs to happen is to download original, fix it up to work locally and save it as /wiki/index.html

Acceptance Criteria

  • PR with necessary changes is submitted and merged to this repo
  • Script works and enables us to produce updated IPFS mirror of the latest Turkish snapshot: wikipedia_tr_all_maxi_2019-12.zim
    • CID of a demo output is provided

spec: worker for building snapshots

design a worker that uses the script from #35 to

  1. uses the script from #35 to
    • download data from a specified HTTP source
    • unpack the data and write it to IPFS
    • run a script (from this repo) that modifies the data (using IPFS files API)
    • pin the resulting data somewhere
  2. return the hash of the result
  3. (maybe) submit a PR to https://github.com/ipfs/distributed-wikipedia-mirror with the new hash

Optional:

Make the workers follow a queue.

links should be exactly the same as wikipedia

links should be exactly the same as wikipedia

  • pages have “.html” at the end, can we get no “.html”? (with directories + index.htmk or whatever)
  • i want the links to be the same so people can just change a prefix somewhere and have everything else just work
  • are media links the same as they are on wikipedia.org?

Block internet search engines from indexing the mirror

If possible, are you able to make your mirror non-indexed by internet search engines? There is very minimal benefit for clearnet users to run across three (WMF, WikiVisually and ipfs) different copies of the Wikipedia article every time they search for something.

Rendering pages from XML sources instead of kiwix HTML dumps

Zim files are nice source but they are a bit limiting. The layout is fixed (and it isn't wikipedia alike), they are updated much more rarely then XML dumps.

Rendering XML ourselves would be quite an effort so I think it is long term goal.

They are also good source of mediawiki assets (as no per language dumps od assets are available).

Add English Snapshot

This is taking longer than the other languages because the english wikipedia dump is more than 20x bigger than the others.

Add Snapshot of http://en.wikipedia.org

The IPNS entry for this snapshots will be (we can update these links to point to the hash of the most recent snapshot) :

When the corresponding snapshot is ready, we will update that IPNS entry to point to the snapshot. We will also announce the link to that snapshot in comments on this issue and on the ipfs blog.

Search

Currently there is no search available in the IPFS version.

Turkish version has 521k titles with redirects (not sure right now how many without) which weight total of 13MiB (3MiB gzip compressed). Chalenge would be writing fuzzy search over them in ways that:

  • don't bright the browser to its knees
  • saves bandwidth (it is one time download if local ipfs is running but still downloading it right away for every one is a lot), so probably download gzip version.
  • is fast; uncompressing the gzip version every time to do the search is wastefull but also english wiki has 13M articles.
  • supports unicode

This might be a place to investigate some precalculated data structure that could be stored in IPFS to improve the search, speed and bandwidth wise. If the data structure is sharded in IPFS there should be a button to download whole search index.

Modify snapshots to clearly declare that they were not created by wikipedia

If we publish unmodified snapshots of wikipedia it may confuse visitors, leading them to believe that the snapshots were published by Wikipedia or by volunteers who contribute to WIkipedia in Turkey. This could present a real risk for those volunteers, who have nothing to do with our snapshots, and may harm wikipedia, who did not ask us to do this.

I propose modifying the snapshots to:

  1. Remove references to the Wikimedia user group in Turkey from the bottom of the page

  2. Adjust the top of the page to distinguish it from Wikipedia (by removing the puzzle globe logo and adding a prominent explanation that this is an independent project, not affiliated with Wikipedia)

This will force us to scrap all of the snapshots we've built and start over with modified code, which will delay the release of the snapshots by at least a few days.

"Error: file does not exist" when trying to add CA wikipedia

Got a dump whose hash is QmXq9FMaTYKU6sY91XZyZvuFsee165FuGCyHvWQrQrwk33

Trying to run the final step, running ./execute-changes.sh. Running it without arguments prompts me for <ipfs files root> so I gave it the location of the root file I copied in the previous step. End up with the command ./execute-changes.sh /root but that gives me the Error: file does not exist error.

This is the full execution:

+ IFS='
        '
++ getopt --test
++ echo 4
+ '[' 4 -ne 4 ']'
+ LONG_OPT=help,search:,ipns:,date:,main:
+ SHORT_OPT=h
++ getopt -n ./execute-changes.sh -o h -l help,search:,ipns:,date:,main: -- /root
+ PARSED_OPTS=' -- '\''/root'\'''
+ eval set -- ' -- '\''/root'\'''
++ set -- -- /root
++ date +%Y-%m-%d
+ SNAP_DATE=2017-10-08
+ IPNS_HASH=
+ SEARCH=
+ MAIN=index.htm
+ true
+ case "$1" in
+ shift
+ break
+ '[' -z /root ']'
+ ROOT=/root
+ ipfs files stat /root/A
++ sed -e 's/{{SNAPSHOT_DATE}}/2017-10-08/g' -e 's/{{IPNS_HASH}}//g' scripts/body.js
++ ipfs add -Q
++ '[' -n '' ']'
++ cat -
+ NEW_BODYJS=QmYGpCuGLKAkF4fyENs5UgWgfx6qwLhY7yKpWSFSvSvV3F
+ ipfs-replace -/j/body.js /ipfs/QmYGpCuGLKAkF4fyENs5UgWgfx6qwLhY7yKpWSFSvSvV3F
+ ipfs files rm /root/-/j/body.js
+ true
+ ipfs files --flush=false cp /ipfs/QmYGpCuGLKAkF4fyENs5UgWgfx6qwLhY7yKpWSFSvSvV3F /root/-/j/body.js
Error: file does not exist

wrapper script for building snapshots

As a person who wants to build snapshots, I want to build a snapshot for a specific language by

  • cloning this repo
  • running a single script with a simple command (just providing the language code)

Completion state:

This repository provides a script that takes one required argument: the language code of the snapshot you want to build.
Based on that argument and the info in this repo, it

  • pulls from kiwix
  • unpacks the dump
  • modifies the dump
  • writes the result to ipfs
  • reports the resulting hash

Update tr.wikipedia-on-ipfs.org

  • create new (test) snapshot
    • couldn't extract fully, some files fail with other os error (dignifiedquire/zim#3)
    • fixed in extract_zim v0.2.0
  • ensure canonical link is correct (#48 (comment) + #65)
    • present in wikipedia_tr_all_maxi_2019-10.zim
  • ensure footer is updated (#64
  • identify landing page parameter for execute-changes.sh
  • fix any broken JS by execute-changes.sh
  • recreate snapshot if needed
  • pin
    • set up collaborative pinning cluster (#68)
  • update DNSLink at tr.wikipedia-on-ipfs.org

This could be done manually or as a part of #58

Write a script that modifies ZIM dumps to include info about how they were generated

  1. Replace out/-/j/body.js with the /scripts/body.js from this repo
  2. In that copy of body.js, replace these placeholders with the relevant values
    • {{SNAPSHOT_DATE}} -- date that this snapshot is being generated
    • {{IPNS_HASH}} - IPNS hash for this version of wikipedia (probably corresponds to the language code. ie. tr.wikipedia.org probably has its own ipns hash)
  3. Copy /assets/wikipedia-on-ipfs.png into the root of the snapshot
  4. Copy /assets/wikipedia-on-ipfs-small-flat-cropped-offset-min.png to /out/I/s/Wikipedia-logo-v2-200px-transparent.png in the snapshot

When you're done writing this script, please

  • add the script to this repo
  • Update the instructions in the repo's README.md to include the new step of running this script.

Add Arabic snapshot

blocked by #26 , replaces #14

Add Snapshot of http://ar.wikipedia.org

The IPNS entry for this snapshots will be (we can update these links to point to the hash of the most recent snapshot) :

When the corresponding snapshot is ready, we will update that IPNS entry to point to the snapshot. We will also announce the link to thjat snapshots in comments on this issue and on the ipfs blog.

.

.

Lots of IPFS errors with Turkish wikipedia mirror (loading fine nonetheless)

The IPFS mirror of the Turkish wikipedia (via IPNS) loads fine, but lots of errors are popping up:

22:01:30.949 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Mameluke_Flag.svg.png: Failed to get block for zb2rhm7VzayQudATK8W2u37aGhPmEFknTou6zQXxCSHj6JYte: context canceled gateway_handler.go:548
22:01:30.966 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Flag_of_Greece_(1822-1978).svg.png: Failed to get block for QmP7hqsgd8NX2aCRUS3oGbgmM6ZSWaW21j7eXZ1ba7FVTv: context canceled gateway_handler.go:548
22:01:30.966 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Coa_Croatia_Country_History_(Fojnica_Armorial).svg.png: Failed to get block for QmVfQXFGao22h7EvPuH66dakgj4FxkN98cuYNYFKr6bCoG: context canceled gateway_handler.go:548
22:01:30.967 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Coat_of_arms_of_Hungary_(historic_design).png: Failed to get block for zb2rhX18u3QKYQQqeJtfE7kjHdNWtj9MhmZMNKNGoLLeT2Vmn: context canceled gateway_handler.go:548
22:01:30.969 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Coa_Kastrioti_Family.svg.png: Failed to get block for QmVrXEXLDLgGtdA5avsnAsfEwEFWXv1gkqYutc338TP6W8: context canceled gateway_handler.go:548
22:01:30.975 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Flag_of_Turkey.svg.png: Failed to get block for QmNhRZWFoKfeuwbnpf35J9g1NzZCzSfHqPr1WiHqb8kSVf: context canceled gateway_handler.go:548
22:02:10.846 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Botanik_Parkı.jpg: Failed to get block for QmbemscoX5aXoNXRTTJ1adbysDTTZFMbt6YwZVynkn2dhL: context canceled gateway_handler.go:548
22:02:10.850 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Harikalar_Diyari_Gulliver_Lilliputians_06034_nevit.jpg: Failed to get block for QmPvPmfSc9GcDriwZA99AD3MHqKbpTVTKiuU2KVJBdTtdr: context canceled gateway_handler.go:548
22:02:10.856 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Çayyolu_Metro_19.JPG: Failed to get block for QmXxcZjuJHq2qnxGEedooW1WNdHtzL77UmwsxHSEBpoHHj: context canceled gateway_handler.go:548
22:02:10.857 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Ankara_Central_Station_2012_03.JPG: Failed to get block for QmRfdvNsndpiNEbNTuH1ZaDfrrMsa1WNDKMNoT7HBcXkzY: context canceled gateway_handler.go:548
22:02:10.880 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Ankara_road_map.png: Failed to get block for zb2rhgcucL97ksC69WTWyfp7SF7FQPkZi6T6gz79qxNaGM2Ca: context canceled gateway_handler.go:548
22:02:10.884 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Dirus-Roads_of_Ankara.svg.png: Failed to get block for zb2rhiF3XBBBvTDhYAhw7qgcEB1FRtCrATUgZmEJ38bxQ9tjv: context canceled gateway_handler.go:548
22:02:17.185 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Desc-20.png: Failed to get block for QmXu8c6yjTpxNgiqmBnEjXCGRNVCNXpFzy6QdBPdef7q8c: context canceled gateway_handler.go:548
22:02:17.185 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/BlankMapTurkeyProvinces.png: Failed to get block for QmSDKbXfco38V8CR7niButc81cogiW9m1QaXHaxpocBTvD: context canceled gateway_handler.go:548
22:02:17.218 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Mausoleo_Mevlana.jpg: Failed to get block for zb2rhmWtXQ19xRDHSnWtHCzLk6tfVmq3wFFkf5FbUWsksFrta: context canceled gateway_handler.go:548
22:02:17.222 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Dolmabahçe_Palace_(cropped).JPG: Failed to get block for QmUXuCuWnqYE82sCYNQnn35Kk4yHgkHSgeNdcgymdWc3xj: context canceled gateway_handler.go:548
22:02:17.267 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Mustafa_Kemal_Atatürk_.jpg: Failed to get block for Qmbw36E5GkUSfnZ2jida2ufKNSEejpXwCH1eUDhgZBdB23: context canceled gateway_handler.go:548
22:02:17.323 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Recep_Tayyip_Erdogan.PNG: Failed to get block for QmaAUgvy4fEK76RdKkEVfYErR3LCF9iFbBqP4tui3kmRbE: context canceled gateway_handler.go:548
22:02:23.634 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Erdoğan_(2014_cumhurbaşkanlığı_seçim_logosu).jpg: Failed to get block for QmXqbHEuXywNnxQyygjZZpxVaaf2tWffrG7gUz6Tt4hTjS: context canceled gateway_handler.go:548
22:02:23.637 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Cari_islemler_dengesi.png: Failed to get block for QmRx4esjoYLmBV7nC4i1ZrqFSMAoqJ3bRgS5dK7fJ8LvdX: context canceled gateway_handler.go:548
22:02:23.637 ERROR core/serve: ipfs resolve -r /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/I/m/Family_photo_of_Council_of_ministers_of_Turkey_and_Spain.jpg: Failed to get block for QmQAe9RCdiuHunR6mmuMjYA2X8LGiWrYybPPLYgUy6Z57b: context canceled gateway_handler.go:548
22:02:23.704 ERROR commands/h: err: write tcp4 127.0.0.1:8080->127.0.0.1:57281: write: protocol wrong type for socket handler.go:288
22:04:34.318 ERROR core/serve: ipfs resolve -r /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/I/m/c326d317eddef3ad3e6625e018a708e290a039f6.svg: Failed to get block for zb2rhXQVxo883MW1cS2mYa3wYSv2ZLiQTmbn2EFrUXa1DSZUi: context canceled gateway_handler.go:548
22:04:34.319 ERROR core/serve: ipfs resolve -r /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/I/m/c5d0431ce231935522dc0cb52df7f2b406cdadc3.svg: Failed to get block for zb2rhfsuyFnUf9j4BJx27sNzNB31mYSkdDQHgEWCLB5vxaDq8: context canceled gateway_handler.go:548
22:04:34.320 ERROR core/serve: ipfs resolve -r /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/I/m/e1d67495288eac0fa90d5bbcad7d9a343c15ad56.svg: Failed to get block for zb2rhkXbtbZ9HpyHJixHRs6ioKjktd1ZJCiGfTNAK29BS8hjb: context canceled gateway_handler.go:548
22:04:34.320 ERROR core/serve: ipfs resolve -r /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/I/m/504dc030b18a6fcb9575b9c70b2d9314e86ece5e.svg: Failed to get block for QmemtQK1z99jDhqsHgnwihX3PjiWJpjHRA6PfXN6oCFeQV: context canceled gateway_handler.go:548
22:04:34.321 ERROR core/serve: ipfs resolve -r /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/I/m/Yüzey_sınıflandırması.jpg: Failed to get block for zb2rhnA149yrwMBERK9q35ZosesLekqEDn5ExhRjZYZi1xUn6: context canceled gateway_handler.go:548
22:04:34.322 ERROR core/serve: ipfs resolve -r /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/I/m/21de672b1953817ed423e8f4c008498a81341292.svg: Failed to get block for zb2rhaDTRHNvVM6PW88ih4J7euvotR34SozzZ1dvDdRbMe6sj: context canceled gateway_handler.go:548

Provide recommendations for using InterPlanetary Test Lab to generate wikipedia snapshots

@FrankPetrilli in order to follow the data control plan described here we need a way to spin up workers based on a configuration (or docker file) and then use those workers to

  • download data from a specified HTTP source
  • unpack the data and write it to IPFS
  • run a script (from this repo) that modifies the data (using IPFS files API)
  • pin the resulting data somewhere
  • return the hash of the result

This overlaps with the kind of stuff that we do with InterPlanetary Test Lab. How hard would it be to set this up?

Turkish Mirror Doesn't have Page about the referendum

The Turkish referendum was the inciting incident that led Erdogan to block wikipedia. That is because it mentioned how he stole the election. It would be excellent if you could update that single page if at all possible before this goes live

Make `execute-changes` script more generally usable

Make the script from #18, which is in https://github.com/ipfs/distributed-wikipedia-mirror/blob/master/execute-changes.sh more generally usable. This is needed in order to fulfill #14 and in order to make it easier for other people to generate their own snapshots.

As a hacker who wants to add new wikipedia snapshots to IPFS in the language of my choice, I should be able to follow a clear set of instructions that allow me to download a zim dump, add it to IPFS, modify it with this script and then publish the resulting hash. The instructions should be clear, the configuration should be simple and it should be easy for me to set the correct IPNS hash and snapshot date based on the language version I'm adding.

Completion Requirements

  • the shell script works with minimal pre-configuration of your system
  • the shell script's documentation clearly declares what you need to do in order to run it
  • the shell script makes it easy or completely transparent to set the correct IPNS hash and snapshot date based on the current date and the language of the current snapshot
  • the readme at the root of this repo, or a page it links to, contains complete and accurate instructions for using this script when you're creating a snapshot

Citation links broken

Assuming we are on page /ipfs/QmRoot/wiki/SomePage.html, when we try to click a citation link, instead of scrolling the page with a #cite_note-xx anchor, it tries to load /ipfs/QmRoot/wiki/SomePage#cite_note-xx, which fails, as SomePage does not exist.

Related to #2

Gather background info from other repositories and add to this one

Background info

The idea of putting Wikipedia on IPFS has been around for a while. Every few months or so someone revives the threads. You can find such discussions in this github issue about archiving wikipedia, this issue about possible integrations with Wikipedia, and this proposal for a new project.

what's missing?

There's an even bigger, longer-running conversation about decentralizing wikipedia. See https://strategy.m.wikimedia.org/wiki/Proposal:Distributed_Wikipedia

Current Work

Add OS-native installers to ipfs station

IPFS Station is an electron app that installs and runs IPFS for you. We want to make it easier to install station by adding a package builder to IPFS station that generates installers for Win, Mac and Linux. This will make it possible for everyday users to run their own copy of IPFS and access it locally without using the command line.

More info and instructions in the ipfs station repository here: ipfs/ipfs-desktop#508

Add all the other wikipedia snapshots

Many countries have blocked Wikipedia over the years, and Wikipedia has also suffered from DDOS attacks causing service drops in various regions. We should broaden our current list of Wikipedia mirrors to include other languages. I propose doing the languages with the most users - as of now there are 13 languages with more than 1M users and 6 languages with more than 2.5M users.

Language Wiki Users Active Users Good Pages Total Pages On IPFS
English en 37103876 124823 5926367 48505470
Spanish es 5543043 15373 1543777 6782414
French fr 3544188 16349 2138049 10319452
German de 3271376 18550 2341066 6547819
Chinese zh 2803562 8437 1073112 5907531
Russian ru 2589568 10331 1567312 5987865
Portuguese pt 2297005 5696 1013328 4890332
Italian it 1867701 8066 1551915 6360102
Arabic ar 1709753 4342 949434 5651012
Japanese ja 1527805 14061 1167744 3460642
Indonesian id 1087372 2786 502154 2604650
Turkish tr 1042570 683 333283 1666749
Dutch nl 1018242 3866 1978041 4116417

For each:

  • create new snapshot of https://.wikipedia.org/wiki/
  • ensure canonical link is correct
    • this is blocked on upstream snapshots or updating our scripts: #48 (comment) + #65
  • pin
  • create and update DNSLink at .wikipedia-on-ipfs.org

Any updates? News? etc?

Is this still actively maintained? Is there any roadmap? When will the mirror be read-write?

Turkish IPFS mirror doesn't work with localhost redirect: core/serve error

When I deactivate my localhost redirect script, the turkish WP mirror loads fine (both ipfs snapshot & ipns current version) via the ipfs.io gateway. After enabling the userscript, the redirect to localhost works, i.e. the address has the correct hash etc., but browser & console print error messages:

00:35:07.573 ERROR core/serve: ipfs cat /ipns/QmVH1VzGBydSfmNG7rmdDjAeBZ71UVeEahVbNpFQtwZK8W/wiki/Anasayfa.html: no link named "wiki" under QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX gateway_handler.go:525

00:37:51.368 ERROR core/serve: ipfs cat /ipfs/QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX/wiki/Anasayfa.html: no link named "wiki" under QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX gateway_handler.go:525

Anyone know what's producing the error? Other ipfs/ipns address redirects seem to work fine.

EDIT: direct input of localhost addresses produces the same errors.

Handle the logo override in a cleaner way

What's in the Snapshot

The style.js sets the logo with this css style:

.globegris{background-image:url(../../I/s/Wikipedia-logo-v2-200px-transparent.png) }

the logo is offset by a background-position style in the HTML of the page:

<td class="globegris" style="background-repeat:no-repeat; background-position:-40px -15px; width:100%; border:1px solid #a7d7f9; vertical-align:top;">

Current Hack

  1. I made a version of the wikipedia-on-ipfs logo that corrects for the offset: https://github.com/ipfs/distributed-wikipedia-mirror/blob/master/assets/wikipedia-on-ipfs-offset.png
  2. The script from #18, which needs to be run on ZIM dumps before adding them to IPFS, will copy that offset logo to /out/I/s/Wikipedia-logo-v2-200px-transparent.png in the snapshot

Alteratively we can modify the style.css, but we still need to deal with the offset unless we want to modify the style element in the html.

What needs to happen

  • confirm which approach we should use. If we want to use a different approach, update the assets and #18 accordingly
  • (if possible) make a version of the offset logo with better anti-aliasing on the resized text.

Automate snapshot updates

This is a placeholder issue.
Will be updated with more details when we gain better understanding of what is needed here.

In the long run, we want to introduce CI/CD automation that does something along these lines:

Then, maintainer would review PR and merge it.
Updating manifest in master would trigger an update of DNSLink under <lang>.wikipedia-on-ipfs.org, propagating change to collaborative cluster etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.