How to mirror Wikipedia

Being part of CTWUG we are alway on the lookout for new services to run on our network that could prove usefull to the users of the CTWUG network.

The latest idea we got was to provide users with a local copy of the Wikipedia website, this would enable CTWUG users to view wikipedia content at local network speeds without having an internet connection.

After extensive research I came across this nice tutorial on how to run your own copy of Wikipedia from a database dump of the real Wikipedia website. Who whould have thought that Wikipedia would offer a montly database dump of their Website for users to download, good news is that they do.

I’ll provide the steps for you to follow to set up your own Wikipedia mirror, please note basic knowledge of linux, bash, apache and mysql is required. The steps is for installing the mirror on a Ubuntu machine.

So here is the steps for setting up your own Wikipedia mirror from their database dump.

  1. Install LAMP: Linux Apache MySQL PHP
    apt-get update
    apt-get install apache2 php5 libapache2-mod-php5 mysql-server mysql-client php5-mysql phpmyadmin
  2. Setup MySQL: You need to set your mysql root password
    $ mysql
    mysql> USE mysql;
    mysql> UPDATE user SET Password=PASSWORD(’new-password’) WHERE user=’root’;
    mysql> FLUSH PRIVILEGES;

    You also need to create a database for your incoming Wikipedia. Go to http://localhost/ and click on phpmyadmin. Log in using your new root password. Under Create new database, enter wikidb and click Create. On the new page, click on Privileges, add the new user wikiuser and click check all, then Go.

  3. Download the MediaWiki software: This is the software wikipedia is running on. Go to the MediaWiki download page. On the right, download the .tar.gz file.
    wget http://download.wikimedia.org/mediawiki/1.15/mediawiki-1.15.1.tar.gz

    Decompress it and move it to /var/www/

    tar xf mediawiki-1.15.1.tar.gz
    mv mediawiki-1.15.1.tar.gz wikipedia
    sudo mv wikipedia /var/www/

    I am installing it under the directory wikipedia.
    Change the file permissions of the config directory

    cd /var/www/wikipedia/
    chmod a+x config/

    Now navigate to http://localhost/wikipedia/ From here, the only things you need to put in are

    • Site name (I chose WikiMirror)
    • WikiSysop’s password (The administrator password)
    • DB password

    Now you need to move LocalSettings.php out of config.

    mv config/LocalSettings.php

    Now you can go to http://localhost/wikipedia/ and you should see your virgin MediaWiki install!

  4. Get Wikipedia’s database dumpYou can get the latest version of Wikipedia’s database dump by subscribing to http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2-rss.xml the latest file I got was http://download.wikimedia.org/enwiki/20091009/enwiki-20091009-pages-articles.xml.bz2
    wget http://download.wikimedia.org/enwiki/20091009/enwiki-20091009-pages-articles.xml.bz2

    The file is 5.2GB so it should take a while to download.
    After downloading decompress the file

    tar xf enwiki-20091009-pages-articles.xml.bz2

    The uncompressed size is almost 20 GB so be sure you have enough disk space available.
    Now for the lenghty part of the process, you need to import the file into your mysql database.
    Download mwimport.sh and save it and run it like this

    cat enwiki-<date>.xml | mwimport | mysql -f -u <admin name> -p <database name>

    This process should take a few hours to complete, from 7-12 hours depending on your HDD speed and processor.

That should be all, if all went well your will have a complete working copy of wikipedia on your local machine. CTWUG members can look forward to this service very soon.

, , , , , , , , , , , , , , , , , , , , ,

8 Responses to How to mirror Wikipedia

  1. Dan December 14, 2009 at 6:30 am #

    Just curious, how much memory was on the system that you did this on, and did you modify any of the settings for mySQL?

  2. Ken D'Ambrosio February 23, 2011 at 11:30 pm #

    Couple things:
    1) Yes, a tweak to /etc/mysql/my.cnf was required; I changed max_packet_size to 128 MB. (It was 16 MB. 128 was probably overkill… but hey — better safe than sorry.)
    2) No need to de-compress the .bz2 — and I don’t know if you could even *do* that with tar, since a .bz2 is a bzip’d file, and not a tar archive. Instead, I used the following:
    bzcat enwiki-[...]-pages-articles.xml.bz2 | mwimport | mysql -p -f -u
    3) Note that you can’t just do a “wget” on the mwimport link above — that’s a link to a mediawiki page that, in turn, has text you need to stuff into an executable, and then chmod +x on.

  3. Matt November 15, 2011 at 12:34 am #

    excellent walk through, and the only definitive guide that i could find. thank you very much.

    a pretty consistant link to download the latest wikipedia pages would be:
    http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

    in the end i had a little bit of trouble trying to get mwimport to be recognized/found. make sure to use the whole file name “mwimport.sh”

  4. Zach January 2, 2012 at 5:04 am #

    Does this include all the images as well?

  5. Matt January 2, 2012 at 9:19 am #

    unfortunately no.. i looked around for an answer to this problem but couldn’t find one.. if you find a way to get all the images, please do share!

    1

  6. Nick January 17, 2012 at 3:59 am #

    For the two users above me, the images aren’t available. Longer explanation here:
    http://en.wikipedia.org/wiki/Wikipedia:Database_download#Where_are_images_and_uploaded_files

  7. pdh January 17, 2012 at 11:06 am #

    Is anybody hosting a wikipedia mirror that is accessible on-line?

  8. Taco April 18, 2012 at 6:29 pm #

    Dumps of images are no longer available, but you can use this automated script to download them: http://meta.wikimedia.org/wiki/Wikix

Leave a Reply