Giter VIP home page Giter VIP logo

dmoz2db's Introduction

dmoz2db is a tool to parse the RDF-like dumps from http://rdf.dmoz.org/rdf/ and
put the contents into a database. dmoz2db is tested with MySQL but should work
with other databases as well. IT COMES WITH ABSOLUTELY NO WARRANTY OF ANY KIND. 

Instructions

To use dmoz2db you need to install sqlalchemy 0.6.5 or higher
(http://www.sqlalchemy.org)

Your database must have utf8 support enabled. For MySQL a description how to do
that is available here: 
http://cameronyule.com/2008/07/configuring-mysql-to-use-utf-8

The database where the dmoz data will be stored must be created manually:

mysql> create database DATABASENAME;
mysql> GRANT ALL ON DATABASENAME.* TO 'USERNAME'@'localhost';

After that you should edit db.sample.conf according to your setup and save it
as db.conf.

The database design can be found in the html pages in the doc folder.

Running

If the rdf files are present in your current directory you can just say
~/dmoz-dir/src $ python dmoz2db.py

but you may want to run 
~/dmoz-dir/src $ python dmoz2db.py --help

first and look at the available options. Most of them should be self
explaining. If you are not interested in the complete dmoz dataset you can
specify a topic filter to ignore everything which is not under the given
category which speeds up the import process. Take care with trailing slashes:
'Top/Computers' includes the category while 'Top/Computers/' filters for
everything under that category. The default father id is 1 for every category
whose father was filtered out.

Debug output should be turned on only in combination with the log file option
because every sql statement is printed.

The import will take time, so go to lunch or find something else to do :) And
don't halloo till you're out of the wood: There is a first parse inserting the
basic topic structure into the db, then the father ids are generated and after
that all the additional information like related categories or other languages
are added in the second parse. Last but not least the content.rdf file is
parsed to add the externalpage information to the database.

On my laptop it took about 20 minutes to complete the first parse, 25 minutes
for generating father ids, 2:08 h for the second parse and 8 h for the
content.rdf file which results in ~11 h total. One dot in the output means
10,000 processed topics, a newline is generated after 200,000 Topics.

In the structure.rdf file entries dealing with the last editor are ignored. For
content.rdf the tags <mediadate>, <type>, <uksite>, <age> and <priority> are
ignored because they are present only in a fraction of the data.

dmoz2db's People

Contributors

bar avatar gnrhxni avatar joknopp avatar siccovansas avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

fsakbas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.