Giter VIP home page Giter VIP logo

repec's Introduction

RePEc Database Manager

Introduction

A collection of Python scripts to download, cleanup, and somewhat structure the RePEc dataset. The data is downloaded into an SQLite database, so it is possible to use SQL queries to analyse the data. E.g.,

SELECT count(*) FROM papers JOIN papers_jel USING (pid) WHERE year = 2010  AND code = 'D43'

will show how many papers were written in 2010 about oligopolistic markets.

Getting Started

Run

python main.py init
python main.py update

to setup an empty SQLite database, repec.db, and to download the full RePEc dataset into it (takes a while). See

python main.py init --help
python main.py update --help

for available options.

Non-standard Dependencies

The scripts use cld2-cffi for automatic language detection, and curl for downloading from FTP sites. curl is used instead of requests, because requests cannot handle some of the FTP sites out there.

Update Process

The script downloads the data using breadth-first approach. First, the names of all the available ReDIF files are downloaded from the RePEc FTP and saved in table repec. Second, all the files from table repec are downloaded from the RePEc FTP and are used to fill in table series. Among other data, table series will contain URLs where the data on particular series can be found. Unique URLs are then saved in table remotes. Third, all unique URLs will be visited to collect the listings of the final ReDIF documents. These listings are saved in table listings. Fourth, all the files from table listings are downloaded, processed, and saved in tables papers, authors, and papers_jel.

If an update is interrupted during the last stage, you can run

python main.py update --papers

and the update should resume from where it has stopped.

Incremental updates are currently not supported, however it is possible to perform a full update on an existing database. Paper records that are obsolete, i.e. those that can no longer be reached from the initial list of series from the RePEc FTP, are not pruned. This is done on purpose as on some days some participating websites work, and on other days they don't.

Downloaded records are saved as is in papers.redif (z-compressed). Additionally, the records are cleaned up and partially destructured into the respective fields. The cleanup steps include, among other:

  • stripping html tags;
  • language auto-detection (using cld2-cffi);
  • jel codes extraction.

Database

The SQLite database will contain the following tables.

Table Description
repec A list of ReDIF files from RePEc FTP.
series Content of ReDIF files from RePEc FTP.
remotes A list of URLs that host RePEc data.
listings File listings from the sites in remotes.
papers Titles, abstracts, etc. of economic papers.
authors Author names.
jel JEL codes.
papers_jel Correspondence between papers and jel.

Applications

  • The other day, I made a web page where you can check trends in economics. It's like a toy version of google trends but then based on words from titles and abstracts from RePEc. Some trends are suggestive, e.g. it's all about new results.

See Also

There is also an official Perl script for downloading the data, see remi. Remi is aimed at downloading ReDIF files, whereas the current set of scripts is aimed at downloading and partially processing the files, with the idea of using an SQLite backend to track progress and to store the final results.

repec's People

Contributors

andrei-dubovik avatar

Stargazers

Oska Fentem avatar  avatar  avatar Alessio Garau avatar Thomas Gebhart avatar Adrian Gao avatar Vishal Belsare avatar  avatar Niklas Witzig avatar Christian König avatar

Watchers

 avatar

repec's Issues

MIT Press FTP is misconfigured

Two primary ReDIF files on the MIT Press ftp server currently contain incorrect URLs, so the rest of the archive is also not discoverable. The files in question are:

ftp://ftp-mitpress.mit.edu/outgoing/RePEc/mtp/mtparch.rdf
ftp://ftp-mitpress.mit.edu/outgoing/RePEc/tpr/tprarch.rdf

Both files reference

ftp://ftp-mitpress.mit.edu/Anonymous/outgoing/RePEc/...

which does not exist. However,

ftp://ftp-mitpress.mit.edu/outgoing/RePEc/...

does.

Upon inspection, logging it to ftp-mitpress.mit.edu manually, and issuing pwd gives:

ftp ftp-mitpress.mit.edu
Connected to ftp-mitpold.mit.edu.
220 Microsoft FTP Service
Name (ftp-mitpress.mit.edu:andu): anonymous
331 Anonymous access allowed, send identity (e-mail name) as password.
Password:
230 Anonymous user logged in.
Remote system type is Windows_NT.
ftp> pwd
257 "/Anonymous" is current directory.

So, mtparch.rdf and tprarch.rdf reference an internal path instead of an URL. I have contacted MIT Press about this on 2021-07-01, and received a confirmation they are working on this issue on 2021-09-14, but as of now nothing has been resolved. I guess it would be simpler to address this issue on the client side.

Switch to async IO

Currently, the downloading is done with multiple threads. This makes the complete download unnecessary slow (and specifying a much higher number of threads than the default doesn't result in better performance on Windows). Switching to async IO should improve the speed for a complete download substantially.

Ditch curl?

Back when the project was started there were quite a few ftp sites that requests couldn't handle but curl could. Is it still the case that requests cannot handle certain ftp sites? If yes, consider submitting a request upstream?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.