dissemin / dissemin Goto Github PK

View Code? Open in Web Editor NEW

164.0 15.0 24.0 39.55 MB

This repository has migrated to https://gitlab.com/dissemin/dissemin

Home Page: https://dissem.in/

License: GNU Affero General Public License v3.0

Python 75.38% HTML 11.47% Shell 0.46% JavaScript 11.36% CSS 0.89% Ruby 0.10% SCSS 0.35%

papers open-access orcid django

dissemin's Introduction

dissem.in

This repository has migrated to https://gitlab.com/dissemin/dissemin

dissemin's People

Contributors

Stargazers

Watchers

dissemin's Issues

Fix PDF url update

Sources with higher priority should be preferred

Allow researchers to add missing papers

The papers could be suggested from the candidate list (currently invisible to regular users) and a form could be provided to add unknown papers.

Scrape researcher lists from departments websites

Following the meeting of 3th July, we need to get the researchers lists by ourselves. It would be better not to use any resource available only from inside the ENS, but only public information on the department's websites.
We need them as a TSV file (one researcher per line, tab-separated fields), with the following columns:

Last name
First name
URL of the home page
Email address
Position (PhD student, Professor…)
Research group
Department

Only the names and the department are required.

Example file (some fields are left blank):

Baron-Cohen Simon   http://www.psychol.cam.ac.uk/directory/simon-baron-cohen    ******@cam.ac.uk    Academic staff      Department of Psychology
Bekinschtein    Tristan http://www.psychol.cam.ac.uk/directory/tristan-bekinschtein ******@cam.ac.uk    Academic staff      Department of Psychology

One way to extract them automatically is to scrape the HTML of department pages. The following is an example for Institut Jean Nicod.

# -*- encoding: utf-8 -*-
from __future__ import unicode_literals

from urllib2 import urlopen, HTTPError
from httplib2 import iri2uri
from lxml import etree
from lxml.html import document_fromstring
from papers.name import *
from sys import stdout

# print etree.tostring(root)
email_re = re.compile(r'[a-zA-Z.\-_]*@[a-zA-Z.\-_]*')

url_roles = []
root_urls = [('http://www.institutnicod.org/membres/membres-statutaires/?lang=fr','Membre statutaire'),
        ('http://www.institutnicod.org/membres/post-doctorants-35/', 'Post-doctorant·e'),
        ('http://www.institutnicod.org/membres/etudiants/doctorants/?lang=fr', 'Doctorant·e')]

for rooti, role in root_urls:
    f = urlopen(rooti).read()
    root = document_fromstring(f)
    for a in root.xpath("//li[@class='menu-entree']/a"):
        url_roles.append(('http://www.institutnicod.org/'+a.get('href'), a.text, role))


#results = etree.XPath('html')
#print list(results(root))
for link, nom, role in url_roles:
    try:
        f = urlopen(link).read()
        mdoc = document_fromstring(f)
        groupe = 'Institut Jean Nicod'
        email = ''
        idx = 0
        for elem in mdoc.xpath("//a"):
            if not elem.text:
                continue
            if elem.text=='Contact' or elem.text == 'Email' or elem.get('href', '').strip().startswith('mailto:'):
                email = elem.get('href', 'mailto:')[7:]
            if elem.text=='Site Web':
                link = elem.get('href')

            subspan = elem.xpath("//span")
            if subspan and subspan[0].text and 'site web' in subspan[0].text.lower():
                link = elem.get('href')
            idx += 1

        first, last = parse_comma_name(nom)

        print('\t'.join([last,first,link,email,role,groupe]).encode('utf-8'))
        stdout.flush()
    except HTTPError:
        pass

Add a basic search interface

Use a ready-made tool such as Solr or Elasticsearch.

The papers should be updated regularly

For now everything's static, it's sad! Just schedule a job to update the papers, all the ingredients are ready.

Add unit tests where relevant

All API interfaces
Name splitting heuristics
somewhere else?

Change from RabbitMQ to redis

RabbitMQ is too big for us, let's keep things simple and use redis instead. It runs out of memory and crashes the server.

Write some docs!

Tons of things to do… :-(

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Add a button to send a mail to the authors

Send a mail to all the known authors of a paper asking them to upload it :-)

Add support for paper type (conference, journal, chapter…)

Subtasks:

Create a list of acceptable types
Map CrossRef types to our types
Map OAI types to our types
Integrate it with the frontend

Catch DataError exceptions in various interfaces (skip corresponding paper)

Long titles / author names sometimes exceed the database limits: catch the DataError exception and ignore the corresponding papers.

Publication objects: store conference name, not only proceedings title

And display them in the interface, for instance to avoid non informative titles such as "Lecture Notes in Computer Science". Look at the "container-title" field in CrossRef metadata:

{"subtitle":[],"issued":{"date-parts":[[2012]]},"score":1.0,"prefix":"http:\/\/id.crossref.org\/prefix\/10.1007","author":[{"family":"Abdalla","given":"Michel"},{"family":"Fouque","given":"Pierre-Alain"},{"family":"Lyubashevsky","given":"Vadim"},{"family":"Tibouchi","given":"Mehdi"}],"container-title":"Advances in Cryptology \u2013 EUROCRYPT 2012","reference-count":0,"page":"572-590","deposited":{"date-parts":[[2014,1,23]],"timestamp":1390435200000},"title":"Tightly-Secure Signatures from Lossy Identification Schemes","type":"book-chapter","DOI":"10.1007\/978-3-642-29011-4_34","ISSN":["0302-9743","1611-3349"],"ISBN":["http:\/\/id.crossref.org\/isbn\/978-3-642-29010-7","http:\/\/id.crossref.org\/isbn\/978-3-642-29011-4"],"URL":"http:\/\/dx.doi.org\/10.1007\/978-3-642-29011-4_34","source":"CrossRef","publisher":"Springer Science + Business Media","indexed":{"date-parts":[[2014,9,16]],"timestamp":1410910441629},"member":"http:\/\/id.crossref.org\/member\/297"}

Fingerprints: various improvements

The problem is that papers with titles such as "Introduction" or preface tend to be merged together as they are common. Solution: include the year in the fingerprint when the title is too short (Introduction, Preface, and so on)

Take into account only the first initial in the fingerprints ?

CORE interface

CORE has released a new API:
http://core.ac.uk/docs/
Let's connect it with dissemin :-)

Add caches for RoMEO and CrossRef

One option is to use a ready-made caching tool such as requests-caching.
https://realpython.com/blog/python/caching-external-api-requests/

Faire déclaration CNIL

Déclarer dissem.in à la CNIL !

Abort refetch on error 500 from the OAI proxy

This probably involves tweaking pyoai (again…)

Get researcher lists from the admin

Give a wake up call to the administration, with the information we need:

Required:

First name
Last name
Department

Helpful:

Email
Homepage
ENS CRI login (usually first initial + last name)
Research group
Other affiliations

Optional:

Position
Any other information they want us to display on their profile

Implement an heuristic to merge similar papers

This can involve:

Improving the current fingerprint system (don't think it's enough)
Adding an heuristic that compares two papers and decides whether they should be merged.

This is tricky because a simple edit distance fails, for instance because the two following papers are different:

Desingularisation de metriques d'Einstein. II
Désingularisation de metriques d'Einstein. I

Note that the current fingerprint system already collapses different papers together:

Preface. (published in 2010)
Preface. (published in 2012)

But adding the date in the fingerprint is not an option (preprints and publication years are often different).

Integrate with ORCID

Fix cache invalidation for publiListItem

Recapitalize titles properly

If all the words in the title are capitalized, recapitalize it. There should be a Python package for that.

Example: http://beta.ens.dissem.in/paper/6113/

Name unification on paper merge & new source addition

For now, if two papers from different sources have the same fingerprint, we merge them, keeping title and authors as in the first record we have discovered. This is far from optimal, because the second record can actually be more precise (full first names instead of initials, for instance).
Ensure that we get the most of our sources by implementing a name unification algorithm.

For instance: ("A. G. Erschler","Anna Erschler") -> ("Anna G. Erschler")

When a paper has too many authors, the list should be shortened

For instance by showing first the affiliated authors, a few unaffiliated and then "et al.".

http and https urls in OAI sources should be merged

BASE returns URLs with http for HAL, but we get https URLs through OAI-PMH: these should be considered equivalent.

Fingerprints: convert latex to plain text ?

ex: $\mathbb{NP}$ -> NP

The arXiv OAI extractor sometimes misses PDFs

cf. the comments at the reunion at ENS.

Titles contain escaped special characters and markup

This is ugly. Handle these characters properly and allow a limited HTML markup (<sup>, for instance).

Examples:
http://ens.dissem.in/paper/1692/
http://ens.dissem.in/paper/13847/
http://ens.dissem.in/paper/15326/
http://ens.dissem.in/paper/16907/

Parse embargo periods and update policies accordingly [$15]

SHERPA/RoMEO embeds the lengths of embargo periods in a special <num> tag.

These embargo periods are currently not stored in dissemin. This issue is about storing them in the model, and refining the "uploadability" of the papers based on that.

add a field in the Publisher model to store these embargos
populate it with the SHERPA/RoMEO interface
adapt Paper.update_availability to take that into account

There is a $15 open bounty on this issue. Add to the bounty at Bountysource.

Fix URL update issue in OaiRecord

http://beta.ens.dissem.in/paper/1773/
arXiv record has CORE url :-(

Sort out Google indexing, add the relevant metadata as <meta> tags

We might need to create a sitemap. This should be documented for Django.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Revamp the home page

The home page is terrible.
What we need is:

A different page structure made of smaller blocks
A short intro with numbers (nb departments, nb researchers, nb papers, nb publishers…)
A list of 3-4 most recent papers (+ link to the complete list)
A list of 3-4 most popular publishers (+ link to the complete list)
A list of 3-4 departments (+ link to the complete list)
Global access statistics (not department-wise)?
An ENS logo (but making sure people don't believe it is officially backed by the ENS)
Anything else?

Extract more metadata with scrapers

One option would be to reuse zotero/translators (properly isolated). The problem is that it does not extract author emails or affiliations. We can also use the Google Scholar meta markup, but not all publishers support it (Elsevier does not, Springer does).

Other scraping tools:

ContentMine/journal-scrapers: good for getting author affiliations. But: the code is very unstable and has nasty dependencies (nodejs), with very few scrapers.
egonw/citeulike: better coverage, stable codebase, but few metadata extracted

Some BASE sources are incorrectly marked as unavailable

This is because the BASE interface is dumb and could be greatly improved.

Translate everything, including the FAQ

We use gettext, so this can be done by anyone!
This has to be done after the changes in the frontend, i.e. in the webdiz branch.

If you want to contribute, the file is here: locale/fr/LC_MESSAGES/django.po.
It can be regenerated with python manage.py makemessages -l fr.
And can be compiled (so that it appears on the web site) with python manage.py compilemessages.

Fix encoding from SHERPA

Looks like we've broken it by moving to lxml…
http://beta.ens.dissem.in/publisher/353/
http://beta.ens.dissem.in/publisher/473/

The "forbidden" and "allowed" policy logos are too similar

When looking at a publications list, we don't see the difference between the two logos at first. Maybe using different colours could help.

The publisher policies could be discovered even when no RoMEO journal is found

When a publisher is provided (e.g. through CrossRef), we could look up the default policy of that publisher and assign it to the journal, even when the journal itself is not found. We have to be careful though.

Add HTML or QuerySet caching

Many pages could be efficiently cached as the platform is mostly read-only.
Caching is well documented for Django, but there are many decisions to make.

Add an Heloise interface

This could help us to increase the coverage of our policy stats.

http://heloise.ccsd.cnrs.fr/api/index/

Improve the classification of publisher policies

This publisher should be "nok" instead of "unk":
http://beta.ens.dissem.in/publisher/404/

Import publications from bibtex file

This enables (among others) importing publications from Google Scholar profiles.

SWORD protocol

To find an existing SWORD server more tolerant than the HAL one;
(To contact HAL peoples to ask for a specification of their implementation / their source code)
|| (To fuzzy the HAL preprod server to build this specification )

404 pages (and similar) are ugly

Create a nice version of these (keeping a similar design). That's easy.

Integrate with HowOpenIsIt

Cool stuff: let's link to that or integrate their info in dissemin !
http://howopenisit.org/
http://howopenisit.org/developers/api

django-cas

We "fixed" django CAS in a crappy way. We should investigate to check if this fix does not break anything else. A function was called with a variable number of arguments, but the wrapper was not considering this, so we added these arguments and we just throw them.

Downloading/unpacking httplib (from -r requirements_backend.txt (line 6))
  Could not find any downloads that satisfy the requirement httplib (from -r requirements_backend.txt (line 6))
  Some externally hosted files were ignored (use --allow-external httplib to allow).
Cleaning up...
No distributions at all found for httplib (from -r requirements_backend.txt (line 6))
Traceback (most recent call last):
  File "/home/a3nm/scratch/dissemin/.virtualenv/bin/pip", line 11, in 
    sys.exit(main())
  File "/home/a3nm/scratch/dissemin/.virtualenv/local/lib/python2.7/site-packages/pip/__init__.py", line 248, in main
    return command.main(cmd_args)
  File "/home/a3nm/scratch/dissemin/.virtualenv/local/lib/python2.7/site-packages/pip/basecommand.py", line 161, in main
    text = '\n'.join(complete_log)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 72: ordinal not in range(128)

Researcher lists are ugly

Spread them on two (or more) columns
Add letters to help people search for a name
Capitalize the last name? And/or start with the last name?

dissemin / dissemin Goto Github PK

dissemin's Introduction

dissem.in

dissemin's People

Contributors

Stargazers

Watchers

Forkers

dissemin's Issues

Recommend Projects

Recommend Topics

Recommend Org