Giter VIP home page Giter VIP logo

dissemin's Introduction

dissemin's People

Contributors

a3nm avatar abijeet avatar armavica avatar evarin avatar golls avatar jibe-b avatar jilljenn avatar kyodralliam avatar monperrus avatar nemobis avatar p4bl0- avatar phyks avatar pierresenellart avatar raitobezarius avatar rgrunbla avatar sayashraaj avatar stephno avatar translatewiki avatar wetneb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dissemin's Issues

Scrape researcher lists from departments websites

Following the meeting of 3th July, we need to get the researchers lists by ourselves. It would be better not to use any resource available only from inside the ENS, but only public information on the department's websites.
We need them as a TSV file (one researcher per line, tab-separated fields), with the following columns:

  • Last name
  • First name
  • URL of the home page
  • Email address
  • Position (PhD student, Professor…)
  • Research group
  • Department

Only the names and the department are required.

Example file (some fields are left blank):

Baron-Cohen Simon   http://www.psychol.cam.ac.uk/directory/simon-baron-cohen    ******@cam.ac.uk    Academic staff      Department of Psychology
Bekinschtein    Tristan http://www.psychol.cam.ac.uk/directory/tristan-bekinschtein ******@cam.ac.uk    Academic staff      Department of Psychology

One way to extract them automatically is to scrape the HTML of department pages. The following is an example for Institut Jean Nicod.

# -*- encoding: utf-8 -*-
from __future__ import unicode_literals

from urllib2 import urlopen, HTTPError
from httplib2 import iri2uri
from lxml import etree
from lxml.html import document_fromstring
from papers.name import *
from sys import stdout

# print etree.tostring(root)
email_re = re.compile(r'[a-zA-Z.\-_]*@[a-zA-Z.\-_]*')

url_roles = []
root_urls = [('http://www.institutnicod.org/membres/membres-statutaires/?lang=fr','Membre statutaire'),
        ('http://www.institutnicod.org/membres/post-doctorants-35/', 'Post-doctorant·e'),
        ('http://www.institutnicod.org/membres/etudiants/doctorants/?lang=fr', 'Doctorant·e')]

for rooti, role in root_urls:
    f = urlopen(rooti).read()
    root = document_fromstring(f)
    for a in root.xpath("//li[@class='menu-entree']/a"):
        url_roles.append(('http://www.institutnicod.org/'+a.get('href'), a.text, role))


#results = etree.XPath('html')
#print list(results(root))
for link, nom, role in url_roles:
    try:
        f = urlopen(link).read()
        mdoc = document_fromstring(f)
        groupe = 'Institut Jean Nicod'
        email = ''
        idx = 0
        for elem in mdoc.xpath("//a"):
            if not elem.text:
                continue
            if elem.text=='Contact' or elem.text == 'Email' or elem.get('href', '').strip().startswith('mailto:'):
                email = elem.get('href', 'mailto:')[7:]
            if elem.text=='Site Web':
                link = elem.get('href')

            subspan = elem.xpath("//span")
            if subspan and subspan[0].text and 'site web' in subspan[0].text.lower():
                link = elem.get('href')
            idx += 1

        first, last = parse_comma_name(nom)

        print('\t'.join([last,first,link,email,role,groupe]).encode('utf-8'))
        stdout.flush()
    except HTTPError:
        pass


Change from RabbitMQ to redis

RabbitMQ is too big for us, let's keep things simple and use redis instead. It runs out of memory and crashes the server.

Publication objects: store conference name, not only proceedings title

And display them in the interface, for instance to avoid non informative titles such as "Lecture Notes in Computer Science". Look at the "container-title" field in CrossRef metadata:

{"subtitle":[],"issued":{"date-parts":[[2012]]},"score":1.0,"prefix":"http:\/\/id.crossref.org\/prefix\/10.1007","author":[{"family":"Abdalla","given":"Michel"},{"family":"Fouque","given":"Pierre-Alain"},{"family":"Lyubashevsky","given":"Vadim"},{"family":"Tibouchi","given":"Mehdi"}],"container-title":"Advances in Cryptology \u2013 EUROCRYPT 2012","reference-count":0,"page":"572-590","deposited":{"date-parts":[[2014,1,23]],"timestamp":1390435200000},"title":"Tightly-Secure Signatures from Lossy Identification Schemes","type":"book-chapter","DOI":"10.1007\/978-3-642-29011-4_34","ISSN":["0302-9743","1611-3349"],"ISBN":["http:\/\/id.crossref.org\/isbn\/978-3-642-29010-7","http:\/\/id.crossref.org\/isbn\/978-3-642-29011-4"],"URL":"http:\/\/dx.doi.org\/10.1007\/978-3-642-29011-4_34","source":"CrossRef","publisher":"Springer Science + Business Media","indexed":{"date-parts":[[2014,9,16]],"timestamp":1410910441629},"member":"http:\/\/id.crossref.org\/member\/297"}

Fingerprints: various improvements

The problem is that papers with titles such as "Introduction" or preface tend to be merged together as they are common. Solution: include the year in the fingerprint when the title is too short (Introduction, Preface, and so on)

Take into account only the first initial in the fingerprints ?

Get researcher lists from the admin

Give a wake up call to the administration, with the information we need:

Required:

  • First name
  • Last name
  • Department

Helpful:

  • Email
  • Homepage
  • ENS CRI login (usually first initial + last name)
  • Research group
  • Other affiliations

Optional:

  • Position
  • Any other information they want us to display on their profile

Implement an heuristic to merge similar papers

This can involve:

  • Improving the current fingerprint system (don't think it's enough)
  • Adding an heuristic that compares two papers and decides whether they should be merged.

This is tricky because a simple edit distance fails, for instance because the two following papers are different:

  • Desingularisation de metriques d'Einstein. II
  • Désingularisation de metriques d'Einstein. I

Note that the current fingerprint system already collapses different papers together:

  • Preface. (published in 2010)
  • Preface. (published in 2012)

But adding the date in the fingerprint is not an option (preprints and publication years are often different).

Name unification on paper merge & new source addition

For now, if two papers from different sources have the same fingerprint, we merge them, keeping title and authors as in the first record we have discovered. This is far from optimal, because the second record can actually be more precise (full first names instead of initials, for instance).
Ensure that we get the most of our sources by implementing a name unification algorithm.

For instance: ("A. G. Erschler","Anna Erschler") -> ("Anna G. Erschler")

Parse embargo periods and update policies accordingly [$15]

SHERPA/RoMEO embeds the lengths of embargo periods in a special <num> tag.

These embargo periods are currently not stored in dissemin. This issue is about storing them in the model, and refining the "uploadability" of the papers based on that.

  • add a field in the Publisher model to store these embargos
  • populate it with the SHERPA/RoMEO interface
  • adapt Paper.update_availability to take that into account

There is a $15 open bounty on this issue. Add to the bounty at Bountysource.

Revamp the home page

The home page is terrible.
What we need is:

  • A different page structure made of smaller blocks
  • A short intro with numbers (nb departments, nb researchers, nb papers, nb publishers…)
  • A list of 3-4 most recent papers (+ link to the complete list)
  • A list of 3-4 most popular publishers (+ link to the complete list)
  • A list of 3-4 departments (+ link to the complete list)
  • Global access statistics (not department-wise)?
  • An ENS logo (but making sure people don't believe it is officially backed by the ENS)
  • Anything else?

Extract more metadata with scrapers

One option would be to reuse zotero/translators (properly isolated). The problem is that it does not extract author emails or affiliations. We can also use the Google Scholar meta markup, but not all publishers support it (Elsevier does not, Springer does).

Other scraping tools:

  • ContentMine/journal-scrapers: good for getting author affiliations. But: the code is very unstable and has nasty dependencies (nodejs), with very few scrapers.
  • egonw/citeulike: better coverage, stable codebase, but few metadata extracted

Translate everything, including the FAQ

We use gettext, so this can be done by anyone!
This has to be done after the changes in the frontend, i.e. in the webdiz branch.

If you want to contribute, the file is here: locale/fr/LC_MESSAGES/django.po.
It can be regenerated with python manage.py makemessages -l fr.
And can be compiled (so that it appears on the web site) with python manage.py compilemessages.

Add HTML or QuerySet caching

Many pages could be efficiently cached as the platform is mostly read-only.
Caching is well documented for Django, but there are many decisions to make.

SWORD protocol

To find an existing SWORD server more tolerant than the HAL one;
(To contact HAL peoples to ask for a specification of their implementation / their source code)
|| (To fuzzy the HAL preprod server to build this specification )

django-cas

We "fixed" django CAS in a crappy way. We should investigate to check if this fix does not break anything else. A function was called with a variable number of arguments, but the wrapper was not considering this, so we added these arguments and we just throw them.

Big number of authors

When there are lot of coauthors (LHC for example), the rendering is not acceptable. We could probably just print 5 random authors (or the ENS authors?)? with a way to see all the authors.

Add accessibility statistics

Display the access statistics by department, journal and publisher. A bar-based display is already implemented, just need to make it understandable (or change it).

Installing backend requirements fails because "httplib" cannot be found in PIP

Downloading/unpacking httplib (from -r requirements_backend.txt (line 6))
  Could not find any downloads that satisfy the requirement httplib (from -r requirements_backend.txt (line 6))
  Some externally hosted files were ignored (use --allow-external httplib to allow).
Cleaning up...
No distributions at all found for httplib (from -r requirements_backend.txt (line 6))
Traceback (most recent call last):
  File "/home/a3nm/scratch/dissemin/.virtualenv/bin/pip", line 11, in 
    sys.exit(main())
  File "/home/a3nm/scratch/dissemin/.virtualenv/local/lib/python2.7/site-packages/pip/__init__.py", line 248, in main
    return command.main(cmd_args)
  File "/home/a3nm/scratch/dissemin/.virtualenv/local/lib/python2.7/site-packages/pip/basecommand.py", line 161, in main
    text = '\n'.join(complete_log)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 72: ordinal not in range(128)

Researcher lists are ugly

  • Spread them on two (or more) columns
  • Add letters to help people search for a name
  • Capitalize the last name? And/or start with the last name?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.