dissemin / dissemin Goto Github PK
View Code? Open in Web Editor NEWThis repository has migrated to https://gitlab.com/dissemin/dissemin
Home Page: https://dissem.in/
License: GNU Affero General Public License v3.0
This repository has migrated to https://gitlab.com/dissemin/dissemin
Home Page: https://dissem.in/
License: GNU Affero General Public License v3.0
Subtasks:
For instance by showing first the affiliated authors, a few unaffiliated and then "et al.".
cf. the comments at the reunion at ENS.
CORE has released a new API:
http://core.ac.uk/docs/
Let's connect it with dissemin :-)
Downloading/unpacking httplib (from -r requirements_backend.txt (line 6)) Could not find any downloads that satisfy the requirement httplib (from -r requirements_backend.txt (line 6)) Some externally hosted files were ignored (use --allow-external httplib to allow). Cleaning up... No distributions at all found for httplib (from -r requirements_backend.txt (line 6)) Traceback (most recent call last): File "/home/a3nm/scratch/dissemin/.virtualenv/bin/pip", line 11, in sys.exit(main()) File "/home/a3nm/scratch/dissemin/.virtualenv/local/lib/python2.7/site-packages/pip/__init__.py", line 248, in main return command.main(cmd_args) File "/home/a3nm/scratch/dissemin/.virtualenv/local/lib/python2.7/site-packages/pip/basecommand.py", line 161, in main text = '\n'.join(complete_log) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 72: ordinal not in range(128)
Cool stuff: let's link to that or integrate their info in dissemin !
http://howopenisit.org/
http://howopenisit.org/developers/api
One option is to use a ready-made caching tool such as requests-caching
.
https://realpython.com/blog/python/caching-external-api-requests/
Send a mail to all the known authors of a paper asking them to upload it :-)
http://beta.ens.dissem.in/paper/1773/
arXiv record has CORE url :-(
If all the words in the title are capitalized, recapitalize it. There should be a Python package for that.
Déclarer dissem.in à la CNIL !
The problem is that papers with titles such as "Introduction" or preface tend to be merged together as they are common. Solution: include the year in the fingerprint when the title is too short (Introduction, Preface, and so on)
Take into account only the first initial in the fingerprints ?
Use a ready-made tool such as Solr or Elasticsearch.
Long titles / author names sometimes exceed the database limits: catch the DataError exception and ignore the corresponding papers.
RabbitMQ is too big for us, let's keep things simple and use redis instead. It runs out of memory and crashes the server.
This publisher should be "nok" instead of "unk":
http://beta.ens.dissem.in/publisher/404/
We might need to create a sitemap. This should be documented for Django.
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.
Give a wake up call to the administration, with the information we need:
Required:
Helpful:
Optional:
SHERPA/RoMEO embeds the lengths of embargo periods in a special <num> tag.
These embargo periods are currently not stored in dissemin. This issue is about storing them in the model, and refining the "uploadability" of the papers based on that.
There is a $15 open bounty on this issue. Add to the bounty at Bountysource.
We "fixed" django CAS in a crappy way. We should investigate to check if this fix does not break anything else. A function was called with a variable number of arguments, but the wrapper was not considering this, so we added these arguments and we just throw them.
Sources with higher priority should be preferred
When there are lot of coauthors (LHC for example), the rendering is not acceptable. We could probably just print 5 random authors (or the ENS authors?)? with a way to see all the authors.
For now everything's static, it's sad! Just schedule a job to update the papers, all the ingredients are ready.
BASE returns URLs with http for HAL, but we get https URLs through OAI-PMH: these should be considered equivalent.
Following the meeting of 3th July, we need to get the researchers lists by ourselves. It would be better not to use any resource available only from inside the ENS, but only public information on the department's websites.
We need them as a TSV file (one researcher per line, tab-separated fields), with the following columns:
Only the names and the department are required.
Example file (some fields are left blank):
Baron-Cohen Simon http://www.psychol.cam.ac.uk/directory/simon-baron-cohen ******@cam.ac.uk Academic staff Department of Psychology Bekinschtein Tristan http://www.psychol.cam.ac.uk/directory/tristan-bekinschtein ******@cam.ac.uk Academic staff Department of Psychology
One way to extract them automatically is to scrape the HTML of department pages. The following is an example for Institut Jean Nicod.
# -*- encoding: utf-8 -*- from __future__ import unicode_literals from urllib2 import urlopen, HTTPError from httplib2 import iri2uri from lxml import etree from lxml.html import document_fromstring from papers.name import * from sys import stdout # print etree.tostring(root) email_re = re.compile(r'[a-zA-Z.\-_]*@[a-zA-Z.\-_]*') url_roles = [] root_urls = [('http://www.institutnicod.org/membres/membres-statutaires/?lang=fr','Membre statutaire'), ('http://www.institutnicod.org/membres/post-doctorants-35/', 'Post-doctorant·e'), ('http://www.institutnicod.org/membres/etudiants/doctorants/?lang=fr', 'Doctorant·e')] for rooti, role in root_urls: f = urlopen(rooti).read() root = document_fromstring(f) for a in root.xpath("//li[@class='menu-entree']/a"): url_roles.append(('http://www.institutnicod.org/'+a.get('href'), a.text, role)) #results = etree.XPath('html') #print list(results(root)) for link, nom, role in url_roles: try: f = urlopen(link).read() mdoc = document_fromstring(f) groupe = 'Institut Jean Nicod' email = '' idx = 0 for elem in mdoc.xpath("//a"): if not elem.text: continue if elem.text=='Contact' or elem.text == 'Email' or elem.get('href', '').strip().startswith('mailto:'): email = elem.get('href', 'mailto:')[7:] if elem.text=='Site Web': link = elem.get('href') subspan = elem.xpath("//span") if subspan and subspan[0].text and 'site web' in subspan[0].text.lower(): link = elem.get('href') idx += 1 first, last = parse_comma_name(nom) print('\t'.join([last,first,link,email,role,groupe]).encode('utf-8')) stdout.flush() except HTTPError: pass
This is because the BASE interface is dumb and could be greatly improved.
This is ugly. Handle these characters properly and allow a limited HTML markup (<sup>, for instance).
Examples:
http://ens.dissem.in/paper/1692/
http://ens.dissem.in/paper/13847/
http://ens.dissem.in/paper/15326/
http://ens.dissem.in/paper/16907/
And display them in the interface, for instance to avoid non informative titles such as "Lecture Notes in Computer Science". Look at the "container-title" field in CrossRef metadata:
{"subtitle":[],"issued":{"date-parts":[[2012]]},"score":1.0,"prefix":"http:\/\/id.crossref.org\/prefix\/10.1007","author":[{"family":"Abdalla","given":"Michel"},{"family":"Fouque","given":"Pierre-Alain"},{"family":"Lyubashevsky","given":"Vadim"},{"family":"Tibouchi","given":"Mehdi"}],"container-title":"Advances in Cryptology \u2013 EUROCRYPT 2012","reference-count":0,"page":"572-590","deposited":{"date-parts":[[2014,1,23]],"timestamp":1390435200000},"title":"Tightly-Secure Signatures from Lossy Identification Schemes","type":"book-chapter","DOI":"10.1007\/978-3-642-29011-4_34","ISSN":["0302-9743","1611-3349"],"ISBN":["http:\/\/id.crossref.org\/isbn\/978-3-642-29010-7","http:\/\/id.crossref.org\/isbn\/978-3-642-29011-4"],"URL":"http:\/\/dx.doi.org\/10.1007\/978-3-642-29011-4_34","source":"CrossRef","publisher":"Springer Science + Business Media","indexed":{"date-parts":[[2014,9,16]],"timestamp":1410910441629},"member":"http:\/\/id.crossref.org\/member\/297"}
This can involve:
This is tricky because a simple edit distance fails, for instance because the two following papers are different:
Note that the current fingerprint system already collapses different papers together:
But adding the date in the fingerprint is not an option (preprints and publication years are often different).
Looks like we've broken it by moving to lxml…
http://beta.ens.dissem.in/publisher/353/
http://beta.ens.dissem.in/publisher/473/
Create a nice version of these (keeping a similar design). That's easy.
We use gettext, so this can be done by anyone!
This has to be done after the changes in the frontend, i.e. in the webdiz
branch.
If you want to contribute, the file is here: locale/fr/LC_MESSAGES/django.po
.
It can be regenerated with python manage.py makemessages -l fr
.
And can be compiled (so that it appears on the web site) with python manage.py compilemessages
.
Display the access statistics by department, journal and publisher. A bar-based display is already implemented, just need to make it understandable (or change it).
This could help us to increase the coverage of our policy stats.
One option would be to reuse zotero/translators (properly isolated). The problem is that it does not extract author emails or affiliations. We can also use the Google Scholar meta markup, but not all publishers support it (Elsevier does not, Springer does).
Other scraping tools:
The home page is terrible.
What we need is:
When looking at a publications list, we don't see the difference between the two logos at first. Maybe using different colours could help.
Tons of things to do… :-(
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.
For now, if two papers from different sources have the same fingerprint, we merge them, keeping title and authors as in the first record we have discovered. This is far from optimal, because the second record can actually be more precise (full first names instead of initials, for instance).
Ensure that we get the most of our sources by implementing a name unification algorithm.
For instance: ("A. G. Erschler","Anna Erschler") -> ("Anna G. Erschler")
This enables (among others) importing publications from Google Scholar profiles.
The papers could be suggested from the candidate list (currently invisible to regular users) and a form could be provided to add unknown papers.
This probably involves tweaking pyoai (again…)
ex:
To find an existing SWORD server more tolerant than the HAL one;
(To contact HAL peoples to ask for a specification of their implementation / their source code)
|| (To fuzzy the HAL preprod server to build this specification )
When a publisher is provided (e.g. through CrossRef), we could look up the default policy of that publisher and assign it to the journal, even when the journal itself is not found. We have to be careful though.
Many pages could be efficiently cached as the platform is mostly read-only.
Caching is well documented for Django, but there are many decisions to make.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.