rufuspollock-okfn / bibserver Goto Github PK

BibServer is open-source software what makes it easy to publish, manage and find bibliographies. BibServer is RESTful and web-friendly.

License: MIT License

Python 72.01% CSS 0.15% JavaScript 3.25% HTML 16.76% Perl 7.82%

bibserver's People

Contributors

Stargazers

Watchers

bibserver's Issues

[super] Authentication system and its use

signup / login is #2
identify collection as belonging to me ...

Memory leak

An as yet unidentified memory leak has been introduced during the refactor. This could be due to pyes, flask or a change in our code. Bibserver appears to eat about 4gb per day. Have not looked into the cause yet.

Switch bibsoup.net to refactored ES version

bibsoup.net still running against SOLR. Move it.

Change importer to do bulk upload to index

Importer currently only indexes one record at a time. These should be batched and bulk inserted.

Refactored code is not combining text search value with facet values

The refactored version is not combining text search value with facets, so when a facet filter is applied the text search value is deleted. I will check this out and repair.

Create a setup.py script

install will be easier once we have a setup script.

Missing Search Results

Search for "Pitman" in the aldous data
http://bibsoup.net/collection/aldous9?q=Pitman&a=%257B%2522q%2522%253A%2520%257B%257D%252C%2520%2522start%2522%253A%252020%252C%2520%2522rows%2522%253A%252020%252C%2520%2522facet_field%2522%253A%2520%255B%2522journal.exact%2522%252C%2520%2522year.exact%2522%252C%2520%2522collection.exact%2522%252C%2520%2522type.exact%2522%252C%2520%2522author.exact%2522%255D%257D&submit_search=Search
returns as top entry

Weak Convergence of Random $p$-Mappings and the Exploration Process of Inhomogeneous Continuum Random Trees
Probab. Th. Rel. Fields
D.J. Aldous, G. Miermont, J. Pitman
http://xxx.arXiv.org/abs/math.PR/0401115

But search for title words "Continuum Random Trees"
http://bibsoup.net/collection/aldous9?q=Continuum+Random+Trees&a=%257B%2522q%2522%253A%2520%257B%257D%252C%2520%2522start%2522%253A%252020%252C%2520%2522rows%2522%253A%252020%252C%2520%2522facet_field%2522%253A%2520%255B%2522journal.exact%2522%252C%2520%2522year.exact%2522%252C%2520%2522collection.exact%2522%252C%2520%2522type.exact%2522%252C%2520%2522author.exact%2522%255D%257D&submit_search=Search

Returns nothing.

http://bibsoup.net/collection/aldous9?q=Continuum+Random&a=%257B%2522q%2522%253A%2520%257B%257D%252C%2520%2522start%2522%253A%252020%252C%2520%2522rows%2522%253A%252020%252C%2520%2522facet_field%2522%253A%2520%255B%2522journal.exact%2522%252C%2520%2522year.exact%2522%252C%2520%2522collection.exact%2522%252C%2520%2522type.exact%2522%252C%2520%2522author.exact%2522%255D%257D&submit_search=Search

returns 4 entries, but not the one above. These entries have "continuum random tree" in "subjects" field, but not in title.
Its a serious defect to miss words in title. Maybe the title words never got into the index?

set importer to check to see if local version is the same

when a user imports a file, it is stored locally in store/raw - when a file is imported again, should check to see if it is different. presumably it will be, but this will also be required when we enable scheduling automated checks of web urls. so need to do this anyway,

Make default theme nicer

Current theme is good but we could make it quite a bit nicer without too much work (e.g. by reusing an existing OKFN theme).

Minor Refactorings

Relocate /query to /api/search
Do not have search on every page at top (if wanted on every page put into top bar properly) - should be part of retheme #19
Get rid of option to set number of rows in query results (Why: who needs this and adds complexity)
Remove use of path segments to create implicit facets. [No - leave in]
- Main use of this was for supporting /collection/xyz?... I believe. IMO this isn't needed, at least in the first instance - people can just search and set the relevant facet (on collection)
Have 'normal' pagination using normal page querry string param (probably there in fact!)

Refactor to use Flask instead of web.py

Asciify method in resultmanager can cause unicode errors

This method needs fixing - quite a few different things cause it to trip and throw a UnicodeEncodeError

Add unit tests

there are currently no tests to run - we have been pulling together various bits and pieces. We now need some tests to check that changes do not break functionality

Replication of old Bibserver on Schramm data

Replication of the old bibserver capability on the benchmark Schramm dataset http://research.microsoft.com/~schramm/bibserver.bib is a first milestone. Following are a few issues I see between here and there. Probably these should be broken out into several issues, but I try to collect them here for completeness.

upload of the dataset. The upload failed for me.
The upload from a url is a post request. It should be a get request, so it can be easily bookmarked, and the data should be saved with a filename or id which is a suitably sanitized form of the url. e.g.

http://bibserver.berkeley.edu/cgi-bin/bibs7?source=http%3A%2F%2Fresearch.microsoft.com%2F~schramm%2Fbibserver.bib

User supplied titles should not be used as ids, as they will clash eventually.

First upload should create a cache. Thereafter, subsequent calls for the same url should pull from cache, except with an indication e.g. "refresh=yes" in the get string to refresh from source.
There should be no redirect required for this upload procedure from a url. Perhaps optionally, but not required.
Listings of Authors should be alphabetical, with author links, like http://bibserver.berkeley.edu/cgi-bin/bibs7?&source=http://research.microsoft.com/~schramm/bibserver.bib&index=authors or at least such a complete listing should be available.
Format in .bib or .json to produce such displays is negotiable. Its good to have a simple dropdown list of authors in the
left nav bar, but this does not replace a comprehensive author listing page.
Capability for subjects listing like http://bibserver.berkeley.edu/cgi-bin/bibs7?&source=http://research.microsoft.com/~schramm/bibserver.bib&index=subjects
should be supported, both in the data model and in display.
Similar capability for journals is also desirable.

Generally, we want for any facet (except perhaps types and years), to have both a simple dropdown in the nav, and a more
comprehensive full page listing which allows external links based on attributes in suitable entity tables for journals/subjects/people/...

Need the footer e.g. "Display created by BibServer from this bibtex source file http://research.microsoft.com/~schramm/bibserver.bib" or ".... from file uploaded by user ... now cached at .... " if its a user upload. Generally, need to display the provenance of the data, so user knows where it is coming from.
Provide "Edit source" input at the bottom.
Demonstrate display template to replicate the item display as closely as possible, and give maintainer control of details in display, e.g. what things are linked, in what order, ....

NLM XML DTD Parser

A number of sources in various fields have stabilized on NLM XML for their biblio standard.
Including
EuDML
PKP Citation Markup Assistant
We should provide a converter to BibJSON

Refactor tests to work with refactored solreyes

whilst refactoring solreyes, it is necessary to change some of the tests we have, because solreyes will not exist any more.

Repeated uploads from same source

Repeated uploads from the same source are producing multiple entries.
e.g. http://bibsoup.net/collection/hartley
has been uploaded 3 times it there are now 600 entries instead of 200 with each entry repeated 3 times e.g.
http://bibsoup.net/collection/hartley?q=lifting+group
This is not the desired behavior. The simplest behaviour implemented by the legacy bibserver is that
each new upload from the same url should overwrite any previous upload. That should be implemented first.
Note that this overwriting should not just be entry by entry. The entire collection should be replaced.

Pager display bug

With http://bibsoup.net/collection/chung_test
when the page first loads, there flashes by an alternate paging scheme, like

153 Results [1-10] [11-20] ....

then this is replaced by

Results of 1-10 153. Show 10 per page. (with dropdowns).

Fine to experiment with different pager options, but the flashing of one before the other is disturbing.

In the present pager, need [Next] and [Previous] buttons.

Query endpoint gives 405 when hit with JSONP request

Turns out flask does not support JSONP... will have to fix this in flask in order to make the query endpoint useful again.

Enable frontend record / collection editing

this is not too hard, but requires some form of user control first. then allow people to edit records and collections.

Erdos breaks BibTexParser.py

Upload from http://bibserver.berkeley.edu/tmp/erdos.bib produces

<type 'exceptions.IndexError'> at /upload
list index out of range
Python /opt/bibserver/parsers/BibTexParser.py in read_bibitem, line 87
Web POST http://bibsoup.net/upload

Compare with http://bibserver.berkeley.edu/cgi-bin/bibs7?source=http%3A%2F%2Fbibserver.berkeley.edu%2Ftmp%2Ferdos.bib

Note that the author listing for Erdos is an excellent test of unicode conversion capabilities from tex accents, also
handling of a long author list (568 authors) The old bibserver is doing this by a crude mapping to html entities. For the new one, best to use the NUMDAM tex to unicode converter.

Clean up and simplify templates

Large number of small templates which are often only included in one other template (usually solreyes). In order to simplify development and keep code cleaner suggest consolidating into main template and removing as needed.

Est cost: 2h.

Move bibsoup.net to dev.okfn.org

There is a big ES instance running on dev.okfn.org.

Move bibsoup.net there, particularly to try running the medline index.

Check if anyone will be upset if I kill the ES service by trying to facet the author field on the medline index...

Put mapping to ES after creating index

DAO has an init_db method. I added a call to put_mapping, and put the mapping in the config. However put_mapping fails. Will need to find how to get pyes to put a dynamic mapping. Doing this would save having to manually create the index and put the mapping during install.

User registration and user login (US-011)

need to enable user logins so that users can own and edit collections

Move parsers into bibserver package

Also suggest renaming BibTextParser.py to bibtex.py

Control of display config

This is a breakout of item 10) of #14

Demonstrate display template to replicate the item display as closely as possible, and give maintainer control of details in display, e.g. what things are linked, in what order, ....

to which Mark replied:

Yes, this sort of thing has to be looked at more once we have user login. They are all managed by config at present, but we need to control how people will manage their own config file, which means we need to know who people are in some way.

I agree with "we need to control how people will manage their own config file"
I do not agree this implies "we need to know who people are in some way".
I think we should regard BibServer as a webservice which takes two sorts of input data

a biblio dataset
a display config file
and returns displays of the dataset according to the config file.
All a user should have to do is provide a pair of urls, one with the dataset, and one with the config file, and then BibServer should do its thing. This requires no login or identification of users. I think we should try to avoid imposing such barriers for as long as possible, and should try to encourage a culture of users posting both data and display config files with an open license.
We should maintain an index of config files we think are of good quality, and allow users to pick/choose from these with a
simple dropdown. None of this requires knowing who people are in any way.

Of course, both biblio datasets and config files should be checked in some way to see they are not malicious. I hope this
is adequate as an alternative to login gates.

Still cant upload Erdos

I just tested
http://bibsoup.net/upload?source=http://bibserver.berkeley.edu/tmp/erdos.bib&collection=erdos
and it returns an Internal Server Error

This was a previous issue (#15) but I dont seem to have permission to reopen issues so I am making it a new one.

Refactor mako templates to jinja2

Flask bundles Jinja2 by default and it seems to preferred in the general community. However, we have working mako templates and others may prefer them. So this is a very low priority (and perhaps is a wontfix?)

View record or search

The View record or search [source] for [text] [Go]

is a potentially a great feature! It looks like this has been scripted from some search resource list I had lying around.
I'd like to see how this is controlled, and work on improving the selection of search resources which should be provided
in a config file.

Note that the [text] would be better entered in a text box.
A tricky point is that best construction of the query from data will depend on the source searched.
Probably some curated searches should be offered, then the user can mix/match with others.

As an example, if the record has a DOI, e.g. 10.1214/aoms/1177705069 then the search

http://scholar.google.com/scholar?q=10.1214/aoms/1177705069

typically returns an exact match. So if an item has a DOI, this link should always be offered. Without the DOI, something like

http://scholar.google.com/scholar?hl=en&q=author%3A%22Chung%22+The+ergodic+theorem+of+information+theory
gets the item as number 3 in a list of 85. The list from this softer search is very useful, but good to be able to provide both soft and exact matches into Google Scholar and other resources. Where exact matches are obained, we should try to harvest them,
and register the remote ids. e.g.
http://scholar.google.com/scholar?cluster=17976357296721002721

I expect that a tool for providing high quality searches and links from a particular record is something we should develop as a separate module. This needs to be customized a lot depending on the type of record (article, book, person, ... ) and the
information already available. I have had many attempts at this. A general framework for managing such searches and links would be desirable.

Add content negotiation (cf US-013)

add in content negotiation so that JSON or HTML can be easily returned. Richard to drop in the code from SSS.

Document the currently available code

i need to improve the documentation for the code we have now - we have stuff that does something useful at this point, so time to scale up dev efforts. I will update docs and provide diagrams of how code fits together etc.

Put urlmanager under test

urlmanager came from solreyes. It now needs a test.

Change elasticsearch mapping to use X.exact instead of X.raw

exact seems like a more sensible term instead of raw, as the unanalysed field is stored exactly as it is found.

Test with ryan running own install (US-007)

update website and check that ryan can install a bibserver

refactor solreyes to separate modules

solreyes code was built up from stuff originally started in a different project. it quickly got us off the ground, but needs factoring out. will separate it into resultmanager and urlmanager and setconfig.

subsequent tickets might see refactorings of those individual items.

PyES queries are only returning back top ten facets

The PyES queries are defaulting to only returning the most common 10 values - how do we pass through the "size" parameter to ES through PyES?

refactor manager to importer

Manager was initially envisaged as managing async uploads, but this is not required, so change manager to importer, just handles importing of content to the index.

Update tests too.

Factor out csv parser from dataset.py and put under test

Refactor dataset to parser

The dataset class is improperly named for function now, rename to parser.

This class now manages parsing of files, and uses whichever parser is requested.

Allow for disabling the frontend upload

Have added a config option and controls in web.py and index template to hide the upload page and remove upload functionality when allow_upload is set to NO.

This, in combination with using bulk_upload from the command line, allows a department to run a bibserver with only the content an administrator pushes to it.

Superuser / sysadmin auth

enable existence of a user account that can do anything

Add a Command Line Interface to Bibserver

Would like to carry some operations from command line e.g.:

Cleaning the database
Adding some demo fixtures
Converting from one format to another (especially bibjson)

Create person records

Imagine this involves:

Person Domain Object
Create Person objects automatedly on upload and associate to the Record
- MM: whereas author is an attribute of a book, person is a representation of a person. therefore we will create person records for each person, and append a list of relevant persons to each record.

Import from URL list (US-017)

Bulk import json file:

{
    collections: [
        {
            url: ...
            name: ...
        }
    ]
}

command line function: bulkimport which takes url to json file (can be file:/// of course!)
logic.py with new bulk_import(bulkimport_dict) called by this (with test)

Put BibTex parser under test

Switch from SOLR to elasticsearch

In DAO layer
In solreyes

NB: RP has already done quite a bit of wrapping elasticsearch in python in https://github.com/okfn/hypernotes

Source not apparent from display

From the display

http://bibsoup.net/collection/hartley

it is not apparent from what url the data was uploaded. The url, in this instance http://rsise.anu.edu.au/~hartley/hartley.bib
should be apparent, as it is in

http://bibserver.berkeley.edu/cgi-bin/bibs7?source=http://rsise.anu.edu.au/~hartley/hartley.bib

See text "Display created by BibServer from this bibtex source file http://rsise.anu.edu.au/~hartley/hartley.bib" at bottom of page
with hyperlink to the source file.

Providers of source data should be encouraged to provide metadata about their collection which can then be made part of the default BibServer display of that collection. Formatting and handling of that metadata is another issue. First lets ensure that the
use case with no metadata besides a source url is accomodated.
Allowing uploaders to name their collection uploads complicates the issue raised here. The legacy BibServer names its caches by source url, so the issue of what happens if the same url is uploaded with two different collection names does not arise. It does not seem a good idea to be making copies of the data from the same url even with different collection names, but this may require further discussion.

To summarize.

There should be just one copy of data from a given url in the BibSoup.
This entire copy should be overwritten by any re-upload of that dataset by any user.
The source url should always be apparent from the BibSoup display.
There should be a button on the display which allows refreshing of the dataset.
These are well tested and successful features of the legacy BibServer display. Lets demonstrate these features before
trying more complex things.

Add proper scheduling (i.e. asynchronous job processing)

currently, when the manager is called, it just executes an upload. need to change this to queue an upload.

RP: Suggest using celery for this ...

Debug pyes issue in refactor

The refactored version runs but fails to upload on a specific dataset because upsert trips pyes when mapping.

Find error, add failing example test, and then fix.

rufuspollock-okfn / bibserver Goto Github PK

bibserver's People

Contributors

Stargazers

Watchers

Forkers

bibserver's Issues

Recommend Projects

Recommend Topics

Recommend Org