Giter VIP home page Giter VIP logo

bibserver's People

Contributors

edchamberlain avatar epoz avatar flowsta avatar gpaumier avatar jgoldfar avatar loleg avatar ptgolden avatar rufuspollock avatar tfmorris avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bibserver's Issues

Memory leak

An as yet unidentified memory leak has been introduced during the refactor. This could be due to pyes, flask or a change in our code. Bibserver appears to eat about 4gb per day. Have not looked into the cause yet.

Missing Search Results

Search for "Pitman" in the aldous data
http://bibsoup.net/collection/aldous9?q=Pitman&a=%257B%2522q%2522%253A%2520%257B%257D%252C%2520%2522start%2522%253A%252020%252C%2520%2522rows%2522%253A%252020%252C%2520%2522facet_field%2522%253A%2520%255B%2522journal.exact%2522%252C%2520%2522year.exact%2522%252C%2520%2522collection.exact%2522%252C%2520%2522type.exact%2522%252C%2520%2522author.exact%2522%255D%257D&submit_search=Search
returns as top entry

Weak Convergence of Random $p$-Mappings and the Exploration Process of Inhomogeneous Continuum Random Trees
Probab. Th. Rel. Fields
D.J. Aldous, G. Miermont, J. Pitman
http://xxx.arXiv.org/abs/math.PR/0401115

But search for title words "Continuum Random Trees"
http://bibsoup.net/collection/aldous9?q=Continuum+Random+Trees&a=%257B%2522q%2522%253A%2520%257B%257D%252C%2520%2522start%2522%253A%252020%252C%2520%2522rows%2522%253A%252020%252C%2520%2522facet_field%2522%253A%2520%255B%2522journal.exact%2522%252C%2520%2522year.exact%2522%252C%2520%2522collection.exact%2522%252C%2520%2522type.exact%2522%252C%2520%2522author.exact%2522%255D%257D&submit_search=Search

Returns nothing.

Search

http://bibsoup.net/collection/aldous9?q=Continuum+Random&a=%257B%2522q%2522%253A%2520%257B%257D%252C%2520%2522start%2522%253A%252020%252C%2520%2522rows%2522%253A%252020%252C%2520%2522facet_field%2522%253A%2520%255B%2522journal.exact%2522%252C%2520%2522year.exact%2522%252C%2520%2522collection.exact%2522%252C%2520%2522type.exact%2522%252C%2520%2522author.exact%2522%255D%257D&submit_search=Search

returns 4 entries, but not the one above. These entries have "continuum random tree" in "subjects" field, but not in title.
Its a serious defect to miss words in title. Maybe the title words never got into the index?

set importer to check to see if local version is the same

when a user imports a file, it is stored locally in store/raw - when a file is imported again, should check to see if it is different. presumably it will be, but this will also be required when we enable scheduling automated checks of web urls. so need to do this anyway,

Make default theme nicer

Current theme is good but we could make it quite a bit nicer without too much work (e.g. by reusing an existing OKFN theme).

Minor Refactorings

  1. Relocate /query to /api/search
  2. Do not have search on every page at top (if wanted on every page put into top bar properly) - should be part of retheme #19
  3. Get rid of option to set number of rows in query results (Why: who needs this and adds complexity)
  4. Remove use of path segments to create implicit facets. [No - leave in]
    • Main use of this was for supporting /collection/xyz?... I believe. IMO this isn't needed, at least in the first instance - people can just search and set the relevant facet (on collection)
  5. Have 'normal' pagination using normal page querry string param (probably there in fact!)

Add unit tests

there are currently no tests to run - we have been pulling together various bits and pieces. We now need some tests to check that changes do not break functionality

Replication of old Bibserver on Schramm data

Replication of the old bibserver capability on the benchmark Schramm dataset http://research.microsoft.com/~schramm/bibserver.bib is a first milestone. Following are a few issues I see between here and there. Probably these should be broken out into several issues, but I try to collect them here for completeness.

  1. upload of the dataset. The upload failed for me.

  2. The upload from a url is a post request. It should be a get request, so it can be easily bookmarked, and the data should be saved with a filename or id which is a suitably sanitized form of the url. e.g.

http://bibserver.berkeley.edu/cgi-bin/bibs7?source=http%3A%2F%2Fresearch.microsoft.com%2F~schramm%2Fbibserver.bib

User supplied titles should not be used as ids, as they will clash eventually.

  1. First upload should create a cache. Thereafter, subsequent calls for the same url should pull from cache, except with an indication e.g. "refresh=yes" in the get string to refresh from source.

  2. There should be no redirect required for this upload procedure from a url. Perhaps optionally, but not required.

  3. Listings of Authors should be alphabetical, with author links, like http://bibserver.berkeley.edu/cgi-bin/bibs7?&source=http://research.microsoft.com/~schramm/bibserver.bib&index=authors or at least such a complete listing should be available.
    Format in .bib or .json to produce such displays is negotiable. Its good to have a simple dropdown list of authors in the
    left nav bar, but this does not replace a comprehensive author listing page.

  4. Capability for subjects listing like http://bibserver.berkeley.edu/cgi-bin/bibs7?&source=http://research.microsoft.com/~schramm/bibserver.bib&index=subjects
    should be supported, both in the data model and in display.

  5. Similar capability for journals is also desirable.

Generally, we want for any facet (except perhaps types and years), to have both a simple dropdown in the nav, and a more
comprehensive full page listing which allows external links based on attributes in suitable entity tables for journals/subjects/people/...

  1. Need the footer e.g. "Display created by BibServer from this bibtex source file http://research.microsoft.com/~schramm/bibserver.bib" or ".... from file uploaded by user ... now cached at .... " if its a user upload. Generally, need to display the provenance of the data, so user knows where it is coming from.

  2. Provide "Edit source" input at the bottom.

  3. Demonstrate display template to replicate the item display as closely as possible, and give maintainer control of details in display, e.g. what things are linked, in what order, ....

NLM XML DTD Parser

A number of sources in various fields have stabilized on NLM XML for their biblio standard.
Including
EuDML
PKP Citation Markup Assistant
We should provide a converter to BibJSON

Repeated uploads from same source

Repeated uploads from the same source are producing multiple entries.
e.g. http://bibsoup.net/collection/hartley
has been uploaded 3 times it there are now 600 entries instead of 200 with each entry repeated 3 times e.g.
http://bibsoup.net/collection/hartley?q=lifting+group
This is not the desired behavior. The simplest behaviour implemented by the legacy bibserver is that
each new upload from the same url should overwrite any previous upload. That should be implemented first.
Note that this overwriting should not just be entry by entry. The entire collection should be replaced.

Pager display bug

With http://bibsoup.net/collection/chung_test
when the page first loads, there flashes by an alternate paging scheme, like

153 Results [1-10] [11-20] ....

then this is replaced by

Results of 1-10 153. Show 10 per page. (with dropdowns).

Fine to experiment with different pager options, but the flashing of one before the other is disturbing.

In the present pager, need [Next] and [Previous] buttons.

Erdos breaks BibTexParser.py

Upload from http://bibserver.berkeley.edu/tmp/erdos.bib produces

<type 'exceptions.IndexError'> at /upload
list index out of range
Python /opt/bibserver/parsers/BibTexParser.py in read_bibitem, line 87
Web POST http://bibsoup.net/upload

Compare with http://bibserver.berkeley.edu/cgi-bin/bibs7?source=http%3A%2F%2Fbibserver.berkeley.edu%2Ftmp%2Ferdos.bib

Note that the author listing for Erdos is an excellent test of unicode conversion capabilities from tex accents, also
handling of a long author list (568 authors) The old bibserver is doing this by a crude mapping to html entities. For the new one, best to use the NUMDAM tex to unicode converter.

Clean up and simplify templates

Large number of small templates which are often only included in one other template (usually solreyes). In order to simplify development and keep code cleaner suggest consolidating into main template and removing as needed.

Est cost: 2h.

Move bibsoup.net to dev.okfn.org

There is a big ES instance running on dev.okfn.org.

Move bibsoup.net there, particularly to try running the medline index.

Check if anyone will be upset if I kill the ES service by trying to facet the author field on the medline index...

Put mapping to ES after creating index

DAO has an init_db method. I added a call to put_mapping, and put the mapping in the config. However put_mapping fails. Will need to find how to get pyes to put a dynamic mapping. Doing this would save having to manually create the index and put the mapping during install.

Control of display config

This is a breakout of item 10) of #14

  1. Demonstrate display template to replicate the item display as closely as possible, and give maintainer control of details in display, e.g. what things are linked, in what order, ....

to which Mark replied:

  1. Yes, this sort of thing has to be looked at more once we have user login. They are all managed by config at present, but we need to control how people will manage their own config file, which means we need to know who people are in some way.

I agree with "we need to control how people will manage their own config file"
I do not agree this implies "we need to know who people are in some way".
I think we should regard BibServer as a webservice which takes two sorts of input data

  1. a biblio dataset
  2. a display config file
    and returns displays of the dataset according to the config file.
    All a user should have to do is provide a pair of urls, one with the dataset, and one with the config file, and then BibServer should do its thing. This requires no login or identification of users. I think we should try to avoid imposing such barriers for as long as possible, and should try to encourage a culture of users posting both data and display config files with an open license.
    We should maintain an index of config files we think are of good quality, and allow users to pick/choose from these with a
    simple dropdown. None of this requires knowing who people are in any way.

Of course, both biblio datasets and config files should be checked in some way to see they are not malicious. I hope this
is adequate as an alternative to login gates.

Refactor mako templates to jinja2

Flask bundles Jinja2 by default and it seems to preferred in the general community. However, we have working mako templates and others may prefer them. So this is a very low priority (and perhaps is a wontfix?)

View record or search

The View record or search [source] for [text] [Go]

is a potentially a great feature! It looks like this has been scripted from some search resource list I had lying around.
I'd like to see how this is controlled, and work on improving the selection of search resources which should be provided
in a config file.

Note that the [text] would be better entered in a text box.
A tricky point is that best construction of the query from data will depend on the source searched.
Probably some curated searches should be offered, then the user can mix/match with others.

As an example, if the record has a DOI, e.g. 10.1214/aoms/1177705069 then the search

http://scholar.google.com/scholar?q=10.1214/aoms/1177705069

typically returns an exact match. So if an item has a DOI, this link should always be offered. Without the DOI, something like

http://scholar.google.com/scholar?hl=en&q=author%3A%22Chung%22+The+ergodic+theorem+of+information+theory
gets the item as number 3 in a list of 85. The list from this softer search is very useful, but good to be able to provide both soft and exact matches into Google Scholar and other resources. Where exact matches are obained, we should try to harvest them,
and register the remote ids. e.g.
http://scholar.google.com/scholar?cluster=17976357296721002721

I expect that a tool for providing high quality searches and links from a particular record is something we should develop as a separate module. This needs to be customized a lot depending on the type of record (article, book, person, ... ) and the
information already available. I have had many attempts at this. A general framework for managing such searches and links would be desirable.

Document the currently available code

i need to improve the documentation for the code we have now - we have stuff that does something useful at this point, so time to scale up dev efforts. I will update docs and provide diagrams of how code fits together etc.

refactor solreyes to separate modules

solreyes code was built up from stuff originally started in a different project. it quickly got us off the ground, but needs factoring out. will separate it into resultmanager and urlmanager and setconfig.

subsequent tickets might see refactorings of those individual items.

refactor manager to importer

Manager was initially envisaged as managing async uploads, but this is not required, so change manager to importer, just handles importing of content to the index.

Update tests too.

Refactor dataset to parser

The dataset class is improperly named for function now, rename to parser.

This class now manages parsing of files, and uses whichever parser is requested.

Allow for disabling the frontend upload

Have added a config option and controls in web.py and index template to hide the upload page and remove upload functionality when allow_upload is set to NO.

This, in combination with using bulk_upload from the command line, allows a department to run a bibserver with only the content an administrator pushes to it.

Add a Command Line Interface to Bibserver

Would like to carry some operations from command line e.g.:

  • Cleaning the database
  • Adding some demo fixtures
  • Converting from one format to another (especially bibjson)

Create person records

Imagine this involves:

  • Person Domain Object
  • Create Person objects automatedly on upload and associate to the Record
    • MM: whereas author is an attribute of a book, person is a representation of a person. therefore we will create person records for each person, and append a list of relevant persons to each record.

Import from URL list (US-017)

Bulk import json file:

{
    collections: [
        {
            url: ...
            name: ...
        }
    ]
}
  1. command line function: bulkimport which takes url to json file (can be file:/// of course!)
  2. logic.py with new bulk_import(bulkimport_dict) called by this (with test)

Source not apparent from display

From the display

http://bibsoup.net/collection/hartley

it is not apparent from what url the data was uploaded. The url, in this instance http://rsise.anu.edu.au/~hartley/hartley.bib
should be apparent, as it is in

http://bibserver.berkeley.edu/cgi-bin/bibs7?source=http://rsise.anu.edu.au/~hartley/hartley.bib

See text "Display created by BibServer from this bibtex source file http://rsise.anu.edu.au/~hartley/hartley.bib" at bottom of page
with hyperlink to the source file.

Providers of source data should be encouraged to provide metadata about their collection which can then be made part of the default BibServer display of that collection. Formatting and handling of that metadata is another issue. First lets ensure that the
use case with no metadata besides a source url is accomodated.
Allowing uploaders to name their collection uploads complicates the issue raised here. The legacy BibServer names its caches by source url, so the issue of what happens if the same url is uploaded with two different collection names does not arise. It does not seem a good idea to be making copies of the data from the same url even with different collection names, but this may require further discussion.

To summarize.

  • There should be just one copy of data from a given url in the BibSoup.
  • This entire copy should be overwritten by any re-upload of that dataset by any user.
  • The source url should always be apparent from the BibSoup display.
  • There should be a button on the display which allows refreshing of the dataset.
    These are well tested and successful features of the legacy BibServer display. Lets demonstrate these features before
    trying more complex things.

Debug pyes issue in refactor

The refactored version runs but fails to upload on a specific dataset because upsert trips pyes when mapping.

Find error, add failing example test, and then fix.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.