cernopendata / opendata.cern.ch Goto Github PK

Source code for the CERN Open Data portal

License: GNU General Public License v2.0

Python 43.55% CSS 2.76% JavaScript 13.26% Shell 3.01% HTML 29.99% Dockerfile 1.43% SCSS 6.00%

open-data open-science research-data research-data-repository research-data-management flask json-schema python invenio inveniosoftware

opendata.cern.ch's Introduction

CERN Open Data portal

About

This is the source code behind CERN Open Data portal. You can access the portal at http://opendata.cern.ch/. The source code uses Invenio digital repository framework.

Developing

If you'd like to install a demo site locally for personal developments, please see developing guide for more information.

Contributing

Bug reports, feature requests and code contributions are encouraged and welcome! Please see contributing guide for more information.

Support

You can ask questions at our Forum or get in touch via our Chatroom.

Authors

The alphabetical list of all contributors is available in the AUTHORS file.

License

GNU General Public License

opendata.cern.ch's People

Contributors

Stargazers

Watchers

Forkers

jirikuncar tpmccauley pamfilos tiborsimko pherterich raoofphysics espacial cranmer samerio atlas-outreach berghaus ffelsner mattbellis meirtz aglne xchen101 shubdeep nextbook438 rudhirsharma kristin5 tyleraaaaa adventure1890 anxheladani hikkihiki alvinahmadov aiouy skog84 pucms xuyanan1 chaabni kennethhole srinivas-seshadri noobergoober17 kjili smallrug artemislav hbcbh1999 ergestnako muharremokutan solertis chokribr aprilmendez dilobandaaceh frankhe123 karanguptak9 tarang9211 kdboco2422 ioannistsanaktsidis yaocxustc thalezirnho mirate team-ai-tokyo itay4 liudaoqiang hjhsalo diegodelemos annatrz allysonbarros wahello nyimbi atrisovic geisergit camilocarrillo italoadler dmitr25 thompa666 laramaktub diyaselis mamico heitorpb okraskaj mgheata andriish oesping caredg mantasavas alexjarabek bhanditz stwunsch adrianodee clelange walegeorge prasant1993 sssss38438 juhateuho hemmlin anovorolnik filipmaxin mwooller mvidalgarcia cnxtech ahmedelsalam evrohachik katilp mokotus yuriykulakov kznmft conceziob heartyharts browse-holdings

opendata.cern.ch's Issues

system: puppetisation

The demo system should be moved from Quattor to Puppet.

formats: new dedicated format for small data samples

Introduce dedicated output format template for small data samples.

Feedback on current design [Aug 15th]

no facebook/twitter/youtube icons and language settings: get rid of the uppermost dark line
new logo will be proposed by Laura Rueda
collection tabs without radio buttons - do we need them anyway? [we will ask exps]
detailed record view should get a "frame" leaving more space to the left. Fonts should be adjusted (e.g. Title not prominent enough)
detailed record: dont show unused tabs, usage stats are of interest (but most likely closed access)

thoughts:

better navigation for researchers vs. citizen scientists
policies will be uploaded as individual records, but also put onto a "learn about" page
we need some design elements and experiment specific materials to show alongside data records
navigation elements to move from "big data" elements to "small data" elements

Meeting August 28th

metadata ingest will happen manually for the first 14 high level datasets, for the mid-term future we will enable automated ingestions from a controlled list of sources
CMS is preparing for a "guided tour"/how-to document which will accompany every dataset and analysis. This document will be the same for all primary data sets (may change later). But it will be different for derived data sets (e.g. the instructions connect to derived "pattuples" from ana will point to code and the instructions of how to run). However, the structure will be the same:
selection
validation
how to reuse
limitations

These texts are being prepared by CMS, with the support of Patricia. They should be linked (initially) on the right hand side of the individual records with a dedicated box. Patricia will investigate if parts of this information can be referenced in the metadata to enable the tailoured dataset specific display. This additional documentation will sit, however, on an additional page and should be exportable as a PDF. It should be a record by itself, get a DOI, incl. citation recommendation (Action on Patricia to prepare that).

all of the datasets get a disclaimer that Kati will provide, i.e. concerning quality assurance. Location on the record page to be decided, possibly at the bottom of the page
There will be a set of restricted files, not visible to the external users with trigger/selection details (Kati please correct the details here!)
there must be an export functionality for the 14 highlevel file names enabling an easy integration into the config files - this needs to include the root file name
A virtual image will be stored on the plattform: will become a standalone record and DOI

Ana's analysis

is derived from two high-level (primary) datasets [the same is the case for Tom's examples]
is available on github:
a) exercise itself https://github.com/ayrodrig/OutreachExercise2010
b) the pattuples production https://github.com/ayrodrig/pattuples2010
Ana's code should become a record by itself, too - also with a DOI [following Zenodo's Github integration]
also these records will have their own "how to" in the box on the right [see 1-4 above]
there should be enough metadata to create such a record: authors lists the same

Overall tasks and next steps

set up Laura's design
set up of html-editing pages for additional info on github
prepare for additional information in separate menu so that we can prepare some nice additional documentation there
prepare the additional boxes on the right of a detailed record page
check export functionlities (see comment on titles above)
meeting beginning of next week for documentation sprint (with Achintya and Patricia)
meeting beginning of next week with Pamfilos for design sprint

UX/UI testing tasks

navigation on the portal
navigation from primary and reduced data
one task: can you reproduce the analysis? [is the user able to find all the related information, data, code, "how-to" for the particular analysis?]

Metadata related tasks

compile metadata for software
compile metdata for virtual image
populate the records for the 14 primary datsets
integrate Ana's analysis

ALICE: create howto page for using VMs

Transform ALICE help page on how to use VM into a static help page on the open data portal.

ALICE: upload demo data files

Upload demo data files from ALICE.

pages: "For Research"

Prepare content pages for "research" carousel links, i.e. how to download VM, how to access datasets via xrootd.

testsuite: addition of "small data" samples

In addition to big data sets, smaller (JSON) files that will be used for event display and histogramming should be added to the test suite so that we could further develop the portal UI and the data visualisation layer.

ALICE: configure collections

Configure ALICE collections. (data sets, analysis tarballs, and possible VM)

Remove the link to the release message

Now that the portal is open, to avoid misunderstandings, could you remove the link to the draft
release statement in http://open-data-demo.cern.ch/news

system: introduce haproxy load balancing

A load-balanced architecture should be introduced to prepare for slashdot effect.

system: upgrade to latest "next"

Upgrade to latest "next" that brings new way of dealing with JS/CSS asset bundles.

Dataset metadata

To assign a DOI, the following metadata are required:

Title (ideally a human readable one)
Creators (will be the collaboration and authors from author XML)
Date (year is enough, either the date the data was published on CMS pages or the date the data moved to the portal; we could also have both, depending on preference)
Publisher (should that be the collaboration or more abstract the "CERN Open Data Portal"?)

In bold is marked where CMS has to decide on which data to use/what to put there.

It's always good to have more metadata, especially

Description (human readable information about the dataset)
Technical details (how many files, file sizes etc.)

As a MARC record, it will look like this:
0247_$$a10.1234/whatevernaming $$2DOI
245__$$aHUMAN READABLE TITLE
256__$$aNr. of Files, Filesize in total
260__$$bPUBLISHER$$cDATE
[269__$$cPREPRINTDATE; might be CMS page publication date in case 260$$c will be used for the portal]
520__$$aHuman Readable Description
540__$$aCC-0 license
700__$$aAUTHOR [filled from author XML file]
710__$$gCMS collaboration

Usability test tasks

Visualise an event from Muon primary data set and turn the view on the x-y plane.
Do you observe the curvature of the tracks?

Compare, in the event display, some events from Mu primary data set to those in Minumum Bias primary data set. What differences do you observe?

With the online histogrammer, plot a *** histogram from the di-muon reduced data set and make selections on **?

You liked the histogramming application and you would like to use it for other purposes.
Do you find the source code, and do you understand what would you need to do to get started for using them for your own application?

You liked the histogramming application and you would like to use it plot other physics objects
(e.g. events including two jets).
Do you find the source code for producing the reduced data for histogramming, and do you
understand what would you need to change in the source code to read other primary data sets
and select other physics objects?

Open a primary data set file with root and find the collections for different physics objects (muons, electrons, photons, jets).

Run the analysis example on a small number of events.

pages: "For Education"

Prepare content pages for "education" carousel links, i.e. event display, histogram, "learn more" about HEP.

disable login/register functionality

system: switch to Python-2.7

On current Quattor-powered box, switch to Python-2.7, until Puppet profiles are ready.

Activate the file download

Activate the file download for derived and primary data sets.
We could already foresee a warning text for the primary data sets saying that
data sets are of TB size and download takes time accordinly, and point
the users to VM image.

Area for information material with easy editing

Follow-up from #41 : We need an area where we can easily deposit/edit the information material (i.e. the instructions, VM test report, validation statement). A possibility for easy html editing within github? was mentioned in the meeting of Aug 28.

Data file listings under Research and Education

Can we have primary data sets appear under Reserach and all others (derived data sets) under Education? The name for the files for event display should reflect their content, now they have the same name as the primary data set.

records: fix xhr javascript loading when traveling from tab to tab

example event display files

For each of the primary datasets produce an event display file with example events.

introduce search option

Amend "SEE ALL" to list all records, by simply pointing to /search.
Amend UI prototype to have search box visible.

Feedback mechanism for production use

Need a feedback page (unless something else is preferred) for the testing. Can it
be a button on every page which picks up automatically the page URL, and maybe proposes categories such as

problems with display or layout
problems with instructions
does not provided the intended functionality
problems with navigations
purpose of the page unclear
...
free text field
eventually a possibility for attaching a screen-shot

search: collection facets

Introduce facets by collections after #10 and #26 is completed. For example, a search in the CMS collection would distinguish "CMS Primary Dataset" records from "CMS Reduced Dataset" records.

previewer: example d3 visualisation

Provide example d3 visualisation in record preview that can be further built upon and extended.

collection: style `/collection` pages

Now that we have some collections, see /collection/CMS or /collection/ALICE, it would be good to style collection pages. (fonts, margins, logos, remove add-to-favourites, etc)

UI: add big red DEMO stamp across page header

The site should express in a very visible manner that it is only a demo, e.g. the header can feature big red overlay stamp saying DEMO. For an example, see CDS TEST.

pages: "Latest News"

Prepare content pages for "news" carousel links, e.g. CMS open data release statement draft.

prepare demo data XML file

Prepare XML data describing 14 CMS data sets.

Create GitHub Organization for all ODP-related repos

As discussed at meeting of 2014-09-02T15:00+02 (@katilp, @pherterich, @pamfilos, @RaoOfPhysics present):

Request to have a GitHub Organization, which will act as a single point-of-entry for all repositories related to the CERN Open Data Portal, including:

Analysis examples:
- Exercise
- pattuples
iSpy:
- Analyzers
- Services

Note that there already exists an Organization called cms-outreach where the iSpy codebase is stored.

Create GitHub Organization. Either @tiborsimko or @TimSmithCH to be owners?

dataset descriptions

The AOD datasets have names (e.g. /Mu/April21_ReReco) which need to be translated into something more meaningful to the public.

facets: fix missing visual newline

Facets do not have visual newline, as it were, so the output displayed concatenated, e.g.

Any Author
    CMS Collaboration (14)Any Year

Create Tools collection on the portal

This would contain the code to use on data and subcategories could be

CMSSW code (Ana's to examplesin https://github.com/ayrodrig/pattuples2010 and
https://github.com/ayrodrig/OutreachExercise2010), Tom's code to select the
dimuon derived sample and the event display samples
applications (like event display code, code for histogramming example from Tom)
Every entry should have
1. Run environment and underlying tools
2. How to run
3. Validation, eventually
4. Limitations

Open firewall for demo instance for outside CERN users

Request from Kati

dataset information

If the datasets are divided into skims then information on the contents should be provided (e.g. if a muon skim then something like "this dataset was created by selecting events that contained at least one muon that passed this trigger condition which was...").

List of pages needed for additional material (continued from #41)

VM (the element in the portal is the VM image - or a link to it)
Instructions
Validation report
Known problems

Analysis example final step (the element in the portal is the code in
https://github.com/ayrodrig/OutreachExercise2010)
Input data (are the files to be uploaded from Spain)
Instructions
Validation
Limitations

Analysis example - intermediate file production (the element in the portal is the code in
https://github.com/ayrodrig/pattuples2010)
Input data
Instructions
Validation
Limitations

Intermediate analysis files (the files to be uploaded from Spain, see above)
The usual metadata fields
Intructions (maybe the same than the ones above for the code)
Validation
Limitations

For the primary data sets, we could get started with the current template,
but how do you want is organized?
For example, as we think now - the text for "How the data were selected"
would be different for each sample.
The text for "Validation" is the same for all primary data sets now, but in the future
it may become different.
For "How to reuse", we would like to have a page called "Getting started",
which is a single set of instructions for all primary data sets (to start with)

Note also the remark from Tim in #41 that all data records should have clearly marked the copyright statement and licence for reuse, and from Sünje that the official label for CCZero, which is the one being used here (so far) is available here http://creativecommons.org/about/downloads

Can these templated to all elements in Limitation/Disclaimer section?

UI: site not working well for MSIE

The site currently does not work well for MSIE users. I'm testing with MSIE 9.0 via CERN WTS:

$ alias wts="rdesktop -d cern.ch -g 1024x768 -a 16 -k en-us -T TS cernts.cern.ch"

We could:

improve the layout so that it would work with MSIE; this may be worth it if event display and reduced data set JS visualisation works well with MSIE. @tpmccauley have you checked ?

detect the usage of MSIE and say something like:

  It seems you are using MS Internet Explorer 9 for which this site
  has not been optimised. Please consider using Firefox, Chromium, 
  or Safari instead.

in a gentle way.

Due to the shortage of time, let's start by 2, and eventually implement 2 when time permits.

integrate event display

The online event display needs to be added.

testimonials: update user quotes

For demo purposes, we have added some initial young user testimonial quotes taken from past IPPOG masterclasses. It would be good to update them.

reduced dataset: invent descriptive file types

Enrich small data sample record with ig files and possibly call them:

csv, json: 4-vectors? slimmed dataset? reduced dataset?
ig : event displays

search: pressing `Enter` invokes Add-to-search rather than Search

On the search page (/search), typing cms and pressing Enter does not invoke Search action, but the "Add-to-search" action is exectued instead. The default should be the former, not the latter.

Note that this was probably already fixed in latest "next", so it may be sufficient to upgrade Invenio.

CMS: setup collection to hold VM images

Set up new collection to hold VM images. The collection name can be: "CMS VM Images".
Add example record representing a VM image. Either take copy or at least link to places like http://cernvm.cern.ch/releases/CMS%20OpenData%20Latest.ova.

files: ROOT vs BIN file types

ROOT demo files (see #25) were uploaded on the SLC6 box as BIN type ones. The "automagic" recognition of file content vs file extension to be checked and amended...

home page menu: introduce links to experiments

In the home page menu panel next to LATEST, the name of experiments should be links pointing to experiments' records, e.g. CMS should point to /search?cc=CMS. (Others don't have any demo data yet.)

Also, Alice should be spelled ALICE.

Also, ATLAS is missing.

CMS: customise "CMS VM Images" collection page and record page

After CMS VM demo image is uploaded (see #47), the collection page for CMS VM images should be customised, as well as detailed record page display itself, e.g. to state "known issues".

ALICE: upload demo analysis tarballs

Upload ALICE demo analysis tarballs.

installation: add `invenio-previewer-ispy`

add -e git+https://github.com/inveniosoftware/invenio-previewer-ispy.git to requirements.txt
add invenio-previewer-ispy to install_require in setup.py
add ‘invenio_previewer_ispy’ to invenio_opendata.config:PACKAGES

prepare basic record formatting

After #1 is completed, the basic record format templates (both in brief and detailed outputs) should be adapted in order to match chosen metadata and the site style.

fixtures: collection setup

The collection set up should be amended to distinguish between (i) big datasets from (ii) small samples, and the corresponding fixtures should be committed.