open-contracting / cove-ocds Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 3.0 3.34 MB

OCDS Data Review Tool

Home Page: https://ocds-data-review-tool.readthedocs.io

License: Other

Python 46.91% HTML 40.85% Sass 12.25%

cove-ocds's People

Contributors

Stargazers

Watchers

Forkers

jkroman2 seba6190 odscjames

cove-ocds's Issues

Reduce test duplication / find a way to share tests when we look for the same thing in both lib-cove-ocds and cove-ocds

I've been making changes to Key Field Information, and this requires the same tests to be carried out in lib-cove-ocds and cove-ocds.

The tests are necessary in both places:

in lib-cove-ocds, we're checking that the calculations are carried out correctly
in cove-ocds, we're checking that the relevant data is being supplied, and that it's been correctly calculated

It would save a lot of lines of code if we were able to share the tests in some way, while still carrying them out in both places.

Big files: the Web result page is huge

Hello!

As we are about to publish French award data, I had to validate it (4 days ago): https://standard.open-contracting.org/review/data/55390859-63fd-453a-989b-21c612d69687

If you clicked the above link and the validation have not expired:

it's going to be a little while before you see something
then your browser may be struggling a bit to display the page

It's not surprising, it's trying to display 120,000+ releases.

I don't think that displaying a release table with so many lines is relevant, especially if it costs so much on both the client and the server side.

Would it make sense to disable the display of the release table from a certain number of releases?

More generally, should the reviewing process be optimized for big files? That could be changes on cove-ocds, but also the release of command line tool that would be run locally.

KFI: Include count of unique item IDs

Feature request from OpenDataServices/cove#263

Original text:

When a file containing multiple releases is validated, the total count of items and documents are shown. For clarity, could we also include the unique count of:

items based on id

By showing that we found 19 mentions of a document and 4 unique document ids, publishers can be check the release has been validated correctly.

Flattening record packages

Some quick notes from a discussion with @kindly on this today:

The primary use case we have in mind is publishers sharing draft data with the helpdesk in record package format and helpdesk analysts wanting to flatten this to help give feedback (e.g. it's easier to check all the values of a given field by looking at a spreadsheet than by reviewing a JSON file).

Records have the following components:

Compiled release - easiest to flatten, using existing flatten-tool functionality. could include in minimal version of record package flattening
Releases list, which can be either linked releases or flattened releases. Key issue is the field is oneOf and it isn't possible to definitely tell which has been provided. Also the list could be mixed. Would require more work.
- Linked releases - assume we won't fetch full releases for flattening, Q: would flattening the list of linked releases (url, date and tag) be useful?
- Embedded releases - would need to combine the release lists from all records into one list - not sure where this functionality would sit between CoVE and flatten-tool. Could require lots of work.
Versioned release - we don't have approach to flattening this and it is very rarely used by published, would leave out of functionality for now

We also discussed whether compiled and embedded releases should be flattened into the same spreadsheet.

Flattening to separate spreadsheets would be easier to implement and it would be easier to use for users who just wanted to work with one or the other (no filtering out would be required).
Flattening to one spreadsheet would be harder to implement but it would make it easier for users who wanted to do analysis across compiled and individual releases (we don't have a specific use case for that in mind).

We also discussed whether this would sit better in CoVE/flatten-tool or OCDSKit/-web (possibly via the tabulate command supporting a spreadsheet output).

Seeking feedback from @yolile @romifz @jpmckinney @mrshll1001@pindec and others on:

How important is this for the helpdesk
Other use cases for flattening record packages
Views on which elements of records are important to flatten
Flattening records into one or multiple spreadsheets
Where this functionality belongs

Interface bug when collapsing 'structural errors' elements

Clicking the collapse button on the second 'structural errors' element collapses the first 'structural errors element', so it isn't possible to collapse the second one (example).

Better copy for duplicate ID warning

See OpenDataServices/cove#782

Can we put a link to docs in the repository description?

https://ocds-data-review-tool.readthedocs.io/en/latest/

Thanks

"First 3 examples" should not have ambiguous types

From open-contracting/cove-oc4ids#20

If the value is a string, it should be surrounded by quotation marks. This makes it easier to see that e.g. 500 is "500" instead of 500. It also means something will be printed to the screen for the empty string ("" instead of nothing).

Moved from OpenDataServices/cove#1198

Edits to grouping of validation errors

See OpenDataServices/cove#1117 for how we did this for 360Giving.

Right now, all of the validation errors are presented in one table. Grouping these by type (eg missing-but-required, format errors, 'other') helps people understand what kind of things they need to do in order to make their data use the standard.

The task here is to:

Review all possible errors, see if the same groupings as we used for 360 are appropriate (they probably are)
Implement the splitting up of the errors
Write simple, appropriate text to frame the errors

"Array has non-unique elements" error is missing examples

Example data

The data has duplicate values in tender/participationFees/0/methodOfPayment which is an array of strings, but no examples are shown in the first 3 examples column:

For arrays of strings, I think the examples should just show the whole field so that users can see that there are duplicates, e.g.

First 3 errors
DD - Demand Draft;FDR - Fixed Deposit;DD - Demand Draft;FDR - Fixed Deposit
DD - Demand Draft;FDR - Fixed Deposit;DD - Demand Draft;FDR - Fixed Deposit
DD - Demand Draft;BC - Bankers Cheque;SS - Small Savings Instrument;FDR - Fixed Deposit;DD - Demand Draft;BC - Bankers Cheque;SS - Small Savings Instrument;FDR - Fixed Deposit

Stop HotJar

https://standard.open-contracting.org/review/terms/#privacy_notice says "The experiment runs from 4th March to 30th September 2020."

KFI: Improve documentation of results

Copied from OpenDataServices/cove#263 (comment)

KFI is an important part of how the Helpdesk supports publishers to understand their data but to an unsupported user who is still learning about the standard the terse presentation of the stats isn't helpful.

Inline docs could be useful here.

Original text:

I have a great deal of difficulty understanding how each of the numbers is calculated (using the linked file as input). For example, 8 releases have an (identical) planning field, and the number is 1.

However, I think we first need to determine how frequently this section is referenced, before investing time in improving its clarity.

Use Django caching for explore pages

lib-cove is very slow: open-contracting/lib-cove-oc4ids#23

There is no reason to re-validate the data on each request. This should be cached.

I think we can just use the per-site cache, which is i18n-aware: https://docs.djangoproject.com/en/3.1/topics/cache/#the-per-site-cache Please setup memcached on the server.

Docs are not building automatically

Can we check the web hooks?

Signpost the command line tool from the validator landing page

From discussion with @ColinMaudry in Georgia:

We don't signpost the command line tool anywhere from the validator landing page.

At the moment this is a beta tool but it would be good to let users know it exists.

Could we add a link?

In the future we might want to impose a limit on upload size for the web tool and direct users to the command line tool for large files (rather than returning a server error).

Truncate check results

Follow-up to #31, which yielded #35.

Problem

Large, invalid files yield very large response sizes (performance), and more information than is useful (UX), e.g. a list of 42,000 invalid entries for one error type.

File	Valid?	File Size	Response size before PR	Response size after releases table PR #35
repeated_errors_repeated.json (not public data)	Invalid	359MB	65.8 MB	27.32 MB
badfile_repeated.json (script to generate)	Invalid	341MB	174.18 MB	150 MB

Test files are now here: https://github.com/open-contracting/sample-data-private/tree/master/data

Solution

I think we can have a configurable setting to limit the number of results returned.

To address performance issues, we can set a high limit that still exceeds usefulness, like 1000.

To address usability issues, we can have a smaller number like 100. We'll want to randomize the results returned, so that we're not simply reporting e.g. the first 100 errors all caused by old data and none of the errors caused by newer data (publishers who are only making improvements to new data are likely to ignore the results if they only seem to pertain to old data).

Review and test language and copy for different types of user

CoVE is used by a number of different types of users, each attempting to achieve different things.

Through our research last year, and Georg's research into personas for the OCP web presence, we can have a reasonable go at documenting a handful of these, walking through their paths to interact with the software, and improve things for each of them as we go.

Internal - see https://docs.google.com/document/d/1KQ-j4q0rC5lkhIHGk9bCC_TA50PGP5KSvTZaKUuuaaQ/edit#bookmark=id.55uduidgj3bu

DRT reports packages array has non-unique elements when the elements differ

Original data example.

Packages (schema: "A list of URIs of all the release packages that were used to create this record package") look like this:

"packages": [
       "https://budeshi.ng/api/releases/1288/planning",
       "https://budeshi.ng/api/releases/1288/tender",
       "https://budeshi.ng/api/releases/1288/planning",
       "https://budeshi.ng/api/releases/1288/contract"
   ]

DRT report shows a structural error: Array has non-unique elements

Move OCP logo to footer, using new logo

Moved from OpenDataServices/cove#1245

Improved text on landing page

From open-contracting/cove-oc4ids#19

The form prompts can be improved so that a first-time user knows what to do without scrolling below the fold to read the instructions, for example: "Paste an OCDS release package or record package as JSON" instead of simply "Paste". I can suggest more improvements, but I recommend that either @mrshll1001 or @pindec do a quick review to pick up these types of usability issues for first-time users.

Similarly, the order and flow of content below the form doesn't put the most important or relevant information first.

For example, I would at minimum swap the positions of the "Check and Review" and "About OCDS" blocks. Ideally, "About OCDS" would come after both other blocks, as it's the least relevant content; anyone arriving at the OCDS Data Review Tool is very likely to at least know what OCDS is.

The text in each box can also be clearer, simpler, more straight-forward for first-time users. As in issue description, I think others should do a first pass.

Noting some quick observations from Hotjar recordings. In general, drawing conclusions from recordings over a short period will be biased, as the recordings will tend to be the same user working to implement OCDS.

Many users don't get past the first page. The common behavior is to scroll past the form, loiter around the content area, scroll further down, then back up to the content area.
- Similar to my heuristic observations above, my interpretation is that people don't know what the tool does yet, so they don't know what the form is about and skip it. They get to the content area and scan it, but the content isn't ordered by priority (and there's a fair amount of it), so they scroll further down to scan the content there, realize that content is even less relevant, so scroll back up to read the content more closely.
Users don't seem to know what to do with the error about not identifying a package structure.
- We can perhaps display a JSON snippet of the skeleton of a package, to show what it should look like. We might also have a truncated (e.g. first 1000 characters) copy of their indented JSON shown side-by-side.
One user abandoned the form.
- Instead of just "Loading" and a spinner, we should display a message that for large inputs, the form will take some time to submit, and to please be patient.

Moved from OpenDataServices/cove#1197

explore_release and explore_record are not in-sync

In the explore_additional_content block, explore_record has "Is structurally correct?", "Number of records", etc. which are in the headlines block in explore_release. The record template also has the Schema and Convert boxes alongside the Headlines box, whereas the release template has them beneath.

As much as possible, the two templates should be the same. We should perhaps have each inherit from (or reuse blocks from) a common template.

Switch to GitHub Actions

Once this and open-contracting/cove-oc4ids#43 are closed, we can:

Uninstall Travis CI across the whole organization https://github.com/organizations/open-contracting/settings/installations
Remove Travis CI-specific code from https://github.com/open-contracting/standard-maintenance-scripts

Initial porting of code from cove repo

Create repository for test data

#35 (comment) has links to test data and scripts to generate data.

If the data can't be made public, we should create a private repository to put the data and scripts.

cc @Bjwebb

Date in excel header row causes unhelpful error

https://sentry.io/open-data-services/cove/issues/392837329/

TypeError: datetime.datetime(2017, 1, 1, 0, 0) is not JSON serializable
(5 additional frame(s) were not displayed)
...
  File "cove/lib/exceptions.py", line 53, in wrapper
    return func(request, *args, **kwargs)
  File "cove/lib/converters.py", line 92, in convert_spreadsheet
    **flattentool_options

datetime.datetime(2017, 1, 1, 0, 0) is not JSON serializable

Example file: date_in_header.xlsx

coverage line in CI does nothing?

 Run coverage run --source cove_ocds,cove_project manage.py test
/opt/hostedtoolcache/Python/3.6.12/x64/lib/python3.6/site-packages/cove/settings.py:31: UserWarning: SECRET_KEY should be added to Environment Variables. Random key will be used instead.
  warnings.warn('SECRET_KEY should be added to Environment Variables. Random key will be used instead.')

System check identified no issues (0 silenced).
----------------------------------------------------------------------
Ran 0 tests in 0.000s

OK

When migrating from Travis to Github CI's, I left it in as I wasn't certain it does nothing, and if felt like we should check that. If it really does nothing, we can remove. I suspect coveralls data comes from the next test line anyway.

Include context of errors (e.g. reporting the OCID)

From a helpdesk issue

Currently when reporting structural errors the review tool reports the location of the error relative to the package e.g. releases/38/tender/tenderers. Since some packages are generated on the fly, and there may be complex mapping behind them, it may be helpful if it would also extract the OCID of the release with the issue rather than just its position in the array. This would support the publisher finding the offending release elsewhere to make the change.

Make truncation length 5 for OCDS

Add a web API

Copied across from OpenDataServices/cove#320

The original issue has some discussion, but pertinent bits:

It's certainly true that we don't intend CoVE / lib-cove-ocds to be used as part of the backend for a web site - it's definitely not fast enough. If there was a demand for this, we could probably concoct something - but I'd want to know more about exactly what was required. As it stands, I'm only aware of @patxiworks wanting to use CoVE in this way, and unfortunately we weren't able to support what he wanted to do.

However, a web API to CoVE also has a role in publication systems, and potentially in data consumption applications, as part of automated processing, where response time isn't as important. Packaging up something that can be run as a 'black box' and handle requests for feedback on data that goes beyond simple validation can be useful.

and

I guess the use case for a web API is an implementer who is unable or unwilling to use lib-cove-ocds as a library. (Even if their system is implemented in another language, they can either bridge to Python, or shell out to the libcoveocds command.) Right now, the demand for that seems low, but the issue can certainly remain open.

Make number of rows in Releases Table configureable

This should be configureable as an environment variable.

From #35 (comment)

Include package metadata in 'fields that are empty or contain only whitespaces' check

It looks like this check currently only looks at the release, not the package metadata, at least a blank string publisher/scheme is not reported by the check.

The check should be carried out on both the package metadata and the contents of the package.

I'm not sure how this is implemented, so worth checking if this issue is specific to this check or whether it applies to the way additional checks work in general.

Remove duplication of validation messages from within oneOf

Introduced as part of OpenDataServices/cove#895 when we replaced the monolithic oneOf validation messages with the individual messages for each subschema.

Errors about date and tag being required can come from either subschema, and are repeated for each.

This is only a problem for files that have a mix of assumed embedded and linked releases for different records.

Should we be using "tx pull -af"?

In https://ocds-data-review-tool.readthedocs.io/en/latest/translations.html

See https://docs.transifex.com/client/pull

-f or --force: Force the download of the translations files regardless of whether timestamps on the local computer are newer than those on the server.

I just had an issue where because of git branch switching (I assume), it was skipping ES and I wasn't getting the latest translations. I had to add -f to get it to do that.

Is there any situations where this is bad? If not, should we just add that to the docs?

Link to OCDS Merge test cases if merged releases present

Add a note to the Headlines like: "This file contains compiled releases [and versioned releases]. If you created these merged releases, we encourage you to test your implementation of the merge routine using the OCDS Merge test cases. For assistance, please contact the OCDS Helpdesk."

Moved from OpenDataServices/cove#1210

Remove homepage content from expired results pages

e.g. https://standard.open-contracting.org/review/data/704b068b-af6b-4dbd-9001-e0430e2a8e16

It is just confusing and distracts from the main message.

@robredpath commented:

Agreed. We could probably make some use of that space, though - with some custom text that's helpful to someone who's come to an expired results page!

Moved from OpenDataServices/cove#1209

Additional Fields: Collapse button overrides other panel titles

When viewing validation results, if you try to collapse the Additional Fields panel, it's header and description overrides the titles and descriptions on all of the other panels. This doesn't happen when collapsing/restoring the other panels, only the Additional Fields panel.

As well, when there are no additional fields in use, it might be good to specify a row in that panel to say that no fields were identified rather than it just being blank as it looks like something might be missing.

Add schema descriptions for required fields

From OpenDataServices/cove#951

We did this for all other validation messages, but required fields are more difficult because current the context path is the parent.

explore_record: Truncate "Unique OCIDs" section

Follow-up to OpenDataServices/cove#896

If there is a use case, we can have a "See all" link/button to download them all. (This button could then similarly be implemented for the Records Table and Releases Table.)

Group validation errors by oneOf subschema

In OpenDataServices/cove#895 we now assume whether a record has linked or embedded releases, in order to use the correct subschema within the oneOf block. Text about this assumption is added to every relevant validation message.

This text is repeated for each validation error message. Instead we should group the messages by subschema used, and state the assumption only once.

Validation error messages aren't translated

Truncate "Organization Roles" modal

From CRM-5858, the data URL is https://www.data.gouv.fr/fr/datasets/r/68bd2001-3420-4d94-bc49-c90878df322c

Submitting that URL generates a 5MB HTML file with tens of thousands of rows of organization names.

Similar to #57, we can implement a "See all" download if there is a use case.

Add call to action in the Validator example errors column

When there are errors in the validation, the error count column shows the errors for the row.
However, the number does not look like a link.
Perhaps a call to action is needed or format the number to look like a link

Example: http://standard.open-contracting.org/validator/data/f7d35e4c-d3c8-4a3b-887c-539ceff7a491

List fields that are present/missing compared to the schema

Copied from OpenDataServices/cove#118

Although we provide a list of fields that are present but aren't in the schema (ie, "additional fields"), we don't provide a list of fields that are in the schema but aren't in the file, or any other way for people to discover fields that may be relevant to them from the results page. The risk, of course, is that excessive noise is generated, but some way of allowing people to see their mapping template alongside their existing coverage could help alignment.

Write overview documentation

In this first iteration, this documentation should cover information that is unique to the DRT (i.e. we should assume that the developer knows Python, Django, JSON Schema, OCDS, etc. which are not unique to the DRT). It should cover the information that developers newly working on the DRT would be given, with respect to the responsibilities of each component (lib-cove-ocds, lib-cove, cove-ocds, cove), for example.

Validating record package shows incorrect release count

When validating a record package with embedded releases the Records Table shows a release count of 0.

In the case of the sample data I was validating, it should have shown a release count of 3 - unless I'm entirely misinterpreting the column header.

Link to sample data validation here: https://standard.open-contracting.org/review/data/3a395812-d486-48b7-8789-d0d2e16e3651

Tests fail because httpstat.us/404 doesn't always return a 404 status code

http://httpstat.us/404 sometimes returns a different status code, such as 503 or 524, which is probably due to the request taking too long / not returning from the origin server, and Cloudflare making an error response.

A GitHub search shows that we use this in this repo, and kingfisher-collect, so I've also opened an issue there.

Account for #skiprows and #headerrows in spreadsheet validation error messages

The row number reported in spreadsheet validation error messages is incorrect when the #skiprows or #headerrows configuration properties are set.

For example, in the data which generated the following error message (a spreadsheet with #skiprows set to 2 and #headerrows set to 5) the error is actually on row 8, not row 2:

The row reported in the error message should account for the configuration properties set in the source spreadsheet.

Clarify "Download Data" text

See open-contracting/cove-oc4ids#30, replacing "OC4IDS project package" with the appropriate OCDS text (e.g. "OCDS release package").

Moved from OpenDataServices/cove#1196

Add links to the docs in OCDS schema validation messages

From OpenDataServices/cove#951. There was some work on this, but it was removed because changes to the docs site weren't merged yet. They now have been, so we can do this. Steps:

Revert removal of docs urls
Update urls to point to the live standard docs
Check that all the messages make sense for data with extensions

Change 'item ID schemes' to 'item classification schemes' in key field information

To align with how the field and parent object are named in the schema.