open-contracting / kingfisher-colab Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 0.0 310 KB

A set of utility functions for Google Colaboratory notebooks using OCDS data

Home Page: https://kingfisher-colab.readthedocs.io/en/latest/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

kingfisher-colab's People

Contributors

Stargazers

Watchers

kingfisher-colab's Issues

Add upgrade notes to internal spreadsheet

See CRM issue 3972

Refactor: Extend ipython-sql

Recently I've been experimenting with ipython-sql for connecting notebooks to (non-kingfisher) databases and issuing SQL commands using line and cell magics.

It provides some of the same functionality as kingfisher-colab (create_connection, get_dataframe_from_query and download_dataframe_as_csv), has some other useful features and makes for a cleaner looking notebook.

We should consider whether to switch to using ipython-sql, which could help to reduce the number of functions we need to maintain in kingfisher-colab.

List of regularly used notebooks

@yolile @romifz @aguilerapy @duncandewhurst @pindec @mrshll1001 Please share links to notebooks / folders of notebooks that are regularly used, so that development on this library can keep that in mind.

Need to fix version of openpyxl in kingfisher-colab

Looks like an openpyxl upgrade means that flattening OCDS via kingfisher colab currently fails with the following error:

"got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element".

Example OCDS JSON that fails is here

As a workaround for now, add
!pip install -U openpyxl==2.5.9
below the !pip install -U git+https://github.com/open-contracting/kingfisher-colab line.

PyDrive is deprecated and unmaintained

https://github.com/googlearchive/PyDrive

https://github.com/iterative/PyDrive2 is recommended instead.

calculate_coverage: Incorrect implementation for nested arrays

This doesn't look right (note the release_summary.field_list->>'additionalClassifications').

kingfisher-colab/tests/test_module.py

Lines 417 to 432 in 7da376e

 def test_calculate_coverage_all_many_to_many(db, capsys, tmpdir): 

 field = "ALL awards/items/additionalClassifications/scheme" 

 sql = calculate_coverage([field], scope="release_summary", sql=False, sql_only=True) 

 assert sql == textwrap.dedent("""\ 

  SELECT 

  count(*) AS total_release_summary, 

  ROUND(SUM(CASE WHEN coalesce(release_summary.field_list->>'awards/items/additionalClassifications/scheme' = 

  release_summary.field_list->>'additionalClassifications', false) THEN 1 ELSE 0 END) * 100.0 / count(*), 2) AS awards_items_additionalclassifications_scheme_percentage, 

  ROUND(SUM(CASE WHEN coalesce(release_summary.field_list->>'awards/items/additionalClassifications/scheme' = 

  release_summary.field_list->>'additionalClassifications', false) THEN 1 ELSE 0 END) * 100.0 / count(*), 2) AS total_percentage 

  FROM release_summary 

  """) # noqa: E501 

 assert capsys.readouterr().out == "WARNING: Results might be inaccurate due to nested arrays. Check that there is exactly one `awards/items` entry per `release`.\n" # noqa: E501

Assigning to @duncandewhurst to confirm what should be the correct behavior.

Add output_flattened_gsheet function

Draft code from a notebook:

def flatten_to_gsheets(dataframe, workbook):

  import os
  import json

  drive = authenticate_pydrive()
  
  if os.path.exists("release_package.json"):
    os.remove("release_package.json")
  
  #save release package to colab virtual machine
  with open("release_package.json", 'w') as f:
    json.dump(dataframe['release_package'][0],f)

  if os.path.exists("flattened.xlsx"):
    os.remove("flattened.xlsx")

  #flatten to .xlsx
  !flatten-tool flatten release_package.json --root-id=ocid --main-sheet-name releases --root-list-path=releases

  #upload to Google Drive
  uploaded = drive.CreateFile({'title': workbook + '.xlsx'})
  uploaded.SetContentFile('flattened.xlsx')
  uploaded.Upload()
  print('Uploaded file with ID {}'.format(uploaded.get('id')))

Review the changes on the analyst-dev branch and merge or integrate as required

Add download_json function

FYI we're using this code in 360Giving data services

Hi team!

Just an FYI to say that we're using this code in some work we're doing on 360Giving Colab notebooks. We're using specific commits so no need to adapt practice or anything - just to let you know, say thanks, and ping us if you'd like to see what we're using and talk about contributions!

Update docs in deploy repo

Here: https://ocdsdeploy.readthedocs.io/en/latest/use/databases.html#google-colaboratory

flattentool 0.15.0 breaks kingfisher-colab

Steps to reproduce:

pip install --upgrade 'ocdskingfishercolab'

import json
import os
import pprint

import pandas
from google.colab.files import download
from ocdskingfishercolab import (
    create_connection, set_spreadsheet_name, list_source_ids, list_collections,
    set_search_path, execute_statement, get_dataframe_from_query,
    save_dataframe_to_sheet, save_dataframe_to_spreadsheet,
    download_data_as_json, download_package_from_ocid)
from psycopg2.sql import SQL, Identifier

Error:

ImproperlyConfigured                      Traceback (most recent call last)

/usr/local/lib/python3.6/dist-packages/flattentool/i18n.py in <module>()
     12     try:
---> 13         get_language()
     14     except ImproperlyConfigured:

10 frames

ImproperlyConfigured: Requested setting USE_I18N, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.


During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)

ImportError: 


During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)

/usr/lib/python3.6/gettext.py in translation(domain, localedir, languages, class_, fallback, codeset)
    527         if fallback:
    528             return NullTranslations()
--> 529         raise OSError(ENOENT, 'No translation file found for domain', domain)
    530     # Avoid opening, reading, and parsing the .mo file after it's been done
    531     # once.

FileNotFoundError: [Errno 2] No translation file found for domain: 'flatten-tool'

As a workaround, I've updated the data feedback template to explicitly install flattentool<0.15 before installing kingfisher-colab.

cc @Bjwebb

calculate_coverage: Potential error in default scope

If a field is contracts/implementation/documents/format, then contracts_summary is used as the scope, instead of contract_implementation_documents_summary, because contract != contracts.

If that's the intended behavior, the documentation can be updated. If not, the error can be fixed.

Had to manually upgrade ipykernel to run notebook locally

When running a notebook locally, I got this error:

ERROR:tornado.general:Uncaught exception in ZMQStream callback
Traceback (most recent call last):
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 444, in _run_callback
    callback(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    self.pre_handler_hook()
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 248, in pre_handler_hook
    self.saved_sigint_handler = signal(SIGINT, default_int_handler)
  File "/usr/lib/python3.6/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread
ERROR:tornado.general:Uncaught exception in zmqstream callback
Traceback (most recent call last):
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 462, in _handle_events
    self._handle_recv()
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 492, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 444, in _run_callback
    callback(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    self.pre_handler_hook()
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 248, in pre_handler_hook
    self.saved_sigint_handler = signal(SIGINT, default_int_handler)
  File "/usr/lib/python3.6/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread
ERROR:tornado.application:Exception in callback (<zmq.sugar.socket.Socket object at 0x7fd311ad0db0>, <function wrap.<locals>.null_wrapper at 0x7fd311aa96a8>)
Traceback (most recent call last):
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 462, in _handle_events
    self._handle_recv()
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 492, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 444, in _run_callback
    callback(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    self.pre_handler_hook()
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 248, in pre_handler_hook
    self.saved_sigint_handler = signal(SIGINT, default_int_handler)
  File "/usr/lib/python3.6/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread

These issues seem related:
ipython/ipython#11258
ipython/ipykernel#335

Upgrading ipykernel fixes it.

pip install --upgrade ipykernel

The current version of ipykernel is pinned by the google-colab library, so I'm not sure what a good fix for this would look like.

Rename package to ocdskingfisher-colab

Like all the other Kingfisher packages.

SQLAlchemy 2 is released

https://docs.sqlalchemy.org/en/20/changelog/migration_20.html

The code will need to be updated if we want to adopt it.

Check coverage for a collection of fields

From open-contracting/kingfisher-summarize#29 (comment):

we should look at how we can auto-generate queries with the correct joins to check coverage for a collection of fields, after which we can ask Camila to update the usability checks notebook she shared in CRM issue 5084.

There's an example of the type of query I'm thinking of in: open-contracting/kingfisher-summarize#29 (comment), though that may change depending on how the issue is resolved.

Ideally we'd have a get_coverage method which accepts a list of field paths as a parameter and returns the % of (compiled) releases in which the collection of fields is present.

I need to think through what the input and output should be when you want to measure coverage at a level other than release, e.g. awards, contracts etc.

Escape slashes in ocids for `download_package_from_ocid`

Some ocids use slashes e.g 'ocds-i8xayf-IMS/1' from Nepal Dhangadhi.

The error I get is

FileNotFoundError: [Errno 2] No such file or directory: 'ocds-i8xayf-IMS/1_release_package.json'

calculate_coverage: Using fields `["ALL path", "path"]` results in two `path_percentage` columns

They should be named differently.

Move setting seaborn style from setup_environment.ipynb

In https://github.com/open-contracting/kingfisher_notebook_components

Consider using noteql instead of ipython-sql

@kindly has been working on noteql as an alternative to ipython-sql. The new library fixes some of the issues we have with ipython-sql (e.g. catherinedevlin/ipython-sql#177, catherinedevlin/ipython-sql#190, catherinedevlin/ipython-sql#191) and introduces some new features, including support for multiple SQL statements in one cell and support for templating using jinjasql.

noteql is still a work-in-progress but could be worth using in the future, especially as we'll be more readily able to make changes or add features.

@kindly are there any other improvements, either from an analyst perspective, or from a development/maintenance perspective, that we should take into account?

SQL style guide and linter

We write a lot of SQL code in notebooks. To make it easier to maintain and collaborate on that code it would be good to agree on a style guide to follow and to use a linter to enforce that style guide.

pycodestyle_magic provides a nice integration of flake-8 and the notebook interface and https://github.com/pgjones/flake8-sql is a flake-8 plugin that looks for SQL queries in Python code and checks them against an opinionated style, based on the SQL Style Guide.

It looks like we could use those to lint our existing approach of passing SQL queries as strings, although I haven't actually tried it.

SQLFluff is more configurable and has a default style which is closer to our existing style, but seems to only be available as a command-line tool.

If we choose to adopt ipython-sql (#29) then some further work would be required to get linting working for cells which use the %%sql magic.

@pindec @mrshll1001 @odscrachel @yolile @romifz - does this sound like a good idea? Do you have a preferred style guide we should follow?

Remove output_gsheet function?

I think saveToSheets provides this functionality.

set_search_path: ResourceClosedError

set_search_path works but returns a ResourceClosedError, e.g.

schema_name = 'view_data_collection_2052_2053'

set_search_path(schema_name)

---------------------------------------------------------------------------

ResourceClosedError                       Traceback (most recent call last)

<ipython-input-12-1fd85f0312b5> in <module>()
      1 schema_name = 'view_data_collection_2052_2053'
      2 
----> 3 set_search_path(schema_name)

12 frames

<decorator-gen-116> in execute(self, line, cell, local_ns)

<decorator-gen-115> in execute(self, line, cell, local_ns)

/usr/local/lib/python3.7/dist-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***)
    209 
    210         try:
--> 211             raise exception
    212         finally:
    213             # credit to

ResourceClosedError: This result object does not return rows. It has been closed automatically.

Can download_package_from_ocid have an ORDER clause?

Sorting by date ASC might make sense?

Use specific exceptions instead of "except Exception"

Only one occurrence remains (below).

Setup Repository

James,

Can you enable https://travis-ci.org/open-contracting/kingfisher-colab and also set up https://readthedocs.org/projects/kingfisher-colab/ with an incoming webhook?

Thanks,
James

calculate_coverage: Spurious warning

calculate_coverage(["ALL parties/address/region"], scope="release_summary")

The code assumes that any parent ending in s is an array...

Document each function in kingfisher-colab

At the moment I have to look at the code in init.py to work out how to use the functions.

Some documentation, perhaps in the style of the ocdskit documentation would be really helpful.

calculate_coverage: Raise an error if `fields` is falsy

Consider using Spoonbill instead of Flatten Tool

save_dataframe_to_sheet: skip method when dataframe is empty

Or include a flag that allows skipping the method when dataframe is empty.

Fix function name style

Functions should be consistently underscored, not camelCase.

Change cur.execute() calls to add the query's source as a comment

That way, we can trace a query back to its source if we're doing maintenance on the server.

We'd have to update reusable notebooks to use this method.

I don't know if Google Colaboratory offers a Python method to identify the notebook, or if we'd just have to have some boilerplate at the start of each notebook.

Follow-up to open-contracting/kingfisher-process#260

output_notebook stops working after a query fails

If a query fails due to any SQL error (e.g syntax error) using output_notebook, any subsequent call to the method fails with the following stacktrace:

---------------------------------------------------------------------------
InFailedSqlTransaction                    Traceback (most recent call last)
<ipython-input-68-4a0a638ff542> in <module>()
     20 """
     21 
---> 22 buyers = output_notebook(querystring.format(collection_id))
     23 buyers.head()

/usr/local/lib/python3.6/dist-packages/kingfishercolab/__init__.py in output_notebook(sql, params)
    138 def output_notebook(sql, params=None):
    139     with conn.cursor() as cur:
--> 140         cur.execute(sql, params)
    141         return getResults(cur)
    142 

InFailedSqlTransaction: current transaction is aborted, commands ignored until end of transaction block

Review the overall structure of the library and the interactions between functions

@kindly had some concerns about the global variable for the database connection and cursor, I think.

Move OCDS-specific methods to an ocds module

This package is also used by 360Giving, so in the longer term it’d be best to separate standard-specific methods

For 0.2.x, we can import those methods into the main module, so that users will see no difference in behaviour.

In 0.3.x, that import can be removed.

SQL syntax highlighting

I've been trying to get SQL syntax highlighting working in Google Colab with no luck.

If we start using ipython-sql (#29) then it would be helpful to also have syntax highlighting cells which use the %%sql cell magic.

Stackoverflow suggests this is possible with a normal Jupyter notebook: https://stackoverflow.com/questions/57966403/how-to-add-syntax-highlight-to-sql-line-magic-cell-magic-and-custom-command-in

This might not fit exactly kingfisher-colab, as I expect the solution will be something of a hack involving custom Javascript, but it seems like the closest place to log it.

Add function to select collection to work with

The latest version of the data feedback template includes code to help an analyst find the correct collection to work with (broadly, select a source_id and return a list of collections for that source, with a count of releases, records and compiled_releases for each collection).

This is applicable to any use of a notebook to connect to kingfisher, so would be a good candidate for inclusion in kingfisher-colab

ipython-sql: PostgreSQL JSON operators break %sql line magic

Using the -> or ->> operators in a query breaks the sql line magic. For example (notebook), running a cell containing %sql select data -> 'ocid' from data limit 1 results in the following error:

UsageError: unrecognized arguments: -> 'ocid' from data limit 1

The same issue affects get_ipython_sql_resultset_from_query.

This is a problem because the line magic is useful for controlling the execution of queries using Python for loops, if statements, functions etc. and for creating template queries and substituting in table names etc.

I guess this is an upstream problem with ipython-sql but I'm wondering if there is a workaround I'm missing, or if there's something we can do in kingfisher-colab.

catherinedevlin/ipython-sql#190 looks like the same issue.

For context, I came across this issue when updating the templated queries in the structure and format section of the data feedback template to use the new version of kingfisher-colab. In that case I was able to author a query that did away with the need for the templates, but in other cases it could be blocker.

Switch to JupySQL

https://github.com/ploomber/jupysql

Please add jpmckinney to kingfisher-colab PyPi project

https://pypi.org/project/kingfisher-colab/

Since I wasn't added when it was originally created, and since it was never communicated that this package was created (the release process used didn't add tags on GitHub, which would have at least been a hint, and the released package isn't used in any notebook I've seen), I ended up creating another project: https://pypi.org/project/ocdskingfishercolab/

Please either delete the project from PyPi to avoid confusion, and/or add me as an admin.

gspread prefers google-auth over oauth2client

https://docs.gspread.org/en/latest/oauth2.html

Can we put a link to docs in the repository description?

https://kingfisher-colab.readthedocs.io/en/latest/

Like we do on https://github.com/open-contracting/kingfisher-scrape etc

Thanks

render_json: add function to render collapsible JSON in a notebook

function:

import json
from IPython.display import HTML

def render_json(jstr):
  if type(jstr) != str:
    jstr = json.dumps(jstr)
  return HTML("""
<script src="https://rawgit.com/caldwell/renderjson/master/renderjson.js"></script>
<script>
renderjson.set_show_to_level(1)
document.body.appendChild(renderjson(%s))
new ResizeObserver(google.colab.output.resizeIframeToContent).observe(document.body)
</script>
""" % jstr)

usage:

render_json("""{"example_array": [1,2,3], "example_object": {"a": 1, "b":2}}""")

⊖{
    "example_array": ⊕[3 items],
    "example_object": ⊕{2 items}
}

ERROR: google-colab 1.0.0 has requirement notebook~=5.3.0; python_version >= "3.0", but you'll have notebook 5.2.2 which is incompatible.
ERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.21.0 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.

This doesn't seem to break the notebook, as we just get one alternative of the incompatible requirements installed. I think this may change (to hard fail) in future pip versions.

	def test_calculate_coverage_all_many_to_many(db, capsys, tmpdir):
	field = "ALL awards/items/additionalClassifications/scheme"
	sql = calculate_coverage([field], scope="release_summary", sql=False, sql_only=True)

	assert sql == textwrap.dedent("""\
	SELECT
	count(*) AS total_release_summary,
	ROUND(SUM(CASE WHEN coalesce(release_summary.field_list->>'awards/items/additionalClassifications/scheme' =
	release_summary.field_list->>'additionalClassifications', false) THEN 1 ELSE 0 END) * 100.0 / count(*), 2) AS awards_items_additionalclassifications_scheme_percentage,
	ROUND(SUM(CASE WHEN coalesce(release_summary.field_list->>'awards/items/additionalClassifications/scheme' =
	release_summary.field_list->>'additionalClassifications', false) THEN 1 ELSE 0 END) * 100.0 / count(*), 2) AS total_percentage
	FROM release_summary

	""") # noqa: E501

	assert capsys.readouterr().out == "WARNING: Results might be inaccurate due to nested arrays. Check that there is exactly one `awards/items` entry per `release`.\n" # noqa: E501

open-contracting / kingfisher-colab Goto Github PK

kingfisher-colab's People

Contributors

Stargazers

Watchers

kingfisher-colab's Issues

Recommend Projects

Recommend Topics

Recommend Org