Giter VIP home page Giter VIP logo

kingfisher-colab's People

Contributors

bjwebb avatar dependabot[bot] avatar duncandewhurst avatar jpmckinney avatar kindly avatar odscjames avatar pindec avatar robredpath avatar yolile avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kingfisher-colab's Issues

Refactor: Extend ipython-sql

Recently I've been experimenting with ipython-sql for connecting notebooks to (non-kingfisher) databases and issuing SQL commands using line and cell magics.

It provides some of the same functionality as kingfisher-colab (create_connection, get_dataframe_from_query and download_dataframe_as_csv), has some other useful features and makes for a cleaner looking notebook.

We should consider whether to switch to using ipython-sql, which could help to reduce the number of functions we need to maintain in kingfisher-colab.

Need to fix version of openpyxl in kingfisher-colab

Looks like an openpyxl upgrade means that flattening OCDS via kingfisher colab currently fails with the following error:

"got invalid input value of type <class 'xml.etree.ElementTree.Element'>, expected string or Element".

Example OCDS JSON that fails is here

As a workaround for now, add
!pip install -U openpyxl==2.5.9
below the !pip install -U git+https://github.com/open-contracting/kingfisher-colab line.

calculate_coverage: Incorrect implementation for nested arrays

This doesn't look right (note the release_summary.field_list->>'additionalClassifications').

def test_calculate_coverage_all_many_to_many(db, capsys, tmpdir):
field = "ALL awards/items/additionalClassifications/scheme"
sql = calculate_coverage([field], scope="release_summary", sql=False, sql_only=True)
assert sql == textwrap.dedent("""\
SELECT
count(*) AS total_release_summary,
ROUND(SUM(CASE WHEN coalesce(release_summary.field_list->>'awards/items/additionalClassifications/scheme' =
release_summary.field_list->>'additionalClassifications', false) THEN 1 ELSE 0 END) * 100.0 / count(*), 2) AS awards_items_additionalclassifications_scheme_percentage,
ROUND(SUM(CASE WHEN coalesce(release_summary.field_list->>'awards/items/additionalClassifications/scheme' =
release_summary.field_list->>'additionalClassifications', false) THEN 1 ELSE 0 END) * 100.0 / count(*), 2) AS total_percentage
FROM release_summary
""") # noqa: E501
assert capsys.readouterr().out == "WARNING: Results might be inaccurate due to nested arrays. Check that there is exactly one `awards/items` entry per `release`.\n" # noqa: E501

Assigning to @duncandewhurst to confirm what should be the correct behavior.

Add output_flattened_gsheet function

Draft code from a notebook:

def flatten_to_gsheets(dataframe, workbook):

  import os
  import json

  drive = authenticate_pydrive()
  
  if os.path.exists("release_package.json"):
    os.remove("release_package.json")
  
  #save release package to colab virtual machine
  with open("release_package.json", 'w') as f:
    json.dump(dataframe['release_package'][0],f)

  if os.path.exists("flattened.xlsx"):
    os.remove("flattened.xlsx")

  #flatten to .xlsx
  !flatten-tool flatten release_package.json --root-id=ocid --main-sheet-name releases --root-list-path=releases

  #upload to Google Drive
  uploaded = drive.CreateFile({'title': workbook + '.xlsx'})
  uploaded.SetContentFile('flattened.xlsx')
  uploaded.Upload()
  print('Uploaded file with ID {}'.format(uploaded.get('id')))

FYI we're using this code in 360Giving data services

Hi team!

Just an FYI to say that we're using this code in some work we're doing on 360Giving Colab notebooks. We're using specific commits so no need to adapt practice or anything - just to let you know, say thanks, and ping us if you'd like to see what we're using and talk about contributions!

flattentool 0.15.0 breaks kingfisher-colab

Steps to reproduce:

pip install --upgrade 'ocdskingfishercolab'
import json
import os
import pprint

import pandas
from google.colab.files import download
from ocdskingfishercolab import (
    create_connection, set_spreadsheet_name, list_source_ids, list_collections,
    set_search_path, execute_statement, get_dataframe_from_query,
    save_dataframe_to_sheet, save_dataframe_to_spreadsheet,
    download_data_as_json, download_package_from_ocid)
from psycopg2.sql import SQL, Identifier

Error:

ImproperlyConfigured                      Traceback (most recent call last)

/usr/local/lib/python3.6/dist-packages/flattentool/i18n.py in <module>()
     12     try:
---> 13         get_language()
     14     except ImproperlyConfigured:

10 frames

ImproperlyConfigured: Requested setting USE_I18N, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.


During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)

ImportError: 


During handling of the above exception, another exception occurred:

FileNotFoundError                         Traceback (most recent call last)

/usr/lib/python3.6/gettext.py in translation(domain, localedir, languages, class_, fallback, codeset)
    527         if fallback:
    528             return NullTranslations()
--> 529         raise OSError(ENOENT, 'No translation file found for domain', domain)
    530     # Avoid opening, reading, and parsing the .mo file after it's been done
    531     # once.

FileNotFoundError: [Errno 2] No translation file found for domain: 'flatten-tool'

As a workaround, I've updated the data feedback template to explicitly install flattentool<0.15 before installing kingfisher-colab.

cc @Bjwebb

calculate_coverage: Potential error in default scope

If a field is contracts/implementation/documents/format, then contracts_summary is used as the scope, instead of contract_implementation_documents_summary, because contract != contracts.

If that's the intended behavior, the documentation can be updated. If not, the error can be fixed.

Had to manually upgrade ipykernel to run notebook locally

When running a notebook locally, I got this error:

ERROR:tornado.general:Uncaught exception in ZMQStream callback
Traceback (most recent call last):
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 444, in _run_callback
    callback(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    self.pre_handler_hook()
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 248, in pre_handler_hook
    self.saved_sigint_handler = signal(SIGINT, default_int_handler)
  File "/usr/lib/python3.6/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread
ERROR:tornado.general:Uncaught exception in zmqstream callback
Traceback (most recent call last):
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 462, in _handle_events
    self._handle_recv()
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 492, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 444, in _run_callback
    callback(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    self.pre_handler_hook()
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 248, in pre_handler_hook
    self.saved_sigint_handler = signal(SIGINT, default_int_handler)
  File "/usr/lib/python3.6/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread
ERROR:tornado.application:Exception in callback (<zmq.sugar.socket.Socket object at 0x7fd311ad0db0>, <function wrap.<locals>.null_wrapper at 0x7fd311aa96a8>)
Traceback (most recent call last):
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 462, in _handle_events
    self._handle_recv()
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 492, in _handle_recv
    self._run_callback(callback, msg)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 444, in _run_callback
    callback(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    self.pre_handler_hook()
  File "/home/vagrant/open-contracting/kingfisher-colab/.ve/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 248, in pre_handler_hook
    self.saved_sigint_handler = signal(SIGINT, default_int_handler)
  File "/usr/lib/python3.6/signal.py", line 47, in signal
    handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
ValueError: signal only works in main thread

These issues seem related:
ipython/ipython#11258
ipython/ipykernel#335

Upgrading ipykernel fixes it.

pip install --upgrade ipykernel

The current version of ipykernel is pinned by the google-colab library, so I'm not sure what a good fix for this would look like.

Check coverage for a collection of fields

From open-contracting/kingfisher-summarize#29 (comment):

we should look at how we can auto-generate queries with the correct joins to check coverage for a collection of fields, after which we can ask Camila to update the usability checks notebook she shared in CRM issue 5084.

There's an example of the type of query I'm thinking of in: open-contracting/kingfisher-summarize#29 (comment), though that may change depending on how the issue is resolved.

Ideally we'd have a get_coverage method which accepts a list of field paths as a parameter and returns the % of (compiled) releases in which the collection of fields is present.

I need to think through what the input and output should be when you want to measure coverage at a level other than release, e.g. awards, contracts etc.

Consider using noteql instead of ipython-sql

@kindly has been working on noteql as an alternative to ipython-sql. The new library fixes some of the issues we have with ipython-sql (e.g. catherinedevlin/ipython-sql#177, catherinedevlin/ipython-sql#190, catherinedevlin/ipython-sql#191) and introduces some new features, including support for multiple SQL statements in one cell and support for templating using jinjasql.

noteql is still a work-in-progress but could be worth using in the future, especially as we'll be more readily able to make changes or add features.

@kindly are there any other improvements, either from an analyst perspective, or from a development/maintenance perspective, that we should take into account?

SQL style guide and linter

We write a lot of SQL code in notebooks. To make it easier to maintain and collaborate on that code it would be good to agree on a style guide to follow and to use a linter to enforce that style guide.

pycodestyle_magic provides a nice integration of flake-8 and the notebook interface and https://github.com/pgjones/flake8-sql is a flake-8 plugin that looks for SQL queries in Python code and checks them against an opinionated style, based on the SQL Style Guide.

It looks like we could use those to lint our existing approach of passing SQL queries as strings, although I haven't actually tried it.

SQLFluff is more configurable and has a default style which is closer to our existing style, but seems to only be available as a command-line tool.

If we choose to adopt ipython-sql (#29) then some further work would be required to get linting working for cells which use the %%sql magic.

@pindec @mrshll1001 @odscrachel @yolile @romifz - does this sound like a good idea? Do you have a preferred style guide we should follow?

set_search_path: ResourceClosedError

set_search_path works but returns a ResourceClosedError, e.g.

schema_name = 'view_data_collection_2052_2053'

set_search_path(schema_name)

---------------------------------------------------------------------------

ResourceClosedError                       Traceback (most recent call last)

<ipython-input-12-1fd85f0312b5> in <module>()
      1 schema_name = 'view_data_collection_2052_2053'
      2 
----> 3 set_search_path(schema_name)

12 frames

<decorator-gen-116> in execute(self, line, cell, local_ns)

<decorator-gen-115> in execute(self, line, cell, local_ns)

/usr/local/lib/python3.7/dist-packages/sqlalchemy/util/compat.py in raise_(***failed resolving arguments***)
    209 
    210         try:
--> 211             raise exception
    212         finally:
    213             # credit to

ResourceClosedError: This result object does not return rows. It has been closed automatically.

Document each function in kingfisher-colab

At the moment I have to look at the code in init.py to work out how to use the functions.

Some documentation, perhaps in the style of the ocdskit documentation would be really helpful.

  • create_connection
  • authenticate_gspread
  • authenticate_pydrive
  • getResults
  • saveToCSV
  • set_spreadsheet_name
  • saveToSheets
  • downloadReleases
  • output_ghseet
  • output_flattened_gsheet
  • output_notebook
  • download_json

output_notebook stops working after a query fails

If a query fails due to any SQL error (e.g syntax error) using output_notebook, any subsequent call to the method fails with the following stacktrace:

---------------------------------------------------------------------------
InFailedSqlTransaction                    Traceback (most recent call last)
<ipython-input-68-4a0a638ff542> in <module>()
     20 """
     21 
---> 22 buyers = output_notebook(querystring.format(collection_id))
     23 buyers.head()

/usr/local/lib/python3.6/dist-packages/kingfishercolab/__init__.py in output_notebook(sql, params)
    138 def output_notebook(sql, params=None):
    139     with conn.cursor() as cur:
--> 140         cur.execute(sql, params)
    141         return getResults(cur)
    142 

InFailedSqlTransaction: current transaction is aborted, commands ignored until end of transaction block

Move OCDS-specific methods to an ocds module

This package is also used by 360Giving, so in the longer term it’d be best to separate standard-specific methods

For 0.2.x, we can import those methods into the main module, so that users will see no difference in behaviour.

In 0.3.x, that import can be removed.

SQL syntax highlighting

I've been trying to get SQL syntax highlighting working in Google Colab with no luck.

If we start using ipython-sql (#29) then it would be helpful to also have syntax highlighting cells which use the %%sql cell magic.

Stackoverflow suggests this is possible with a normal Jupyter notebook: https://stackoverflow.com/questions/57966403/how-to-add-syntax-highlight-to-sql-line-magic-cell-magic-and-custom-command-in

This might not fit exactly kingfisher-colab, as I expect the solution will be something of a hack involving custom Javascript, but it seems like the closest place to log it.

ipython-sql: PostgreSQL JSON operators break %sql line magic

Using the -> or ->> operators in a query breaks the sql line magic. For example (notebook), running a cell containing %sql select data -> 'ocid' from data limit 1 results in the following error:

UsageError: unrecognized arguments: -> 'ocid' from data limit 1

The same issue affects get_ipython_sql_resultset_from_query.

This is a problem because the line magic is useful for controlling the execution of queries using Python for loops, if statements, functions etc. and for creating template queries and substituting in table names etc.

I guess this is an upstream problem with ipython-sql but I'm wondering if there is a workaround I'm missing, or if there's something we can do in kingfisher-colab.

catherinedevlin/ipython-sql#190 looks like the same issue.

For context, I came across this issue when updating the templated queries in the structure and format section of the data feedback template to use the new version of kingfisher-colab. In that case I was able to author a query that did away with the need for the templates, but in other cases it could be blocker.

Please add jpmckinney to kingfisher-colab PyPi project

https://pypi.org/project/kingfisher-colab/

Since I wasn't added when it was originally created, and since it was never communicated that this package was created (the release process used didn't add tags on GitHub, which would have at least been a hint, and the released package isn't used in any notebook I've seen), I ended up creating another project: https://pypi.org/project/ocdskingfishercolab/

Please either delete the project from PyPi to avoid confusion, and/or add me as an admin.

render_json: add function to render collapsible JSON in a notebook

function:

import json
from IPython.display import HTML

def render_json(jstr):
  if type(jstr) != str:
    jstr = json.dumps(jstr)
  return HTML("""
<script src="https://rawgit.com/caldwell/renderjson/master/renderjson.js"></script>
<script>
renderjson.set_show_to_level(1)
document.body.appendChild(renderjson(%s))
new ResizeObserver(google.colab.output.resizeIframeToContent).observe(document.body)
</script>
""" % jstr)

usage:

render_json("""{"example_array": [1,2,3], "example_object": {"a": 1, "b":2}}""")

⊖{
    "example_array": ⊕[3 items],
    "example_object": ⊕{2 items}
}

Dependencies of google-colab 1.0.0 are now different on Google Colab compared to PyPI

Since we've pinned based on what's in PyPI (and presumably used to be on Colab), we now get these errors on Colab:

ERROR: google-colab 1.0.0 has requirement notebook~=5.3.0; python_version >= "3.0", but you'll have notebook 5.2.2 which is incompatible.
ERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.21.0 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.

This doesn't seem to break the notebook, as we just get one alternative of the incompatible requirements installed. I think this may change (to hard fail) in future pip versions.

Screenshot from 2020-12-15 11-57-06

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.