cjworkbench / cjworkbench Goto Github PK

The data journalism platform with built in training

License: Other

Python 54.65% JavaScript 28.73% HTML 10.26% Shell 0.68% Dockerfile 0.33% Thrift 0.22% Makefile 0.01% C 0.05% SCSS 5.06% HCL 0.03%

data-science data-visualization data-analysis journalism data-journalism notebook

cjworkbench's Introduction

License:

Spreadsheet, meet automation.

Welcome to Workbench!

Workbench is a platform that helps you make sense of data tables. Code like a pro -- without code.

Features include:

Steps to download, HTML-scrape, clean, analyze and visualize data.
Steps to load tables from Google Drive, Twitter and APIs.
Emailed notifications when data changes.
An integrated data-journalism training course.
Undo, so you can't make mistakes -- only experiments.
Unlimited power, with custom Python and Excel-like formulas.

Try it

To see what Workbench does, run your own server.

User Documentation

Imagining the data journalism workflow of the future
What workbench can do for data
A different approach to transparent data journalism
Data journalism made easier, faster, and more collaborative
Our knowledge base has detailed instructions for each step

Contributing

Workbench is licensed under the AGPL 3.0 license. You are free to use the code or parts of it in your own applications, even your own own closed source applications. If you modify Workbench code or merge it into your own software, you must open-source the modifications.

Contact us

Always happy to hear from you:

We also welcome issue reports and pull requests :)

Credits

Workbench started as a project of Columbia Journalism School, made possible through the generous support of Krishna Bharat and the Knight Foundation.

cjworkbench's People

Contributors

Stargazers

Watchers

cjworkbench's Issues

Warning Dialog while deleting Tabs

Please kindly provide a warning when a user deletes a tab with something 'This action will remove the workflow tab, this action is not reversible'.

While 'Undo/Redo' option could restore the tabs, it would be good to warn the users about such board sweeping change actions.

Resources Broken Link: "Resources and inspiration to do more"

The link to additional resources is broken in inflation tutorial

404 https://app.workbenchdata.com/courses/en/intro-to-data-journalism/resources

cjworkbench/server/courses/en/intro-to-data-journalism/inflation.html

Line 217 in 0f51007

<a href="resources">Resources and inspiration to do more</a>

64-bit integers are rounded before displaying

Steps to reproduce:

Add a Twitter Step
Duplicate the "id" column and convert it to text

Expected results: the Text "id" column and Number "id" column have the same values
Actual results: the Text "id" values are correct; the Number ones are incorrect

The problem: we send 64-bit integers as JSON, and the browser's parser converts them to Float and then converts them back to Integer to display them -- losing precision.

Solution: let's use a BigInt-compatible library for parsing JSON table data from the server.

Support accessing private APIs

Please kindly provide support for API calls with API Key authorization. This would enable us to query APIs which are not public.

The use-case for this is Open Apparel Project that provides information about supply chains. We are trying create datasets by querying using provided API keys.

https://openapparel.org/

https://openapparel.org/api/docs/

Problem with encoding and emoji

Hi,
when I use twitter module, I have no problem with emoji , thery are correctly imported.

But if I use export CSV link as live link for another tab, it seems there is some problem.

This is an example workflow https://app.workbenchdata.com/workflows/18286/ (in the first tab the twitter import, in the second you can see the emoji encoding problem)

Thank you

Changing processing limits for local instance

Where can I change processing limits (e.g. max number of importable rows) for a local CJW instance?

Storage.connect is not available on concrete implemetations

Hello, python novice here, experimenting with SQL Alchemy and tableschema.

Have this code:

from tableschema import Table
from sqlalchemy import create_engine
import json
from pathlib import Path

def builddb(maindir,typedir,schemadir):

dbname = maindir / 'data/mydb.sqlite'
s_engine = 'sqlite:////.' + str(dbname)
engine = create_engine(s_engine)

schemalist = schemadir.glob("*.json")
for schema in schemalist:
    schemadata = json.load(open(schema))
    csvfile = schemadata["csv"]
    csvfile = typedir / csvfile

    try:
            
        mytable = Table(csvfile, schema=schemadata)
        tablename = schemadata["tablename"]
        mytable.save(tablename, storage='sql', engine=engine)
    
    except Exception as error:
        print(error)

It always raises the error: "Storage.connect is not available on concrete implemetations"

In Storage.py, there is this code:
...
if cls is not Storage:
message = 'Storage.connect is not available on concrete implemetations'
raise exceptions.StorageError(message)

Out of curiosity I commented out this IF statement any my database builds as expected. Am I doing something wrong in my implementation??

Thanks very much!

Track Changes with version history

When you are collaborating with multiple people in a workflow, it might happen that one person made a change that broke the workflow. It is not possible for the other collaborators to revert the workflow to its original state.

Hence it would be good to have a 'change history' where all the recent changes on the workflow are tracked. To revert to the previous workflow, you can select a change and press revert. The functionality could be similar to Google Version History

This is request from our project co-coordinator in Thailand.

Uploader and paste functions not working

After properly running the install instructions here I was able to successfully create a new user and log in.

When I try to make a new workflow with either the CSV file uploader or the paste data function (clicking the little arrow), nothing happens. No JS errors, no python errors. The file uploading server debugging info shows the file uploader was working, but the main table on the page never actually changes to reflect the data — it just stays blank.

There is also a continual loading icon at the bottom of the screen that never goes away. Not sure if this is related.

Question about Global Export

Am confused about 'Export' button. Does the [Export] button on the top right hand corner of each tab provide CSV, JSON endpoints that includes all the workflow steps in the current tab.

However sometimes if you actively select/highligh a single workflow in the editor, the Export endpoint seems to publish the result of the currently workflow step. Here is the testcase

Create a tab with 4 workflow steps.
Click and highlight 3 workflow step.
Click on 'Export' button and download the CSV or JSON file

Expected Result: The CSV or JSON contains the results of all 4 workflow steps.

Actual Result: The CSV or JSON contain only the results of 1, 2 and 3 workflow steps only.

Resizing plot/chart area or hiding data for viz step

Is it possible to either increase the height of the chart area or hide the data grid underneath? Something like the resizing of column widths but horizontal.

No export button on charts?

I have a column chart that is public, but in the editor, I see no export button in the upper right hand corner. There is only the embed button. Am I missing a setting somewhere?

https://app.workbenchdata.com/workflows/7700

Thanks!

Mac Mojave 10.14.1 (18B75)
This is in both Safari Version 12.0.1 (14606.2.104.1.1) and
Chrome Version 70.0.3538.102 (Official Build) (64-bit)

Twitter query: what's the query length size limit ?

Hi,
I have built this query "(min_retweets:10) OR (min_faves:10) opendata OR ("open data") OR ("dati aperti") OR ("dati pubblici") OR ("dato aperto") OR ("dato pubblico") OR ("données publiques") OR ("Verwaltungsdaten festgelegt") OR ("offene Verwaltungsdaten") OR ("DATOS ABIERTOS") OR ("avoin data") OR "avoindata" OR "datosabiertos" OR DadesObertes OR ("DadesObertes") dadosabertos OR ("dados abertos")", but it does not work in workbenchdata.

If I use this shorter one "(min_retweets:10) OR (min_faves:10) opendata OR ("open data") OR ("dati aperti") OR ("dati pubblici") OR ("dato aperto") OR ("dato pubblico") OR ("données publiques") OR ("Verwaltungsdaten festgelegt") OR ("offene Verwaltungsdaten") OR ("DATOS ABIERTOS") OR ("avoin data") OR "avoindata"", it works.

What's the query length size limit ?

Thank you

`bin/dev start` error prevents server from starting

When running bin/dev start it goes through a bunch of stuff and then fails and stops running on this line.

Pulling intercom-sink (gcr.io/workbenchdata-ci/cjw-intercom-sink:38d0dfbb2dc90a1c978b4bb80ea5b145cdfe3591)...
ERROR: Head "https://gcr.io/v2/workbenchdata-ci/cjw-intercom-sink/manifests/38d0dfbb2dc90a1c978b4bb80ea5b145cdfe3591": denied: Project workbenchdata-ci has been deleted.

Any ideas?

Fetch modules only work if `version_select` is configured

In order to make a new data loader I had to add this to my YAML config:

- id_name: version_select
  type: custom
  name: Update

This should be in the docs.

Allow a longer help-string for individual module parameters

It would be useful if I could add a longer help string for each individual parameter. It would make it much less likely a user has to jump out to separate documentation in order to understand how a module works. (Perhaps this could be displayed as a tooltip alongside the name.)

Unavailable file used for course input data

In lesson VIII. Make a chart of police stops by race., the file https://app.workbenchdata.com/public/moduledata/live/76723.csv cannot be found: it throws 503 Service Unavailable

Is there some tidy - wide to long - module?

Hi,
often we have this kind of table (the image below), that it's necessary to "tidy" in

Anno	Key	Value
1966	Nazione 1	Italia
1966	Nazione 2	Algeria
...	...	...

It would be great to have a tidy module - both wide to long and long to wide - in workbenchdata.

Thank you

HTTP 503 when downloading CSVs

When a workflow has been changed but not yet rendered (so its Steps' cached render results don't exist or are stale), requests to GET /public/moduledata/live/:id.(csv|json) will return HTTP 503.

Steps to reproduce:

Create a workflow with a "Load HTML from URL" module
Point it to https://www.nytimes.com and set auto-refresh every 5min
Look up the "API endpoint" (/public/moduledata/live/:id.csv), and then close the browser window
Six minutes later, request data from the endpoint.

Expected results: you get new data
Actual results: HTTP 503 -- but if you retry a few seconds later, you'll get data.

The problem: Workbench renders processes in the background, and a GET request is in the foreground. If the workflow isn't rendered, we can't know when it will render.

This plays badly with auto-refreshes: when auto-refreshing a step, if the workflow has no steps with notifications enabled and nobody has a web client open to the workflow, Workbench skips rendering altogether. (It will only render on-demand.)

The Workbench-side workaround: when we return HTTP 503, we schedule another render of the workflow, in case it hasn't been scheduled yet.

There are two user-side workarounds:

Enable notifications on any step in the workflow. That will force a render every time data changes -- greatly reducing the amount of time a request would lead to an HTTP 503 response.
Configure the client to retry after 10-30s upon HTTP 503.

A better solution is to let users "turn on" API endpoints instead of supplying them implicitly. API endpoints should always host valid data -- even if it's stale.

Add link 'Help' in UI

It would be great if you can link to 'Help Center' in workbenchdata UI. Perhaps adding a menu link in drop down options menu could be one place.

Wish I add access to to these pages before, I only discovered them after chatting with support with intercom chat bot.

Feature request: add "no inference mode" to add from URL

Hi,
when I add from URL a file, workbenchdata does inferencing to map the field types. It's a great feature but sometimes gives wrong results.

In example here (https://app.workbenchdata.com/workflows/17120) I import an XLS file and it maps the field "CODISTAT" as number and it's a problem, because in the source xls file it's a text field. And then in workbenchdata the value "001801" becomes "1801" and it's not so good.

It would be great to have an option in the module to have "no inference", and have all fields as text field.

Thank you

docker frontend: No module named httpprocessproxy

Hi,

When trying to run the docker-compose I have the following error. I cannot find any reference to "httpprocessproxy" python module beside in docker-compose.yml which uses this command:
cjwkernel/setup-sandboxes.sh only-readonly && pipenv run python -m httpprocessproxy 0.0.0.0:8000 0.0.0.0:8080 --exclude ...

# docker-compose up frontend
cjworkbench_minio_1 is up-to-date
cjworkbench_rabbitmq_1 is up-to-date
cjworkbench_database_1 is up-to-date
Starting cjworkbench_frontend_1 ... done
Attaching to cjworkbench_frontend_1
frontend_1      | /root/.local/share/virtualenvs/app-4PlAip0Q/bin/python: No module named httpprocessproxy
cjworkbench_frontend_1 exited with code 1

Edit: host platform is Windows. I guess this is unsupported, works fine under Linux.

Multiply by number other than 1 fails

You can take a look at this workflow:
https://app.workbenchdata.com/workflows/17339/

In the tab1 I try to multiply a result by a value (1000) but it silently fails and the server response in the console is the following:

Message from server:  – "ValueError: Value 10000 is not a float" – Error: Server responded with error
Error: Server responded with error
construct
o — construct.js:30
t — wrapNativeSuper.js:26
t — WorkflowWebsocket.js:13
(anonyme Funktion) — WorkflowWebsocket.js:65

not much help I assume.
I tried to search the repo for the error but could not find anything. I tried to search for multiply or calculate and could not find something either.

Hope this helps.

Docker Image

Dear all,

I try to make a working deployment of workbench on a local server without success. The guides either for setting a development environment or for deploying with kubernetes are difficult to follow even for someone with DevOps skills.

Why don't you give us a working docker image for easy deployment?

Thanks in advance.

Suggested system resources

What would be the recommended specs for running CJW on a development machine?

Creating a user when using a docker image

Dear all,

I want to run and test cjworkbench for academic purposes. I run a docker image found here: https://hub.docker.com/r/cjworkbench/cjworkbench-main

but I can't login... is there any superuser already present there?
I run the "bin/dev python ./manage.py createsuperuser" command into the docker image successfully but I can't run the "bin/dev sql -c 'UPDATE account_emailaddress SET verified = TRUE'" to verify my user. It says "sql command not found". Is there anything else I can do? I found these commands here: https://github.com/CJWorkbench/cjworkbench/wiki/Setting-up-a-development-environment

Thanks in advance,
Lazaros Vrysis
M3C Research Group
m3c.web.auth.gr

Support for Frictionless Data specs

Reproducibility and traceability is clearly a prerogative of this project. Exporting data is one thing, but how about exporting metadata and workflows? I'm not sure if this is the right place to post module ideas, but here goes.

Has anyone looked into integrating Frictionless Data support, e.g. as an import format, to export column definitions (as Table Schema), generating a complete Data Package using datapackage-py, integrating with Goodtables for validation, or even making the JSON feed compatible with dataflows?

It looks like it would be straightforward to start by developing a Python module for data import, but I can't tell if it would be possible to export in that manner as well.

GMT on alerts

Maybe feature and not bug, but would be nice for alerts to give time in local time rather than GMT.

Your workflow Oversight.gov - Inspector General Scraper/Alerts has been updated with new data on Apr 11, 2019 at 2:57 PM. Its "scrapetable" module has new output. Arrived ~11:57 AM EST.

"Creating A Module" docs are in a confusing order

A bunch of the most critical information is in the Developing a Module section which is near the bottom.

Locale not correct on social login

In fact, this issue has two different cases

Account creation on social login: locale id is not set correctly, it's always the default one. We can easily fix this by overriding allauth's social account adapter (see commit a306b1d).
Social login with existing account: locale cookie is never written (because execution does not pass from our login view for normal logins), so the locale shown after login is the one that it was before login, instead of the one saved in user profile. Since we need a response on which to set the cookie and since each social provider has different urls and views, this can't be fixed equally easily.

Save raw source data in `fetch`

There are cases where the data source module may want to allow the user to specify special formatting instructions to be applied on the raw source data. Today, because the module's fetch function is expected to return a pd.DataFrame object, the module is forced to perform the formatting inside of fetch. This means that a user would have to re-fetch to apply the formatting instructions, which is counter-intuitive and wasteful of resources on all sides. One (ugly) workaround is to store some raw data in additional DataFrame columns that are read during render and cleaned up there.

Ideally, the fetch function would store the raw data, and the render function would have access to this raw data for formatting.

Allow Workbench modules to have multiple Python modules

Currently any Workbench module can only have a single Python module. This makes it difficult to write well-factored and readable code. It would be great if we could have the entrypoint discovered either by magic module name (eg. main.py) or by having it documented in the YAML config.

Custom resource urls for workbench instance on local network

Congrats on an awesome tool. I really do hope that, despite the commercial shutdown, the open source project will be taken care of, it's just too good to go!

I tried to get the project up and running following

https://github.com/CJWorkbench/cjworkbench/wiki/Setting-up-a-development-environment

I deployed the project to a server machine in my local network.

Everything seems to runs smoothly, except all webpage resources point to localhost:8003, instead of the server's network IP. Hence, there are no scripts, no CSS and no images.
Where can this be fixed / configured?

Many thanks!

Auth messages aren't translated to other languages

Steps to reproduce:

Create a new account using your main email address
Log out
Log in via Google account -- on same email address

Expected results: templates/socialaccount/messages/account_connected.txt is translated into your language
Actual results: context["i18n"] is not set when trans_html is called.

I added a workaround to trans_html(): output the default language. This lowers the severity of the problem (it no longer causes a 500 error); but it's not a fix.

My hypothesis: the problem is in middleware. cjworkbench.middleware.i18n.SetCurrentLocaleMiddleware comes after django.contrib.auth.middleware.AuthenticationMiddleware. That's broadly correct because we can't determine the user's locale before we've looked up the user. But in this one scenario, i18n isn't set because we haven't logged the user in yet when we render the message.

Brainstorming a fix: how about we split our logic into two pieces of middleware:

cjworkbench.middleware.i18n.SetCurrentLocaleFromRequestAndSessionMiddleware (comes soon after django.contrib.sessions.middleware.SessionMiddleware
cjworkbench.middleware.i18n.SetCurrentLocaleFromLoggedInUserMiddleware (comes soon after django.contrib.auth.middleware.AuthenticationMiddleware)

@giorgio93p could you please take a look?

Request to include Geopandas

Wondering if it is possible to include 'GeoPandas' packages available in the python script workflow module.

I believe this might attract Geo-spatial folks to use CJ Workbench.

My simple use case for GeoPandas is you could simple action as validating if a column of coordinates are within a given country. At the moment, you have to make multiple call to external reverse geocoding systems to do such verification. With GeoPandas modules it could be done in couple of lines.

Added bonus is GeoPandas could generate its own maps that could perhaps be rendered in reports section.

Locale swithcer not visible in lessons when not logged in

The hamburger menu is hidden for not logged-in users, hence they do not have access to locale switcher

Can't load HTML from series of pages

The 'Load HTML from URL' workflow doesn't work when used with series of pages. I have already reported this via intercom.

Also attached a example testcase:
https://app.workbenchdata.com/workflows/86008/

Missing functions

Hello to all and New Year's greetings!

I have installed the workbench tool on a local server, as described in this repository.

However, the functions are not available/visible? when using workbench. All this functionality is not available at all or have to be installed separately?

Best regards,
LB

Add option to include MOE from CensusReporter

It's super-critical anytime anyone is working with ACS data that they understand that the data has inherent error. I see some evidence in the code this was intended to be supported, but just hasn't been fully implemented.

County name cleanup module

It would be very useful, for local data journalism, to have a module that cleans US county names and looks up their FIPS codes. Attached is a mockup of what this might look like.

dependency error

Trying to look at this on my home server but running into some dependency issues. Following your guide, I get these errors:

WebpackError at /workflows/
            ModuleNotFoundError in 
            Module not found: Error: Can't resolve 'chartbuilder-ui/dist/styles.css' in '/Users/acer9997/Projects/cjworkbench/cjworkbench/assets/js'

webpack-stats.json issue

I'm setting up the dev environment and followed the steps from here.

python manage runserver works fine although upon hitting the 127.0.0.1:8000, it throws a

error reading /cjworkbench/webpack-stats.json. Are you sure webpack has generated the file and the path is correct?

I guess one of the steps needs webpack setup and I haven't done that. Can you please let me know how can I fix this?

Workflow becomes read only after several hours

Encountered three times (different workflows): create a workflow and try to edit it after several hours. It works as read only, can't edit or create a new tab (tab shows but can't move to it or rename it and disappears when browser is refreshed). Last workflow tested was last edited 15 hours back. Using Chrome on Win 10.
I can still start a new workflow. Further, I found if I go back to Workflow List (my workflows) and open the read only one, it becomes editable [failed to reproduce later, may be server or connectivity or load issues].

Module docs referenced "Import from GitHub" which doesn't seem to exist

Maybe I'm blind or maybe this feature was dropped, but I can't find it anywhere. Here's the relevant section of the docs.

Placeholder of custom module parameters

In cjwstate/modules/module_spec_schema.yaml, module parameters of type custom are defined as supporting a placeholder property. However, in cjwstate/models/param_specs.py, ParamSpecCustom does not inherit from _HasPlaceholder.

Warn users when they are at risk of overwriting a module

Hello, I forked one of your modules (regexextractor) and reimported it mindlessly into my Workbench session, forgetting to change its name in the json. Result: it looks like my version of the module has replaced the original one, and it seems a bit difficult to restore it.

Perhaps users should be informed when they are about to overwrite an existing module.

Request for group collobration through 'team/organization/group' namespace

It would be great if workbench could have a team/organization/groups feature. This enables an organization to create a team and invite other members.

This also enables a organization to track the private workflows that it uses. At the moment CJ workbench is designed for individuals workflows.

Logout destroys language setting

Steps to reproduce:

Log in
Choose Greek language
Log out

Expected results: things are still Greek
Actual results: back to English!

This is because when the user logs out, we destroy the session.

One solution is to use a separate cookie for language. This would make our solution to #149 seem to fit in, too. On #149 I was suggested two different pieces of middleware running at two different times. Now I'm also suggesting two different cookies -- one per middleware.

cjworkbench / cjworkbench Goto Github PK

cjworkbench's Introduction

Spreadsheet, meet automation.

Try it

User Documentation

Contributing

Contact us

Credits

cjworkbench's People

Contributors

Stargazers

Watchers

Forkers

cjworkbench's Issues

Recommend Projects

Recommend Topics

Recommend Org