Giter VIP home page Giter VIP logo

cjworkbench / cjworkbench Goto Github PK

View Code? Open in Web Editor NEW
302.0 20.0 45.0 154.72 MB

The data journalism platform with built in training

Home Page: http://workbenchdata.com

License: Other

Python 54.65% JavaScript 28.73% HTML 10.26% Shell 0.68% Dockerfile 0.33% Thrift 0.22% Makefile 0.01% C 0.05% SCSS 5.06% HCL 0.03%
data-science data-visualization data-analysis journalism data-journalism notebook

cjworkbench's Introduction


License: license

Spreadsheet, meet automation.

Welcome to Workbench!

Workbench is a platform that helps you make sense of data tables. Code like a pro -- without code.

Features include:

  • Steps to download, HTML-scrape, clean, analyze and visualize data.
  • Steps to load tables from Google Drive, Twitter and APIs.
  • Emailed notifications when data changes.
  • An integrated data-journalism training course.
  • Undo, so you can't make mistakes -- only experiments.
  • Unlimited power, with custom Python and Excel-like formulas.

Try it

To see what Workbench does, run your own server.


User Documentation

Contributing

Workbench is licensed under the AGPL 3.0 license. You are free to use the code or parts of it in your own applications, even your own own closed source applications. If you modify Workbench code or merge it into your own software, you must open-source the modifications.

Contact us

Always happy to hear from you:

We also welcome issue reports and pull requests :)

Credits

Workbench started as a project of Columbia Journalism School, made possible through the generous support of Krishna Bharat and the Knight Foundation.

cjworkbench's People

Contributors

adamhooper avatar alexgonca avatar anothercookiecrumbles avatar brandonrobertz avatar emrig avatar giorgio93p avatar harry-tc-zhang avatar hydrosquall avatar jstray avatar mgerring avatar panospapacharalampous avatar pierreconti avatar schleppguy avatar tfmorris avatar tjdharamsi avatar whelpley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cjworkbench's Issues

Warning Dialog while deleting Tabs

Please kindly provide a warning when a user deletes a tab with something 'This action will remove the workflow tab, this action is not reversible'.

While 'Undo/Redo' option could restore the tabs, it would be good to warn the users about such board sweeping change actions.

64-bit integers are rounded before displaying

Steps to reproduce:

  1. Add a Twitter Step
  2. Duplicate the "id" column and convert it to text

Expected results: the Text "id" column and Number "id" column have the same values
Actual results: the Text "id" values are correct; the Number ones are incorrect

The problem: we send 64-bit integers as JSON, and the browser's parser converts them to Float and then converts them back to Integer to display them -- losing precision.

Solution: let's use a BigInt-compatible library for parsing JSON table data from the server.

Problem with encoding and emoji

Hi,
when I use twitter module, I have no problem with emoji , thery are correctly imported.

image

But if I use export CSV link as live link for another tab, it seems there is some problem.

image

This is an example workflow https://app.workbenchdata.com/workflows/18286/ (in the first tab the twitter import, in the second you can see the emoji encoding problem)

Thank you

Storage.connect is not available on concrete implemetations

Hello, python novice here, experimenting with SQL Alchemy and tableschema.

Have this code:

from tableschema import Table
from sqlalchemy import create_engine
import json
from pathlib import Path

def builddb(maindir,typedir,schemadir):

dbname = maindir / 'data/mydb.sqlite'
s_engine = 'sqlite:////.' + str(dbname)
engine = create_engine(s_engine)

schemalist = schemadir.glob("*.json")
for schema in schemalist:
    schemadata = json.load(open(schema))
    csvfile = schemadata["csv"]
    csvfile = typedir / csvfile

    try:
            
        mytable = Table(csvfile, schema=schemadata)
        tablename = schemadata["tablename"]
        mytable.save(tablename, storage='sql', engine=engine)
    
    except Exception as error:
        print(error)

It always raises the error: "Storage.connect is not available on concrete implemetations"

In Storage.py, there is this code:
...
if cls is not Storage:
message = 'Storage.connect is not available on concrete implemetations'
raise exceptions.StorageError(message)

Out of curiosity I commented out this IF statement any my database builds as expected. Am I doing something wrong in my implementation??

Thanks very much!

Track Changes with version history

When you are collaborating with multiple people in a workflow, it might happen that one person made a change that broke the workflow. It is not possible for the other collaborators to revert the workflow to its original state.

Hence it would be good to have a 'change history' where all the recent changes on the workflow are tracked. To revert to the previous workflow, you can select a change and press revert. The functionality could be similar to Google Version History

This is request from our project co-coordinator in Thailand.

Uploader and paste functions not working

After properly running the install instructions here I was able to successfully create a new user and log in.

When I try to make a new workflow with either the CSV file uploader or the paste data function (clicking the little arrow), nothing happens. No JS errors, no python errors. The file uploading server debugging info shows the file uploader was working, but the main table on the page never actually changes to reflect the data — it just stays blank.

Screen Shot 2022-02-02 at 10 15 54 AM

There is also a continual loading icon at the bottom of the screen that never goes away. Not sure if this is related.

Question about Global Export

Am confused about 'Export' button. Does the [Export] button on the top right hand corner of each tab provide CSV, JSON endpoints that includes all the workflow steps in the current tab.

However sometimes if you actively select/highligh a single workflow in the editor, the Export endpoint seems to publish the result of the currently workflow step. Here is the testcase

  1. Create a tab with 4 workflow steps.
  2. Click and highlight 3 workflow step.
  3. Click on 'Export' button and download the CSV or JSON file

Expected Result: The CSV or JSON contains the results of all 4 workflow steps.

Actual Result: The CSV or JSON contain only the results of 1, 2 and 3 workflow steps only.

image

No export button on charts?

I have a column chart that is public, but in the editor, I see no export button in the upper right hand corner. There is only the embed button. Am I missing a setting somewhere?

https://app.workbenchdata.com/workflows/7700

Thanks!

Mac Mojave 10.14.1 (18B75)
This is in both Safari Version 12.0.1 (14606.2.104.1.1) and
Chrome Version 70.0.3538.102 (Official Build) (64-bit)

screenshot 2018-11-20 00 45 43

Twitter query: what's the query length size limit ?

`bin/dev start` error prevents server from starting

When running bin/dev start it goes through a bunch of stuff and then fails and stops running on this line.

Pulling intercom-sink (gcr.io/workbenchdata-ci/cjw-intercom-sink:38d0dfbb2dc90a1c978b4bb80ea5b145cdfe3591)...
ERROR: Head "https://gcr.io/v2/workbenchdata-ci/cjw-intercom-sink/manifests/38d0dfbb2dc90a1c978b4bb80ea5b145cdfe3591": denied: Project workbenchdata-ci has been deleted.

Any ideas?

Allow a longer help-string for individual module parameters

It would be useful if I could add a longer help string for each individual parameter. It would make it much less likely a user has to jump out to separate documentation in order to understand how a module works. (Perhaps this could be displayed as a tooltip alongside the name.)

Is there some tidy - wide to long - module?

Hi,
often we have this kind of table (the image below), that it's necessary to "tidy" in

Anno Key Value
1966 Nazione 1 Italia
1966 Nazione 2 Algeria
... ... ...

It would be great to have a tidy module - both wide to long and long to wide - in workbenchdata.

Thank you

image

HTTP 503 when downloading CSVs

When a workflow has been changed but not yet rendered (so its Steps' cached render results don't exist or are stale), requests to GET /public/moduledata/live/:id.(csv|json) will return HTTP 503.

Steps to reproduce:

  1. Create a workflow with a "Load HTML from URL" module
  2. Point it to https://www.nytimes.com and set auto-refresh every 5min
  3. Look up the "API endpoint" (/public/moduledata/live/:id.csv), and then close the browser window
  4. Six minutes later, request data from the endpoint.

Expected results: you get new data
Actual results: HTTP 503 -- but if you retry a few seconds later, you'll get data.

The problem: Workbench renders processes in the background, and a GET request is in the foreground. If the workflow isn't rendered, we can't know when it will render.

This plays badly with auto-refreshes: when auto-refreshing a step, if the workflow has no steps with notifications enabled and nobody has a web client open to the workflow, Workbench skips rendering altogether. (It will only render on-demand.)

The Workbench-side workaround: when we return HTTP 503, we schedule another render of the workflow, in case it hasn't been scheduled yet.

There are two user-side workarounds:

  1. Enable notifications on any step in the workflow. That will force a render every time data changes -- greatly reducing the amount of time a request would lead to an HTTP 503 response.
  2. Configure the client to retry after 10-30s upon HTTP 503.

A better solution is to let users "turn on" API endpoints instead of supplying them implicitly. API endpoints should always host valid data -- even if it's stale.

Add link 'Help' in UI

It would be great if you can link to 'Help Center' in workbenchdata UI. Perhaps adding a menu link in drop down options menu could be one place.

Wish I add access to to these pages before, I only discovered them after chatting with support with intercom chat bot.

Feature request: add "no inference mode" to add from URL

Hi,
when I add from URL a file, workbenchdata does inferencing to map the field types. It's a great feature but sometimes gives wrong results.

In example here (https://app.workbenchdata.com/workflows/17120) I import an XLS file and it maps the field "CODISTAT" as number and it's a problem, because in the source xls file it's a text field. And then in workbenchdata the value "001801" becomes "1801" and it's not so good.

It would be great to have an option in the module to have "no inference", and have all fields as text field.

Thank you

docker frontend: No module named httpprocessproxy

Hi,

When trying to run the docker-compose I have the following error. I cannot find any reference to "httpprocessproxy" python module beside in docker-compose.yml which uses this command:
cjwkernel/setup-sandboxes.sh only-readonly && pipenv run python -m httpprocessproxy 0.0.0.0:8000 0.0.0.0:8080 --exclude ...

# docker-compose up frontend
cjworkbench_minio_1 is up-to-date
cjworkbench_rabbitmq_1 is up-to-date
cjworkbench_database_1 is up-to-date
Starting cjworkbench_frontend_1 ... done
Attaching to cjworkbench_frontend_1
frontend_1      | /root/.local/share/virtualenvs/app-4PlAip0Q/bin/python: No module named httpprocessproxy
cjworkbench_frontend_1 exited with code 1

Edit: host platform is Windows. I guess this is unsupported, works fine under Linux.

Multiply by number other than 1 fails

You can take a look at this workflow:
https://app.workbenchdata.com/workflows/17339/

In the tab1 I try to multiply a result by a value (1000) but it silently fails and the server response in the console is the following:

Message from server:  – "ValueError: Value 10000 is not a float" – Error: Server responded with error
Error: Server responded with error
construct
o — construct.js:30
t — wrapNativeSuper.js:26
t — WorkflowWebsocket.js:13
(anonyme Funktion) — WorkflowWebsocket.js:65

not much help I assume.
I tried to search the repo for the error but could not find anything. I tried to search for multiply or calculate and could not find something either.

Hope this helps.

Docker Image

Dear all,

I try to make a working deployment of workbench on a local server without success. The guides either for setting a development environment or for deploying with kubernetes are difficult to follow even for someone with DevOps skills.

Why don't you give us a working docker image for easy deployment?

Thanks in advance.

Creating a user when using a docker image

Dear all,

I want to run and test cjworkbench for academic purposes. I run a docker image found here: https://hub.docker.com/r/cjworkbench/cjworkbench-main

but I can't login... is there any superuser already present there?
I run the "bin/dev python ./manage.py createsuperuser" command into the docker image successfully but I can't run the "bin/dev sql -c 'UPDATE account_emailaddress SET verified = TRUE'" to verify my user. It says "sql command not found". Is there anything else I can do? I found these commands here: https://github.com/CJWorkbench/cjworkbench/wiki/Setting-up-a-development-environment

Thanks in advance,
Lazaros Vrysis
M3C Research Group
m3c.web.auth.gr

Support for Frictionless Data specs

Reproducibility and traceability is clearly a prerogative of this project. Exporting data is one thing, but how about exporting metadata and workflows? I'm not sure if this is the right place to post module ideas, but here goes.

Has anyone looked into integrating Frictionless Data support, e.g. as an import format, to export column definitions (as Table Schema), generating a complete Data Package using datapackage-py, integrating with Goodtables for validation, or even making the JSON feed compatible with dataflows?

It looks like it would be straightforward to start by developing a Python module for data import, but I can't tell if it would be possible to export in that manner as well.

GMT on alerts

Maybe feature and not bug, but would be nice for alerts to give time in local time rather than GMT.

Your workflow Oversight.gov - Inspector General Scraper/Alerts has been updated with new data on Apr 11, 2019 at 2:57 PM. Its "scrapetable" module has new output. Arrived ~11:57 AM EST.

Locale not correct on social login

In fact, this issue has two different cases

  1. Account creation on social login: locale id is not set correctly, it's always the default one. We can easily fix this by overriding allauth's social account adapter (see commit a306b1d).
  2. Social login with existing account: locale cookie is never written (because execution does not pass from our login view for normal logins), so the locale shown after login is the one that it was before login, instead of the one saved in user profile. Since we need a response on which to set the cookie and since each social provider has different urls and views, this can't be fixed equally easily.

Save raw source data in `fetch`

There are cases where the data source module may want to allow the user to specify special formatting instructions to be applied on the raw source data. Today, because the module's fetch function is expected to return a pd.DataFrame object, the module is forced to perform the formatting inside of fetch. This means that a user would have to re-fetch to apply the formatting instructions, which is counter-intuitive and wasteful of resources on all sides. One (ugly) workaround is to store some raw data in additional DataFrame columns that are read during render and cleaned up there.

Ideally, the fetch function would store the raw data, and the render function would have access to this raw data for formatting.

Allow Workbench modules to have multiple Python modules

Currently any Workbench module can only have a single Python module. This makes it difficult to write well-factored and readable code. It would be great if we could have the entrypoint discovered either by magic module name (eg. main.py) or by having it documented in the YAML config.

Custom resource urls for workbench instance on local network

Congrats on an awesome tool. I really do hope that, despite the commercial shutdown, the open source project will be taken care of, it's just too good to go!

I tried to get the project up and running following

https://github.com/CJWorkbench/cjworkbench/wiki/Setting-up-a-development-environment

I deployed the project to a server machine in my local network.

Everything seems to runs smoothly, except all webpage resources point to localhost:8003, instead of the server's network IP. Hence, there are no scripts, no CSS and no images.
Where can this be fixed / configured?

Many thanks!

Auth messages aren't translated to other languages

Steps to reproduce:

  1. Create a new account using your main email address
  2. Log out
  3. Log in via Google account -- on same email address

Expected results: templates/socialaccount/messages/account_connected.txt is translated into your language
Actual results: context["i18n"] is not set when trans_html is called.

I added a workaround to trans_html(): output the default language. This lowers the severity of the problem (it no longer causes a 500 error); but it's not a fix.

My hypothesis: the problem is in middleware. cjworkbench.middleware.i18n.SetCurrentLocaleMiddleware comes after django.contrib.auth.middleware.AuthenticationMiddleware. That's broadly correct because we can't determine the user's locale before we've looked up the user. But in this one scenario, i18n isn't set because we haven't logged the user in yet when we render the message.

Brainstorming a fix: how about we split our logic into two pieces of middleware:

  1. cjworkbench.middleware.i18n.SetCurrentLocaleFromRequestAndSessionMiddleware (comes soon after django.contrib.sessions.middleware.SessionMiddleware
  2. cjworkbench.middleware.i18n.SetCurrentLocaleFromLoggedInUserMiddleware (comes soon after django.contrib.auth.middleware.AuthenticationMiddleware)

@giorgio93p could you please take a look?

Request to include Geopandas

Wondering if it is possible to include 'GeoPandas' packages available in the python script workflow module.

I believe this might attract Geo-spatial folks to use CJ Workbench.

My simple use case for GeoPandas is you could simple action as validating if a column of coordinates are within a given country. At the moment, you have to make multiple call to external reverse geocoding systems to do such verification. With GeoPandas modules it could be done in couple of lines.

Added bonus is GeoPandas could generate its own maps that could perhaps be rendered in reports section.

Missing functions

Hello to all and New Year's greetings!

I have installed the workbench tool on a local server, as described in this repository.

However, the functions are not available/visible? when using workbench. All this functionality is not available at all or have to be installed separately?

Best regards,
LB

Add option to include MOE from CensusReporter

It's super-critical anytime anyone is working with ACS data that they understand that the data has inherent error. I see some evidence in the code this was intended to be supported, but just hasn't been fully implemented.

County name cleanup module

It would be very useful, for local data journalism, to have a module that cleans US county names and looks up their FIPS codes. Attached is a mockup of what this might look like.
unnamed

dependency error

Trying to look at this on my home server but running into some dependency issues. Following your guide, I get these errors:

WebpackError at /workflows/
            ModuleNotFoundError in 
            Module not found: Error: Can't resolve 'chartbuilder-ui/dist/styles.css' in '/Users/acer9997/Projects/cjworkbench/cjworkbench/assets/js'

webpack-stats.json issue

hi

I'm setting up the dev environment and followed the steps from here.

python manage runserver works fine although upon hitting the 127.0.0.1:8000, it throws a

error reading /cjworkbench/webpack-stats.json. Are you sure webpack has generated the file and the path is correct?

I guess one of the steps needs webpack setup and I haven't done that. Can you please let me know how can I fix this?

Workflow becomes read only after several hours

Encountered three times (different workflows): create a workflow and try to edit it after several hours. It works as read only, can't edit or create a new tab (tab shows but can't move to it or rename it and disappears when browser is refreshed). Last workflow tested was last edited 15 hours back. Using Chrome on Win 10.
I can still start a new workflow. Further, I found if I go back to Workflow List (my workflows) and open the read only one, it becomes editable [failed to reproduce later, may be server or connectivity or load issues].

Warn users when they are at risk of overwriting a module

Hello, I forked one of your modules (regexextractor) and reimported it mindlessly into my Workbench session, forgetting to change its name in the json. Result: it looks like my version of the module has replaced the original one, and it seems a bit difficult to restore it.

screenshot-app workbenchdata com-2019 02 11-14-46-27

Perhaps users should be informed when they are about to overwrite an existing module.

Logout destroys language setting

Steps to reproduce:

  1. Log in
  2. Choose Greek language
  3. Log out

Expected results: things are still Greek
Actual results: back to English!

This is because when the user logs out, we destroy the session.

One solution is to use a separate cookie for language. This would make our solution to #149 seem to fit in, too. On #149 I was suggested two different pieces of middleware running at two different times. Now I'm also suggesting two different cookies -- one per middleware.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.