Giter VIP home page Giter VIP logo

csv-reconcile's People

Contributors

1-byte avatar b2m avatar dependabot[bot] avatar gitonthescene avatar tfmorris avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

csv-reconcile's Issues

Extending columns showing id and not name

When using "Add columns from reconciled values", you see the list of db column names to choose from and not the original csv column names. Using the original csv column names looks cleaner.

Problem with CSV dialect sniffing

I tested the new behaviour for CSV dialect sniffing introduced for #41 in #42 and discovered the following problems:

  1. The csv.Sniffer().sniff() method will throw a "Could not determine delimiter" error in case it is unsure about a delimiter.
  2. I can not overwrite this behaviour via the CSVKWARGS configuration because it is applied later.

dialect = csv.Sniffer().sniff(csvfile.read(1024))

The reason for csv.Sniffer() being unsure about the delimiter is that while reading a fixed chunk of the csv file this chunk might end in the middle of a csv line and therefore the number of delimiters in this line is off.

Working example with whole file:

import csv
csv.Sniffer().sniff("a,b,c\n1,2,3")

Problematic example with only part of the file (throws error):

import csv
csv.Sniffer().sniff("a,b,c\n1,2")

So I would recommend to use dialect sniffing only (or additionaly?) when the user has not given explicit instructions on the dialect via CSVKWARGS and to use csvfile.readline() to avoid having a line cut somewhere.

localhost:5000/reconcile not displaying properly

After executing the following commands:

Screen Shot 2022-05-03 at 2 23 26 PM

http://127.0.0.1:5000/reconcile displays as

Screen Shot 2022-05-03 at 2 24 29 PM

The following commands did not result in an error message and all executed successfully. What could the issue be?

I don't believe it's an issue described in #41. To test, I created a config file adding a line
"CSVENCODING = " with csvencoding as the encoding for the file and the localhost:5000/reconcile did not change.

Any suggestions would be greatly appreciated! Thanks!

Thank You!!

Please do feel free to close this :) I just wanted to drop a note about how pleasantly easy this was to use to join a few columns into my tables - thanks for the clear docs and easy install!!

CSV sniffer needs more data

Hello- I have a TSV file with line character counts as follows (the first line is the header)

155
130
656
416
707
950
526
753
186
731
...

csv-reconcile init gives me the following error:

$ poetry run csv-reconcile init test7.tsv col1_name col2_name

...
File "/home/me/src/csv-reconcile/csv_reconcile/initdb.py", line 88, in init_db
searchidx = header.index(searchcol)
^^^^^^^^^^^^^^^^^^^^^^^
ValueError: 'col2_name' is not in list

The error is fixed if I change the amount of data being fed to the sniffer on this line
dialect = csv.Sniffer().sniff(csvfile.read(10240))

where I changed the previous value of 1024 to be 10240.

The message after activating the reconciliation service in the terminal from csv-reconcile has incorrect service url.

While trying to run a reconciliation service with the help of csv-reconcile reconciliation service from a csv file in the terminal, it leads us to the message ->"Running on http://127.0.0.1:5000/". The url"http://127.0.0.1:5000/" gives an error message when we click it or when we try adding it in OpenRefine as a service, while the actual service url is given in the instructions of the csv-reconcile website.When we click on the actual service url it leads us to the Service manifest and it is also successfully added as a service in OpenRefine.
The url in the message should essentially redirect to the service url (which is http://127.0.0.1:5000/reconcile)which would be very convenient for the users

Install instructions

I am a bit confused by the install instructions in the readme, which currently are (after cloning the repository):

$ python -m venv venv                                             # create virtualenv
$ venv/bin/pip install csv-reconcile                              # install package
$ source venv/bin/activate                                        # activate virtual environment
(venv) $ csv-reconcile --init-db sample/reps.tsv item itemLabel   # start the service
(venv) $ deactivate                                               # remove virtual environment

It seems to me that this installs csv-reconcile from PyPI (using the latest release) rather than from the code contained in the repository. What are the steps to run the code directly instead?

Generally speaking I expect the following workflow (but perhaps that's old-school!):

$ python -m venv venv
$ source venv/bin/activate     # activate the virtualenv before installing anything
$ pip install -r requirements.txt  # install dependencies
$ python setup.py install # install csv-reconcile itself
$ csv-reconcile …

Reorganize command line options to accommodate running in a container

When running from a container, we should initialize the database when creating the container and simply run the service when running the container. The current command line interface obscures this distinction. Deprecate current syntax for something more accommodating. Namely, use sub-commands to "init" the database and "serve" the data. Keep a deprecated "run" sub-command which mimics the current behavior.

Make preview service opt-in

Per the discussion in #28 make the preview service opt-in by simply pointing the manifest at it like so:

    "preview": {
       "url": "http://localhost:5000/preview/{{id}}",
       "width": 400,
       "height": 300
    }

When used under Windows with OpenRefine 3.6.1 the service fails - The endpoint MUST return a JSON document describing the service, accessible vîa CORS or JSONP.

When used under Windows with OpenRefine 3.6.1 the service fails.
If I try using the OpenRefine Test Bench tab I get the following error: "The endpoint MUST return a JSON document describing the service, accessible vîa CORS or JSONP."

The output from the service from the test is:

* Serving Flask app 'csv-reconcile' (lazy loading)
* Environment: production
  WARNING: This is a development server. Do not use it in a production deployment.
  Use a production WSGI server instead.
* Debug mode: off
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [07/Sep/2022 21:57:05] "OPTIONS / HTTP/1.1" 404 -
127.0.0.1 - - [07/Sep/2022 21:57:05] "GET / HTTP/1.1" 404 -
127.0.0.1 - - [07/Sep/2022 21:57:05] "OPTIONS /?callback=jsonp_1662580625915_97939 HTTP/1.1" 404 -
127.0.0.1 - - [07/Sep/2022 21:57:05] "GET /?callback=jsonp_1662580625915_97939 HTTP/1.1" 404 -

Might the OpenRefine protocol have changed?

ValueError: 'item' is not in list

I was able to set up and run csv-reconcile serve, but cannot run the example on the reps.tsv file I get ValueError: 'item' is not in the list, similarly when I try the progressives.tsv file I get ValueError: 'itemLabel' is not in list. The errors are otherwise identical, except the last few lines. I have tried restarting everything, and cannot get the init step to work before running the serve command. Any suggestions would be appreciated.

Last few lines of the error for reps.tsv:

  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 68, in init_db
    ididx = header.index(idcol)
ValueError: 'item' is not in list

Last few lines of the error for progressives.tsv:

  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 67, in init_db
    searchidx = header.index(searchcol)
ValueError: 'itemLabel' is not in list

The full error for reps.tsv:

(venv) C:\Users\jennifer.woodward\OneDrive - USDA\myGitOneDrive\NALT4MA\csv-reconcile>csv-reconcile init sample/reps.tsv item itemLabel
Traceback (most recent call last):
  File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\jennifer.woodward\OneDrive - USDA\myGitOneDrive\NALT4MA\csv-reconcile\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 321, in main
    return cli()
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 271, in init
    return doinit(config, scorerOption, csvfile, idcol, namecol)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 259, in doinit
    initdb.init_db_with_context(csvfile, idcol, namecol)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 95, in init_db_with_context
    return init_db(db,
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 68, in init_db
    ididx = header.index(idcol)
ValueError: 'item' is not in list

Csv-reconcile geo does not suggest some close objects after OpenRefine reconciliation

I have the following object:

Point(14.6152142 50.0812828)

I am reconciling it to the outcome of this query:

https://w.wiki/3CDn

The correct match is Q64816168
However that suggestion does not appear in the top-ten. Any ideas why? Do I miss something obvious?

Steps to replicate:

  1. insert new project from clipboard with only this content:
    Point(14.6152142 50.0812828)
  2. start a reconciliation service as I just documented in #3 , using query.tsv from the query above
  3. reconcile the column using that service - top hit is Q64816166, which is close, but farther than Q64816168

Continuous integration?

I see there is a test suite already, so perhaps it would be worth running it in a continuous integration service?

As a side effect, this is a good way to document the install process on a stock machine (since you have to script it for the CI). I actually looked for the CI configuration files as a way to solve my install problems (#8).

Provide Preview Service

I am reconciling against local CSV files with ambiguous data in the search column and without ids from external systems (Wikidata...).

ID Search Column Additional Data
1 John Doe 1970-01-01
2 John Doe 2020-01-01
... ... ...

To identify the correct match from the Reconciliation API I have to manually check the proposed results against the data in the CSV files. To speed this process up I would prefer to use a Preview Service as defined in the Reconciliation Service API Specification.

Incorrect encoding detection

Hello,

at first, let me thank you for this great reconciliation tool!

I've been trying to use csv-reconcile with the csv-reconcile-geo plugin and I am not confident, where the error comes from so feel free to direct me elsewhere, if the problem does not occur at your site.

So, the problem was that the "budovy_wdqs.tsc" file I was providing was using UTF-8 encoding, while the program apparently expect it to be in cp-1250 for some reason. When I have resaved the .tsv in cp-1250, the bug went gone.

(venv) C:\...\csv-reconcile [master ≡ +4 ~0 -0 !]> csv-reconcile --init-db budovy_wdqs.tsv item coords --scorer geo --config config.txt
Traceback (most recent call last):
  File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    exec(code, run_globals)
  File "C:\...\csv-reconcile\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 195, in main
    initdb.init_db_with_context()
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 90, in init_db_with_context
    return init_db(db,
  File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 58, in init_db
    header = next(reader)
  File "C:\Python310\lib\encodings\cp1250.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 2094: character maps to <undefined>

Setting up csv-reconcile-geo

FWIW, if you don't mind running your own reconciliation service, I've just written a geo scoring plugin for csv-reconcile.

With this you could, say run a SPARQL query to find coordinate locations of points you're looking to match against, export that as a TSV file and use that to run csv-reconcile.

You can get the service up and running as simply as the following:

$ python -m venv serverenv
$ source serverenv/bin/activate
$ python -m pip install csv-reconcile
$ python -m pip install csv-reconcile-geo
$ csv-reconcile --init-db query.tsv item coord --scorer geo 

Here item is the name of the column containing the QID's and coord is the name of the coordinate column in well-known text format, the default export format for coordinates.

This was just my first pass at it. There's certainly room for improvement, but it may suit your immediate needs.

@gitonthescene Please could you assist me with this? I am a bit disoriented and I am not sure if I understand the overall idea of 'my own' reconciliation service correctly. Am I right in assuming that I need to load File number 1 into openrefine, load File number 2 into command line via the commands above, add a reconciliation service "http://127.0.0.1:5000/reconcile" to OpenRefine and reconcile?

I think I was able to start virtualenv on my system (I am on Windows and "source" did not work, but I think I was able to find a solution at https://stackoverflow.com/questions/8921188/issue-with-virtualenv-cannot-activate) and then I was able to install csv-reconcile and csv-reconcile-geo. However, this is what I get when I run the program:

(venv) C:\Users\vojte\Downloads>csv-reconcile --init-db query.tsv item coord --scorer geo
c:\users\vojte\venv\lib\site-packages\normality\__init__.py:72: ICUWarning: Install 'pyicu' for better text transliteration.
  text = ascii_text(text)
Traceback (most recent call last):
  File "C:\Users\vojte\AppData\Local\Programs\Python\Python37-32\Lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\vojte\AppData\Local\Programs\Python\Python37-32\Lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\vojte\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 782, in main
    rv = self.invoke(ctx)
  File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "c:\users\vojte\venv\lib\site-packages\csv_reconcile\__init__.py", line 210, in main
    initdb.init_db()
  File "c:\users\vojte\venv\lib\site-packages\csv_reconcile\initdb.py", line 76, in init_db
    (mid, word) + tuple(matchFields))
sqlite3.IntegrityError: UNIQUE constraint failed: reconcile.id
sqlite3.IntegrityError: UNIQUE constraint failed: reconcile.id

My query.tsv is from https://w.wiki/3BV9

What do you think is happening? Sorry to spam the issue with my questions

Originally posted by @VojtechDostal in wetneb/openrefine-wikibase#101 (comment)

Handle mismatched lines in csv files

We should expect all lines in the csv file to have the same number of entries. Skip lines which have a different number. Generally these will be trailing blank lines created by whatever generated the csv file.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.