gitonthescene / csv-reconcile Goto Github PK
View Code? Open in Web Editor NEWA reconciliation service for OpenRefine serving data from a given CSV file.
License: MIT License
A reconciliation service for OpenRefine serving data from a given CSV file.
License: MIT License
When using "Add columns from reconciled values", you see the list of db column names to choose from and not the original csv column names. Using the original csv column names looks cleaner.
Announce the use of git flow workflow.
I tested the new behaviour for CSV dialect sniffing introduced for #41 in #42 and discovered the following problems:
CSVKWARGS
configuration because it is applied later.csv-reconcile/csv_reconcile/initdb.py
Line 76 in bc723c7
The reason for csv.Sniffer()
being unsure about the delimiter is that while reading a fixed chunk of the csv file this chunk might end in the middle of a csv line and therefore the number of delimiters in this line is off.
Working example with whole file:
import csv
csv.Sniffer().sniff("a,b,c\n1,2,3")
Problematic example with only part of the file (throws error):
import csv
csv.Sniffer().sniff("a,b,c\n1,2")
So I would recommend to use dialect sniffing only (or additionaly?) when the user has not given explicit instructions on the dialect via CSVKWARGS
and to use csvfile.readline()
to avoid having a line cut somewhere.
After executing the following commands:
http://127.0.0.1:5000/reconcile displays as
The following commands did not result in an error message and all executed successfully. What could the issue be?
I don't believe it's an issue described in #41. To test, I created a config file adding a line
"CSVENCODING = " with csvencoding as the encoding for the file and the localhost:5000/reconcile did not change.
Any suggestions would be greatly appreciated! Thanks!
Please do feel free to close this :) I just wanted to drop a note about how pleasantly easy this was to use to join a few columns into my tables - thanks for the clear docs and easy install!!
Hello- I have a TSV file with line character counts as follows (the first line is the header)
155
130
656
416
707
950
526
753
186
731
...
csv-reconcile init gives me the following error:
$ poetry run csv-reconcile init test7.tsv col1_name col2_name
...
File "/home/me/src/csv-reconcile/csv_reconcile/initdb.py", line 88, in init_db
searchidx = header.index(searchcol)
^^^^^^^^^^^^^^^^^^^^^^^
ValueError: 'col2_name' is not in list
The error is fixed if I change the amount of data being fed to the sniffer on this line
dialect = csv.Sniffer().sniff(csvfile.read(10240))
where I changed the previous value of 1024 to be 10240.
While trying to run a reconciliation service with the help of csv-reconcile reconciliation service from a csv file in the terminal, it leads us to the message ->"Running on http://127.0.0.1:5000/". The url"http://127.0.0.1:5000/" gives an error message when we click it or when we try adding it in OpenRefine as a service, while the actual service url is given in the instructions of the csv-reconcile website.When we click on the actual service url it leads us to the Service manifest and it is also successfully added as a service in OpenRefine.
The url in the message should essentially redirect to the service url (which is http://127.0.0.1:5000/reconcile)which would be very convenient for the users
It's awkward to have to remember to keep the same config and scorer options between the init and serve steps. It's much better to save off the config and scorer option during the init stage and use the saved versions during the serve stage.
I am a bit confused by the install instructions in the readme, which currently are (after cloning the repository):
$ python -m venv venv # create virtualenv
$ venv/bin/pip install csv-reconcile # install package
$ source venv/bin/activate # activate virtual environment
(venv) $ csv-reconcile --init-db sample/reps.tsv item itemLabel # start the service
(venv) $ deactivate # remove virtual environment
It seems to me that this installs csv-reconcile from PyPI (using the latest release) rather than from the code contained in the repository. What are the steps to run the code directly instead?
Generally speaking I expect the following workflow (but perhaps that's old-school!):
$ python -m venv venv
$ source venv/bin/activate # activate the virtualenv before installing anything
$ pip install -r requirements.txt # install dependencies
$ python setup.py install # install csv-reconcile itself
$ csv-reconcile …
When running from a container, we should initialize the database when creating the container and simply run the service when running the container. The current command line interface obscures this distinction. Deprecate current syntax for something more accommodating. Namely, use sub-commands to "init" the database and "serve" the data. Keep a deprecated "run" sub-command which mimics the current behavior.
Per the discussion in #28 make the preview service opt-in by simply pointing the manifest at it like so:
"preview": {
"url": "http://localhost:5000/preview/{{id}}",
"width": 400,
"height": 300
}
When used under Windows with OpenRefine 3.6.1 the service fails.
If I try using the OpenRefine Test Bench tab I get the following error: "The endpoint MUST return a JSON document describing the service, accessible vîa CORS or JSONP."
The output from the service from the test is:
* Serving Flask app 'csv-reconcile' (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [07/Sep/2022 21:57:05] "OPTIONS / HTTP/1.1" 404 -
127.0.0.1 - - [07/Sep/2022 21:57:05] "GET / HTTP/1.1" 404 -
127.0.0.1 - - [07/Sep/2022 21:57:05] "OPTIONS /?callback=jsonp_1662580625915_97939 HTTP/1.1" 404 -
127.0.0.1 - - [07/Sep/2022 21:57:05] "GET /?callback=jsonp_1662580625915_97939 HTTP/1.1" 404 -
Might the OpenRefine protocol have changed?
I was able to set up and run csv-reconcile serve, but cannot run the example on the reps.tsv file I get ValueError: 'item' is not in the list
, similarly when I try the progressives.tsv file I get ValueError: 'itemLabel' is not in list
. The errors are otherwise identical, except the last few lines. I have tried restarting everything, and cannot get the init step to work before running the serve command. Any suggestions would be appreciated.
Last few lines of the error for reps.tsv:
File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 68, in init_db
ididx = header.index(idcol)
ValueError: 'item' is not in list
Last few lines of the error for progressives.tsv:
File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 67, in init_db
searchidx = header.index(searchcol)
ValueError: 'itemLabel' is not in list
The full error for reps.tsv:
(venv) C:\Users\jennifer.woodward\OneDrive - USDA\myGitOneDrive\NALT4MA\csv-reconcile>csv-reconcile init sample/reps.tsv item itemLabel
Traceback (most recent call last):
File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\jennifer.woodward\OneDrive - USDA\myGitOneDrive\NALT4MA\csv-reconcile\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 321, in main
return cli()
File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 271, in init
return doinit(config, scorerOption, csvfile, idcol, namecol)
File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 259, in doinit
initdb.init_db_with_context(csvfile, idcol, namecol)
File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 95, in init_db_with_context
return init_db(db,
File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 68, in init_db
ididx = header.index(idcol)
ValueError: 'item' is not in list
The importlib_metadata API becomes standard as of Python3.10. Use the latest version of the API in all versions of Python up until there.
I have the following object:
Point(14.6152142 50.0812828)
I am reconciling it to the outcome of this query:
The correct match is Q64816168
However that suggestion does not appear in the top-ten. Any ideas why? Do I miss something obvious?
Steps to replicate:
I see there is a test suite already, so perhaps it would be worth running it in a continuous integration service?
As a side effect, this is a good way to document the install process on a stock machine (since you have to script it for the CI). I actually looked for the CI configuration files as a way to solve my install problems (#8).
I want to reconcile two csv's...Ive installed csv-reconcile and initialized the database....and Ive uploaded one of the csv's to openRefine....but I am not sure where to put the other file Im trying to reconcile it to....any help would be greatly appreciated. Thanks!!!
I am reconciling against local CSV files with ambiguous data in the search column and without ids from external systems (Wikidata...).
ID | Search Column | Additional Data |
---|---|---|
1 | John Doe | 1970-01-01 |
2 | John Doe | 2020-01-01 |
... | ... | ... |
To identify the correct match from the Reconciliation API I have to manually check the proposed results against the data in the CSV files. To speed this process up I would prefer to use a Preview Service as defined in the Reconciliation Service API Specification.
Use nox with nox-poetry to run tests with plugin installed in virtual environment.
Hello,
at first, let me thank you for this great reconciliation tool!
I've been trying to use csv-reconcile with the csv-reconcile-geo plugin and I am not confident, where the error comes from so feel free to direct me elsewhere, if the problem does not occur at your site.
So, the problem was that the "budovy_wdqs.tsc" file I was providing was using UTF-8 encoding, while the program apparently expect it to be in cp-1250 for some reason. When I have resaved the .tsv in cp-1250, the bug went gone.
(venv) C:\...\csv-reconcile [master ≡ +4 ~0 -0 !]> csv-reconcile --init-db budovy_wdqs.tsv item coords --scorer geo --config config.txt
Traceback (most recent call last):
File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
exec(code, run_globals)
File "C:\...\csv-reconcile\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1053, in main
rv = self.invoke(ctx)
File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\...\csv-reconcile\venv\lib\site-packages\click\core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 195, in main
initdb.init_db_with_context()
File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 90, in init_db_with_context
return init_db(db,
File "C:\...\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 58, in init_db
header = next(reader)
File "C:\Python310\lib\encodings\cp1250.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x88 in position 2094: character maps to <undefined>
FWIW, if you don't mind running your own reconciliation service, I've just written a geo scoring plugin for csv-reconcile.
With this you could, say run a SPARQL query to find coordinate locations of points you're looking to match against, export that as a TSV file and use that to run csv-reconcile.
You can get the service up and running as simply as the following:
$ python -m venv serverenv $ source serverenv/bin/activate $ python -m pip install csv-reconcile $ python -m pip install csv-reconcile-geo $ csv-reconcile --init-db query.tsv item coord --scorer geo
Here
item
is the name of the column containing the QID's andcoord
is the name of the coordinate column in well-known text format, the default export format for coordinates.This was just my first pass at it. There's certainly room for improvement, but it may suit your immediate needs.
@gitonthescene Please could you assist me with this? I am a bit disoriented and I am not sure if I understand the overall idea of 'my own' reconciliation service correctly. Am I right in assuming that I need to load File number 1 into openrefine, load File number 2 into command line via the commands above, add a reconciliation service "http://127.0.0.1:5000/reconcile" to OpenRefine and reconcile?
I think I was able to start virtualenv on my system (I am on Windows and "source" did not work, but I think I was able to find a solution at https://stackoverflow.com/questions/8921188/issue-with-virtualenv-cannot-activate) and then I was able to install csv-reconcile and csv-reconcile-geo. However, this is what I get when I run the program:
(venv) C:\Users\vojte\Downloads>csv-reconcile --init-db query.tsv item coord --scorer geo
c:\users\vojte\venv\lib\site-packages\normality\__init__.py:72: ICUWarning: Install 'pyicu' for better text transliteration.
text = ascii_text(text)
Traceback (most recent call last):
File "C:\Users\vojte\AppData\Local\Programs\Python\Python37-32\Lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\vojte\AppData\Local\Programs\Python\Python37-32\Lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\vojte\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 782, in main
rv = self.invoke(ctx)
File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\users\vojte\venv\lib\site-packages\click\core.py", line 610, in invoke
return callback(*args, **kwargs)
File "c:\users\vojte\venv\lib\site-packages\csv_reconcile\__init__.py", line 210, in main
initdb.init_db()
File "c:\users\vojte\venv\lib\site-packages\csv_reconcile\initdb.py", line 76, in init_db
(mid, word) + tuple(matchFields))
sqlite3.IntegrityError: UNIQUE constraint failed: reconcile.id
sqlite3.IntegrityError: UNIQUE constraint failed: reconcile.id
My query.tsv is from https://w.wiki/3BV9
What do you think is happening? Sorry to spam the issue with my questions
Originally posted by @VojtechDostal in wetneb/openrefine-wikibase#101 (comment)
We should expect all lines in the csv file to have the same number of entries. Skip lines which have a different number. Generally these will be trailing blank lines created by whatever generated the csv file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.