Comments (19)
For some reason there are two files named "20190309-icews-events.zip" in the dataverse repo (they don't contain the same events). The duplicate file names are causing issues, e.g. the download right now is by the file name/label. For me this is causing a timeout when it tries to download the file, but I'm guessing it's also what leads to the error you are getting. I'll have to change how the files are downloaded and labelled for the database.
from icews.
Looking at the metatdata of the 2 files, they have at least four attributes that differ, that may help differentiate them:
https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/QI2T9A/DER3I5&version=200.0
https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/QI2T9A/5DCQQF&version=200.0
Those 4 things are:
- the MD5
- the last part of the direct link: DER3I5 vs 5DCQQF
- the deposit date: 2019-03-10 vs 2019-03-11
- the first ICEWS version in which they appear:
Those might help to form a filename if needed.
Hope this helps.
from icews.
Hey, thank you. I've been (and still am) on vacation, but had a change to look a bit more. Gonna paste this here, partly for myself. Aside from the repeated 20190309-icews-events.zip
, there's also 20190409-icews-events-1.zip
and 20190410-icews-events-1.zip
, as well as 20190503-thru-20190519-icews-events.zip
.
The files are downloaded using the dataverse
packages. Looking at get_file
, which does the actual downloading, it seems that it is also possible to use a numeric ID instead of the file name. E.g.:
library("icews")
library("dataverse")
file_list = get_dataset(get_doi()$daily)
head(file_list$files[, c("label", "id")])
label id
1 20181004-icews-events.zip 3234868
2 20181005-icews-events.zip 3235021
3 20181006-icews-events.zip 3238491
4 20181007-icews-events.zip 3238493
5 20181008-icews-events.zip 3238584
6 20181009-icews-events.zip 3238918
Right now the file name is used to reconcile the local and remote states, so that will have to switch. There are two tables, source_files
and null_source_files
, that list the ingested source files (and source files that contained no new events and thus wouldn't show up in the events
table source_file
column).
library("RSQLite")
con = connect()
dbGetQuery(con, "select * from source_files limit 5;")
dbGetQuery(con, "select * from null_source_files limit 5;")
The source file is also included in the events table (i.e. query_icews("select * from events limit 5;") %>% str()
will show a source_file
column at the end), but that's more for informative purposes since querying the events
table at every update would take a really long time.
I'm going to have to switch those internal tables up. Maybe have a new source_file table with something like
- source_file_id: source file integer ID from dataverse
- source_file_name: the possible duplicated source file name
- source_id: maybe another ID for this table since the dataverse source file id is a pretty big integer. Not sure.
The various state and downloader functions will need to be switched to use the integer ID instead of file name.
There's probably also going to have to be some kind of one-time upgrade functionality that implements these changes on an existing table, to avoid having to nuke and re-download everything.
Well, lesson for me to not use file names as unique IDs when there's already a perfectly good unique ID on dataverse. I hope to get to this at the end of this week or next week.
from icews.
Make DVN to local files work again:
- create file name normalizer
- create dictionary with dataverse file labels and IDs, as well as normalized local file names
- switch
get_dvn_state
to new state format - switch
get_local_state
to new state format - update
plan_file_changes
- update
execute_plan
to use file ID for download
Make local files to DB work again:
- switch
get_db_state
to new state format - update
plan_database_sync
for local file to DB steps (without downloading new files) - update
plan_database_changes
for DVN to local file to DB steps
Misc
- check documentation downloader still works
from icews.
@mayeulk can you try updating the package and seeing if it works now?
from icews.
Hi, It seems to work, also I could not finish the process (full disk):```
Ingesting records from 'events.2015.20180710092545.tab'
|================================================================================================================================| 100% 241 MB
Error in result_bind(res@ptr, params) : database or disk is full
Error in result_create(conn@ptr, statement) :
no such savepoint: dbWriteTable
My previous attempts led to db of 120 MB, this one: 6GB
I'll confirm in a few days (getting a new disk).
Thanks! Cheers,
Mayeul
from icews.
😀 that sounds about right. I just updated all the way through 15 June, and have ~8GB for the database and ~5GB for the raw ".tsv" files.
from icews.
Hi, it repeatedly fails now on '20190409-icews-events.zip' and '20190409-icews-events-1.zip'
Downloading '20190406-icews-events.zip'
Ingesting records from '20190406-icews-events.tab'
Downloading '20190407-icews-events.zip'
Ingesting records from '20190407-icews-events.tab'
Downloading '20190408-icews-events.zip'
Ingesting records from '20190408-icews-events.tab'
Downloading '20190409-icews-events.zip'
Ingesting records from '20190409-icews-events.tab'
Downloading '20190409-icews-events-1.zip'
Ingesting records from '20190409-icews-events-1.tab'
Error in result_bind(res@ptr, params) :
UNIQUE constraint failed: events.event_id, events.event_date
> date()
[1] "Wed Jun 19 18:59:23 2019"
> update_icews(dryrun = FALSE); date()
Downloading '20190409-icews-events.zip'
Ingesting records from '20190409-icews-events-1.tab'
Error in result_bind(res@ptr, params) :
UNIQUE constraint failed: events.event_id, events.event_date
>
from icews.
Maybe related to this, '@icews' twiteer feed mentions duplicates, see:
https://twitter.com/icews?lang=en
Direct link to tweet Apr 7, 2019: https://t.co/W3XSnPU0Vo
ICEWS @icews Apr 7
We uploaded the 3 missing events file for Mar 26-28. Unfortunately, there will be duplicate events in these files and files from Mar 27-Apr 6, predominantly affecting Mar 28 and Apr 5. Use the Event ID field to identify these duplicates. http://bit.ly/2ORIfQX #icews #dataverse
from icews.
I lowered the unicity requirement (primary key) as a quick, temporary fix, running this against the sqlite database:
CREATE TABLE events_copy ( event_id INTEGER NOT NULL, event_date INTEGER NOT NULL,
source_name TEXT, source_sectors TEXT, source_country TEXT, event_text TEXT, cameo_code TEXT,
intensity REAL, target_name TEXT, target_sectors TEXT, target_country TEXT, story_id INTEGER,
sentence_number INTEGER, publisher TEXT, city TEXT, district TEXT, province TEXT, country TEXT,
latitude REAL, longitude REAL, year INTEGER NOT NULL, yearmonth INTEGER NOT NULL,
source_file TEXT NOT NULL, PRIMARY KEY (event_id, event_date, source_file) );
INSERT INTO events_copy
SELECT * FROM events;
ALTER TABLE events RENAME TO events_old;
ALTER TABLE events_copy RENAME TO events;
I kept the old data to analyse this. Running again update_icews(dryrun = FALSE)
now works on this particular setup.
from icews.
With my version of the database (with PRIMARY KEY (event_id, event_date, source_file)
), it now works:
Ingesting records from '20190618-icews-events.tab'
Cleaning up and optimizing database, this might take a while
Complete
File and/or database update done
On this new db, I ran the following to find duplicates:
SELECT event_id, event_date, COUNT(*)
FROM events
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
There are 2434 rows returned ("duplicates"), of which:
2319 rows are for event_date == 20190409
77 rows are for event_date == 20190410
38 rows are for event_date == 20190408
from icews.
I had the exact same issue with "20190409-icews-events-1.zip" and "20190409-icews-events.zip", thought I had managed to fix it (#46).
The two files contain the same exact set of events (by event ID), so what should happen is this:
- The version without "-1" is ingested first, and all events are added to the database.
- For the next version with "-1",
write_data_to_db()
should have recognized that there are potentially duplicate events, realized that all events are duplicates and thus there is nothing to add, in which case it should have added the "-1" file name to a table tracking source files with no new events (i.e. all duplicates).
Could you check if you get the same results for these queries?:
SELECT name FROM null_source_files WHERE name LIKE '20190409%';
"20190409-icews-events-1.tab" only
SELECT name FROM source_files WHERE name LIKE '20190409%';
Both "20190409-icews-events.tab" and "20190409-icews-events-1.tab".
SELECT source_file, count(*) AS n_events
FROM events
WHERE source_file LIKE '20190409%'
GROUP BY source_file;
All from the file version without "-1":
source_file n_events
1 20190409-icews-events.tab 2434
from icews.
Here are the results (ran on my sqlite db with duplicates):
SELECT name FROM null_source_files WHERE name LIKE '20190409%';
0 rows returned
SELECT name FROM source_files WHERE name LIKE '20190409%';
"20190409-icews-events-1.tab"
"20190409-icews-events.tab"
2 rows
SELECT source_file, count(*) AS n_events
FROM events
WHERE source_file LIKE '20190409%'
GROUP BY source_file;
```----
source_file n_events
"20190409-icews-events-1.tab" "2434"
"20190409-icews-events.tab" "2434"
----
2 rows
from icews.
I guess we can think of ways to remove duplicates in sql, which might be faster than in R (or not).
from icews.
Here, I implement in sql a solution to remove the duplicates linked to this issue.
For speed of queries in this testing phase, we do not make a full copy of the db but only of a recent subset here, into table events_extract
CREATE TABLE events_extract ( event_id INTEGER NOT NULL, event_date INTEGER NOT NULL,
source_name TEXT, source_sectors TEXT, source_country TEXT, event_text TEXT, cameo_code TEXT,
intensity REAL, target_name TEXT, target_sectors TEXT, target_country TEXT, story_id INTEGER,
sentence_number INTEGER, publisher TEXT, city TEXT, district TEXT, province TEXT, country TEXT,
latitude REAL, longitude REAL, year INTEGER NOT NULL, yearmonth INTEGER NOT NULL,
source_file TEXT NOT NULL, PRIMARY KEY (event_id, event_date, source_file) );
INSERT INTO events_extract
SELECT * FROM events where event_date>20190400;
-- The duplicates to fix
SELECT event_id, event_date, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
-- 2434 rows returned
-- Same, without showing the count
SELECT event_id, event_date --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
-- Full rows of duplicates
SELECT * FROM events_extract where event_date || '-' || event_id IN
(SELECT event_date || '-' || event_id --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) >1)
ORDER by event_date, event_id, source_file;
-- 4868 rows returned
-- we see that all duplicates are pairs coming from two files: -icews-events.tab and -icews-events-1.tab
-- From those duplicates, select only those loaded from -icews-events-1.tab
SELECT * FROM events_extract where event_date || '-' || event_id IN
(SELECT event_date || '-' || event_id --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) >1)
AND source_file LIKE '%-icews-events-1.tab'
ORDER by event_date, event_id, source_file;
-- 2434 rows returned
-- DELETE the duplicates
DELETE FROM events_extract where event_date || '-' || event_id IN
(SELECT event_date || '-' || event_id --, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) >1)
AND source_file LIKE '%-icews-events-1.tab';
-- 2434 rows affected
-- Check there are no duplicates left
SELECT event_id, event_date, COUNT(*)
FROM events_extract
GROUP BY event_id, event_date
HAVING COUNT(*) > 1;
0 rows returned
from icews.
Ran over the full events table, the DELETE SQL query takes 15 s on my laptop.
took 15465ms, 2434 rows affected
from icews.
Running now the update function puts back the dupes from the 20190409 file:
# R code
> update_icews(dryrun = FALSE)
Ingesting records from '20190409-icews-events-1.tab'
Downloading '20190622-icews-events.zip'
Ingesting records from '20190622-icews-events.tab'
Cleaning up and optimizing database, this might take a while
-- SQL code
-- The duplicates
SELECT event_id, event_date, COUNT(*)
FROM events
GROUP BY event_id, event_date
HAVING COUNT(*) > 1
-- 2434 rows returned
from icews.
I've changed the title to something more readable. I believe there are two options here:
- fix the R icews package
- fix the data in the Icews Harvard Dataverse repository. I'm not sure the data authors would be willing to do so, but it is certainly a better option, among other things because other projects (not in R) might want to use the data, too.
from icews.
Hi, given the recent changes in ICEWS dataverse, I think this is not an issue anymore. Duplicate events are still a problem, but that should be taken care of when ingesting new data. Have you tried updating the data recently?
(It should work even with the previous data present, but might give some essentially ineffectual messages, #54 (comment))
from icews.
Related Issues (20)
- In Windows tests, unlinking test DB does not work
- Test errors due to dplyr 1.0.0 HOT 1
- Vignette building fails on Ubuntu + R 3.5 HOT 1
- Update saved mock DVN manifest HOT 1
- Finish ICEWS event data vignette
- Re-knit and if needed update database internals vignette
- Re-knit speed comparison vignette
- Some records are missing CAMEO code HOT 1
- Make it easier to manually interact with dataverse
- Release icews 1.0.0
- Check if dataverse on CRAN has been updated beyond 0.2.0 HOT 1
- Issues in Events.2017 file
- Missing CAMEO codes for 2017 events
- Some text fields include quotes, e.g. ""Fight"" instead of "Fight"
- Migrate to testthat 3rd edition HOT 1
- Add error hint for expired dataverse API tokens
- Wrong cameo codes '13y' for some events in the 2017 file
- Faster inserts on update
- Unusual January 2022 data file in weekly repo is causing error HOT 1
- Add option to use DuckDB as backend
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from icews.