andybega / icews Goto Github PK

View Code? Open in Web Editor NEW

22.0 3.0 2.0 1.88 MB

Get the ICEWS event data

Home Page: https://www.andybeger.com/icews/

License: Other

R 100.00%

icews r event-data sqlite-database dvn dataverse cameo cameo-codes

icews's People

Contributors

Stargazers

Watchers

Forkers

mayeulk pndrake

icews's Issues

Map countries to GW codes

duplicate file names and event data in source Dataverse repository

After a fresh install on Ubuntu 18.04, the following fails after downloading 151 files (73.1 MB) with an error:

library("icews")
library("DBI")
library("dplyr")
library("usethis")
setup_icews(data_dir = "/home/mk/Documents/data/icews", use_db = TRUE, keep_files = TRUE,  r_profile = TRUE)

update_icews(dryrun = TRUE)
update_icews(dryrun = FALSE)

# (...... downloads 151 files, ingesting correctly 294687 rows in sqlite database)
Downloading '20190309-icews-events.zip'
Error in writeBin(as.vector(f), tmp) : can only write vector objects

Launching update_icews(dryrun = FALSE) again and again does not solve the issue.

The following (launched after the error) might help:

> update_icews(dryrun = TRUE)
File system changes:
Found 151 local data file(s)
Downloading 84 file(s)
Removing 0 old file(s)

Database changes:
Deleting old records for 0 file(s)
Ingesting records from 84 file(s)

Plan:
Download            '20190309-icews-events.zip'
Download            '20190309-icews-events.zip'
Ingest records from '20190309-icews-events.tab'
Ingest records from '20190309-icews-events.tab'
Download            '20190311-icews-events.zip'
Ingest records from '20190311-icews-events.tab'
Download            '20190312-icews-events.zip'
Ingest records from '20190312-icews-events.tab'
Download            '20190313-icews-events.zip'
Ingest records from '20190313-icews-events.tab'
Download            '20190314-icews-events.zip'
(etc.)

Make a plan class and convert print_plan to print method for that class

Makes testing easier since I can just return the plan for dry run = TRUE.

DB state does not include source files with all duplicate events

When adding events to the database, events with an already existing event ID are not added again. If all events in a ".tsv" source file are duplicates and thus none are added to the "events" table in the DB, the name of the source file is stored in the "null_source_files" table. This is because the "source_files" table is created in reference to the "source_file" column in the "events" table, and thus those files wouldn't show up. The DB state getter does not include the null source files.

Store event date as integer not text

Takes up half the space and apparently faster index as well.

Check that manual use with path arguments propagates properly all the way down

Store source_file list in separate table updated via trigger?

Getting the distinct source file list can take several seconds. As often it might be needed without any subsequent actions being performed, maybe keep the source file list in a second table that is updated automatically when there are changes in the events table. This should trade a relatively trivial additional amount of time when deleting or inserting rows--as this already takes very long--for a much faster read when determining the DB/events state.

See

DVN vs. Dataverse vs. Harvard Dataverse

Hi! I spotted this project on Twitter at https://twitter.com/andybeega/status/1103226111855607809

I noticed that you're using "DVN" in your README (e.g. "the current version on DVN") but in some contexts, "Harvard Dataverse" would be preferred. Let me try to summarize a few terms:

DVN: A somewhat awkward acronym for "Dataverse Network", which was the old name (in the 3.x and earlier days) for the software that is now called "Dataverse".
Dataverse: The current name for the software formerly known as "DVN".
Harvard Dataverse: One of 39 installations of Dataverse and home of ICEWS event data. Other installations of Dataverse include UNC Dataverse, Scholars Portal Dataverse, etc.

I'd be happy to make a pull request if you'd like. Please let me know. Thanks! Great project!

Also, if you're interested in helping with the "dataverse" R package, please leave a comment at IQSS/dataverse-client-r#21 😄

License of ICEWS data?

What is the license of ICEWS data? (not the license of the ICEWS R package, which is MIT license).

I cannot find the licence here: https://dataverse.harvard.edu/dataverse/icews

Latest individual file seems to imply: "For Official Use Only (FOUO), government sponsored research activities." see https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QI2T9A
Unsure it applies to all data. In any case the "terms" pane says "CC0 Public domain":

Unsure what FOUO means, some users on wikipidia say: public domain. https://en.wikipedia.org/wiki/Talk:For_Official_Use_Only
which is consistent with the "terms"

Still, look at this:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/28075
Terms pane: RESTRICTIONS ON USE: THESE MATERIALS ARE SUBJECT TO COPYRIGHT PROTECTION AND MAY ONLY BE USED AND COPIED FOR RESEARCH AND EDUCATIONAL PURPOSES. THE MATERIALS MAY NOT BE USED OR COPIED FOR ANY COMMERCIAL PURPOSES. Â© 2015 Lockheed Martin Corporation and BBN-Raytheon. All rights reserved.

dr_icews should not consult db if paths are manually specified and no db is present

dr_icews(dp_path = "", raw_file_dir = "")

versus

dr_icews()

Add Travis CI

Add checks to make sure local files or DB exists at a user-specified path

Add a check so this becomes more informative:

> set_icews_opts("foo", TRUE, TRUE)
> read_icews()
Error in `$<-.data.frame`(`*tmp*`, "year", value = NA_integer_) : 
  replacement has 1 row, data has 0
In addition: Warning messages:
1: Unknown or uninitialised column: 'event_date'. 
2: In read_icews_raw(find_raw(), n_max) :

download_data is not working with user-specific path

old_opts <- unset_icews_opts()
download_data(to_dir = "~/Downloads/icews_data", update = TRUE, dryrun = TRUE)

Error in find_path("raw") : Path argument is missing.
Consider setting the paths up globally with `setup_icews()`.
Ideally in your .Rprofile file; try running `dr_icews()` for help.

Auto-retry when dataverse is slow

Often, the dataverse server is slow and update_icews() stops. It would be great to have an option to relaunch it automatically in these cases (maybe after a delay, specified in seconds). There are at least 2 types of errors for which relaunching works:

Gateway Timeout (HTTP 504).
parse error: premature EOF

> update_icews(dryrun = FALSE); date()
Error in value[[3L]](cond) : 
  Something went wrong in 'dataverse' or the Dataverse API, try again. Original error message:
Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181129-icews-events.zip'
Ingesting records from '20181129-icews-events.tab'
Downloading '20181130-icews-events.zip'
Ingesting records from '20181130-icews-events.tab'
Downloading '20181203-icews-events.zip'
Ingesting records from '20181203-icews-events.tab'
Downloading '20181204-icews-events.zip'
Ingesting records from '20181204-icews-events.tab'
Downloading '20181205-icews-events.zip'
Ingesting records from '20181205-icews-events.tab'
Downloading '20181206-icews-events.zip'
Ingesting records from '20181206-icews-events.tab'
Downloading '20181207-icews-events.zip'
Ingesting records from '20181207-icews-events.tab'
Downloading '20181208-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) : 
  Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Error in value[[3L]](cond) : 
  Something went wrong in 'dataverse' or the Dataverse API, try again. Original error message:
parse error: premature EOF
                                       
                     (right here) ------^
> update_icews(dryrun = FALSE); date()
Downloading '20181208-icews-events.zip'
Ingesting records from '20181208-icews-events.tab'
Downloading '20181209-icews-events.zip'
Ingesting records from '20181209-icews-events.tab'
Downloading '20181210-icews-events.zip'
Ingesting records from '20181210-icews-events.tab'
Downloading '20181211-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) : 
  Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Error in value[[3L]](cond) : 
  Something went wrong in 'dataverse' or the Dataverse API, try again. Original error message:
Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181211-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) : 
  Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181211-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) : 
  Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181211-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) : 
  Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181211-icews-events.zip'
Ingesting records from '20181211-icews-events.tab'
Downloading '20181212-icews-events.zip'
Ingesting records from '20181212-icews-events.tab'
Downloading '20181213-icews-events.zip'

Add n_max argument for read_icews, read_events_tsv

Sync DB in place, non-destructively

Make the DB sync without purging and rebuilding each time

Add a path getter module

Something that takes the path arguments as input and returns normalized paths as output.

Why? Core behavior right now relies on having arg = NULL defaults, and each user facing function has arguments for the paths that requires a lot of duplicate code to substitute the correct paths if the environment variable option (ICEWS_DATA_DIR) is used.

Also use this for input validation (e.g. error if one is NULL, the other path is not.).

Deal with TSV parsing failures

2007 has an abnormally low number of events, maybe some other years/files as well. The cause might be related to parsing failures. Compare the number of lines here:

tsv2007 <- read_tsv(file.path(find_raw(), "events.2007.20150313083959.tab"))
str2007 <- read_lines(file.path(find_raw(), "events.2007.20150313083959.tab"))
tsv2008 <- read_tsv(file.path(find_raw(), "events.2008.20150313084156.tab"))
str2008 <- read_lines(file.path(find_raw(), "events.2008.20150313084156.tab"))

The 2007 TSV fails to parse lines after events for February in that year. In 2008, the string lines match, correctly, the TSV records number (plus 1 for header row).

> nrow(tsv2007)
[1] 135693
> length(str2007)
[1] 1011162
> nrow(tsv2008)
[1] 980879
> length(str2008)
[1] 980880

Add pkgdown site for docs

Make it easier to access documentation

Open files from R, or at least point to documentation location?

dr_icews should show option variable values

Add quad- and penta-code mapping to included data

Remove triggers in stats tables due to speed issues

Any data updates become painfully slow, I think because each of the potentially millions of inserts/removes is a separate transaction that triggers the trigger, meaning the stats tables are also updated after each single write/remove.

Better to move this to R and manually rebuild the stats tables after the relevant operations.

Add basic tests for downloader functions

Testing the database related stuff will be difficult, but I can test the downloaders with the dry run option.

Add event data sample

This would be helpful for conveying the general structure, and also In the vignettes and some examples.

Replace create_event_table() with equivalent sql create table

Other tables are created from SQL files, but the "events" table is not. Path dependence, probably because it was the first table I set up, or maybe because it has indices. Which in any case can be part of the create table SQL file.

Then just call "events.sql" with "execute_sql()" like the other tables.

Add module to download and sync documentation

Integrate daily file drops

There are daily data drops (beta) at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QI2T9A.

Although this is still in testing, prepare for what syncing with those would have to look like.

update_icews fails with first file

After a fresh install on Ubuntu 18.04, the following fails with an error:

library("icews")
library("DBI")
library("dplyr")
library("usethis")
# Note: do not end the data_dir with a slash
setup_icews(data_dir = "/home/mk/Documents/data/icews", use_db = TRUE, keep_files = TRUE,  r_profile = TRUE)

update_icews(dryrun = TRUE)
update_icews(dryrun = FALSE)

Message I got after the last line of code is:

Downloading '20181004-icews-events.zip'
Ingesting records from '20181004-icews-events.tab'
Error in if (min(events$event_date) <= max_date_in_db) { : 
    valor ausente donde TRUE/FALSE es necesario

(last line is more ore less "missing value where TRUE/FALSE is necessary")

With keep_files = FALSE instead (after restarting R), this is the error

Ingesting records from '20181004-icews-events.tab'
Error in get_fileid.character(dataset, file, key = key, server = server,  : 
                                File not found

Same behaviour after updating all Ubuntu packages and running update.packages() in R.
R version 3.6.0 (2019-04-26)

events table (event_id, event_date) primary key is not unique

Ingesting records from '20181031-icews-events.tab'
Error in result_bind(res@ptr, params) : 
  UNIQUE constraint failed: events.event_id, events.event_date

Add startup message when path options are detected

Should print the setup options

Add option to not retain raw data files

Change read_icews_db to filter out unneeded columns like source_file

Output should match that of the read_icews from file version.

Normalize column names, e.g. "Event ID" to "event_id" so R and SQL work is easier

Integrate stats table and triggers

Use table "stats", for now only containing the tuple (events_n, [some number]), to store the number of rows in the main events table. This is one of the things that somewhat slows down dr_icews.

save SQL create stuff for table and triggers at inst/sql
add functionality that can read and split SQL statements (can't execute multi-part statements form R I think)
add execution upon DB creation

Add gwcodes to events table

Right now getting gwcode-year counts and such from the DB is kind of hard since data merging has to happen in R.

404 error at download attempt

First thank you for developing the icews package. I am trying to use the minimalist functionality and running into an error.
This error occurs for both the update_icews() and download_data() functions when dryrun is set to False. My setup has use_db = F and keep_files =T.
update_icews(dryrun = F)
Downloading 'events.1995.20150313082510.tab.zip'
Error in get_file(file_ref, get_doi()[[repo]]) : Not Found (HTTP 404).
I am hoping this is a common error and an answer is ready available. Thanks for your help.

Take out dplyr to the extent possible to reduce globals that need to be declared

See globals in icews-package.R

Check that doc links to external functions like readr::read_tsv are correctly formatted

Add a method for set_icews_opts that works with "icews_opts" class

I.e. to allow this:

opts <- get_icews_opts()
old_opts <- unset_icews_opts()
get_icews_opts()
set_icews_opts(old_opts)
# should all be the same
get_icews_opts(); opts; old_opts

Integrate dataverse repo with daily data drops; monthly seems to not be updated anymore

Add short ".tab" example data to inst/extdata

For testing, examples, etc.

Duplicate event handling

Event ID is not unique because there are duplicate events.

In all cases, the duplicate events can be distinguished by event date. And in all cases there are exactly 2 versions of each duplicate event.

events %>% group_by(`Event ID`) %>% mutate(n = n()) %>% ungroup() %>% filter(n > 1) %>% group_by(`Event ID`, `Event Date`) %>% dplyr::summarize(n = n()) -> foo

> foo
# A tibble: 290,624 x 3
# Groups:   Event ID [?]
   `Event ID` `Event Date`     n
        <int> <date>       <int>
 1   20718170 2013-11-12       1
 2   20718170 2014-01-01       1
 3   20718171 2013-11-12       1
 4   20718171 2014-01-01       1
 5   20718172 2013-11-12       1
 6   20718172 2014-01-01       1
 7   20718173 2013-11-12       1
 8   20718173 2014-01-01       1
 9   20718174 2013-11-12       1
10   20718174 2014-01-01       1
# ... with 290,614 more rows

foo %>% group_by(`Event ID`) %>% summarize(n = n()) %>% group_by(n) %>% summarize(cases = n())

# A tibble: 1 x 2
      n  cases
  <int>  <int>
1     2 145312

What to do with these? Silently drop and keep the later date version?

Keep local and DVN file versions in sync

Sometimes a local file and associated event set will be superseded by a new version on DVN.

E.g. most likely this will occur with the current 2008 file as it expands to cover more of the year.

The file name patterns are consistent, events.[year].[yyyymmddhhmmss].tab.

Separate that into events set (events.[year]) and version based on date?

Ingesting records from '20190409-icews-events-1.tab'

Error in result_bind(res@ptr, params) :
UNIQUE constraint failed: events.event_id, events.event_date

Add database setup tool

Functions to setup and eventually keep in sync a local event database. SQLite?

print or format "icews_opts" is not showing correct values

When doing:

old_opts = unset_icews_opts()
old_opts

prints:

Options not set
data_dir: NULL
use_db: NULL
keep_files: NULL

event though old_opts has correct values:

> str(old_opts)
List of 3
 $ data_dir  : chr "~/foo/icews_data"
 $ use_db    : logi TRUE
 $ keep_files: logi TRUE
 - attr(*, "class")= chr "icews_opts"