Giter VIP home page Giter VIP logo

icews's People

Contributors

andybega avatar mayeulk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

mayeulk pndrake

icews's Issues

duplicate file names and event data in source Dataverse repository

After a fresh install on Ubuntu 18.04, the following fails after downloading 151 files (73.1 MB) with an error:

library("icews")
library("DBI")
library("dplyr")
library("usethis")
setup_icews(data_dir = "/home/mk/Documents/data/icews", use_db = TRUE, keep_files = TRUE,  r_profile = TRUE)

update_icews(dryrun = TRUE)
update_icews(dryrun = FALSE)

# (...... downloads 151 files, ingesting correctly 294687 rows in sqlite database)
Downloading '20190309-icews-events.zip'
Error in writeBin(as.vector(f), tmp) : can only write vector objects

Launching update_icews(dryrun = FALSE) again and again does not solve the issue.

The following (launched after the error) might help:

> update_icews(dryrun = TRUE)
File system changes:
Found 151 local data file(s)
Downloading 84 file(s)
Removing 0 old file(s)

Database changes:
Deleting old records for 0 file(s)
Ingesting records from 84 file(s)

Plan:
Download            '20190309-icews-events.zip'
Download            '20190309-icews-events.zip'
Ingest records from '20190309-icews-events.tab'
Ingest records from '20190309-icews-events.tab'
Download            '20190311-icews-events.zip'
Ingest records from '20190311-icews-events.tab'
Download            '20190312-icews-events.zip'
Ingest records from '20190312-icews-events.tab'
Download            '20190313-icews-events.zip'
Ingest records from '20190313-icews-events.tab'
Download            '20190314-icews-events.zip'
(etc.)

DB state does not include source files with all duplicate events

When adding events to the database, events with an already existing event ID are not added again. If all events in a ".tsv" source file are duplicates and thus none are added to the "events" table in the DB, the name of the source file is stored in the "null_source_files" table. This is because the "source_files" table is created in reference to the "source_file" column in the "events" table, and thus those files wouldn't show up. The DB state getter does not include the null source files.

Store source_file list in separate table updated via trigger?

Getting the distinct source file list can take several seconds. As often it might be needed without any subsequent actions being performed, maybe keep the source file list in a second table that is updated automatically when there are changes in the events table. This should trade a relatively trivial additional amount of time when deleting or inserting rows--as this already takes very long--for a much faster read when determining the DB/events state.

See

DVN vs. Dataverse vs. Harvard Dataverse

Hi! I spotted this project on Twitter at https://twitter.com/andybeega/status/1103226111855607809

I noticed that you're using "DVN" in your README (e.g. "the current version on DVN") but in some contexts, "Harvard Dataverse" would be preferred. Let me try to summarize a few terms:

  • DVN: A somewhat awkward acronym for "Dataverse Network", which was the old name (in the 3.x and earlier days) for the software that is now called "Dataverse".
  • Dataverse: The current name for the software formerly known as "DVN".
  • Harvard Dataverse: One of 39 installations of Dataverse and home of ICEWS event data. Other installations of Dataverse include UNC Dataverse, Scholars Portal Dataverse, etc.

I'd be happy to make a pull request if you'd like. Please let me know. Thanks! Great project!

Also, if you're interested in helping with the "dataverse" R package, please leave a comment at IQSS/dataverse-client-r#21 ๐Ÿ˜„

License of ICEWS data?

What is the license of ICEWS data? (not the license of the ICEWS R package, which is MIT license).

I cannot find the licence here: https://dataverse.harvard.edu/dataverse/icews

Latest individual file seems to imply: "For Official Use Only (FOUO), government sponsored research activities." see https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QI2T9A
Unsure it applies to all data. In any case the "terms" pane says "CC0 Public domain":
image
Unsure what FOUO means, some users on wikipidia say: public domain. https://en.wikipedia.org/wiki/Talk:For_Official_Use_Only
which is consistent with the "terms"

Still, look at this:
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/28075
Terms pane: RESTRICTIONS ON USE: THESE MATERIALS ARE SUBJECT TO COPYRIGHT PROTECTION AND MAY ONLY BE USED AND COPIED FOR RESEARCH AND EDUCATIONAL PURPOSES. THE MATERIALS MAY NOT BE USED OR COPIED FOR ANY COMMERCIAL PURPOSES. ร‚ยฉ 2015 Lockheed Martin Corporation and BBN-Raytheon. All rights reserved.

Add checks to make sure local files or DB exists at a user-specified path

Add a check so this becomes more informative:

> set_icews_opts("foo", TRUE, TRUE)
> read_icews()
Error in `$<-.data.frame`(`*tmp*`, "year", value = NA_integer_) : 
  replacement has 1 row, data has 0
In addition: Warning messages:
1: Unknown or uninitialised column: 'event_date'. 
2: In read_icews_raw(find_raw(), n_max) :

download_data is not working with user-specific path

old_opts <- unset_icews_opts()
download_data(to_dir = "~/Downloads/icews_data", update = TRUE, dryrun = TRUE)
Error in find_path("raw") : Path argument is missing.
Consider setting the paths up globally with `setup_icews()`.
Ideally in your .Rprofile file; try running `dr_icews()` for help. 

Auto-retry when dataverse is slow

Often, the dataverse server is slow and update_icews() stops. It would be great to have an option to relaunch it automatically in these cases (maybe after a delay, specified in seconds). There are at least 2 types of errors for which relaunching works:

  • Gateway Timeout (HTTP 504).

  • parse error: premature EOF

> update_icews(dryrun = FALSE); date()
Error in value[[3L]](cond) : 
  Something went wrong in 'dataverse' or the Dataverse API, try again. Original error message:
Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181129-icews-events.zip'
Ingesting records from '20181129-icews-events.tab'
Downloading '20181130-icews-events.zip'
Ingesting records from '20181130-icews-events.tab'
Downloading '20181203-icews-events.zip'
Ingesting records from '20181203-icews-events.tab'
Downloading '20181204-icews-events.zip'
Ingesting records from '20181204-icews-events.tab'
Downloading '20181205-icews-events.zip'
Ingesting records from '20181205-icews-events.tab'
Downloading '20181206-icews-events.zip'
Ingesting records from '20181206-icews-events.tab'
Downloading '20181207-icews-events.zip'
Ingesting records from '20181207-icews-events.tab'
Downloading '20181208-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) : 
  Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Error in value[[3L]](cond) : 
  Something went wrong in 'dataverse' or the Dataverse API, try again. Original error message:
parse error: premature EOF
                                       
                     (right here) ------^
> update_icews(dryrun = FALSE); date()
Downloading '20181208-icews-events.zip'
Ingesting records from '20181208-icews-events.tab'
Downloading '20181209-icews-events.zip'
Ingesting records from '20181209-icews-events.tab'
Downloading '20181210-icews-events.zip'
Ingesting records from '20181210-icews-events.tab'
Downloading '20181211-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) : 
  Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Error in value[[3L]](cond) : 
  Something went wrong in 'dataverse' or the Dataverse API, try again. Original error message:
Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181211-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) : 
  Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181211-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) : 
  Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181211-icews-events.zip'
Error in get_file(file_ref, get_doi()[[repo]]) : 
  Gateway Timeout (HTTP 504).
> update_icews(dryrun = FALSE); date()
Downloading '20181211-icews-events.zip'
Ingesting records from '20181211-icews-events.tab'
Downloading '20181212-icews-events.zip'
Ingesting records from '20181212-icews-events.tab'
Downloading '20181213-icews-events.zip'

Add a path getter module

Something that takes the path arguments as input and returns normalized paths as output.

Why? Core behavior right now relies on having arg = NULL defaults, and each user facing function has arguments for the paths that requires a lot of duplicate code to substitute the correct paths if the environment variable option (ICEWS_DATA_DIR) is used.

Also use this for input validation (e.g. error if one is NULL, the other path is not.).

Deal with TSV parsing failures

2007 has an abnormally low number of events, maybe some other years/files as well. The cause might be related to parsing failures. Compare the number of lines here:

tsv2007 <- read_tsv(file.path(find_raw(), "events.2007.20150313083959.tab"))
str2007 <- read_lines(file.path(find_raw(), "events.2007.20150313083959.tab"))
tsv2008 <- read_tsv(file.path(find_raw(), "events.2008.20150313084156.tab"))
str2008 <- read_lines(file.path(find_raw(), "events.2008.20150313084156.tab"))

The 2007 TSV fails to parse lines after events for February in that year. In 2008, the string lines match, correctly, the TSV records number (plus 1 for header row).

> nrow(tsv2007)
[1] 135693
> length(str2007)
[1] 1011162
> nrow(tsv2008)
[1] 980879
> length(str2008)
[1] 980880

Remove triggers in stats tables due to speed issues

Any data updates become painfully slow, I think because each of the potentially millions of inserts/removes is a separate transaction that triggers the trigger, meaning the stats tables are also updated after each single write/remove.

Better to move this to R and manually rebuild the stats tables after the relevant operations.

Add event data sample

This would be helpful for conveying the general structure, and also In the vignettes and some examples.

Replace create_event_table() with equivalent sql create table

Other tables are created from SQL files, but the "events" table is not. Path dependence, probably because it was the first table I set up, or maybe because it has indices. Which in any case can be part of the create table SQL file.

Then just call "events.sql" with "execute_sql()" like the other tables.

update_icews fails with first file

After a fresh install on Ubuntu 18.04, the following fails with an error:

library("icews")
library("DBI")
library("dplyr")
library("usethis")
# Note: do not end the data_dir with a slash
setup_icews(data_dir = "/home/mk/Documents/data/icews", use_db = TRUE, keep_files = TRUE,  r_profile = TRUE)

update_icews(dryrun = TRUE)
update_icews(dryrun = FALSE)

Message I got after the last line of code is:

Downloading '20181004-icews-events.zip'
Ingesting records from '20181004-icews-events.tab'
Error in if (min(events$event_date) <= max_date_in_db) { : 
    valor ausente donde TRUE/FALSE es necesario

(last line is more ore less "missing value where TRUE/FALSE is necessary")

With keep_files = FALSE instead (after restarting R), this is the error

Ingesting records from '20181004-icews-events.tab'
Error in get_fileid.character(dataset, file, key = key, server = server,  : 
                                File not found

Same behaviour after updating all Ubuntu packages and running update.packages() in R.
R version 3.6.0 (2019-04-26)

Integrate stats table and triggers

Use table "stats", for now only containing the tuple (events_n, [some number]), to store the number of rows in the main events table. This is one of the things that somewhat slows down dr_icews.

  • save SQL create stuff for table and triggers at inst/sql
  • add functionality that can read and split SQL statements (can't execute multi-part statements form R I think)
  • add execution upon DB creation

Add gwcodes to events table

Right now getting gwcode-year counts and such from the DB is kind of hard since data merging has to happen in R.

404 error at download attempt

First thank you for developing the icews package. I am trying to use the minimalist functionality and running into an error.
This error occurs for both the update_icews() and download_data() functions when dryrun is set to False. My setup has use_db = F and keep_files =T.

update_icews(dryrun = F)
Downloading 'events.1995.20150313082510.tab.zip'
Error in get_file(file_ref, get_doi()[[repo]]) : Not Found (HTTP 404).

I am hoping this is a common error and an answer is ready available. Thanks for your help.

Duplicate event handling

Event ID is not unique because there are duplicate events.

In all cases, the duplicate events can be distinguished by event date. And in all cases there are exactly 2 versions of each duplicate event.

events %>% group_by(`Event ID`) %>% mutate(n = n()) %>% ungroup() %>% filter(n > 1) %>% group_by(`Event ID`, `Event Date`) %>% dplyr::summarize(n = n()) -> foo

> foo
# A tibble: 290,624 x 3
# Groups:   Event ID [?]
   `Event ID` `Event Date`     n
        <int> <date>       <int>
 1   20718170 2013-11-12       1
 2   20718170 2014-01-01       1
 3   20718171 2013-11-12       1
 4   20718171 2014-01-01       1
 5   20718172 2013-11-12       1
 6   20718172 2014-01-01       1
 7   20718173 2013-11-12       1
 8   20718173 2014-01-01       1
 9   20718174 2013-11-12       1
10   20718174 2014-01-01       1
# ... with 290,614 more rows
foo %>% group_by(`Event ID`) %>% summarize(n = n()) %>% group_by(n) %>% summarize(cases = n())

# A tibble: 1 x 2
      n  cases
  <int>  <int>
1     2 145312

What to do with these? Silently drop and keep the later date version?

Keep local and DVN file versions in sync

Sometimes a local file and associated event set will be superseded by a new version on DVN.

E.g. most likely this will occur with the current 2008 file as it expands to cover more of the year.

The file name patterns are consistent, events.[year].[yyyymmddhhmmss].tab.

Separate that into events set (events.[year]) and version based on date?

Can indices be built at DB creation?

Can the indices for a table be specified at DB/table creation? This would make the ingestion file by file easier (i.e. download, ingest, index, one file at a time).

Probably slower, what is the impact on speed of "ingest all at once then index" versus "ingest and index file by file"?

print or format "icews_opts" is not showing correct values

When doing:

old_opts = unset_icews_opts()
old_opts

prints:

Options not set
data_dir: NULL
use_db: NULL
keep_files: NULL

event though old_opts has correct values:

> str(old_opts)
List of 3
 $ data_dir  : chr "~/foo/icews_data"
 $ use_db    : logi TRUE
 $ keep_files: logi TRUE
 - attr(*, "class")= chr "icews_opts"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.