Giter VIP home page Giter VIP logo

nldi-crawler's Introduction

NLDI Crawler

Spotless Check codecov

The Crawler is used to ingest data and link it to the network if it is not already. The only requirement is that the source system is able to provide GeoJSON via a web request with necessary attributes. A database table (nldi_data.crawler_source) contains metadata about the GeoJSON. We can link data to the network via latitude/longitude coordinates or NHDPlus reachcode and measure.

Table of Contents

Contributing

Contributions can be made via pull request to this file.

Current nldi_data.crawler_source table fields:

Column Name Column Description
crawler_source_id An integer used to identify the source when starting the crawler source.
source_name A human-oriented name for the source.
source_suffix The suffix to use in NLDI service urls to identify the source.
source_uri A uri the crawler can use to retrieve source data to be indexed by the crawling method.
feature_id The attribute in the returned data used to identify the feature for use in NLDI service urls.
feature_name A human readable name used to label the source feature.
feature_uri A uri that can be used to access information about the feature.
feature_reach Conditionally Optional The attribute in the source feature data where the crawler can find a reachcode.
feature_measure Conditionally Optional The attribute in the source feature data where the crawler can find a measure to be used with the reachcode. (strings are parsed into numbers if measure is represented as a string)
ingest_type Either reach or point. If reach then the feature_reach and feature_measure fields must be populated.

Developer Environment

nldi-db contains everything you need to set up a development database environment. It includes a demo database with data for the Yahara River in Wisconsin.

Configuration

To run the Crawler project you will need to create an application.yml file in the projects root directory and add the following:

nldiDbHost: <hostNameOfDatabase>
nldiDbPort: <portNumberForDatabase>
nldiDbUsername: <dbUserName>
nldiDbPassword: <dbPassword>

Dependencies

Project dependencies can be downloaded through your preferred IDE or command line utility.

For maven you can use the following command.

mvn dependency:resolve

Testing

This project contains unit and integration tests.

To run unit tests, use the following command.

mvn test

To run integration tests, you will need to have Docker install on your system, the you can use the following command.

mvn verify

Running the Crawler

There are several options to run the Crawler depending on your preferences or development environment.

Maven

To run with maven, use the following command replacing <crawler_source_id> with the intended integer ID.

mvn spring-boot:run -Dspring-boot.run.arguments="<crawler_source_id>"

JAR File

After packaging the project, you can run the JAR file directly with the following command.

java -jar target/nldi-crawler-<build version>.jar <crawler_source_id>

Refer to the target directory to determine the build version. For further instructions on running the Crawler via JAR file, see RUNNING.md.

Docker

To run via Docker Compose, create a secrets.env file with the following format:

nldiDbHost: <hostNameOfDatabase>
nldiDbPort: <portNumberForDatabase>
nldiDbUsername: <dbUserName>
nldiDbPassword: <dbPassword>

and run with:

docker-compose run -e CRAWLER_SOURCE_ID=<crawler_source_id> nldi-crawler

Sequence Diagram

The image below is a sequence diagram detailing how the NLDI crawler operates.

Sequence Diagram

An internal user starts the crawler with an input source value. The crawler gathers information from the database for that source. A GET request to the source URL is made to get the target GeoJSON features. That collection of features is then looped through and each one is added as a row to a database table specific to the feature source.

nldi-crawler's People

Contributors

abriggs-usgs avatar codacy-badger avatar danielnaab avatar dblodgett-usgs avatar dependabot-preview[bot] avatar dependabot-support avatar dependabot[bot] avatar dsteinich avatar kkehl-usgs avatar mbucknell avatar skaymen avatar ssoper-usgs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

nldi-crawler's Issues

Replace Jenkins pipeline files

We have a mirror in place for this repository which will allow us to run build and deploy pipelines internally. The Jenkins files should be removed and replace with a .gitlab-ci.yml with an include: section that references the internal pipelines. As part of this, we can also remove the .travis.yml file.

Create crawler execution / admin app

some stuff, some more reasonable than others

  • social login options. maybe just github + google
  • nested RBAC w/ organization and user levels similar to github orgs
    workspace listing each dataset
  • wizard to add dataset link, configure which columns correspond to required fields, perhaps with preview of header + first two lines of target dataset
  • option to upload dataset to some default space (subject to reasonable storage limits)
  • "refresh" buttons to initiate recrawl of any datasets that don't need to change configuration
  • Stats
    • total calls involving each dataset over time (line-graph)
    • Call counts of each dataset under user's management as originating point with each other NLDI dataset as navigation target
    • Call counts of each NLDI dataset as originating point with each dataset under user's management as navigation target

reach goal: option to generate URIs as geoconnex IDs using a namespace/stem text entry + column select for id, w/ prompt to go finalize at geoconnex github/ TBD similar system.

Add inventory of dam features?

Not sure if this is the right place to ask for this feature, but when doing up/downstream queries it is often necessary if there is a dam in place.

ID must be in properties.

Per https://github.com/ACWI-SSWD/nldi-crawler/blob/master/src/main/java/gov/usgs/owi/nldi/service/Ingestor.java#L160 the crawler isn't looking at an "id" that's not in the properties.

Standard geoJSON supports an id at the level of the actual feature.

e.g. https://locations.newmexicowaterdata.org/collections/Things/items/4201?f=json

Seems like we could add the "id" to the properties if there isn't an "id" in the properties already here. https://github.com/ACWI-SSWD/nldi-crawler/blob/master/src/main/java/gov/usgs/owi/nldi/service/Ingestor.java#L170

Determine native approach to create database tables during tests

There are several tests that need a _temp table to exist and preload it with data. The @DatabaseSetup annotation does not allow for table creation, only data injection. An approach needs to be found that allows for temporarily creating these tables in the CI database before the tests are run.

My initial thought would be to have a @BeforeClass (or similar) function that runs a SQL command to create the necessary table(s). I have tried a few variations of this, but did not get a successful run. Further research needs to be done to determine whether we could use an available annotation, or potentially grab the jdbc connection to run the command.

A workaround for this issue was implemented in #141

Only keep records that have been indexed to a catchment/flowline ID.

Currently, the crawler keeps all records that are read in whether they get indexed or not. The crawler should operate exclusively where it only keeps data that indexes to a comid.

When a crawl finishes, no rows with NULL comids should remain in the NLDI database. This could be made configurable but default to drop un-indexed features.

table suffixes with dashes

Question is how to handle table names based on source_suffix value. If the suffix has a dash in the name, does that work with postgres table names?

prefer to raise an error rather than silently coerce the table name.

Rewrite crawler in more accessible language.

I want to write a dockerized version of the crawler in R so I can contribute crawler code myself. Having a pattern for both python and R would be really nice. I doubt it would be too heavy a lift but need to do a little research on how it would work out.

Also add index on comid

With the larger tables from wade and the census blocks, it's becoming clear that we need indexes on the feature tables.

The crawler should be adding a comid index to help improve join performance.

https://github.com/internetofwater/nldi-crawler/blob/master/src/main/resources/mybatis/mappers/ingest.xml#LL30C4-L30C4 seems to be where it is being handled during ingest.

It looks like it maybe adopts from this? https://github.com/internetofwater/nldi-db/blob/d372160066d343ddd671124049a6e2e95488c1b2/liquibase/changeLogs/nldi/nldi_data/indexes/featureId.yml#L4 ?

Maybe @EthanGrahn can pop in and give us a pointer on a quick way to add a comid index?

Handling lake data linking

Moving conversation from this issue here.

I'm generally uncomfortable with the semantics of lakes as wide rivers have many (all?) of the same characteristics from a data model point of view and need to be treated on the same spectrum of data.

I'm confused, so are you saying lakes are treated as wide rivers now? I think that would actually be a fairly reasonable way to do that. Especially if the lake catchments were calculated and up-catchment datapoints were linked to the lake itself. Doesn't really matter if the system knows it is a lake or not, or just a really wide river.

Update tests to utilize latest CI database

The Crawler integration tests had an expectation of data and tables that were already in the CI database. The CI database no longer contains any data or temp tables. Integration tests need to be updated to create the data they expect so that they can use the latest CI Docker image.

Add WaDE sites to NLDI

A NLDI-ready geojson of sites from the WaDE 2.0 services is available here: https://www.hydroshare.org/resource/5f665b7b82d74476930712f7e423a0d2/

source_name: Water Data Exchange 2.0 Sites
source_suffix: wade
source_uri: https://www.hydroshare.org/resource/5f665b7b82d74476930712f7e423a0d2/data/contents/wade.geojson
feature_id: feature_id
feature_name: feature_name
feature_uri: feature_uri
feature_reach: NULL
feature_measure: NULL
ingest_type: point

If the team is not comfortable adding this to production, it would be ok in the near term to have it on beta/QA. It includes about 1M points.

Screen Shot 2020-05-16 at 8 19 26 AM

Temp tables not always being cleaned up

The crawler creates a temporary table during ingestion and is supposed to drop the table once complete. I've noticed that there are temp tables in our deployed database that did not get cleaned up. (different issue from internetofwater/nldi-db#36) There needs to be investigation on what causes the cleanup to fail.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.