Giter VIP home page Giter VIP logo

beam-samples's Introduction

Java CI

Apache Beam samples

This project is composed by several samples. The purpose is to download and analyze GDELT Project data using Apache Beam pipelines.

The objectives are:

  1. Show how to implement Apache Beam pipelines for both streaming and batch analyses.
  2. Show how to run those pipelines on several runners.

GDELT Project stores all news articles as "events": http://data.gdeltproject.org/events/index.html

Daily, a zip file is created, containing a CSV file with all events using the following format:

545037848       20150530        201505  2015    2015.4110                                                                                       JPN     TOKYO   JPN                                                             1       046     046     04      1       7.0     15      1       15      -1.06163552535792       0                                                       4       Tokyo, Tokyo, Japan     JA      JA40    35.685  139.751 -246227 4       Tokyo, Tokyo, Japan     JA      JA40    35.685  139.751 -246227 20160529        http://deadline.com/print-article/1201764227/

The format is described: http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf

Compiling and packaging the samples

It is simple, just use:

mvn clean package

Executing the examples

We have prepared maven profiles to execute the Pipelines in every single runner:

You must activate the profile and choose the appropiate runner:

Direct Runner

mvn exec:java -Dexec.mainClass=org.apache.beam.samples.EventsByLocation -Pdirect-runner -Dexec.args="--runner=DirectRunner --input=/home/dataset/gdelt/2014-2016/201605*.zip --output=/tmp/gdelt/output/"

Spark Runner

mvn exec:java -Dexec.mainClass=org.apache.beam.samples.EventsByLocation -Pspark3-runner -Dexec.args="--runner=SparkRunner --input=/home/dataset/gdelt/2014-2016/201605*.zip --output=/tmp/gdelt/output/"

Flink Runner

mvn exec:java -Dexec.mainClass=org.apache.beam.samples.EventsByLocation -Pflink-runner -Dexec.args="--runner=FlinkRunner --input=/home/dataset/gdelt/2014-2016/201605*.zip --output=/tmp/gdelt/output/"

Google Dataflow Runner

mvn exec:java -Dexec.mainClass=org.apache.beam.samples.EventsByLocation -Pflink-runner -Dexec.args="--runner=DataflowRunner --input=/home/dataset/gdelt/2014-2016/201605*.zip --output=/tmp/gdelt/output/"

Google Dataflow Runner (blocking)

mvn exec:java -Dexec.mainClass=org.apache.beam.samples.EventsByLocation -Pflink-runner -Dexec.args="--runner=BlockingDataflowRunner --input=/home/dataset/gdelt/2014-2016/201605*.zip --output=/tmp/gdelt/output/"

Test infrastructure

Some of the samples require to have some infrastructure available, e.g. some brokers, filesystems and databases.

We provide a convinient way to have such infrastructura available using docker-compose.

beam-samples's People

Contributors

aromanenko-dev avatar coheigea avatar dependabot[bot] avatar echauchot avatar edgarlgb avatar iemejia avatar jbonofre avatar psolomin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

beam-samples's Issues

beam is unable to read hdfs file

hello ,

i am ajay sharma and using your code for beam integration with hadoop.
but seems its not working.

any suggestions ?

thanks
ajay

Security Policy violation SECURITY.md

This issue was automatically created by Allstar.

Security Policy Violation
Security policy not enabled.
A SECURITY.md file can give users information about what constitutes a vulnerability and how to report one securely so that information about a bug is not publicly visible. Examples of secure reporting methods include using an issue tracker with private issue support, or encrypted email with a published key.

To fix this, add a SECURITY.md file that explains how to handle vulnerabilities found in your repository. Go to https://github.com/Talend/beam-samples/security/policy to enable.

For more information, see https://docs.github.com/en/code-security/getting-started/adding-a-security-policy-to-your-repository.


This issue will auto resolve when the policy is in compliance.

Issue created by Allstar. See https://github.com/ossf/allstar/ for more information. For questions specific to the repository, please contact the owner or maintainer.

BeamSqlExample error

Exception in thread "main" java.lang.IllegalStateException: Unable to return a default Coder for MapElements3/Map/ParMultiDo(Anonymous).output [PCollection]. Correct one of the following root causes:
No Coder has been manually specified; you may do so using .setCoder().
Inferring a Coder from the CoderRegistry failed: Cannot provide a coder for a Beam Row. Please provide a schema instead using PCollection.setRowSchema.
Using the default output Coder from the producing PTransform failed: PTransform.getOutputCoder called.
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Preconditions.checkState(Preconditions.java:507)
at org.apache.beam.sdk.values.PCollection.getCoder(PCollection.java:277)
at org.apache.beam.sdk.values.PCollection.finishSpecifying(PCollection.java:114)
at org.apache.beam.sdk.runners.TransformHierarchy.finishSpecifying(TransformHierarchy.java:263)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:249)
at org.apache.beam.sdk.Pipeli

Running Sample using State & Timer

Thanks for this open source repo.
I was looking for a running example using State and Timer somewhere but didn't find any.
My goal is to minimize the number of call to a datastore using State & timer in order to do some caching.
If someone has an example that would be great otherwise I will try to add one.

Got ClassNotFoundException: org.apache.beam.samples.EventsByLocation while running the examples

Hi,

First to say, thanks for providing this project. I wonder whether you'll be able to assist me with running the examples. When I tried

mvn exec:java -Dexec.mainClass=org.apache.beam.samples.EventsByLocation -Pdirect-runner -Dexec.args="--runner=DirectRunner --input=/home/dataset/gdelt/2014-2016/201605*.zip --output=/tmp/gdelt/output/"

I got java.lang.ClassNotFoundException: org.apache.beam.samples.EventsByLocation. I have attached the full log from the above command with the -X option added to the mvn params.
beam-example-error.txt

I'm new to Beam; it might be just a simple configuration error in my environment that is causing this exception. If you can suggest what might be wrong that would be great.

Cheers,
Simeon

Security Policy violation Security Scorecards

This issue was automatically created by Allstar.

Security Policy Violation
Project is out of compliance with Security Scorecards policy

Rule Description
This is a generic passthrough policy that runs the configured checks from Security Scorecards. Please see the Security Scorecards Documentation for more information on each check.

**First 10 Results from policy: Token-Permissions : non read-only tokens detected in GitHub workflows **

  • .github/workflows/java11.yml[1]:no topLevel permission defined
  • .github/workflows/java11.yml[1]:not a publishing workflow: .github/workflows/java11.yml
  • .github/workflows/java11.yml[1]:not a releasing workflow: .github/workflows/java11.yml
  • .github/workflows/java11.yml[1]:not a GitHub Pages deployment workflow: .github/workflows/java11.yml
  • .github/workflows/java11.yml[1]:not a codeql workflow
  • .github/workflows/java11.yml[1]:not a codeql upload SARIF workflow
  • .github/workflows/java11.yml[9]:no jobLevel permission defined
  • .github/workflows/java17.yml[1]:no topLevel permission defined
  • .github/workflows/java17.yml[1]:not a publishing workflow: .github/workflows/java17.yml
  • .github/workflows/java17.yml[1]:not a releasing workflow: .github/workflows/java17.yml
  • Run a Scorecards scan to see full list.

This issue will auto resolve when the policy is in compliance.

Issue created by Allstar. See https://github.com/ossf/allstar/ for more information. For questions specific to the repository, please contact the owner or maintainer.

Security Policy violation Repository Administrators

This issue was automatically created by Allstar.

Security Policy Violation
Did not find any owners of this repository
This policy requires all repositories to have a user or team assigned as an administrator. A responsible party is required by organization policy to respond to security events and organization requests.

To add an administrator From the main page of the repository, go to Settings -> Manage Access.
(For more information, see https://docs.github.com/en/organizations/managing-access-to-your-organizations-repositories)

Alternately, if this repository does not have any maintainers, archive or delete it.


This issue will auto resolve when the policy is in compliance.

Issue created by Allstar. See https://github.com/ossf/allstar/ for more information. For questions specific to the repository, please contact the owner or maintainer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.