Giter VIP home page Giter VIP logo

kafka-integration's Introduction

kafka-integration

This repository will contain the source code/documentation for streaming data using Apache Kafka and HPCC.

Consumer:

The Apache Kafka Consumer will be running on each node of the HPCC Cluster. The consumer connects to the Kafka brokers and fetches the data depending upon the 'messageListSize' property defined in DataConsumer.properties file. We are using Non-Blocking Consumer since we need to read only specific number of messages for a topic. The Consumers run in parallel hence the topic partition size should be equal to "Number of slave nodes".
For e.g. If you are running a 5 node cluster (1 THOR Master and 4 slaves) then the partition size will be 4.

Producer:

The are no hard and fast rules for Producer only that the partition size for the topic being produced should be equal to "Number of slave nodes".

Apache Kafka Brokers/Zookeeper:

This must be configured according to message throughput and cluster availability.

Installation Steps:

  • See Building with Gradle below for changes in the installation steps (supersedes these installation instructions).
  • Make changes to DataConsumer.properties to point to Apache Kafka Cluster.
  • Copy DataConsumer.properties and DataConsumer.class files on each node (Including THOR Master) and add it to classpath. DEPRECATED As of now you would need to manually copy this files to each node. This will be replaced by a script that will do it for you. DEPRECATED
  • Add the Apache Kafka jars (kafka_2.8.0-0.8.0-beta1.jar, kafka-assembly-0.8.0-beta1-deps.jar) to the classpath. DEPRECATED
  • Add Log4j jar file to the classpath. DEPRECATED
  • Restart the cluster.

Usage:

The code base contains an example for Apache Kafka Producer (TelematicsSimulator.java) which simulates sample telematics data. On the ECL side there are 2 schedulers:

  • DataCollection_Scheduler.ecl : Which fetched the data from Apache Kafka brokers, creates logical files and adds the logical files to superfile.
  • BuildIndex_Scheduler.ecl: Which creates a base file from the data received, creates indexes, adds the indexes to superkeys, builds a package and deploy the package to ROXIE.

Before you can start the schedulers you need to publish the queries to ROXIE (telematics_service_accdec.ecl and telematics_service_km_by_speed.ecl).

The two schedulers are independent of each other which means that if one fails the other will not be affected.

DataCollection Scheduler:

Below are the high level steps that we perform for each incoming logical file:

  1. The DataConsumer returns the data fetched from brokers as a string.
  2. Creates a logical file for each iteration and adds it to Superfile.

BuildIndex_Scheduler:

Below are the high level steps that we perform for each incoming file:

  1. Swap the contents of Superfile used by DataCollection Scheduler to a temporary superfile.
  2. Create a Base File which contains the cleaned/parsed data.
  3. Create a index on the sub file
  4. Add the index to the SuperKey and Base file to Superfile
  5. Create package XML for the queries deployed and publish the new data using packages (I do this using SOAPCALL from ECL. It can be done using ecl command line as well).
  6. Roxie Query will pick up the data from Superkey (which will be deployed using package)
  7. After specific time interval (1 hour, 6 hours or 1 day) do the following:
    1. All the sub files in a Superfile will be consolidated into a single sub file
    2. Build one single index using the data in the superfile.
    3. Clear up the SuperKey and add the index built in step 7(i) to the Superkey.
    4. Clear the SuperFile.
NOTE: Step 7 is not yet implemented and will be available in a future version.

Building with Gradle:

In the project root directory, execute ./gradlew build - the build creates a single jar file containing this project's executable code as well as all its dependencies.

Add this jar file to the classpath.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.