Giter VIP home page Giter VIP logo

distributed-file-loader's Introduction

distributed-file-loader

Application to load large amounts of structured data into DB in a non-blocking fashion

Intuition

The idea is to separate the file read and file load. Since serial read is faster and parallel load is faster. We use one thread to load the file sequentially and create batches from the file that can be loaded into the DB. We use multiple worker threads to load these batches into DB. The number of worker threads can be configured from properties file. We additionally keep track of the line number (can be tracked since serial read), and the run number to handle updates.

Steps to Run the code

Running the code directly

Developed the code using Java Springboot. So, the prerequisites are Java, Maven and preferably IntelliJ.

The code expects the data file path as the one and only command line argument. It can be configured on IntelliJ using run configurations.

java -jar ./target/CSVLoader.jar /path/to/datafile

Once run, the DB can be accessed as long as the application is running on localhost:<server.port>/h2-console The login details and server.port values can be found in application.properties file.

Running using the docker container

Dockerfile is available to run the code as a container.

Run the following command to create a container image from the root of distributed-file-loader project.

docker build -t distributed-file-loader:latest .

Using the created container, run the code using run_container.sh script as below

./run_container.sh -t <CONTAINER_IMAGE>:<TAG> -c <ABSOLUTE/PATH/TO/PROPERTIES/FILE> -d <ABSOLUTE/PATH/TO/STORE/H2/DB> -p <ABSOLUTE/PATH/TO/DATAFILE -o <PORT/TO/MAP/ON/LOCAL>

To save the time in building the image, the container is pushed to DockerHub-amdsrinivas

To run using pre-built image:

./run_container.sh -t amdsrinivas/distributed-file-loader:latest -c <ABSOLUTE/PATH/TO/PROPERTIES/FILE> -d <ABSOLUTE/PATH/TO/STORE/H2/DB> -p <ABSOLUTE/PATH/TO/DATAFILE -o <PORT/TO/MAP/ON/LOCAL>

Tables used

  • PRODUCTS_FCT : table to load the data of all products
Schema
CREATE TABLE IF NOT EXISTS PRODUCTS_FCT(
    PRODUCT_ID INTEGER PRIMARY KEY,
    PRODUCT_NAME VARCHAR(100),
    PRODUCT_SKU VARCHAR(100),
    PRODUCT_DESCRIPTION VARCHAR(500),
    LINE_NUMBER INTEGER,
    UPDATED_ON_RUN INTEGER
) ; -- src/main/resources/sql/CREATE_PRODUCTS_FCT.sql

PRODUCTS_FCT

  • PRODUCTS_VIEW : View to get the aggregate details of all products by name and count
Schema
CREATE OR REPLACE VIEW PRODUCTS_VIEW AS
    SELECT PRODUCT_NAME, COUNT(*) AS NUMBER_OF_PRODUCTS
    FROM PRODUCTS_FCT GROUP BY PRODUCT_NAME
; -- src/main/resources/sql/CREATE_PRODUCTS_VIEW.sql

PRODUCTS_VIEW

  • RUN_LOGGER : table to keep track of the current run number. Used in updates.
Schema
CREATE TABLE IF NOT EXISTS RUN_LOGGER(
    RUN_NUMBER INTEGER
) ; 

INSERT INTO RUN_LOGGER (RUN_NUMBER)
SELECT 0
WHERE NOT EXISTS (SELECT * FROM RUN_LOGGER) ; -- src/main/resources/sql/CREATE_PRUN_LOGGER.sql

RUN_LOGGER

Point achieved

  1. The code follows OOPS concepts.
  2. The ingestion is non-blocking parallel using threads. On average, ingesting the products.csv takes 0.953 minutes. An update file with 18 rows takes 0.0352 minutes.
  3. A column called PRODUCT_ID is generated from 'sku' column using String.hashCode() method of Java. The 'sku' loaded from file is converted to PRODUCT_ID which is used for updates.
  4. All product details are ingested to single PRODUCTS_FCT table.
  5. Aggregated data is available under view PRODUCTS_VIEW.

Points unable to achieve -- NA

Improvements

  1. MySQL specific syntax used for loading batches. Would have tried for DB independent syntax.
  2. Would have tuned the number of threads, maximum batches and batch size to optimize the waiting time in few scenarios.
  3. Uses file based DB. Would have tried to use a container based DB server.
  4. Tracking the run using the current datetime instead of a separate table.

distributed-file-loader's People

Contributors

amdsrinivas avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.