Giter VIP home page Giter VIP logo

nex-de-challenge's Introduction

Nexmo Coding Challenge - Data Engineer

Andrew Austin ([email protected]) - 09/22/17

Solution Description

This solution, coded in Java and targeting MySQL, leverages the Spring Framework's Spring Boot module for quick project scaffolding and the Spring Batch module for robust out-of-the-box ETL functionality including resumable jobs, comprehensive job control behavior and performant batch prepared statement inserts. The solution could be considered a 'minimum viable product' suitible for code-review and stakeholder demonstration for the proposed scenario:

Your task is to build a data pipeline, which will process data from logs generated by an API platform and load it into a data warehouse, where the data can be accessed by business analysts using SQL and reporting tools.

Code Description

Solution code can be found in the src/main/java folder and all paths below are rooted there

  • Job control flow is orchestrated by com.nexmo.jobs.BatchConfiguration
  • Lines which fail to be parsed are processed by com.nexmo.jobs.FileVerificationSkipper
  • Lines are parsed, mapped and validated by com.nexmo.mappers.LogDataExtractorUtil and com.nexmo.mappers.LogDataLineMapper, with limited computation occurring in com.nexmo.entities.LogData
  • The solution has 95% line-test coverage for functional packages (com.nexmo.entities.* and com.nexmo.mappers.*).
  • The solution processes the test data set (10721732 lines of data across 4 files) in ~240 seconds on development hardware (example run output can be found in the logs folder).

System Requirements

This solution was built and tested on MacOSX Sierra 10.2.6 using MySQL 5.7, Java 1.8 JDK and Git 2.7.1.

Installation / Build Steps

  1. Clone this repository to your local system and CD to that directory.
  2. Install and start MySQL if not already done so (installation is out of scope for this document). Run the src/main/resources/setup.sql script from this repository within MySQL to create the target database, application user and table 'log_data'. Configure the connection string in src/main/java/resources/application.properties with your database's instance information
  3. Set permissions to execute the mvnw script (e.g. chmod 755 mvnw), then run ./mvnw package; this will build and package the source code for execution as an executable JAR.

Run Steps

Ensure Installation / Build Steps are completed first and completed with BUILD SUCCESS

To run the application from the cloned directory:

java -jar target/nexmo-0.1.0.jar --file=FILE_PATH | --dir=DIR_PATH

Where only --file OR --dir is supplied. If --dir is supplied, all files ending with .csv in the target directory will be passed to the parser. This solution can be scheduled via Cron, Taskmanager, etc, using a similar execution statement.

Data Assumptions / Expectations

  • We assume single-level nested CSV objects that don't contain the char '{' in their value
  • We assume nested CSV objects are only one-level deep; more work would need to be done in order to accommodate multi-level nested CSV objects
  • We assume all logs are being generated in the [Europe/London] timezone
  • We assume messageId and date length and positions are fixed
  • We assume currency calculations are trivial (small amounts w/ limited precision requiriments and uniform currency type); thus we leverage BigDecimal and are only trivially concerned with rounding
  • We consider all fields and sub-fields mandatory; any missing component forces the record to be logged to the console and discarded
A Note on Requirements

I asked Oscar if SQL to find revenue, gross margin, etc, was necessary. His answer was SQL was only necessary if the data was stored in some format other than listed in the scenario document. As our MySQL schema matches the listed fields, no additional SQL is provided.

Additional Questions

How would you change your solution if you had an additional week to work on it? Additional month?

The solution is currently 'MVP' and ready for review and feedback from stakeholders and engineering team members (for technical validation / suggestions / guidance). Once feedback / assumption validation (or disproval) and review has completed, I would focus on prioritizing and implementing changes driven by that feedback followed by testing a larger variety / volume of data sources (and another review round). Once base-functionality is working as intended, I would consider the following changes:

  • Refactor to store full currency information in the DB for currency / cost calculations to be performed there OR consider implementing a dedicated currency library if this currency computation was to be done by this application
  • Redirect failed lines to a flat-file for resolution / problem-pattern detection
  • Abstract out configuration points to config file
  • Potential batch size tweaking (rough-tweaking was done (10, 100, 1000, 5000, 10000); additional adjustment may provide limited performance benefits) and a performance review
  • Move Spring Batch control tables from local DB into specialized control DB (automatically stored in the base database, which may appear as 'clutter' to analysts)
  • Dev-ops related tasks - CI/CD / build management systems integration, containerization (if that is the deployment scheme) and multi-server deployment (if data volume necessitates)
  • Rudimentary notification system / integrating into an existing notification system for success/failure alert consolidation (email, etc)
  • Discussions w/ stakeholders re: long-term plan for operational workflows (change management, rejected record resolution pipelines, etc)
  • Discussions w/ stakeholders re: long-term plan for operational self-service UI for analysts (requires expertise gained from maintaining production system before design / development)

Are there any aspects of ETL your solution does not cover that would need to be covered in a production environment? How would you approach them?

  • Security configuration for DBMS and DBMS connection (currently ignoring SSL certificate validation issues)
  • Application logs are currently being output to the console, configuration to output logs to a file / directory for troubleshooting
  • No notification system for failures; see improvements section
  • Configuration point abstraction for operational maintainability

If your solution needed to process 10 times the data volume provided, how would it change? 100 times?

From an execution stability / 'will it work' perspective, the application has not been tested with files larger than those included in the test set which are not exceptionally large (max size .75GB). Testing would be required to validate handling of larger files to eliminate risk of memory pressure issues; file-splitting may be necessary as a last-resort. From a performance perspective, the solution, on development hardware co-hosting the target database, processed 10721732 lines of data across 4 files in ~240 seconds, yielding an average of roughly 2.65m records / minute. If those values were to hold in production, the anticipated load window at 10x the data would be 40 minutes and ~6 hours and 45 minutes at 100x. A 10x solution may not need performance alterations / changes (e.g. load is completed within maintenance window; time required to performance tune may be better spent on other projects), a 100x solution may require us to examine performance bottlenecks (e.g. processor bound, network bound, DB bound) and explore the following solutions accordingly:

  • Spring batch parallelism and scaled processing - simultaneous processing of multiple files
  • Distributed processing of multiple files - distributing the solution to multiple hosts and distributing the files to be processed accordingly
  • DBMS scaling issues - index manipulation, DBMS tuning, DBMS hardware enhancements

nex-de-challenge's People

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.