Giter VIP home page Giter VIP logo

welltok / dskit Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jpictor/dskit

0.0 54.0 0.0 14 KB

This data science "kit" was built to help data science teams get started with exploratory data analysis and algorithm development with a product architected with a service or microservice architecture. With such architectures, data is distributed across a number of isolated databases, SQL and non-SQL. This presents a challenge to developing analytics that require complex joins of data across the service databases. This data science toolkit solves this problem by providing programs to dump service databases as JSON row files, and then use those files as a unified Spark SQL data where big-data queries and map/reduce algorithms can be applied.

Home Page: https://github.com/jpictor/dskit

License: MIT License

Python 86.20% Shell 13.80%

dskit's Introduction

Spark SQL Data Science Kit

Introduction

This data science "kit" was built to help data science teams get started with exploratory data analysis and algorithm development with a product architected with a service or microservice architecture. With such architectures, data is distributed across a number of isolated databases, SQL and non-SQL. This presents a challenge to developing analytics that require complex joins of data across the service databases. This data science toolkit solves this problem by providing programs to dump service databases as JSON row files, and then use those files as a unified Spark SQL data where big-data queries and map/reduce algorithms can be applied.

Data Dump Programs

Let's say you have the following service databases: identity, notification, groups, and chat. Also, some of the databases are MySQL and some are Postgres. Use the data dump programs pg2json and my2json to export the data for analysis.

$ ./manage.py pg2json --host=localhost --user=postgres --password=foo identity /disk/data/identity
$ ./manage.py pg2json --host=localhost --user=postgres --password=foo notification /disk/data/notification
$ ./manage.py my2json --host=localhost --user=root --password=foo groups /disk/data/groups
$ ./manage.py my2json --host=localhost --user=root --password=foo chat /disk/data/chat

Joining Data with SparkSQLJob and Spark SQL

The SparkSQLJob class is a utility class for loading the exported JSON data directory created above into a Spark SQL context. The data directory is arranged as <service>/<table>.txt files and registered into the SQL context as <service>_<table>. Partitioned files are supported using the convention <service>/<table>-<partition>.txt.

Here is an example job which joins the data form the identity and notification services to generate a count the number of notifications per user:

from spark_sql_job import SparkSQLJob

class Job(SparkSQLJob):
    app_name = 'Messages per user'
    load_tables = [
        'identity_user',
        'notification_message'
    ]

    def task(self):
        sql = """
        select     u.id as id,
                   u.username as username,
                   u.email as email,
                   max(datetime_from_timestamp(u.last_login)) as last_login,
                   count(m.id) as message_count
        from       identity_user u
        join       notification_message m on m.user_id = u.id
        group by   u.id, u.username, u.email
        order by   message_count desc
        """

        users_rdd = self.sql_context.sql(sql)
        self.write_local_output(users_rdd.collect())

if __name__ == '__main__':
    Job().run()

This example is run with the command:

$ ./manage.sh submit examples/messages_per_user.py --output-type=csv --output-path=user_messages.csv /disk/data/

Useful Spark SQL User Defined Functions

Time is stored in databases in various ways. The example above uses the Spark SQL function datetime_from_timestamp to convert time stored in unix timestamp seconds to a Python datetime instance in the Spark RDD result rows. This is not a standard Spark SQL function. It's a user defined function (UDF) added automatically by SparkSQLJob. There are three UDF functions for manipulating time fields registered:

  • datetime_from_isodate: convert ISO date-time strings into a Python datetime instance
  • timestamp_from_isodate: convert ISO date-time strings into UNIX time stamps, which are the number of seconds since Jan 1, 1970
  • datetime_from_timestamp: convert UNIX time stamp into a Python datetime instance

Installation Instructions

$ ./manage.sh build_all

Run Example

There is sample data in the data/ directory from a Django database: data/pictorlabs/auth_user.txt. This data was exported from the database using the pg2json command. The auth_user.txt data contains the rows from the auth_user database table in JSON format one row per line. Each line is a JSON hash with the keys the database column names.

The sample run a Spark SQL query on the sample data and outputs a CSV file. Run the sample:

The example user_example.py uses the SparkSQLJob class to generate a Spark SQL context that is pre-loaded with the JSON row data in the data directory. The SparkSQLJob loader uses the directory and file names to register the data into a Spark SQL database. The JSON record file pictorlabs/auth_user.txt is registered as the database table pictorlabs_auth_user.

Run the example using the submit command:

$ ./manage.sh submit examples/user_example.py --output-type=csv --output-path=users.csv data/

After the sample runs, the users.csv contains the sample output.

dskit's People

Contributors

jpainter128 avatar jpictor avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.