Giter VIP home page Giter VIP logo

sparketl's Introduction

SparkETL

主要运用spark SQL实现数据仓库etl。从extract、transform到导出到其他数据库,基本是写sql方式实现。 实现从数据库抽取数据,在spark上实现etl主逻辑,将数据仓库加工后的数据再导入到RdbMS中供后续使用。sql满足不了的再需要调用spark接口实现。

公共约定

  • 数据库、文件、程序数据处理等需要指定字符集的都是UTF-8
  • 事实表以时间分区的,分区键都是time_type,time_id

环境

  • CentOS 7.7
  • OpenJDK 1.8.0_242
  • CDH6.3.2

DB

数据

  • github:https://github.com/wuda0112/mysql-tester
  • 生成数据:java -jar mysql-tester-1.0.1.jar --mysql-username=test --mysql-password=Test123$ --user-count=1000 --max-item-per-user=100 --thread=10 --mysql-url=jdbc:mysql://127.0.0.1:3306/?serverTimezone=UTC&characterEncoding=UTF-8

调用

spark-submit --master yarn --class etl.App --driver-memory 512m --executor-memory 512m /dp/bin/etl.jar prod 1 2020-03-23

idea远程调试

spark-submit --master yarn --class etl.App --driver-memory 512m --executor-memory 512m --driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5005" /dp/bin/etl.jar prod 1 2020-03-23

记录应用日志

/etc/spark/conf/log4j.properties 增加

log4j.logger.etl=DEBUG,rollingFile
log4j.appender.rollingFile=org.apache.log4j.RollingFileAppender
log4j.appender.rollingFile.Threshold=DEBUG
log4j.appender.rollingFile.ImmediateFlush=true
log4j.appender.rollingFile.Append=true
log4j.appender.rollingFile.File=/dp/log/etl.log
log4j.appender.rollingFile.MaxFileSize=50MB
log4j.appender.rollingFile.MaxBackupIndex=5
log4j.appender.rollingFile.layout=org.apache.log4j.PatternLayout
log4j.appender.rollingFile.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} [%p] %m%n

具体使用方式见

SparkETL 用Spark SQL实现ETL

sparketl's People

Contributors

dazheng avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.