Giter VIP home page Giter VIP logo

datax's Introduction

Datax-logo

Github Action

关于这个Fork

由于阿里维护DataX没有这么及时,但是很多人(包括我)还在使用DataX。使用过程中难免会发现一些问题,发现了问题我会从新编译使用,但看到社区有 小伙伴遇到同样的问题,所以就萌生出维护一个fork并且定期发布,并且希望有人可以一起维护

这个Fork和alibaba/DataX相比做了什么修改

具体有什么修改见changelog

是否会合并alibaba/DataX中新的commit

项目会定期合并alibaba/DataX中的commit,统一采用git cherry -x <commit-hash>的方式,保留原commit的hash值,方便用户溯源,合并之后 会自动编译并发布到release,详细的自动发布过程见这个Fork如何发布

如何使用这个Fork的代码

  • 想使用这个Fork: 如果你仅仅想使用这个Fork的Datax,请直接到release页面下载最新版。几乎是最新代码的编译包,详见 这个Fork如何发布
  • 想参与这个Fork共享:如果你想参与这个Fork贡献,可以参考想要提供帮助

这个Fork如何发布

这个Fork使用Github Action自动化发布,每当master分支有新commit的时候就会触发自动化发布,大概一个小时就会发布到 release中。为了防止过多的release assert(每个commit产生一个),所以目前采用同一个tag发布(v0.0.2),详见PR-20, 所以release仅会存在一个release信息,但是已经是最新代码的编译文件,请放心使用。

想要提供帮助

如果你觉得这个项目有点意思,且想要参与到项目建设/贡献,可以参考以下流程:

  • 先Fork这个项目
  • 将你的Fork拉到本地
  • 在本地完成修改并提交到你的github
  • 创建PR并将target branch 选择至 zhongjiajie/DataX
  • 点击create创建PR

datax's People

Contributors

asdf2014 avatar binaryworld avatar cch1996 avatar heljoyliu avatar kevinwangcs avatar lw309637554 avatar ryan-mei avatar sufism avatar trafalgarluo avatar wanda1416 avatar wuchase avatar zhongjiajie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

datax's Issues

JF1367818433

你好,请问怎么使用DATAX读取parquet格式的数据呢?

elasticsearchwriter支持自定义日期类型 无效

输出信息

2020-12-17 19:52:52.083 [0-0-0-writer] ERROR StdoutPluginCollector - 脏数据:
{"message":"status:[400], error: {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [created_at] of type [date] in document with id '11'. Preview of field's value: '2020-10-28T11:16:06.000+08:00'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"failed to parse date field [2020-10-28T11:16:06.000+08:00] with format [yyyy-MM-dd HH:mm:ss]\",\"caused_by\":{\"type\":\"date_time_parse_exception\",\"reason\":\"Text '2020-10-28T11:16:06.000+08:00' could not be parsed at index 10\"}}}","record":[{"byteSize":2,"index":0,"rawData":11,"type":"LONG"},{"byteSize":7,"index":1,"rawData":"手机端淘宝首页","type":"STRING"},{"byteSize":5,"index":2,"rawData":15369,"type":"LONG"},{"byteSize":4,"index":3,"rawData":2929,"type":"LONG"},{"byteSize":8,"index":4,"rawData":1603854966000,"type":"DATE"}],"type":"writer"}
2020-12-17 19:52:52.085 [0-0-0-writer] ERROR StdoutPluginCollector - 脏数据:
{"message":"status:[400], error: {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [created_at] of type [date] in document with id '16'. Preview of field's value: '2020-10-28T11:16:06.000+08:00'\",\"caused_by\":{\"type\":\"illegal_argument_exception\",\"reason\":\"failed to parse date field [2020-10-28T11:16:06.000+08:00] with format [yyyy-MM-dd HH:mm:ss]\",\"caused_by\":{\"type\":\"date_time_parse_exception\",\"reason\":\"Text '2020-10-28T11:16:06.000+08:00' could not be parsed at index 10\"}}}","record":[{"byteSize":2,"index":0,"rawData":16,"type":"LONG"},{"byteSize":4,"index":1,"rawData":"过年海报","type":"STRING"},{"byteSize":6,"index":2,"rawData":104630,"type":"LONG"},{"byteSize":4,"index":3,"rawData":5888,"type":"LONG"},{"byteSize":8,"index":4,"rawData":1603854966000,"type":"DATE"}],"type":"writer"}
2020-12-17 19:52:52.086 [0-0-0-writer] ERROR StdoutPluginCollector - 脏数据:

配置



        "reader": {
                    "name": "mysqlreader",
                    "parameter": {
                        "username": "",
                        "password": "",
                        "connection": [
                            {
                                "querySql": [
                                    "select id,keyword,scount,icount,created_at from t"
                                ],
                                "jdbcUrl": [
                                    ""
                                ]
                            }
                        ]
                    }
                },

 "writer": {
                    "name": "elasticsearchwriter",
                    "parameter": {
                        "endpoint": "http://127.0.0.1:9200",
                        "index": "aaa",
                        "type": "_doc",
                        "cleanup": false,
                        "discovery": false,
                        "batchSize": 1000,
                        "splitter": ",",
                        "dynamic": true,
                        "column" : [
                            {"name": "id", "type": "id"},
                            {"name": "keyword", "type": "text", "analyzer": "ccc"},
                            {"name": "scount", "type": "integer"},
                            {"name": "icount", "type": "integer"},
                            {"name": "created_at", "type": "date", "fromFormat": "yyyy-MM-dd HH:mm:ss"}
                        ]
                    }
                }

ES

"mappings": {
    "properties": {
      "icount": {
          "type": "integer"
        },
      "scount": {
          "type": "integer"
        },
      "created_at": {
        "format": "yyyy-MM-dd HH:mm:ss",
        "type": "date"
      },
      "id": {
        "type": "integer"
      },
      "keyword": {
        "analyzer": "ccc",
        "type": "text"
      }
    }  
  }

数据格式

insert into t ( `icount`, `pinyin`, `scount`, `created_at`, `keyword`, `updated_at`) values ( '253300', 'beijing', '14432285', '2020-10-28 11:16:06', '背景', '2020-10-28 11:16:06');

自检失败呢

源码编译后 进入target/datax/datax,运行 python bin/datax.py job/job.json
报错:
2020-09-25 20:27:24.531 [main] INFO ErrorRecordChecker - percentage使用标准的百分比(配置值忽略百分号),如 [45.45%] 的配置为:"percentage": 45.45
2020-09-25 20:27:24.532 [main] INFO ErrorRecordChecker - 配置了 errorLimit.record, 其优先级高于 errorLimit.percentage 会将覆盖 errorLimit.percentage
2020-09-25 20:27:24.533 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2020-09-25 20:27:24.534 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2020-09-25 20:27:24.535 [main] INFO JobContainer - DataX jobContainer starts job.
2020-09-25 20:27:24.537 [main] INFO JobContainer - Set jobId = 0
2020-09-25 20:27:24.559 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2020-09-25 20:27:24.560 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work .
2020-09-25 20:27:24.561 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2020-09-25 20:27:24.561 [job-0] INFO JobContainer - jobContainer starts to do split ...
2020-09-25 20:27:24.568 [job-0] ERROR JobContainer - Exception when job run
com.alibaba.datax.common.exception.DataXException: Code:[Framework-03], Description:[DataX引擎配置错误,该问题通常是由于DataX安装错误引起,请联系您的运维解决 .]. - 在有总bps限速条件下,单个channel的bps值不能为空,也不能为非正数
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:26) ~[datax-common-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.job.JobContainer.adjustChannelNumber(JobContainer.java:430) ~[datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.job.JobContainer.split(JobContainer.java:387) ~[datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.job.JobContainer.start(JobContainer.java:117) ~[datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.Engine.start(Engine.java:92) [datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.Engine.entry(Engine.java:171) [datax-core-0.0.1-SNAPSHOT.jar:na]
at com.alibaba.datax.core.Engine.main(Engine.java:204) [datax-core-0.0.1-SNAPSHOT.jar:na]
2020-09-25 20:27:24.576 [job-0] INFO StandAloneJobContainerCommunicator - Total 0 records, 0 bytes | Speed 0B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 0.00%
2020-09-25 20:27:24.577 [job-0] ERROR Engine -

经DataX智能分析,该任务最可能的错误原因是:
com.alibaba.datax.common.exception.DataXException: Code:[Framework-03], Description:[DataX引擎配置错误,该问题通常是由于DataX安装错误引起,请联系您的运维解决 .]. - 在有总bps限速条件下,单个channel的bps值不能为空,也不能为非正数
at com.alibaba.datax.common.exception.DataXException.asDataXException(DataXException.java:26)
at com.alibaba.datax.core.job.JobContainer.adjustChannelNumber(JobContainer.java:430)
at com.alibaba.datax.core.job.JobContainer.split(JobContainer.java:387)
at com.alibaba.datax.core.job.JobContainer.start(JobContainer.java:117)
at com.alibaba.datax.core.Engine.start(Engine.java:92)
at com.alibaba.datax.core.Engine.entry(Engine.java:171)
at com.alibaba.datax.core.Engine.main(Engine.java:204)

谢谢

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.