homeaway / datapull Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 36.0 14.47 MB

Cloud based Data Platform based on Apache Spark

Home Page: https://homeaway.github.io/datapull

License: Other

Dockerfile 0.50% Java 49.37% Scala 39.16% Batchfile 0.54% Shell 4.35% HCL 6.05% Makefile 0.02%

datapull's People

Contributors

Stargazers

Watchers

datapull's Issues

Limit permissions needed by AWS EMR service role

Is your feature request related to a problem? Please describe.
DataPull when deployed onto AWS EMR, uses the default EMR service role EMR_DefaultRole that has a lot more access than necessary. We should use a custom EMR service role that has only the minimum necessary access.

Describe the solution you'd like
Create an IAM Role emr_datapull_role that has just enough access to run as the EMR service role for DataPull

Describe alternatives you've considered
N/A

Additional context
N/A

create_user_and_roles.sh script is not generating access and secret keys

Describe the bug
As per documentation, on executing create_user_and_roles.sh file with admin access and secret key, it will output access and secret keys. However, it is not generating these keys.

Expected behavior
Access and secret keys should be generated.

To Reproduce
Execute the create_user_and_roles.sh script with admin access and secret key

Fixing a bug while fetching secrets from secretsmanager

Describe the bug
The functionality which retrieves the secret from secretsmanager has a bug which pushed into the json object.

Expected behavior
when the user puts the option "secretstore":"secret_manager" and correct secretname then the password should be retrieved and put in the platform object to use for further processing.

To Reproduce
Submit the job with the secret name and secret store as secretmanager then the job fails with no password.

Errorlog
NA

Additional context
NA

Elasticsearch examples use docs as _type, which is non-default and deprecated

Describe the bug
Ref https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html ; users tend to copy the examples when using the platform objects for the input json. For elasticsearch, the "type" attribute is set to "docs", which shows up on the index as "_type". This doesn't break DataPull per se, but it breaks apps who use the indexes written by datapull. Those apps sometimes filter on the type with default value "_doc".

Expected behavior

The ES examples should use the type as "_doc".
The "type" attribute should optional. If the type is not specified, for ES versions where the type is deprecated, it should not be set at all; else it should be set to the default value "_doc"

Optimization of DataPull API code and configuration files

Currently, Few of the options in the configuration files are redundant and may need to edit few configuration files which will be confusing on the user's side; so there is room for optimizing the COnfiguration file and changing the API code accordingly.

Documentation request for Monitoring Spark UI

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Enable support for sql server connections using windows authentication instead of sql authentication

Is your feature request related to a problem? Please describe.
Currently, we don't support windows based authentication for connecting to sql server or any jdbc data stores. this is to enable windows authentication for sql server.

Describe the solution you'd like
We can use jtds driver for connecting to mssql using NTLM.

Describe alternatives you've considered
We tried using Microsoft MsSQL driver but couldn't connect using NTLM.

Additional context
NA

Core builds are failing due because of passing null values to primitive type

Describe the bug
Currently, the null values are being passed to some parameters which are of the primitive type Int which it doesn't accept.

Expected behavior
The build should go through and if the users don't use the optional parameter then it should work with the code flow and the build should pass.

To Reproduce
Just by following the installation procedure, the builds are failing.

Errorlog
If applicable, add an error log which will help us to diagnose the issue.

Additional context
NA

Bug in the code for executing the pre/post migrate commands for postgres

Describe the bug
There is a bug in the code when we check for assigning the driver based on the platform for running pre/post migrate commands for the Postgres data store.

Expected behavior
When there are any pre/post migrate commands for postgres they should call the rdbmsruncommand and build the url and initialize the driver for the respective data store.

To Reproduce
To reproduce the issue, submit any datapull with pre/post migrate command with postgres datastore.

Errorlog
exception printing:ListBuffer(java.lang.NullPointerException at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264)

Additional context
N/A

EMR EC2 execution role should have access to send email via SES.

We have multiple teams who want to use SES to deliver the Datapull reports via email (these reports are already delivered to Cloudwatch as events). For this to happen, the EMR EC2 execution role should have access to send email via SES.

The ideal solution will be to tailor the installation such that these roles get access to send email via SES only if SES is configured. however, until we make the installation UI-driven/dynamic, let's provide this access scoped to DataPull resources alone.

Installing DataPull fails at ecs_deploy.sh with "Failed to collect dependencies at org.apache.hive:hive-jdbc:jar:3.1.2"

Describe the bug
One of the installation steps in https://homeaway.github.io/datapull/install_on_aws/#installation-steps is to run ./ecs_deploy.sh . When running this on a Ubuntu linux client, the script fails while compiling the Scala JAR with a reference to "Failed to collect dependencies at org.apache.hive:hive-jdbc:jar:3.1.2"

Expected behavior
The installation script should successfully compile the Scala JAR when running ./ecs_deploy.sh <env>

To Reproduce
On a linux machine, install DataPull using the instructions at https://homeaway.github.io/datapull/aws_account_setup

Errorlog
The script

./ecs_deploy.sh test

yields

Downloaded from central: https://repo.maven.apache.org/maven2/org/scala-lang/modules/scala-xml_2.11/1.3.0/scala-xml_2.11-1.3.0.pom (2.9 kB at 47 kB/s)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  11:52 min
[INFO] Finished at: 2020-07-07T04:27:56Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project DataMigrationFramework: Could not resolve dependencies for project DataMigrationTool:DataMigrationFramework:jar:1.0-SNAPSHOT: Failed to collect dependencies at org.apache.hive:hive-jdbc:jar:3.1.2 -> org.apache.hive:hive-service:jar:3.1.2 -> org.apache.hive:hive-llap-server:jar:3.1.2 -> org.apache.hbase:hbase-server:jar:2.0.0-alpha4 -> org.glassfish.web:javax.servlet.jsp:jar:2.3.2 -> org.glassfish:javax.el:jar:3.0.1-b06-SNAPSHOT: Failed to read artifact descriptor for org.glassfish:javax.el:jar:3.0.1-b06-SNAPSHOT: Could not transfer artifact org.glassfish:javax.el:pom:3.0.1-b06-SNAPSHOT from/to Neo4J (https://m2.neo4j.org/content/repositories/releases/): Failed to transfer file https://m2.neo4j.org/content/repositories/releases/org/glassfish/javax.el/3.0.1-b06-SNAPSHOT/javax.el-3.0.1-b06-SNAPSHOT.pom with status code 502 -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
PROCESS FAILED

Additional context
N/A

Controlling the numbe rof partitions written to the file system

Is your feature request related to a problem? Please describe.
Currently, when we are writing to s3/file system the destination will have those many files as the number of parallel threads running per executors. We have to make changes to control the number of partitions written to the destination

Describe the solution you'd like
Dataframe API has the ability to control the number of partitions written to the destination using the function coalesce.

Describe alternatives you've considered
There is a built-in function available with dataframe API.

Additional context
n/a

Wrong method signature used when invoking "DataframeFromTo.dataFrameToRdbms()"

Describe the bug
Actual method signature of dataframeFromTo.dataFrameToRdbms() at

datapull/core/src/main/scala/core/DataFrameFromTo.scala

Line 1245 in 57bb1cf

 def dataFrameToRdbms(platform: String, awsEnv: String, server: String, database: String, table: String, login: String, password: String, df: org.apache.spark.sql.DataFrame, vaultEnv: String, secretStore: String, sslEnabled: String, port: String, addlJdbcOptions: JSONObject, savemode: String, isWindowsAuthenticated: String, domainName: String): Unit = { 

expects port but when calling it at

datapull/core/src/main/scala/core/Migration.scala

Line 199 in 2edf845

 dataframeFromTo.dataFrameToRdbms(destinationMap("platform"), destinationMap("awsenv"), destinationMap("server"), destinationMap("database"), destinationMap("table"), destinationMap("login"), destinationMap("password"), dft, destinationMap("vaultenv"), destinationMap.getOrElse("secretstore", "vault"), destinationMap.getOrElse("sslenabled", "false"), destinationMap.getOrElse("vault", null), destination.optJSONObject("jdbcoptions"), destinationMap.getOrElse("savemode", "Append"), destinationMap.getOrElse("iswindowsauthenticated", "false"), destinationMap.getOrElse("domain", null)) 

, its passed vault.

Expected behavior
Correct method call.

DataPull is throwing an exception "No json input available to Data Pull"

Describe the bug
DataPull is throwing an exception "No json input available to Data Pull" when running in EMR

Expected behavior
It should run with provided JSON

To Reproduce

Deploy core code in s3
submit the JSON through API

Jobs throwing errors for all the jobs which have source as mongo kafka

Describe the bug
The function which detects if the jobs use any temp location for storing intermittent datasets such as Kafka or mongo with override connector to true has a bug which calls the pre-post migrate command function with no json object in it.

Expected behavior
The pre-post migarte command should be called only when the source is kafka or mongo with override connector automatically.

To Reproduce
Running the job with source as mongo with no override connector.

First Step
Second Step
Third Step

Errorlog
java.util.NoSuchElementException: key not found: platform at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at core.Migration.jsonSourceDestinationRunPrePostMigrationCommand(Migration.scala:483) at core.Migration.core$Migration$$prePostFortmpS3$1(Migration.scala:303) at core.Migration$$anonfun$migrate$2.apply$mcVI$sp(Migration.scala:313) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at core.Migration.migrate(Migration.scala:309) at core.Controller$$anonfun$performmigration$2$$anonfun$apply$mcV$sp$1.

Additional context
NA

Documentation for updating initial author list

Is your feature request related to a problem? Please describe.
The documentation lists the three initial authors of DataPull; but since there have been multiple other contributors to the codebase. It is difficult to keep the list of authors updated in the documentation.

Describe the solution you'd like
Github already tracks the contributions to the repo automatically; so it would be good to use that instead of this. As for acknowledging the contributors to DataPull codebase while it was an inner-sourced product, we can use the list of contributors compiled while filing for this project to be open-sourced.

Describe alternatives you've considered
An alternate solution is to not list the initial list of authors at all; however, there is good karma to be realised by thanking everyone who have contributed.

Additional context
N/A

Writing json strings into a file system as json files

Is your feature request related to a problem? Please describe.
No, currently we support writing as any regular formats such as jsons, parquet, avro, csv etc from a dataframe which is different from what we need.

Describe the solution you'd like
Writing each partition by creating a file json file and saving it as a partition to the destination.

Describe alternatives you've considered
NA

Additional context
NA

Compile DataPull using OpenJDK 11+ once Scala becomes compatible

Is your feature request related to a problem? Please describe.
Although Scala apps can run on JDK 11, per official docs it is recommended that Java 8 be used for compiling Scala code. Per the same documentation, this is expected to happen in June 2019.

Describe the solution you'd like
Upgrade Scala and dependencies to a version that uses OpenJDK 11+.

Describe alternatives you've considered
An alternative is to use Amazon Corretto JDK instead.

Additional context
N/A

Installs failing because of double quotes in datapull_user credentials in profile

Describe the bug
During the installation process to AWS, the script create_user_and_roles.sh creates the datapull_user,but it puts double quotes around the aws access key id and aws secret access key values in ~/.aws/credentials. Because of this, the next script ecs_deploy.sh fails because it doesn't recognize the aws access key id as a being a valid one because of the extra double quotes.

Expected behavior
When the datapull_user credentials are written to the local ~/.aws/credentials file,they should show up without the double quotes, or subsequent aws cli should be able to handle the credentials with the double quotes.

To Reproduce
Run the installation. It will fail when you run the step that involves execution of the ecs_deploy.sh script with the following error.

Uploading core Jar to s3
+ docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_DEFAULT_REGION -e AWS_PROFILE -v /home/markomerine/githubrepos/homeawway/datapull/core:/data -v /home/markomerine/.aws:/root/.aws garland/aws-cli-docker aws s3 cp /da
ta/target/DataMigrationFramework-1.0-SNAPSHOT-jar-with-dependencies.jar s3://datapull-test2/datapull-opensource/jars/DataMigrationFramework-1.0-SNAPSHOT-jar-with-dependencies.jar
upload failed: target/DataMigrationFramework-1.0-SNAPSHOT-jar-with-dependencies.jar to s3://datapull-test2/datapull-opensource/jars/DataMigrationFramework-1.0-SNAPSHOT-jar-with-dependencies.jar An error occurred (InvalidAccessKeyId)
 when calling the CreateMultipartUpload operation: The AWS Access Key Id you provided does not exist in our records.

Adding HIVE as destination

Is your feature request related to a problem? Please describe.
This is to add HIVE as a destination

Describe the solution you'd like
we should enable hive support for the spark session and add the necessary config related to hive.

Describe alternatives you've considered
NA

Additional context
NA

Adding support for non authenticated mongo clusters

Is your feature request related to a problem? Please describe.
Currently, datapull doesn't support non authenticated mongo clusters. This request is to add the support for.

Describe the solution you'd like
DataPull should be able to support mongo cluster with no authentication.

Describe alternatives you've considered
N/A

Additional context
N/A

ecs_deploy.sh fails when installing DataPull from a Ubuntu linux client

Describe the bug
One of the installation steps in https://homeaway.github.io/datapull/install_on_aws/#installation-steps is to run ./ecs_deploy.sh <env> . When running this on a Ubuntu linux client, the script fails at the line #!/usr/bin/env bash -x

Expected behavior
The step ./ecs_deploy.sh <env> should not fail at line #!/usr/bin/env bash -x

To Reproduce
On a Windows machine running WSL2 with Ubuntu , install DataPull using the instructions at https://homeaway.github.io/datapull/aws_account_setup

Errorlog

./ecs_deploy.sh test

yields

/usr/bin/env: ‘bash -x’: No such file or directory

Additional context
N/A

Dollar sign not supported in password injection

Describe the bug
I noticed that a password containing a dollar sign $ would break the injection.

Expected behavior
Any character should be allowed.

To Reproduce
In the input file:

"password": "inlinesecret{{\"secretstore\": \"aws_secrets_manager\", \"secretname\": \"datapull/my-app\", \"secretkeyname\": \"MY_SERCRET\"}}",

MY_SERCRET = 'abc$123'

Errorlog

Exception in thread "main" java.lang.IllegalArgumentException: Illegal group reference
	at java.util.regex.Matcher.appendReplacement(Matcher.java:857)
	at scala.util.matching.Regex$Replacement$class.replace(Regex.scala:804)
	at scala.util.matching.Regex$MatchIterator$$anon$1.replace(Regex.scala:782)
	at scala.util.matching.Regex$$anonfun$replaceAllIn$1.apply(Regex.scala:473)
	at scala.util.matching.Regex$$anonfun$replaceAllIn$1.apply(Regex.scala:473)
	at scala.collection.Iterator$class.foreach(Iterator.scala:891)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
	at scala.util.matching.Regex.replaceAllIn(Regex.scala:473)
	at helper.Helper.ReplaceInlineExpressions(Helper.scala:389)
	at core.DataPull$$anonfun$main$1.apply$mcV$sp(DataPull.scala:163)
	at scala.util.control.Breaks.breakable(Breaks.scala:38)
	at core.DataPull$.main(DataPull.scala:154)
	at core.DataPull.main(DataPull.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

DataPull line 54:

    breakable {
      while (json.has("jsoninputfile")) {
        val jsonMap = jsonObjectPropertiesToMap(json.getJSONObject("jsoninputfile"))
        if (listOfS3Path.contains(jsonMap("s3path"))) {
          throw new Exception("New json is pointing to same json.")
        }
        listOfS3Path += jsonMap("s3path")
        setAWSCredentials(sparkSession, jsonMap)
        json = new JSONObject(
          helper.ReplaceInlineExpressions(
            helper.InputFileJsonToString(
              sparkSession = sparkSession,
              jsonObject = json,
              inputFileObjectKey = "jsoninputfile"
            ).getOrElse(""), sparkSession
          )
        )
      }
    }

Documentation fixes for how to install DataPull in AWS, do documentation, fix dependency vulnetability

Is your feature request related to a problem? Please describe.
There are some minor typos in the root readme.md. There are some formatting issues and the lack of diagrams in the documentation for how to install DataPull in AWS. There is also a dependency vulnerability reported by github.

Describe the solution you'd like
Need to fix the typos, fix formatting issues and add diagrams if time permits. Also fix dependency vulnerability

Describe alternatives you've considered
Considered making this 2 issues instead of one; but decided against it since this stuff is really minor.

Additional context
N/A

Reading data from mongoDB via native connector is not working

Describe the bug
We use the native connector by enabling overrideconnector flag for reading data from MongoDB it isn't working by throwing internal spark error.

Expected behavior
The data from mongodb should be retrieved as strings with the column name jsonfield.

To Reproduce
Use the overrideconnector to true in the source MongoDB section of the JSON and submit the job.

Errorlog
java.lang.IllegalArgumentException: Self-suppression not permitted at java.lang.Throwable.addSuppressed(Throwable.java:1072) at java.io.BufferedWriter.close(BufferedWriter.java:266) at java.io.PrintWriter.close(PrintWriter.java:339) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$stop$1.apply(EventLoggingListener.scala:242) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$stop$1.apply(EventLoggingListener.scala:242) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:242) at org.apache.spark.SparkContext$$anonfun$stop$7$$anonfun$apply$mcV$sp$5.apply(SparkContext.scala:1920) at org.apache.spark.SparkContext$$anonfun$stop$7$$anonfun$apply$mcV$sp$5.apply(SparkContext.scala:1920) at scala.Option.foreach(Option.scala:257) at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1920) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at org.apache.spark.SparkContext.stop(SparkContext.scala:1919) at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: All datanodes [DatanodeInfoWithStorage[DS-92b28fcd-5dd3-4e02-bff3-bb8b65cf1362,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1531) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1465) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657)

and another one:
java.io.EOFException: Unexpected EOF while trying to read response from server at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1073) OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000665800000, 130416640, 0) failed; error='Cannot allocate memory' (errno=12)

Auto deleting temp s3 files which are created as part of storing temp files while reading data from kafka or mongodb

Is your feature request related to a problem? Please describe.
NO, currently we have to add pre/post migrate command to delete the temp files which are created while reading data from kafka or mongodb for storing temp files.

Describe the solution you'd like
we can add the check for the above scenario and create a post migrate command dynamically and call the prepost migrate command function.

Describe alternatives you've considered
NA

Additional context
NA

Updating the Kafka read and write functionality support key section of the message

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
While reading the message, parse the whole message and allow users to pick whatever they want.
Describe alternatives you've considered
The solutions proposed are straight and simple.

Additional context
Add any other context or screenshots about the feature request here.

Adding Terdata and Oracle support for DataPull

Is your feature request related to a problem? Please describe.
Nope, this is more like a feature request than a problem for supporting oracle and Teradata support.

Describe the solution you'd like
We will be using JDBC to connect to the datastores Oracle and Teradata the same way like MS-SQL, MySQL, and Postgres.

Describe alternatives you've considered
Nope, this is a straight and efficient way to connect to these datastores.

Additional context
N/A

Documentation for updating how to run locally

Problem:
Say you have a MacOS running the latest version of Java (say Java SE 12), the documentation at https://github.com/homeaway/datapull#build-and-execute-within-a-dockerised-spark-environment does not work because it won't compile.

Solution:
Update the documentation so that the compilation happens in a dockerised maven3+jdk8 environment.

homeaway / datapull Goto Github PK

datapull's People

Contributors

Stargazers

Watchers

Forkers

datapull's Issues

Recommend Projects

Recommend Topics

Recommend Org