homeaway / datapull Goto Github PK
View Code? Open in Web Editor NEWCloud based Data Platform based on Apache Spark
Home Page: https://homeaway.github.io/datapull
License: Other
Cloud based Data Platform based on Apache Spark
Home Page: https://homeaway.github.io/datapull
License: Other
Is your feature request related to a problem? Please describe.
DataPull when deployed onto AWS EMR, uses the default EMR service role EMR_DefaultRole that has a lot more access than necessary. We should use a custom EMR service role that has only the minimum necessary access.
Describe the solution you'd like
Create an IAM Role emr_datapull_role that has just enough access to run as the EMR service role for DataPull
Describe alternatives you've considered
N/A
Additional context
N/A
Describe the bug
As per documentation, on executing create_user_and_roles.sh file with admin access and secret key, it will output access and secret keys. However, it is not generating these keys.
Expected behavior
Access and secret keys should be generated.
To Reproduce
Execute the create_user_and_roles.sh script with admin access and secret key
Describe the bug
The functionality which retrieves the secret from secretsmanager has a bug which pushed into the json object.
Expected behavior
when the user puts the option "secretstore":"secret_manager"
and correct secretname then the password should be retrieved and put in the platform object to use for further processing.
To Reproduce
Submit the job with the secret name and secret store as secretmanager then the job fails with no password.
Errorlog
NA
Additional context
NA
Describe the bug
Ref https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html ; users tend to copy the examples when using the platform objects for the input json. For elasticsearch, the "type" attribute is set to "docs", which shows up on the index as "_type". This doesn't break DataPull per se, but it breaks apps who use the indexes written by datapull. Those apps sometimes filter on the type with default value "_doc".
Expected behavior
Currently, Few of the options in the configuration files are redundant and may need to edit few configuration files which will be confusing on the user's side; so there is room for optimizing the COnfiguration file and changing the API code accordingly.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
Currently, we don't support windows based authentication for connecting to sql server or any jdbc data stores. this is to enable windows authentication for sql server.
Describe the solution you'd like
We can use jtds driver for connecting to mssql using NTLM.
Describe alternatives you've considered
We tried using Microsoft MsSQL driver but couldn't connect using NTLM.
Additional context
NA
Describe the bug
Currently, the null values are being passed to some parameters which are of the primitive type Int which it doesn't accept.
Expected behavior
The build should go through and if the users don't use the optional parameter then it should work with the code flow and the build should pass.
To Reproduce
Just by following the installation procedure, the builds are failing.
Errorlog
If applicable, add an error log which will help us to diagnose the issue.
Additional context
NA
Describe the bug
There is a bug in the code when we check for assigning the driver based on the platform for running pre/post migrate commands for the Postgres data store.
Expected behavior
When there are any pre/post migrate commands for postgres they should call the rdbmsruncommand and build the url and initialize the driver for the respective data store.
To Reproduce
To reproduce the issue, submit any datapull with pre/post migrate command with postgres datastore.
Errorlog
exception printing:ListBuffer(java.lang.NullPointerException at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264)
Additional context
N/A
We have multiple teams who want to use SES to deliver the Datapull reports via email (these reports are already delivered to Cloudwatch as events). For this to happen, the EMR EC2 execution role should have access to send email via SES.
The ideal solution will be to tailor the installation such that these roles get access to send email via SES only if SES is configured. however, until we make the installation UI-driven/dynamic, let's provide this access scoped to DataPull resources alone.
Describe the bug
One of the installation steps in https://homeaway.github.io/datapull/install_on_aws/#installation-steps is to run ./ecs_deploy.sh . When running this on a Ubuntu linux client, the script fails while compiling the Scala JAR with a reference to "Failed to collect dependencies at org.apache.hive:hive-jdbc:jar:3.1.2"
Expected behavior
The installation script should successfully compile the Scala JAR when running ./ecs_deploy.sh <env>
To Reproduce
On a linux machine, install DataPull using the instructions at https://homeaway.github.io/datapull/aws_account_setup
Errorlog
The script
./ecs_deploy.sh test
yields
Downloaded from central: https://repo.maven.apache.org/maven2/org/scala-lang/modules/scala-xml_2.11/1.3.0/scala-xml_2.11-1.3.0.pom (2.9 kB at 47 kB/s)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 11:52 min
[INFO] Finished at: 2020-07-07T04:27:56Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project DataMigrationFramework: Could not resolve dependencies for project DataMigrationTool:DataMigrationFramework:jar:1.0-SNAPSHOT: Failed to collect dependencies at org.apache.hive:hive-jdbc:jar:3.1.2 -> org.apache.hive:hive-service:jar:3.1.2 -> org.apache.hive:hive-llap-server:jar:3.1.2 -> org.apache.hbase:hbase-server:jar:2.0.0-alpha4 -> org.glassfish.web:javax.servlet.jsp:jar:2.3.2 -> org.glassfish:javax.el:jar:3.0.1-b06-SNAPSHOT: Failed to read artifact descriptor for org.glassfish:javax.el:jar:3.0.1-b06-SNAPSHOT: Could not transfer artifact org.glassfish:javax.el:pom:3.0.1-b06-SNAPSHOT from/to Neo4J (https://m2.neo4j.org/content/repositories/releases/): Failed to transfer file https://m2.neo4j.org/content/repositories/releases/org/glassfish/javax.el/3.0.1-b06-SNAPSHOT/javax.el-3.0.1-b06-SNAPSHOT.pom with status code 502 -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
PROCESS FAILED
Additional context
N/A
Is your feature request related to a problem? Please describe.
Currently, when we are writing to s3/file system the destination will have those many files as the number of parallel threads running per executors. We have to make changes to control the number of partitions written to the destination
Describe the solution you'd like
Dataframe API has the ability to control the number of partitions written to the destination using the function coalesce
.
Describe alternatives you've considered
There is a built-in function available with dataframe API.
Additional context
n/a
Describe the bug
Actual method signature of dataframeFromTo.dataFrameToRdbms() at
port
but when calling it at
, its passed vault
.
Expected behavior
Correct method call.
Describe the bug
DataPull is throwing an exception "No json input available to Data Pull" when running in EMR
Expected behavior
It should run with provided JSON
To Reproduce
Describe the bug
The function which detects if the jobs use any temp location for storing intermittent datasets such as Kafka or mongo with override connector to true has a bug which calls the pre-post migrate command function with no json object in it.
Expected behavior
The pre-post migarte command should be called only when the source is kafka or mongo with override connector automatically.
To Reproduce
Running the job with source as mongo with no override connector.
First Step
Second Step
Third Step
Errorlog
java.util.NoSuchElementException: key not found: platform at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:59) at core.Migration.jsonSourceDestinationRunPrePostMigrationCommand(Migration.scala:483) at core.Migration.core$Migration$$prePostFortmpS3$1(Migration.scala:303) at core.Migration$$anonfun$migrate$2.apply$mcVI$sp(Migration.scala:313) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at core.Migration.migrate(Migration.scala:309) at core.Controller$$anonfun$performmigration$2$$anonfun$apply$mcV$sp$1.
Additional context
NA
Is your feature request related to a problem? Please describe.
The documentation lists the three initial authors of DataPull; but since there have been multiple other contributors to the codebase. It is difficult to keep the list of authors updated in the documentation.
Describe the solution you'd like
Github already tracks the contributions to the repo automatically; so it would be good to use that instead of this. As for acknowledging the contributors to DataPull codebase while it was an inner-sourced product, we can use the list of contributors compiled while filing for this project to be open-sourced.
Describe alternatives you've considered
An alternate solution is to not list the initial list of authors at all; however, there is good karma to be realised by thanking everyone who have contributed.
Additional context
N/A
Is your feature request related to a problem? Please describe.
No, currently we support writing as any regular formats such as jsons, parquet, avro, csv etc from a dataframe which is different from what we need.
Describe the solution you'd like
Writing each partition by creating a file json file and saving it as a partition to the destination.
Describe alternatives you've considered
NA
Additional context
NA
Is your feature request related to a problem? Please describe.
Although Scala apps can run on JDK 11, per official docs it is recommended that Java 8 be used for compiling Scala code. Per the same documentation, this is expected to happen in June 2019.
Describe the solution you'd like
Upgrade Scala and dependencies to a version that uses OpenJDK 11+.
Describe alternatives you've considered
An alternative is to use Amazon Corretto JDK instead.
Additional context
N/A
Describe the bug
During the installation process to AWS, the script create_user_and_roles.sh creates the datapull_user,but it puts double quotes around the aws access key id and aws secret access key values in ~/.aws/credentials. Because of this, the next script ecs_deploy.sh fails because it doesn't recognize the aws access key id as a being a valid one because of the extra double quotes.
Expected behavior
When the datapull_user credentials are written to the local ~/.aws/credentials file,they should show up without the double quotes, or subsequent aws cli should be able to handle the credentials with the double quotes.
To Reproduce
Run the installation. It will fail when you run the step that involves execution of the ecs_deploy.sh script with the following error.
Uploading core Jar to s3
+ docker run -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -e AWS_DEFAULT_REGION -e AWS_PROFILE -v /home/markomerine/githubrepos/homeawway/datapull/core:/data -v /home/markomerine/.aws:/root/.aws garland/aws-cli-docker aws s3 cp /da
ta/target/DataMigrationFramework-1.0-SNAPSHOT-jar-with-dependencies.jar s3://datapull-test2/datapull-opensource/jars/DataMigrationFramework-1.0-SNAPSHOT-jar-with-dependencies.jar
upload failed: target/DataMigrationFramework-1.0-SNAPSHOT-jar-with-dependencies.jar to s3://datapull-test2/datapull-opensource/jars/DataMigrationFramework-1.0-SNAPSHOT-jar-with-dependencies.jar An error occurred (InvalidAccessKeyId)
when calling the CreateMultipartUpload operation: The AWS Access Key Id you provided does not exist in our records.
Is your feature request related to a problem? Please describe.
This is to add HIVE as a destination
Describe the solution you'd like
we should enable hive support for the spark session and add the necessary config related to hive.
Describe alternatives you've considered
NA
Additional context
NA
Is your feature request related to a problem? Please describe.
Currently, datapull doesn't support non authenticated mongo clusters. This request is to add the support for.
Describe the solution you'd like
DataPull should be able to support mongo cluster with no authentication.
Describe alternatives you've considered
N/A
Additional context
N/A
Describe the bug
One of the installation steps in https://homeaway.github.io/datapull/install_on_aws/#installation-steps is to run ./ecs_deploy.sh <env>
. When running this on a Ubuntu linux client, the script fails at the line #!/usr/bin/env bash -x
Expected behavior
The step ./ecs_deploy.sh <env>
should not fail at line #!/usr/bin/env bash -x
To Reproduce
On a Windows machine running WSL2 with Ubuntu , install DataPull using the instructions at https://homeaway.github.io/datapull/aws_account_setup
Errorlog
./ecs_deploy.sh test
yields
/usr/bin/env: ‘bash -x’: No such file or directory
Additional context
N/A
Describe the bug
I noticed that a password containing a dollar sign $
would break the injection.
Expected behavior
Any character should be allowed.
To Reproduce
In the input file:
"password": "inlinesecret{{\"secretstore\": \"aws_secrets_manager\", \"secretname\": \"datapull/my-app\", \"secretkeyname\": \"MY_SERCRET\"}}",
MY_SERCRET = 'abc$123'
Errorlog
Exception in thread "main" java.lang.IllegalArgumentException: Illegal group reference
at java.util.regex.Matcher.appendReplacement(Matcher.java:857)
at scala.util.matching.Regex$Replacement$class.replace(Regex.scala:804)
at scala.util.matching.Regex$MatchIterator$$anon$1.replace(Regex.scala:782)
at scala.util.matching.Regex$$anonfun$replaceAllIn$1.apply(Regex.scala:473)
at scala.util.matching.Regex$$anonfun$replaceAllIn$1.apply(Regex.scala:473)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.util.matching.Regex.replaceAllIn(Regex.scala:473)
at helper.Helper.ReplaceInlineExpressions(Helper.scala:389)
at core.DataPull$$anonfun$main$1.apply$mcV$sp(DataPull.scala:163)
at scala.util.control.Breaks.breakable(Breaks.scala:38)
at core.DataPull$.main(DataPull.scala:154)
at core.DataPull.main(DataPull.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
DataPull
line 54:
breakable {
while (json.has("jsoninputfile")) {
val jsonMap = jsonObjectPropertiesToMap(json.getJSONObject("jsoninputfile"))
if (listOfS3Path.contains(jsonMap("s3path"))) {
throw new Exception("New json is pointing to same json.")
}
listOfS3Path += jsonMap("s3path")
setAWSCredentials(sparkSession, jsonMap)
json = new JSONObject(
helper.ReplaceInlineExpressions(
helper.InputFileJsonToString(
sparkSession = sparkSession,
jsonObject = json,
inputFileObjectKey = "jsoninputfile"
).getOrElse(""), sparkSession
)
)
}
}
Is your feature request related to a problem? Please describe.
There are some minor typos in the root readme.md. There are some formatting issues and the lack of diagrams in the documentation for how to install DataPull in AWS. There is also a dependency vulnerability reported by github.
Describe the solution you'd like
Need to fix the typos, fix formatting issues and add diagrams if time permits. Also fix dependency vulnerability
Describe alternatives you've considered
Considered making this 2 issues instead of one; but decided against it since this stuff is really minor.
Additional context
N/A
Describe the bug
We use the native connector by enabling overrideconnector
flag for reading data from MongoDB it isn't working by throwing internal spark error.
Expected behavior
The data from mongodb should be retrieved as strings with the column name jsonfield.
To Reproduce
Use the overrideconnector to true in the source MongoDB section of the JSON and submit the job.
Errorlog
java.lang.IllegalArgumentException: Self-suppression not permitted at java.lang.Throwable.addSuppressed(Throwable.java:1072) at java.io.BufferedWriter.close(BufferedWriter.java:266) at java.io.PrintWriter.close(PrintWriter.java:339) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$stop$1.apply(EventLoggingListener.scala:242) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$stop$1.apply(EventLoggingListener.scala:242) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:242) at org.apache.spark.SparkContext$$anonfun$stop$7$$anonfun$apply$mcV$sp$5.apply(SparkContext.scala:1920) at org.apache.spark.SparkContext$$anonfun$stop$7$$anonfun$apply$mcV$sp$5.apply(SparkContext.scala:1920) at scala.Option.foreach(Option.scala:257) at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1920) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at org.apache.spark.SparkContext.stop(SparkContext.scala:1919) at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: All datanodes [DatanodeInfoWithStorage[DS-92b28fcd-5dd3-4e02-bff3-bb8b65cf1362,DISK]] are bad. Aborting... at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1531) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1465) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657)
and another one:
java.io.EOFException: Unexpected EOF while trying to read response from server at org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:402) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213) at org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1073) OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000665800000, 130416640, 0) failed; error='Cannot allocate memory' (errno=12)
Is your feature request related to a problem? Please describe.
NO, currently we have to add pre/post migrate command to delete the temp files which are created while reading data from kafka or mongodb for storing temp files.
Describe the solution you'd like
we can add the check for the above scenario and create a post migrate command dynamically and call the prepost migrate command function.
Describe alternatives you've considered
NA
Additional context
NA
Is your feature request related to a problem? Please describe.
No
Describe the solution you'd like
While reading the message, parse the whole message and allow users to pick whatever they want.
Describe alternatives you've considered
The solutions proposed are straight and simple.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
Nope, this is more like a feature request than a problem for supporting oracle and Teradata support.
Describe the solution you'd like
We will be using JDBC to connect to the datastores Oracle and Teradata the same way like MS-SQL, MySQL, and Postgres.
Describe alternatives you've considered
Nope, this is a straight and efficient way to connect to these datastores.
Additional context
N/A
Problem:
Say you have a MacOS running the latest version of Java (say Java SE 12), the documentation at https://github.com/homeaway/datapull#build-and-execute-within-a-dockerised-spark-environment does not work because it won't compile.
Solution:
Update the documentation so that the compilation happens in a dockerised maven3+jdk8 environment.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.