amazon-archives / data-pipeline-samples Goto Github PK
View Code? Open in Web Editor NEWThis repository hosts sample pipelines
License: MIT No Attribution
This repository hosts sample pipelines
License: MIT No Attribution
Hi,
I'm moving few data to Redshift on daily basis. The data are copied to Redshift with the help of a shell script which uses PSQL for inserting data into Redshift from a CSV file.
Since it runs every day and takes data from last one week, lot of duplicate data are inserted. So to avoid this I compute hash using MD5, and using the hash, I insert only the new data and ignore the duplicate one. But PSQL is not computing the hash correctly. Means that when I compute row_hash with the same query form SQLWorkbench, it works fine, but now with PSQL.
The shell script which performs the above task is stored in S3.
Code wise everything is fine. Because when I execute the same query from the Workbench, I don't find any problem.
Thanks in advance.
Hi,
i was testing your billing sample but apparently it didnt work anymore.
it broke about to create folder on this step "directoryPath": "#{myS3StagingLoc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}"
Could be usefull to have this sample fixed
Thanks for your help.
Regards,
Julien.
Hello,
When I want to access the activity logs I get this error: The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256.
Any ideea?
Thank you
This may be worthwhile so that we have more control over the environment we are running in.
Is there any working sample / template for loading postgresql data onto redshift?
What is the ideal way to handle schema creation, and deleted / updated data?
I am using Shellcommandactivity to first copy the script from s3 and execute the same.
The resource is m3.xlarge , paravirtualization,
Failed to open native connection: (Datastax) : dse spark-submit ....
The error is :
Exception in thread "main" java.io.IOException: Failed to open native connection to Cassandra at {xxx.xxx.xxx.xxx}:9042
The connections / port connectivity as checked are all good.
This is a non-EMR standalone datastax cluster and the above shell activity is executed on a driver machine.
I've been trying to use the script https://github.com/aws-samples/data-pipeline-samples/blob/master/samples/EFSBackup/efs-backup.sh to make my efs backups. Even though Data Pipeline said it was healthy, the stderr file shows:
mount.nfs: remote share not in 'host:dir' format
When i did it manually, it also showed the message and I realized that the mount command format for efs has changed from
sudo mount -t nfs -o nfsvers=4.1 -o rsize=1048576 -o wsize=1048576 -o timeo=600 -o retrans=2 -o hard {efs-ip-addr} /backup
to
sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport {efs-id}.efs.eu-west-1.amazonaws.com:/ /backup
It took me a bit to realize or maybe I am doing something wrong and the older command is still valid?
"username": "#{myRedshiftUsername}",
"*password": "#{*myRedshiftPassword}"
{
"description": "The password for the above user to establish connection to the Redshift cluster.",
"id": "*myRedshiftPassword",
"type": "String"
}
How to move daily/weekly/monthly data from Teradata On-Premisis server to AWS S3 storage?
This folder https://github.com/awslabs/data-pipeline-samples/tree/master/samples/rds-to-rds-copy has no JSON file.
The efs mounts in the backup script should use the additional mount options:http://docs.aws.amazon.com/efs/latest/ug/mounting-fs-mount-cmd-general.html
rsize=1048576
wsize=1048576
hard
timeo=600
retrans=2
Thsi sample no longer works, fails at step 3.
-> aws datapipeline create-default-roles
usage: aws [options] [parameters]
aws: error: argument operation: Invalid choice, valid choices are:
Hello,
rsync process gets killed for an unknown reason. Please see log attached below. The production EFS volume has 50 GB of data, backup volume ends up with approximately 17 GB of backup data before rsync gets killed.
Thanks
Peter
--2016-09-23 13:24:14-- https://s3-us-west-2.amazonaws.com/xxx/aws/efsbackup/efs-backup.sh
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 54.231.168.196
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|54.231.168.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2986 (2.9K) [application/x-sh]
Saving to: ‘efs-backup.sh’
0K .. 100% 90.7M=0s
2016-09-23 13:24:14 (90.7 MB/s) - ‘efs-backup.sh’ saved [2986/2986]
rm: cannot remove ‘/tmp/efs-backup.log’: No such file or directory
rsync error: received SIGINT, SIGTERM, or SIGHUP (code 20) at rsync.c(544) [sender=3.0.6]
rsync: writefd_unbuffered failed to write 97 bytes to socket [generator]: Broken pipe (32)
Data Pipeline newbie, any thoughts as to what is causing this error?
amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg : at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:275)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:227)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:430)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:366)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:463)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:479)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:759)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:697)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:636)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Job Submission failed with exception 'java.lang.NullPointerException(null)'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
Hello,
I am trying to backup an EFS of 5TB size and the Data pipeline fails on Backup part 1 with the following error.
Unable to create resource for @EC2Resource1_2017-08-30T05:56:06 due to: Your quota allows for 0 more running instance(s). You requested at least 1 (Service: AmazonEC2; Status Code: 400; Error Code: InstanceLimitExceeded; Request ID: 0585067a-e291-472a-8581-a2a5108a2cdd)
The m3.xlarge instance limits are well within the range however it still fails.
AMI - ami-0188776c
I am a newbie to Data pipeline and any guidance is appreciated.
Thanks,
Hemanth
Hey, for the "RedshiftCopyActivityFromDynamoDBTable", I followed exactly the same steps as the sample. However, the pipeline is always give me an error of "ava.lang.RuntimeException: org.postgresql.util.PSQLException: Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections."
If I use SQL Bench with JDBC driver directly, the same command will work. It just doesn't work on AWS pipeline
Hello
I am facing a timeout issue when applying the DataPipeline template for EFS Backups. Usually that effect suggests a wrong security group configuration, however, I manually launched an EC2 instance belonging to mySrcSecGroupID and myBackupSecGroupID and accessing both EFS volumes was OK.
Attaching the StdErr.log below.
Thanks,
Peter
--2016-09-22 08:52:24-- https://s3-us-west-2.amazonaws.com/XXXXXX/efs-backup.sh
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 54.231.169.16
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|54.231.169.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2986 (2.9K) [application/x-sh]
Saving to: ‘efs-backup.sh’
0K .. 100% 93.5M=0s
2016-09-22 08:52:24 (93.5 MB/s) - ‘efs-backup.sh’ saved [2986/2986]
mount.nfs: Connection timed out
mount.nfs: Connection timed out
rm: cannot remove ‘/tmp/efs-backup.log’: No such file or directory
If I have my standalone spark cluster with hdfs/yarn configured , What changes are required to run this code?
Hi,
Poking around in your code looking for useful Boto examples I noted you are explicitly deleting s3 buckets provisioned by a CloudFormation stack.
https://github.com/awslabs/data-pipeline-samples/blob/master/setup/stacker.py#L79
if r.resource_type == "AWS::S3::Bucket":
if not s3:
s3 = boto3.resource("s3")
bucket = s3.Bucket(r.physical_resource_id)
for key in bucket.objects.all():
key.delete()
Was wondering why you felt the need to explicitly delete s3 buckets that have been provisioned by CloudFormation. Are they not being handled by Stack.Delete()
Thanks
Terry
We had this script stop working on data pipeline on or around Nov 8, 2016 (using the AWS walk through approach with Data pipelines). We also can't get it running on a new instance either. I'm not sure what changed, still investigating. It seems the instance created by Data pipelines cant see the mounts
The mount commands weren't throwing any timeout errors, so I spun up an EC2 instance with the same ami as Data pipeline uses. The mount command works the first time but no files appear on the share (they do on a "standard" EC2 Ami using the same mount command. If I unmount and run the command a second time, it hangs and doesn't time out (even after 20 mins)
Will do some more investigating but for now we just have the efs-backup.sh command running on a t2.micro as a cron (which works fine)
Hello,
I do some test use efs with data pipeline, according to backup efs .
I can backup efs , but tried restore efs many times failed. The efs size only 200M,
It is the efs s3 log
27 Jul 2017 08:42:35,923 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.taskrunner.TaskPoller: Executing: amazonaws.datapipeline.activity.ShellCommandActivity@31e1783b
27 Jul 2017 08:42:36,027 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.objects.CommandRunner: Executing command: wget https://raw.githubusercontent.com/awslabs/data-pipeline-samples/master/samples/EFSBackup/efs-restore.sh
chmod a+x efs-restore.sh
./efs-restore.sh $1 $2 $3 $4 $5
27 Jul 2017 08:42:36,042 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.objects.CommandRunner: configure ApplicationRunner with stdErr file: output/logs/df-02164142M6NFNIT11Y63/ShellCommandActivityObj/@ShellCommandActivityObj_2017-07-27T08:40:29/@ShellCommandActivityObj_2017-07-27T08:40:29_Attempt=1/StdError and stdout file :output/logs/df-02164142M6NFNIT11Y63/ShellCommandActivityObj/@ShellCommandActivityObj_2017-07-27T08:40:29/@ShellCommandActivityObj_2017-07-27T08:40:29_Attempt=1/StdOutput
27 Jul 2017 08:42:36,043 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.objects.CommandRunner: Executing command: output/tmp/df-02164142M6NFNIT11Y63-1f87ddb121394f15b1d638c67340a48e/ShellCommandActivityObj20170727T084029Attempt1_command.sh with env variables :{} with argument : [10.1.2.200:/, 10.1.2.251:/, daily, 0, backup-fs-12345678]
27 Jul 2017 08:42:38,569 [ERROR] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.connector.staging.StageFromS3Connector: Script returned with exit status 23
27 Jul 2017 08:42:38,605 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.taskrunner.LogMessageUtil: Returning tail errorMsg :--2017-07-27 08:42:36-- https://raw.githubusercontent.com/awslabs/data-pipeline-samples/master/samples/EFSBackup/efs-restore.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.52.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.52.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1474 (1.4K) [text/plain]
Saving to: 鈥榚fs-restore.sh鈥�
0K . 100% 127M=0s
2017-07-27 08:42:36 (127 MB/s) - 鈥榚fs-restore.sh鈥� saved [1474/1474]
./efs-restore.sh: line 22: [: too many arguments
rsync: change_dir "/mnt/backups/backup-fs-12345678/daily.0" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]
27 Jul 2017 08:42:38,606 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.taskrunner.HeartBeatService: Finished waiting for heartbeat thread @ShellCommandActivityObj_2017-07-27T08:40:29_Attempt=1
27 Jul 2017 08:42:38,606 [INFO] (TaskRunnerService-resource:df-02164142M6NFNIT11Y63_@EC2ResourceObj_2017-07-27T08:40:29-0) df-02164142M6NFNIT11Y63 amazonaws.datapipeline.taskrunner.TaskPoller: Work ShellCommandActivity took 0:2 to complete
The configuration file is nothing special.
myImageID use the the Amazon Linux AMI 2017.03.1 (PV) ami-98f3e7e1
myInstanceType use t1.micro
Thanks for your answer
Hi guys,
Please, can you tell me what is the correct CSV format for the script DynamoDBImportCSV.
Comma separated only ?
Headers are mandatory ?
Thanks for your answer ;)
Cheers
For the "RedshiftCopyActivityFromDynamoDBTable", I followed exactly the same steps as the sample. However, the pipeline is always give me an error of "ava.lang.RuntimeException: org.postgresql.util.PSQLException: Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections."
If I use SQL Bench with JDBC driver directly, the same command will work. It just doesn't work on AWS pipeline
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.