This repository host code samples from the AWS Big Data Blog. http://blogs.aws.amazon.com/bigdata/
aws-samples / aws-big-data-blog Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
This repository host code samples from the AWS Big Data Blog. http://blogs.aws.amazon.com/bigdata/
Not sure if the instruction missing any part around installing/setting up gremlin shell ?
I tried this on Ubuntu 16.04 LTS
Regards
i follow the instruction and got error as below
gremlin> conf = new BaseConfiguration()
==>org.apache.commons.configuration.BaseConfiguration@70fab835
gremlin> conf.setProperty("storage.backend", "com.amazon.titan.diskstorage.dynamodb.DynamoDBStoreManager")
==>null
gremlin> conf.setProperty("storage.dynamodb.client.endpoint", "http://localhost:4567")
==>null
gremlin> conf.setProperty("index.search.backend", "elasticsearch")
==>null
gremlin> conf.setProperty("index.search.directory", "/tmp/searchindex")
==>null
gremlin> conf.setProperty("index.search.elasticsearch.client-only", "false")
==>null
gremlin> conf.setProperty("index.search.elasticsearch.local-mode", "true")
==>null
gremlin> conf.setProperty("index.search.elasticsearch.inteface", "NODE")
==>null
gremlin> g = TitanFactory.open(conf)
No such property: TitanFactory for class: groovysh_evaluate
How can i solve this problem? Thank you.
Team,
I am trying the exercise from @ https://aws.amazon.com/blogs/big-data/interactive-analysis-of-genomic-datasets-using-amazon-athena/ and when I am trying to convert from vcf2adam to convert to Parquet format. It is exiting telling vcf2adam is not supported only fasta2adam is supported. Can someone help?
I have already posted an issue in 1731
Krishna
It seems like the point of this is to be able to update the SampleToplogy with your own rules, but its not clear to a storm newbie how to do so? I tried using the storm binary, but get SLF4J: Class path contains multiple SLF4J bindings. The docs say to use maven and not storm to make changes locally but its not clear how to do that on a prebuilt image...can we get some simple docs of how to make changes to the topology and deploy them?
Hello,
I was trying to use the same script to transform GZIP JSON data into parquet, but I encountered the following error:
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
I'm using the EMR configurations (instance types, number, version, ...) as shown in this example. I created my hive table using ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
. An example of the DDL scripts to create and alter my table is:
DROP TABLE my_events;
CREATE EXTERNAL TABLE IF NOT EXISTS my_events (text string,number int)
PARTITIONED BY(year int, month int, day int, hour int)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://bucket/json/';
ALTER TABLE my_events ADD PARTITION (year=2017, month=8, day=1, hour=0) location 's3://bucket/json/2017/08/01/00';
In order for spark to be able to read my table, I had to do the following modifications on the configuration file in my master node (adding the SERDE to my both driver and executor's classpath):
The python script is practically the same. I only added a few type casting to some columns in my dataframe, before calling the write2parquet
function, like:
rdf = rdf.withColumn('columnName', rdf['columnName'].cast(DoubleType()))
Nevertheless, I tried without those type castings and still got the same error. The following is the script used without type casting:
from concurrent.futures import ThreadPoolExecutor, as_completed
from pyspark.sql import SparkSession
def write2parquet(start):
df = rdf.filter((rdf.day >= start) & (rdf.day <= start + 10))
df.repartition(*partitionby).write.partitionBy(partitionby).mode("append").parquet(output, compression=codec)
partitionby = ['year', 'month', 'day', 'hour']
output = '/user/hadoop/myevents_pq'
codec = 'snappy'
hivetablename = 'default.my_events'
spark = SparkSession.builder.appName("Convert2Parquet").enableHiveSupport().getOrCreate()
rdf = spark.table(hivetablename)
futures = []
pool = ThreadPoolExecutor(1)
for i in [1]:
futures.append(pool.submit(write2parquet, i))
for x in as_completed(futures):
pass
spark.stop()
Is there something I'm missing here?
There is a bug in the script exportdatabase.py, the line 25
command="hive -f "+schema+"_tables.hql -S >> "+schema+".output"
need to be replaced as below
command="hive --database "+schema+" -f "+schema+"_tables.hql -S >> "+schema+".output"
The schema name needs to be specified in case if we are exporting from a schema other than default.
mvn install clean FAILURE
Can someone provide the accurate versions of these plugins?
[WARNING] Some problems were encountered while building the effective model for com.awsproserv:kafkaandsparkstreaming:jar:0.0.1-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 133, column 12
[WARNING] 'build.plugins.plugin.version' for org.scala-tools:maven-scala-plugin is missing. @ line 107, column 12
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-jar-plugin is missing. @ line 160, column 12
[WARNING] Expected all dependencies to require Scala version: 2.11.8
[WARNING] com.awsproserv:kafkaandsparkstreaming:0.0.1-SNAPSHOT requires scala version: 2.11.8
[WARNING] org.scala-lang:scala-compiler:2.11.8 requires scala version: 2.11.8
[WARNING] org.scala-lang:scala-reflect:2.11.8 requires scala version: 2.11.8
[WARNING] org.scala-lang.modules:scala-xml_2.11:1.0.4 requires scala version: 2.11.4
[WARNING] Multiple versions of scala libraries detected!
[ERROR] error: scala.reflect.internal.MissingRequirementError: object java.lang.Object in compiler mirror not found.
[ERROR] at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:17)
[ERROR] at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:18)
Hi,
I am trying to test aws-blog-firehose-lambda-elasticsearch-near-real-time-discovery-platform/ with Twitter api v2, but seems like the code doesn't work with Twitter api v2....is there any update for this project for twitter api v2...
thanks
The Real-Time OLTP with Apache Ignite blog is missing an explanation or reference to the location of the "configureIgnite.sh" file. It doesn't seem to be part of the official Ignite distributions. Please advise where to obtain the file.
In the aws-blog-kinesis-producer-library click events are generated with the following code:
private static ClickEvent generateClickEvent() {
byte[] id = new byte[13];
RANDOM.nextBytes(id);
String data = StringUtils.repeat("a", 350);
return new ClickEvent(DatatypeConverter.printBase64Binary(id), data);
}
However, the bulk of the content is just the letter "a", which means that the data sent up to the servers will be extremely compressable. (ie. the uploaded compressed payloads will essentially be the 13 byte random number plus a few more bytes).
It would be more representative of performance, to make a larger percentage of the event payload random.
Hi there,
I just tried to use kinesis-storm-clickstream-stack.template to provision the reference application. Stack creation failed on this step:
13:10:11 UTC-0500 CREATE_FAILED AWS::CloudFormation::WaitCondition NodeJsWaitCondition WaitCondition received failed message: 'Failed to install Epoch.See epoch-install.log for details' for uniqueId: i-05d4688c
The provisioning operation was rolled back, so I'm not sure how to get to epoch-install.log.
Is the stack template out of date? Does Epoch need a version update? Any other ideas?
Thanks,
Ranjith
Hi,
i was running the aws-blog-kinesis-storm-clickstream-app/static/cloudformation/kinesis-storm-clickstream-stack.template, but the template stops creating the stack because it can't find the sh script "storm-install.sh" ((towards the 877 line in the template).
Where can i find this script?
I'm interested in the cloud formation resources referenced in the Query and Visualize AWS Cost and Usage Data Using Amazon Athena and Amazon QuickSight post (https://aws.amazon.com/blogs/big-data/query-and-visualize-aws-cost-and-usage-data-using-amazon-athena-and-amazon-quicksight/)
I'm not seeing them in the repository and the post doesn't seem to link to anything. Do these resources exist?
I tried to run the Firehose java sample locally by uncommenting lines from KinesisToFirehose.java. Here is the output:
$ mvn exec:java -Dexec.mainClass="com.amazonaws.proserv.lambda.KinesisToFirehose"
...
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ firehoseLambda ---
Firehose ACTIVE
deliverytest
Firehose ACTIVE
Firehose ACTIVE
{RecordId: wVgtYFcO2Pv0JlXI2cN4apw3R3mxMa9kW4GaWIXR/wUa5qVqMA14XNTLsxEpzPeP0rOO5w4EqxjoZ90NcOah6lvBkoLvqMCqsZcvazNsRtKQnUzA24BG+CPxiRJCorSPZnhCmo6+NyZch4CSB0lCLUT1/AuqKN9FBgZZ9DBg7C5vQB/BcP2VfIZAHRO69Q0XbNve2eFzEqsnzLcXcO2sue6qSf5dIOUB}
...
From the output, everything looks fine, but when I check the Firehose streams on Kinesis AWS dashboard, I see this error INTERNAL_SERVER_ERROR
. Now I cannot do much of anything, I don't even see manually created firehoses and the S3 bucket is still empty.
I developed a pipeline in AWS where I collect CPU Temperature from my notebook via Python 3 notebook (basicPubSub.py adapted), send to AWS IoT Core using security protocols, updates the shadow successfully, a rule sends data to Cloud Watch and DynamoDB saves it. A Data Pipeline was created to save this DynamoDB data to S3 and I want to generate plots of this data with Quick Sight.
However, I am not being able to make Quick Sight properly read the file. The S3 file looks like this:
{"timestamp":{"s":"1526819850637"},"payload":{"m":{"$Temperature":{"s":"42.000"}}}} {"timestamp":{"s":"1526819976032"},"payload":{"m":{"$Temperature":{"s":"42.000"}}}} {"timestamp":{"s":"1526819934216"},"payload":{"m":{"$Temperature":{"s":"42.000"}}}} {"timestamp":{"s":"1526817845094"},"payload":{"m":{"$Temperature":{"s":"48.000"}}}}
When I use the manifest file below, Quick Sight successfully reads data but the 'parseJson' command in New Calculated field disappears, making impossible to read the JSON:
{ "fileLocations": [ { "URIs": [ "https://s3.amazonaws.com/my-bucket2/2018-05-20-12-32-49/12345-a279-1234-2269-491212345" ] }, { "URIPrefixes": [ "https://s3.amazonaws.com/my-bucket2/2018-05-20-12-32-49/12345-a279-1234-2269-491212345" ] } ],"globalUploadSettings": { "format": "CSV","delimiter":"\n","textqualifier":"'" } }
Quick Sight reads the JSON as:
{{"timestamp":{"s":"1526819850637"},"payload":{"m":{"$unknown":{"s":"42.000"}}}}}
... with no 'parseJson' command.
Data in Quick Sight has no missing values and the AWS pipeline works perfectly until this point. What can I do ?
These are the only formulas available:
avg
ceil
count
dateDiff
decimalToInt
distinct_count
epochDate
extract
floor
intToDecimal
max
min
round
sum
truncDate
... formulas like 'right', 'left', parseJson simply disappear.
Hi,
Thanks for the code.
Getting these errors after importing: aws-blog-hbase-on-emr
Thanks,
Hari
RetryingBatchedClickEventsToKinesis.java
Line 63:
Math.min(MAX_BACKOFF, backoff *= 2);
Should read:
backoff = Math.min(MAX_BACKOFF, backoff * 2);
Hello. I want to use instance emr-5.10 ( Spark 2.2 etc.)
I'll very happy if you can show me how to solve the errors.
This is my environment
In my master emr-5.10 instance I've uploaded both files:
In s3:///jobserver/jobserver_configs/ I've:
-emr.sh:
APP_USER=hadoop
APP_GROUP=hadoop
INSTALL_DIR=/mnt/lib/spark-jobserver
LOG_DIR=/mnt/var/log/spark-jobserver
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=1G
SPARK_VERSION=2.2.0
SPARK_HOME=/usr/lib/spark
SPARK_CONF_DIR=/etc/spark/conf
HADOOP_CONF_DIR=/etc/hadoop/conf
YARN_CONF_DIR=/etc/hadoop/conf
SCALA_VERSION=2.11.0
When I execute full_install_jobserver_BA.sh script I obtain the following error related to unresolved dependencies (complete output attached)
...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: com.typesafe.akka#akka-slf4j_2.10;2.4.9: not found
[warn] :: com.typesafe.akka#akka-cluster_2.10;2.4.9: not found
[warn] :: io.spray#spray-routing-shapeless23_2.10;1.3.4: not found
[warn] :: com.typesafe.akka#akka-testkit_2.10;2.4.9: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] com.typesafe.akka:akka-slf4j_2.10:2.4.9 (/mnt/work/spark-jobserver/build.sbt#L11)
[warn] +- spark.jobserver:akka-app_2.10:0.8.0
[warn] com.typesafe.akka:akka-cluster_2.10:2.4.9 (/mnt/work/spark-jobserver/build.sbt#L11)
[warn] +- spark.jobserver:akka-app_2.10:0.8.0
[warn] io.spray:spray-routing-shapeless23_2.10:1.3.4 (/mnt/work/spark-jobserver/build.sbt#L11)
[warn] +- spark.jobserver:akka-app_2.10:0.8.0
[warn] com.typesafe.akka:akka-testkit_2.10:2.4.9 (/mnt/work/spark-jobserver/build.sbt#L11)
[warn] +- spark.jobserver:akka-app_2.10:0.8.0
[warn] Credentials file /root/.bintray/.credentials does not exist
[warn] Credentials file /root/.bintray/.credentials does not exist
sbt.ResolveException: unresolved dependency: com.typesafe.akka#akka-slf4j_2.10;2.4.9: not found
unresolved dependency: com.typesafe.akka#akka-cluster_2.10;2.4.9: not found
unresolved dependency: io.spray#spray-routing-shapeless23_2.10;1.3.4: not found
unresolved dependency: com.typesafe.akka#akka-testkit_2.10;2.4.9: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:313)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
...
[error] (akka-app/*:update) sbt.ResolveException: unresolved dependency: com.typesafe.akka#akka-slf4j_2.10;2.4.9: not found
[error] unresolved dependency: com.typesafe.akka#akka-cluster_2.10;2.4.9: not found
[error] unresolved dependency: io.spray#spray-routing-shapeless23_2.10;1.3.4: not found
[error] unresolved dependency: com.typesafe.akka#akka-testkit_2.10;2.4.9: not found
[error] Total time: 30 s, completed 21-ene-2018 15:15:42
Assembly failed
The user-provided path /tmp/job-server/job-server.tar.gz does not exist.
tar (child): /tmp/job-server/job-server.tar.gz: No se puede open: No existe el fichero o el directorio
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
Thakyou very much!!
Julián.
Hi,
According to documentation, KinesisSpout method constructs an instance of the spout, using AWS credentials. But how and where I have to give those AWS access and secret key. I could not find any file, where I'v to give.
Please help. Thanks !!!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.