aws-samples / aws-big-data-blog Goto Github PK

License: Apache License 2.0

Python 9.98% Java 39.63% HTML 3.18% CSS 0.29% JavaScript 12.74% Scala 2.91% Shell 7.01% PLpgSQL 0.26% R 2.92% Puppet 0.35% Awk 0.01% Groovy 0.35% WebAssembly 12.64% HiveQL 7.70% Pug 0.03%

aws-big-data-blog's Introduction

aws-big-data-blog

This repository host code samples from the AWS Big Data Blog. http://blogs.aws.amazon.com/bigdata/

aws-big-data-blog's People

Contributors

Stargazers

Watchers

Forkers

datasnap-io erikswensson rbhartia nickcorbett chayel mindis wmdoble j3nguyen ianmeyers jelez blackbeard xujiaxj nonspecialist ospreyx rajpalparyani hydrologic avershel ejono jeffreyksmithjr mikedeck mcholste imclab madatonline dsgillen wangziyun utkarshas ursdeepak gauravm17 laeeth daniel-aranda vidhatanand kevincdeng mrdlmndr adhorn ccrosbie kink80 paulhendricks kuruvasomasekhar asafcombo smartpcr mohnkhan assafmentzer mfonsen zhanglin00129 mannem poovizhiashokkumar mwallman ehoffman73 d3byex mheydt loredanaoancea yohanp dominicmmurphy rundsp rzachariah gsingh04 rkulan007 njvijay loslabsinc gybharat dbiswa4 zachahuy-zz ronaldosaheki akshithrk pariyat stowns jamshedmelik mewalig skywalker-gk qubit0-1 macchiatoism alexandruast gwunnava samelamin danromuald daileimail manjeetchayel jschwellach eugenestarchenko chinakyc varikuti nandipati fraser-hamilton jackieqizhu jstaffans witsoej sensorhound-akash coderxiaok dxclee saiaman babupe rjacoby manboubird benalta vonrosenchild chitrang hurelyyu abhik1368 amulyahub shreerajshrestha

aws-big-data-blog's Issues

lear

running mvn test -Pstart-gremlin give error stating [WARNING] The requested profile "start-gremlin" could not be activated because it does not exist.

Not sure if the instruction missing any part around installing/setting up gremlin shell ?

I tried this on Ubuntu 16.04 LTS

Regards

Get error "No such property: TitanFactory for class: groovysh_evaluate"

i follow the instruction and got error as below

gremlin> conf = new BaseConfiguration()
==>org.apache.commons.configuration.BaseConfiguration@70fab835
gremlin> conf.setProperty("storage.backend", "com.amazon.titan.diskstorage.dynamodb.DynamoDBStoreManager")
==>null
gremlin> conf.setProperty("storage.dynamodb.client.endpoint", "http://localhost:4567")
==>null
gremlin> conf.setProperty("index.search.backend", "elasticsearch")
==>null
gremlin> conf.setProperty("index.search.directory", "/tmp/searchindex")
==>null
gremlin> conf.setProperty("index.search.elasticsearch.client-only", "false")
==>null
gremlin> conf.setProperty("index.search.elasticsearch.local-mode", "true")
==>null
gremlin> conf.setProperty("index.search.elasticsearch.inteface", "NODE")
==>null
gremlin> g = TitanFactory.open(conf)
No such property: TitanFactory for class: groovysh_evaluate

How can i solve this problem? Thank you.

vcf2adam support

Team,

I am trying the exercise from @ https://aws.amazon.com/blogs/big-data/interactive-analysis-of-genomic-datasets-using-amazon-athena/ and when I am trying to convert from vcf2adam to convert to Parquet format. It is exiting telling vcf2adam is not supported only fasta2adam is supported. Can someone help?

I have already posted an issue in 1731
Krishna

How to simply make changes to the SampleTopology?

It seems like the point of this is to be able to update the SampleToplogy with your own rules, but its not clear to a storm newbie how to do so? I tried using the storm binary, but get SLF4J: Class path contains multiple SLF4J bindings. The docs say to use maven and not storm to make changes locally but its not clear how to do that on a prebuilt image...can we get some simple docs of how to make changes to the topology and deploy them?

aws-blog-spark-parquet-conversion: java.lang.ClassCastException when tranforming JSON into parquet

Hello,

I was trying to use the same script to transform GZIP JSON data into parquet, but I encountered the following error:

java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

I'm using the EMR configurations (instance types, number, version, ...) as shown in this example. I created my hive table using ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'. An example of the DDL scripts to create and alter my table is:

createtable1.hql

DROP TABLE my_events;
CREATE EXTERNAL TABLE IF NOT EXISTS my_events (text string,number int)
PARTITIONED BY(year int, month int, day int, hour int)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
 LOCATION 's3://bucket/json/';

addpartitions1.py

ALTER TABLE my_events ADD PARTITION (year=2017, month=8, day=1, hour=0) location 's3://bucket/json/2017/08/01/00';

In order for spark to be able to read my table, I had to do the following modifications on the configuration file in my master node (adding the SERDE to my both driver and executor's classpath):

Edit /etc/spark/conf/spark-defaults.conf on EMR master node
Add /usr/lib/hive-hcatalog/share/hcatalog/* to spark.driver.extraClassPath and spark.executor.extraClassPath property

The python script is practically the same. I only added a few type casting to some columns in my dataframe, before calling the write2parquet function, like:

rdf = rdf.withColumn('columnName', rdf['columnName'].cast(DoubleType()))

Nevertheless, I tried without those type castings and still got the same error. The following is the script used without type casting:

convert2parquet1.py

from concurrent.futures import ThreadPoolExecutor, as_completed
from pyspark.sql import SparkSession

def write2parquet(start):
    df = rdf.filter((rdf.day >= start) & (rdf.day <= start + 10))
    df.repartition(*partitionby).write.partitionBy(partitionby).mode("append").parquet(output, compression=codec)

partitionby = ['year', 'month', 'day', 'hour']
output = '/user/hadoop/myevents_pq'
codec = 'snappy'
hivetablename = 'default.my_events'

spark = SparkSession.builder.appName("Convert2Parquet").enableHiveSupport().getOrCreate()
rdf = spark.table(hivetablename)

futures = []
pool = ThreadPoolExecutor(1)

for i in [1]:
    futures.append(pool.submit(write2parquet, i))
for x in as_completed(futures):
    pass

spark.stop()

Is there something I'm missing here?

aws-big-data-blog/aws-blog-athena-importing-hive-metastores/

There is a bug in the script exportdatabase.py, the line 25

command="hive -f "+schema+"_tables.hql -S >> "+schema+".output"

need to be replaced as below

command="hive --database "+schema+" -f "+schema+"_tables.hql -S >> "+schema+".output"

The schema name needs to be specified in case if we are exporting from a schema other than default.

Sparkstreaming-from-kafka

mvn install clean FAILURE

Can someone provide the accurate versions of these plugins?

[WARNING] Some problems were encountered while building the effective model for com.awsproserv:kafkaandsparkstreaming:jar:0.0.1-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 133, column 12
[WARNING] 'build.plugins.plugin.version' for org.scala-tools:maven-scala-plugin is missing. @ line 107, column 12
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-jar-plugin is missing. @ line 160, column 12

[WARNING] Expected all dependencies to require Scala version: 2.11.8
[WARNING] com.awsproserv:kafkaandsparkstreaming:0.0.1-SNAPSHOT requires scala version: 2.11.8
[WARNING] org.scala-lang:scala-compiler:2.11.8 requires scala version: 2.11.8
[WARNING] org.scala-lang:scala-reflect:2.11.8 requires scala version: 2.11.8
[WARNING] org.scala-lang.modules:scala-xml_2.11:1.0.4 requires scala version: 2.11.4
[WARNING] Multiple versions of scala libraries detected!

[ERROR] error: scala.reflect.internal.MissingRequirementError: object java.lang.Object in compiler mirror not found.
[ERROR] at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:17)
[ERROR] at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:18)

Doesn't seem to work with Twitter api v2

Hi,

I am trying to test aws-blog-firehose-lambda-elasticsearch-near-real-time-discovery-platform/ with Twitter api v2, but seems like the code doesn't work with Twitter api v2....is there any update for this project for twitter api v2...

thanks

Missing configureIgnite.sh file

The Real-Time OLTP with Apache Ignite blog is missing an explanation or reference to the location of the "configureIgnite.sh" file. It doesn't seem to be part of the official Ignite distributions. Please advise where to obtain the file.

aws-blog-kinesis-producer-library: generateClickEvent data misleading to performance

In the aws-blog-kinesis-producer-library click events are generated with the following code:

private static ClickEvent generateClickEvent() {
    byte[] id = new byte[13];
    RANDOM.nextBytes(id);
    String data = StringUtils.repeat("a", 350);
    return new ClickEvent(DatatypeConverter.printBase64Binary(id), data);
}

However, the bulk of the content is just the letter "a", which means that the data sent up to the servers will be extremely compressable. (ie. the uploaded compressed payloads will essentially be the 13 byte random number plus a few more bytes).

It would be more representative of performance, to make a larger percentage of the event payload random.

Amazon Kinesis Storm Clickstream CloudFormation template fails with Epoch error

Hi there,

I just tried to use kinesis-storm-clickstream-stack.template to provision the reference application. Stack creation failed on this step:

13:10:11 UTC-0500 CREATE_FAILED AWS::CloudFormation::WaitCondition NodeJsWaitCondition WaitCondition received failed message: 'Failed to install Epoch.See epoch-install.log for details' for uniqueId: i-05d4688c

The provisioning operation was rolled back, so I'm not sure how to get to epoch-install.log.

Is the stack template out of date? Does Epoch need a version update? Any other ideas?

Thanks,
Ranjith

Failed to install Storm. See storm-install.log for details

Hi,
i was running the aws-blog-kinesis-storm-clickstream-app/static/cloudformation/kinesis-storm-clickstream-stack.template, but the template stops creating the stack because it can't find the sh script "storm-install.sh" ((towards the 877 line in the template).

Where can i find this script?

Missing resources for Query and Visualize AWS Cost and Usage Data Using Amazon Athena and Amazon QuickSight

I'm interested in the cloud formation resources referenced in the Query and Visualize AWS Cost and Usage Data Using Amazon Athena and Amazon QuickSight post (https://aws.amazon.com/blogs/big-data/query-and-visualize-aws-cost-and-usage-data-using-amazon-athena-and-amazon-quicksight/)

I'm not seeing them in the repository and the post doesn't seem to link to anything. Do these resources exist?

INTERNAL_SERVER_ERROR after running aws-blog-firehose-lambda locally

I tried to run the Firehose java sample locally by uncommenting lines from KinesisToFirehose.java. Here is the output:

$ mvn exec:java -Dexec.mainClass="com.amazonaws.proserv.lambda.KinesisToFirehose"
...
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ firehoseLambda ---
Firehose ACTIVE
deliverytest
Firehose ACTIVE
Firehose ACTIVE
{RecordId: wVgtYFcO2Pv0JlXI2cN4apw3R3mxMa9kW4GaWIXR/wUa5qVqMA14XNTLsxEpzPeP0rOO5w4EqxjoZ90NcOah6lvBkoLvqMCqsZcvazNsRtKQnUzA24BG+CPxiRJCorSPZnhCmo6+NyZch4CSB0lCLUT1/AuqKN9FBgZZ9DBg7C5vQB/BcP2VfIZAHRO69Q0XbNve2eFzEqsnzLcXcO2sue6qSf5dIOUB}
...

From the output, everything looks fine, but when I check the Firehose streams on Kinesis AWS dashboard, I see this error INTERNAL_SERVER_ERROR. Now I cannot do much of anything, I don't even see manually created firehoses and the S3 bucket is still empty.

Unable to Parse JSON file from S3 in AWS Quick Sight

I developed a pipeline in AWS where I collect CPU Temperature from my notebook via Python 3 notebook (basicPubSub.py adapted), send to AWS IoT Core using security protocols, updates the shadow successfully, a rule sends data to Cloud Watch and DynamoDB saves it. A Data Pipeline was created to save this DynamoDB data to S3 and I want to generate plots of this data with Quick Sight.

However, I am not being able to make Quick Sight properly read the file. The S3 file looks like this:

{"timestamp":{"s":"1526819850637"},"payload":{"m":{"$Temperature":{"s":"42.000"}}}} {"timestamp":{"s":"1526819976032"},"payload":{"m":{"$Temperature":{"s":"42.000"}}}} {"timestamp":{"s":"1526819934216"},"payload":{"m":{"$Temperature":{"s":"42.000"}}}} {"timestamp":{"s":"1526817845094"},"payload":{"m":{"$Temperature":{"s":"48.000"}}}}

When I use the manifest file below, Quick Sight successfully reads data but the 'parseJson' command in New Calculated field disappears, making impossible to read the JSON:

{ "fileLocations": [ { "URIs": [ "https://s3.amazonaws.com/my-bucket2/2018-05-20-12-32-49/12345-a279-1234-2269-491212345" ] }, { "URIPrefixes": [ "https://s3.amazonaws.com/my-bucket2/2018-05-20-12-32-49/12345-a279-1234-2269-491212345" ] } ],"globalUploadSettings": { "format": "CSV","delimiter":"\n","textqualifier":"'" } }

Quick Sight reads the JSON as:

{{"timestamp":{"s":"1526819850637"},"payload":{"m":{"$unknown":{"s":"42.000"}}}}}

... with no 'parseJson' command.

Data in Quick Sight has no missing values and the AWS pipeline works perfectly until this point. What can I do ?

These are the only formulas available:

avg
ceil
count
dateDiff
decimalToInt
distinct_count
epochDate
extract
floor
intToDecimal
max
min
round
sum
truncDate

... formulas like 'right', 'left', parseJson simply disappear.

Errors after import

Hi,

Thanks for the code.

Getting these errors after importing: aws-blog-hbase-on-emr

Missing artifact jdk.tools:jdk.tools:jar:1.7.0_65
Missing artifact com.amazonaws:amazon-kinesis-connectors:jar:1.1.1

Thanks,

Hari

aws-blog-kinesis-producer-library: exponential backoff with no upper limit

RetryingBatchedClickEventsToKinesis.java

Line 63:

Math.min(MAX_BACKOFF, backoff *= 2);

Should read:

backoff = Math.min(MAX_BACKOFF, backoff * 2);

Error using emr-5.10.0 instance

Hello. I want to use instance emr-5.10 ( Spark 2.2 etc.)
I'll very happy if you can show me how to solve the errors.
This is my environment
In my master emr-5.10 instance I've uploaded both files:

configuration.json (original content)
full_install_jobserver_BA.sh script
and adapted this parameters in the last on:
####### EDIT THESE - START ########
JOBSERVER_VERSION="v0.8.0"
EMR_SH_FILE="s3:///jobserver/jobserver_configs/emr.sh"
EMR_CONF_FILE="s3:///jobserver/jobserver_configs/emr.conf"
S3_BUILD_ARCHIVE_DIR="s3:///jobserver/dist"
####### EDIT THESE - END ########

In s3:///jobserver/jobserver_configs/ I've:

-emr.sh:
APP_USER=hadoop
APP_GROUP=hadoop
INSTALL_DIR=/mnt/lib/spark-jobserver
LOG_DIR=/mnt/var/log/spark-jobserver
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=1G
SPARK_VERSION=2.2.0
SPARK_HOME=/usr/lib/spark
SPARK_CONF_DIR=/etc/spark/conf
HADOOP_CONF_DIR=/etc/hadoop/conf
YARN_CONF_DIR=/etc/hadoop/conf
SCALA_VERSION=2.11.0

emr_contexts.conf (github original content)

When I execute full_install_jobserver_BA.sh script I obtain the following error related to unresolved dependencies (complete output attached)
...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: com.typesafe.akka#akka-slf4j_2.10;2.4.9: not found
[warn] :: com.typesafe.akka#akka-cluster_2.10;2.4.9: not found
[warn] :: io.spray#spray-routing-shapeless23_2.10;1.3.4: not found
[warn] :: com.typesafe.akka#akka-testkit_2.10;2.4.9: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] com.typesafe.akka:akka-slf4j_2.10:2.4.9 (/mnt/work/spark-jobserver/build.sbt#L11)
[warn] +- spark.jobserver:akka-app_2.10:0.8.0
[warn] com.typesafe.akka:akka-cluster_2.10:2.4.9 (/mnt/work/spark-jobserver/build.sbt#L11)
[warn] +- spark.jobserver:akka-app_2.10:0.8.0
[warn] io.spray:spray-routing-shapeless23_2.10:1.3.4 (/mnt/work/spark-jobserver/build.sbt#L11)
[warn] +- spark.jobserver:akka-app_2.10:0.8.0
[warn] com.typesafe.akka:akka-testkit_2.10:2.4.9 (/mnt/work/spark-jobserver/build.sbt#L11)
[warn] +- spark.jobserver:akka-app_2.10:0.8.0
[warn] Credentials file /root/.bintray/.credentials does not exist
[warn] Credentials file /root/.bintray/.credentials does not exist
sbt.ResolveException: unresolved dependency: com.typesafe.akka#akka-slf4j_2.10;2.4.9: not found
unresolved dependency: com.typesafe.akka#akka-cluster_2.10;2.4.9: not found
unresolved dependency: io.spray#spray-routing-shapeless23_2.10;1.3.4: not found
unresolved dependency: com.typesafe.akka#akka-testkit_2.10;2.4.9: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:313)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
...
[error] (akka-app/*:update) sbt.ResolveException: unresolved dependency: com.typesafe.akka#akka-slf4j_2.10;2.4.9: not found
[error] unresolved dependency: com.typesafe.akka#akka-cluster_2.10;2.4.9: not found
[error] unresolved dependency: io.spray#spray-routing-shapeless23_2.10;1.3.4: not found
[error] unresolved dependency: com.typesafe.akka#akka-testkit_2.10;2.4.9: not found
[error] Total time: 30 s, completed 21-ene-2018 15:15:42
Assembly failed

The user-provided path /tmp/job-server/job-server.tar.gz does not exist.
tar (child): /tmp/job-server/job-server.tar.gz: No se puede open: No existe el fichero o el directorio
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now

Thakyou very much!!
Julián.

error.txt

How to give AWS credentials?

Hi,

According to documentation, KinesisSpout method constructs an instance of the spout, using AWS credentials. But how and where I have to give those AWS access and secret key. I could not find any file, where I'v to give.

Please help. Thanks !!!