Giter VIP home page Giter VIP logo

aws-big-data-blog's Introduction

aws-big-data-blog's People

Contributors

afitzgibbon avatar ajfriedman18 avatar aws-joe avatar awsnaush avatar awwerth avatar ccrosbie avatar chayel avatar christopherbozeman avatar daniel-aranda avatar dependabot[bot] avatar dev-chris-marshall avatar dgraeber avatar dominicmmurphy avatar erikswensson avatar gwunnava avatar hvivani avatar hyandell avatar ianmeyers avatar jeffreyksmithjr avatar jpeddicord avatar kevincdeng avatar ksumit avatar mcholste avatar mikedeck avatar nehalmehta avatar nickcorbett avatar nmukerje avatar stmcpherson avatar tbj128 avatar ydereky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aws-big-data-blog's Issues

Get error "No such property: TitanFactory for class: groovysh_evaluate"

i follow the instruction and got error as below

gremlin> conf = new BaseConfiguration()
==>org.apache.commons.configuration.BaseConfiguration@70fab835
gremlin> conf.setProperty("storage.backend", "com.amazon.titan.diskstorage.dynamodb.DynamoDBStoreManager")
==>null
gremlin> conf.setProperty("storage.dynamodb.client.endpoint", "http://localhost:4567")
==>null
gremlin> conf.setProperty("index.search.backend", "elasticsearch")
==>null
gremlin> conf.setProperty("index.search.directory", "/tmp/searchindex")
==>null
gremlin> conf.setProperty("index.search.elasticsearch.client-only", "false")
==>null
gremlin> conf.setProperty("index.search.elasticsearch.local-mode", "true")
==>null
gremlin> conf.setProperty("index.search.elasticsearch.inteface", "NODE")
==>null
gremlin> g = TitanFactory.open(conf)
No such property: TitanFactory for class: groovysh_evaluate

How can i solve this problem? Thank you.

How to simply make changes to the SampleTopology?

It seems like the point of this is to be able to update the SampleToplogy with your own rules, but its not clear to a storm newbie how to do so? I tried using the storm binary, but get SLF4J: Class path contains multiple SLF4J bindings. The docs say to use maven and not storm to make changes locally but its not clear how to do that on a prebuilt image...can we get some simple docs of how to make changes to the topology and deploy them?

aws-blog-spark-parquet-conversion: java.lang.ClassCastException when tranforming JSON into parquet

Hello,

I was trying to use the same script to transform GZIP JSON data into parquet, but I encountered the following error:

java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

I'm using the EMR configurations (instance types, number, version, ...) as shown in this example. I created my hive table using ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'. An example of the DDL scripts to create and alter my table is:

createtable1.hql

DROP TABLE my_events;
CREATE EXTERNAL TABLE IF NOT EXISTS my_events (text string,number int)
PARTITIONED BY(year int, month int, day int, hour int)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
 LOCATION 's3://bucket/json/';

addpartitions1.py

ALTER TABLE my_events ADD PARTITION (year=2017, month=8, day=1, hour=0) location 's3://bucket/json/2017/08/01/00';

In order for spark to be able to read my table, I had to do the following modifications on the configuration file in my master node (adding the SERDE to my both driver and executor's classpath):

  1. Edit /etc/spark/conf/spark-defaults.conf on EMR master node
  2. Add /usr/lib/hive-hcatalog/share/hcatalog/* to spark.driver.extraClassPath and spark.executor.extraClassPath property

The python script is practically the same. I only added a few type casting to some columns in my dataframe, before calling the write2parquet function, like:

rdf = rdf.withColumn('columnName', rdf['columnName'].cast(DoubleType()))

Nevertheless, I tried without those type castings and still got the same error. The following is the script used without type casting:

convert2parquet1.py

from concurrent.futures import ThreadPoolExecutor, as_completed
from pyspark.sql import SparkSession

def write2parquet(start):
    df = rdf.filter((rdf.day >= start) & (rdf.day <= start + 10))
    df.repartition(*partitionby).write.partitionBy(partitionby).mode("append").parquet(output, compression=codec)

partitionby = ['year', 'month', 'day', 'hour']
output = '/user/hadoop/myevents_pq'
codec = 'snappy'
hivetablename = 'default.my_events'

spark = SparkSession.builder.appName("Convert2Parquet").enableHiveSupport().getOrCreate()
rdf = spark.table(hivetablename)

futures = []
pool = ThreadPoolExecutor(1)

for i in [1]:
    futures.append(pool.submit(write2parquet, i))
for x in as_completed(futures):
    pass

spark.stop()

Is there something I'm missing here?

aws-big-data-blog/aws-blog-athena-importing-hive-metastores/

There is a bug in the script exportdatabase.py, the line 25

command="hive -f "+schema+"_tables.hql -S >> "+schema+".output"

need to be replaced as below

command="hive --database "+schema+" -f "+schema+"_tables.hql -S >> "+schema+".output"

The schema name needs to be specified in case if we are exporting from a schema other than default.

Sparkstreaming-from-kafka

mvn install clean FAILURE

Can someone provide the accurate versions of these plugins?

[WARNING] Some problems were encountered while building the effective model for com.awsproserv:kafkaandsparkstreaming:jar:0.0.1-SNAPSHOT
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 133, column 12
[WARNING] 'build.plugins.plugin.version' for org.scala-tools:maven-scala-plugin is missing. @ line 107, column 12
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-jar-plugin is missing. @ line 160, column 12

[WARNING] Expected all dependencies to require Scala version: 2.11.8
[WARNING] com.awsproserv:kafkaandsparkstreaming:0.0.1-SNAPSHOT requires scala version: 2.11.8
[WARNING] org.scala-lang:scala-compiler:2.11.8 requires scala version: 2.11.8
[WARNING] org.scala-lang:scala-reflect:2.11.8 requires scala version: 2.11.8
[WARNING] org.scala-lang.modules:scala-xml_2.11:1.0.4 requires scala version: 2.11.4
[WARNING] Multiple versions of scala libraries detected!

[ERROR] error: scala.reflect.internal.MissingRequirementError: object java.lang.Object in compiler mirror not found.
[ERROR] at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:17)
[ERROR] at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:18)

Doesn't seem to work with Twitter api v2

Hi,

I am trying to test aws-blog-firehose-lambda-elasticsearch-near-real-time-discovery-platform/ with Twitter api v2, but seems like the code doesn't work with Twitter api v2....is there any update for this project for twitter api v2...

thanks

Missing configureIgnite.sh file

The Real-Time OLTP with Apache Ignite blog is missing an explanation or reference to the location of the "configureIgnite.sh" file. It doesn't seem to be part of the official Ignite distributions. Please advise where to obtain the file.

aws-blog-kinesis-producer-library: generateClickEvent data misleading to performance

In the aws-blog-kinesis-producer-library click events are generated with the following code:

private static ClickEvent generateClickEvent() {
    byte[] id = new byte[13];
    RANDOM.nextBytes(id);
    String data = StringUtils.repeat("a", 350);
    return new ClickEvent(DatatypeConverter.printBase64Binary(id), data);
}

However, the bulk of the content is just the letter "a", which means that the data sent up to the servers will be extremely compressable. (ie. the uploaded compressed payloads will essentially be the 13 byte random number plus a few more bytes).

It would be more representative of performance, to make a larger percentage of the event payload random.

Amazon Kinesis Storm Clickstream CloudFormation template fails with Epoch error

Hi there,

I just tried to use kinesis-storm-clickstream-stack.template to provision the reference application. Stack creation failed on this step:

13:10:11 UTC-0500 CREATE_FAILED AWS::CloudFormation::WaitCondition NodeJsWaitCondition WaitCondition received failed message: 'Failed to install Epoch.See epoch-install.log for details' for uniqueId: i-05d4688c

The provisioning operation was rolled back, so I'm not sure how to get to epoch-install.log.

Is the stack template out of date? Does Epoch need a version update? Any other ideas?

Thanks,
Ranjith

Failed to install Storm. See storm-install.log for details

Hi,
i was running the aws-blog-kinesis-storm-clickstream-app/static/cloudformation/kinesis-storm-clickstream-stack.template, but the template stops creating the stack because it can't find the sh script "storm-install.sh" ((towards the 877 line in the template).

Where can i find this script?

Missing resources for Query and Visualize AWS Cost and Usage Data Using Amazon Athena and Amazon QuickSight

I'm interested in the cloud formation resources referenced in the Query and Visualize AWS Cost and Usage Data Using Amazon Athena and Amazon QuickSight post (https://aws.amazon.com/blogs/big-data/query-and-visualize-aws-cost-and-usage-data-using-amazon-athena-and-amazon-quicksight/)

I'm not seeing them in the repository and the post doesn't seem to link to anything. Do these resources exist?

INTERNAL_SERVER_ERROR after running aws-blog-firehose-lambda locally

I tried to run the Firehose java sample locally by uncommenting lines from KinesisToFirehose.java. Here is the output:

$ mvn exec:java -Dexec.mainClass="com.amazonaws.proserv.lambda.KinesisToFirehose"
...
[INFO] --- exec-maven-plugin:1.4.0:java (default-cli) @ firehoseLambda ---
Firehose ACTIVE
deliverytest
Firehose ACTIVE
Firehose ACTIVE
{RecordId: wVgtYFcO2Pv0JlXI2cN4apw3R3mxMa9kW4GaWIXR/wUa5qVqMA14XNTLsxEpzPeP0rOO5w4EqxjoZ90NcOah6lvBkoLvqMCqsZcvazNsRtKQnUzA24BG+CPxiRJCorSPZnhCmo6+NyZch4CSB0lCLUT1/AuqKN9FBgZZ9DBg7C5vQB/BcP2VfIZAHRO69Q0XbNve2eFzEqsnzLcXcO2sue6qSf5dIOUB}
...

From the output, everything looks fine, but when I check the Firehose streams on Kinesis AWS dashboard, I see this error INTERNAL_SERVER_ERROR. Now I cannot do much of anything, I don't even see manually created firehoses and the S3 bucket is still empty.

Unable to Parse JSON file from S3 in AWS Quick Sight

I developed a pipeline in AWS where I collect CPU Temperature from my notebook via Python 3 notebook (basicPubSub.py adapted), send to AWS IoT Core using security protocols, updates the shadow successfully, a rule sends data to Cloud Watch and DynamoDB saves it. A Data Pipeline was created to save this DynamoDB data to S3 and I want to generate plots of this data with Quick Sight.

However, I am not being able to make Quick Sight properly read the file. The S3 file looks like this:

{"timestamp":{"s":"1526819850637"},"payload":{"m":{"$Temperature":{"s":"42.000"}}}} {"timestamp":{"s":"1526819976032"},"payload":{"m":{"$Temperature":{"s":"42.000"}}}} {"timestamp":{"s":"1526819934216"},"payload":{"m":{"$Temperature":{"s":"42.000"}}}} {"timestamp":{"s":"1526817845094"},"payload":{"m":{"$Temperature":{"s":"48.000"}}}}

When I use the manifest file below, Quick Sight successfully reads data but the 'parseJson' command in New Calculated field disappears, making impossible to read the JSON:

{ "fileLocations": [ { "URIs": [ "https://s3.amazonaws.com/my-bucket2/2018-05-20-12-32-49/12345-a279-1234-2269-491212345" ] }, { "URIPrefixes": [ "https://s3.amazonaws.com/my-bucket2/2018-05-20-12-32-49/12345-a279-1234-2269-491212345" ] } ],"globalUploadSettings": { "format": "CSV","delimiter":"\n","textqualifier":"'" } }

Quick Sight reads the JSON as:

{{"timestamp":{"s":"1526819850637"},"payload":{"m":{"$unknown":{"s":"42.000"}}}}}

... with no 'parseJson' command.

Data in Quick Sight has no missing values and the AWS pipeline works perfectly until this point. What can I do ?

These are the only formulas available:

avg
ceil
count
dateDiff
decimalToInt
distinct_count
epochDate
extract
floor
intToDecimal
max
min
round
sum
truncDate

... formulas like 'right', 'left', parseJson simply disappear.

Errors after import

Hi,

Thanks for the code.

Getting these errors after importing: aws-blog-hbase-on-emr

  1. Missing artifact jdk.tools:jdk.tools:jar:1.7.0_65
  2. Missing artifact com.amazonaws:amazon-kinesis-connectors:jar:1.1.1

Thanks,

Hari

Error using emr-5.10.0 instance

Hello. I want to use instance emr-5.10 ( Spark 2.2 etc.)
I'll very happy if you can show me how to solve the errors.
This is my environment
In my master emr-5.10 instance I've uploaded both files:

  • configuration.json (original content)
  • full_install_jobserver_BA.sh script
    and adapted this parameters in the last on:
    ####### EDIT THESE - START ########
    JOBSERVER_VERSION="v0.8.0"
    EMR_SH_FILE="s3:///jobserver/jobserver_configs/emr.sh"
    EMR_CONF_FILE="s3:///jobserver/jobserver_configs/emr.conf"
    S3_BUILD_ARCHIVE_DIR="s3:///jobserver/dist"
    ####### EDIT THESE - END ########

In s3:///jobserver/jobserver_configs/ I've:

-emr.sh:
APP_USER=hadoop
APP_GROUP=hadoop
INSTALL_DIR=/mnt/lib/spark-jobserver
LOG_DIR=/mnt/var/log/spark-jobserver
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=1G
SPARK_VERSION=2.2.0
SPARK_HOME=/usr/lib/spark
SPARK_CONF_DIR=/etc/spark/conf
HADOOP_CONF_DIR=/etc/hadoop/conf
YARN_CONF_DIR=/etc/hadoop/conf
SCALA_VERSION=2.11.0

  • emr_contexts.conf (github original content)

When I execute full_install_jobserver_BA.sh script I obtain the following error related to unresolved dependencies (complete output attached)
...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: com.typesafe.akka#akka-slf4j_2.10;2.4.9: not found
[warn] :: com.typesafe.akka#akka-cluster_2.10;2.4.9: not found
[warn] :: io.spray#spray-routing-shapeless23_2.10;1.3.4: not found
[warn] :: com.typesafe.akka#akka-testkit_2.10;2.4.9: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] com.typesafe.akka:akka-slf4j_2.10:2.4.9 (/mnt/work/spark-jobserver/build.sbt#L11)
[warn] +- spark.jobserver:akka-app_2.10:0.8.0
[warn] com.typesafe.akka:akka-cluster_2.10:2.4.9 (/mnt/work/spark-jobserver/build.sbt#L11)
[warn] +- spark.jobserver:akka-app_2.10:0.8.0
[warn] io.spray:spray-routing-shapeless23_2.10:1.3.4 (/mnt/work/spark-jobserver/build.sbt#L11)
[warn] +- spark.jobserver:akka-app_2.10:0.8.0
[warn] com.typesafe.akka:akka-testkit_2.10:2.4.9 (/mnt/work/spark-jobserver/build.sbt#L11)
[warn] +- spark.jobserver:akka-app_2.10:0.8.0
[warn] Credentials file /root/.bintray/.credentials does not exist
[warn] Credentials file /root/.bintray/.credentials does not exist
sbt.ResolveException: unresolved dependency: com.typesafe.akka#akka-slf4j_2.10;2.4.9: not found
unresolved dependency: com.typesafe.akka#akka-cluster_2.10;2.4.9: not found
unresolved dependency: io.spray#spray-routing-shapeless23_2.10;1.3.4: not found
unresolved dependency: com.typesafe.akka#akka-testkit_2.10;2.4.9: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:313)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
...
[error] (akka-app/*:update) sbt.ResolveException: unresolved dependency: com.typesafe.akka#akka-slf4j_2.10;2.4.9: not found
[error] unresolved dependency: com.typesafe.akka#akka-cluster_2.10;2.4.9: not found
[error] unresolved dependency: io.spray#spray-routing-shapeless23_2.10;1.3.4: not found
[error] unresolved dependency: com.typesafe.akka#akka-testkit_2.10;2.4.9: not found
[error] Total time: 30 s, completed 21-ene-2018 15:15:42
Assembly failed

The user-provided path /tmp/job-server/job-server.tar.gz does not exist.
tar (child): /tmp/job-server/job-server.tar.gz: No se puede open: No existe el fichero o el directorio
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now

Thakyou very much!!
Julián.

error.txt

How to give AWS credentials?

Hi,

According to documentation, KinesisSpout method constructs an instance of the spout, using AWS credentials. But how and where I have to give those AWS access and secret key. I could not find any file, where I'v to give.

Please help. Thanks !!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.