fabiogjardim / bigdata_docker Goto Github PK

Big Data Ecosystem Docker

VBA 88.38% PLSQL 0.17% Jupyter Notebook 9.33% Batchfile 0.75% Shell 1.37%

hadoop hdfs hbase hive presto spark jupyter-notebook hue mongo metabase

bigdata_docker's Introduction

Hi I'm Fábio Jardim 👋

Welcome to my profile! With a consolidated and diversified career in the field of technology and data, I have been driving organizations to reach their maximum potential through data-driven decision-making.

Over more than 20 years, I have had the honor of holding key positions such as Data Director, Engineering Manager, and Head of Big Data, where I created and implemented solutions in analytics, machine learning, data engineering, data architecture, and big data that transformed the corporate culture towards becoming data-driven. I successfully led projects in companies of various sizes and sectors, including retail, banking, and internet.

Speak About

Technologies

bigdata_docker's People

Contributors

Stargazers

Watchers

Forkers

tiagonpsilva corner4world andresavs ronaldobernardi osnirmaster vagnerasilva marcelamonteiromontenegrogallo mmatheusvidal marcelomrwin helioduarte arthurxpto matheusmota jatin7 mdlucca michelmiranda armandobs14 janes emersonnaka saulofurtado igorpereirabr1 mallik-g itimes-digital rodluiz albertochong vngasp thiagonogueiramgarcia clepaula tiagolugatto grdonda sunny121li tooptoop4 dougver carloszazula tcvieira pauloaraxa willpeixoto thiago-castilho glaubercss loser007 egilgamesh alexbaptista daviddesz luisble jordanboaz jingdq nailson bruno7andrade lcscarpini wallace-noronha matcgoes ahconde stanleycruvinel andressamarcal ederpereira nandacast ortisan tandisheng rodrigo-reboucas allssdevandersoncoelho next-bigdata vskywalker elkhaddari04 denisavilamontini jeanfbd m2candre andreustimm rblbigdata b-r-u-n-o marcelomata yramamurthy anarafaelagomes devvafj janairacs sebagonella cglsoft murilo-zc moisespereira gaoyangy brunnamaiaradasilva ven2day thiagoabb rbmuller cvasani mateuscastello ybq880812131 a6santa rafaelladuarte guilherme-esplugues metaver5o legomco silviojunior520 llzimmer liangqinghai italo-github mhunesi kaiyuanxuexizhe brungius iamgrewal neosun100 jdadong

bigdata_docker's Issues

Dúvida sobre docker

Olá Fábio,

Por favor, de acordo com a imagem do ecossistema, cada um dos itens será colocado em um container específico? Por exemplo, o MongoDB e o Mongo Express ficariam em containers separados ou no mesmo container?

Muito obrigado,

Daniel Adorno Gomes

Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog

Grande Fábio,

sua distribuição caiu como uma luva pra mim, agradeço muito.

Entretanto estou com um erro ao tentar realizar qualquer conexão do Spark com o Hive. Dá mensagem
Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog. Tanto pelo Jupyter, como diretamente no pyspark dentro da vm do spark.

Já reinstalei tudo (apaguei todas as imagens do docker e inicializei somente o bigdata_docker, verifiquei se tem alguma porta em conflito, aumentei os recursos do Docker para 4 CPU, 16 GB de memória, 4 swap, e não mudou nada. Não achei nada de relevante nas pesquisas pela net.

Estou rodando em um iMac (24 GB RAM) com MacOS Catalina 10.15.4 e Docker 2.2.0.5 .

O restante está tudo funcionando, o HUE o Presto e o Metabase acessam normalmente o Hive.

Agradeço se puder me dar alguma idéia do que está errado. Não alterei nenhuma configuração sua ou das imagens.

root@jupyter-spark:/opt/spark/conf# pyspark
Python 3.5.3 (default, Sep 27 2018, 17:25:39)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/04/11 17:11:24 WARN spark.SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
20/04/11 17:11:25 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/ / ._/_,// //_\ version 2.4.1
//

Using Python version 3.5.3 (default, Sep 27 2018 17:25:39)
SparkSession available as 'spark'.

from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)
sqlContext.sql("show databases").show()
Traceback (most recent call last):
File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.sql.
: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:192)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:103)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.org$apache$spark$sql$hive$HiveSessionStateBuilder$$externalCatalog(HiveSessionStateBuilder.scala:39)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$1.apply(HiveSessionStateBuilder.scala:54)
at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anonfun$1.apply(HiveSessionStateBuilder.scala:54)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog$lzycompute(SessionCatalog.scala:90)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.externalCatalog(SessionCatalog.scala:90)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listDatabases(SessionCatalog.scala:247)
at org.apache.spark.sql.execution.command.ShowDatabasesCommand$$anonfun$2.apply(databases.scala:44)
at org.apache.spark.sql.execution.command.ShowDatabasesCommand$$anonfun$2.apply(databases.scala:44)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.command.ShowDatabasesCommand.run(databases.scala:44)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:194)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
at org.apache.spark.sql.Dataset.(Dataset.scala:194)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:189)
... 36 more
Caused by: java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/metadata/HiveException
at org.apache.spark.sql.hive.HiveExternalCatalog.(HiveExternalCatalog.scala:71)
... 41 more
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.metadata.HiveException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 42 more

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "", line 1, in
File "/opt/spark/python/pyspark/sql/context.py", line 358, in sql
return self.sparkSession.sql(sqlQuery)
File "/opt/spark/python/pyspark/sql/session.py", line 767, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/opt/spark/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: "Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':"

fatal: repository 'https://github.com/fabiobjardim/bigdata_docker.git/' not found

Olá Fábio, tudo bem?

Tentei executar esse comando, de acordo com as instruções em sua página em: https://github.com/fabiogjardim/bigdata_docker mas não funcionou:

C:\docker>git clone http://github.com/fabiobjardim/bigdata_docker.git
Cloning into 'bigdata_docker'...
info: please complete authentication in your browser...
remote: Repository not found.
fatal: repository 'https://github.com/fabiobjardim/bigdata_docker.git/' not found

Windows 10 Enterprise 64-bit

Olá... Infelizmente meu SO: Windows 10 Enterprise 64-bit não tem suporte para virtualizacão:

Atualmente utilizo o Docker Desktop Windows, dependente do Hyper-V, que quando ativo é incompatível com o VirtualBox...

Infelizmente por se tratar de um computador corporativo, não posso alterar a BIOS para ativar a virtualizacão.

Com base nesse cenário, alguma sugestão? Infelizmente não conheco muito de docker, mas acho que dever ter alguma alternativa.

ingest data / demo example

Hi,

Hope you are all well !

Is it possible to provide an example of ingesting a csv file into this stack ?

Thanks in advance for any insights or inputs on that issue.

Cheers,
X

Problema para iniciar imagem mysql

Bom dia amigos,

Estou tentando iniciar a imagem do mysql, entretando após iniciar ele reinicia. Olhando o log, tenho o seguinte erro

`2020-05-29 01:57:46+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 5.7.29-1debian10 started.
2020-05-29 01:57:48+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2020-05-29 01:57:48+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 5.7.29-1debian10 started.
2020-05-29T01:57:48.720344Z 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2020-05-29T01:57:48.732126Z 0 [Note] mysqld (mysqld 5.7.29) starting as process 1 ...
2020-05-29T01:57:48.749124Z 0 [Note] InnoDB: PUNCH HOLE support available
2020-05-29T01:57:48.749141Z 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2020-05-29T01:57:48.749144Z 0 [Note] InnoDB: Uses event mutexes
2020-05-29T01:57:48.749146Z 0 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2020-05-29T01:57:48.749148Z 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2020-05-29T01:57:48.749324Z 0 [Note] InnoDB: Number of pools: 1
2020-05-29T01:57:48.749400Z 0 [Note] InnoDB: Using CPU crc32 instructions
2020-05-29T01:57:48.750569Z 0 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
2020-05-29T01:57:48.759106Z 0 [Note] InnoDB: Completed initialization of buffer pool
2020-05-29T01:57:48.761304Z 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2020-05-29T01:57:48.809724Z 0 [Note] InnoDB: Highest supported file format is Barracuda.
2020-05-29T01:57:48.822850Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 155984619
2020-05-29T01:57:48.822870Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 155984628
2020-05-29T01:57:48.822874Z 0 [Note] InnoDB: Database was not shutdown normally!
2020-05-29T01:57:48.822876Z 0 [Note] InnoDB: Starting crash recovery.
2020-05-29T01:57:49.364512Z 0 [ERROR] InnoDB: Operating system error number 1 in a file operation.
2020-05-29T01:57:49.364549Z 0 [ERROR] InnoDB: Error number 1 means 'Operation not permitted'
2020-05-29T01:57:49.364555Z 0 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html
2020-05-29T01:57:49.364559Z 0 [ERROR] InnoDB: File ./ibtmp1: 'delete' returned OS error 101.
2020-05-29T01:57:49.364563Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2020-05-29T01:57:49.365233Z 0 [ERROR] InnoDB: Operating system error number 1 in a file operation.
2020-05-29T01:57:49.365244Z 0 [ERROR] InnoDB: Error number 1 means 'Operation not permitted'
2020-05-29T01:57:49.365247Z 0 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html
2020-05-29T01:57:49.365250Z 0 [ERROR] InnoDB: File ./ibtmp1: 'stat' returned OS error 101.
2020-05-29T01:57:49.365275Z 0 [ERROR] InnoDB: os_file_get_status() failed on './ibtmp1'. Can't determine file permissions
2020-05-29T01:57:49.365278Z 0 [ERROR] InnoDB: Could not create the shared innodb_temporary.
2020-05-29T01:57:49.365280Z 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2020-05-29T01:57:49.566478Z 0 [ERROR] InnoDB: Operating system error number 1 in a file operation.
2020-05-29T01:57:49.566527Z 0 [ERROR] InnoDB: Error number 1 means 'Operation not permitted'
2020-05-29T01:57:49.566563Z 0 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html
2020-05-29T01:57:49.566568Z 0 [ERROR] InnoDB: File ./ibtmp1: 'delete' returned OS error 101.
2020-05-29T01:57:49.566573Z 0 [ERROR] Plugin 'InnoDB' init function returned error.
2020-05-29T01:57:49.566576Z 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2020-05-29T01:57:49.566581Z 0 [ERROR] Failed to initialize builtin plugins.
2020-05-29T01:57:49.566583Z 0 [ERROR] Aborting

2020-05-29T01:57:49.566587Z 0 [Note] Binlog end
2020-05-29T01:57:49.566638Z 0 [Note] Shutting down plugin 'CSV'
2020-05-29T01:57:49.569736Z 0 [Note] mysqld: Shutdown complete`

Assim está minha configuracão da imagem no .yml

database: image: fjardim/mysql container_name: database hostname: database ports: - "33061:3306" deploy: resources: limits: memory: 500m command: mysqld --innodb-flush-method=O_DSYNC --innodb-use-native-aio=OFF --init-file /data/application/init.sql volumes: - /c/docker/bigdata_docker/data/mysql/data:/var/lib/mysql - /c/docker/bigdata_docker/data/init.sql:/data/application/init.sql environment: MYSQL_ROOT_USER: root MYSQL_ROOT_PASSWORD: secret MYSQL_DATABASE: hue MYSQL_USER: root MYSQL_PASSWORD: secret
Alguma idéia do que pode estar causando o erro? Lembrado que estou usando o Windows 10 e docker desktop para executar tudo.

Obrigado!