Example Amundsen/DSE + Airflow

This project is ment to demonstrate integrating Amundsen with DSE and Airflow using Docker

Software involved

docker (docker-compose)
Amundsen
Datastax Enterprise Cassandra
Airflow

Requirements

docker, docker-compose

1. Run Amundsen

Full Installation: https://github.com/amundsen-io/amundsen/blob/main/docs/installation.md The following instructions are for setting up a version of Amundsen using Docker.

Make sure you have at least 3GB available to docker. Install docker and docker-compose.

Clone this repo and its submodules by running:

$ git clone --recursive https://github.com/amundsen-io/amundsen.git

Enter the cloned directory and run below:
```
# For Neo4j Backend
$ docker-compose -f docker-amundsen.yml up

# For Atlas
$ docker-compose -f docker-amundsen-atlas.yml up
```
If it's your first time, you may want to proactively go through troubleshooting steps, especially the first one related to heap memory for ElasticSearch and Docker engine memory allocation (leading to Docker error 137).

2. Run Airflow

Full Installation: https://airflow.apache.org/docs/apache-airflow/stable/start/docker

cd Airflow
docker-compose up airflow-init
docker-compose up

3. Populate DSE/Postgres with data

Cassandra

docker exec -it airfloworiginal_dse1_1 cqlsh

CREATE KEYSPACE demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE demo;
CREATE TABLE User_type(
user_type INT,
user_ID UUID,
primary key(user_ID));

Postgres

docker exec -it airfloworiginal_postgres_1 psql -U airflow

 CREATE TABLE accounts (
   user_id serial PRIMARY KEY,
   username VARCHAR ( 50 ) UNIQUE NOT NULL,
   password VARCHAR ( 50 ) NOT NULL,
   email VARCHAR ( 255 ) UNIQUE NOT NULL,
   created_on TIMESTAMP NOT NULL,
 last_login TIMESTAMP 
 );

4. Run the scripts to Extract and Publish

Transfer the scripts into the /amundsen/databuilder/example/scripts

  python cassandra_data_loader.py
  python cassaandra_no4j_es_loader.py

5. Install requirements

In the airflow_worker CLI install the dependencies

  cd dags/req
  pip install -r requirements.txt
  pip install cassandra-driver

6. Configure the DAG

In the /dags/dag.py file you need to configure the connections for Cassandra/Neo4j and ES

you should see the network

  docker network ls

example: Network ID amundsen_amundsennet bridge local 2. With this command you should be able to see all containers running on this network

  docker network inspect amundsen_amundsennet

Get the IPv4Address for this 3 containers Example:

               "Name": "airfloworiginal_dse1_1",
               "EndpointID": "3e3e13d95457c500dcf10660f0e9796b08dff4190f5893b3d1443dbff771a3f8",
               "MacAddress": "02:42:ac:15:00:09",
               "IPv4Address": "172.21.0.9/16",
               "IPv6Address": "" 

              "Name": "es_amundsen",
               "EndpointID": "dfa0fc9580d97309516add337fc4b5aa1df8e8439b7e075c28c0d3d6a990a8c4",
               "MacAddress": "02:42:ac:15:00:02",
               "IPv4Address": "172.21.0.2/16",
               "IPv6Address": ""

              "Name": "neo4j_amundsen",
               "EndpointID": "c044909c033c8f82172be6c265a70e0e077825fb1b01c960a9fd5d0373f9508f",
               "MacAddress": "02:42:ac:15:00:03",
               "IPv4Address": "172.21.0.3/16",
               "IPv6Address": ""

7. Edit the DAG file

Change the file on these 3 lines

On line 95:

  'extractor.cassandra.{}'.format(CassandraExtractor.IPS_KEY): ['172.21.0.9'],

On line 56:

  NEO4J_ENDPOINT = f'bolt://172.21.0.3:{neo_port}'

On line 51:

  {'host': '172.21.0.2', 'port': es_port},

stefanvinica / amundesn-dse-airflow Goto Github PK

amundesn-dse-airflow's Introduction

Example Amundsen/DSE + Airflow

Software involved

Requirements

1. Run Amundsen

2. Run Airflow

3. Populate DSE/Postgres with data

4. Run the scripts to Extract and Publish

5. Install requirements

6. Configure the DAG

7. Edit the DAG file

amundesn-dse-airflow's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent