A Simplified Data Processing Platform
- docker engine: v19.03 or above
- docker compose: v1.27.2 or above
-
Clone the repo:
git clone https://github.com/daxnet/abacuza
-
Build everything with the following command:
docker-compose -f docker-compose.build.yaml build
-
Start the infrastructure services like database or redis cache:
docker-compose -f docker-compose.dev.yaml up
-
Open
abacuza.sln
in Visual Studio 2019 fromsrc/services
directory -
Press F5 to debug
- Follow the instructions in How to Debug (Services) to start the infrastructure services and the backend services
- Go to the
src/client
directory - Run
npm install
to install the dependencies - Run
npm start
to start the Angular development server at localhost:4200 - Navigate to http://localhost:4200 in a web browser to access the Abacuza Administrator dashboard
-
Execute the following command to run everything:
docker-compose up
-
Navigate to http://localhost:9320 in a web browser to access the Abacuza Administrator dashboard
Microsoft provides a .NET for Spark tutorial that demonstrates the counting of the words in a given text file. We will use that demo script to show the features and data processing capabilities provided by Abacuza.
An application in Abacuza describes how the data should be processed or transformed, it is usually developed by data scientists to meet their analysis needs. Applications will be assigned to the Job Runners and then loaded by the job runner when a project requests a data processing session. Developing an application for Abacuza involves the following tasks:
- Create a new .NET 5 console application
- Add
Microsoft.Spark
andAbacuza.JobRunners.Spark.SDK
NuGet package reference - Customize the application
- Build and pack the application
- Create a new .NET 5 console application
$ dotnet new console -f net5.0 -n WordCountApp
- Add NuGet package reference
$ dotnet add package Microsoft.Spark --version 2.0.0 $ dotnet add package Abacuza.JobRunners.Spark.SDK --prerelease
- Add a new class which derives from the
SparkRunnerBase
, actually its code is copied from the example code provided by Microsoft:using Abacuza.JobRunners.Spark.SDK; using Microsoft.Spark.Sql; namespace WordCountApp { public class WordCountRunner : SparkRunnerBase { public WordCountRunner(string[] args) : base(args) { } protected override DataFrame RunInternal(SparkSession sparkSession, DataFrame dataFrame) => dataFrame .Select(Functions.Split(Functions.Col("value"), " ").Alias("words")) .Select(Functions.Explode(Functions.Col("words")) .Alias("word")) .GroupBy("word") .Count() .OrderBy(Functions.Col("count").Desc()); } }
- Modify the
Program.cs
, in theMain
method, simply invoke theWordCountRunner
:static void Main(string[] args) { new WordCountRunner(args).Run(); }
- Under the WordCount project folder, execute the following command to publish the application that targets to Linux x64 platform:
$ dotnet publish -c Release -f net5.0 -r linux-x64 -o published
- Zip the contents in the
published
folder, note that the zip file should only contains the content under thepublished
folder, thepublished
folder itself shouldn't be zipped. For example, execute following command under Linux will zip thepublished
folder into a ZIP file:$ zip -rj WordCountApp.zip published/.
Before doing the data transformation, you will need to create a cluster connection in Abacuza which connects to a data processing cluster. By default, Abacuza delivers the Spark cluster implementation, which is also the one that is used here.
-
Determine your IP address by using the
ifconfig
(oripconfig
under Windows) -
Edit the
template.env
file and modify theACCESS_HOST
environment variable and set it to your IP address -
Start Abacuza services and front-end dashboard by using the following command:
$ docker-compose --env-file template.env up
For more information about running Abacuza locally, please refer to the steps above
-
Open your web browser, navigate to
http://<your-ip-address>:9320
, this opens the Abacuza dashboard -
Login with your credential, by default, use
super
for username andP@ssw0rd
for the password -
In the left pane, click
Cluster Connections
, then in theCluster Connections
page, click theAdd Connection
button to create a new cluster connection -
In the
Add Connection
dialog, fill in the name, description fields, forCluster type
choosespark
. In theSettings
text box, input the Spark settings in JSON format. To be simple, we just specify the base URL to the Spark livy. ClickSave
button to save the changes:{ "baseUrl": "http://192.168.0.110:8998" }
-
Now your cluster connection which connects to the running
Spark
instance should be ready
Follow the steps below to create a job runner in Abacuza.
-
Click
Job Runners
menu, then click theAdd Job Runner
button to create a new job runner -
In the
Create Job Runner
dialog, fill in the name and description for the job runner, and for theCluster type
, chooseSpark
: -
Click
Save
button, Abacuza will redirect you to theJob Runner Details
page -
In the
Job Runner Details
page, under theBinaries
section, add the following two files to theJob Runner
:microsoft-spark-3-1_2.12-2.0.0.jar
- you can find it in yourpublished
folderWordCountApp.zip
- This is the Zip file you created in step 6 of chapter Develop the Word Count Application
-
Under the
Payload template
section, use the following JSON document:{ "file": "${jr:binaries:microsoft-spark-3-1_2.12-2.0.0.jar}", "className": "org.apache.spark.deploy.dotnet.DotnetRunner", "args": [ "${jr:binaries:WordCountApp.zip}", "WordCountApp", "${proj:input-defs}", "${proj:output-defs}", "${proj:context}" ] }
Note that the
${jr:binaries}
place holder refers to the binary files that you've uploaded to the current job runner. -
Save the job runner
-
Click
Projects
menu -
In the
Projects
page, click theAdd Project
button to add a new project -
In the
Add Project
dialog, fill in the name, description of the project. ForInput endpoint
, chooseText Files
; forOutput endpoint
, chooseConsole
, which means that we want the output of the data process to be shown in the console log. For theJob Runner
, choose the one that we just created in previous steps -
Save the project, the
Project Details
page will show -
Let's prepare some data. Follow the instructions described on Microsoft official site to create a
input.txt
file -
On the
Project Details
page, underINPUT
tab, add theinput.txt
as the project input -
Click
Submit
button, the data processing job will be submitted to one of the clusters whose type isspark
, and on that cluster, the customized application that we developed above will be executed for data processing. You can monitor the status of the execution from theREVISIONS
tab of theProject Details
page -
Once the job is completed successfully, you can click the
log
icon to see the logs. In this example, you can see the following output in the log
For more information about the architecture, the design concepts and the developer's manual, please refer to the Abacuza Documentation.
Click here for the documentation.