absaoss / hermes Goto Github PK

View Code? Open in Web Editor NEW

8.0 15.0 3.0 408 KB

A E2E test tool for Enceladus. Also general dataframe comparison tool

License: Apache License 2.0

Scala 100.00%

enceladus spark dataset-comparison atum e2e-tests

hermes's Introduction

Enceladus TestUtils

Enceladus TestUtils

Hermes is an E2E testing tool created mainly for the use in ABSA OSS ecosystem but still provides some tools/utils that are usable in other projects and are quite generic. For more information, please look at our Hermes Github Pages.

To Build

Use either of the commands below. Depending on your versions.

sbt ++2.11.12 assembly -DSPARK_VERSION=2.4.7
sbt ++2.12.12 assembly -DSPARK_VERSION=3.2.2

Known to work with:

Spark 2.4.2 - 3.2.2 [1]
Java 1.8.0_191-b12
Scala 2.11.12 and 2.12.12

[1] There are now spark version guards to protect from false positives. If there is someone willing to test for older versions, we are happy to extend these. These are applicable only to use as a spark-job not as a library

How to generate Code coverage report

sbt ++2.11.12 jacoco -DSPARK_VERSION=2.4.7
sbt ++2.12.12 jacoco -DSPARK_VERSION=3.2.2

Code coverage will be generated on path:

{project-root}/{module}/target/scala-{scala_version}/jacoco/report/html

Dataset Comparison

Spark job for the comparison of data sets. As it leverages spark, there are almost no limitations to data sources and size of the data.

Running

Basic running example

spark-submit \
/path/to/jar/file \
--format <format of the reference and new data sets> \
--new-path /path/to/new/data/set \
--ref-path /path/to/referential/data/set \
--out-path /path/to/diff/output
--keys key1,key2,key3

Where

Datasets Comparison 
Usage: spark-submit [spark options] --class za.co.absa.hermes.datasetComparison.DatasetComparisonJob hermes.jar [options]

  --[ref|new]-format            Format of the raw data (csv, xml, parquet,fixed-width, etc.). Use prefix only in case
                                    comparison of two different formats. Mandatory.
  --new-path|--new-dbtable      Path to the new dataset or dbtable (jdbc), just generated and to be tested. Mandatory.
  --ref-path|--ref-dbtable      Path to supposedly correct data set or dbtable (jdbc). Mandatory.
  --out-path.                    Path to where the `ComparisonJob` will save the differences. 
                                    This will efectivly creat a folder in which you will find two 
                                    other folders. expected_minus_actual and actual_minus_expected.
                                    Both hold parque data sets of differences. (minus as in is 
                                    relative complement. Mandatory.
  --keys                        If there are know unique keys, they can be specified for better
                                   output. Keys should be specified one by one, with , (comma) 
                                   between them. Optional.
  others                        Other options depends on selected format specifications (e.g. --delimiter and --header for
                                   csv, --rowTag for xml). For case comparison of two different formats use prefix ref|new
                                   for each of this options. For more information, check sparks documentation on what all
                                   the options for the format you are using. Optional.
  
  --help                   prints similar text to this one.

Other configurations are Spark dependant and are out of scope of this README.

Info File Comparison

Atum's (and it's derivatives) Info file comparison. Ran as part of the E2E Runner. Can be run as a plain old jar file.

Running

Basic running example

java -jar \
/path/to/jar/file \
--new-path /path/to/new/data/set \
--ref-path /path/to/referential/data/set \
--out-path /path/to/diff/output

E2E Runner

Currently runs both Standardization and Conformance of [Enceladus][enceladus] project on the data provided. After each, a comparison job is run to check the results against expected reference data.

This tool is planed for an upgrade in the nearest future to be a general E2E Runner for user defined runes.

Basic running example:

spark-submit \
/path/to/jar/file \
--menas-credentials-file /path/to/credentials/file \
--dataset-name <datasetName> \
--dataset-version <datasetVersion> \
--report-date <reportData> \
--report-version <reportVersion> \
--raw-format <rawFormat>
--keys <key1,key2,...>

hermes's People

Contributors

Stargazers

Watchers

Forkers

coppermann-absa fossabot hartl3y94

hermes's Issues

Fully automate release

Feature

Fully automate release

e2e runner has mandatory argument keys

Describe the bug

Run e2e runner without arg "keys" fall with error Error: Missing option --keys.

Expected behavior

Arg "keys" have to be optional.

sbt wrong merge strategy for application.conf

Describe the bug

build.sbt has a wrong merge strategy for application.conf resulting in the conf not propagating to other modules.

Add option to specify which schema to use

Feature

Add option to specify which schema to use, thus comparing only fields of interest.

key column generation does not work for complex columns

Describe the bug

Dataset comparison didn't generate a key in the case compared data has a column with complex data type in.

To Reproduce

Steps to reproduce the behavior OR commands run:

Have data with complex data types like a timestamp.
Run Dataset comparison
See error

Expected behavior

Dataset comparison has the same behavior regardless of datatype.

Additional context

Logs:

Test DatasetComparison (2) failed on an exception
org.apache.spark.sql.AnalysisException: cannot resolve 'concat_ws('|', `name`, `string`, CAST(`boolean` AS STRING), CAST(`integer` AS STRING), CAST(`date` AS STRING), CAST(`binary` AS STRING), transform(`errCol`, lambdafunction(named_struct('errType', namedlambdavariable().`errType`, 'errCode', namedlambdavariable().`errCode`, 'errMsg', namedlambdavariable().`errMsg`, 'errCol', namedlambdavariable().`errCol`, 'rawValues', namedlambdavariable().`rawValues`, 'mappings', transform(namedlambdavariable().`mappings`, lambdafunction(named_struct('mappingTableColumn', namedlambdavariable().`mappingTableColumn`, 'mappedDatasetColumn', namedlambdavariable().`mappedDatasetColumn`), namedlambdavariable()))), namedlambdavariable())), CAST(`enceladus_record_id` AS STRING))' due to data type mismatch: argument 8 requires (array<string> or string) type, however, 'transform(`errCol`, lambdafunction(named_struct('errType', namedlambdavariable().`errType`, 'errCode', namedlambdavariable().`errCode`, 'errMsg', namedlambdavariable().`errMsg`, 'errCol', namedlambdavariable().`errCol`, 'rawValues', namedlambdavariable().`rawValues`, 'mappings', transform(namedlambdavariable().`mappings`, lambdafunction(named_struct('mappingTableColumn', namedlambdavariable().`mappingTableColumn`, 'mappedDatasetColumn', namedlambdavariable().`mappedDatasetColumn`), namedlambdavariable()))), namedlambdavariable()))' is of array<struct<errType:string,errCode:string,errMsg:string,errCol:string,rawValues:array<string>,mappings:array<struct<mappingTableColumn:string,mappedDatasetColumn:string>>>> type.;;
'Project [name#0, string#1, boolean#2, integer#3, date#4, binary#5, errCol#67, enceladus_record_id#7, md5(concat_ws(|, name#0, string#1, cast(boolean#2 as string), cast(integer#3 as string), cast(date#4 as string), cast(binary#5 as string), transform(errCol#67, lambdafunction(named_struct(errType, lambda elm#89.errType, errCode, lambda elm#89.errCode, errMsg, lambda elm#89.errMsg, errCol, lambda elm#89.errCol, rawValues, lambda elm#89.rawValues, mappings, transform(lambda elm#89.mappings, lambdafunction(named_struct(mappingTableColumn, lambda elm#90.mappingTableColumn, mappedDatasetColumn, lambda elm#90.mappedDatasetColumn), lambda elm#90, false))), lambda elm#89, false)), cast(enceladus_record_id#7 as string))) AS HermesDatasetComparisonUniqueId#88]
+- Project [name#0, string#1, boolean#2, integer#3, date#4, binary#5, transform(errCol#6, lambdafunction(named_struct(errType, lambda elm#68.errType, errCode, lambda elm#68.errCode, errMsg, lambda elm#68.errMsg, errCol, lambda elm#68.errCol, rawValues, lambda elm#68.rawValues, mappings, transform(lambda elm#68.mappings, lambdafunction(named_struct(mappingTableColumn, lambda elm#69.mappingTableColumn, mappedDatasetColumn, lambda elm#69.mappedDatasetColumn), lambda elm#69, false))), lambda elm#68, false)) AS errCol#67, enceladus_record_id#7]
   +- Relation[name#0,string#1,boolean#2,integer#3,date#4,binary#5,errCol#6,enceladus_record_id#7] parquet

	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:116)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:108)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:281)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:281)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:280)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:278)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:93)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:105)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:104)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:116)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$2.apply(QueryPlan.scala:121)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.immutable.List.map(List.scala:296)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:121)
	at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:126)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:126)
	at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:93)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:108)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
	at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:86)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
	at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
	at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
	at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
	at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:78)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:3412)
	at org.apache.spark.sql.Dataset.select(Dataset.scala:1340)
	at org.apache.spark.sql.Dataset.withColumns(Dataset.scala:2258)
	at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:2225)
	at za.co.absa.hermes.datasetComparison.DatasetComparison.addKeyColumn(DatasetComparison.scala:264)
	at za.co.absa.hermes.datasetComparison.DatasetComparison.compare(DatasetComparison.scala:75)
	at za.co.absa.hermes.e2eRunner.plugins.DatasetComparisonPlugin.performAction(DatasetComparisonPlugin.scala:76)
	at za.co.absa.hermes.e2eRunner.E2ERunnerJobExperimental$$anonfun$za$co$absa$hermes$e2eRunner$E2ERunnerJobExperimental$$tryExecute$1.apply(E2ERunnerJobExperimental.scala:105)
	at za.co.absa.hermes.e2eRunner.E2ERunnerJobExperimental$$anonfun$za$co$absa$hermes$e2eRunner$E2ERunnerJobExperimental$$tryExecute$1.apply(E2ERunnerJobExperimental.scala:104)
	at scala.util.Try$.apply(Try.scala:192)
	at za.co.absa.hermes.e2eRunner.E2ERunnerJobExperimental$.za$co$absa$hermes$e2eRunner$E2ERunnerJobExperimental$$tryExecute(E2ERunnerJobExperimental.scala:104)
	at za.co.absa.hermes.e2eRunner.E2ERunnerJobExperimental$$anonfun$runTests$1.apply(E2ERunnerJobExperimental.scala:81)
	at za.co.absa.hermes.e2eRunner.E2ERunnerJobExperimental$$anonfun$runTests$1.apply(E2ERunnerJobExperimental.scala:75)
	at za.co.absa.hermes.e2eRunner.TestDefinitions.fold$1(TestDefinitions.scala:61)
	at za.co.absa.hermes.e2eRunner.TestDefinitions.foldLeftWithIndex(TestDefinitions.scala:63)
	at za.co.absa.hermes.e2eRunner.E2ERunnerJobExperimental$.runTests(E2ERunnerJobExperimental.scala:75)
	at za.co.absa.hermes.e2eRunner.E2ERunnerJobExperimental$.main(E2ERunnerJobExperimental.scala:54)
	at za.co.absa.hermes.e2eRunner.E2ERunnerJobExperimental.main(E2ERunnerJobExperimental.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Find a solution to the Directory size comparison issue

When comparing the directory size between two runs, there is a high probability of the size to differ because of the IDs and other non-controllable parts.

Log reference.conf values in use

Background

When people use application.conf to override default values. The defaults in use should be logged so the people know what is in use. Their application.conf values should not be logged just to be sure we are not exposing something we should not.

Feature

Log reference.conf values in use.

Get Extra var for E2E from CLI

Background

TestDefinition JSONs might not be always transportable as it thanks to paths or something else. For that reason, the user might want to overwrite or include a var from CLI.

Feature

Add --extra-vars to the CLI parameters for e2e runner and use it in TestDefinition creation.
extra-vars take precedence

Set up documentation

Feature

Set up documentation

Missing user documentation

Background

There is no user documentation.

Feature

Create documentation for the user with information

how to write plugins
how to use comparison plugins
how to write tests

Final comp file of differentiating datasets shouldn't contain all rows and columns

Background

Final comp file of differentiating datasets shouldn't contain all rows and columns, but only the ones that are different + the key columns and errCol.

Feature

Comp file should show only differentiating columns and key columns and errCol

Get schema difference printed/logged

Feature

In case there is a difference found (#24) then there should be a proper log or print-out of the difference in schemas

Add a JDBC format option

Background

There are new use cases presented that would benefit from this connection option

Feature

Add a JDBC format option

differences in comparison kill the whole e2e_run

Background

When there are differences in _INFO files or datasets schema, it will end the whole e2e_run.

Feature

When there are any differences in the comparison schemas, the e2e run should continue and write down info about differences. The run should be ended only in cases as wrong input or data missing.

Additional info

Log:
Exception in thread "main" za.co.absa.hermes.datasetComparison.SchemasDifferException: Expected and actual datasets differ in schemas. Reference path: /ref/publish/Hermes_e2e_test/enceladus_info_date=2019-11-27/enceladus_info_version=1 Actual dataset path: /publish/Hermes_e2e_test/enceladus_info_date=2019-11-27/enceladus_info_version=1 Difference is: enceladus_info_version cannot be found in both schemas enceladus_info_date cannot be found in both schemas enceladus_info_date_string cannot be found in both schemas at za.co.absa.hermes.datasetComparison.DatasetComparison.checkSchemas(DatasetComparison.scala:141) at za.co.absa.hermes.datasetComparison.DatasetComparison.compare(DatasetComparison.scala:59) at za.co.absa.hermes.datasetComparison.DatasetComparisonJob$.execute(DatasetComparisonJob.scala:51) at za.co.absa.hermes.e2eRunner.E2ERunnerJob$.main(E2ERunnerJob.scala:140) at za.co.absa.hermes.e2eRunner.E2ERunnerJob.main(E2ERunnerJob.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

_METRICS are not written to HDFS but to local FS

Describe the bug

_METRICS are not written to HDFS but to local FS

class: DatasetComparisonJob
method: writeMetricsToFile

Check info file even though Datasets shows difference

Background

Currently, if Dataset shows differences, _INFO comparison won't start.

Feature

Start _INFO file comparison even though Dataset comparison failed.

Update Info File Comparison to reflect Atum enchantments

Background

Atum is going to have some work done. Upgrade Info File Comparison to reflect these.

Feature

Add ability to read in 2 different data formats

Feature

Be able to read in data frames from 2 different file types.

Example [Optional]

spark-submit \
--class za.co.absa.hermes.datasetComparison.DatasetComparisonJob \
--master local \
--deploy-mode client \
--executor-memory 2g \
/path/to/jar/file \
--new-raw-format parquet \
--ref-raw-format xml \
--ref-row-tag alfa \
--new-path /path/to/new/data/set \
--ref-path /path/to/referential/data/set \
--out-path /path/to/diff/output
--keys key1,key2,key3

The differences are --new-raw-format and --ref-raw-format vs just --raw-format and accompanying changes.

Published artefacts to maven are bare, with no dependencies

Describe the bug

Published artefacts to maven are bare, with no dependencies

Find out why and fix in next versions.

Creation of a key column might fail due to concatenation

Describe the bug

Creation of a key column might fail due to concatenation

To Reproduce

if we have rows "10","1" and "1","01" they will produce the same key.

Expected behaviour

Key column should be unique to reflect the data

null === null marked as a difference

Describe the bug

If there is null in both new and ref columns, dataset comparison says it's a difference

To Reproduce

Steps to reproduce the behaviour OR commands run:

Have data with null on both sides, both same type and nullable
Run comparison
See the column in differences

Expected behaviour

null === null should be true

Add a data generation module

Background

Together with the testing of data, we could use data generation.

Feature

Data generation
random data
data by schema

Need to update jekyll version for ruby 2.7.0.

Describe the bug

There is a warning Using the last argument as keyword parameters is deprecated during generating gh-pages with Ruby 2.7.0 by jekyll.

Depends on issue jekyll/jekyll#7947

Desktop (please complete the following information):

Ruby 2.7.0
Jekyll 3.8.5

Compare schema with different order of collumns

Background

With new features to read different formats #16 for comparison, we need to handle a comparison of two schemas with a different order of columns.

Feature

DatasetComparisonJob different kinds of comparison schemas.

Example [Optional]

have csv file
id,first_name,last_name,email,gender,ip_address 1,Pierette,Sawyer,[email protected],Female,117.99.152.18 2,Jillian,Gorriessen,[email protected],Female,48.45.211.100

and compare with json file
{"id": "1","first_name": "Pierette","last_name": "Sawyer","email": "[email protected]","gender": "Female","ip_address": "117.99.152.18"} {"id": "2","first_name": "Jillian","last_name": "Gorriessen","email": "[email protected]","gender": "Female","ip_address": "48.45.211.100"}

There is a test called Compare different dataset's format

Fix usage of properties in InfoFile comparison to be general

Describe the bug

Fix usage of properties in InfoFile comparison to be general. Currently, it is Enceladus dependant.

Expected behaviour

Properties should be array based instead of key based

Add jenkinsfile

Add Jenkinsfile to build automatically.

Remove the use of except

Feature

Remove the use of except

Specify which schema has an issue in it

Background

When there are a schema comparison issues print out more specific schema issue.

Feature

Print out more specific schema issue.

Example [Optional]

"Column1 was not found in the new schema"
"Column2 was not found in the reference schema"

Add cross-compilation

Feature

Add cross-compilation

Allow duplicated in data

Background

Allow duplicated in data

Feature

Add an option to be able to say that duplicates are allowed and code to deduplicate the data

Add Config Validations

Background

Load configurations on start and validate them that they "could work together"

Feature

Add Config Validations

Sort columns of dataframes to compare for "except" reasons

Feature

For the cases where we use except we need to have the columns sorted. #24 will have a high-level solution but this needs to work even for nested structures.

Dataset comparison output path should follow the standard of arg names

Describe the bug

Dataset comparison output path should follow the standard of arg names with --ref and --new, there should be --out

Now

--outPath

Expected

--out-path

Produce Results File for e2e runner

Background

E2E runner lacks a proper way of automatically figuring out if everything failed or passed or what happened

Thank you for the idea @veprokendlo

Feature

Have a results file
maybe a non zero exit

Move from Enceladus's TestUtils

Be able to specify output data type

Feature

A description of the requested feature.

Proposed Solution [Optional]

Use the same logic of ref and new data types

DatasetComparison method handleDuplicates should be in the Job class

Describe the bug

DatasetComparison method handle duplicates should be in the Job class

Expected behaviour

DatastComparison is a library part and should know nothing about writing anything anywhere. If there is duplicity it should return an issue, not handle it.

Update readme

Describe the bug

Outdated readme about dataset comparison job.
A typing error in build command sbt assembly.

Outdated informations

Possibility to compare two different formats.
New attributes for this possibility.

Add all community standard files

Background

--delimiter arg is not getting as a string with quotation marks

Describe the bug

In case using the ; as a delimiter, it seems like it is end of the row and another property for CSV is ran as a new command. E2E runner fallout with error bash: line 1: --header: command not found because of the problem with getting arg --delimiter.