Giter VIP home page Giter VIP logo

dqops / dqo Goto Github PK

View Code? Open in Web Editor NEW
56.0 5.0 12.0 73.69 MB

Data Quality and Observability platform for the whole data lifecycle, from profiling new data sources to full automation with Data Observability. Configure data quality checks from the UI or in YAML files, let DQOps run the data quality checks daily to detect data quality issues.

Home Page: https://dqops.com/docs/

License: Apache License 2.0

Python 18.20% Shell 0.06% Batchfile 0.05% HCL 0.01% Java 72.90% Jinja 3.98% Dockerfile 0.01% JavaScript 0.01% HTML 0.01% TypeScript 4.68% CSS 0.01% Handlebars 0.08%
data-ops data-quality data-quality-checks data-quality-measurement data-quality-monitoring data-quality-report monitoring data-observability data-profiling

dqo's Issues

Cannot create calculated column on azure blob table

When trying to create a calculated column on a json, gzipped, hive-partitioned table that is read from azure blob storage, dqops throws this error when collecting statistics on such calculated column.

The calculated column uses this query:

dayname(scraped_at::timestamp)

This calculated column is working perfectly for a s3 hive partitioned parquet file table. Manually setting the column data type to STRING, or VARCHAR does not seem to help.

Error stacktrace:

2024-05-13 09:41:41.120 [pool-5-thread-2] ERROR c.d.c.jobqueue.BaseDqoJobQueueImpl -- Failed to execute a job: com.dqops.execution.statistics.jobs.DqoStatisticsCollectionJobFailedException: Cannot collect statistics on the table *redacted* on the connection azure, the first error: Cannot invoke "com.dqops.metadata.sources.ColumnTypeSnapshotSpec.getColumnType()" because "typeSnapshot" is nulljava.lang.NullPointerException: Cannot invoke "com.dqops.metadata.sources.ColumnTypeSnapshotSpec.getColumnType()" because "typeSnapshot" is null
	at com.dqops.metadata.sources.fileformat.TableOptionsFormatter.lambda$formatColumns$1(TableOptionsFormatter.java:97)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.base/java.util.stream.SliceOps$1$1.accept(SliceOps.java:200)
	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:601)
	at com.dqops.metadata.sources.fileformat.TableOptionsFormatter.formatColumns(TableOptionsFormatter.java:95)
	at com.dqops.metadata.sources.fileformat.JsonFileFormatSpec.buildSourceTableOptionsString(JsonFileFormatSpec.java:96)
	at com.dqops.metadata.sources.fileformat.FileFormatSpec.buildTableOptionsString(FileFormatSpec.java:170)
	at com.dqops.execution.sqltemplates.rendering.JinjaTemplateRenderParameters.createFromTrimmedObjects(JinjaTemplateRenderParameters.java:161)
	at com.dqops.execution.sqltemplates.rendering.JinjaSqlTemplateSensorRunner.prepareSensor(JinjaSqlTemplateSensorRunner.java:104)
	at com.dqops.execution.sensors.DataQualitySensorRunnerImpl.prepareSensor(DataQualitySensorRunnerImpl.java:93)
	at com.dqops.execution.statistics.TableStatisticsCollectorsExecutionServiceImpl.prepareSensors(TableStatisticsCollectorsExecutionServiceImpl.java:264)
	at com.dqops.execution.statistics.TableStatisticsCollectorsExecutionServiceImpl.executeCollectorsOnTable(TableStatisticsCollectorsExecutionServiceImpl.java:154)
	at com.dqops.execution.statistics.StatisticsCollectorsExecutionServiceImpl.executeStatisticsCollectorsOnTable(StatisticsCollectorsExecutionServiceImpl.java:171)
	at com.dqops.execution.statistics.jobs.CollectStatisticsOnTableQueueJob.onExecute(CollectStatisticsOnTableQueueJob.java:82)
	... 8 common frames omitted
Wrapped by: com.dqops.execution.statistics.jobs.DqoStatisticsCollectionJobFailedException: Cannot collect statistics on the table *redacted* on the connection azure, the first error: Cannot invoke "com.dqops.metadata.sources.ColumnTypeSnapshotSpec.getColumnType()" because "typeSnapshot" is null
	at com.dqops.execution.statistics.jobs.CollectStatisticsOnTableQueueJob.onExecute(CollectStatisticsOnTableQueueJob.java:99)
	at com.dqops.execution.statistics.jobs.CollectStatisticsOnTableQueueJob.onExecute(CollectStatisticsOnTableQueueJob.java:39)
	at com.dqops.core.jobqueue.DqoQueueJob.execute(DqoQueueJob.java:128)
	... 6 common frames omitted
Wrapped by: com.dqops.core.jobqueue.exceptions.DqoQueueJobExecutionException: com.dqops.execution.statistics.jobs.DqoStatisticsCollectionJobFailedException: Cannot collect statistics on the table *redacted* on the connection azure, the first error: Cannot invoke "com.dqops.metadata.sources.ColumnTypeSnapshotSpec.getColumnType()" because "typeSnapshot" is null
	at com.dqops.core.jobqueue.DqoQueueJob.execute(DqoQueueJob.java:142)
	at com.dqops.core.jobqueue.BaseDqoJobQueueImpl.jobProcessingThreadLoop(BaseDqoJobQueueImpl.java:203)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

Steps to reproduce:

  1. Create a hive-partitioned json gzipped newline-delimited azure blob data source table, where some field contains a timestamp
  2. Create a calculated column using query mentioned above

Bug observed in:

Some data quality incidents are empty

After running daily partition checks, DQOps tends to create empty data quality incidents that have no issues. I could not find a root cause for this, it seems to appear randomly (around 10% of all reported incidents).

image

After checking the DevTools, it seems like the API is sending an empty issue list.
image

Steps to reproduce:

  1. Add a few hive-partitioned tables to a single data source (S3, parquet file format)
  2. Set up Daily partition checks for those tables (some are templated by column name - tables have the same schema, some tables have overriden template rule values)
  3. Batch run daily partition check for a 45-day time period

Bug encountered in versions:

  • release v1.2.0
  • develop commit hash: dfb42f1

Error adding MS SQL SERVER

Command failed, error message: Cannot invoke "com.dqops.connectors.sqlserver.SqlServerAuthenticationMode.toString()" because "cloned.authenticationMode" is null

Please enter one of the [] values: 10
SQL Server host name (--sqlserver-host) [${SQLSERVER_HOST}]: 00000
SQL Server port number (--sqlserver-port) [${SQLSERVER_PORT}]: 1433
SQL Server database name (--sqlserver-database) [${SQLSERVER_DATABASE}]: mydatabase
SQL Server user name (--sqlserver-user) [${SQLSERVER_USER}]: myuser
SQL Server user password (--sqlserver-password) [${SQLSERVER_PASSWORD}]: mypassword
Disable SSL encryption (--sqlserver-disable-encryption) [y,N]: Y
Connection CIGAM was successfully added.
Run 'table import -c=mydatabase' to import tables.
dqo> table import -c=mydatabase
Command failed, error message: Cannot invoke "com.dqops.connectors.sqlserver.SqlServerAuthenticationMode.toString()" because "cloned.authenticationMode" is null
dqo>

Incident link redirect to the main page

Hi!
I have a link to the incident copied from the browser url field like https://dqops.monitoringsystem.com/incidents/test-db/2024/5/65927154-14fc-5c70-27d9-a83cb460003e.
When I have the tab opened in the dqops navbar with this incident (even not active) and than click the link - it works properly. But if i send it to someone else, who hasn't opened it anytime before it does not work. We cannot attach the incident link to some issues or send it in message to other coworkers. Please help how to share incident for other people. Thank you!

Steps to reproduce:

  1. Run checks which produce data quality incident in DQOPS
  2. Open incident, copy link from the browser
  3. Close incident tab in dqops
  4. Open link in other browser cart - it won't work

Able to host in our own GCP Org?

Looking at what you are offering here & think it's wonderful. Kudos for the level of detail & knowledge sharing put in across the board - not unnoticed!

I am Dir of Data Science & MLops with an org committed to automating & proactively responding to data incidents. We are a GCP, cloud-first group.

My question is, rather than joining DQOPS GCP resources, are we able to set up shop within our organization? Meaning not just the front end, but everything??

I haven't been able to find anything speaking to that on the resources you have share - but if that's a possibility I'd love to learn more.

I'm guessing the initial reaching is something like, it's not just one service or tool it's integrated widely across the board. Compute, messaging, hosting etc...which would be understandable certainly.

That said, this is what we do on a daily basis (manage massive data & provide that to customers end-to-end. I'm not worried about handing to /stand up, maintain & sustain on our own.

So if you could share a bit about what that would look like, or perhaps we could jump on a call to do the same.

Appreciate it! ๐Ÿ”ฅ

Error trying to add DQOps connection to DuckDB using the CLI

dqo> connection add --duckdb-directories=<"path"="/folder1/subfolder2/my-salesforce-pipeline">

Command failed, error message: Cannot invoke "com.dqops.cli.terminal.TerminalFactory.getReader()" because "this.terminalFactory" is null

Not sure what the proper syntax for the console command.

I tried to do a similar thing using the Web interface and got this error:

com.zaxxer.hikari.pool.HikariPool$PoolInitializationException: Failed to initialize pool: Invalid Input Error: Unrecognized configuration property "path"

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.