ohdsi / skeletoncohortcharacterization Goto Github PK

View Code? Open in Web Editor NEW

2.0 22.0 3.0 102.58 MB

License: Apache License 2.0

R 19.22% Java 80.78%

skeletoncohortcharacterization's People

Contributors

Stargazers

Watchers

Forkers

alondhe jreps ecruz165

skeletoncohortcharacterization's Issues

Error in calculating prevalence in Cohort Characterization with criteria features.

Expected behavior

In a case where the prevalence shows a count of 121 and a percentage of 3.82e-04, but should be 1.17e-03 because it is using the denominator as cohort episodes and not distinct people in cohort. Ie: prevalence should be distinct people with criteria / distinct people in cohort.

Actual behavior

It is calculating distinct people / cohort episode count.

Steps to reproduce behavior

Create a cohort with multiple episodes per person.
create a feature that will be found in each of the episodes
Note the prevalence will be calculated as number of people in cohort / number of episodes in cohort.

cc_results insert column lists fails in Spark

Cross posting this issue (OHDSI/WebAPI#1433) here. For CC to work in Spark SQL, the cc_results inserts would explicitly use all columns rather than a selection.

Prevalence criteria uses distinct people in numerator but all records in deominator

Referring to these lines in the prevalence criteria query, we see that the count of people found with the criteria is performed as a distinct person count (via the group by person_id) but the deominator is calculated by select count(*) from cohort.

The issue is that if there are multiple episodes per person, a criteria that matches against all people in the cohort will have a < 100% prevalence because distinct people < total records.

Solution: make the totals subquery count distinct person_id.

Some confusion related to cohort table schema

I'm finding a few places where the cohort table schema is being confused between the result schema and temp schema:

Line 122

return getSqlQueriesToRun(createFeJsonObject(options, options.resultSchema + "." + cohortTable), cohortDefinitionId);

Line 127

String cohortTable = Optional.ofNullable(strata).map(this::getStrataCohortTable).orElse(tempSchema + "." + this.cohortTable);

This might work if the resultsSchema is the same as the tempSchema, but in the case they are different, I believe this would break. However, looking elsewhere in the code, I see something strange when using createDefaultOptions:

Line 246

		final CohortExpressionQueryBuilder.BuildExpressionQueryOptions options = new CohortExpressionQueryBuilder.BuildExpressionQueryOptions();
		options.cdmSchema = cdmSchema;
		// Target schema
		options.resultSchema = tempSchema;
		options.cohortId = id;
		options.generateStats = false;
		return options;

One element of confusion is re-suing the CohortExpressionQueryBuilder.BuildExpressionQueryOptions: this object encapsulates options for cohort definition generation, but the re-use of it in characterization seems strange here. But the main point of confusion is assigning the options.resultSchema = tempSchema. Again, this works if resultSchema == tempSchema, but doing this sort of overwrite here is very confusing to me.

For background: the purpose of defining the cohortTable in the other analysis methods (or even as the targetTable in cohort definitions) was that we don't assume anything about the location of the cohort table: it could be in a completely different schema in the database (something other than result or temp schema). It purpose was that the user would have to specify the fully-qualified table name of the cohort table. Therefore, I believe the solution to the confusion here is to make the callers of the CCQueryBuilder() constructor pass the fully qualified name of the cohort table, and don't attempt to do any tricks with concatenating 'tempSchema' or 'resultsSchema': the cohort table is simply the cohort table (including schema).

It looks like when dealing with strata, a strata-cohort table is created:

return String.format("%s.sc_%s_%d", tempSchema, sessionId, strata.getId());

Which, in this case: the cohort table generated is the fully qualified name, so it does feel like it would be consistent to pass in the 'cohortTable' as the fully qualified name instead of trying to figure out which schema the cohort table belongs to.

CC with separate vocabulary daimon

I found two Cohort Characterization SQLs that do not use @vocabulary_database_schema, for example in distributionRetrieving.sql line 29:
left join @cdm_database_schema.concept c on c.concept_id = fr.concept_id;

If rewritten as:
left join @vocabulary_database_schema.concept c on c.concept_id = fr.concept_id;
it could fetch concepts from the vocabulary daimon.

The same goes with prevalenceRetrieving.sql.

PS: I was trying to fix this on current WebAPI release, but maybe it's better fix here and wait for 2.8?

featureAnalyses design deserialization issue

Utils.deserialize(design, CohortCharacterizationImpl.class) doesn't properly deserialize featureAnalyses[*].design.aggregate field.

Expected:

Actual:

Component versioning

At the moment, v1.0.0 of this project is deployed to the OHDSI nexus repo but no tag exists for the v1.0.0 release of this project. The v1.0.0 release was made at this commit point: 7cdb50f.

Additionally, the R DESCRIPTION file is out-of-sync with the versioning scheme defined in the pom.xml: https://github.com/OHDSI/SkeletonCohortCharacterization/blob/master/DESCRIPTION.

Suggested fixes:

Push a v1.0.0 tag to the repository at the commit mentioned above.
Increment the pom.xml to use 1.1.0-SNAPSHOT as the current version until we are ready to release a new version. Change the DESCRIPTION in the R file to reflect this change (and elsewhere as necessary).
Merge in this work before any new PRs are introduced as the current v1.0.0 SNAPSHOT is actually ahead of the released v1.0.0 component.

Aggregate functions should be defined in this libary, not WebAPI

While trying to run through the code in SkeletonCohortCharacterization as a R package, I discovered that all the aggregate functions are specified in the design JSON, but those specifications are not part of the SkeletonCohortCharacterization. Instead, these aggregate types are defined as entities in WebAPI, and then handled by passing in the raw definitions over to SkeletonCC.

I think this is wrong: SkeletonCC should define all the aggregate functions that are supported, and not allow external parties to submit their own expressions (isn't that a SQL injection problem?!?). The SkeletonCohortChacterization should expose a function to read meta about the available aggregate types for WebAPI to consume/render...but webAPI shouldn't be the place to define the characterization aggregate functions...It makes using SkeletonCC hard to use as a stand-alone library.

Cohort Characterization - denominators for subgroups

Cross posted from OHDSI/Atlas#2420

Expected behavior
When using subgroups in cohort characterization, the denominators used for percents should be for each subgroup.

Actual behavior
When using cohort event date ranges as subgroups (Jan-Apr, May-July, etc), the percents are using the total cohort as the denominator rather than the subgroup patients.

Steps to reproduce behavior
Create a cohort characterization, add some features, and add some subgroups.

Standardize build tooling and common libs

PROBLEM ATTEMPTING TO SOLVE
Many OHDSI projects have aged libraries.

vulnerabilities exists in older libraries
Also many projects used different versions of same libraries

Projects like WebApi that depend on several libraries have to create very messy workarounds to exclude library versions that create conflicts
Duplicated Snippets in across projects

PROPOSED SOLUTION

update versions of maven tooling and used libraries
introduce spring-boot-starter-parent as a parent project where applicable
Ideally we would introduce a ohdsi-maven-starter-parent which can be shared across projects. This project would contain maven tooling, jacaco and others like repositories.

Dropping Java imports in favor of R packages?

The skeleton now includes FeatureExtraction and SqlRender both as R packages (see DESCRIPTION) and as Java libraries (see inst/java). Why can't this skeleton be written as pure R?

The resulting R package is very big, and I don't think it is a good idea to have such big skeletons in Hydra.

better readme

I am confused what the function of the package is.

Can I provide a cohort id and have this package describe it for me (like Herakles?)