genome-nexus / annotation-tools Goto Github PK

Tools developed for AACR GENIE to allow annotation of vcf and maf files from a number of centers and merging the results

License: MIT License

Shell 5.09% Python 94.69% Dockerfile 0.22%

annotation-tools's Introduction

Genome Nexus 🧬

Genome Nexus, a comprehensive one-stop resource for fast, automated and high-throughput annotation and interpretation of genetic variants in cancer. Genome Nexus integrates information from a variety of existing resources, including databases that convert DNA changes to protein changes, predict the functional effects of protein mutations, and contain information about mutation frequencies, gene function, variant effects, and clinical actionability.

Documentation 📖

See the docs

Run 💻

Alternative 1 - run genome-nexus, mongoDB and genome-nexus-vep in docker containers

First, set environment variables for Ensembl Release, VEP Assembly, location of VEP Cache, and species (since a mouse instalation is supported). If these are not, the default values from .env will be set.

The reference genome and Ensembl release must be consistent with a version in genome-nexus-importer/data/. For example grch37_ensembl92, grch38_ensembl92 or grch38_ensembl95:

export REF_ENSEMBL_VERSION=grch38_ensembl92

If you want to setup Genome Nexus for mouse, also set the SPECIES variable to 'mus_musculus'. Also see the docs to create a mouse database.

export SPECIES=mus_musculus

If you would like to do local VEP annotations instead of using the public Ensembl API, please uncomment # gn_vep.region.url=http://localhost:6060/vep/human/region/VARIANT in your application.properties. This will require you to download the VEP cache files for the preferred Ensembl Release and Reference genome, see our documentation on downloading the Genome Nexus VEP Cache. This will take several hours.

# Set local cache dir
export VEP_CACHE=<local_vep_cache>

# GRCh38 or GRCh37
export VEP_ASSEMBLY=GRCh38

Run docker-compose to create images and containers:

docker-compose up --build -d

Run without recreating images:

docker-compose up -d

Run without Genome Nexus VEP:

# Start both the Web and DB (dependency of Web) containers
docker-compose up -d web

Stop and remove containers:

docker-compose down

Alternative 2 - run genome-nexus locally, but mongoDB in docker container

# the genomenexus/gn-mongo images comes with all the required tables imported
# change latest to different version if necessary (only need to run this once)
docker run --name=gn-mongo --restart=always -p 27017:27017 -d genomenexus/gn-mongo:latest 
mvn  -DskipTests clean install
java -jar web/target/web-*.war

Alternative 3 - install mongoDB locally and run with local java

Install mongoDB manually. Then follow instructions in genome-nexus-importer to initialize the database.

After that run this:

mvn clean install
java -jar web/target/web-*.war

Test Status 👷‍♀️

branch	master	rc
status

Deploy 🚀

annotation-tools's People

Contributors

Stargazers

Watchers

Forkers

ao508 sage-bionetworks averyniceday sheridancbio ucsf-cbi inodb ozguzmete ruslan-forostianov leexgh korjcjeong bioinfo-ventho rmadupuri hweej wehi-researchcomputing

annotation-tools's Issues

Genome Nexus sometimes annotates SNV as ONP

Input:
input.txt
Intermediate files: annotation-tools intermediate files I must add the .txt at the end or github won't allow me to upload these. My understanding it the input.txt.temp.annotated.txt is the output from Genome Nexus. But because the annotation-tools allows us to include a directory with a list of mafs or vcfs, it annotates each one of those files separately. processed.txt is all of these merged.
input.txt.temp.annotated.txt
input.txt.temp.txt
processed:
processed.txt

Process VCF error

Loading data from file: .../GENIE-...-...-1.vcf
	Total records loaded 4 
Standardized MAF written to: .../processed/GENIE-...-...-1.vcf.temp
Loading data from file: .../GENIE-...-...-1.vcf
	Total records loaded 4 
Standardized MAF written to: ..._out/processed/GENIE-...-...-1.vcf.temp
Loading data from file: .../GENIE-...-...-1.vcf
DP could not be resolved for current record in VCF: {'CHROM': '#CHROM', 'POS': 'POS', 'ID': 'ID', 'REF': 'REF', 'ALT': 'ALT', 'QUAL': 'QUAL', 'FILTER': 'FILTER', 'INFO': {'INFO': ''}, 'FORMAT': ['FORMAT'], 'GENIE-...-...-1': 'GENIE-...-...-1', 'MAPPED_TUMOR_FORMAT_DATA': {'FORMAT': 'GENIE-...-...-1', 'AD': ','}} <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
[ERROR] standardizeMutationFilesFromDirectory(), error encountered while running genie-annotation-pkg/standardize_mutation_data.py

How does Genome Nexus handle Tumor Seq Allele1 and Tumor Seq Allele2

In GENIE a "-" specifies a deletion. However when we have rows such as:

Reference_Allele,Tumor_Seq_Allele1,Tumor_Seq_Allele2
G,C,-

Would Genome Nexus annotate this as "Missense" or "Deletion"? I have always been slightly confused by TSA1 and TSA2. Can we go further in depth of what exactly happens to these two columns?

Genome Nexus creates variants with Reference_Allele == Tumor_Seq_Allele2 == Tumor_Seq_Allele1

There are 154 unique variants that have different Reference_Allele and Tumor_Seq_Allele2 that are reannotated to have the same Ref_Allele and TSA2 through genome nexus.

Input:
input_1.txt
input_2.txt
Intermediate files: annotation-tools intermediate files I must add the .txt at the end or github won't allow me to upload these. My understanding it the input.txt.temp.annotated.txt is the output from Genome Nexus. But because the annotation-tools allows us to include a directory with a list of mafs or vcfs, it annotates each one of those files separately. processed.txt is all of these merged.
input_1.txt.temp.annotated.txt
input_1.txt.temp.txt
input_2.txt.temp.annotated.txt
input_2.txt.temp.txt
Processed:
processed_1.txt
processed_2.txt

Add minimal example to repo for testing

It's currently hard to review whether the scripts are working because there is no example data. Let's try to construct a minimal example test set that we can enrich over time

Some corner cases to capture, e.g.: #39

Improve tumor vs normal sample recognition in a vcf file

Currently, standartize_mutation_data.py while reading vcf file with 2 columns interprets first as a tumor sample and second as a normal sample.

See https://github.com/genome-nexus/annotation-tools/blob/master/standardize_mutation_data.py#L854

We work with vcf files that don't have fixed order of the sample columns.

However, our vcf header contains metadata like the following:

##normal_sample=sample_45345
##tumor_sample=sample_867657

Although, this metadata does not seem to be part of the vcf specification (https://samtools.github.io/hts-specs/VCFv4.1.pdf),
it seems to be used out there.

My proposal is to make def get_vcf_sample_and_normal_ids(filename) function to look for normal_sample and tumor_sample in the header first. It'd fall back to the existing logic if such metadata has not been found.

Variant annotation failing

We have MAF data that is failing annotation with genome nexus (see example.csv). We are wondering why it is failing?

Please see an example input: example.csv

To provide more context, we had recently removed some logic before annotating with genome-nexus. The logic was if the ref allele and alt allele do not match and if the first char of the ref allele is the same as the first char alt allele, then we remove the first char of both alleles and shift the start position forward by 1.

Using the example file provided, the logic would update the start_position from 25398284 to 25398285, the reference allele from CC to C and the alt allele from CG to G. Then genome-nexus would annotate the variant with KRAS. However, after removing the logic, genome-nexus fails to annotate this variant.

Center maf input isn't required to have all columns

It is GENIE convention that we require at least
Chromosome,Start_Position,Reference_Allele,Tumor_Seq_Allele2,Tumor_Sample_Barcode,to_alt_count
Must have either t_ref_count or t_depth. These are the optional headers:
t_ref_count, n_depth, n_ref_count, n_alt_count

This makes the standardize_mutation_data.py fail..

Loading data from input directory: ...

	Searching for files with extensions: vcf, maf, txt 

Loading data from file: .../data_mutations_extended_....txt
Traceback (most recent call last):
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1429, in <module>
    main()
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1426, in main
    generate_maf_from_input_data(input_directory, output_directory, extensions_list, center_name, sequence_source)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1394, in generate_maf_from_input_data
    maf_data = extract_maf_data_from_file(os.path.join(input_directory, filename), center_name, sequence_source)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1351, in extract_maf_data_from_file
    maf_record = create_maf_record_from_maf(data, center_name, sequence_source)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 755, in create_maf_record_from_maf
    resolve_variant_allele_data(data, maf_data)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 474, in resolve_variant_allele_data
    variant_type = resolve_variant_type(data, ref_allele, tumor_seq_allele)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 276, in resolve_variant_type
    if variant_type == "1":
UnboundLocalError: local variable 'variant_type' referenced before assignment

[ERROR] standardizeMutationFilesFromDirectory(), error encountered while running genie-annotation-pkg/standardize_mutation_data.py

I did try to add some try catches to the code, but then it made Genome Nexus fail with stuff like

2020-04-27 02:49:44 [main] INFO  org.cbioportal.annotation.AnnotationPipeline - Starting AnnotationPipeline v1.0.0 on ip-10-5-19-203.ec2.internal with PID 25902 (/home/tyu/genome-nexus-annotation-pipeline/annotationPipeline/target/annotationPipeline-1.0.0.jar started by tyu in /home/tyu)
2020-04-27 02:49:45 [main] INFO  org.springframework.context.annotation.AnnotationConfigApplicationContext - Refreshing org.springframework.context.annotation.AnnotationConfigApplicationContext@1c5ecd10: startup date [Mon Apr 27 02:49:45 UTC 2020]; root of context hierarchy
2020-04-27 02:49:47 [main] INFO  org.springframework.integration.config.IntegrationRegistrar - No bean named 'integrationHeaderChannelRegistry' has been explicitly defined. Therefore, a default DefaultHeaderChannelRegistry will be created.
2020-04-27 02:49:47 [main] WARN  org.springframework.context.annotation.ConfigurationClassEnhancer - @Bean method ScopeConfiguration.stepScope is non-static and returns an object assignable to Spring's BeanFactoryPostProcessor interface. This will result in a failure to process annotations such as @Autowired, @Resource and @PostConstruct within the method's declaring @Configuration class. Add the 'static' modifier to this method to avoid these container lifecycle issues; see @Bean javadoc for complete details.
2020-04-27 02:49:47 [main] WARN  org.springframework.context.annotation.ConfigurationClassEnhancer - @Bean method ScopeConfiguration.jobScope is non-static and returns an object assignable to Spring's BeanFactoryPostProcessor interface. This will result in a failure to process annotations such as @Autowired, @Resource and @PostConstruct within the method's declaring @Configuration class. Add the 'static' modifier to this method to avoid these container lifecycle issues; see @Bean javadoc for complete details.
2020-04-27 02:49:47 [main] INFO  org.springframework.integration.config.DefaultConfiguringBeanFactoryPostProcessor - No bean named 'errorChannel' has been explicitly defined. Therefore, a default PublishSubscribeChannel will be created.
2020-04-27 02:49:47 [main] INFO  org.springframework.integration.config.DefaultConfiguringBeanFactoryPostProcessor - No bean named 'taskScheduler' has been explicitly defined. Therefore, a default ThreadPoolTaskScheduler will be created.
2020-04-27 02:49:47 [main] INFO  org.springframework.beans.factory.annotation.AutowiredAnnotationBeanPostProcessor - JSR-330 'javax.inject.Inject' annotation found and supported for autowiring
2020-04-27 02:49:47 [main] INFO  org.hibernate.validator.internal.util.Version - HV000001: Hibernate Validator 5.1.3.Final
2020-04-27 02:49:48 [main] INFO  org.springframework.context.support.PostProcessorRegistrationDelegate$BeanPostProcessorChecker - Bean 'org.springframework.transaction.annotation.ProxyTransactionManagementConfiguration' of type [org.springframework.transaction.annotation.ProxyTransactionManagementConfiguration$$EnhancerBySpringCGLIB$$b9bfddf0] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2020-04-27 02:49:48 [main] INFO  org.springframework.context.support.PostProcessorRegistrationDelegate$BeanPostProcessorChecker - Bean 'integrationGlobalProperties' of type [org.springframework.beans.factory.config.PropertiesFactoryBean] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2020-04-27 02:49:48 [main] INFO  org.springframework.context.support.PostProcessorRegistrationDelegate$BeanPostProcessorChecker - Bean 'integrationGlobalProperties' of type [java.util.Properties] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2020-04-27 02:49:48 [main] INFO  org.springframework.context.support.PostProcessorRegistrationDelegate$BeanPostProcessorChecker - Bean 'messageBuilderFactory' of type [org.springframework.integration.support.DefaultMessageBuilderFactory] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2020-04-27 02:49:48 [main] INFO  org.springframework.context.support.PostProcessorRegistrationDelegate$BeanPostProcessorChecker - Bean '(inner bean)#7f682304' of type [org.springframework.integration.channel.MessagePublishingErrorHandler] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2020-04-27 02:49:48 [main] INFO  org.springframework.scheduling.concurrent.ThreadPoolTaskScheduler - Initializing ExecutorService  'taskScheduler'
2020-04-27 02:49:48 [main] INFO  org.springframework.context.support.PostProcessorRegistrationDelegate$BeanPostProcessorChecker - Bean 'taskScheduler' of type [org.springframework.scheduling.concurrent.ThreadPoolTaskScheduler] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2020-04-27 02:49:48 [main] INFO  org.springframework.context.support.PostProcessorRegistrationDelegate$BeanPostProcessorChecker - Bean 'integrationHeaderChannelRegistry' of type [org.springframework.integration.channel.DefaultHeaderChannelRegistry] is not eligible for getting processed by all BeanPostProcessors (for example: not eligible for auto-proxying)
2020-04-27 02:49:49 [main] WARN  org.springframework.batch.core.listener.AbstractListenerFactoryBean - org.springframework.batch.item.ItemStreamReader is an interface.  The implementing class will not be queried for annotation based listener configurations.  If using @StepScope on a @Bean method, be sure to return the implementing class so listner annotations can be used.
2020-04-27 02:49:49 [main] WARN  org.springframework.batch.core.listener.AbstractListenerFactoryBean - org.springframework.batch.item.ItemStreamWriter is an interface.  The implementing class will not be queried for annotation based listener configurations.  If using @StepScope on a @Bean method, be sure to return the implementing class so listner annotations can be used.
2020-04-27 02:49:50 [main] INFO  org.springframework.jdbc.datasource.init.ScriptUtils - Executing SQL script from class path resource [org/springframework/batch/core/schema-hsqldb.sql]
2020-04-27 02:49:50 [main] INFO  org.springframework.jdbc.datasource.init.ScriptUtils - Executed SQL script from class path resource [org/springframework/batch/core/schema-hsqldb.sql] in 9 ms.
2020-04-27 02:49:51 [main] INFO  org.springframework.ui.velocity.SpringResourceLoader - SpringResourceLoader for Velocity: using resource loader [org.springframework.context.annotation.AnnotationConfigApplicationContext@1c5ecd10: startup date [Mon Apr 27 02:49:45 UTC 2020]; root of context hierarchy] and resource loader paths [classpath:/templates/]
2020-04-27 02:49:52 [main] INFO  org.springframework.context.support.DefaultLifecycleProcessor - Starting beans in phase -2147483648
2020-04-27 02:49:52 [main] INFO  org.springframework.context.support.DefaultLifecycleProcessor - Starting beans in phase 0
2020-04-27 02:49:52 [main] INFO  org.springframework.integration.endpoint.EventDrivenConsumer - Adding {logging-channel-adapter:_org.springframework.integration.errorLogger} as a subscriber to the 'errorChannel' channel
2020-04-27 02:49:52 [main] INFO  org.springframework.integration.channel.PublishSubscribeChannel - Channel 'application.errorChannel' has 1 subscriber(s).
2020-04-27 02:49:52 [main] INFO  org.springframework.integration.endpoint.EventDrivenConsumer - started _org.springframework.integration.errorLogger
2020-04-27 02:49:52 [main] INFO  org.cbioportal.annotation.AnnotationPipeline - Started AnnotationPipeline in 7.513 seconds (JVM running for 8.639)
2020-04-27 02:49:52 [main] INFO  org.springframework.batch.core.repository.support.JobRepositoryFactoryBean - No database type set, using meta data indicating: HSQL
2020-04-27 02:49:52 [main] INFO  org.springframework.batch.core.launch.support.SimpleJobLauncher - No TaskExecutor has been set, defaulting to synchronous executor.
2020-04-27 02:49:52 [main] INFO  org.springframework.batch.core.launch.support.SimpleJobLauncher - Job: [SimpleJob: [name=annotationJob]] launched with the following parameters: [{filename=..._out/processed/data_mutations_extended_....txt.temp, outputFilename=..._out/annotated/data_mutations_extended_....txt.temp.annotated, replace=true, isoformOverride=uniprot, errorReportLocation=null, postIntervalSize=-1}]
2020-04-27 02:49:52 [main] INFO  org.springframework.batch.core.job.SimpleStepHandler - Executing step: [step]
2020-04-27 02:49:53 [main] INFO  org.cbioportal.annotation.pipeline.MutationRecordReader - Loading records from: ..._out/processed/data_mutations_extended_....txt.temp
2020-04-27 02:49:56 [main] INFO  org.cbioportal.annotation.pipeline.MutationRecordReader - Loaded 13084 records from: ..._out/processed/data_mutations_extended_....txt.temp
2020-04-27 02:49:56 [main] INFO  org.cbioportal.annotator.internal.GenomeNexusImpl - 13084 records to annotate
2020-04-27 02:49:56 [main] ERROR org.springframework.batch.core.step.AbstractStep - Encountered an error executing step step in job annotationJob
java.lang.NumberFormatException: For input string: ""
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:592)
	at java.lang.Integer.valueOf(Integer.java:766)
	at org.cbioportal.annotator.internal.GenomeNexusImpl.extractGenomicLocation(GenomeNexusImpl.java:685)
	at org.cbioportal.annotator.internal.GenomeNexusImpl.extractGenomicLocationAsString(GenomeNexusImpl.java:674)
	at org.cbioportal.annotator.internal.GenomeNexusImpl.annotateRecord(GenomeNexusImpl.java:130)
	at org.cbioportal.annotator.internal.GenomeNexusImpl.annotateRecordsUsingGET(GenomeNexusImpl.java:164)
	at org.cbioportal.annotation.pipeline.MutationRecordReader.open(MutationRecordReader.java:90)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:333)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
	at org.springframework.aop.support.DelegatingIntroductionInterceptor.doProceed(DelegatingIntroductionInterceptor.java:133)
	at org.springframework.aop.support.DelegatingIntroductionInterceptor.invoke(DelegatingIntroductionInterceptor.java:121)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:213)
	at com.sun.proxy.$Proxy55.open(Unknown Source)
	at org.springframework.batch.item.support.CompositeItemStream.open(CompositeItemStream.java:96)
	at org.springframework.batch.core.step.tasklet.TaskletStep.open(TaskletStep.java:310)
	at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:197)
	at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:148)
	at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:392)
	at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:135)
	at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:306)
	at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:135)
	at org.springframework.core.task.SyncTaskExecutor.execute(SyncTaskExecutor.java:50)
	at org.springframework.batch.core.launch.support.SimpleJobLauncher.run(SimpleJobLauncher.java:128)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:333)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
	at org.springframework.batch.core.configuration.annotation.SimpleBatchConfiguration$PassthruAdvice.invoke(SimpleBatchConfiguration.java:127)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179)
	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:213)
	at com.sun.proxy.$Proxy52.run(Unknown Source)
	at org.cbioportal.annotation.AnnotationPipeline.launchJob(AnnotationPipeline.java:88)
	at org.cbioportal.annotation.AnnotationPipeline.main(AnnotationPipeline.java:104)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:53)
	at java.lang.Thread.run(Thread.java:748)
2020-04-27 02:49:56 [main] INFO  org.springframework.batch.core.launch.support.SimpleJobLauncher - Job: [SimpleJob: [name=annotationJob]] completed with the following parameters: [{filename=..._out/processed/data_mutations_extended_....txt.temp, outputFilename=..._out/annotated/data_mutations_extended_....txt.temp.annotated, replace=true, isoformOverride=uniprot, errorReportLocation=null, postIntervalSize=-1}] and the following status: [FAILED]
2020-04-27 02:49:56 [Thread-2] INFO  org.springframework.context.annotation.AnnotationConfigApplicationContext - Closing org.springframework.context.annotation.AnnotationConfigApplicationContext@1c5ecd10: startup date [Mon Apr 27 02:49:45 UTC 2020]; root of context hierarchy
2020-04-27 02:49:56 [Thread-2] INFO  org.springframework.context.support.DefaultLifecycleProcessor - Stopping beans in phase 0
2020-04-27 02:49:56 [Thread-2] INFO  org.springframework.integration.endpoint.EventDrivenConsumer - Removing {logging-channel-adapter:_org.springframework.integration.errorLogger} as a subscriber to the 'errorChannel' channel
2020-04-27 02:49:56 [Thread-2] INFO  org.springframework.integration.channel.PublishSubscribeChannel - Channel 'application.errorChannel' has 0 subscriber(s).
2020-04-27 02:49:56 [Thread-2] INFO  org.springframework.integration.endpoint.EventDrivenConsumer - stopped _org.springframework.integration.errorLogger
2020-04-27 02:49:56 [Thread-2] INFO  org.springframework.context.support.DefaultLifecycleProcessor - Stopping beans in phase -2147483648
2020-04-27 02:49:56 [Thread-2] INFO  org.springframework.scheduling.concurrent.ThreadPoolTaskScheduler - Shutting down ExecutorService 'taskScheduler'

Report problematic files at the end of standardize_mutation_data.py run

IndexError: list index out of range

Loading data from input directory: error

	Searching for files with extensions: vcf, maf, txt 
Loading data from file: error/GENIE-PHS-01f07ab5-triseq-v1.vcf
Traceback (most recent call last):
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1430, in <module>
    main()
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1427, in main
    generate_maf_from_input_data(input_directory, output_directory, extensions_list, center_name, sequence_source)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1393, in generate_maf_from_input_data
    maf_data = extract_vcf_data_from_file(os.path.join(input_directory, filename), center_name, sequence_source)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1327, in extract_vcf_data_from_file
    maf_record = create_maf_record_from_vcf(sample_id, center_name, sequence_source, vcf_data, is_germline_data, matched_normal_sample_id, tumor_sample_data_col)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1277, in create_maf_record_from_vcf
    resolve_vcf_counts_data(vcf_data, maf_data, matched_normal_sample_id, tumor_sample_data_col)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1178, in resolve_vcf_counts_data
    (t_ref_count, t_alt_count, t_depth) = resolve_vcf_allele_depth_values(tumor_sample_format_data, vcf_alleles, variant_allele_idx, vcf_data)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1132, in resolve_vcf_allele_depth_values
    alt_count = allele_depth_values[variant_allele_idx]
IndexError: list index out of range

Some VCFs don't contain any information

Some of the vcfs are placeholders and actually don't have any variants. Therefore the .processed file that gets created only has the header. Not sure if these are sent to GenomeNexus to be annotated, but if they were, we could possibly save some time by not sending these.

Unknown annotation failures reasons

Dear GN team,

We are trying to figure out why some mutation records are failing annotation by genome-nexus (version: 1.0.2). The error handling in the report for the failed annotations was very useful, but there were still a lot of records that have unknown failure reasons. (see
unknown_annotation_fails.txt). This file contains n=2000 failed annotations that had a blank FAILURE_REASON value.

Would you be able to help us make sense of the reasons these failed annotations? Let me know if there is anything else I can provide to help troubleshoot this.

Thank you!

Genome Nexus changes the Reference Allele and Tumor_Seq_Allele2

Input
input.txt
annotation-tools intermediate files I must add the .txt at the end or github won't allow me to upload these. My understanding it the input.txt.temp.annotated.txt is the output from Genome Nexus. But because the annotation-tools allows us to include a directory with a list of mafs or vcfs, it annotates each one of those files separately. processed.txt is all of these merged.
input.txt.temp.annotated.txt
input.txt.temp.txt
Processed
processed.txt

vcf2maf not converting missing AD values to 0

Hi,

We're looking at using vcf2maf to prepare our VCF files for import into cBioPortal, but I've run into the issue:

$ python3 vcf2maf.py -i ../vcf-noann-FTatleast1PASS/solace2-0003.vcf --tumor-id tumour --normal-id germline
Loading data from file: ../vcf-noann-FTatleast1PASS/solace2-0003.vcf
Traceback (most recent call last):
  File "/vast/projects/staffordfox-cbioportal/genome-nexus_annotation-tools/vcf2maf.py", line 1640, in <module>
    main()
  File "/vast/projects/staffordfox-cbioportal/genome-nexus_annotation-tools/venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vast/projects/staffordfox-cbioportal/genome-nexus_annotation-tools/venv/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/vast/projects/staffordfox-cbioportal/genome-nexus_annotation-tools/venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vast/projects/staffordfox-cbioportal/genome-nexus_annotation-tools/venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vast/projects/staffordfox-cbioportal/genome-nexus_annotation-tools/vcf2maf.py", line 1628, in main
    generate_maf_from_input_data(
  File "/vast/projects/staffordfox-cbioportal/genome-nexus_annotation-tools/vcf2maf.py", line 1552, in generate_maf_from_input_data
    maf_data = extract_vcf_data_from_file(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vast/projects/staffordfox-cbioportal/genome-nexus_annotation-tools/vcf2maf.py", line 1496, in extract_vcf_data_from_file
    maf_record = create_maf_record_from_vcf(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vast/projects/staffordfox-cbioportal/genome-nexus_annotation-tools/vcf2maf.py", line 1048, in create_maf_record_from_vcf
    resolve_vcf_counts_data(
  File "/vast/projects/staffordfox-cbioportal/genome-nexus_annotation-tools/vcf2maf.py", line 630, in resolve_vcf_counts_data
    (t_ref_count, t_alt_count, t_depth) = resolve_vcf_allele_depth_values(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/vast/projects/staffordfox-cbioportal/genome-nexus_annotation-tools/vcf2maf.py", line 511, in resolve_vcf_allele_depth_values
    mapped_sample_format_data["DP"] = str(sum(map(float, allele_depth_values)))
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: ''

I believe this is because .,237 is the AD value, and the . is being converted to an empty string, which is not accepted by float. I've traced this back to line 341.

        # attempt to parse values as int - if not an int then set value to empty string
        allele_depth_values = []
        for value in mapped_sample_format_data["AD"].split(","):
            if is_valid_integer(value):
                allele_depth_values.append(value)
            else:
                allele_depth_values.append("")

Should "0" be appended instead of ""? For our data, this would make sense.

Please let me know how best to resolve this, and I will create a pull request.

Report logs/warnings/errors encountered during MAF standardization run

Store logs, warnings, and error messages per data file processed in a summary report at the end of the MAF standardization run.

Fail early on incorrect variants

This is more long term but it would be nice to have a script that finds incorrect variant formats early, so we can avoid trying to annotate them. That way we can also immediately get back to centers about issues with the format that's been delivered. Could be a separate command within genome-nexus-annotation-pipeline or a script part of this repo

Possibly support MAF file with all upper case column headers?

Some centers submit MAF files with all capitalized headers. Due to this, in my processing code I actually capitalizes the headers to resolve this issue.

Do you know if Genome Nexus will annotate a maf file with all capitalized headers? I know vcf2maf also accepted all capitalized headers.

ValueError: Match_Norm_Seq_Allele1

- using default value of empty string...
Traceback (most recent call last):
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1482, in <module>
    main()
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1479, in main
    generate_maf_from_input_data(input_directory, output_directory, extensions_list, center_name, sequence_source)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1441, in generate_maf_from_input_data
    maf_data = extract_vcf_data_from_file(input_filename, center_name, sequence_source)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1347, in extract_vcf_data_from_file
    maf_record = create_maf_record_from_vcf(sample_id, center_name, sequence_source, vcf_data, is_germline_data, matched_normal_sample_id, tumor_sample_data_col)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1306, in create_maf_record_from_vcf
    resolve_vcf_matched_normal_allele_data(vcf_data, maf_data, matched_normal_sample_id)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1236, in resolve_vcf_matched_normal_allele_data
    maf_data["Match_Norm_Seq_Allele1"] = vcf_alleles[int(normal_sample_genotype_info[0])]
ValueError: invalid literal for int() with base 10: '.'

Genome Nexus: 41 variants that have incorrectly annotated ref/alt.

Uploaded as file for ease. ref_alt_diff.txt

How to properly handle variants that can't be annotated?

Current Behavior

When annotating a MAF file through Genome Nexus we get:

	Failed annotations summary:  409 total failed annotations
		Records with HGVSp null variant classification:  0
		Records that failed due to other unknown reason: 409

These unknown reasons could include "variants that have reference mismatches" or other issues. When looking at the annotated MAF file, I see that these 409 "failed" variants have actually been added back. This is confusing for multiple reasons:

The number of columns returned from Genome Nexus can be far greater than the minimal set of MAF columns required for Genome Nexus to work (so you will have rows with empty columns)
If you pass in a fully formed maf file, it is close to impossible to tell which of these variants were unannotated vs annotated

Proposed Feature

It would be nice to be able to tell the difference between the variants that annotated properly vs not.

Add an annotation to the annotated maf file (e.g. UNANNOTATED) so that users can look into these variants
Only write out the annotatable variants and write out a separate "variants_errored" file.

Personally, I like solution 1. If we take the example above (variants that have reference mismatches), this is useful to know. One thing to add is that I have an automated process set up to run genome nexus across millions of variants. Sifting through the logs to see which variants could not be annotated is not a viable solution.

DP could not be resolved for current record in VCF

I changed the values I bit here, but will send the vcf tomorrow.

Searching for files with extensions: vcf, maf, txt 
Loading data from file: error/GENIE-...
DP could not be resolved for current record in VCF: {'CHROM': '22', 'POS': '1323', 'ID': '.', 'REF': 'TG', 'ALT': 'T', 'QUAL': '.', 'FILTER': 'PASS', 'INFO': {'SOMATIC': '', 'AC': '1', 'AN': '4', 'END': '29682709', 'HOMLEN': '0', 'SVLEN': '-1', 'SVTYPE': 'DEL', 'ANNOVAR_DATE': '2018-04-16', 'Func.refGene': 'intronic', 'Gene.refGene': 'EWSR1', 'GeneDetail.refGene': '', 'ExonicFunc.refGene': '', 'AAChange.refGene': '', 'ExAC_ALL': '', 'ExAC_AFR': '', 'ExAC_AMR': '', 'ExAC_EAS': '', 'ExAC_FIN': '', 'ExAC_NFE': '', 'ExAC_OTH': '', 'ExAC_SAS': '', 'gnomAD_exome_ALL': '', 'gnomAD_exome_AFR': '', 'gnomAD_exome_AMR': '', 'gnomAD_exome_ASJ': '', 'gnomAD_exome_EAS': '', 'gnomAD_exome_FIN': '', 'gnomAD_exome_NFE': '', 'gnomAD_exome_OTH': '', 'gnomAD_exome_SAS': '', 'esp6500siv2_all': '', '1000g2015aug_all': '', 'ucsf500normT': '0.023', 'ucsf500normN': '0.00464', 'ALLELE_END': ''}, 'FORMAT': ['GT', 'AD', 'GMIMUT', 'GMIMAF', 'GMICOV', 'CallHC', 'CallUG', 'CallFB', 'CallPI', 'CallSID', 'CallMU', 'LR'], 'GENIE-...': '0/1:378,22:22:6:400:.:.:.:1:.:.:1.30598', 'GENIE-UCSF-1041-1213N': '0/0:.:.:.:.:.:.:.:.:.:.:0.0779254', 'MAPPED_TUMOR_FORMAT_DATA': {'GT': '0/1', 'AD': '378,25', 'GMIMUT': '32', 'GMIMAF': '8', 'GMICOV': '400', 'CallHC': '', 'CallUG': '', 'CallFB': '', 'CallPI': '1', 'CallSID': '', 'CallMU': '', 'LR': '1.30598', 'DP': '400.0'}, 'MAPPED_NORMAL_FORMAT_DATA': {'GT': '0/0', 'AD': ',', 'GMIMUT': '', 'GMIMAF': '', 'GMICOV': '', 'CallHC': '', 'CallUG': '', 'CallFB': '', 'CallPI': '', 'CallSID': '', 'CallMU': '', 'LR': '0.0779254'}}

Missing gnomAD values for particular variant

This particular variant can be found at: https://gnomad.broadinstitute.org/variant/1-144859816-TGGA-T?dataset=gnomad_r2_1 which has gnomAD values but processed.txt doesnt seem to show these gnomAD values.

input.txt (Center input maf)
processed.txt (Genome Nexus/annotation tools final output)

MAF files that had missing column headers fail GenomeNexus annotation

maf_error.txt

	Failed annotations summary:  9 total failed annotations
		Records with HGVSp null variant classification:  0
		Records that failed due to other unknown reason: 9

maf_error2.txt

	Failed annotations summary:  3 total failed annotations
		Records with HGVSp null variant classification:  0
		Records that failed due to other unknown reason: 3

Beneficial to merge all processed vcfs into one prior to annotating?

average distribution of variants of VCFs:

wc -l vcf/processed/*
2 GENIE-
2 GENIE-
5 GENIE-
1 GENIE-
2 GENIE-
3 GENIE-
2 GENIE-
...

java.net.UnknownServiceException

When running the annotation_suite_wrapper.sh script, I am getting the following error:

java.net.UnknownServiceException: Unable to find acceptable protocols. isFallback=false, modes=[ConnectionSpec(cipherSuites=[TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_DHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_DHE_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_3DES_EDE_CBC_SHA], tlsVersions=[TLS_1_2, TLS_1_1, TLS_1_0], supportsTlsExtensions=true), ConnectionSpec(cipherSuites=[TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_DHE_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_DHE_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_3DES_EDE_CBC_SHA], tlsVersions=[TLS_1_0], supportsTlsExtensions=true), ConnectionSpec()], supported protocols=[TLSv1]

I am not sure how to proceed from here.
Any help would be really appreciated

Genome Nexus sometimes annotates SNV as DNP

input: input.txt
Intermediate files: annotation-tools intermediate files I must add the .txt at the end or github won't allow me to upload these. My understanding it the input.txt.temp.annotated.txt is the output from Genome Nexus. But because the annotation-tools allows us to include a directory with a list of mafs or vcfs, it annotates each one of those files separately. processed.txt is all of these merged.
input.txt.temp.annotated.txt
input.txt.temp.txt
Processed:
processed.txt

Annotation tools doesn't calculate t_ref_count and t_alt_count from particular site's VCF field

Brief summary of the issue, we have a site that is reporting their variant counts like this:

These fields can be calculated from the DP tag in the VCF (t_depth in MAF), e.g. if AF=0.05 and DP=2275, t_alt_count is 114 and t_ref_count is 2161.

That being said, it seems that annotation-tools is not doing this calculation. Should this be something annotation-tools supports? Not sure if the above is a standard way of reporting the variant counts.

Input vcf: input.vcf.txt
Annotation Tool output: input.vcf.temp.txt
Genome Nexus annotation: input.vcf.temp.annotated.txt
Final output: processed.txt

Missing reference and variant allele

We ran into an an issue where genome-nexus might be annotating this de-identified record incorrectly:
renamed_maf.xlsx

The record has a Reference_Allele and variant (Tumor_Seq_Allele1) filled out, but after processing through with genome-nexus, it seems to update the record's value to blank.

Any help would be greatly appreciated! Please see the file called "renamed_maf.xlsx" to test.

some vcfs have non-ascii characters


2020-07-22T12:44:09.987-07:00 | File "/root/annotation-tools/standardize_mutation_data.py", line 1578, in <module>
-- | --
  | 2020-07-22T12:44:09.987-07:00 | main()
  | 2020-07-22T12:44:09.987-07:00 | File "/root/annotation-tools/standardize_mutation_data.py", line 1574, in main
  | 2020-07-22T12:44:09.987-07:00 | generate_maf_from_input_data(input_directory, output_directory, extensions_list, center_name, sequence_source)
  | 2020-07-22T12:44:09.987-07:00 | File "/root/annotation-tools/standardize_mutation_data.py", line 1538, in generate_maf_from_input_data
  | 2020-07-22T12:44:09.987-07:00 | maf_data = extract_vcf_data_from_file(input_filename, center_name, sequence_source)
  | 2020-07-22T12:44:09.987-07:00 | File "/root/annotation-tools/standardize_mutation_data.py", line 1342, in extract_vcf_data_from_file
  | 2020-07-22T12:44:09.987-07:00 | (sample_id, tumor_sample_data_col, matched_normal_sample_id) = get_vcf_sample_and_normal_ids(filename)
  | 2020-07-22T12:44:09.987-07:00 | File "/root/annotation-tools/standardize_mutation_data.py", line 808, in get_vcf_sample_and_normal_ids
  | 2020-07-22T12:44:09.987-07:00 | for line in vcf_file.readlines():
  | 2020-07-22T12:44:09.987-07:00 | File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
  | 2020-07-22T12:44:09.987-07:00 | return codecs.ascii_decode(input, self.errors)[0]
  | 2020-07-22T12:44:09.987-07:00 | UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 7467: ordinal not in range(128)
  | 2020-07-22T12:44:09.992-07:00 | [ERROR] standardizeMutationFilesFromDirectory(), error encountered while running /root/annotation-tools/standardize_mutation_data.py
  | 2020-07-22T12:44:09.999-07:00 | Traceback (most recent call last):
  | 2020-07-22T12:44:09.999-07:00 | File "bin/input_to_database.py", line 170, in <module>
  | 2020-07-22T12:44:09.999-07:00 | genie_annotation_pkg=args.genie_annotation_pkg)
  | 2020-07-22T12:44:09.999-07:00 | File "bin/input_to_database.py", line 95, in main
  | 2020-07-22T12:44:09.999-07:00 | genie_annotation_pkg=genie_annotation_pkg
  | 2020-07-22T12:44:09.999-07:00 | File "/root/Genie/genie/input_to_database.py", line 851, in center_input_to_database
  | 2020-07-22T12:44:09.999-07:00 | genome_nexus_pkg=genie_annotation_pkg)
  | 2020-07-22T12:44:09.999-07:00 | File "/root/Genie/genie/input_to_database.py", line 353, in processfiles
  | 2020-07-22T12:44:09.999-07:00 | workdir=path_to_genie
  | 2020-07-22T12:44:09.999-07:00 | File "/root/Genie/genie/process_mutation.py", line 198, in process_mutation_workflow
  | 2020-07-22T12:44:09.999-07:00 | workdir=workdir
  | 2020-07-22T12:44:09.999-07:00 | File "/root/Genie/genie/process_mutation.py", line 249, in annotate_mutation
  | 2020-07-22T12:44:09.999-07:00 | subprocess.check_call(annotater_cmd)
  | 2020-07-22T12:44:09.999-07:00 | File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
  | 2020-07-22T12:44:09.999-07:00 | raise CalledProcessError(retcode, cmd)
  | 2020-07-22T12:44:09.999-07:00 | subprocess.CalledProcessError: Command '['bash', '/root/annotation-tools/annotation_suite_wrapper.sh', '-i=/root/.synapseCache/tmp_hvegy54', '-o=/root/.synapseCache/tmpe6h_3d62', '-m=/root/.synapseCache/tmpe6h_3d62/data_mutations_extended_SAGE.txt', '-c=SAGE', '-s=WXS', '-p=/root/annotation-tools']' returned non-zero exit status 1.

Should not require VCF's to have `AF`

bash ./annotation-tools/annotation_suite_wrapper.sh -i=./ -o=./mafs/ -m=./mafs/final.maf -c=FOO -s=WEX -p=./annotation-tools/
	INPUT_DATA_DIRECTORY=./
	OUTPUT_DATA_DIRECTORY=./mafs/
	MERGED_MUTATION_FILENAME=./mafs/final.maf
	CENTER_NAME=FOO
	SEQUENCE_SOURCE=WEX
	ANNOTATION_SUITE_SCRIPTS_HOME=./annotation-tools/
rm: cannot remove './mafs//annotated': Is a directory
rm: cannot remove './mafs//processed': Is a directory
	[INFO] standardizeMutationFilesFromDirectory(), standardized mutation files from ./ will be written to ./mafs//processed
Loading data from input directory: ./
	Searching for files with extensions: vcf, maf, txt
Loading data from file: ./Strelka_JHU2084-PN_variants_VEP.ann.vcf
Traceback (most recent call last):
  File "./annotation-tools//standardize_mutation_data.py", line 1624, in <module>
    main()
  File "./annotation-tools//standardize_mutation_data.py", line 1620, in main
    generate_maf_from_input_data(input_directory, output_directory, extensions_list, center_name, sequence_source)
  File "./annotation-tools//standardize_mutation_data.py", line 1584, in generate_maf_from_input_data
    maf_data = extract_vcf_data_from_file(input_filename, center_name, sequence_source)
  File "./annotation-tools//standardize_mutation_data.py", line 1403, in extract_vcf_data_from_file
    maf_record = create_maf_record_from_vcf(sample_id, center_name, sequence_source, vcf_data, is_germline_data, matched_normal_sample_id, tumor_sample_data_col)
  File "./annotation-tools//standardize_mutation_data.py", line 1364, in create_maf_record_from_vcf
    resolve_vcf_counts_data(vcf_data, maf_data, matched_normal_sample_id, tumor_sample_data_col)
  File "./annotation-tools//standardize_mutation_data.py", line 1259, in resolve_vcf_counts_data
    (t_ref_count, t_alt_count, t_depth) = resolve_vcf_allele_depth_values(tumor_sample_format_data, vcf_alleles, variant_allele_idx, vcf_data)
  File "./annotation-tools//standardize_mutation_data.py", line 1235, in resolve_vcf_allele_depth_values
    not is_missing_vcf_data_value(depth) and not is_missing_vcf_data_value(mapped_sample_format_data["AF"])
KeyError: 'AF'
[ERROR] standardizeMutationFilesFromDirectory(), error encountered while running ./annotation-tools//standardize_mutation_data.py

Genes referred to by both their approved and previous name

@inodb
There are two genes in the mutation data (v9.6) that are referred to by their both approved name and previous name:

First pair: GPR124 (460 mutations) and ADGRA2 (1 mutation)
Evidence that they are both the same gene:
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:17849

Second pair: PARK2 (1202 mutations) and PRKN (3 mutations)
Evidence:
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:8607

invalid literal for int() with base 10:

Traceback (most recent call last):
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1430, in <module>
    main()
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1427, in main
    generate_maf_from_input_data(input_directory, output_directory, extensions_list, center_name, sequence_source)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1393, in generate_maf_from_input_data
    maf_data = extract_vcf_data_from_file(os.path.join(input_directory, filename), center_name, sequence_source)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1327, in extract_vcf_data_from_file
    maf_record = create_maf_record_from_vcf(sample_id, center_name, sequence_source, vcf_data, is_germline_data, matched_normal_sample_id, tumor_sample_data_col)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1277, in create_maf_record_from_vcf
    resolve_vcf_counts_data(vcf_data, maf_data, matched_normal_sample_id, tumor_sample_data_col)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1178, in resolve_vcf_counts_data
    (t_ref_count, t_alt_count, t_depth) = resolve_vcf_allele_depth_values(tumor_sample_format_data, vcf_alleles, variant_allele_idx, vcf_data)
  File "genie-annotation-pkg/standardize_mutation_data.py", line 1117, in resolve_vcf_allele_depth_values
    allele_depth_values[variant_allele_idx] = "%.0f" % (int(mapped_sample_format_data["FA"]) * int(mapped_sample_format_data["DP"]))
ValueError: invalid literal for int() with base 10: '0.049'

I am gathering examples from all the errors I see, but thought I would record the error

Several annotations show up as MUTATED in GENIE

Blank FAILURE_REASONs

We are experiencing some annotations that have an annotation_status = FAILED and failure_reason = blank. Is it possible to run this on your end and see what the issue is? Examples are in the attached file. I checked that the reference allele matches the reference genome and the formatting seems correct at first glance.

Note: The original data comes from a tumor-normal VCF.