nanoporetech / pinfish Goto Github PK

Tools to annotate genomes using long read transcriptomics data

License: Other

Makefile 8.37% Go 91.63%

rna-seq cdna transcriptomics genome-annotation nanopore

pinfish's Introduction

We have a new bioinformatic resource that largely replaces the functionality of this project! See our new repository here: https://github.com/nanoporetech/pipeline-nanopore-ref-isoforms

The improved spliced_bam2gff tool is released at https://github.com/nanoporetech/spliced_bam2gff

This repository is now unsupported and we do not recommend its use. Please contact Oxford Nanopore: [email protected] for help with your application if it is not possible to upgrade to our new resources, or we are missing key features.

pinfish

Pinfish is a collection of tools helping to make sense of long transcriptomics data (long cDNA reads, direct RNA reads). The toolchain is composed of the following tools:

spliced_bam2gff - a tool for converting sorted BAM files containing spliced alignments (generated by minimap2 or GMAP) into GFF2 format. Each read will be represented as a distinct transcript. This tool comes handy when visualizing spliced reads at particular loci and to provide input to the rest of the toolchain.
cluster_gff - this tool takes a sorted GFF2 file as input and clusters together reads having similar exon/intron structure and creates a rough consensus of the clusters by taking the median of exon boundaries from all transcripts in the cluster.
polish_clusters - this tool takes the cluster definitions generated by cluster_gff and for each cluster creates an error corrected read by mapping all reads on the read with the median length (using minimap2) and polishing it using racon. The polished reads can be mapped to the genome using minimap2 or GMAP.
collapse_partials - this tool takes GFFs generated by either cluster_gff or polish_clusters and filters out transcripts which are likely to be based on RNA degradation products from the 5' end. The tool clusters the input transcripts into "loci" by the 3' ends and discards transcripts which have a compatible transcripts in the loci with more exons.

Pinfish is largely inspired by the Mandalorion pipeline. It is meant to provide a quick way for generating annotations from long reads only and it is not meant to provide the same functionality as pipelines using a broader strategy for annotation (such as LoReAn).

The pinfish tools can be run via a Snakemake pipeline which handles the alignment tasks using minimap2.

Getting Started

Installation

The static linux binaries for the x86_64 platform are included in the respective subdirectories of the source tree. To install them simply copy them somewhere in your path.

The polish_clusters tool depends on the following software:

minimap2
samtools
racon - please install from source!

Dependencies and compiling from source

Compiling the tools from source require a working go compiler installation and the following packages installed via go get:

After installing dependencies simply issue make in the respective subdirectory.

Usage

spliced_bam2gff

Usage of spliced_bam2gff:
  -M    Input is from minimap2.
  -V    Print out version.
  -g    Use strand tag as feature orientation then read strand if not available.
  -h    Print out help message.
  -s    Use read strand (from BAM flag) as feature orientation.
  -t int
        Number of cores to use. (default 4)

The tool is looking by default for the XS tag in order to determine transcript orientation, unless the -M flag is specified in which case it is assumed that the input is from minimap2 and the ts tag is used instead (with different rules to determine the final orientation).

If no orientation tag is found, then the orientation is set to ., unless the -g flag is provided, in which case the read orientation from the BAM flag is used.

If the -s flag is specified all the rules above are ignored and the orientation is set to the read strand from the BAM flag (appropriate for stranded protocols).

Example run with minimap2 input:

spliced_bam2gff -M minimap_sorted.bam > raw_transcripts.gff

Example run with minimap2 input, stranded mode:

spliced_bam2gff -s minimap_sorted.bam > raw_transcripts.gff

Example run with GMAP input:

spliced_bam2gff gmap_sorted.bam > raw_transcripts.gff

cluster_gff

Usage of ./cluster_gff:
  -V    Print out version.
  -a string
        Write clusters in tabular format in this file.
  -c int
        Minimum cluster size. (default 10)
  -d int
        Exon boundary tolerance. (default 10)
  -e int
        Terminal exons boundary tolerance. (default 30)
  -h    Print out help message.
  -p float
        Minimum isoform percentage. (default 1)
  -prof string
        Write out CPU profiling information.
  -t int
        Number of cores to use. (default 4)

The -e parameter is the maximum distance tolerated at the start of the first exon and the end of last exon, while -d is the tolerance for all other exon boundaries.

Transcript clusters having size less than the -c parameter are discarded. This parameter has the largest effect on the sensitivity and specificity of transcript reconstruction. Larger values usually lead to higher specificity at the expense of lowering sensitivity.

Example run with default minimum cluster size and tolerance values:

cluster_gff -a clusters.tsv raw_transcripts.gff > clustered_transcripts.gff

Example run with custom parameters:

cluster_gff -c 5 -e 50 -d 5 -a clusters.tsv raw_transcripts.gff > clustered_transcripts.gff

polish_clusters

Usage of ./polish_clusters:
  -V    Print out version.
  -a string
        Read cluster memberships in tabular format.
  -c int
        Minimum cluster size. (default 1)
  -d string
        Location of temporary directory.
  -h    Print out help message.
  -m    Do not load all reads in memory (slower).
  -o string
        Output fasta file.
  -t int
        Number of cores to use. (default 4)
  -x string
        Arguments passed to minimap2.
  -y string
        Arguments passed to racon.

Example run:

polish_clusters -a clusters.tsv -c 50 -o consensus_transcripts.fas -t 40 sorted.bam

The resulting consensus transcripts can be mapped to the genome using minimap2.

collapse_partials

Usage of ./collapse_partials:
  -M    Discard monoexonic transcripts.
  -U    Discard transcripts which are not oriented.
  -V    Print out version.
  -d int
        Internal exon boundary tolerance. (default 5)
  -e int
        Three prime exons boundary tolerance. (default 30)
  -f int
        Five prime exons boundary tolerance. (default 5000)
  -h    Print out help message.
  -prof string
        Write out CPU profiling information.
  -t int
        Number of cores to use. (default 4)

The -d parameter is the exon boundary difference tolerated at internal splice sites, while -e and -f are the tolerance values at the 3' and 5' end respectively. Transcripts which are not oriented are all assigned to distinct "loci" and left untouched by default (but see the -U flag).

Example run:

collapse_partials -d 10 -e 35 -f 1000 input.gff > collapsed_output.gff

Running tests

For running tests the following dependencies have to be installed:

Both are easy to install using bioconda. Look into the Makefiles for targets testing the tools on simulated and real data.

Help

Licence and Copyright

This Source Code Form is subject to the terms of the Mozilla Public License, v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain one at http://mozilla.org/MPL/2.0/.

FAQs and tips

The GFF2 files can be visualised using IGV.
The GFF2 files can be converted to GFF3 or GTF using the gffread utility.

References and Supporting Information

See the post announcing the tool at the Oxford Nanopore Technologies community here.

pinfish's People

Contributors

Stargazers

Watchers

Forkers

zm-git-dev learnyoung1990 tabotaab ljyanesm chegejames groundb tleonardi wangzhennan14 kiwiroy tclin422 botond-sipos liqianqian-123 distilledchild

pinfish's Issues

syntax error in pinfish/polish_clusters/check_deps.go

Line 37 has the word "overalps" instead of "overlaps", this prevented me from finishing the run. It was an easy fix, just thought I'd let you know for future downloads. :)

pinfish annotation for prokaryotic genomes fails to collapse reads

I used the snakemake pinfish pipeline to annotate E. coli K12 from direct RNA seq reads using standard parameters (only increasing sensitivity by lowering the -c parameter). After polishing and collapsing there are still many overlapping reads left. I was playing around with the collapse parameters but until now nothing really worked. Any ideas?

Genome file was downloaded from the NCBI https://www.ncbi.nlm.nih.gov/nuccore/U00096.2.
GFF3:
ecoli_k12.txt

GFF output from snakemake pinfish:
clustered_transcripts_collapsed.txt
polished_transcripts_collapsed.txt
polished_transcripts.txt
clustered_transcripts.txt

Error racon - exit status 134

Dear all
Following the README file, I finished the spliced_bam2gff step and cluster_gff step. Now I want to use polish_clusters software to polish the transcript based on "clusters.tsv". The following error occurred.

$  polish_clusters -a clusters.tsv -t 12 -c 50 -o octosporus_diploid_19H.full_length.consensus_transcript.fas ../octosporus_diploid_19H.full_length.sorted.bam
polish_clusters: 13:04:48 Polishing cluster 1e6f6eaa-e7c2-4c42-bb3a-1124cff9cd21 of size 276
polish_clusters: 13:04:49 Polishing cluster e24f0d1f-8273-41e6-be26-64163d420938 of size 103
polish_clusters: 13:04:49 Polishing cluster 72f6a514-7c31-4cf4-bd8b-3751fe7a95b7 of size 71
polish_clusters: 13:04:49 Polishing cluster 0a5a9eba-24c2-40c6-a2e8-8abea6f4c504 of size 4967
polish_clusters: 13:05:21 Polishing cluster ec124557-145e-4796-875b-a11c9bfe19b0 of size 7196
polish_clusters: 13:05:22 Failed running command: racon -t 12 -q -1  /tmp/pinfish_ec124557-145e-4796-875b-a11c9bfe19b0_838333432/reads.fq /tmp/pinfish_ec124557-145e-4796-875b-a11c9bfe19b0_838333432/alignments.sam /tmp/pinfish_ec124557-145e-4796-875b-a11c9bfe19b0_838333432/reference.fq > /tmp/pinfish_ec124557-145e-4796-875b-a11c9bfe19b0_838333432/consensus.fq - exit status 134

Can anyone help me to solve this error?
Thanks~
Guo-Song

cluster_gff failed to cluster exons

Hi, I am using cluster_gff to cluster primary alignments of an ONT cDNA data and I notice that some exons which should be clustered together are missing in the output gff file of cluster_gff, for example:

The upper panel is the primary alignments before cluster_gff, and the lower panel is the output gff file of cluster_gff. Most reads shown in this screenshot are filtered out in the output cluster_memberships.tsv file. And I think there should be exons in the place where I marked with question mark.

Here are my commands:
spliced_bam2gff -s -t 20 -M only_primary.sort.bam > raw.gff
cluster_gff -p 1.0 -t 20 -c 2 -d 30 -e 30 -a cluster_memberships.tsv raw.gff > clustered_transcripts.gff

How could I finetune the parameters to prevent these reads from being filtered out? Could you give some advices?

Thanks.

polish_clusters step

I am little confused as what the bam input should be. The original BAM file used initial step to give gff conversion?

So far I ran following steps:

Converted BAM of splice aware minimap2 alignments to hg38 indexed reference to a gff file.

Clustered the gtf transcripts to the clustered_transcripts.gtf and generating the clusters.tsv

Now I am stuck here:

polish_clusters -a clusters.tsv -c 50 -o consensus_transcripts.fas -t 40 sorted.bam

Does this example code omit some step to pipe the results of the consensus to map back and generate another sorted BAM file?

Or the sorted.bam file is an example output file in addition to the fas reference file?

Sorry for my naivety.

I ran and get following error related to minimap2. I recently updated my install minimap2 to a new release and perhaps this may cause issue to find the executable.

tkx292:~ callum$ ./pinfish/polish_clusters/polish_clusters -a ~/Sync_later/clusters.tsv -c 50 -o ~/Sync_later/20180903_HDF_consensus_transcripts.fas ~/Sync_later/20180903_HDF_consensus.sorted.bam polish_clusters: 17:43:28 Failed running command: minimap2 -h - exit status 127

Make a bin

Can I suggest making a bin with all of the programs inside of it? As the directory structure is now, users have to add four paths to their PATH variable, or move the programs around.

Polish Cluster stops midway

Hi,
The pipeline stops midway at the polishing cluster step. It does not error out or killed, it simply hangs randomly after polishing some clusters. If I run just the polish cluster step multiple times, it hangs at different clusters, so there should not be inherently anything wrong with any cluster.

Few observations:

The pipeline runs successfully for a smaller genome but is having issue with a bigger genome. So can this be a memory issue? I did not see an option to provide more memory.
I provide the program a scratch space of up to 1TB, and so I do not think space is an issue.
The program completes successfully when I set the threshold of cluster size to 50 or 100, but with lower threshold values like 15 or 20 the program does not run reliably.

Any guidance will be really helpful.
Thanks.

Scaffolding using long-read RNA?

Dear developers,

Any chance that this tool could be co-opted or extended to provide information about contig adjacency and scaffolding from split long-read RNA mappings?

Cheers!

cluster_gff option

Dear @bsipos,

I'd like to clarify about the options which are provided in cluster_gff.
Would these 2 options below be referring to the number of bases in each information?

-d int : Exon boundary tolerance. (default 10)
-e int : Terminal exons boundary tolerance. (default 30)

and If so for the option -d, would this be tolerating 10bp in both side of exon when default?

Also for the cluster option, is this referring to the number of supporting reads for that cluster?

-c int : Minimum cluster size. (default 10)

Thank you very much for your help!

Jungwoo

polish_clusters error?

Hi, I am using "pipeline pinfish analysis" to annotate genome with ONT cDNA long reads. However, when it came to "polish_clusters" step, I got the error message like the following:

polish_clusters: 15:11:47 Failed running command: samtools view -h - exit status 1

[Wed Dec 18 15:11:47 2019]

Error in rule polish_clusters:

jobid: 7

output: results/polished_transcripts.fas

conda-env: /home/work/rxl_wkdir/wkdir_N06/pipeline-pinfish-analysis/.snakemake/conda/bfe8fba3

shell:        

/home/work/rxl_wkdir/wkdir_N06/pipeline-pinfish-analysis/pinfish/polish_clusters/polish_clusters -t 30 -a results/cluster_memberships.tsv -c 10 -o results/polished_transcripts.fas alignments/reads_aln_sorted.bam    

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.

I do install minimap2, samtools, and racon as dependencies, but I don't know how did this error happen. Could anybody give some advices on this? Thank you very much!

Using medaka with transcriptome data

Hi,

I am new with Pomoxis and I am trying to run a polishing step on budgerigar transcriptomic data after genome reference-based assembly but I am getting an error in the step of generating consensus.

First, I a successfully created reference-based assembly (with the budgerigar genome but only with the longest transcript per gene selected) with the following script:

#!/bin/bash
NPROC=$(nproc)
BASECALLS=data/Pool4_merged_trimmed_BC.fastq
REFERENCE=/home/martin.tesicky/Medaka_parrots/Pool4/medaka_walkthrough/Budgerigar_genome_longest_transcript_mart_export.fasta
mini_assemble -i ${BASECALLS} -r ${REFERENCE} -o draft_assm_with_reference -p assm -t ${NPROC}

Then, I run a Polishing a Consensus as follows:

#!/bin/bash
NPROC=$(nproc)
BASECALLS=data/Pool4_merged_trimmed_BC.fastq
DRAFT=draft_assm_with_reference/assm_final.fa
OUTDIR=medaka_consensus_with_reference_1_12
medaka_consensus -i ${BASECALLS} -d ${DRAFT} -o ${OUTDIR} -t ${NPROC} -m r941_min_h

When I am doing the medaka consensus step the fasta file with the consensus sequence is always empty and I am getting calls from logfile: "Failed to stitch consensus chunks." and from a terminal (please, see below and see also complete output in attached files). I also tried to use de-novo assembly without using reference but I got a very low number of contigs (only ca 1500) and polishing step was working with this attitude. When I using reference-based assembly I have many more contigs (ca 12 000), but the consensus step is not working. I would also welcome any piece of advice on which parameters should I adjust for de-novo assembly since we have also other species where the genomes are not available @cjw85.

I would be very grateful for any help.

[file):]([url](url
typescript.txt
medaka_consenus_runner_with_reference__01_12_logfile.txt

))

Traceback (most recent call last):
File "/home/martin.tesicky/miniconda3/envs/medaka/bin/medaka", line 11, in
sys.exit(main())
File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/site-packages/medaka/medaka.py", line 431, in main
args.func(args)
File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/site-packages/medaka/stitch.py", line 137, in stitch
for contigs in executor.map(worker, regions):
File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/concurrent/futures/process.py", line 496, in map
timeout=timeout)
File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/concurrent/futures/_base.py", line 575, in map
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/concurrent/futures/_base.py", line 575, in
fs = [self.submit(fn, *args) for args in zip(*iterables)]
File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/concurrent/futures/process.py", line 139, in _get_chunks
chunk = tuple(itertools.islice(it, chunksize))
File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/site-packages/medaka/common.py", line 610, in grouper
batch.append(next(gen))
File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/site-packages/medaka/stitch.py", line 130, in
(common.Region.from_string(r) for r in args.regions),
File "/home/martin.tesicky/miniconda3/envs/medaka/lib/python3.6/site-packages/medaka/common.py", line 469, in from_string
start = int(bounds)
ValueError: invalid literal for int() with base 10: 'ubiquinone'
(medaka) martin.tesicky@turbacz:/Medaka_parrots/Pool4/medaka_walkthrough$
(medaka) martin.tesicky@turbacz:/Medaka_parrots/Pool4/medaka_walkthrough$ ls
�[0m�[01;32massess_assembly_runner.bash�[0m
�[01;32massess_assembly_stats_runner.bash�[0m

error loading or unzipping Reference Data

Hello, I am getting the error below when I execute snakemake. I have followed the tutorial instructions here: https://community.nanoporetech.com/knowledge/bioinformatics/using-pinfish-for-gene-tra/tutorial

I build the proper environment and get the tutorial to run but when I use my own data in the RawData and ReferenceData folders after updating the config.yaml file I am getting:

(Pinfish) -bash-4.2$ snakemake -j 1
Building DAG of jobs...
InputFunctionException in line 73 of /mnt/home/thom1524/Pinfish/Snakefile:
KeyError: 'AMC_annotation.gff.gz'
Wildcards:
unzipFile=AMC_annotation.gff.gz

It seems to me that the data I give it are not being opened or read. Any help would be much appreciated!

polish_clusters fails with racon error

Starting from a bam file of ONT directRNA reads aligned to the mouse genome with minimap (-x splice)
I ran the first two commands of pinfish successfully:

spliced_bam2gff -s fastq_runid_a7dd2b90b03f7f2be36d2c837fd73e0272542809_sort.bam > raw_transcripts.gff

cluster_gff -a clusters.tsv raw_transcripts.gff > clustered_transcripts.gff

However, the 3rd step fails with:
polish_clusters -a clusters.tsv -c 50 -o consensus_transcripts.fas -t 10 fastq_runid_a7dd2b90b03f7f2be36d2c837fd73e0272542809_sort.bam polish_clusters: 13:32:36 Polishing cluster 807d3c5d-18d5-4be1-a3aa-3d6e97d36d86 of size 165 polish_clusters: 13:32:37 Polishing cluster b799bd86-4133-4c78-8727-4dd097073d53 of size 62 polish_clusters: 13:32:37 Failed running command: racon -t 10 -q -1 /tmp/pinfish_b799bd86-4133-4c78-8727-4dd097073d53_281010008/reads.fq /tmp/pinfish_b799bd86-4133-4c78-8727-4dd097073d53_281010008/alignments.sam /tmp/pinfish_b799bd86-4133-4c78-8727-4dd097073d53_281010008/reference.fq > /tmp/pinfish_b799bd86-4133-4c78-8727-4dd097073d53_281010008/consensus.fq - exit status 1

I also noticed that running the same command again, different clusters are processed first, is that expected?
polish_clusters -a clusters.tsv -c 50 -o consensus_transcripts.fas -t 10 fastq_runid_a7dd2b90b03f7f2be36d2c837fd73e0272542809_sort.bam polish_clusters: 13:44:01 Polishing cluster 2da54d30-382b-46fe-83c0-ae47c4f34ee9 of size 104 polish_clusters: 13:44:02 Polishing cluster 31b0ce79-2a54-4057-a534-599dbde2d39a of size 52 polish_clusters: 13:44:02 Failed running command: racon -t 10 -q -1 /tmp/pinfish_31b0ce79-2a54-4057-a534-599dbde2d39a_699589609/reads.fq /tmp/pinfish_31b0ce79-2a54-4057-a534-599dbde2d39a_699589609/alignments.sam /tmp/pinfish_31b0ce79-2a54-4057-a534-599dbde2d39a_699589609/reference.fq > /tmp/pinfish_31b0ce79-2a54-4057-a534-599dbde2d39a_699589609/consensus.fq - exit status 1

In /tmp/ some files are starting to be generated:
`ll /tmp/pinfish_31b0ce79-2a54-4057-a534-599dbde2d39a_699589609/

40143 Dec 18 13:44 alignments.sam

0 Dec 18 13:44 consensus.fq

34064 Dec 18 13:44 reads.fq

810 Dec 18 13:44 reference.fq`

To me, it looks like it starts running (first two clusters of size 165 and 62, resp., are processed), but then encounters something it does not like.
Any ideas for troubleshooting? The error log is unfortunately not really enlightening.

Many thanks,
best,
Sophia

question on cluster_gff

Dear @bsipos

Hello while I was doing the analysis, I ran into some question on the cluster_gff step.
Would the cluster_gff step consider to print out the 2 different stranded cluster reads seperately as output when they happened to have matching boundary?

Thank you for your wonderful help.

Jungwoo.

Racon failed during polish_clusters run

Hi!

I have an error message when running polish_clusters (spliced_bam2gff and cluster_gff worked fine):

polish_clusters: 10:41:12 Polishing cluster 8066429e-3347-4dbd-b47b-594085368984 of size 29
polish_clusters: 10:41:12 Failed running command: racon -t 16 -q -1 /home/meitel/data/cbas/pinfish/corrected2/gmap/tmp/pinfish_8066429e-3347-4dbd-b47b-594085368984_702280846/reads.fq /home/meitel/data/cbas/pinfish/corrected2/gmap/tmp/pinfish_8066429e-3347-4dbd-b47b-594085368984_702280846/alignments.sam /home/meitel/data/cbas/pinfish/corrected2/gmap/tmp/pinfish_8066429e-3347-4dbd-b47b-594085368984_702280846/reference.fq > /home/meitel/data/cbas/pinfish/corrected2/gmap/tmp/pinfish_8066429e-3347-4dbd-b47b-594085368984_702280846/consensus.fq - exit status 134

My command polish_clusters line:

polish_clusters -d /home/cgarcia/analysis/pinfish/corrected2/gmap/tmp -a CBAS_MASURCA-2_final.genome.scf._ONT_cdna_gmap_combined_100bp_correction-2.sorted_clusters.tsv \
 -o CBAS_MASURCA-2_final.genome.scf._ONT_cdna_gmap_combined_100bp_correction-2.sorted_clustered_consensus_transcripts.fasta \
 -t 16 CBAS_MASURCA-2_final.genome.scf._ONT_cdna_gmap_combined_100bp_correction-2_fixmate_sorted.bam 2> CBAS_PIN_C-2_G_polish_clusters.log

Before running the pinfish tools I sorted the gmap sam file by reads, removed secondary alignments and unmapped reads and then sorted again (standard) using samtools:

samtools sort -n -@ 16 CBAS_MASURCA-2_final.genome.scf._ONT_cdna_gmap_combined_100bp_correction-2.sam | samtools fixmate --reference CBAS_MASURCA-2_final.genome.scf.fasta -r -@ 16 - CBAS_MASURCA-2_final.genome.scf._ONT_cdna_gmap_combined_100bp_correction-2_fixmate.bam
samtools sort -@ 16 CBAS_MASURCA-2_final.genome.scf._ONT_cdna_gmap_combined_100bp_correction-2_fixmate.bam > CBAS_MASURCA-2_final.genome.scf._ONT_cdna_gmap_combined_100bp_correction-2_fixmate_sorted.bam

Not sure if the polish_clusters error stems from racon or from my samtools processing. Any ideas?

Thanks
Michael

Question: minimum_isoform_percent

What is the minimum isoform percentage parameter? And what does the value 1 mean: 1 as 1.0 (so 100%) or 1 as 1%?

error for polish cluster

Hi @bsipos and @ksahlin,

I am running polish_cluster with my own data (spliced_bam2gff and cluster_gff worked fine):
and I am getting the following error which has something to do with the wrong input format for racon. Besides commands, I am also including pieces of input/ output files. I already tried to ask here with a similar topic: #5 but I am not sure whether you receive any notifications when the issue is closed.

Splice bam2gf
spliced_bam2gff -M Pool4_merged_trimmed_BC.bam -t 24 > Pool4_merged_trimmed_BC_raw_transcripts.gff

cluster_gff
cluster_gff -t 24 -a Pool4_merged_trimmed_BC_raw_transcripts_clusters.tsv Pool4_merged_trimmed_BC_raw_transcripts.gff > Pool4_merged_trimmed_BC_raw_transcripts_clustered_transcripts.gff

My command polish_clusters line:
(pinfish) martin.tesicky@turbacz:~/Medaka_parrots/Pool4/pinfish$ polish_clusters -a Pool4_merged_trimmed_BC_raw_transcripts_clusters.tsv -c 50 -o Pool4_merged_t_rimmed_BCconsensus_transcripts.fas -t 24 Pool4_merged_trimmed_BC.bam
polish_clusters: 15:55:52 Polishing cluster 02e3c0c0-b1b5-4f5f-8bf5-05f83f97fbd3 of size 330
polish_clusters: 15:55:53 Polishing cluster c7a0033e-8bd3-4ea8-8539-8cfe406f915f of size 85
polish_clusters: 15:55:53 Polishing cluster e3b5ea34-f86a-42d8-9e2d-9c86a8d29b5c of size 53
polish_clusters: 15:55:54 Polishing cluster 8f8b2b1d-54d3-4b18-8dc5-3fc9b7eef4f9 of size 96
polish_clusters: 15:55:54 Polishing cluster f786d9ec-ecc0-4b5f-8862-2b4d1b72a8e1 of size 26109
polish_clusters: 15:59:42 Polishing cluster bb6413a0-83dd-452f-8e70-94e848c88720 of size 99
polish_clusters: 15:59:42 Polishing cluster 8db24705-3c4c-4e5a-b8bd-300688ff0bdc of size 50
polish_clusters: 15:59:43 Polishing cluster d28805d8-6cae-42fe-8138-a0d64a998608 of size 1345
polish_clusters: 15:59:50 Polishing cluster 46a7098b-2dd1-4f1b-966e-869613dfa32b of size 53
polish_clusters: 15:59:50 Polishing cluster 34c16ef3-891f-47aa-8946-32b963fb6812 of size 58
polish_clusters: 15:59:50 Polishing cluster d49642c2-0fd4-4c0c-8319-59d1863f57f8 of size 70
polish_clusters: 15:59:50 Polishing cluster 441903e3-5769-43cf-8043-bfb1df57c90e of size 2829
polish_clusters: 16:00:04 Polishing cluster 971c2378-3d77-43de-89ee-fd6c28e1947f of size 107
polish_clusters: 16:00:05 Polishing cluster e4bee383-39e1-494d-968e-ed129b7c4fc5 of size 147
polish_clusters: 16:00:06 Polishing cluster 0523fa67-f3a0-4343-8ecc-6cb957f5c26c of size 188
polish_clusters: 16:00:07 Polishing cluster cdbe623e-17c5-431a-93ea-a79e978eb475 of size 2260
polish_clusters: 16:00:20 Polishing cluster 6b535889-7107-48c3-9eab-5cb0aa6d1092 of size 258
polish_clusters: 16:00:22 Polishing cluster ca60e86e-f2cc-4ab8-ad13-d8727c344461 of size 202
polish_clusters: 16:00:23 Polishing cluster 70ed3672-75d2-4580-b9a8-8c0d79491f98 of size 306
polish_clusters: 16:00:24 Polishing cluster 53d2a843-bff9-4258-8b8b-701c4d47cd46 of size 61
polish_clusters: 16:00:24 Polishing cluster c40639a3-c3bb-4fec-8662-0e963f47d6db of size 99
polish_clusters: 16:00:25 Polishing cluster b788ec37-d32d-4f0a-baa8-5d3ec18af775 of size 130
polish_clusters: 16:00:25 Failed running command: racon -t 24 -q -1 /tmp/pinfish_b788ec37-d32d-4f0a-baa8-5d3ec18af 775_916745013/reads.fq /tmp/pinfish_b788ec37-d32d-4f0a-baa8-5d3ec18af775_916745013/alignments.sam /tmp/pinfish_b788 ec37-d32d-4f0a-baa8-5d3ec18af775_916745013/reference.fq > /tmp/pinfish_b788ec37-d32d-4f0a-baa8-5d3ec18af775_9167450 13/consensus.fq - exit status 134

And when I type only specific command that doesn´t work:
racon -t 24 -q -1 /tmp/pinfish_30b0c438-8629-4a99-bc51-d09beba70aaf_205375420/reads.fq /tmp/pinfish_30b0c438-8629-4a99-bc51-d09beba70aaf_205375420/alignments.sam /tmp/pinfish_30b0c438-8629-4a99-bc51-d09beba70aaf_205375420/reference.fq > /tmp/pinfish_30b0c438-8629-4a99-bc51-d09beba70aaf_205375420/consensus.fq
terminate called after throwing an instance of 'std::invalid_argument'
what(): [bioparser::FastqParser] error: invalid file format!
Aborted (core dumped)

Few lines from input/ output files:
Pool4_merged_trimmed_BC_raw_transcripts_clusters.tsv
Read Cluster
1b27018c-7eb5-4359-8a0f-95d7265a1c28 640b1d2d-8a92-46a8-ac64-03fec213591b
135a6032-2c24-4f62-9948-8edae6dbd8a4 640b1d2d-8a92-46a8-ac64-03fec213591b
b1faf377-9151-48c0-9dd4-c8d8900117ba 640b1d2d-8a92-46a8-ac64-03fec213591b
5b6b8a04-bf6d-4e32-b3fa-728e582569f9 640b1d2d-8a92-46a8-ac64-03fec213591b
7046ff83-4301-4694-80a1-4d5d8596a371 640b1d2d-8a92-46a8-ac64-03fec213591b
3e1928ba-06cf-4c04-9cf3-8510bcc318d2 640b1d2d-8a92-46a8-ac64-03fec213591b
a2c4f0b8-f6ff-4fa0-9a04-9147bbcbf2ac 640b1d2d-8a92-46a8-ac64-03fec213591b
dea24b20-4eb7-478a-97df-4052a7ab435c 640b1d2d-8a92-46a8-ac64-03fec213591b
1e0226c9-2774-4475-b679-100dc3c45f65 640b1d2d-8a92-46a8-ac64-03fec213591b

Pool4_merged_trimmed_BC.bam
(pinfish) martin.tesicky@turbacz:~/Medaka_parrots/Pool4/pinfish$ samtools view Pool4_merged_trimmed_BC.bam | head -n 5
14a0e3a4-795f-4986-9975-0547ff81815d 0 ENSMUNG00000000050|ENSMUNG00000000050.1|testis 587 60 41S54M2I70M1D29M34S 0 0 GGGGACGCCGGGGCCAAGCGGTAGCAGTGCCATGAGCTGCGAGACTCTGACACGTCTCTGGGACCTGCAATGACAAGTCAGAACAGTAGCCATGCAGAGAGAAAACTGGTGAAATGTCACCAAAGCAGTCAACACCAAAAGTGCAGGTCCAGCAGTCTGTCTCCCGAAAACCATCACTATTCATTTCTCTGCATTTTGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA D:/.;=>;;&(2+---$''%)(%%&$9@/9=100%()))$3,5/..,0+)&693?3;=<96231/4?=E=.)BA=4+#$(/2:;9''+))#4%21>922'7-$83''()##569&888%@538%>>IC72GA99=4--<>&$$$37.)+?>80)'1/:9418,-1745=BKMFF>A?7>?;961,)A9E?766555556566766789::;===?@> NM:i:7 ms:i:134 AS:i:134 nn:i:0 ts:A:+ tp:A:P cm:i:28 s1:i:127 s2:i:0 de:f:0.0387 rl:i:33
c9c5990c-ad71-4ee1-a7d6-0fe05e9a2947 0 ENSMUNG00000000050|ENSMUNG00000000050.1|testis 587 60 44S54M2D30M1I7M1D8M1D19M2D38M1D5M1I4M1I2M1I85M1D8M2I10M1D10M1I18M3I25M1D34M3I89M2I98M1D15M2I5M3I7M1I36M1I24M1D90M1D61M1D7M2D42M4D35M1D14M1I29M1I25M1I17M2I12M6I8M1I39M1D22M3D19M1D69M1I11M1D51M1D3M1I3M1I1M1I13M1D18M2D19M1I14M2D22M3D1M1I25M2D6M2D9M4I3M1D11M1D3M2I2M1I14M1I38M2D51M1D27M1I2M1I5M1D31M4D34M5D26M1I35M1I25M1D5M1D27M1D22M2D23M6D15M1D7M1D22M2D56M1D20M2D5M1I3M2D39M1I28M1I10M1D21M3I1M1D20M1D2M1D7M1I33M1I14M1D5M1D44M1I5M1D11M1I3M1D3M1D21M1I107M1S * 0 0 GGGGGACGCCGAGACCTGACAGGCAGTGCCATGAGCTGCAGGTAAGACTCTGACACGTCTCTGGGACCTGCAATGACAAGTCAGAACAGTAGCCATGCAGAAAACTGGTGAAATGTCCTCAAAGCAGTCCAGCACCAAAGTGCAGTCCGGCGATCTGTCTCCCGAAACCATCACTATTCATTTCTCTGCATTTGGGAAGGAGAAGAGAGGGAGAGGGAAGAAGAGTTTAAGGAATTCCTTGATGAGGAACTAGATGACCAAAGCATTGTAACAGCACTTGAAATAAAGGAAGACCTCTGCTTGAGCATGCACTGGCCATGGTACTCTGGTCCCAGCTACCTCGCTTAACATAATGGCCAACTCTGCATCACTACTGTCTCATCACCTGCAGTTCTGCCAACTGCAGAGACTACAATAAACCTGTTGGATTCTCCTTCCACATCCCAAGTATTCAGTGCAGTGCCACTAGTCCTGTCTCCTTCATCACACTCATGTAATACAGTTGTAGCTCATCAGGTGCACCACCTGTGAGTCAGAAATCCAGCCTGTCATCCTCCCCATCCTCATCCCCTTCCAGATCAGTTGTCTGCTCTAGTGGATCATCACAGTTTCTAGTTCAGAAAACTTATTTTTAAGGGGTTTAGTCAAGTCCCTTTCAGCAGATGTGGAACCAAAAAGAACCCACCCCACCGATGAGCGCAGACAGCTAGTGAAAACCTTAGTGAAATCTCTGTCTACAGACACTTCCAAACAAGAATCTGAAACTGTGTCTTACGGGCCACCTGACCCAAACTGAACTTGCATCTGTTCAAACAGTTCACTCAACCTCGAGCTACAGGTGGTGATTCAAAACTGCCCTCGTCTCCATTAACATCTCCCTCTGACACCCGTTCCTTTAAGTACCTGAAATGGAGGCTAAAATTGAAGATACTAAAGACGCCTTTCTGGAAGTAATCTGTGAGCCTTTCCAGCTGCTCCAGTAAAATAATGGGTGATGAAAGTGACCAGCCACAGACCCAAAGAGCCTTATCTTCAGGGAGAAGTGCTTCCAGAACTCTCAAACCTTTCCAGTTTGAATGGCCATTTTGAAGCAATAACAACTACAGCATTGAAGAAGAATGTGATTCAGAGAGGACTTCTATGGAAGTGACTCCAACCTGAGCAAGAACGATCAGTCAAAAGTGGCTGAGGAGCATACAAAAAGAGACAGGGCCCAAAGCTCTCAGTCTGCAAGCACAAAGGACGTGAGTTCCAAAACGTCCTCATTAGCGAGGAAAAATGTTCGTGTCTGCATTAGCAAGCAGAGGATGAGGAGTTTTGTGGAACTTTATTCTGAACTTTTCCTTGCTGGAGGATGACTAAAACTGATAAACCTGCTGAAACTTCATGATCGAGATCACCAAAGGAGAAAATGGTACTGGCACTCCAGTAGTGGAGATGAAAAATAATTCCTATGAGCAACAGCCTAAAATACCAGTGAGCTTTATGCTTCTTAACGCTGTTAGTCTATGCTTACCTTAATATCCCTCTCCTAGCTACCAAAGTGGACTTTTATTTAGGAAATGGCCTTGGATTTATGATAGCTGTCTGTGTGATTTAATAACCTCACGTACTCATGAATATCTCAAATTAAAGTGTGAAAAAGCAATGGAATACAGGGAGCTCTAGACATCAAAGAACCTGAAATACTGAAGGGGATGGATGAATGAAATCTATAACACGATCAGAAACATACCATGCTACATTGACTCCTCTGTCTATGTGCGACTTGAAAAGCACCTTACGACTTTCAAAACAAAATATCTCTAGAAGAATATCACAATGAGCCAAAGCCTGAAGTCATATGTCAGCCAGAAAATCTATGGACTTACAGAGCAAAAGATTTCCCTGGTTCCTAAAGTCTGGCACGAAAACGGTTGGAGATAAAGTACCCTATTTGCATTGAACTCGCTAAACAGGATGACTTTTATGGCTAAGGCCCAGGCTGATAAAGAAGAATGCAGAGAAAAGTTATCTGCGGAAAAAAACGAGACGTGAGCAACGAAGAATCAGAAATCTCCAGGGTGGAGCAAAGTACACTAGCCAAAAGGATCCCAGTGCTTTATCTTTTGGAGGACCGGTAGGGAGAAGGAGGAATGGTTCAGAAGGTTTCTTCTCCGCATCAAGCTGAAGTCCTGAGCAAGAAGCTATCCAGTCTATGTGGGGAACAAGCCAGGGATCTTGCCAACACAGAGTAGAGCCGATAGTCAATCTGGAGTTCTCACACACAGCCGAAGCAGCAGCAAGGGAAGTGCAGAAGAGATTGCATCC 8+414)&&'''/)-)$&%%%6>8;=D/5@4(($&"#%$#285D:4992=70335496A>D=)M<FHIABKE9=;=3/7?FEF=9:18:'>>BDE;A=@653,31=>3-13+-23:-4350('+)$$?=&(-(29802H7.1+((8&336>:BA<8=A(33'#,168>/1/:>E?F8<JFF-CG4?=>>?3)/:4@=421/80(,$//5.(%0')-1(::/CBL=:;63);(A-6/197<;<7;0?<4//H48=<@k9<7=>=:,/;5992//)),++(&&00&4,14;5>928''&'','&''$$"(<=A6CB>64&%#"&'.30154.454(2%$'.-$#$#$%-%%42??G=+).BGC-3B>4-38G@@:=<9358@D<9,354A26;@;BA+&--'((%(0612:8;=>03.;<68-.1671-9G<=ID<1-.-@@>9?3-1>&,,-:>(767A;;>:I7D@B/<;67&)%6C8$$#$#%'14:670-3/-B<'9''#&02')(('$%$$%"&)(/4++92$+7,:<9++1507;7>//26-:&&;;-,<2:)%($A8DAB=<7;7&);(.($48:<A.+1%()1)@b96>>41(('$'),-.99@=B5EMF/,(@=13:;D1(5<7&%/4>G89,2N95(4==>=+''9768=;<:C58EBEC83:=;><43D'&%'9A;C716<8?:9;/46<;8FDD1'/)'8;:>1BA61:;AD.:2$+CJ:9<-.BF?>;-/32,AC110;5((#30;?8))9>%%(#'16;D<8/..((E?ABBGFI5/3%&/11=D:1+'+(/>I/A4&$.76<878/%1.;=?83;)D7,124=>=?@;<4=B)9111'''3$$%'6'94,,4$5//,,,(+,GH>E59>0,65.%%$#+637&@C=:295A/FPJ30'&&&)2/CLKF?8@@A+B@>BGI7/)('(?58>(:?$8>>7=79;:5$#(63/5=7/'%)(&&/),)3??=168899+-&).3?>,@C188=512>&;5=7241?-(4GPE?8.A>802827((&)+$$DQLA=97)43)/(&&++1.,%'()5$.,:80.+0,12<=;:?E@/-53,,/;+5A23966%%.+<((&$),&&-0FE@6@=I8>$39;=D@:<>3-->:'18956=223$&#:+@c7+$$--10KRGBB?6-28B<ERVC)1-+48=6-<>,0/2##6>I?</03AII?B@@<31-.8701,.$&+/4'-67?541,,($.(08@511/+))2/.+;1.+,3>O87A9<2.)-+$&'&46<<;33.,6956?CA?21#1?:D1026E.'AB>5#$-())+132657=0-:34:4ABFF>&)?=&&%'75:+04)'&&+)%D:8B9-9612=CHD?;@d89@;$$$1$.994=::9??83).>//.(%&%$'(()))%%45++@+BA56'(&;DBDB@A5;<85602$##.)),.::?<992:A=D8JCA>4.-<,245;<9B7;>BC::=E<9<558FK@??E8<>;,/7=940/4=??=B>=255&$'=6::0$)+,1&&663**:794-+,,-1/>:13,-1@>/11>K@;>1$TUN?=BDAADK>93+)$621A@@m?2.1)0//;FE66'>>C;>Q::BD;2,=.IE::%%%)..,-/2;<685236;6;5->9%1@67=$@6:1.?:6/)9>4)/3<8?@);3ANAA<=;9((-1+/3=>B@</@ux.>?-:BF?CI@9B?96<)+8;85+-)D''';1=6.B7>E6-&)&98C<3:AD:C@?@;A##$%%.=D8&+,0--/''C><>?9;BC../:&$%3G@.:>2E<>8?=A3611D>B6<:7)<89:''.62,%&&'5R=<F@(($,@6<B482145415,+'234:9:,.--8:.05<KC@941/32?8==-&E<=A NM:i:217 ms:i:1655 AS:i:1655 nn:i:0 ts:A:+ tp:A:P cm:i:322 s1:i:1541 s2:i:0 de:f:0.0718 rl:i:0
4caf762c-9527-47b6-aaea-5c0d324eda2f 0 ENSMUNG00000000050|ENSMUNG00000000050.1|testis 587 41 19S27M1I8M1D19M1D37M1I63M27S * 0 0 GAAGTACCATGAAGCTCAGAGACTCTGACACGTCTCTGAGACCTTCTAATGACAATCAGAACGGTAGCCATGCAAGAAGCTTGGTGAAATAATCTCAAAGCAGTCAGCACCAAAAAGTGCAGGTCCAGCGATCTGTCTCCCGAGAAACCATCACTATTCATTTCTCTGCATTTGGAAAAAAAAAAAAAAAAAAAAAAAAAAA 9-C52=3+++)%((($%%',5;64&++.2$$783&($+52)%)%&.,217;<&&>4DC@2288C9<@<41//#$'$#%'897:2@+$$$$.;><+(%%%>9$.0%9IC,45/7502>;**I@583-.((77B2>52%&&-(9>B@E@;2522;IGD4<5/':=A@<45+@@877655555667889987766777 NM:i:14 ms:i:112 AS:i:112 nn:i:0 ts:A:+ tp:A:P cm:i:16 s1:i:73 s2:i:0 de:f:0.0886 rl:i:28

The output: Pool4_merged_t_rimmed_BCconsensus_transcripts.fas:

02e3c0c0-b1b5-4f5f-8bf5-05f83f97fbd3|330
TTTTTTTTTTTTTCACAGTTAACAAATATTCTTTATTGTCAGGTCTCAAGACATTATCATAATGGACATTTTTGGACTGTATAAAAACTACTTTTAACTC
AGTGTAAAAGCTCCGTTGAATGTATGAATGATAGCTTAAGAAAGTTTAGAGTAGCAGTTATGGAATTCATTCACTTATTTATGAATAAGGTATAACAGGT
ACCATTCATGCTTTGATCCAAGAGCATTTACAGCTTTGTTTTTGACACTGGTTGTGCCTACAGCTTCTGTATCAGAATTGCAGAAGCACCTCCTCCACTC
CATTGCAAATTCCTGCAAGGCCGTATTGTCCTTGTTTCAATGCATGGGCCATGTGAACAACGATTCTGGCTCCAGACATTCCTATAGGATGTCCAAGAGA
GACACACCTCCATTCATGTTTACTTTTTGTGGATCAATACC
c7a0033e-8bd3-4ea8-8539-8cfe406f915f|85
TTTTTTTTTTTTTTCAAGAACAACTGTTCTTTATTTTATTGACTGGTTGAAGCAGGACTATAAGCCAGGTATATTTCAATCAAGTGTTGGTCCACTCTTA
CCATCAAAAAGAATTTTTTTTTTTTTTTATAATAACATCAACACAAATGGAAGGAATATAAAGCGTCATAATAGGAACTTTCAACTGTACATGATATGAG
ACCATGATCAGACTGGTGCTACTTCAGTATTTATAGACTCTCCACTGTACAGTCCAGCCACACTAGTGTTATTTACCTCCAATCATTCAAGTTCTAGGTA
AAGGATCCTTCTGACTACAGCTCACATCTGAGCCACCAACATGAAATCAAAATGCCATTTGTGCCACTAGCATTGTAGTCTTGTAGAAATATTTCTATAT
TTAGCATATTACTAAAGAATAATTACTATCCTCTCTGAAGTTACATATAGCCATTATAAATATTTACATCAAACATTTACAACTGGTTCATAATATACAA
CACAAGAATTTAGCTATAGTTTCTAATCTTCCAGTGTAAAAGTTTCAAACAACATGTTGCTATCATAATTTCATCTGGTTTGCACACAGTCACAGGCAGT
ACAGGGTATTTCAAAAACTCATGTCACATAAAAAAAAAGGAATGATGTTACTTTAATAACAATTACCTTGGGGATTGATTGTTGTTGGTTTTTGTTGGGG
TTTTTAACAAGCAGTTTTCAATTTCTAAACCCCATCTGATGCTACTACTGACATTTAAAATAAGCTTAAGAGGGAAAAAAATACTTAAAAATAAATACAC
CTTTAAAAAATTCCACAATTAACCATGTCAGAATATTTGTTTTCCAGACAGCTGAAAGGAGACCTTACTCTTCATCATTTCCTTGCACTGAAACAGAGTG
CAGTTCAATCCATTCTTATCTTCTGATTTGTCATCCTATAAATAATGCTGTCGTAGCCAGGATCCATGTTCCAGAGGTTGTAGGAGATAACTATCACAGC
TAAAGCAAGACCTATCATCATCCACAAAATAATGTTGAAGATTACAGAGTAGTTGTAGTTATATGGGTAGGCAAGGTTATATGGATTGTCTGATTCACTC
TGTGAGATTGGAGAATGAGCGAGTCTTCCTTATGTTGGGAGAAATAAACGCCTTTACAGCTACCACTTCTACTACTGCATTCCCACTATACAGATTAAAC
ATCTCATCTGCAAACTTTTGCAAAGAGTCTACAAAATTTGAGAAGCATCCTTGAACTGCTGAGAGTCTTCCCCATATCGTTTTCCAACCTCTTCCAAACC
AGACAGTTCAGAGAATACAGGTCTGGAGAGTGATCTTTGGCTAAGTGCTTGTGTCGAGACAGCAGGCTTGCAATATCATGTAGGACTTGTAGTTCTGACA
GAAAAAGCAAGTCAACCTCATTGTTTCTGCTGAGGGAGTTGAGAGGAAGGGAGCCAAGAATAGAGTTGTCCTGGAATAAGCGGTTTCGCAGTTGGCGCAG
TGTGACAGACAGATCTTCAAATACAGAGTTTGCCTTACCCACCATGTATACCCTTTCCTCACTGGGGGCCAGCTGCAAGACCACAGGAGTCTCCTCAGAG
AACAAAGTATGAATAGCATTTGCAACACTGTCAAGACTGAAAGGAACAGCATTCTCAATAGGGTAAGAAACCCCTTTCACAGGCAGTGCCAGCTTGTCCA
CTCCCTTCACAGTTACCAGCACAGTAGCTCGTGGTCTGTGAAACAGATCACCCACTGCAAGGCCAGGCCAGGAAAGGTCCTCTTCAACAGAAAAGCCCAT
AGACAATGCAGCTACATCTGGGATCCGCTCACCAGGAATGGGCCAACTTCCATCTCGAAAAACAACTGACTGAGGTGATCGTAAGACACTAAATTCATCT
CCACTTACACTGGCAAGACAAGCCGATGTAACCAGCACCGCCACCTCCAGGGACCAGAGCACCCCGCCACGGCACCCCATGTCCGCCGCAGTCCCGGACA
CCGCCGCGCTAAGAGACGCCGCTGGGAAGACGCCGGGCCGCACCGTCAGACCGAGACCACC

I would be very grateful for any help.

pinfish_polish fails when run on BAM generated from fasta file

Hi,
when the BAM file is generated from a fasta rather than fastq file, pinfish_polish fails due to racon exiting with code 134. Manually running the racon command gives the following error:

terminate called after throwing an instance of 'std::invalid_argument'
  what():  [bioparser::FastqParser] error: invalid file format!

This is due to the fact that the reads.fq and reference.fq generated by pinfish_polish from the BAM file have strings of spaces (\x20) as phred score lines. The underlying cause for this is that the BAM reader in biogo/hts returns as seq.Qual a byte array filled with 255, which gets encoded by biogo/seqio as spaces (see here).

However, racon does support fasta file. I've addressed the issue modifying pinfish_polish so that it generates fasta files rather then fastq when needed. I'll open a pull request shortly.

error from polish_clusters

Hi,
I use the polish_clusters to correct my nanopore data and hope to get the consensus fasta,
and I encounter an error.
My command line:
/polish_clusters -a data.clusters.tsv -o data.consensus.fasta -t 10 -c 5 -d ./data/temp data.sorted.bam
why?
Hope to get your reply.
Thanks

the same full length sequences but in different order results different result

Hi,
I have a fasta file that contains some full length sequences.When I adjusted the order of full sequences in the file. I got the different clusters.tsv file.The command line as follow:
spliced_bam2gff -s -M CON1.sorted.bam > CON1.raw_transcripts.gff
cluster_gff -c 10 -a CON1.clusters.tsv CON1.raw_transcripts.gff > CON1.clustered_transcripts.gff

I don't know why and I need your help.Could you help me?
If you need my test file, I can provide.

quality assessment of polished data

Hi @bsipos,

I have just polished my data with polished cluster and I would like to check their quality, e.g. the percentage of error and Q score. However, I have six different species and only for one, the genome is available. For one species with a sequenced genome, I used assess_assembly script from Pomoxis and a similar task is also performed by Quast, but both these tools require to have reference genome and they are primarily designed for genomic data. Is there any way how to calculate the error rate without reference available from transcriptomic data?

I would be very grateful for any recommendations.

cluster_gff not multithreading

I'm playing around with this pipeline, trying to implement it in a resource-optimised SGE pipeline (which I will gladly share once completed).
It seems that cluster_gff is only using one thread for the clustering bit.
Monitoring the job indicates it is mostly running with 1 thread (~100%), and I could see it go to ~120% when reading/writing to NVMe drives.
Could the parallelisation only be implemented for read/write operations? Apologies, I don't speak GO.

errors from polish_clusters

Hello,
I am employing pinfish to analyze Nanopore cDNA data. The tool polish_clusters returned an error:

polish_clusters: 12:56:31 Polishing cluster 70ac1799-5f4b-411b-86be-57459f716c3e of size 101
polish_clusters: 12:56:31 Failed running command: racon -t 40 -q -1 /tmp/pinfish_70ac1799-5f4b-411b-86be-57459f716c3e_960787481/reads.fq /tmp/pinfish_70ac1799-5f4b-411b-86be-57459f716c3e_960787481/alignments.sam /tmp/pinfish_70ac1799-5f4b-411b-86be-57459f716c3e_960787481/reference.fq > /tmp/pinfish_70ac1799-5f4b-411b-86be-57459f716c3e_960787481/consensus.fq - exit status 1

Here are the commands I used:
#alignment
minimap2 -t 24 -a -x splice ref.mmi pass.fq > aln.sam

#conversion
samtools sort -@ 24 -o aln_sorted.bam -O bam aln.sam

#prepare annotation
spliced_bam2gff -s aln_sorted.bam > raw_transcript.gff

cluster_gff -a cluster.tsv raw_transcript.gff > clustered_transcript.gff

polish_clusters -a cluster.tsv -c 50 -o consensus_transcripts.fas -t 40 aln_sorted.bam

polish result seems quite strange

Dear @bsipos

I have run the polish_cluster and collapse_partials with the clustered reads and observed somewhat unexpected results.
In my understanding, the polish_cluster is intended to fix only the clustered reads.
But I found that there were more reads in the polish_cluster result than the cluster_gff result.
I found this quite strange and only difference I made for the run was the input bam file which is the bam file for gff generation, but only the primary reads were extracted (polish step did not work with the whole sorted bam file)

I ran the collapse step as well, but didn't seem like it has been polished.
If you could give me any comment on this issue, I would really appreciate it.

Jungwoo

error from spliced_bam2gff

Hi
when I use the command:
./spliced_bam2gff -s -t 10 -M data.sorted.bam >data.raw_transcripts.gff
I get an error:
**panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x4cd426]

goroutine 1 [running]:
main.SplicedBam2GFF(0x7fff21087b9a, 0x57, 0x52a9e0, 0xc420108008, 0xa, 0xc420108001, 0x2)
/home/OXFORDNANOLABS/bsipos/gt/pinfish/spliced_bam2gff/bam2gff.go:29 +0xa6
main.main()
/home/OXFORDNANOLABS/bsipos/gt/pinfish/spliced_bam2gff/main.go:22 +0x146**
why?
Hope to get your reply!
Thanks

pinfish - amplicon clustering

I'm interested in using pinfish for non-reference amplicon clustering. Is it possible?

Many Thanks,
Azita

nanoporetech / pinfish Goto Github PK

pinfish's Introduction

pinfish

Getting Started

Installation

Dependencies and compiling from source

Usage

spliced_bam2gff

cluster_gff

polish_clusters

collapse_partials

Running tests

Help

Licence and Copyright

FAQs and tips

References and Supporting Information

pinfish's People

Contributors

Stargazers

Watchers

Forkers

pinfish's Issues

Recommend Projects

Recommend Topics

Recommend Org