Giter VIP home page Giter VIP logo

Comments (9)

lbarquist avatar lbarquist commented on June 14, 2024

Hi Irilenia,

Thanks for this -- could you pass on a mini fastq file containing some of these problematic reads so we can use them for debugging?

-Lars

from bio-tradis.

irilenia avatar irilenia commented on June 14, 2024

from bio-tradis.

irilenia avatar irilenia commented on June 14, 2024

from bio-tradis.

LCrossman avatar LCrossman commented on June 14, 2024

Hello, we are also seeing a set of sample files of different lengths longer than they should be and this seems to be the likely cause. Thanks for looking into it,
Best wishes,
Lisa

from bio-tradis.

martinclott avatar martinclott commented on June 14, 2024

We see the same problem, looking through the code it appears that the CIGAR.pm file which parses the CIGAR reported by the mapping software is incorrect. My understanding is that the original parser was incorrect, it was then changed a few months ago to fix the soft clipping issue but that causes the software to be incorrect for other CIGAR characters since they were erroneously grouped with the soft clipping character in the same 'if' statement.

According to page 8 of the SAM tools manual https://samtools.github.io/hts-specs/SAMv1.pdf, only characters {M,D,N,=,X} consume the reference and therefore we should only change the coordinate upon encountering those characters.

I'm working on the software for the Quadram Institute, I have changed the If statements in the CIGAR parser to:-
if($action eq 'M' || $action eq 'D' || $action eq 'N' || $action eq '=' || $action eq 'X' ){
$results{start} = $current_coordinate - 1 if($results{start} == -1);
$current_coordinate += $number;
$results{end} = $current_coordinate -1;
}

There are other changes that we are making and this correction should be pushed in alongside other changes in due course.

from bio-tradis.

lbarquist avatar lbarquist commented on June 14, 2024

Hi Martin,

Do you mean that you want to remove any handling of soft-clipped bases? If so, I think this would just reintroduce the bug I fixed in issue #120, basically that soft-clipped bases at the beginning and ends of reads need to consume bases in the reference, as the design of TraDIS primers are such that the first base corresponds to the insert site -- if there's a base calling error early in the read this will lead to soft-clipping, and a miscalled insertion site as @irilenia showed in that issue report. I have similar data that shows this is a common problem, and shouldn't be ignored. It seems to be a bigger problem with the switch to bwa, as the default smalt parameters were such that reads with more than one or two mismatches were generally tossed.

I think to solve the current issue properly, we would need to track the end position of the chromosome, and forbid insertion sites that extend beyond that -- I haven't looked to see how difficult this would be.

-Lars

from bio-tradis.

martinclott avatar martinclott commented on June 14, 2024

from bio-tradis.

lbarquist avatar lbarquist commented on June 14, 2024

Hi Martin,

I agree the X and D probably should not be grouped with soft-clipping in terms of extending the coordinates at the ends, but I don't believe this is the problem either here or with issue #120.

If you read #120, you'll see soft-clipping has to be considered to get accurate insertion sites. Basically extra bases from the soft clipping need to be appended to the edge of read first when calculating the start site, otherwise mismatches near the 5' end of the read will lead to a shift from the true insert site at the read start. Once this has been done, S needs to consume bases in the reference, as you'll end up with an incorrect alignment stop site otherwise.

Some of the current issue may be related to the current handling of soft-clipping, but I don't think this was entirely created by my updated handling of soft-clipping and I suspect it's really an issue with the padding in InsertSite.pm -- for example this problem seems to predate my fix for soft-clipping, see issue #86. My guess is that the start and end of the genome aren't being tracked properly, and this probably interacts poorly with anything that modifies the alignment coordinates, particularly when you've got split alignments.

-Lars

from bio-tradis.

martinclott avatar martinclott commented on June 14, 2024

from bio-tradis.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.