Giter VIP home page Giter VIP logo

goldstandard's Introduction

Gold Standard Analysis

  • This repository contains semantic similarity, Partial Precision, and Partial Recall results between curator-gold standard comparisons.

  • Curators were assigned IDs for the Naive (WD_38484, AD_40674, NI_40676) and Knowledge rounds (NI_40716, WD_40717, AD_40718). IDs with the same two letter prefix belong to the same curator; hence there were 3 curators in total. Gold Standard was assigned "GS" as the ID.

  • Annotation rounds are indicated by "NR"= Naive round or "KR" = Knowledge round.

  • The following curator-GS comparisons were conducted :

    • Naive Round:

        38484 -- GS
        40674 -- GS
        38484 -- GS
      
    • Knowledge Round:

        40717 -- GS
        40718 -- GS
        40717 -- GS
      
    • For each curator-GS pair, pairwise comparisons are made between the set of EQ annotations for a particular character-state combination to the set of EQs for the same character-state combination. The best match (highest similarity) from these pairwise comparisons is the score associated to the character-state combination.

  • The following GS-charaparser comparisons were conducted for CharaParser:

    • Naive Round:

        GS -- CP_38484
        GS -- CP_40674
        GS -- CP_40676
      
    • Knowledge Round:

        GS -- CP_40716
        GS -- CP_40717
        GS -- CP_40718
      
    • Charaparser Best:

        GS -- CP_best
      

Similarity Metrics Used:

  • The results contain scores for the SimJ, NIC, Partial Precision, and Partial Recall metrics. The SimJ score for two EQ statements, EQ1 and EQ2 is normally calculated as
    (Number of Common Subsumers for EQ1, EQ2 (Intersection) / Number of All Subsumers of EQ1, EQ2 (Union))
    • The SimJ score lies in the range of [0,1]
    • If EQ1 and EQ2 are exactly the same, the SimJ score is 1
    • SimJ is not based on Information Content The IC score for EQ1,EQ2 is the IC of their least common subsumer. The IC of a term t, is calculated as -log(freq(t)/corpussize). The corpus is the combination of all the curator and charaparser annotations, all ancestors of every annotation. The corpus size is the number of annotations in the corpus. IC scores do not have a fixed range unlike SimJ scores. In order to force the IC scores to lie in the [0,1] range, the IC scores are divided by the maximum possible IC for this dataset resulting in a normalized IC score (NIC). The maximum IC is the score when the frequency of a term is 1. The mean and Median MaxNIC scores reported are the mean and median normalized IC scores. Partial Precision and Partial Recall are computed using the SimJ scores per EQ for each character state.

Ontology Used in the Analysis

  • The ontology used for this analysis is ./Ontologies/MergedOntology_GS_Relations.owl.

Pipeline

  • All the commands (in order) used to generate results for the manuscript are listed in ./src/run-goldstandard-analysis.sh

goldstandard's People

Contributors

prashanti avatar

Watchers

James Cloos avatar  avatar

Forkers

phenoscape

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.