Giter VIP home page Giter VIP logo

dep2rel's People

Contributors

tuh8888 avatar

Stargazers

 avatar

Watchers

 avatar  avatar

dep2rel's Issues

Coding: Connect Knowtator

Needed Scripts:

Inputs:

  • Dependency parses (ConLL files)
  • Annotation files
    Action:
  • Read to Knowtator
  • Display structure annotations
  • Structure interactivity
  • Interact with relation extraction results within Knowtator

Coding: Parameter maximization

Needed Scripts:
On a few test sets of known relations, try different combinations of parameters, and evaluate the performance. Determine the best range of parameter values.

Inputs:

  • Current CRAFT relation annotations
  • BioCreative 2017 PPI relations
    Parameters
  • Seed similarity threshhold
  • Cluster similarity threshold
  • Context similarity threshold
  • Minimum support
  • Seed sets (possibly found from using #8 )

Action:

  • Iterate over range of thresholds
  • Evaluate performance using precision, recall, and F1 score
  • Find groups of sentences not being found

Output:

  • Evaluation results along with parameters used (as csv or edn)
  • Suggested best parameter values

Coding: BioCreative 2017 Track 5 relation extraction and evaluation

BioCreative 2017 had a subtask (Track5) involving extracting chemical-protein interactions from text.

Needed Scripts: Write a script that does the following:

  • Parses BioCreative inputs
  • Extracts chemical-protein relations
  • Outputs to BioCreative output
  • Documents process and performance

Inputs:

  • BioCreative articles
  • Seeds #8

Action:

  • Parses BioCreative inputs

Output: what is the desired output?

  • Extracted relations
  • Model performance #9

Wiki: Potential Applications

Wiki Page:

  • Applications

Suggestions:

  • Shared task, does sentence express a single relationship
  • What are all of the relationships between two entities? Cluster them
  • Finding specific examples of general examples. How to pick seeds for this?

Coding: Find good seed sentences

Needed Scripts:
Determines good seeds.
Possible approaches:

  • clustering
  • confidence

Inputs:

  • Known seeds that imply a relation
    Action:
  • Find a set of good sentences to use as seeds
    Output:
  • Set of seed sentences
  • Seed confidences

Project Meeting -- 04/17/2019 @ 13:30

Meeting Date: April, 17th, 2019
Topic: discuss current CRAFT and BioCreative Results
Attendees: @LEHunter

Proposed Agenda:

  • Describe evaluation procedure
  • Show CRAFT results (TODO: link to notebook)
  • Show BioCreative Results (TODO: link to notebook)
  • Discuss parameter choices #9
  • Discuss seed choice #8

Alternative Dependency Parses

Currently, I am using the SyntaxNet architecture trained on CRAFT to parse dependency trees. There are some alternative automatic approaches I could try.

  • Default SyntaxNet parser (ParseyMcParseface)

Coding: Make CRAFT relation extraction notebook

Inputs:

  • CRAFT v3.1
  • Dependency parses
  • Knowtator annotation files

Action:

  • Parses sentences to get relations

  • Using naive method (find occurences of seed entity pairs)

  • Using less naive method (find occurences of sentences containing seed entity pairs that are similar to seed sentences)

  • Using bootstrap method

  • Evaluates performance using relations annotated by Mike Bada

  • Demonstrates usage of key functions

Output: what is the desired output?

  • Notebook describing
  • usage of key functions
  • process of starting with concept annotations and dependency parses to extracted relations

Coding: Visualize clusters

Needed Scripts:

Create visualizations of the clustering to help determine

  • optimal clustering threshold parameter
  • number of clusters in data (actual true vs actual false)
  • better clustering procedure

Inputs:

  • sentences
  • clustering parameter (adjustable)

Action:

  • performs clustering
  • plots scores and clusters
  • color by various metrics

Output:

  • interactive plot of the clusters

Project Meeting -- 06/10/2019 @ 12:00

Meeting Date: June, 10th, 2019
Topic: NLM Training Conference presentation
Attendees: @LEHunter

  • think about drafting this into a paper. What would I need to finish to be "done"?
  • try a chemical tokenizer
  • try using a concept recognizer instead of using entity labels ("chemical", "protein", etc.)
  • evaluate the benefits of increasing the number of seeds
  • make PCA plots
  • first slide should be assertions

Dependency Distance

Problem
When one of the similarity thresholds is relaxed, the bootstrapping process will cascade and grab every possible sentence. This seems to happen when long dependency chains are matched.

These long dependency chains are usually matched because they contain repetitive phrases
e.g. Chemical X inhibits protein A and chemical Y inhibits protein B -> Checmical X inhibits protein B โŒ

According to Mike Bada, usually relations are only implied between entities that are within one or two steps along a dependency path. So it might make sense to enforce a limit on the distance we search along the dependency path.

Possible Solutions

  • Hard limit of 1 or two steps when generating potential sentences.
  • Penalty on similarity score for longer paths

Project Meeting -- 06/12/2019 @ 13:30

Meeting Date: June, 12th, 2019
Topic: Show results
Attendees: @LEHunter

  • adding negative examples to clustering
  • seem to have resolved memory leaks
  • look at PCA plot
  • look at evaluation metrics plot

Coding: re-evaluate algorithm with regard to seed similarity

Currently the algorithm has two filtering steps for the sentences to become seeds.

  1. First it finds sentences similar to the existing seeds and adds them to the seeds.
  2. These seeds are added to the set of seeds and clustered to form patterns. The second filtering step finds sentences similar to the patterns and adds them to the seeds.

I think the first step needs to be removed because it violates one of the goals of the algorithm which is to develop context patterns and use them to refine themselves.

I am proposing to either remove this step, or replace it with a step to filter by concept type of the entities in the current pattern.

  • remove seed similarity step

Coding: How to match a sentence to a pattern?

I noticed that I was able to get an F1 score of 0.2 over BioCreative VI.4 with only 5 seeds. Because each seed got its own cluster, the matching stage only involved finding similarity to one pattern that had no "dilution" due to being combined with other patterns.

This makes be think that the summing of the context vectors that make up a pattern is somehow diluting the underlying sentences too much so they don't match as well to the next round of samples.

Maybe I should use the context vector for each sentence in the patterns on its own while looking for matches. The original BREDS did this and just checked if the good-bad ratio was greater than 1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.