tuh8888 / dep2rel Goto Github PK

Needed Scripts:
On a few test sets of known relations, try different combinations of parameters, and evaluate the performance. Determine the best range of parameter values.

Inputs:

Current CRAFT relation annotations
BioCreative 2017 PPI relations
Parameters
Seed similarity threshhold
Cluster similarity threshold
Context similarity threshold
Minimum support
Seed sets (possibly found from using #8 )

Action:

Iterate over range of thresholds
Evaluate performance using precision, recall, and F1 score
Find groups of sentences not being found

Output:

Evaluation results along with parameters used (as csv or edn)
Suggested best parameter values

Coding: BioCreative 2017 Track 5 relation extraction and evaluation

BioCreative 2017 had a subtask (Track5) involving extracting chemical-protein interactions from text.

Needed Scripts: Write a script that does the following:

Parses BioCreative inputs
Extracts chemical-protein relations
Outputs to BioCreative output
Documents process and performance

Inputs:

BioCreative articles
Seeds #8

Action:

Parses BioCreative inputs

Output: what is the desired output?

Extracted relations
Model performance #9

Wiki: Potential Applications

Wiki Page:

Applications

Suggestions:

Shared task, does sentence express a single relationship
What are all of the relationships between two entities? Cluster them
Finding specific examples of general examples. How to pick seeds for this?

Allow patterns to exist for a certain number of iterations before discarding based on min-support

Allow patterns to exist for a certain number of iterations before discarding based on min-support

Originally posted by @tuh8888 in #14 (comment)

Coding: Find good seed sentences

Needed Scripts:
Determines good seeds.
Possible approaches:

clustering
confidence

Inputs:

Known seeds that imply a relation
Action:
Find a set of good sentences to use as seeds
Output:
Set of seed sentences
Seed confidences

Project Meeting -- 04/17/2019 @ 13:30

Meeting Date: April, 17th, 2019
Topic: discuss current CRAFT and BioCreative Results
Attendees: @LEHunter

Proposed Agenda:

Describe evaluation procedure
Show CRAFT results (TODO: link to notebook)
Show BioCreative Results (TODO: link to notebook)
Discuss parameter choices #9
Discuss seed choice #8

Project Meeting -- 06/21/2019 @ 12:00

Meeting Date: June, 21st, 2019
Topic: Results, paper reqs, and improvement ideas
Attendees: @LEHunter

Proposed Agenda:

Best results (#3)
The various matching fns (#27)
Seed/pattern selection
Paper requirements (#26)

TODO - Project Organization: invite collaborators

Task:

Invite initial collaborators.
Make sure Abra-Collaboratory framework properly implemented

Wiki: Abstract

Wiki Page:

Abstract

Suggestions:

Initial abstract

Redo building syntaxnet with new craft conlls

redo building syntaxnet with new craft conlls

Originally posted by @tuh8888 in #24 (comment)

Alternative Dependency Parses

Currently, I am using the SyntaxNet architecture trained on CRAFT to parse dependency trees. There are some alternative automatic approaches I could try.

Default SyntaxNet parser (ParseyMcParseface)

Try using rule based method over PubMed to find seeds

Try using rule based method over PubMed to find seeds

Originally posted by @tuh8888 in #14 (comment)

TODO - Pubs+Presentation Task: ICBO

Pub or Presentation: Relation extraction

Submission Site: https://sites.google.com/view/icbo2019

Description:

Due Date: May 1, 2019; April 15, 2019

TODO - Pubs+Presentation Task: NLM 2019

Pub or Presentation: Presentation for NLM Informatics Training Conference 2019

Description:

Make presentation
~~- [ ] Make poster~~

Due Date: June 24

Coding: Make CRAFT relation extraction notebook

Inputs:

CRAFT v3.1
Dependency parses
Knowtator annotation files

Action:

Parses sentences to get relations
Using naive method (find occurences of seed entity pairs)
Using less naive method (find occurences of sentences containing seed entity pairs that are similar to seed sentences)
Using bootstrap method
Evaluates performance using relations annotated by Mike Bada
Demonstrates usage of key functions

Output: what is the desired output?

Notebook describing
usage of key functions
process of starting with concept annotations and dependency parses to extracted relations

Use GPT-2 as language model for coordinating conjunction separation

Use GPT-2 as language model for coordinating conjunction separation

Check out these other two language models suggested by Negacy

Originally posted by @tuh8888 in #14 (comment)

Coding: Visualize clusters

Needed Scripts:

Create visualizations of the clustering to help determine

optimal clustering threshold parameter
number of clusters in data (actual true vs actual false)
better clustering procedure

Inputs:

sentences
clustering parameter (adjustable)

Action:

performs clustering
plots scores and clusters
color by various metrics

Output:

interactive plot of the clusters

Project Meeting -- 06/10/2019 @ 12:00

Meeting Date: June, 10th, 2019
Topic: NLM Training Conference presentation
Attendees: @LEHunter

think about drafting this into a paper. What would I need to finish to be "done"?
try a chemical tokenizer
try using a concept recognizer instead of using entity labels ("chemical", "protein", etc.)
evaluate the benefits of increasing the number of seeds
make PCA plots
first slide should be assertions

Dependency Distance

Problem
When one of the similarity thresholds is relaxed, the bootstrapping process will cascade and grab every possible sentence. This seems to happen when long dependency chains are matched.

These long dependency chains are usually matched because they contain repetitive phrases
e.g. Chemical X inhibits protein A and chemical Y inhibits protein B -> Checmical X inhibits protein B ❌

According to Mike Bada, usually relations are only implied between entities that are within one or two steps along a dependency path. So it might make sense to enforce a limit on the distance we search along the dependency path.

Possible Solutions

Hard limit of 1 or two steps when generating potential sentences.
Penalty on similarity score for longer paths

TODO - Pubs+Presentation Task: Relation Extraction Paper

Pub or Presentation: Relation Extraction Work for Dep2Rel with CRAFT and BioCreative results.

Submission Site:

TODO

Due Date:

TODO

Project Meeting -- 05/09/2019 @ 12:00

Meeting Date: May, 9th, 2019
Topic: Update on generating BioCreative Results
Attendees: @LEHunter

Proposed Agenda:

#3 (comment)
Making results notebook will require packaging to jar
Using mops in evaluation procedure

Wiki: Coordinating conjunctions problem description

Wiki Page:

describe coordinating conjunctions
describe how they are a problem
examples
ways to fix

Project Meeting -- 06/12/2019 @ 13:30

Meeting Date: June, 12th, 2019
Topic: Show results
Attendees: @LEHunter

adding negative examples to clustering
seem to have resolved memory leaks
look at PCA plot
look at evaluation metrics plot

Coding: re-evaluate algorithm with regard to seed similarity

Currently the algorithm has two filtering steps for the sentences to become seeds.

First it finds sentences similar to the existing seeds and adds them to the seeds.
These seeds are added to the set of seeds and clustered to form patterns. The second filtering step finds sentences similar to the patterns and adds them to the seeds.

I think the first step needs to be removed because it violates one of the goals of the algorithm which is to develop context patterns and use them to refine themselves.

I am proposing to either remove this step, or replace it with a step to filter by concept type of the entities in the current pattern.

remove seed similarity step

Coding: How to match a sentence to a pattern?

I noticed that I was able to get an F1 score of 0.2 over BioCreative VI.4 with only 5 seeds. Because each seed got its own cluster, the matching stage only involved finding similarity to one pattern that had no "dilution" due to being combined with other patterns.

This makes be think that the summing of the context vectors that make up a pattern is somehow diluting the underlying sentences too much so they don't match as well to the next round of samples.

Maybe I should use the context vector for each sentence in the patterns on its own while looking for matches. The original BREDS did this and just checked if the good-bad ratio was greater than 1.

tuh8888 / dep2rel Goto Github PK

dep2rel's People

Contributors

Stargazers

Watchers

dep2rel's Issues

Recommend Projects

Recommend Topics

Recommend Org