nestauk / ai_genomics Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 11.99 MB

Open-source code for innovation mapping of the AI in genomics landscape

License: MIT License

Shell 1.11% Python 96.54% Makefile 2.35%

ai_genomics's Issues

Read material related to AI and genomics

Per our kick off call on 13-06, Ada folks to send over literature reviews for our review.

We currently have access to Conor Griffin's Biotechnology & AI review paper

sample ai genomics patent abstracts

based off of ai genomics patent ids, develop methodology to sample abstracts, subject to google bigquery limits.

Update openalex pipeline to work with extra data

Identify topics industry is becoming more interested in and abandoning

Research data: Influence / impact / controversy

Assess options to enrich publication data with information about influence / impact / controversy

Citation data already addressed in #4
Assess options to enrich publication data with information about social media debates from Crossref Event data:
- What information is available directly from CRED API?
Assess how controversial a paper is using scite.ai
- This will involve some costs. Estimate costs under different scenarios

JMG will be working on this w/c 28 March

Research data: institutions

Assess options to enrich OpenAlex data with institutional / geographical information from GRID

This will involve fuzzy matching OpenAlex institutions with e.g. GRID institutions

Explore impact of definitions on OpenAlex results

Identify topics academics are becoming more interested in and abandoning

Prototype strategy to enrich OpenAlex data with institutional information from Global Research Identifier (GRID)

GtR definitions file does not run

The format of the /inputs/gtr/gtr_projects-projects.json file on S3 has changed, meaning gtr_definintions.py does not run

Identify data collection tasks that can be ongoing between 06/20 - 07/01

as @Jack-Vines and I will be on OJO.

Check reliability of results under different "AI thresholds"

Write technical spec

Structure:

Context

Project goals
Research questions

Methodology

High level methodological narrative
Data sources: collection and enrichment
Analysis

Outputs
Timelines and budget

CrunchBase: assess

Assess CrunchBase dataset:
- What are the right crunchbase tags
- How long are the descriptions of CB companies
- ...

prototype patent data pipeline

can we do one for like 5 patents? filter patent ids in genomics - precondition for issue #25

Start work on getting full collection of OpenAlex

Clean up repo

close issues that have been completed
delete remote branches that have been merged

Current snapshot (present and past two years) of the funding and research environment

identify and pull relevant fields for ai genomics patent ids

Using ai genomics patent ids, query BigQuery for additional fields that will be relevant for analysis.

interrogate patent data schema to identify fields most relevant for analysis
interrogate how complete those fields are on a sample of ai genomics patent ids

entity extraction: DBPedia vs. OpenAlex Tagger

Establish collaborative report writing process

Based on feedback from the first epic, we need to clarify a collaborative report writing process. We will also need to schedule an onboarding call to make sure everyone knows how we will be collaborating.

Establish report writing process
Set up call with team to onboard us
implement process

outstanding admin from call this morning

Get lit reviews
Get timelines for other project streams
Get lists of terms they are using to define AI / Genomics
Get research questions / assumptions
Schedule dates for epic meetings
Create project slack and invite them (do we want to do this? :-))
Agree if there is an advisory board / its composition / how we engage with it
Decide comm strategy: where does the report live and how do we share it with others

Funding data: Analysis

evaluate ai genomics patent ids

Sanity check patent ids that should be ai and genomics.

pick N .csvs and check patent ids are relevant to ai AND genomics (perhaps pulling a sample of titles to review or even searching patent ids in Google's front end: https://patents.google.com/ for more complete info)
Investigate distribution of cpc/ipc codes in "golden-shine-355915.genomics.*" - are all codes represented to at least some extent?

Epic 1 Goals

Collect data relevant across research data, patent data, business data and funding data
Conduct EDA and DQA across the relevant datasets
Improve collective domain knowledge and subject expertise of AI and genomics
Solidify agile ways of working
Define whether this data collection work should sit within this project repo or beyond it

Look for categories that map against AI genomics

in patent, openalex, crunchbase taxonomies

baseline metrics

As a baseline, we could use the same metrics karlis used in innovation sweet spots - we could also use his code from the repo to calculate i.e.:

#of orgs founded per topic (pending topic definition: i.e. crunchbase category, cluster etc.)
#of funding rounds per topic per topic (pending topic definition: i.e. crunchbase category, cluster etc.)
Overall funding per topic

We could also join on patent assignees and look at the number of patents per topic.

After that, we could also calculate more complex indicators around emergence metrics, look into novelpy

Have a conversation with Liz about how she found AI companies for the AI map project

EDA of OpenAlex data

Update getters to fetch data from S3 without storing locally
Generate summary by year including
- Number of papers
- Top institutions
- Top countries
- Top concepts
Update getters to fetch data and query for specific concepts, store locally (necessary for genomics)
Explore impact of different definitions on corpus
- E.g. number of papers in intersection of AI and genomics
- Coverage of papers in literature review

Research data: collection

Assess OpenAlex. Some questions to consider:

What is its coverage?
What is its timeliness?
How will we collect it?
How will we store it?
How will we find AI / genomics papers in it?

Get Semantic Scholar API key

Develop algorithm to re-create OpenAlex abstracts from inverted indices

Identify topics strategy

start simple:

simplest approach: just use crunchbase topics or openalex topics

Think about approaches:

what about using pygram? n-grams in abstracts and look at trends -- what's emerging? (second simplest)
spectar (sp?) embeddings of documents (just pull vectors based on DOI from semantic scholar) and just cluster those guys (stretch-ish goal)
topic modelling
some "minimum" threshold of topics - can we at least find these types of categories from the literature review?

crunchbase --> do we do this in a qualitative way?

Research data: knowledge flow

Explore options to enrich OpenAlex with semantic scholar data including:

Citations
High impact citations
Fields of study
As part of this, request access to free + fast API (it takes some time for them to confirm access)

JMG will be working on this w/c 4 April

Chat with Luca and Sam about wikipedia topics

from Luca: yesterday I forgot that Sam is on leave until Friday so I guess we could discuss wikipedia topics for this project next week (15-08)

Funding data: sources

Options include:

Gateway to Research
NIH
EU (through Cordis)
Wellcome Trust

Dataset joining based on entity matching

As part of the project deliverable, we are also interested in joining datasets to analyse. We would then need to:

Identify common entities across all the datasets we've collected so far i.e. a company name
We will need to identify a strategy to join datasets based on an entities i.e. use jacchammer? Are there other entity matching algorithms we want to explore? What preprocessing will we need to do to join?
Justify why we need joins - is this to help with analysis? Which analysis? Is this to expand data collection?
Develop an evaluation strategy for joining i.e. develop a ground truth dataset of the same entity across different datasets - report on how many entities do we in/correctly join etc.? How do these metrics change subject to different entity matching algorithms and preprocessing methods?

[from @Juan-Mateos] Potential joins:

orgs from patent data w/ crunchbase
expanding list of crunchbase companies based on patent join

Other ideas:

join named patent inventors w/ OpenAlex's publisher or display_name

Research data: MeSH terms

Check pipeline to enrich abstract data with MeSH terms: what is required to run this, how long does it take?

Patents: Sources

Some options:

Patent lens
Global Patent Index
USPTO using the open AI patent dataset (https://www.uspto.gov/ip-policy/economic-research/research-datasets/artificial-intelligence-patent-dataset)
Google Patent dataset @BIG query
Check with Karlis / Discovery Hub about their plans to analyse patent data

nestauk / ai_genomics Goto Github PK

ai_genomics's People

Contributors

Stargazers

Watchers

ai_genomics's Issues

Recommend Projects

Recommend Topics

Recommend Org