Giter VIP home page Giter VIP logo

ai_genomics's People

Contributors

georgerichardson avatar india-kerle avatar jack-0-0 avatar jack-vines avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ai_genomics's Issues

Research data: Influence / impact / controversy

Assess options to enrich publication data with information about influence / impact / controversy

  • Citation data already addressed in #4
  • Assess options to enrich publication data with information about social media debates from Crossref Event data:
    • What information is available directly from CRED API?
  • Assess how controversial a paper is using scite.ai
    • This will involve some costs. Estimate costs under different scenarios

JMG will be working on this w/c 28 March

Research data: institutions

Assess options to enrich OpenAlex data with institutional / geographical information from GRID

  • This will involve fuzzy matching OpenAlex institutions with e.g. GRID institutions

Write technical spec

Structure:

  1. Context
  • Project goals
  • Research questions
  1. Methodology
  • High level methodological narrative
  • Data sources: collection and enrichment
  • Analysis
  1. Outputs
  2. Timelines and budget

CrunchBase: assess

  • Assess CrunchBase dataset:
    • What are the right crunchbase tags
    • How long are the descriptions of CB companies
    • ...

Clean up repo

  • close issues that have been completed
  • delete remote branches that have been merged

identify and pull relevant fields for ai genomics patent ids

Using ai genomics patent ids, query BigQuery for additional fields that will be relevant for analysis.

  • interrogate patent data schema to identify fields most relevant for analysis
  • interrogate how complete those fields are on a sample of ai genomics patent ids

Establish collaborative report writing process

Based on feedback from the first epic, we need to clarify a collaborative report writing process. We will also need to schedule an onboarding call to make sure everyone knows how we will be collaborating.

  • Establish report writing process
  • Set up call with team to onboard us
  • implement process

outstanding admin from call this morning

  • Get lit reviews
  • Get timelines for other project streams
  • Get lists of terms they are using to define AI / Genomics
  • Get research questions / assumptions
  • Schedule dates for epic meetings
  • Create project slack and invite them (do we want to do this? :-))
  • Agree if there is an advisory board / its composition / how we engage with it
  • Decide comm strategy: where does the report live and how do we share it with others

evaluate ai genomics patent ids

Sanity check patent ids that should be ai and genomics.

  • pick N .csvs and check patent ids are relevant to ai AND genomics (perhaps pulling a sample of titles to review or even searching patent ids in Google's front end: https://patents.google.com/ for more complete info)
  • Investigate distribution of cpc/ipc codes in "golden-shine-355915.genomics.*" - are all codes represented to at least some extent?

Epic 1 Goals

  • Collect data relevant across research data, patent data, business data and funding data
  • Conduct EDA and DQA across the relevant datasets
  • Improve collective domain knowledge and subject expertise of AI and genomics
  • Solidify agile ways of working
  • Define whether this data collection work should sit within this project repo or beyond it

baseline metrics

As a baseline, we could use the same metrics karlis used in innovation sweet spots - we could also use his code from the repo to calculate i.e.:

  • #of orgs founded per topic (pending topic definition: i.e. crunchbase category, cluster etc.)
  • #of funding rounds per topic per topic (pending topic definition: i.e. crunchbase category, cluster etc.)
  • Overall funding per topic

We could also join on patent assignees and look at the number of patents per topic.

After that, we could also calculate more complex indicators around emergence metrics, look into novelpy

EDA of OpenAlex data

  • Update getters to fetch data from S3 without storing locally
  • Generate summary by year including
    • Number of papers
    • Top institutions
    • Top countries
    • Top concepts
  • Update getters to fetch data and query for specific concepts, store locally (necessary for genomics)
  • Explore impact of different definitions on corpus
    • E.g. number of papers in intersection of AI and genomics
    • Coverage of papers in literature review

Research data: collection

Assess OpenAlex. Some questions to consider:

  • What is its coverage?
  • What is its timeliness?
  • How will we collect it?
  • How will we store it?
  • How will we find AI / genomics papers in it?

Identify topics strategy

start simple:

  • simplest approach: just use crunchbase topics or openalex topics

Think about approaches:

  • what about using pygram? n-grams in abstracts and look at trends -- what's emerging? (second simplest)
  • spectar (sp?) embeddings of documents (just pull vectors based on DOI from semantic scholar) and just cluster those guys (stretch-ish goal)
  • topic modelling
  • some "minimum" threshold of topics - can we at least find these types of categories from the literature review?

crunchbase --> do we do this in a qualitative way?

Research data: knowledge flow

Explore options to enrich OpenAlex with semantic scholar data including:

  • Citations
  • High impact citations
  • Fields of study
    As part of this, request access to free + fast API (it takes some time for them to confirm access)

JMG will be working on this w/c 4 April

Dataset joining based on entity matching

As part of the project deliverable, we are also interested in joining datasets to analyse. We would then need to:

  • Identify common entities across all the datasets we've collected so far i.e. a company name
  • We will need to identify a strategy to join datasets based on an entities i.e. use jacchammer? Are there other entity matching algorithms we want to explore? What preprocessing will we need to do to join?
  • Justify why we need joins - is this to help with analysis? Which analysis? Is this to expand data collection?
  • Develop an evaluation strategy for joining i.e. develop a ground truth dataset of the same entity across different datasets - report on how many entities do we in/correctly join etc.? How do these metrics change subject to different entity matching algorithms and preprocessing methods?

[from @Juan-Mateos] Potential joins:

  • orgs from patent data w/ crunchbase
  • expanding list of crunchbase companies based on patent join

Other ideas:

  • join named patent inventors w/ OpenAlex's publisher or display_name

Research data: MeSH terms

  • Check pipeline to enrich abstract data with MeSH terms: what is required to run this, how long does it take?

Research data: open source

  • Find AI + genomics papers in Papers with Code
  • Get their GitHub repos
  • Are there any benchmark data related to AI + genomics?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.