Giter VIP home page Giter VIP logo

collabnext's Introduction

CollabNext

This project is hackathon challenge / contribution to Building the Prototype Open Knowledge Network (Proto-OKN), a NSF funded project.

https://new.nsf.gov/funding/opportunities/building-prototype-open-knowledge-network-proto

The goal is to build a Knowledge Graph based recommendation system for researchers, to help them find collaborators given specified constrains, i.e. topics or universities.

The first script is ETL code for fetching data from OpenAlex API, filtering it, and inserting institutions, topics, authors, and works into Neo4j graph db. Initial prototype dataset is limited to HBCU institutions, around 100 of them.. goal is to discover researchers less visible with common academic tools (i.e. giving preference to high citations counts and reinforcing Mathew effect / preferential attachment)

Current ETL starts with a

  • list of institutions (universities)
  • ... or any endpoint from the API except 📄Works (as it further expands by works API and specified filter):
  • 👩Authors, 📚Sources, 🏫Institutions, 💡Topics, 🏢Publishers, 💰Funders, 🌎Geo, 💡Concepts

It then fetches whatever is in the lkeep set, while looking through keep properties... so you can modify those. Script is generic, it filters those massive jsons based on your property list. OpenAlex has 3 billion triplets, so it is not practical to filter those for prototyping purposes and with limited resources. One could start from a single uni/author/work, and start expanding graph connections, until satisfying results/recommendations are found.. but then the graph can explode rapidly depending on connectivity of authors/institutions.. so a smarter real time filtering by i.e. topics would be practical!

ADAPTABILITY: With very little modification this code can be adapted to other types of datasets, i.e. analysis of Clinical Trials

Setup

  1. first run etl file, ingest as much data as you want and your resources allow.. modify inputs = ... and nearby details in the file
  2. run db_operations script after etl
  3. streamlit run retrieval.py

Examples

These examples are using GPT-3.5-turbo, ~20+ times cheaper model than GPT-4 models which are significantly better for complex graph queries

In progress, switching to Ollama model, open source and free

  • Q: Which author from Eastern Oregon University worked on protein supplementation?

    • A: Bruno I Cappellozza, D. W. Bohnert, M. M. Reis, Megan L Van Emon, C. S. Schauer, S. J. Falck, Reinaldo F Cooke from Eastern Oregon University worked on protein supplementation.
  • Q: Give me institutions which explore mechanical engineering.

    • A: University of Tennessee at Chattanooga, Chattanooga State Community College, Lawson State Community College explore mechanical engineering.
  • Q: Give me institution with most authors of works.

    • A: Tuskegee University has the most authors of works with a total of 3557.
  • Q: Give me institution with most authors of works on biology.

    • A: Cornell University has the most authors of works on biology.
  • Q: Give me most researched topics by Tuskegee University.

    • THIS MAY BREAK THE CYPHER; you may need to tell LLM where to look. i.e. here specify 'institution'
    • Q: Give me most researched topics by institution Tuskegee University.
      • A: Advancements in Prostate Cancer Research, Nutritional Strategies for Ruminant Health and Production, Remote Laboratories in Engineering Education.
  • Q: Give me institution with most authors of works on biology which have centrality less than 10

    • A: Delaware State University has the most authors of works on biology with centrality less than 10
  • Q: Give me institution with most different communities of authors of works on social justice

    • A: Miles College has 1 community of authors of works on social justice.

Note: Community and centrality are properties not present in openalex, but calculated on a portion of the graph ingested through ETL step.

Another note: notice some universities are not HBCU.. some authors of works probably are affiliated with those, that's how they got into the demo graph...

TODO

Sentence restructuring... sensitivity issues:

Only bold ones return what I was looking for, others return universities or nothing (to be fair, some of my examples have weird grammar!):

  • Q: give me researchers from different communities and universities on the topics of health

  • Q: give me researchers from different communities on the topics of mentalhealth

  • Q: give me researchers from different communities on the topics of mental health

  • Q: give me authors from different communities on the topics of mental health

  • Q: give me researchers on the topics of health belonging to distinct communities (somewhat o.k.)

  • Q: give me authors of works on the topic of mental health that belong to distinct communities

    • A: M. D. MacNeil, F.W.C. Neser, Rulien Grobler, F.H. De Witt, Errol D. Cason, Ockert Einkamerer, G.C. Josling, H.A. O’Neill, Mike D Fair, J. J. Baloyi
  • Q: give me most common topics of research by authors affiliated with Cornell University

  • Q: give me most common topics of research by works of authors affiliated with Cornell University

  • Q: give me most common topics associated with Cornell University

  • Q: give me most common topics of works done by Cornell University faculty

    • A: The most common topics of works done by Cornell University faculty are Genomic Landscape of Cancer and Mutational Signatures, Genetic Research on BRCA Mutations and Cancer Risk, and Advancements in Prostate Cancer Research.

Seems sentence needs to follow the pattern of the CYPHER query.. so another step should be instructing LLM to figure out how to restructure the sentence in the specified way.

Behind the scenes

Q: give me institution with most different communities of authors of works on results

LLM Generated Cypher:

MATCH (i:institution)-[:INTERACTS]->(w:work)-[:author]->(a:author)
WITH i, collect(DISTINCT a.community) as author_communities
WITH i, size(author_communities) as num_communities
RETURN i.display_name, i.id, num_communities
ORDER BY num_communities DESC
LIMIT 1

Full Context: {'i.display_name': 'Tuskegee University', 'i.id': 'https://openalex.org/I6026837', 'num_communities': 7}

Finished chain. A: Tuskegee University has the most different communities of authors of works on results.

In this small demo dataset there are only ~10 communities for all the topics combined.. label propagation was used for community detection..

collabnext's People

Contributors

beviah avatar

Stargazers

 avatar Yetmens avatar  avatar

Watchers

 avatar Kostas Georgiou avatar

Forkers

kyerafrancis

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.