Giter VIP home page Giter VIP logo

kob-llm-to-kg's Introduction

kob-llm-to-kg

This repository is for converting Kuching Old Bazaar into a knowledge graph in Neo4j using Ollama and specifically the llama3 model.

Table of Contents

Requirements

  • Docker
  • Python

Usage of Web Scraper

To start the application, follow these steps:

  1. Go into app folder
    cd app
  2. Create venv
    python -m venv .venv
  3. Activate venv
    On Windows:
    .venv\Scripts\activate
    On macOS and Linux:
    source .venv/bin/activate
  4. Install requirements
    python -r requirements.txt
  5. Run scraper
    python main.py

The webscraper carries out the following steps:

  1. User inputs a kcholdbazaar.com page URL (get_inp_url).
  2. The page is retrieved (get_url) and the English text from the page is retrieved (get_contents)
  3. The English page is fed into the LLM along with the ontology (send_to_ollama). The LLM then returns nodes and relationships, which have to be processed. For each node or relationship, the given label ID needs to be checked, as the LLM sometimes gives incorrect labels. If a label is not part of the ontology, it is checked without the ID and, if there is a match, is returned with the proper ID (check_if_in_ontology).
  4. The nodes and relationships are written to a CSV file in the app/outputs/ folder.
  5. The nodes and relationships are stored in the Neo4j database (load_content_to_database)

The prompt that is currently used was arrived at through trial-and-error, and it is still far from perfect. So those that want to improve it do not have to start completely over, some attempted prompts and their outputs have been stored in app/outputs/prompts_in_out.md.

Features checklist

  • Dockerized application (neo4j, ollama, ollama ui, python webscraper)
  • Python web scraper
  • Allow user to use CLI to specify target website
  • Import existing ontology into neo4j
  • Import data into neo4j
  • Test prompts derived from NaLLM
  • Output data into useful format (csv)
  • Test with bigger models for potentially better results
  • Integrate knowledge graph into Kuching Old Bazaar

neo4j

In this section, the requriements and steps to setup the neo4j database are described. Further the loading of the Data into the database is described, as well as the constraint of these steps.

Automatic Setup

When the neo4j container is started, the following steps are automatically executed:

  • The APOC and neosemantics libraries are installed
  • The unique uri constraint is created (if it does not exist)

The APOC library is used to load files into the database, while the neosemantics library enables neo4j to handle RDF data. To reproduce this in a docker environment, the plugins need to be named in the environment variable NEO4J_PLUGINS, and provided as a list. In a native install, the plugins need to be placed in the plugins folder of the neo4j installation, as jar files. See the neo4j documentation for more information.
The unique uri constraint is created to ensure that the uri of the nodes is unique. This is important for the data import, as the uri is used to identify the nodes.

Ontology Import

To import the CIDOC-CRM ontology, follow these steps, it is required every time the data folder has been emptied:

  • start docker container
  • run the following command in the neo4j browser
CALL n10s.graphconfig.init({ handleVocabUris: 'MAP'})
CALL n10s.onto.import.fetch("https://cidoc-crm.org/rdfs/7.1.3/CIDOC_CRM_v7.1.3.rdfs","RDF/XML");

Data Import

The data is imported at the execution of the python script, once the data has been extracted from the LLM. It is expected to have the data available in the csv format. The generated data is then split into 2 files, one for the nodes and one for the relationships. The rows with the nodes start with an integer value. The relationship rows are updated, so that the id of the node is used instad of the table row number.
The data is then loaded into the neo4j database using the APOC library. First the nodes are loaded, then the relationships. The following cypher commands are used to load the data.
The following settings need to be set in the neo4j.conf file (if the file does not exist in the conf folder, create it) to allow the import of csv files from the file system:

dbms.directories.import=import
dbms.security.allow_csv_import_from_file_urls=true

Limitations

If the the script is executed multiple times, the data is appended to the database. This can be solved by bounding the UUID to a set of properties of each node, e.g. name, type, label. Additionally, are the nodes not properly labeled, which may be due to a not optimal formulated cypher query, may be solved by replacing :n with csvLine._labels. As indicated in the code the neo4j password should not be hardcoded.

Docker Compose

Ollama

Using CPU for ollama

docker-compose up -d

Using AMD GPU for ollama

docker compose -f docker-compose-amd.yml up -d

Using NVIDIA GPU for ollama

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Configure NVIDIA Container Toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Test GPU integration
docker run --gpus all nvidia/cuda:11.5.2-base-ubuntu20.04 nvidia-smi

docker compose -f docker-compose-nvidia.yml up -d

kob-llm-to-kg's People

Contributors

meret6832 avatar noleu avatar padmavathybalaji avatar qimolin avatar

Watchers

 avatar  avatar

kob-llm-to-kg's Issues

Add existing onthology to neo4j

Ask the student for the onthology export (should be possible via neo4j aura) or figure out a way to load the rdf/json-ld into the neo4j insrance. Worst case do it manually.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.