neo4j-labs / llm-graph-builder Goto Github PK

View Code? Open in Web Editor NEW

132.0 132.0 41.0 22.63 MB

Neo4j graph construction from unstructured data

License: Apache License 2.0

Python 27.59% HTML 0.48% TypeScript 42.49% CSS 0.45% Jupyter Notebook 28.70% JavaScript 0.28%

llm-graph-builder's People

Contributors

Stargazers

Watchers

Forkers

aashipandya prakriti-solankey kartikpersistent arora-rakshita rakshita-arora msenechal

llm-graph-builder's Issues

Frontend/Backend-Integration of LLM dropdown selection with API for all files in the Table

On selection of LLM - user shall be able to process the files on basis of preferred LLM.

Connection modal implementation and Dropzone allignment

Add connection modal according to the new Figma
Place/Positioning dropzone according to the above design
Table css styling fixes

Frontend- Code Cleanup

Need to create a common function to call the API's. Some function name changes and logic formatting

Backend configuration

can we make all environment variables uppercase.

and add a section to the backend readme on configuration and the env file.

also call out in the file and readme what is optional and what can be overriden e.g. from the client.

It should be more aligned with the usual style of config variables that we use elsewhere:

#OPENAI_API_KEY="sk-..."
DIFFBOT_API_KEY=""
NEO4J_URI=""
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD=""

Duplicate entities & PDF processing fails with 422

I tried to run the app, it still creates duplicates of the file with the same name

and when trying to process the file I get a 422 error

backend   | INFO:     172.18.0.1:44740 - "GET /sources_list HTTP/1.1" 200 OK
backend   | INFO:     172.18.0.1:44732 - "GET /health HTTP/1.1" 200 OK
backend   | INFO:     172.18.0.1:44748 - "GET /health HTTP/1.1" 200 OK
backend   | INFO:     172.18.0.1:44754 - "GET /sources_list HTTP/1.1" 200 OK
backend   | INFO:     172.18.0.1:35078 - "POST /sources HTTP/1.1" 200 OK
backend   | INFO:     172.18.0.1:55262 - "POST /extract HTTP/1.1" 422 Unprocessable Entity

Model selection state management issue

Handle Model Bug fix
Change failed Response Alert position from top center to bottom left
Add a Check for disable state of Generate graph button and dropdown

Backend API

change the API name from /predict to /extract

spell out knowledge graph in the description
rename the body object in the docs to something more consistent and descriptive from Body_kg_creation_predict_post

also add metadata about the file:

filename
file-size
if available file date
store those a :Source node (or equivalent if the graph transformer already creates a metadata node) in the graph

and in the response at least prepare the numeric processingTime and nodeCount and relationshipCount response fields
and status and errorMessage

Bug: Issue on populating data of multiple files.

Working on handling the bug found while testing
When all the files are in processing then their respective records are populating correctly in table but when lets say processing on on going of 3 files and I upload a new large file with "New" status and doesn't start processing for it then data of records in UI table is getting shuffled .
On refresh retain back to their original data.

LLMs comparision in csv

Bug Fixing for Frontend UI

Fix the white space issue by dynamically adjusting the height and it is responsive even after changing the page size in the table
Fix Auto Page Shifting
New items should be shown on first page rather than last
If the File is Already Processed show it as Completed
Removed the extra check for Disabling the Dropdown and Generate Graph Button
Connection Modal should display, if user is not connected with Neo4j Database
Add the Neo4j Favicon

App is not responsive, generate graph button doesn't show on narrower screen

Also instead of select LLM it should say "Select Model"

Add endpoint update_similarity_graph to update graph

Create a config setting panel

Create a setting panel.
Add settings for LLM Dropdown, Access key and Secret Key , Embedding checkbox

Frontend- API Integration on File upload.

When user uploads a file, hit an API [/sources] to post the the file data.

Frontend-Multiple file upload handling

Currently user is able to upload one file at a time. Allow users to add 5 files at a time.

Frontend-Add relationship column to the table

Integrate the relationships created on file upload processing in the table.

Add Access key and secret key Check

As per our understanding the secret key and access key if already available in the source node, put a check of its existence , if its there show the available for the processing/New .

App Deployment through Cloud run

Deploy docker containers to google cloud run , generate a URL

Deduplication of entities created from LLM

front-end-backend communication

There seems to be a CORS issue.

-> ok seems to be related to the GH codespaces, need to make the backend URL public to make it work for the time being, should be resolved when running it with docker or deploying it elsewhere.

But also connecting to a wrong back-end? Not sure if you hard-coded it, but it should just connect to localhost:8000 on the machine where the backend-is running or the configured base-URL.

Seems you have that hard-coded

https://github.com/neo4j-labs/llm-graph-builder/blob/main/frontend/src/components/DropZone.tsx#L10

it should not be hard-coded but come from an .env file (also provide an example.env)
it should not just hidden inside a UI component but a proper backend/REST API component !!
there should be a health check that valdiates that the backend is correctly running and indicate that to the user!

https://github.com/neo4j-labs/llm-graph-builder/issues/new
Access to XMLHttpRequest at 'https://animated-space-broccoli-jpgjg6pg59qcp7pg-8000.app.github.dev/predict' from origin 'https://studious-dollop-979pxr45x3p4p4-5173.app.github.dev' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.

POST https://animated-space-broccoli-jpgjg6pg59qcp7pg-8000.app.github.dev/predict net::ERR_FAILED 404 (Not Found)

my backend is running on: https://studious-dollop-979pxr45x3p4p4-8000.app.github.dev/docs

Later: Perhaps mid-term we can even serve them from the backend as static assets.

Data Model Cleanup

rename Source -> Document
HAS_CHILD relationship inverse (chunk)-[:PART_OF]->(:Document)
add a single first relationship: (:Document)-[:FIRST_CHUNK]->(:Chunk)
create a NEXT relationship between chunks of each document

Create relationship between chunk node and entity node

Experimenting with different chunking strategies

Create embedding and store in neo4j

Add a filter to the status column of the Table

Frontend Connection Status

The front end should indicate if the back-end is running.

Right now it shows the file-drop area if neo4j/the backend is connected but there should be a clearer indication.

Parallelism of all APIs using asyncio

Frontend - Create LLM Selection Dropdown

Add a dropdown for user to select LLM of their choice.

Front-end pass neo4j connection information to backend

If there is a separate connection information provided in the front-end it should pass that to the backend in a suitable way when making requests.

e.g. for processing files the connection information of the front-end (if available) should be passed on as an extra nested payload and be used in the processing.

Same for listing sources for the table, it should use the front-end connection information.

If the backend is configured with a neo4j connection but the front-end is not connected, it should still work, then automatically using the backends connection config inside the backend.

Backend URL in front-end via environment

by default should come from environment variable / .env file (using dotenv)
if not set you can use the logic you have here right now, but probably want to add a check and also test for localhost
https://github.com/neo4j-labs/llm-graph-builder/blob/main/frontend/src/utils/utils.ts#L1

Backend API

When adding :Source nodes to the graph to represent the files, add a /sources/list endpoint that returns the list of sources ordered by updatedAt descending and returns all the metadata, that was added/updated when creating the nodes

fileName (for the time being this can be the id - unique constraint)
fileType
fileSize
createdAt
updatedAt
processingTime
status
errorMessage
nodeCount
relationshipCount

typo in payload: `errorMessgae` of `sources/list`

Add information to the Source node and the table, which model was used for processing

Frontend- Allowing users to specify the S3 bucket name and credential details required for accessing resources.

Implement a frontend interface element allowing users to specify the S3 bucket name and credential details required for accessing resources.

s3-bucket (with path)

optional credentials (access-key, secret-key) and region (so that public buckets can be accessed without credentials)

Create and link chunk node with source node

Sum nodes and rels over all documents

https://github.com/neo4j-labs/llm-graph-builder/blob/main/backend/src/main.py#L42-L43

Also add relationship count to the UI

Modify backend code to generate KG from OpenAI via /extract api

Frontend/Backend- Table integration to get data from the GET API [/sources_list]

Update the table with the API response.

Update the readme

instructions how to run / deploy / configure
link to the public google cloud run URL + link to neo4j workspace
list of features (upload, s3/gcs, connection to neo4j, file + chunk handling, extract entities with different models, create embeddings, create kNN graph)
screenshot or short animated gif
graph model
screenshot of the graph model in neo4j workspace + query that I shared

S3 Credentials as Payload-Backend

JSON comparision of LLMs in same format as diffbot

Handle Validation Checks for S3 bucket input

Handle validations when user adds invalid url.
Handle state of banner .
Secret and Access key params confirmation.

S3 upload- Backend API

Backend URL handling

you have an inconsistentcy on how you use BACKEND_URL -> url() sometimes {url()}sources sometimes {url()}/extract
I changed it now to always use a slash / i.e. {url()}/extract
so that you have to set the environment variable like this without a trailing slash: export BACKEND_API_URL="https://studious-dollop-979pxr45x3p4p4-8000.app.github.dev/"
ideally in url() we would remove trailing slashes

To Create Generate Button progress Bar

Add some sort of feedback when user clicks on "Generate Graph". The button should show that the files are processing and then indicate completed once the job is done.