snap-stanford / stark Goto Github PK

View Code? Open in Web Editor NEW

275.0 2.0 35.0 8.02 MB

STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases (https://stark.stanford.edu/)

Home Page: https://stark.stanford.edu/

License: MIT License

Python 82.49% Jupyter Notebook 17.51%

llm multimodal graph nlp knowledge-base information-retrieval semi-structured-data

stark's Issues

Textual properties available?

As shown in Figure 3, there would be some textual properties for each node, such as "push-along tricycle" and "fun and safe for kids". And I wonder if we could see these textual properties in the repo?
— So far, what I've found is the complete information for each node with the format like:
{'product': ['title',
'dimensions',
'weight',
'description',
'features',
'reviews',
'Q&A'],
'brand': ['brand_name']}
, rather some phrase-like textual properties as shown in Figure 3.

STaRK-Prime answers wrong?

During my exploration of the STaRK-Prime dataset, I looked into a few questions (human-generated ones specifically). I've discovered a couple of answers that I find strange, where the answer to the question is the topic entity.

For example, check question index 47 for the STaRK-Prime dataset (human-generated): "What diseases is exposure to 2,3',4,4',5-pentachlorobiphenyl associated with?", the answer ID is 61686. The name of the node 61686 is "2,3',4,4',5-pentachlorobiphenyl", which is already mentioned in the question. I also experience the same type of result for the question index 62.

Is this the behavior that is expected, and if so, could you explain why, as I would have expected to have responses that differ from the topic entity (especially in the human-generated).

You can re-create this by running the following code:

from stark_qa import load_qa, load_skb

dataset_name = 'prime'

qa_dataset = load_qa(dataset_name, human_generated_eval=True)
idx_split = qa_dataset.get_idx_split()

skb = load_skb(dataset_name, download_processed=False, root='.')

qa_dataset[47]
# Output
("What diseases is exposure to 2,3',4,4',5-pentachlorobiphenyl associated with?",
 47,
 [61686],
 None)

print(skb.get_doc_info(61686, add_rel=True))
# Output
- name: 2,3',4,4',5-pentachlorobiphenyl
- type: exposure
- source: CTD
- relations:
  parent-child: {exposure: (2,2',3',4,4',5-hexachlorobiphenyl, 2,4,4',5-tetrachlorobiphenyl, Endocrine Disruptors, Environmental Pollutants, Pesticides, Polychlorinated Biphenyls, 2,2',3,3',4,4',5-heptachlorobiphenyl, 2,3,3',4,4',5-hexachlorobiphenyl, 2,4,5,2',4',5'-hexachlorobiphenyl, Hydrocarbons, Chlorinated, Organic Chemicals, Thyroxine, Triiodothyronine),}
  interacts_with: {gene/protein: (TSHB, SERPINA7),biological_process: (thyroid hormone metabolic process, cognition, regulation of thyroid-stimulating hormone secretion, production of molecular mediator of immune response, regulation of bone mineralization, hypermethylation of CpG island, male meiosis chromosome separation),}
  linked_to: {disease: (osteoporosis, metabolic syndrome X, non-Hodgkin lymphoma, respiratory tract infectious disease, fatty liver disease, colorectal neoplasm),}

Unable to Find Synthetic Query Generation Script

STaRK is extremely relevant and well-suited to our use case. We are currently working on replicating a benchmark approach similar to STaRK, but we are unable to find scripts that can be adjusted or reused to produce synthetic queries. Are these scripts not available in the GitHub repository?

Error during processing of Amazon data from scratch

There is an error during reprocessing from the raw "amazon" data that looks to be because it is missing a reference to self.review_columns:

Traceback (most recent call last):
  File "/home/ec2-user/stark/main.py", line 7, in <module>
    kb = get_semistructured_data(dataset_name, download_processed=False)
  File "/home/ec2-user/stark/src/benchmarks/get_semistruct.py", line 9, in get_semistructured_data
    kb = AmazonSemiStruct(root=data_root,
  File "/home/ec2-user/stark/src/benchmarks/semistruct/amazon.py", line 113, in __init__
    processed_data = self._process_raw(categories)
  File "/home/ec2-user/stark/src/benchmarks/semistruct/amazon.py", line 344, in _process_raw
    node_info = self.construct_raw_node_info(df_meta_reduced, df_review_reduced, df_qa_reduced)
  File "/home/ec2-user/stark/src/benchmarks/semistruct/amazon.py", line 525, in construct_raw_node_info
    df_row_to_dict(df_i, colunm_names=review_columns \
NameError: name 'review_columns' is not defined

Problems with Python 3.8

Hi,

you mentioned that Python 3.8 is required. I tried this on my mac + conda I got an error

ERROR: Ignored the following versions that require a different python version: 1.2.0 Requires-Python >=3.9; 1.2.1 Requires-Python >=3.9; 1.2.1rc1 Requires-Python >=3.9 ERROR: Could not find a version that satisfies the requirement cupy-cuda11x==12.2.0 (from versions: none) ERROR: No matching distribution found for cupy-cuda11x==12.2.0

Full log

(base) tobiasoberrauch@Tobiass-MBP stark % conda create -n stark python=3.8
Channels:

defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

Package Plan

environment location: /opt/anaconda3/envs/stark

added / updated specs:
- python=3.8

The following packages will be downloaded:

package                    |            build
---------------------------|-----------------
pip-23.3.1                 |   py38hca03da5_0         2.6 MB
python-3.8.19              |       hb885b13_0        12.5 MB
setuptools-68.2.2          |   py38hca03da5_0         934 KB
wheel-0.41.2               |   py38hca03da5_0         107 KB
xz-5.4.6                   |       h80987f9_0         372 KB
------------------------------------------------------------
                                       Total:        16.5 MB

The following NEW packages will be INSTALLED:

ca-certificates pkgs/main/osx-arm64::ca-certificates-2024.3.11-hca03da5_0
libcxx pkgs/main/osx-arm64::libcxx-14.0.6-h848a8c0_0
libffi pkgs/main/osx-arm64::libffi-3.4.4-hca03da5_0
ncurses pkgs/main/osx-arm64::ncurses-6.4-h313beb8_0
openssl pkgs/main/osx-arm64::openssl-3.0.13-h1a28f6b_0
pip pkgs/main/osx-arm64::pip-23.3.1-py38hca03da5_0
python pkgs/main/osx-arm64::python-3.8.19-hb885b13_0
readline pkgs/main/osx-arm64::readline-8.2-h1a28f6b_0
setuptools pkgs/main/osx-arm64::setuptools-68.2.2-py38hca03da5_0
sqlite pkgs/main/osx-arm64::sqlite-3.41.2-h80987f9_0
tk pkgs/main/osx-arm64::tk-8.6.12-hb8d0fd4_0
wheel pkgs/main/osx-arm64::wheel-0.41.2-py38hca03da5_0
xz pkgs/main/osx-arm64::xz-5.4.6-h80987f9_0
zlib pkgs/main/osx-arm64::zlib-1.2.13-h5a0b063_0

Proceed ([y]/n)?

Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: | WARNING conda.core.path_actions:verify(1055): Unable to create environments file. Path not writable.
environment location: /Users/tobiasoberrauch/.conda/environments.txt

done
Executing transaction: \ WARNING conda.core.envs_manager:register_env(66): Unable to register environment. Path not writable or missing.
environment location: /opt/anaconda3/envs/stark
registry file: /Users/tobiasoberrauch/.conda/environments.txt
done

To activate this environment, use

$ conda activate stark

To deactivate an active environment, use

$ conda deactivate

(base) tobiasoberrauch@Tobiass-MBP stark % conda activate stark
pip install -r requirements.txt
Collecting anthropic==0.25.0 (from -r requirements.txt (line 1))
Downloading anthropic-0.25.0-py3-none-any.whl.metadata (18 kB)
Collecting async-timeout==4.0.3 (from -r requirements.txt (line 2))
Using cached async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting attrs==23.1.0 (from -r requirements.txt (line 3))
Downloading attrs-23.1.0-py3-none-any.whl.metadata (11 kB)
Collecting bs4==0.0.1 (from -r requirements.txt (line 4))
Downloading bs4-0.0.1.tar.gz (1.1 kB)
Preparing metadata (setup.py) ... done
Collecting certifi==2023.7.22 (from -r requirements.txt (line 5))
Downloading certifi-2023.7.22-py3-none-any.whl.metadata (2.2 kB)
Collecting click==8.1.7 (from -r requirements.txt (line 6))
Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting cmake==3.27.7 (from -r requirements.txt (line 7))
Downloading cmake-3.27.7-py2.py3-none-macosx_10_10_universal2.macosx_10_10_x86_64.macosx_11_0_arm64.macosx_11_0_universal2.whl.metadata (6.7 kB)
Collecting comm==0.1.4 (from -r requirements.txt (line 8))
Downloading comm-0.1.4-py3-none-any.whl.metadata (4.2 kB)
Collecting contourpy==1.1.1 (from -r requirements.txt (line 9))
Downloading contourpy-1.1.1-cp38-cp38-macosx_11_0_arm64.whl.metadata (5.9 kB)
ERROR: Ignored the following versions that require a different python version: 1.2.0 Requires-Python >=3.9; 1.2.1 Requires-Python >=3.9; 1.2.1rc1 Requires-Python >=3.9
ERROR: Could not find a version that satisfies the requirement cupy-cuda11x==12.2.0 (from versions: none)
ERROR: No matching distribution found for cupy-cuda11x==12.2.0

the ansewer ID of amazon is not in the provided amazon KG

Hi ,
Could you help us to check why the answer id in Amazon is not the the provided amazon KG file? for example, answer id 16.

thanks!

Embedding generation is slow due to inefficient search

In the emb_generate.py file, for every index, the program checks whether it is in the existing list of indices here

if idx in exisiting_indices:
    continue

This operation is O(n) where n is the length of existing indices, this can be easily avoided by converting this list into a set before starting the loop. This is a minor change, hopefully the authors can take care of this in the next commit.

exisiting_indices = set(exisiting_indices)

What do the ids mean after each query?

Initially I thought it is "qualified product ids", then I realize the query id is also in the list.

e.g.
334460,What are some Tercel women's cycling gloves made in China that you would recommend?,"[334457, 334458, 334460, 334461]"

Searched file folder but couldn't find an interpretation to the data format.

could you provide the topic entity in each sentence?

The answer ID 95886 is also not in your provided node info file. I used your latest file, so please release the full version when it's ready. This will prevent confusion among many researchers.

snap-stanford / stark Goto Github PK

stark's Issues

Textual properties available?

STaRK-Prime answers wrong?

Unable to Find Synthetic Query Generation Script

Error during processing of Amazon data from scratch

Problems with Python 3.8

To activate this environment, use

$ conda activate stark

To deactivate an active environment, use

$ conda deactivate

the ansewer ID of amazon is not in the provided amazon KG

Embedding generation is slow due to inefficient search

What do the ids mean after each query?

could you provide the topic entity in each sentence?

The answer ID 95886 is also not in your provided node info file. I used your latest file, so please release the full version when it's ready. This will prevent confusion among many researchers.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent