Giter VIP home page Giter VIP logo

stark's Issues

Textual properties available?

As shown in Figure 3, there would be some textual properties for each node, such as "push-along tricycle" and "fun and safe for kids". And I wonder if we could see these textual properties in the repo?
โ€” So far, what I've found is the complete information for each node with the format like:
{'product': ['title',
'dimensions',
'weight',
'description',
'features',
'reviews',
'Q&A'],
'brand': ['brand_name']}
, rather some phrase-like textual properties as shown in Figure 3.

STaRK-Prime answers wrong?

During my exploration of the STaRK-Prime dataset, I looked into a few questions (human-generated ones specifically). I've discovered a couple of answers that I find strange, where the answer to the question is the topic entity.

For example, check question index 47 for the STaRK-Prime dataset (human-generated): "What diseases is exposure to 2,3',4,4',5-pentachlorobiphenyl associated with?", the answer ID is 61686. The name of the node 61686 is "2,3',4,4',5-pentachlorobiphenyl", which is already mentioned in the question. I also experience the same type of result for the question index 62.

Is this the behavior that is expected, and if so, could you explain why, as I would have expected to have responses that differ from the topic entity (especially in the human-generated).

You can re-create this by running the following code:

from stark_qa import load_qa, load_skb

dataset_name = 'prime'

qa_dataset = load_qa(dataset_name, human_generated_eval=True)
idx_split = qa_dataset.get_idx_split()

skb = load_skb(dataset_name, download_processed=False, root='.')

qa_dataset[47]
# Output
("What diseases is exposure to 2,3',4,4',5-pentachlorobiphenyl associated with?",
 47,
 [61686],
 None)

print(skb.get_doc_info(61686, add_rel=True))
# Output
- name: 2,3',4,4',5-pentachlorobiphenyl
- type: exposure
- source: CTD
- relations:
  parent-child: {exposure: (2,2',3',4,4',5-hexachlorobiphenyl, 2,4,4',5-tetrachlorobiphenyl, Endocrine Disruptors, Environmental Pollutants, Pesticides, Polychlorinated Biphenyls, 2,2',3,3',4,4',5-heptachlorobiphenyl, 2,3,3',4,4',5-hexachlorobiphenyl, 2,4,5,2',4',5'-hexachlorobiphenyl, Hydrocarbons, Chlorinated, Organic Chemicals, Thyroxine, Triiodothyronine),}
  interacts_with: {gene/protein: (TSHB, SERPINA7),biological_process: (thyroid hormone metabolic process, cognition, regulation of thyroid-stimulating hormone secretion, production of molecular mediator of immune response, regulation of bone mineralization, hypermethylation of CpG island, male meiosis chromosome separation),}
  linked_to: {disease: (osteoporosis, metabolic syndrome X, non-Hodgkin lymphoma, respiratory tract infectious disease, fatty liver disease, colorectal neoplasm),}

Unable to Find Synthetic Query Generation Script

STaRK is extremely relevant and well-suited to our use case. We are currently working on replicating a benchmark approach similar to STaRK, but we are unable to find scripts that can be adjusted or reused to produce synthetic queries. Are these scripts not available in the GitHub repository?

Error during processing of Amazon data from scratch

There is an error during reprocessing from the raw "amazon" data that looks to be because it is missing a reference to self.review_columns:

Traceback (most recent call last):
  File "/home/ec2-user/stark/main.py", line 7, in <module>
    kb = get_semistructured_data(dataset_name, download_processed=False)
  File "/home/ec2-user/stark/src/benchmarks/get_semistruct.py", line 9, in get_semistructured_data
    kb = AmazonSemiStruct(root=data_root,
  File "/home/ec2-user/stark/src/benchmarks/semistruct/amazon.py", line 113, in __init__
    processed_data = self._process_raw(categories)
  File "/home/ec2-user/stark/src/benchmarks/semistruct/amazon.py", line 344, in _process_raw
    node_info = self.construct_raw_node_info(df_meta_reduced, df_review_reduced, df_qa_reduced)
  File "/home/ec2-user/stark/src/benchmarks/semistruct/amazon.py", line 525, in construct_raw_node_info
    df_row_to_dict(df_i, colunm_names=review_columns \
NameError: name 'review_columns' is not defined

Problems with Python 3.8

Hi,

you mentioned that Python 3.8 is required. I tried this on my mac + conda I got an error

ERROR: Ignored the following versions that require a different python version: 1.2.0 Requires-Python >=3.9; 1.2.1 Requires-Python >=3.9; 1.2.1rc1 Requires-Python >=3.9 ERROR: Could not find a version that satisfies the requirement cupy-cuda11x==12.2.0 (from versions: none) ERROR: No matching distribution found for cupy-cuda11x==12.2.0

Full log

(base) tobiasoberrauch@Tobiass-MBP stark % conda create -n stark python=3.8
Channels:

  • defaults
    Platform: osx-arm64
    Collecting package metadata (repodata.json): done
    Solving environment: done

Package Plan

environment location: /opt/anaconda3/envs/stark

added / updated specs:
- python=3.8

The following packages will be downloaded:

package                    |            build
---------------------------|-----------------
pip-23.3.1                 |   py38hca03da5_0         2.6 MB
python-3.8.19              |       hb885b13_0        12.5 MB
setuptools-68.2.2          |   py38hca03da5_0         934 KB
wheel-0.41.2               |   py38hca03da5_0         107 KB
xz-5.4.6                   |       h80987f9_0         372 KB
------------------------------------------------------------
                                       Total:        16.5 MB

The following NEW packages will be INSTALLED:

ca-certificates pkgs/main/osx-arm64::ca-certificates-2024.3.11-hca03da5_0
libcxx pkgs/main/osx-arm64::libcxx-14.0.6-h848a8c0_0
libffi pkgs/main/osx-arm64::libffi-3.4.4-hca03da5_0
ncurses pkgs/main/osx-arm64::ncurses-6.4-h313beb8_0
openssl pkgs/main/osx-arm64::openssl-3.0.13-h1a28f6b_0
pip pkgs/main/osx-arm64::pip-23.3.1-py38hca03da5_0
python pkgs/main/osx-arm64::python-3.8.19-hb885b13_0
readline pkgs/main/osx-arm64::readline-8.2-h1a28f6b_0
setuptools pkgs/main/osx-arm64::setuptools-68.2.2-py38hca03da5_0
sqlite pkgs/main/osx-arm64::sqlite-3.41.2-h80987f9_0
tk pkgs/main/osx-arm64::tk-8.6.12-hb8d0fd4_0
wheel pkgs/main/osx-arm64::wheel-0.41.2-py38hca03da5_0
xz pkgs/main/osx-arm64::xz-5.4.6-h80987f9_0
zlib pkgs/main/osx-arm64::zlib-1.2.13-h5a0b063_0

Proceed ([y]/n)?

Downloading and Extracting Packages:

Preparing transaction: done
Verifying transaction: | WARNING conda.core.path_actions:verify(1055): Unable to create environments file. Path not writable.
environment location: /Users/tobiasoberrauch/.conda/environments.txt

done
Executing transaction: \ WARNING conda.core.envs_manager:register_env(66): Unable to register environment. Path not writable or missing.
environment location: /opt/anaconda3/envs/stark
registry file: /Users/tobiasoberrauch/.conda/environments.txt
done

To activate this environment, use

$ conda activate stark

To deactivate an active environment, use

$ conda deactivate

(base) tobiasoberrauch@Tobiass-MBP stark % conda activate stark
pip install -r requirements.txt
Collecting anthropic==0.25.0 (from -r requirements.txt (line 1))
Downloading anthropic-0.25.0-py3-none-any.whl.metadata (18 kB)
Collecting async-timeout==4.0.3 (from -r requirements.txt (line 2))
Using cached async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting attrs==23.1.0 (from -r requirements.txt (line 3))
Downloading attrs-23.1.0-py3-none-any.whl.metadata (11 kB)
Collecting bs4==0.0.1 (from -r requirements.txt (line 4))
Downloading bs4-0.0.1.tar.gz (1.1 kB)
Preparing metadata (setup.py) ... done
Collecting certifi==2023.7.22 (from -r requirements.txt (line 5))
Downloading certifi-2023.7.22-py3-none-any.whl.metadata (2.2 kB)
Collecting click==8.1.7 (from -r requirements.txt (line 6))
Using cached click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting cmake==3.27.7 (from -r requirements.txt (line 7))
Downloading cmake-3.27.7-py2.py3-none-macosx_10_10_universal2.macosx_10_10_x86_64.macosx_11_0_arm64.macosx_11_0_universal2.whl.metadata (6.7 kB)
Collecting comm==0.1.4 (from -r requirements.txt (line 8))
Downloading comm-0.1.4-py3-none-any.whl.metadata (4.2 kB)
Collecting contourpy==1.1.1 (from -r requirements.txt (line 9))
Downloading contourpy-1.1.1-cp38-cp38-macosx_11_0_arm64.whl.metadata (5.9 kB)
ERROR: Ignored the following versions that require a different python version: 1.2.0 Requires-Python >=3.9; 1.2.1 Requires-Python >=3.9; 1.2.1rc1 Requires-Python >=3.9
ERROR: Could not find a version that satisfies the requirement cupy-cuda11x==12.2.0 (from versions: none)
ERROR: No matching distribution found for cupy-cuda11x==12.2.0

Embedding generation is slow due to inefficient search

In the emb_generate.py file, for every index, the program checks whether it is in the existing list of indices here

if idx in exisiting_indices:
    continue

This operation is O(n) where n is the length of existing indices, this can be easily avoided by converting this list into a set before starting the loop. This is a minor change, hopefully the authors can take care of this in the next commit.

exisiting_indices = set(exisiting_indices)

What do the ids mean after each query?

Initially I thought it is "qualified product ids", then I realize the query id is also in the list.

e.g.
334460,What are some Tercel women's cycling gloves made in China that you would recommend?,"[334457, 334458, 334460, 334461]"

Searched file folder but couldn't find an interpretation to the data format.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.