If I just have a big ttl file, what should I do with it? Is there any way to split thi

You may also want to check out our public API hosted at <a href="https://clocq.mpi-inf

If I just have a big ttl file, what should I do with it? about wikidata-core-for-qa HOT 9 CLOSED

philippchr commented on June 3, 2024

If I just have a big ttl file, what should I do with it?

from wikidata-core-for-qa.

Comments (9)

PhilippChr commented on June 3, 2024

Hi,
I think it would help if you could elaborate a bit more on the data you want to process (which KB? which version? from where?).
Turtle (ttl) makes use of prefixes, which means that specific subsets are not always well-defined, if taken out of context.
So it is not straightforward to convert smaller ttl files into nt files, in general.
Regards,
Philipp

from wikidata-core-for-qa.

TurquoiseDM commented on June 3, 2024

Sorry for my carelessness. The data I want to process is a RDF dump of wikidata dated January 23, 2023 in ttl format. It could be found on this website: https://dumps.wikimedia.org/wikidatawiki/entities/20230123/. (Although there is a nt file of the same content as the ttl file on this website, my storage space is not enough for the nt file. So I just can use the ttl file. )

from wikidata-core-for-qa.

PhilippChr commented on June 3, 2024

No worries!
So I understand the problem, but do not know of an immediate work-around, unfortunately.
Maybe you could try to recreate smaller ntriple files from the ttl file.
The pruning/cleaning process involves splitting the ntriples file anyway, so you could then skip this step.

Another option, depending on your specific use-case, would be to use a slightly older dump.
We provide already filtered dumps for download here: https://github.com/PhilippChr/wikidata-core-for-QA#Downloads.

from wikidata-core-for-qa.

TurquoiseDM commented on June 3, 2024

Thank you for your reply. It is really helpful. I will have a try.

from wikidata-core-for-qa.

cdhx commented on June 3, 2024

Hi, thanks for publishing this valuable tool.

My question is similar to @TurquoiseDM

I would like to know if it is possible to do this through the following process：

split a big TTL to some small ttl
translate them to nt file
use your code to deal with them one by one?

For the ttl to nt file conversion, it seems that I just need to copy the prefix to each ttl file and make sure not to split until all descriptions of an entity are done. (From my observations)

Another question is, does this method of splitting and then processing one by one conflict with your code?

from wikidata-core-for-qa.

PhilippChr commented on June 3, 2024

Hi,
yes this sounds like a reasonable approach!
The prefix is indeed important.

In the code, the large ntriples file is also split into several smaller ones, that are processed in parallel. We follow a naming convention for these files, so you would need to rename your ntriple files to fit the following format:

wikidata-core-for-QA/filter_wikidata.py

Line 17 in 556093d

FILES = glob.glob(os.getcwd() + "/tmp_dumps/wd*")

The line to split the files in the bash script could then be dropped:

wikidata-core-for-QA/prepare_wikidata_for_qa.sh

Line 18 in 556093d

split -l $TMP_DUMP_LINES $WIKIDATA_DUMP_PATH tmp_dumps/wd

Other than that, I do not see a problem right now.

Regards,
Philipp

from wikidata-core-for-qa.

PhilippChr commented on June 3, 2024

You may also want to check out our public API hosted at https://clocq.mpi-inf.mpg.de,
that hosts a Wikidata dump from 2022, and provides convenient and QA-specific access to KB functionalities.

from wikidata-core-for-qa.

cdhx commented on June 3, 2024

Thanks for your detailed reply, it is really helpful. Happy Valentine's Day!

from wikidata-core-for-qa.

PhilippChr commented on June 3, 2024

Thank you, same for you! :)

from wikidata-core-for-qa.

If I just have a big ttl file, what should I do with it? about wikidata-core-for-qa HOT 9 CLOSED

Comments (9)

Related Issues (2)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent