Giter VIP home page Giter VIP logo

Comments (9)

PhilippChr avatar PhilippChr commented on June 3, 2024

Hi,
I think it would help if you could elaborate a bit more on the data you want to process (which KB? which version? from where?).
Turtle (ttl) makes use of prefixes, which means that specific subsets are not always well-defined, if taken out of context.
So it is not straightforward to convert smaller ttl files into nt files, in general.
Regards,
Philipp

from wikidata-core-for-qa.

TurquoiseDM avatar TurquoiseDM commented on June 3, 2024

Sorry for my carelessness. The data I want to process is a RDF dump of wikidata dated January 23, 2023 in ttl format. It could be found on this website: https://dumps.wikimedia.org/wikidatawiki/entities/20230123/. (Although there is a nt file of the same content as the ttl file on this website, my storage space is not enough for the nt file. So I just can use the ttl file. )

from wikidata-core-for-qa.

PhilippChr avatar PhilippChr commented on June 3, 2024

No worries!
So I understand the problem, but do not know of an immediate work-around, unfortunately.
Maybe you could try to recreate smaller ntriple files from the ttl file.
The pruning/cleaning process involves splitting the ntriples file anyway, so you could then skip this step.

Another option, depending on your specific use-case, would be to use a slightly older dump.
We provide already filtered dumps for download here: https://github.com/PhilippChr/wikidata-core-for-QA#Downloads.

from wikidata-core-for-qa.

TurquoiseDM avatar TurquoiseDM commented on June 3, 2024

Thank you for your reply. It is really helpful. I will have a try.

from wikidata-core-for-qa.

cdhx avatar cdhx commented on June 3, 2024

Hi, thanks for publishing this valuable tool.

My question is similar to @TurquoiseDM

I would like to know if it is possible to do this through the following process:

  1. split a big TTL to some small ttl
  2. translate them to nt file
  3. use your code to deal with them one by one?

For the ttl to nt file conversion, it seems that I just need to copy the prefix to each ttl file and make sure not to split until all descriptions of an entity are done. (From my observations)

Another question is, does this method of splitting and then processing one by one conflict with your code?

from wikidata-core-for-qa.

PhilippChr avatar PhilippChr commented on June 3, 2024

Hi,
yes this sounds like a reasonable approach!
The prefix is indeed important.

In the code, the large ntriples file is also split into several smaller ones, that are processed in parallel. We follow a naming convention for these files, so you would need to rename your ntriple files to fit the following format:

FILES = glob.glob(os.getcwd() + "/tmp_dumps/wd*")

The line to split the files in the bash script could then be dropped:

split -l $TMP_DUMP_LINES $WIKIDATA_DUMP_PATH tmp_dumps/wd

Other than that, I do not see a problem right now.

Regards,
Philipp

from wikidata-core-for-qa.

PhilippChr avatar PhilippChr commented on June 3, 2024

You may also want to check out our public API hosted at https://clocq.mpi-inf.mpg.de,
that hosts a Wikidata dump from 2022, and provides convenient and QA-specific access to KB functionalities.

from wikidata-core-for-qa.

cdhx avatar cdhx commented on June 3, 2024

Thanks for your detailed reply, it is really helpful. Happy Valentine's Day!

from wikidata-core-for-qa.

PhilippChr avatar PhilippChr commented on June 3, 2024

Thank you, same for you! :)

from wikidata-core-for-qa.

Related Issues (2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.