nichtich / wikidata-taxonomy Goto Github PK
View Code? Open in Web Editor NEWcommand-line tool to extract taxonomies from Wikidata
Home Page: https://www.npmjs.org/package/wikidata-taxonomy
License: MIT License
command-line tool to extract taxonomies from Wikidata
Home Page: https://www.npmjs.org/package/wikidata-taxonomy
License: MIT License
great module!
If you have an example where wdk.simplifySparqlResults
crashes as suggested by your comment, I would be happy to have a look at it :)
Sometimes the output is shown as ??? (three question marks), the given wikidata id when looked up in wikidata website does not exist, but instances and other information exist.
Is this a bug?
For example,
wdtaxonomy Q2516517
Returns
transport sciences (Q2516517) •2 ↑
├──intelligent transportation system (Q508378) •23
├──transport economics (Q660564) •9 ↑
├──transport engineering (Q775325) •22 ↑
│ └──traffic engineering (Q1640676) •13
│ └──Technology of rail vehicles (Q2234610) •2 ↑
├──transportation geography (Q795612) •19 ↑
├──transport planning (Q1034047) •16 ×2 ↑
├──??? (Q1230796) •1 ↑
├──??? (Q1308085) •3 ↑↑
├──traffic psychology (Q1362446) •12 ↑
├──transport law (Q1996243) •8 ↑
├──effects of the automobile on societies (Q2215004) •3
├──??? (Q2516123) •2 ↑
├──traffic education (Q2516186) •2 ↑
├──Timeline of transportation technology (Q2516265) •5 ↑
├──??? (Q2516343) •1 ↑
├──??? (Q2516344) •1 ↑
├──??? (Q2516371) •1 ↑
├──??? (Q2516390) •1 ↑
├──??? (Q2516430) •1 ↑
├──transport ecology (Q2516529) •1 ↑
├──??? (Q20820139) •1
└──??? (Q20850681) •1 ↑
Big ou medium output need some ordering (order by prefLabel
) to better usability. Perhaps an "order-by" option.
Example wdtaxonomy -l pt-BR,pt,es,en -P P31 Q485258
generated a unordered list.
e.g. try wdtaxonomy Q2623243 -m class
. Maybe like a missing DISTINCT clause?
With option --reverse
Should be generated from README.md with option --man
. Also required for Debian packaging?
this is an amazing tool!
thank you for developing.
Seems to be a costly operation, unless there is a special Blazegraph service to get property usage count.
Simple message after this commandline:
node wdtaxonomy.js Q35120 --format json
When I ran it with --sparql, I got this query:
SELECT ?item ?itemLabel ?broader ?parents ?instances ?sites
WHERE {
{
SELECT ?item (count(distinct ?parent) as ?parents) {
?item wdt:P279* wd:Q35120
OPTIONAL { ?item wdt:P279 ?parent }
} GROUP BY ?item
}
{
SELECT ?item (count(distinct ?element) as ?instances) {
?item wdt:P279* wd:Q35120
OPTIONAL { ?element wdt:P31 ?item }
} GROUP BY ?item
}
{
SELECT ?item (count(distinct ?site) as ?sites) {
?item wdt:P279* wd:Q35120
OPTIONAL { ?site schema:about ?item }
} GROUP BY ?item
}
OPTIONAL { ?item wdt:P279 ?broader }
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
and then it stopped. Not sure what I am doing wrong. The chosen root is "entity" which appears to be at least one important root (perhaps the root?) at Wikidata
Implemented in 0.3.1 (option --total
), better documentation needed.
For instance
wdtaxonomy Q522190^
would result in
wdtaxonomy Q863247
because Q863247 is the only parent of Q522190.
However #8 may make this feature not necessary.
See wikidata-cli for how to implement
see tee
for an example. The screenshot https://commons.wikimedia.org/wiki/File:Wdtaxonomy-example.png should be updated afterwards
Difficult to do in SPARQL, maybe repeated queries, level by level?
Before/instead of checking whether a non-class exists, classes should be queried which the item is instance of.
If an item is no class but an instance, show with the class it belongs to.
$ wdtaxonomy Q3399
Cannot read property 'endpoint' of undefined
should give a meaningful error message instead
serializeTaxonomy.txt(taxonomy, process.stdout, { colors: true });
has error
TypeError: Cannot read property 'delimiter' of undefined
I believe the error traces to (wikidata-taxonomy/lib/serialize-txt.js:24:25) where it is looking for env.chalk but none is specified.
The command line script should be a wrapper to a module
Just tried the tool after some period of time.
Unfortunately, wikidata-taxonomy
fails with an error:
> wdtaxonomy Q634 -v
Error: SPARQL request failed
at XXXXX\npm\node_modules\wikidata-taxonomy\lib\query.js:32:13
at processTicksAndRejections (node:internal/process/task_queues:96:5)
Running Wikibase locally, I can generate results via curl:
curl http://localhost:8989/bigdata/sparql?SELECT%20DISTINCT%3Fp%20WHERE%20%7B%20%3Fs%20%3Fp%20%3Fo%20%7D
But trying to query the same endpoint with wikidata-taxonomy returns data from Wikidata instead:
node wdtaxonomy.js Q3 --sparql-endpoint http://localhost:8989/bigdata/sparql
life (Q3) •188 ↑↑↑
├──extraterrestrial life (Q181508) •81 ×1 ↑
│ ├──life on Mars (Q601319) •34 ×1
│ ├──Martian (Q913850) •25 ×4
│ ├──Life on Titan (Q2591050) •15
│ └──extraterrestrial intelligence (Q15107669) •7
├──personal life (Q2867027) •20
└──human life (Q19771042) •3
I get the same result if I install wikidata-taxonomy globally with npm install -g
It's late night so I'll toss a theory: does it implicitly depend on properties such as P279 existing in the target endpoint, and it falls back to Wikidata if the query to the specified endpoint doesn't return the expected data?
wdtaxonomy --isa Qparent Qchild
wdtaxonomy --broader Qparent Qchild
Both should check whether a instance-of/subclass-of relation exists to skip or modify (the latter requires maxlath/wikibase-edit#2). An item should never be both instance and subclass of the same other item.
I want to get the whole taxonomy of wikidata.
Is there any other method I can use to achieve that goal?
Similar to (and maybe based on) https://github.com/AngryLoki/wikidata-graph-builder
likely requires #6
The reverse switch (-r) takes the query identifier and creates a reverse tree. In other words, the query is on line 1, the superclass is on line 2, etc.
Would it be possible to provide an extra switch that would change the order of the 'reverse' switch? In other words, line 1 would be the root term (e.g. entity) and the tree would be constructed from the root to the query term.
What is the reasoning for this?
In order to see a large class structure, one might run a children (subclass) query from, for example, level 5 in the class tree. That query produces the target result where the 'top' term is the target and the children are listed below.
Now I would like to integrate the subclass results with the superclasses for the search term. But the superclass query result is not in the same form/shape as the subclass query results.
I have to do significant editing to reorder the superclass query to fit with the subclass query.
A switch -rt (reverse top) would allow the simple combination of a superclass query and a subclass query.
Thanks for creating a really useful tool.
The current (0.5.0) output does not distinguish mapping types because they are all stored as identifier
.
As commented in #39, multiple languages can be queried e.g.
wdtaxonomy -l en,de Q2516517
This should better be documented but there is also a bug in the language codes of JSON output e.g.
wdtaxonomy -l en,de Q2516517 -j
...
"prefLabel": {
"en,de": "Verkehrsdidaktik"
}
...
Tree view is similar, have a look at it and compare
I get this output running wdtaxonomy -f csv Q634
on node v9.1.0
. It works fine on v6.10.0
.
previously wdtaxonomy
worked perfectly
wdtaxonomy -V
0.6.6
recently I upgraded to Node.js v19.6.0
now when I run, for example:
wdtaxonomy -c Q35120
I see:
SPARQL request failed
Have I made an error (forgetting something since the last time I successfully used wdtaxonomy
)?
Is some dependency causing this error?
Is there a work-around?
Do you need more information from me to debug this issue?
Thanks for your help here.
/jay
I was pulling a CSV output of the product
taxonomy tree, which is quite large. It failed to parse as CSV because labels that include a quote character are not quoted themselves, so the first item to fail was:
-----,Q6109076,JTL-E .500 S&W Magnum 12",1,0,
Would it make sense for all label
s to be quoted by default?
Just tried latest release that has quoting and found that it escapes a "
using \"
. According to the RFC this is wrong, it should use ""
.
I've been using wdtaxonomy (v 0.6.6) happily for many months on my macbook running 10.14.5. Starting yesterday, every call I make (e.g., "wdtaxonomy -c Q5") produces an immediate "SPARQL request failed" message.
I tried capturing the sparql queries with --sparql and pasting that into the wikidata query service web page, and it they work. I also tried passing the standard query service URL with --sparql-endpoint and that did not help. I tried uninstalling and then installing again, which did not fix the problem.
Might it be due to this: https://lists.wikimedia.org/pipermail/wikidata/2019-June/013161.html ?
Any suggestions?
because concepts must be an array instead of an object.
This is a commandline tool but with iTerm its possible to Cmd-click on a URL and it open in a browser. This would be very handy when scanning a taxonomy list and needing to open a few items in the list quickly.
Is there any current means to achieve something like this?
e.g. {{Q'|Q634}} ...
The "OR list" have many applications, see one example here or this analytic query...
Suggestion: use comma-separated list as P1709,P2888
to recognize an "OR list".
Example:
wdtaxonomy -m P1709 Q33999 |grep schema.org
is the default command, but returns empty.
wdtaxonomy -m P1709,P2888 Q33999 |grep schema.org
as default command is better, will return something.
The "grep external ontology" have many applications, see one example here.
The problem of simple grep
is with intermediate branches...
Example of wdtaxonomy -m P1709 Q732577 | grep schema.org
:
╞══news article (Q5707594) •4 ×15727 ↑ … = http://schema.org/NewsArticle
│ │ ├──atlas (Q162827) •70 ×51 ↑ = http://schema.org/Atlas
├──report (Q10870555) •30 ×7908 = http://schema.org/Report
The real branch for atlas is not news article:
├──educational material (Q6006020) •2 ×7
├──reference work (Q13136) •31 ×191 ↑↑
├──atlas (Q162827) •70 ×51 ↑ = http://schema.org/Atlas
This should speed up and avoid timeouts for very large hierarchies. Output in tree format could be like this:
http://www.wikidata.org/entity/Q634 •202 ×5 ↑
├──http://www.wikidata.org/entity/Q44559 •88 ×2961 ↑
in JSON format there would be no prefLabel and notation.
One-letter lowercase options available: -a
, -g
, -j
, -k
,-x
, -y
, -z
.
Maybe -L, --no-labels
?
By the way, the full class hierarchy contains more then 2 million statements so getting all would probably still not be possible.
e.g. wdtaxonomy P2561
should use property P1647
(subproperty of) to extract a taxonomy.
It would be nice to allow serializing arbitrary JSKOS data sets as tree, so factor out serialization modules. This requires to:
foaf:topic
)sites
and instances
At least the last likely requires extension of JSKOS to handle usage statistics (number of records indexed with some concept in a given database).
See https://lucaswerkmeister.github.io/wikidata-ontology-explorer/, e.g.
SELECT ?property ?propertyLabel ?count WITH {
SELECT ?property (COUNT(DISTINCT ?statement) AS ?count) WHERE {
?item wdt:P279* wd:Q6423319 ;
?p ?statement.
?property a wikibase:Property;
wikibase:claim ?p.
FILTER(?property != wd:P279)
}
GROUP BY ?property
ORDER BY DESC(?count)
LIMIT 15
} AS %results WHERE {
INCLUDE %results.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY DESC(?count)
could be applied to either classes or to instances (-i) with typical properties (-a) or typical statements (-A). Output formats: default (colore), csv, json
Requires https://www.npmjs.com/package/graphlib-dot. Maybe better factor out to new module jskos-writers?
e.g. see also is used to link properties. The list of related link properties needs to be specified by an additional option (a/g/k/x/y/z?)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.