ad-freiburg / qlever Goto Github PK

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.

License: Apache License 2.0

C++ 97.19% CMake 0.65% Python 1.40% Shell 0.09% Dockerfile 0.04% C 0.05% ANTLR 0.42% Makefile 0.15%

sparql query-engine semantic-search-engine text-search sparql-endpoints

qlever's Introduction

QLever

QLever (pronounced "Clever") is a SPARQL engine that can efficiently index and query very large knowledge graphs with over 100 billion triples on a single standard PC or server. In particular, QLever is fast for queries that involve large intermediate or final results, which are notoriously hard for engines like Blazegraph or Virtuoso. QLever also supports search in text associated with the knowledge base, as well as SPARQL autocompletion.

Here are demos of QLever on a variety of large knowledge graphs, including the complete Wikidata, Wikimedia Commons, OpenStreetMap, UniProt, PubChem, and DBLP. Those demos also feature QLever's context-sensitive autocompletion, which makes SPARQL query construction so much easier. The knowledge graphs are updated regularly. Click on "Index Information" for a short description (with dates) and basic statistics.

If you use QLever in your research work, please cite one of the following publications: our CIKM'17 paper (combination of SPARQL and text search, with extensive evaluation), our CIKM'22 paper (QLever's autocompletion, with extensive evaluation), our 2023 book chapter (survey of knowledge graphs and basics of QLever, with many example queries).

QLever aims at full SPARQL 1.1 support and is almost there. In particular, a first version of SPARQL 1.1 Federated Query (SERVICE) is implemented since PR #793 and a proof of concept for SPARQL 1.1 Update is implemented since PR #916. If you find a bug in QLever or in one of our demos or if you are missing a feature, please open an issue.

Quickstart

Use QLever via the qlever script, following the instructions on https://github.com/ad-freiburg/qlever-control . The script allows you to control all things QLever does, with all the configuration in one place, the so-called Qleverfile. The script comes with a number of example Qleverfiles (in particular, one for each of the demos mentioned above), which makes it very easy to get started and also helps to write your own Qleverfile for your own data. If you use QLever via docker (which is the default setting), the script pulls the most recent docker image automatically and you don't have to download or compile the code.

If the qlever script does not work for you for whatever reason, have a look at the Dockerfile for Ubuntu 22.04 or the Dockerfiles for older Ubuntu versions. The source code of the qlever script also provides information on how to use QLever (in particular, note the functions action_start and action_index).

An older (and not quite up-to-date anymore) step-by-step instruction can be found here. QLever's advanced features are described here. For more in-depth information, see the various other .md files in this folder, some of which are outdated though. For high-level descriptions how Qlever works and experiences with some concrete datasets, see the Qlever Wiki.

qlever's People

Contributors

Stargazers

Watchers

qlever's Issues

ORDER BY changes result size when combined with DISTINCT

This was found by Tobias Matysiak: For queries with multiple columns adding an ORDER BY clause can change the result set.

See for example the following query with and without ORDER BY

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?cityname ?popcount WHERE {
    ?city fb:type.object.type fb:location.citytown .
    ?city fb:type.object.name ?cityname .
    ?city fb:location.location.containedby ?country .
    ?city fb:location.statistical_region.population ?population .
    ?population fb:measurement_unit.dated_integer.number ?popcount .
}
ORDER BY DESC(?popcount)
LIMIT 100

My guess is that DISTINCT needs to sort by all columns combined before any ORDER BY sorting if it should give really distinct results. Will have to check with the standards documents if this is actually what DISTINCT does. @Buchhold any quick knowledge?

Multi threaded version breaks on some queries (even with 1 thread)

Working on a query set among about 50 queries that work fine with the multi threaded version I found the following query that gives no results on the multi threaded version but 3580 results on a8b495b (just before the addition of multi threading)

PREFIX fb: <http://rdf.freebase.com/ns/>
PREFIX fbk: <http://rdf.freebase.com/key/>
PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?diocese_name WHERE {
  ?type rdf:label "Religious Jurisdiction"@en .
  ?diocese fb:type.object.type ?type .
  ?diocese fbk:wikipedia.en_title ?diocese_name .
}

I'm still investigating but it seems that somehow join of fixed size tables is broken.

Here is debug output from the single thread version
singlethread.txt

And from the multi thread version
multithread.txt

Apart from the more verbose output because of the LRUCache changes and it being in TRACE mode the problems seems to occur at the following point (Note the end):

Single:

DEBUG: Performing PSO scan for full relation: <http://rdf.freebase.com/ns/type.object.name>
DEBUG: Scan done, got 47,236,131 elements.
DEBUG: IndexScan result computation done.
DEBUG: Getting sub-result for Sort result computation...
DEBUG: Getting sub-results for join result computation...
DEBUG: IndexScan result computation...
DEBUG: Performing POS scan for full relation: <http://rdf.freebase.com/ns/type.object.type>
DEBUG: Scan done, got 266,316,909 elements.
DEBUG: IndexScan result computation done.
DEBUG: Join result computation...
DEBUG: Performing join between two fixed width tables.
DEBUG: A: witdth = 2, size = 266,316,909
DEBUG: B: witdth = 1, size = 3
DEBUG: Galloping case.
DEBUG: Join done.
DEBUG: Result: width = 2, size = 4,791

Multi:

DEBUG: Performing PSO scan for full relation: <http://rdf.freebase.com/ns/type.object.name>
DEBUG: Scan done, got 47,236,131 elements.
…
SCAN POS with P = "<http://rdf.freebase.com/ns/type.object.type>"
TRACE: We were the first to emplace, need to compute result
DEBUG: Performing POS scan for full relation: <http://rdf.freebase.com/ns/type.object.type>
DEBUG: Scan done, got 266,316,909 elements.
TRACE: Try to atomically emplace a new empty ResultTable
TRACE: Using key:
SCAN POS with P = "<http://www.w3.org/2000/01/rdf-schema#label>", O = ""Religious Jurisdiction"@en"
TRACE: Result already (being) computed
DEBUG: Performing join between two fixed width tables.
DEBUG: A: witdth = 2, size = 266,316,909
DEBUG: B: witdth = 1, size = 3
DEBUG: Galloping case.
DEBUG: Join done.
DEBUG: Result: width = 2, size = 0

Looking at the code there is a somewhat dangerous looking const-removal in the galloping code here, especially since the code is called with _fixedSizeData casted to std::vector<std::array> here but I'm not sure about the root cause yet

ad_utility::strip() crashes on stripping to empty

The following query crashes QLever during SPARQL parsing. Also we don't have an OPTIONAL query in the end-to-end tests

PREFIX fb: <http://rdf.freebase.com/ns/>

SELECT DISTINCT ?0 ?0name WHERE {
 fb:m.06w2sn5 fb:imdb.topic.name_id ?0 .
 OPTIONAL {?0 fb:type.object.name ?0name} .
 FILTER (?0 != fb:m.06w2sn5) 
} LIMIT 300

Dumping prefixes for compression in JSON throws exception for invalid UTF-8

This currently breaks building a Freebase Index with prefix compression as QLever tries to compress using a prefix that can't be represented in UTF-8 (ending in a 0x81 byte). Since we are using nlohmann::json this is the relevant upstream discussion.

And the following log shows the problem when building a Freebase index
freebase_qlever.log

Query with one ql:has-predicate triple is slow if COUNT is in ORDER BY clause

The following query is very slow (46 seconds) on http://qlever.informatik.uni-freiburg.de/?backend=1 (Wikipedia+FreebaseEasy with up-to-date QLever build as of this writing):

SELECT ?predicate WHERE {
  ?x ql:has-predicate ?predicate
}
GROUP BY ?predicate
ORDER BY DESC((COUNT(?predicate) AS ?count))

However, the following query, which puts the COUNT in the SELECT clause instead of in the ORDER BY clause (and thus actually has more work to do) is much faster (~ 0.4 seconds):

SELECT ?predicate (COUNT(?predicate) AS ?count) WHERE {
  ?x ql:has-predicate ?predicate
}
GROUP BY ?predicate
ORDER BY DESC(?count)

The reason is obviously that the pattern trick is used for the second query, but not for the first query. The solution is obviously to also use it for the first query.

FILTER EQ returns no results sometimes

The following query works as expected and returns two cities named "Berlin":

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?cityname ?countryname WHERE {
?city fb:type.object.type fb:location.citytown .
?city fb:type.object.name ?cityname .
?country fb:type.object.type fb:location.country .
?country fb:type.object.name ?countryname .
?city fb:location.location.containedby ?country .
FILTER(?cityname == "Berlin"@en) .
}

But filtering by country "Germany" does not work and return 0 results:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?cityname ?countryname WHERE {
?city fb:type.object.type fb:location.citytown .
?city fb:type.object.name ?cityname .
?country fb:type.object.type fb:location.country .
?country fb:type.object.name ?countryname .
?city fb:location.location.containedby ?country .
FILTER(?countryname == "Germany"@en) .
}

Changing the filter to NEQ (!=) seems to work because the result size reduces from 180549 (without any filter) to 174112:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?cityname ?countryname WHERE {
?city fb:type.object.type fb:location.citytown .
?city fb:type.object.name ?cityname .
?country fb:type.object.type fb:location.country .
?country fb:type.object.name ?countryname .
?city fb:location.location.containedby ?country .
FILTER(?countryname != "Germany"@en) .
}

Also the result sizes of the filters (<=) and (>=) add up to 180549

Very slow GROUP BY query, which could be much faster with better query plan

The following query is currently very slow on Wikipedia+FreebaseEasy:

SELECT ?object WHERE {
  ?subject <is-a> ?object 
}
GROUP BY ?object
ORDER BY DESC((COUNT(?object) AS ?count))

The log shows that the PSO index is scanned and then the 130K tuples from the predicate are sorted. The sorting is used to make the GROUP BY easy.

For the query above, it would clearly be better to use the POS index. Then there would be no need to sort for the GROUP BY ?object.

Note that the sorting required by the final ORDER BY is no problem, since there are only 47K distinct objects for this query.

Newest commit breaks QueryExecutionTreeTest

It's just a string difference by the looks of it.

Inconsistent Text Query: Additional entity variable adds results

The following SPARQL + Text query retrieves 1 result row on Freebase + ClueWeb

PREFIX fb: <http://rdf.freebase.com/ns/>
PREFIX fbk: <http://rdf.freebase.com/key/>
PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT TEXT(?t) WHERE {
  ?person rdf:label "Rafael Rosell"@en .
  ?t ql:contains-entity ?person .
  ?t ql:contains-word "star" .
}

Adding another entity variable clearly should NOT give more results, yet it does

PREFIX fb: <http://rdf.freebase.com/ns/>
PREFIX fbk: <http://rdf.freebase.com/key/>
PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT TEXT(?t) WHERE {
  ?person rdf:label "Rafael Rosell"@en .
  ?t ql:contains-entity ?person .
  ?t ql:contains-entity ?unbound .
  ?t ql:contains-word "star" .
}

I tested this on the last commit before multithreading as well as on aeaf26a which was just before the OPTIONAL changes both show the same behavior

ql:has-relation segfaults with COUNT

On the Clueweb Freebase (built with a version before the latest ql:has-predicate renaming) corpus running with the latest master QLever the following query results in a SEGFAULT

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?predicate (COUNT(?predicate) as ?count) WHERE {
 ?person fb:people.person.profession ?profession .
 ?profession fb:type.object.name "Astronaut"@en .
 ?person ql:has-predicate ?predicate .
}
GROUP BY ?predicate
ORDER BY ?count

However removing the COUNT the following query works

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?predicate WHERE {
 ?person fb:people.person.profession ?profession .
 ?profession fb:type.object.name "Astronaut"@en .
 ?person ql:has-predicate ?predicate .
}

And also restricting just by fb:type.object.type works:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT ?predicate (COUNT(?predicate) as ?count) WHERE {
 ?person fb:type.object.type fb:people.person .
 ?person ql:has-predicate ?predicate .
}
GROUP BY ?predicate
ORDER BY ?count

The log output is as follows
qlever_crash_log_count_has_predicate.txt

Assertion failure when building index on freebase-rdf-latest.nt

When trying to build the index with a debug build the following assertion is violated
https://github.com/Buchhold/QLever/blob/master/src/util/Conversions.h#L75
This could be a problem with my dataset but we may also need to be less strict. I'll investigate.

Travis still doesn't use GCC 5

While the somehow it does register gcc-5 as compiler and prints the correct version in Travis' generated compiler version print, it doesn't actually use it because something outside the .travis.yml does export CXX=g++. And thus cmake still uses the old compiler

Building index on wikidata.truthy.12Nov17 could not parse URI

I tried to build an index with wikidata.truthy.12Nov17/latest-truthy.non+en.nt as input and got following error message:

bumuellp@beli:~/qlever_stuff/QLever_binaries$ ./IndexBuilderMain -i /nfs/raid5/bumuellp/qlever/wikidat_index_non+en/ -n /nfs/raid1/wikidata/wikidata.truthy.12Nov17/latest-truthy.non+en.nt -a

IndexBuilderMain, version Dec  5 2017 00:23:17

Set locale LC_CTYPE to: en_US.utf8
Wed Dec  6 18:16:29.413 - DEBUG: Configuring STXXL...
Wed Dec  6 18:16:29.414 - DEBUG: done.
Wed Dec  6 18:16:29.414 - INFO:  Making pass over NTriples /nfs/raid1/wikidata/wikidata.truthy.12Nov17/latest-truthy.non+en.nt for vocabulary.
Wed Dec  6 18:16:43.367 - ERROR: BAD INPUT STRING (Illegal URI in : _:genid1299 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Restriction> .; in /home/bumuellp/qlever_stuff/QLever/src/parser/NTriplesParser.cpp, line 33, function bool NTriplesParser::getLine(std::array<std::__cxx11::basic_string<char>, 3ul>&))

I tested it on a small kb file. It seems to be something about the _:genid1299. If it is at the first or second position in a triplet of an nt file, i get this error message. But if i put it between <>, like <_:genid1299>, the index generation works.

Tests are inadequate when it comes to standard functionality

So @joka921 and I just figured out that his change to externalize the vocabulary during index building actually doesn't load the final vocabulary so that the text index can be built. This means the change completely breaks text indices. Still it passes all tests, without even a hint of a problem. Similarly the threading change subtly broke some benchmark queries a while back (though we only have those for a full Freebase index) and also doesn't trigger any test failures.

Therefore this issue tracks adding "end-to-end" tests that should fail when existing functionality breaks (at least in less subtle ways). For this we probably need a combination of true end-to-end tests which build an index and spawn an actual server running queries against it as well as unit tests that cover these processes from an API point of view.

Ideally we also want to run the end-to-end tests on a realistic index so they can't all run on Travis. Still even there we should at least test with the scientists collection.

I tagged this as "help wanted" because I'd be especially interested in a good variety of queries for the scientists collection (we do have more queries for the real data sets).

FILTER with numerical constant does not work

For example, on Wikipedia+FreebaseEasy, the following query does not work as expected (the result is empty, and with > instead of < nothing is filtered):

SELECT ?city ?country ?population WHERE {
  ?city <is-a> <City/Town/Village> .
  ?country <is-a> <Country> .
  ?city <Contained_by> ?country .
  ?city <Population> ?population .
  FILTER (?population < 1000000)
}
ORDER BY DESC(?population)

Code style inconsistencies

Every couple of pull requests when running clang-format on the entire repository I tend to get a lot of changed files.
I'd suggest adding an automated code check at some point in the pull request verification process. Personally I think the best place would be git's pre-commit hook, but as far as I know that cannot be synchronized automatically using git. I've attached a working pre-commit hook though, in case anybody is interested.
pre-commit
A less elegant solution would be checking the code style during the travis ci run. While this does not provide immediate feedback that the code style is off it would be fully automatic and does not require any setup by the user.
To my knowledge github does not support server sided hooks for non enterprise projects.
@niklas88 do you have any good ideas on how to improve code style conformity?

std::bad_alloc on simple broken query

I'm trying to recreate the mediator-facts list used by Aqqu on the final freebase. Sadly my attempts provoked some QLever crashing. Before it crashes it uses a lot more RAM but it doesn't look like it quite hits out of memory. Especially since that should go to swap or awake the mighty OOM Killer.

In this query the order is wrong and thus ?s and ?o have the same function, it is NOT the correct mediator-facts query

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?s ?left ?right ?o WHERE {
    ?cvt fb:freebase.type_hints.mediator "true" .
    ?s ?left ?cvt .
    ?o ?right ?cvt .
}
LIMIT 100

third_party/json and docker context size

The nlohmann/json submodule is quite large due to including a large test suite. We actually only need a single file json/json.hpp from this repository (+ CMakeLists.txt for setting up the include) but we still get the full size of the repository. This also blows the docker build context to 477 MB.

Sadly there is currently no official git repository without the test suit (see here). So the easiest solution would probably be to include the header + a minimal CMakeLists.txt directly in the thid_party/json folder.

IndexBuilderMain: Segmentation fault

Core-dump: IndexBuilderMain.core.zip
Env: Ubuntu 17.10 / Linux 4.13.0-21-generic #24-Ubuntu SMP Mon Dec 18 17:29:16 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Index Statistics:

Relations: 6,653,878

Elements: 13,523,908

Blocks: 2

Theoretical size of Id triples: 324,573,792 bytes
Size of pair index: 216,382,528 bytes
Total Size: 271,253,808 bytes

Mon Jan 15 11:50:13.709 - INFO: Writing Meta data to index file...
Mon Jan 15 11:50:15.555 - INFO: Permutation done.
Mon Jan 15 11:50:15.569 - INFO: Making pass over ContextFile -l for vocabulary.
Mon Jan 15 11:50:15.570 - INFO: Pass done.
Mon Jan 15 11:50:15.570 - INFO: Creating vocabulary from set ...
Mon Jan 15 11:50:15.571 - INFO: ... sorting ...
Mon Jan 15 11:50:15.571 - INFO: Done creating vocabulary.
Mon Jan 15 11:50:15.571 - INFO: Writing vocabulary to file qindextext.text.vocabulary
Mon Jan 15 11:50:15.571 - INFO: Done writing vocabulary to file.
Mon Jan 15 11:50:15.571 - INFO: Calculating block boundaries...
Segmentation fault (core dumped)

Assertion failure in dense sparsehash/densehashtable.h when building OSP permutation

Trying to build an index on a Freebase augmented with measurements converted to xsd values I get the following error while building the OSP permutation (others seem to have worked):
crash.txt

Sadly I can't reproduce this on the scientists collection. Any ideas?

IndexBuilderMain quits without finishing correctly or showing error message

Hello,
I tried to build an index for wikidata.2018_01_07.truthy.nt and it just quits at the same point without showing any message. The last output from IndexBuilderMain (Debug build):

IndexBuilderMain, version Apr 25 2018 19:57:49

Set locale LC_CTYPE to: en_US.utf8
Thu Apr 26 01:28:17.317 - DEBUG: Configuring STXXL...
Thu Apr 26 01:28:17.396 - DEBUG: done.
Thu Apr 26 01:28:17.397 - INFO: Making pass over NTriples /nfs/raid1/wikidata/wikidata.2018_01_07/wikidata.2018_01_07.truthy.nt for vocabulary.
Thu Apr 26 01:28:52.134 - INFO: Lines processed: 10,000,000
Thu Apr 26 01:29:27.725 - INFO: Lines processed: 20,000,000
Thu Apr 26 01:30:08.776 - INFO: Lines processed: 30,000,000
Thu Apr 26 01:30:41.368 - INFO: Lines processed: 40,000,000
Thu Apr 26 01:31:34.868 - INFO: Lines processed: 50,000,000
Thu Apr 26 01:32:07.549 - INFO: Lines processed: 60,000,000
Thu Apr 26 01:32:38.931 - INFO: Lines processed: 70,000,000
Thu Apr 26 01:33:10.401 - INFO: Lines processed: 80,000,000
Thu Apr 26 01:33:39.564 - INFO: Lines processed: 90,000,000
Thu Apr 26 01:34:09.474 - INFO: Lines processed: 100,000,000
Thu Apr 26 01:34:41.064 - INFO: Lines processed: 110,000,000
Thu Apr 26 01:35:13.795 - INFO: Lines processed: 120,000,000
Thu Apr 26 01:36:27.565 - INFO: Lines processed: 130,000,000
Thu Apr 26 01:36:57.448 - INFO: Lines processed: 140,000,000
Thu Apr 26 01:37:30.877 - INFO: Lines processed: 150,000,000
Thu Apr 26 01:38:02.920 - INFO: Lines processed: 160,000,000
Thu Apr 26 01:38:36.414 - INFO: Lines processed: 170,000,000
Thu Apr 26 01:39:09.024 - INFO: Lines processed: 180,000,000
Thu Apr 26 01:39:41.438 - INFO: Lines processed: 190,000,000
Thu Apr 26 01:40:13.883 - INFO: Lines processed: 200,000,000
Thu Apr 26 01:40:44.382 - INFO: Lines processed: 210,000,000
Thu Apr 26 01:41:16.908 - INFO: Lines processed: 220,000,000
Thu Apr 26 01:41:48.740 - INFO: Lines processed: 230,000,000
Thu Apr 26 01:42:21.168 - INFO: Lines processed: 240,000,000
Thu Apr 26 01:42:53.015 - INFO: Lines processed: 250,000,000
Thu Apr 26 01:43:23.460 - INFO: Lines processed: 260,000,000
Thu Apr 26 01:43:53.399 - INFO: Lines processed: 270,000,000
Thu Apr 26 01:48:29.093 - INFO: Lines processed: 280,000,000
Thu Apr 26 01:49:47.518 - INFO: Lines processed: 290,000,000
Thu Apr 26 01:50:33.165 - INFO: Lines processed: 300,000,000
Thu Apr 26 01:51:08.086 - INFO: Lines processed: 310,000,000
Thu Apr 26 01:51:43.910 - INFO: Lines processed: 320,000,000
Thu Apr 26 01:52:14.771 - INFO: Lines processed: 330,000,000
Thu Apr 26 01:52:45.578 - INFO: Lines processed: 340,000,000
Thu Apr 26 01:53:16.245 - INFO: Lines processed: 350,000,000
Thu Apr 26 01:53:45.715 - INFO: Lines processed: 360,000,000
Thu Apr 26 01:54:18.685 - INFO: Lines processed: 370,000,000
Thu Apr 26 01:54:49.505 - INFO: Lines processed: 380,000,000
Thu Apr 26 01:55:19.311 - INFO: Lines processed: 390,000,000
Thu Apr 26 01:55:49.265 - INFO: Lines processed: 400,000,000
Thu Apr 26 01:56:21.707 - INFO: Lines processed: 410,000,000
Thu Apr 26 01:56:52.822 - INFO: Lines processed: 420,000,000
Thu Apr 26 01:57:24.288 - INFO: Lines processed: 430,000,000
Thu Apr 26 01:57:57.092 - INFO: Lines processed: 440,000,000
Thu Apr 26 01:58:30.770 - INFO: Lines processed: 450,000,000
Thu Apr 26 01:59:05.115 - INFO: Lines processed: 460,000,000
Thu Apr 26 01:59:38.426 - INFO: Lines processed: 470,000,000
Thu Apr 26 02:00:11.201 - INFO: Lines processed: 480,000,000
Thu Apr 26 02:00:42.520 - INFO: Lines processed: 490,000,000
Thu Apr 26 02:01:12.596 - INFO: Lines processed: 500,000,000
Thu Apr 26 02:01:44.377 - INFO: Lines processed: 510,000,000
Thu Apr 26 02:02:20.818 - INFO: Lines processed: 520,000,000
Thu Apr 26 02:02:51.878 - INFO: Lines processed: 530,000,000
Thu Apr 26 02:03:23.206 - INFO: Lines processed: 540,000,000
Thu Apr 26 02:03:56.946 - INFO: Lines processed: 550,000,000
Thu Apr 26 02:04:30.181 - INFO: Lines processed: 560,000,000
Thu Apr 26 02:05:04.523 - INFO: Lines processed: 570,000,000
Thu Apr 26 02:05:38.386 - INFO: Lines processed: 580,000,000
Thu Apr 26 02:06:11.430 - INFO: Lines processed: 590,000,000

The comand was:
./IndexBuilderMain -n /nfs/raid1/wikidata/wikidata.2018_01_07/wikidata.2018_01_07.truthy.nt -i /local/data/bumuellp/wikidata_dump-copies/wikidata2018-a_index -a

System info:
*-cpu
product: AMD FX(tm)-8150 Eight-Core Processor
vendor: Advanced Micro Devices [AMD]
physical id: 1
bus info: cpu@0
size: 1400MHz
capacity: 3600MHz
width: 64 bits
capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf pni pclmulqdq monitor ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 nodeid_msr topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold cpufreq
*-memory
description: System memory
physical id: 0
size: 31GiB

Could this be a memory issue or do you have any other suggestions?

Thank you and best Regards
bumuellp

Crash with simple ?rel query

I'm trying to figure this out myself but wanted to document here nonetheless and maybe @Buchhold sees the problem immediately:
Executing the following query on our internal QLever instance crashes with std::bad_alloc, the same query gives normal results on Virtuoso:

PREFIX fb: <http://rdf.freebase.com/ns/>

SELECT DISTINCT ?rel
WHERE {
    fb:m.02hrh1q ?rel fb:m.02rshjk .
}

This is after more than 27000 queries of the same form worked so pretty specific to these entities.
The output running in GDB suggests there is something wrong with the special case in Index::scanNonFunctionalRelation

ql:has-relation not documented

There is only one example query:

PREFIX we: <http://www.wikidata.org/entity/>
PREFIX wp: <http://www.wikidata.org/prop/direct/>
SELECT ?r (COUNT(?r) as ?count) WHERE {
  ?a wp:P106 we:Q11631 .
  ?a ql:has-relation ?r
}
GROUP BY ?r
ORDER BY DESC(?count)

Could ql:has-relation please be documented?

Filter doesn't really filter

In the following query the last entity filtered out is acutally in the resultset. Checking for '==' to only get this entity doesn't work either. The first two on the other hand seem to get filtered out but switching the last two doesn't really filter now-second.

PREFIX fb: <http://rdf.freebase.com/ns/>

SELECT DISTINCT ?1 ?o ?o2 WHERE {
 fb:m.0fkvn fb:government.government_office_category.officeholders ?0 .
 ?0 fb:government.government_position_held.jurisdiction_of_office fb:m.0vmt .
 ?0 fb:government.government_position_held.office_holder ?1 .
 FILTER (?1 != fb:m.0fkvn && ?1 != fb:m.0vmt && ?1 != fb:m.018mts) 
} LIMIT 300

FILTER ignored in query with ql:has-predicate

In the following query on http://qlever.informatik.uni-freiburg.de/?backend=1 (Wikipedia+FreebaseEasy with latest QLever build), the FILTER is simply ignored:

SELECT ?predicate (COUNT(?predicate) as ?count) WHERE {
  ?x ql:has-predicate ?predicate .
  FILTER (?predicate >= <Ge)
}
GROUP BY ?predicate
ORDER BY DESC(?count)

A filter on ?x is equally ignored in the query above.

Don't be case sensitive with SPARQL keywords

Aqqu generated SPARQL with lowercase "where" I've fixed this in Aqqu but it seems to be supported by the spec https://www.w3.org/TR/sparql11-query/#sparqlGrammar (Note 1)

Queries with FILTER depend on whether there is a . before

The following query works on Wikipedia+FreebaseEasy:

SELECT ?x WHERE {
  ?x <is-a> <Astronaut> .
  FILTER (?x <= <Akihiko_Hoshide>)
}
ORDER BY ASC(?x)

Without the DOT (.) before FILTER, all results are shown (as if there was no FILTER).

PS: The query does NOT work with quotes around the enity names. However, it does work when the entity name does not exist, for example:

SELECT ?x WHERE {
  ?x <is-a> <Astronaut> .
  FILTER (?x >= <Neil)
}
ORDER BY ASC(?x)

Segfault

Happenend for query:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT TEXT(?c) WHERE {
 ?x fb:people.person.profession ?profession .
 ?profession fb:type.object.name.en "Actor" .
 ?y fb:people.person.profession ?profession .
 ?profession fb:type.object.name.en "Scientist" .
 ?x <in-text> ?c .
 ?y <in-text> ?c
}
LIMIT 100

previous query:

SELECT TEXT(?c) WHERE {
 ?x fb:people.person.profession ?profession .
 ?profession fb:type.object.name.en "Actor" .
 ?y fb:people.person.profession ?profession .
 ?profession fb:type.object.name.en "Scientist" .
 ?x <in-text> ?c .
 ?y <in-text> ?c
}
LIMIT 100
ORDER BY DESC(SCORE(?c))

Support OPTIONAL{}

This is used by Aqqu for name relations (fb:type.object.name) so it also gets entities without a name such as dates.

As an example with QLever I currently make the optional name mandatory and get the following query

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?0 ?0name WHERE {
   fb:m.02y1vz fb:internet.website.launched ?0 .
   ?0 fb:type.object.name ?0name . 
   FILTER (?0 != fb:m.02y1vz) 
} LIMIT 300

This doesn't find an answer while what Aqqu really wanted to generate is

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT DISTINCT ?0 ?0name WHERE {
   fb:m.02y1vz fb:internet.website.launched ?0 .
   OPTIONAL {?0 fb:type.object.name ?0name . }
   FILTER (?0 != fb:m.02y1vz) 
} LIMIT 300

which gives the correct answer. The alternative would be to do this as two separate queries but this would need more special treatment for the QLever backend.

Currently this contributes to a much worse avg-F1 (36% vs 65%) with QLever even when using the exact same model (actually trained using QLever)

JSON output for OPTIONAL is ambiguous

QLerver's JSON API return "" for the value of of OPTIONAL and non existing values. This is ambiguous with the empty string. It probably should return null

Example:
The query

PREFIX fb: <http://rdf.freebase.com/ns/>

SELECT DISTINCT ?0 ?0name WHERE {
 fb:m.0jcx fb:people.person.date_of_birth ?0 .
 OPTIONAL {?0 fb:type.object.name ?0name} .
 FILTER (?0 != fb:m.0jcx)
} LIMIT 300

yields

…
"selected": ["?0", "?0name"],
"res": [
["\"1879-03-14T00:00:00\"^^<http://www.w3.org/2001/XMLSchema#dateTime>",""]
],
…

but it should be

…
"selected": ["?0", "?0name"],
"res": [
["\"1879-03-14T00:00:00\"^^<http://www.w3.org/2001/XMLSchema#dateTime>", null]
],
…

OPTIONAL and FILTER in a single query seem to create a conflict

Trying to execute the following query (generated by Aqqu when running with the QLever backend)

PREFIX fb: <http://rdf.freebase.com/ns/>

SELECT DISTINCT ?0 ?0name WHERE {
 fb:m.06w2sn5 fb:imdb.topic.name_id ?0 .
 OPTIONAL {?0 fb:type.object.name ?0name} .
 FILTER (?0 != fb:m.06w2sn5)
} LIMIT 300

QLever can't find an execution plan. Removing either the OPTIONAL or the FILTER part results in a viable plan. @floriankramer any ideas?

Also we are missing a query with OPTIONAL in the end-to-end tests (also #68)!

Code duplication and precision concerns with convertIndexWordToFloat/String()

The functions convertIndexWordToFloat() and convertIndexWordToFloatString() share a lot of code. Yet we can't just use std::to_string(convertIndexWordToFloat(indexWord)) because we also store integers in the "float" index words and the current method preserves precision because we directly go to a string representation. Maybe we can still exploit some common code?

One other thing I wonder is whether it might make sense to use double, for convertIndexWordToFloa(). Afaik the current methods are called float only because of the common name while working with strings their precision isn't actually limited to 32 bits.

HashMap is untested, broken and used inconsistently

Prompted by a discussion on hash maps with @hannahbast I had another look at our HashMap wrapper type.

I noticed that we currently have no tests for it and it is in fact broken. The insert() method doesn't compile as the dense_hash_map::insert() it calls takes a const value_type& which is const std::pair<K,, V>& and not const V&.

Additionally it is used inconsistently with smaller maps using std::unordered_map.

We may also want to consider using Abseil's absl::flat_hash_map. Instead of wrapping the methods we may be able to go with inheriting and using declerations.

Error messages in the UI get displayed incorrectly

Something escapes stuff using lt & gt brackets, e.g.; <in-text>. This way queries always look broken when returned as part of the error message.

Support COUNT() aggregation

Aqqu generates queries using e.g. "SELECT COUNT(?var) WHERE .." to count (distinct?) entities. To port Aqqu to QLever the cleanest solution would be to support this. It should however also be possible to count client side in the specific use case in Aqqu.

Typos in queries need much better feedback / error msgs

Feel free to add to this list:

PREFIX < VS REFIX < leads to empty results without error.
missing quotes for <in-text> objects can cause QLever to ignore parts of the object
Selected variables that do not occur in the query are ignored without error message

Section 5 of README does not make perfect sense.

The complicated features with text record -> doc/snippet relationship should be externalized. Have a simple 1 sentence == 1 doc == 1 text record example instead. Maybe the complicated case as a second example

OPTIONAL leads to no results

The following query generated by Aqqu for the "start of Hoover's (fb:m.03kdl) presidency" yields no results.

PREFIX fb: <http://rdf.freebase.com/ns/>

SELECT  ?1 ?1name WHERE {
 fb:m.060d2 fb:government.government_office_or_title.office_holders ?0 .
 ?0 fb:government.government_position_held.office_holder fb:m.03kdl .
 ?0 fb:government.government_position_held.from ?1 .
 OPTIONAL {?1 fb:type.object.name ?1name} .
 FILTER (?1 != fb:m.060d2 && ?1 != fb:m.03kdl) 
} LIMIT 300

Removing the OPTIONAL line however reveals there should be a result.

It doesn't seem to be related to ?1 not having that relation at all however as removing

fb:m.060d2 fb:government.government_office_or_title.office_holders ?0 .

which restricts to the "President of the United States" also yields the result.

This is the QLever log: qlever_log_optional_sort.txt
my guess would be that the SORT operation is not propagating the _isOptional flag correctly. I'll investigate later.

Support the Turtle format as used by Wikidata

Currently we need to convert Wikidata dumps to NTriple format before we can work with them which makes this unnecessarily complicated and slow. Instead we should directly support this common standard format directly. Parsing this is also a good way to gather some more parsing experience before rewriting the SPARQL parser.

Simplifying the Wikidata ⇒ QLever pipeline will also make it trivial to setup automatic weekly Wikidata updates.

Consider switching away from Travis CI

I really appreciate the free and easy to use service and feel like enabling Travis CI on this repository has had very positive and valuable effects.

Nevertheless I've grown increasingly frustrated with Travis CI

On a handful of ocassions Travis CI builds just failed and had to be manually restarted. Most often because of unreachable apt mirrors. I truly hope Travis is running their own mirrors but I doubt it
The Travis team seems unwilling or possibly incapable of maintaining their build systems. It's been over 2 years since the release of Ubuntu 16.04 and 4 months since the release of Ubuntu 18.04. And yet neither is available as a build environment. One could use them in docker on their 14.04 systems but then that means putting docker commands in .travis.yml which seems to go against the whole point. Instead of listening to their users the Travis team has locked the issue and shows no indication of actively working on this very basic feature.

Given the above I've very little trust in Travis CI as a viable option for the future. So let's discuss alternatives.

Query with `ql:has-relation` crashes QLever when `--paterns` wasn't used

If the index wasn't built using --pattern and (or?) QLever wasn't started with --patterns a ql:has-relation query will lead to a crash as QLever still tries to use patterns.

In the log this looks as follows:

Wed Jul 25 10:13:36.719 - INFO:  Query: PREFIX fb: <http://rdf.freebase.com/ns/>
PREFIX fbk: <http://rdf.freebase.com/key/>
PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?relation (COUNT(?relation) as ?relation) WHERE {
    ?country_type rdf:label "Country"@en .
    ?country fb:type.object.type ?country_type .
    ?country ql:has-relation ?relation .
}
GROUP BY ?relation
Wed Jul 25 10:13:36.719 - DEBUG: Got 1 subplans to create.
Wed Jul 25 10:13:36.719 - DEBUG: _p: <http://www.w3.org/2000/01/rdf-schema#label>
Wed Jul 25 10:13:36.719 - DEBUG: _p: <http://rdf.freebase.com/ns/type.object.type>
Wed Jul 25 10:13:36.719 - DEBUG: _p: <QLever-internal-function/has-relation>
Wed Jul 25 10:13:36.719 - DEBUG: Using the pattern trick to answer the query.
Wed Jul 25 10:13:36.719 - DEBUG: Creating execution plan.
Wed Jul 25 10:13:36.719 - INFO:  Result already (being) computed
Plan after pattern trick:
{
  COUNT_AVAILABLE_PREDICATES (col 1)
<Empty QueryExecutionTree>
  qet-width: 2
}
Segmentation fault (core dumped)

Additional README with example for context->sentences and entire document retrieval

It is easily possible to search within semantic contexts but display whole sentences or even entire documents as snippets via SELECT TEXT(?x). All that is needed is proper input as words- and docsfile. We should add some examples to clarify this. They do not need to be within the main README but shoudl be linked from there.

Freebase Mappings

Hi all,

Is there any FB mapping file available? For instance, what would be the relation for "/book/author/works_written"?

[{
"type": "/book/author",
"mid": null,
"name": [{}],
"/common/topic/alias": [{}],
"/book/author/works_written": [{
"mid": null,
"/book/book/genre": "Science Fiction",
"date_of_first_publication": {
"value": null,
"optional": false
},
"name": [{}],
"/common/topic/alias": [{}],
"limit": 1
}],
}]

Thank you

Aggregate query with only one ql:has-predicate triple fails

The following example query currently fails on Wikidata EN+DE @ vulcano:

PREFIX we: <http://www.wikidata.org/entity/>
PREFIX wp: <http://www.wikidata.org/prop/direct/>
SELECT ?r (COUNT(?r) AS ?count) WHERE {
  ?a ql:has-predicate ?r
}
GROUP BY ?r
ORDER BY DESC(?count)

The reason is that the query planner decides to use the "pattern trick", which leads to an empty query after the only triple is taken out. The correct handling would be to process ql:has-predicate without the pattern trick.

What does "Performing unique to ensure RDF semantics..." mean? (IndexBuilderMain log)

Can somebody briefly explain this to me?

There is probably a better and clearer way to say this.

GROUP BY at the end very slow for some queries

The following query is very slow on Clueweb+Freebase:

PREFIX fb: <http://rdf.freebase.com/ns/>
SELECT (SAMPLE(?name) AS ?name_sample) (COUNT(?predicate) AS ?count) WHERE {
 ?x ql:has-predicate ?predicate .
 ?predicate fb:type.object.name ?name 
}
GROUP BY ?predicate
ORDER BY DESC(?count)

REASON: If the first triple gives a result with N rows, then when joining the second triple, for each of the N rows, a lookup in fb:type.object.name is needed. If the GROUP BY would be done before that join, that lookup would be needed only n << N times, where n is the number of distinct predicates.

Note that queries such as the above are needed for the autocompletion, towards which we are working. Then we also want a query like the above in the combination with a FILTER on the ?name

Debug level TRACE is currently broken when ql:has-predicate is used

Currently when setting the debug level to TRACE, QLever will throw an exception of ql:has-predicate queries during printing of the multiplicities and estimates sizes of the query execution tree. This is because these currently throw ad_semsearch::Exception::NOT_YET_IMPLEMENTED

Error message on "too short Prefix" is unclear and misleading

The following query throws an error due to old* being too short a prefix. The thrown error however doesn't make this clear and sounds like a weird unexpected bug.

PREFIX fb: <http://rdf.freebase.com/ns/>
PREFIX rdf: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?name WHERE {
  ?state fb:type.object.type fb:location.us_state .
  ?state rdf:label ?name .
  ?t ql:contains-entity ?state .
  ?t ql:contains-word "old* state" .
} ORDER BY DESC(SCORE(?t))

The error reported is

BAD QUERY (The ID Range seems to exceed the range possible given to the current min prefix size.; in 
/local/raid/ad/schnelle/QLever/src/index/TextMetaData.cpp, line 31, function const TextBlockMetaData& 
TextMetaData::getBlockInfoByWordRange(Id, Id) const)

Segfault caused by typo in query:

Happened for a query for politicians who had an audience with the pope on the clueweb+freebase dataset. Would probably happen for all queries that make words look like variables.

PREFIX fb: <http://rdf.freebase.com/ns/> 
SELECT ?pol ?lead TEXT(?c) SCORE(?c) WHERE { 
  ?pol fb:people.person.profession fb:m.0fj9f . 
  ?pol <in-context> ?c . 
  ?c <in-context> audience .
  ?lead <in-context> ?c . 
  ?lead fb:people.person.profession fb:m.05hmbp4 . 
  FILTER(?pol != ?lead) 
} ORDER BY DESC(SCORE(?c))

With typo ?audience instead of audience and missing selects I can produce a segfault:

PREFIX fb: <http://rdf.freebase.com/ns/> 
SELECT ?pol ?lead  WHERE { 
  ?pol fb:people.person.profession fb:m.0fj9f . 
  ?pol <in-context> ?c . 
  ?c <in-context> ?audience .
  ?lead <in-context> ?c . 
  ?lead fb:people.person.profession fb:m.05hmbp4 . 
  FILTER(?pol != ?lead) 
}

Fix this after paper dealine!

Missing full blank node support

We just merged simple blank node support in #30 but it's still not to spec. The RDF TR calls for allowing '.' in blank nodes if they are not the first or last character. This sounds like a terrible idea but some files may use it. So keep this issue as a reminder.