Giter VIP home page Giter VIP logo

pyhdt's Introduction

pyHDT

Build Status Documentation Status PyPI version

pyHDT is joining the RDFlib family as part of the rdflib 6.0 release! The development continues at rdflib-hdt, and this repository is going into archive.

Read and query HDT document with ease in Python

Online Documentation

Requirements

  • Python version 3.6.4 or higher
  • pip
  • gcc/clang with c++11 support
  • Python Development headers

You should have the Python.h header available on your system.
For example, for Python 3.6, install the python3.6-dev package on Debian/Ubuntu systems.

Then, install the pybind11 library

pip install pybind11

Installation

Installation in a virtualenv is strongly advised!

Pip install (recommended)

pip install hdt

Manual installation

git clone https://github.com/Callidon/pyHDT
cd pyHDT/
./install.sh

Getting started

from hdt import HDTDocument

 # Load an HDT file.
 # Missing indexes are generated automatically, add False as the second argument to disable them
document = HDTDocument("test.hdt")

# Display some metadata about the HDT document itself
print("nb triples: %i" % document.total_triples)
print("nb subjects: %i" % document.nb_subjects)
print("nb predicates: %i" % document.nb_predicates)
print("nb objects: %i" % document.nb_objects)
print("nb shared subject-object: %i" % document.nb_shared)

# Fetch all triples that matches { ?s ?p ?o }
# Use empty strings ("") to indicates variables
triples, cardinality = document.search_triples("", "", "")

print("cardinality of { ?s ?p ?o }: %i" % cardinality)
for triple in triples:
  print(triple)

# Search also support limit and offset
triples, cardinality = document.search_triples("", "", "", limit=10, offset=100)
# etc ...

Handling non UTF-8 strings in python

If the HDT document has been encoded with a non UTF-8 encoding the previous code won't work correctly and will result in a UnicodeDecodeError. More details on how to convert string to str from c++ to python here

To handle this we doubled the API of the HDT document by adding:

  • search_triples_bytes(...) return an iterator of triples as (py::bytes, py::bytes, py::bytes)
  • search_join_bytes(...) return an iterator of sets of solutions mapping as py::set(py::bytes, py::bytes)
  • convert_tripleid_bytes(...) return a triple as: (py::bytes, py::bytes, py::bytes)
  • convert_id_bytes(...) return a py::bytes

Parameters and documentation are the same as the standard version

from hdt import HDTDocument

 # Load an HDT file.
 # Missing indexes are generated automatically, add False as the second argument to disable them
document = HDTDocument("test.hdt")
it = document.search_triple_bytes("", "", "")

for s, p, o in it:
  print(s, p, o) # print b'...', b'...', b'...'
  # now decode it, or handle any error
  try:
    s, p, o = s.decode('UTF-8'), p.decode('UTF-8'), o.decode('UTF-8')
  except UnicodeDecodeError as err:
    # try another other codecs
    pass

pyhdt's People

Contributors

callidon avatar folkvir avatar nilesh-c avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

pyhdt's Issues

Failed building wheel for hdt

Hi.. I am not able to import hdt.
I am not sure if the prerequisite gcc/clang with c++11 is supported.
How can I check it in windows?
How can I download it ?
Logs:
cl : Command line warning D9002 : ignoring unknown option '-std=c++11'
triple_iterator.cpp
C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.12.25827\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Iinclude -Iinclude -Iinclude/ -Ihdt-cpp-1.3.2/libhdt/include/ -Ihdt-cpp-1.3.2/libhdt/src/dictionary/ -Ihdt-cpp-1.3.2/libcds/include/ -Ihdt-cpp-1.3.2/libcds/src/static/bitsequence -Ihdt-cpp-1.3.2/libcds/src/static/coders -Ihdt-cpp-1.3.2/libcds/src/static/mapper -Ihdt-cpp-1.3.2/libcds/src/static/permutation -Ihdt-cpp-1.3.2/libcds/src/static/sequence -Ihdt-cpp-1.3.2/libcds/src/utils -Ic:\users\new\anaconda3\include -Ic:\users\new\anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.12.25827\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.12.25827\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.16299.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.16299.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.16299.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.16299.0\winrt" /EHsc /Tpsrc/tripleid_iterator.cpp /Fobuild\temp.win-amd64-3.7\Release\src/tripleid_iterator.obj -std=c++11

Finding triples matching a keyword/phrase

Is there a way of utilizing pyHDT to find triples that contain a given keyword? For instance, when working with the Semantic Web Dog Food HDT, I first load the data as below:

from hdt import HDTDocument
document = HDTDocument("swdf.hdt")

I am able to quickly find all the triples containing a given URI if I know that URI beforehand:
(triples, cardinality) = doc.search_triples("", "", "http://data.semanticweb.org/person/christian-bizer")
However, there are scenarios where it would be useful to search the HDT for URIs containing the keyword/phrase of interest, such as christian-bizer. Is there a way of doing this with pyHDT?

[FEATURE] ProgressListener and disabling cout/cerr debug messages from hdt-cpp

  • hdt-cpp uses a bunch of cout/cerr debug output and in Python it can be annoying to have those messages with no means to make this configurable. Since it's c++ extensions, I couldn't figure out any way to shut them up in Python code. I wrote a workaround for this that redirects cerr/cout to /dev/null here.
  • Fully loading large files like DBpedia hdts can take a while, so I used hdt-cpp's StdoutProgressListener to get some progress output from the HDTManager::loadXXX methods. In the same commit here

Let me know if you find any of those useful and I'll send a PR. :)

Is there a limit on the number of triples?

I'm trying to query over a large HDT file (LOD-a-lot, 28 billion triples), where the indexes are already generated.

When I load the HDT file, it does not return any error and it is relatively fast:
document = HDTDocument(data.hdt)

However, it seems that not all triples are loaded. Specifically, I'm expecting the following command to return 28 billion triples, it is returning around 2.5 billion triples:
document.total_triples

I also verified from querying the data that indeed a number of triples were not loaded.

Search is broken for resources that are not in the graph

In the newest commit, the search function is redefined.

If I have a graph containing
A,B,C
and I search for
X ? ?
I get all triples in the database as a result
A,B,C.

Under the 1.1.0 version and hdtSearch I get an empty set as the result.

I suspect that the resource to ID conversion results prior to searching might need to be checked.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.