Giter VIP home page Giter VIP logo

jkminder / data2neo Goto Github PK

View Code? Open in Web Editor NEW
12.0 12.0 0.0 5.72 MB

Data2Neo is a library that simplifies the conversion of data in relational format to a graph knowledge database.

Home Page: https://data2neo.jkminder.ch

License: Apache License 2.0

Python 100.00%
data-cleaning data-conversion data-engineering data2neo database-migrations graphs neo4j relational-databases remodeling

data2neo's People

Contributors

brandenberger avatar jkminder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

data2neo's Issues

Improve runtime performance

Runtime can be optimised. Firstly we can circumvent code executions for resources that do not require to be passed through the factories. Secondly we can leverage batch processing by adding a buffer. Maybe the work of committing the graph can be offloaded to a separate process.

Note: It needs to be thoroughly checked what the impact is on dependency issues. Committing to the graph will be asynchronous. Any query in wrappers to the graph, must be aware of this.

Crashing when similar name in yaml file

Hi all,

I have had a really weird error when providing the yaml file to the converter, and it truly took me some time to find out the reason.

The error looked as follows:

ValueError: The arguments list has not the same number of opening and closing brackets: "MPs.city")_birt

And I knew it was coming from one of the following lines in the yaml file:

    IF_NOT_EMPTY(NODE("City", "Location"), "BirthPlace_City") city_birth:
        + name = MPs.BirthPlace_City         
    IF_NOT_EMPTY(IF_REL_NOT_EXISTS(RELATION(person, "BORN_IN", city_birth)), "BirthPlace_City"):

but I could not understand the reason. I was suspecting the wrappers, and their order, might be causing the issue, but I could not find a pattern. Then, I finally realize it was all a conflict with

    IF_NOT_EMPTY(NODE("City", "Location"), "city") city:
        + name = EXTR_CITY(MPs.city)

It seems it all boils down to some conflict in naming between city and city_birth, because now the following is working:

    IF_NOT_EMPTY(NODE("City", "Location"), "BirthPlace_City") birth_city:
        + name = MPs.BirthPlace_City         
    IF_NOT_EMPTY(IF_REL_NOT_EXISTS(RELATION(person, "BORN_IN", birth_city)), "BirthPlace_City"):

Therefore, I believe this is a bug that needs to be solved, as it should not be happening, and it took way a lot of time to find out.

Thank you so much!

config_parser.py has trouble with compiling overlapping identifiers.

Example:

ENTITY("Session"):
    NODE("Year") year:
        + year = YEAR_FROM_DATE(Session.StartDate)
        
    REQUIRED(RELATION(sessionnode, "DURING", year)):
        - name = Session.SessionName
        - type = Session.TypeName
        - date_start = DATE(Session.StartDate)
        - date_end = DATE(Session.EndDate)

    IF_END_DIFFERENT(NODE("Year")) year_end:
        + year = YEAR_FROM_DATE(Session.StartDate)
        
    IF_END_DIFFERENT(REQUIRED(RELATION(sessionnode, "DURING", year_end))):
        + name = Session.SessionName
        - type = Session.TypeName
        - date_start = DATE(Session.StartDate)
        - date_end = DATE(Session.EndDate)

The config parser/compiler has problems here due to how identifiers (year and year_end) are parsed. The identifier year_end is sometimes parsed first as year (not always though, not yet sure why). Which gives weird parser errors (it identifies the substring year first-> "year"_end). Requires rewrite of compiler in config_parser.py.

Temporary Solution:
Make sure your identifiers non-overlapping. E.g. year and yend instead of year and year_end.

Nodes with same Labels are merged

For the config below only the last (BBI) is created.

ENTITY("LegislativePeriod"):
    NODE("Legislative Period") lp:
        + uid = LegislativePeriod.LegislativePeriodNumber
        - date_start = DATE(LegislativePeriod.StartDate)
        - date_end = DATE(LegislativePeriod.EndDate)
    # add the one time source creations
    NODE("Source"):
        + name = "Online DB"
        - abbrev = "ODB"
        - description = "Public OData database of swiss national parliament."
        - url = "https://ws.parlament.ch/odata.svc/"
    
    NODE("Source"):
        + name = "Amtliche Sammlung"
        - abbrev = "AS"

    NODE("Source"):
        + name = "Bundesblatt"
        - abbrev = "BBI"

If one ads identifiers all of them are created

ENTITY("LegislativePeriod"):
    NODE("Legislative Period") lp:
        + uid = LegislativePeriod.LegislativePeriodNumber
        - date_start = DATE(LegislativePeriod.StartDate)
        - date_end = DATE(LegislativePeriod.EndDate)
    # add the one time source creations
    NODE("Source") odb:
        + name = "Online DB"
        - abbrev = "ODB"
        - description = "Public OData database of swiss national parliament."
        - url = "https://ws.parlament.ch/odata.svc/"
    
    NODE("Source") as:
        + name = "Amtliche Sammlung"
        - abbrev = "AS"

    NODE("Source") bbi:
        + name = "Bundesblatt"
        - abbrev = "BBI"

np.int64 causes crash/better type handling

The pipeline should better communicate which types are allowed/which are converted. Maybe add warnings if a type is converted to string.

E.g. np.int64 is not working with some neo4j versions. Because its of type numbers.Number it is not converted to string and passed along to neo4j, which causes the crash.
TypeError: Neo4j does not support JSON parameters of type int64

Converter using YAML from a variable

When creating a Converter instance, I only found a way to supply the conversion-YAML as a file. Is it possible to supply the YAML from a str variable as well?

For example:

my_yaml = """... Multiline YAML statements here ..."""

converter = Converter(from_variable=my_yaml, my_iterator, my_graph)

The from_variable arg is just an example of a possible implementation. Another idea could, e.g., be a dedicated ConverterFromText class, etc.

Supplying the YAML from a Python variable would be useful, for example, when generating the YAML dynamically.

Relationships are lost

It seems to happen that with very large databases and many workers, neo4j throws variations of the following errors. There is also the suspicion that this looses some relations. Mainly happens if many workers throw many new relations (or matches) at the server. Could not yet be reproduced in toy setting.

Transaction failed and will be retried in 1.0245680547582983s (ForsetiClient[transactionId=333733, clientId=2845] can't acquire UpdateLock{owners=ForsetiCli ent[transactionId=333390, clientId=2844], ForsetiClient[transactionId=333733, clientId=2845], refCount=2} on NODE_RELATIONSHIP_GROUP_DELETE(106064) because holders of that lock are waiting for ForsetiClient[transactionId=333733, clientId=2845]

Transaction failed and will be retried in 0.8898980770768736s (ForsetiClient[transactionId=1199721, clientId=42911] can't acquire ExclusiveLock{owner=ForsetiClient[transactionId=1199388, clientId=42920]} on NODE_RELATIONSHIP_GROUP_DELETE(192231) because holders of that lock are waiting for ForsetiClient[transactionId=1199721, clientId=42911].

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.