Giter VIP home page Giter VIP logo

avro-data-model's Introduction

Avro Data Model

Introduction

Apache Avro is a data serialization framework. It is used in data serialization (especially in Hadoop ecosystem) and RPC protocols. It has libraries to support many languages. The library supports code generation with static languages like Java, while for dynamic languages for example python, code generation is not necessary.

When avro data is deserialized in Python environment, it was stored as a dictionary in memory. As a dictionary, it looses all the interesting features provided by the avro schema. For example, you can modify an integer field with a string without getting any errors. As a dictionary, it also doesn't provide any nice features from a normal class, for example, if an avro schema has firstName and lastName fields, it is not easy to define a fullName function to generate the full name.

Use Cases of the Library

In stream processing and RPC protocols, strict data types are required to make sure the system runs correctly. In Python, avro data is converted to a dictionary, which doesn't guarantee types and also doesn't provide a custom class hierarchy. I am looking to develop a way so that a class can be build on top of an avro schema, so that it can keep correct data type and also has a class structure.

My solution is similar to what SQLAlchemy ORM does. You need to manually create classes corresponding to avro schemas. However, fields of the avro schemas are all extracted from avsc file instead of being manually defined like SQLAlchemy. The classes allow defining methods to introduce new properties or new validations. Please check the following examples for how to use the library.

The purpose of the library is to bridge the gap between dynamical typed python and the use cases that requires strong types. This library should be restricted to places where static types are required. Otherwise, you will loose all the happiness playing with Python if applying this library everywhere.

Example

A Simple Example

User.avsc

{
  "type": "record",
  "name": "User",
  "fields": [
    {
      "name": "lastName",
      "type": "string"
    },
    {
      "name": "firstName",
      "type": "string"
    }
  ]
}

The following code defined a User class associated with the schema

@avro_schema(AvroDataNames(default_namespace="example.avro"), schema_file="User.avsc")
class User(object):
  def fullname(self):
    return "{} {}".format(self.firstName, self.lastName)

With this class definition, the full name can be obtained with the function call.

user = User({"firstName": "Alyssa", "lastName": "Yssa"})
print(user.fullname())
# Alyssa Yssa

Avro Schema with Extra Validation

In some use cases, some extra validations are required, for example: Date.avsc

{
  "name": "Date",
  "type": "record",
  "fields": [
    {
      "name": "year",
      "type": "int"
    },
    {
      "name": "month",
      "type": "int"
    },
    {
      "name": "day",
      "type": "int"
    }
  ]
}

The month and day of a date cannot be arbitrary integers. A extra validation can be done as following:

@avro_schema(AvroDataNames(default_namespace="example.avro"), schema_file="Date.avsc")
class Date(object):
  def __init__(self, value):
    if isinstance(value, datetime.date):
      value = {
          'year': value.year,
          'month': value.month,
          'day': value.day
      }
    super().__init__(value)

  def date(self):
    return datetime.date(self.year, self.month, self.day)

  def validate(self, data):
    return super().validate(data) \
        and datetime.date(data['year'], data['month'], data['day'])

The Date class can validate the input before assign it to then underlying avro schema

date = Date({"year": 2018, "month": 12, "date": 99})
# ValueError: day is out of range for month
date = Date(datetime.date(2018, 12, 12))
# No Error

Extract an avro schema defined in an outer schema

Sometimes an avro schema is defined in another schema Employee.avsc

{
  "type": "record",
  "name": "Employee",
  "namespace": "com.test",
  "fields": [
    {
      "name": "id"
      "type": "string"
    },
    {
      "name": "name",
      "type": {
        "type": "record",
        "name": "Name",
        "namespace": "com.test",
        "fields": [
          {
            "name": "lastName",
            "type": "string"
          },
          {
            "name": "firstName",
            "type": "string"
          }
        ]
      }
    }
  ]
}

The schema com.test.Name is defined in com.test.Employee. There is no Name.avsc, but you can still define a class for it the schema:

# Parent schema must be define first.
@avro_schema(
    EXAMPLE_NAMES,
    schema_file=os.path.join(DIRNAME, "Employee.avsc"))
class Employee(object):
    pass


# Full name is required
@avro_schema(EXAMPLE_NAMES, full_name="com.test.Name")
class Name(object):
    pass


name = Name({{"firstName": "Alyssa", "lastName": "Yssa"})
print(name)
# {'firstName': 'Alyssa', 'lastName': 'Yssa'}

Contributing

After cloning/forking the repo, navigate to the directory and run

source init.sh

The python environment should be ready for you.

Authors

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

avro-data-model's People

Contributors

kun-fang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

frankfanslc

avro-data-model's Issues

No intellisense for the fields on the class

The class generated doesn't provide any auto-completion or intellisense for its properties. Do you think is it something that will be added in future, as avro generated classes in Java

Enum inside record doesn't work

Tried the following minimal example:

from avro_models import avro_schema, AvroModelContainer

names = AvroModelContainer()

schema = {
    "type": "record",
    "name": "minimal",
    "namespace": "com.example",
    "fields": [
        {
            "name": "examplestring",
            "type": "string"
        },
        {
            "name": "enum1",
            "type": {
                "type": "enum",
                "symbols": ["TEST1", "TEST2"],
                "name": "TestEnum"
            }
        },
        {
            "name": "enum2",
            "type": "com.example.TestEnum"
        }
    ]
}

@avro_schema(names, schema_json=schema)
class MinimalExample(object):
    pass

me = MinimalExample({"examplestring":"Hello", "enum1": "TEST1", "enum2": "TEST2"})
print(me.examplestring)
print(me.enum1)
print(me.enum2)

Unfortunately, it doesn't work:

$ python minimal.py
Hello
Traceback (most recent call last):
  File "minimal.py", line 35, in <module>
    print(me.enum1)
  File "/Users/erfor/.virtualenvs/avro_models/lib/python3.7/site-packages/avro_models/models.py", line 60, in __getattr__
    return item_class(self._value[attr])
TypeError: 'NoneType' object is not callable

I also tried creating a union of the two schemas, with the Enum defined first, and the record defines second. This also failed.

I don't want to work with multiple schema files and objects because I need to submit it to the avro schema server which as far as I know only accepts one schema per topic.

Is there some way of making this work? I guess I could handle it as separate schemas and concatenate them into a union schema before submitting to the schema server, but that's cumbersome, and as far as I know, the above exampel schema is valid so it should work with avro-data-model.

How to expose embedded items - in auto-generated schemas from Avro-tools

I'm trying to figure out how to work with nested objects that may be defined in an avdl file.

So if i take this file orders.avdl

@namespace("org.jeeftor.avro")
protocol TacoRequest {

enum MeatType{
  CHICKEN,
  BEEF,
  TURKEY,
  FISH
}

enum CheeseType {
  GROSS_VEGAN,
  ACTUAL_COW_CHEESE,
  GOAT_CHEESE
}

enum Toppings {
  LECHUGA,
  TOMATO,
  SAUCE
}

record Taco {
  MeatType meat;
  CheeseType cheese;
  array<Toppings> toppings;
}

record Order {
  union { string, int } order_id;
  array<Taco> tacos;
}

}

And convert it to schema with: avro-tools idl2schemata order.avdl ./parsed

it will generate the following files:

CheeseType.avsc
MeatType.avsc
Order.avsc
Taco.avsc
Toppings.avsc

So in short an Order is made up of Tacos which contain MeatType a CheeseType and a set of Toppings.

I'm wondering if its possible to get "sub-classes" out of this. In your example you use external references to other records - which DOES work - but I'm trying to automate a process that starts with an avdl file which when generating code does not add the references:

So if i run this code:

import datetime
import os

from avro_models import avro_schema, AvroModelContainer

EXAMPLE_NAMES = AvroModelContainer(default_namespace="org.jeeftor.avro")
DIRNAME = os.path.dirname(os.path.realpath(__file__))

@avro_schema(
    EXAMPLE_NAMES,
    schema_file=os.path.join(DIRNAME, "../parsed/Order.avsc"))
class Order(object):
    pass

@avro_schema(
    EXAMPLE_NAMES,
    schema_file=os.path.join(DIRNAME, "../parsed/CheeseType.avsc"))
class CheeseType(object):
    pass

I get the following error:

Traceback (most recent call last):
File "python/models.py", line 21, in
class CheeseType(object):
File "/Users/jstein/.virtualenvs/avro-stack-o_AuMqmg/lib/python3.7/site-packages/avro_models/core.py", line 49, in wrapper
schema = SchemaFromJSONData(schema_json, container)
File "/Users/jstein/.virtualenvs/avro-stack-o_AuMqmg/lib/python3.7/site-packages/avro/schema.py", line 1214, in SchemaFromJSONData
return parser(json_data, names=names)
File "/Users/jstein/.virtualenvs/avro-stack-o_AuMqmg/lib/python3.7/site-packages/avro/schema.py", line 1127, in _SchemaFromJSONObject
return EnumSchema(name, namespace, symbols, names, doc, other_props)
File "/Users/jstein/.virtualenvs/avro-stack-o_AuMqmg/lib/python3.7/site-packages/avro/schema.py", line 712, in init
other_props=other_props,
File "/Users/jstein/.virtualenvs/avro-stack-o_AuMqmg/lib/python3.7/site-packages/avro/schema.py", line 439, in init
names.Register(self)
File "/Users/jstein/.virtualenvs/avro-stack-o_AuMqmg/lib/python3.7/site-packages/avro/schema.py", line 402, in Register
'Avro name %r already exists.' % schema.fullname)
avro.schema.SchemaParseException: Avro name 'org.jeeftor.avro.CheeseType' already exists.

Avro name 'org.jeeftor.avro.CheeseType' already exists.

So what I'm wondering is - is there a way to handle my situation without having to go in and modify the schemas and modify them to look more like:

  {
      "name": "occupation",
      "type": "example.avro.Occupation"
    },

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.