KafkaModel and DynamicKafkaModel

The current usage example found in README is like below.

from py2k.models import DynamicKafkaModel
from py2k.writer import KafkaWriter

# assuming we have a pandas DataFrame, df
serialized_df = DynamicKafkaModel(df=df,model_name='test_model').from_pandas()

writer = KafkaWriter(
    topic="topic_name",
    schema_registry_config=schema_registry_config,
    producer_config=producer_config
)

writer.write(serialized_df)

The output from DynamicKafkaModel is called serialized_df but it is not actually serialized in the context of what the library does, as pointed out in #34 .

Also, DynamicKafkaModel is only converting Pandas dataframes, whatever their schemas, to key-value records ready to then be serialized as Avro and dispatched to Kafka.

From that, having Model as part of the name might be slightly misleading. Maybe using something like KafkaFormatter, PandasToKafkaTransformer, etc might be more informative to both, users and contributors.

Users should be able to define their own schema namespace

Currently, users have their namespace automatically created for them at python.kafka.modelname

py2k/py2k/record.py

Lines 66 to 80 in 14d82e5

 @staticmethod 

 def schema_extra(schema: Dict[str, Any], 

 model: Type['KafkaRecord']) -> None: 

 schema['type'] = 'record' 

 schema['name'] = schema.pop('title') 

 schema['namespace'] = (f'python.kafka.' 

 f'{schema["name"].lower()}') 

 schema = process_properties(schema) 

 schema.pop('properties') 

 # Dynamically generated schemas might not have this field, 

 # which is removed anyway. 

 if 'required' in schema: 

 schema.pop('required') 

 update_optional_schema(schema=schema, model=model)

We should allow users to define their own namespace.

something like,

class MyRecord(KafkaRecord):
    name: str
    
    @property
    def namespace():
        return 'my.name.space'

or as an argument to PandasToRecordsTransformer

Move key serialization out of the KafkaWriter

Add num of tries while pushing to kafka

As of now, we are pushing to Kafka via an infinite loop, which might cause problems in the future.
We should consider defining "number of tries" after which exception will be raised.

py2k/py2k/producer.py

Lines 24 to 39 in dfe8d5e

 def produce(self, record): 

 while True: 

 try: 

 self._producer.produce( 

 topic=self._topic, 

 key=record.key_to_avro_dict(), 

 value=record.value_to_avro_dict(), 

 on_delivery=self._delivery_report 

 ) 

 self._producer.poll(0) 

 break 

 except BufferError as e: 

 print( 

 f'Failed to send on attempt {record}. ' 

 f'Error received {str(e)}') 

 self._producer.poll(1)

Users should be able to update the topic on the writer to a new topic

For example, if I want to write raw data and processed data onto Kafka, into two separate topics, but with the same producer config and schema registry config.

Suggested API

from py2k.writer import KafkaWriter

writer = KafkaWriter(topic='topic1',
                                   producer_config=some_producer_config,
                                   schema_registry_config=some_sr_config)
writer.write(some_records)

writer.topic = 'new_topic'
# OR
writer.update_topic('new_topic')

writer.write(other_records)

Bumping version incorrectly label version in init.py of file

bump2version duplicates __version__ = 'version.number' and needs to be resolved for releases.

Publish to TestPyPi should only be run on new tag creation

Extract from_dynamic_pandas into DynamicKafkaModel

Adjust the concept of serialization

In the readme, the example currently available is like below.

from py2k.models import DynamicKafkaModel
from py2k.writer import KafkaWriter

# assuming we have a pandas DataFrame, df
serialized_df = DynamicKafkaModel(df=df,model_name='test_model').from_pandas()

writer = KafkaWriter(
    topic="topic_name",
    schema_registry_config=schema_registry_config,
    producer_config=producer_config
)

writer.write(serialized_df)

This library is about Avro, and in Avro parlance, serialization refers to formatting data as Avro.

In the example above, serialized_df is not in Avro format, so calling it serialized is misleading.

Also, it might be worth to change names across the whole library to refer to serialization only when it is about converting data to Avro, in order to avoid confusing users and contributors.

Move schema generation out of utils init.py

We should remove legacy code in init.py to its own module.

py2k/py2k/utils/__init__.py

Line 21 in 14d82e5

def process_properties(schema: Dict[str, Any]) -> Dict[str, Any]:

We don't have well documented, exhaustive examples

Examples need to be implemented to show:

a minimum viable approach with configs for schema registry and producer configs
expected output of dynamic pandas etc.

Key fields should be specified in KafkaModel

Kafka model should know which fields need be included for value and which for key serialization.

The proposed solution is an abstract property key in KafkaModel.

Setting different naming strategies for key and value

Py2k should enable setting different naming strategies (for both key and value).

This functionality is probably provided already by schema_registry_conf. We just need to create examples to clearly document how to set this config properly.

Enable key to be list of columns

Work with iterables instead of lists

We've hit memory issues and the main problem seems to be creating a list of Pydantic models out of a large pandas dataframe.

The suggestion is to exchange lists for generators as we discussed earlier.

py2k.writer should have verbosity setting to show/hide loading bar

There is no documentation for how to build py2k for SASL_SSL auth

Required to put in documentation following confluent-kafka guidelines.

KafkaWriter fails to write data for boolean types

Checks

I added a descriptive title to this issue
I have searched (google, github) for similar issues and couldn't find anything

Bug

py2k version: 1.8.0
Python version: 3.9
Operating System: MacOS

Expected Result

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 10.09it/s]

Actual Result

Traceback (most recent call last):
  File "/examples/basic_write_from_pandas.py", line 25, in <module>
    writer.write(records)
  File "/folder/anaconda3/envs/py2k/lib/python3.9/site-packages/py2k-1.8.0-py3.9.egg/py2k/writer.py", line 57, in write
    self._create_producer(records)
  File "/folder/anaconda3/envs/py2k/lib/python3.9/site-packages/py2k-1.8.0-py3.9.egg/py2k/writer.py", line 49, in _create_producer
    self._producer = KafkaProducer(self._topic, producer_config)
  File "/folder/anaconda3/envs/py2k/lib/python3.9/site-packages/py2k-1.8.0-py3.9.egg/py2k/producer.py", line 22, in __init__
    self._producer = SerializingProducer(producer_config.dict)
  File "/folder/anaconda3/envs/py2k/lib/python3.9/site-packages/py2k-1.8.0-py3.9.egg/py2k/producer_config.py", line 35, in dict
    config_build['value.serializer'] = self._serializer.value_serializer()
  File "/folder/anaconda3/envs/py2k/lib/python3.9/site-packages/py2k-1.8.0-py3.9.egg/py2k/serializer.py", line 31, in value_serializer
    return AvroSerializer(
  File "/folder/anaconda3/envs/py2k/lib/python3.9/site-packages/confluent_kafka/schema_registry/avro.py", line 174, in __init__
    schema_dict = loads(schema.schema_str)
  File "/folder/anaconda3/envs/py2k/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/folder/anaconda3/envs/py2k/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/folder/anaconda3/envs/py2k/lib/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 158 (char 157)

Code Snippet

from py2k.record import PandasToRecordsTransformer
from py2k.writer import KafkaWriter
import pandas as pd


df = pd.DataFrame({'name': ['Daniel', 'David', 'Felipe', 'Ruslan'],
                   'is_cool': [False, True, True, True],
                   'value': [27.1, 100.0, 0, 9000.0]})

record_transformer = PandasToRecordsTransformer(
    df, record_name='KafkaRecord')

records = record_transformer.from_pandas()


topic = 'py2k-test-topic'
producer_config = {'bootstrap.servers': '...'}
schema_registry_config = {
    'url': '...'}

writer = KafkaWriter(topic=topic,
                     schema_registry_config=schema_registry_config,
                     producer_config=producer_config)

writer.write(records)

Automated integration tests

not all functions have docstrings explaining fields

Schema generation of defaults is incompatible with Avro schema

Checks

I added a descriptive title to this issue
I have searched (google, github) for similar issues and couldn't find anything

Bug

py2k version: 1.9.1
Python version: Any
Operating System: Any

Expected Result

When generating a Schema with a default, it should look like:

{"name": "xxx", "type": ["null", "boolean"], "default": null}

Actual Result

{"name": "xxx", "type": "boolean", "default": null}

Rename serialzer to something more meaningful

Users cannot inspect py2k version using py2k.version

Checks

I added a descriptive title to this issue
I have searched (google, github) for similar issues and couldn't find anything

Bug

py2k version: 1.8.2
Python version: 3.x
Operating System: OSX

Expected Result

>>> import py2k

>>> print(py2k.__version__)
'1.8.2'

Actual Result

>>> import py2k
>>> print(py2k.__version__)
'AttributeError: py2k has no attribute __version__'

Implementation

in __init__.py

__version__ = '1.8.2'

Should find a way to include this in bump2version

There is no link to website documentation in the readme

Posibility to specify whether key will be included in value

The current implementation includes key data into value without the possibility to choose whether it is necessary.

We propose to add a boolean flag "include_key" to KafkaWriter to enable choosing whether the user wants to have key data included in the value.

The default behavior should be false

Steps required for public go-live

Implement release CI script for PyPI
- Retrieve PyPI API Token
remove rc from py2k version

	@staticmethod
	def schema_extra(schema: Dict[str, Any],
	model: Type['KafkaRecord']) -> None:
	schema['type'] = 'record'
	schema['name'] = schema.pop('title')
	schema['namespace'] = (f'python.kafka.'
	f'{schema["name"].lower()}')
	schema = process_properties(schema)
	schema.pop('properties')

	# Dynamically generated schemas might not have this field,
	# which is removed anyway.
	if 'required' in schema:
	schema.pop('required')
	update_optional_schema(schema=schema, model=model)

	def produce(self, record):
	while True:
	try:
	self._producer.produce(
	topic=self._topic,
	key=record.key_to_avro_dict(),
	value=record.value_to_avro_dict(),
	on_delivery=self._delivery_report
	)
	self._producer.poll(0)
	break
	except BufferError as e:
	print(
	f'Failed to send on attempt {record}. '
	f'Error received {str(e)}')
	self._producer.poll(1)

absaoss / py2k Goto Github PK

py2k's Introduction

Welcome to Py2k

Installation

Documentation

Contributing

Usage

Minimal Example

Features

License

py2k's People

Contributors

Stargazers

Watchers

Forkers

py2k's Issues

Suggested API

Checks

Bug

Expected Result

Actual Result

Code Snippet

Checks

Bug

Expected Result

Actual Result

Checks

Bug

Expected Result

Actual Result

Implementation

Steps required for public go-live

Recommend Projects

Recommend Topics

Recommend Org