hearnderek / declarativepython Goto Github PK

Connects python functions to function parameters through reflectively generating a dependency graph. Runs on top of pandas to allow for easier ETL and reporting.

Python 100.00%

python3 forward-projection declarative-programming functional-dependency modeling-tool datascience

declarativepython's Introduction

Declarative Python

A calculation engine for automatically stitching together functions, and walking the dependency tree.

This python file is the best example for seeing how this can be used to simplify your calculations.

Two major use cases are:

Letting you focus on the individual functions and not the structure of the program.
Writing timeseries projection models which has highly interconnected logic would be a pain to try and structure in a standard program.

How To Use

Install package

python -m pip install git+https://github.com/hearnderek/DeclarativePython.

Oh yeah. I know. But I want to be able to use this on my remote system and not have to deal with pypi just yet.

Basic usage

import declarative

def f() -> str:
    print('f')
    return 'hello'
    
def g(f: str) -> str:
    print('g')
    return f + ' world'
    
def output(f: str, g: str)
    print(f)
    print(g)

if __name__ == '__main__':
    declarative.Run()

~$ python hello_declarative.py
f
g
hello
hello world
~$

Okay, what just happened?

In the above example we have three functions. f and g return values, and g uses the value return by f. Since that is obvious by the name of the parameter being exactly the same as the function, this package -- declarative -- does all of the plumming work to make that happen. In the third function output, we take f and g and output their return values. Our functions were executed in order of f -> g -> output.

You may have noticed when looking at the output that every function is only executed once. Every function output is memozied, or in otherwords saved in memory for later use. This makes sure your code runs efficiently without any effort.

Forward Projection Calculations

import declarative

def count_up(t, count_up):
    if t == 0:
        return 0
    else:
        return count_up[t-1] + 1

if __name__ == '__main__':
    df = declarative.Run(t=10)
    print(df)

~$ python forward_projection.py
             count_up
result_id t
0         0         0
          1         1
          2         2
          3         3
          4         4
          5         5
          6         6
          7         7
          8         8
          9         9
~$

Woah woah woah. What?

t is a special parameter with this system that tells our engine that you are doing calculations with distinct timesteps. You tell the Run function how many time steps you use, and within your functions you can calculate forward through 0..n. This type of programming is super common within EXCEL. "Using the result of the above cell do a calculation" You can now easily convert those excel calculations into highly similar code. Let the declarative package handle the loops, you handle the logic.

For the data savy you may have noticed the pandas DataFrame was returned by the Run function. You can write out your projections standard python then do your analysis in pandas.

Cool, but I still want normal functions

import declarative

@declarative.ignore
def print_helper(s):
    return(s + ' world')

def f() -> str:
    print_helper('hello')

if __name__ == '__main__':
    declarative.Run()

~$ python ignoreme.py
hello world
~$

Do IO functions block everything else?

Naw, I got you

import declarative
import time

@declarative.io_bound
def slow_one():
    time.sleep(1)
    return 1

@declarative.io_bound
def slow_two():
    time.sleep(1)
    return 2
    
@declarative.io_bound
def slow_three(slow_one, slow_two):
    time.sleep(1)
    return slow_one + slow_two

def output(slow_one, slow_two, slow_three):
    print(slow_one + slow_two + slow_three

if __name__ == '__main__':
    # takes 2 seconds not 3
    declarative.Run()

~$ python slow_io.py
6
~$

Warnings

I am throughly abusing python within this package. Use at your own risk.
I have not implemented any garbage collection, so all function results must be able to fit in memory.
You can only reliably use [t] [t-1] amd [t+1] when accessing values being passed around in projections. (sorry)

declarativepython's People

Contributors

Stargazers

Watchers

declarativepython's Issues

Automatic concurrency for IO bound user functions

This functionality is only supported when t=1.

I don't want there to be IO bound things happening within the projection system.

Decide official how-to for helper methods.

Leading underscores might work... @decoraters are also a possibility...

load in tmp_{col}.py on first run IF there are no changes to the originating module.

How to tell if there was a change to the file:

Hacky Idea 1:

When generating python file

Generate MD5 hash of the user module. (only imagining a single file)
Place hash into generated flat script as a standard string variable user_module_hash.

When initializing engine

search for matching flat script within working directory
Generate MD5 hash of the user module.
test user_module_hash against generated hash
if match use flat script
if not match ignore flat script

Adjacent Idea 1:

Once we start generating these files, it would be nice to have a central place to keep them, and have them out of the way for user.
Store flat file in a sqlite database, with it's hash, create date, etc.
Alternatively we could store the files in a zipfile which acts as a document store.
This then means we have a place to store our results as well. (Which then I would want to look into optimizing writes to sqlite)

death_study

Take US Census data and figure out people's chance of dying based on their age.

If possible, it would be nice to also have my estimate of expected age for each to live until.

pip install from git branch -- stack overflow
pip で github のリポジトリをインストール

Refactor the Engine class.

There are a lot of unneeded members, and some logic should be broken out into separate classes.

Specifically:

Remove all unneeded members
Rename remaining members
~~Break out flat file generation into it's own class for easier refactoring.~~
Add method for choosing optimization style

home_economics

I'm trying to build a forward projection system which shows you what your net worth will be in x years based on your income, expenses, tax, debt, and investments.

This basically and expand indefinitely. Income Tax on it's own is a non-trivial topic. I don't consider car payments, or mortgages or the complexities of buying new cars, unexpected expenses, bankruptcy, pay raises that are correlated to age, and randomness at all.

Handle pre-calculated results as input and historical results (t < 0)

in what format to accept it

dataframe and dictionary

Internal Data

Ideas:

[bad] put data at end of the list so negative numbers give the expected results
put data into a dictionary instead of a list.

Decide on a storage medium.

It would be nice to be able to save work directly to an output file.
It would also be nice to use said output files when working with multiprocess/machine workloads.
It would also be important when dealing with high memory workloads.

Ideas:

results_to_dataframe() -> dataframe_to_csv()
- limiting on what can be stored as a string
results_to_json()
results_to_sqlite()
~~result_to_zipdb()~~

Trim down generated flat python to keep LoC under 1,000,000.

Looks like my notes didn't save.
Basically python scripts and functions are limited to 1,000,000 lines of code. there is a PEP related to this.
In my profiling test I ran into this limit. Also importing takes longer when there are a ton of lines of code.

The basic idea behind this fix is to take the unrolled loops and convert them back into loops.

Should I do this when

Within new object which walks the graph without executing it?
creating the code?
just before writing to file?
just before importing?

My gut feeling is that the easier options are 2 and 3, but in the long term 1 would be the most maintainable since get_calc is quite bloated.

Observe what gains could be made by running on a dedicated CPU

I used psutil to set our priority to the highest possible then run 100% on a single CPU.

p = psutil.Process(os.getpid())
p.nice(psutil.REALTIME_PRIORITY_CLASS)
p.cpu_affinity([1])

While there were some gains, it was not a significant enough increase to warrant the addition of psutil to our dependencies.

Add in multi-process capabilities to IterativeEngine

Since this system is by design simple to run in parallel, let's build that into our IterativeEngine.

The basic idea is

# serial (initial design)
engine = Engine(...)
foreach (row) in input_rows:
    engine.calculate(row)
result = engine.results_to_df()

# parallel local (pseudo code of intended result of this issue)
parallel_foreach (input_rows, engine) in divided_work:
    foreach row in input_rows:
        engine.calculate(row)
result = collect(divided_work)

# parallel distributed (needed to run at production scale)
foreach (input_rows, engine) in divided_work:
    api.send(input_rows, engine, lambda x: receive_results(x))
result = collect(divided_work)

clean up the interface for simplifying usage

Making the Iterative Engine detect the caller's module when no module is provided
Allowing DataFrame, Dictionary, and file path as input.
Accept no input as valid