dsp-uga / einstein Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 3.0 493 KB

Solar Irradiance Prediction using PySpark

Python 88.78% Shell 11.22%

pyspark regression-analysis

einstein's People

Contributors

Stargazers

Watchers

Forkers

aashishyadavally jayant1234 anirudh-kakarlapudi

einstein's Issues

GCP Initialization Action

@Anirudh-Kakarlapudi: Like you observed, I have tried setting up the GCP with an Initialization Action, which is failing. Can you share the gist from the logs, which we can try to debug?

Also, try GCP's dataproc-initialization-action script as well, to see if that helps.

Created a Base Class which can be used in all regression models.

Comment about the necessary additions

I think it would be best if @Anirudh-Kakarlapudi and @Jayant1234 started pushing their models to the code base, and test them using a generic dataset like 'AutoMPG'.
I will add in a binder script which lets you run the models by giving in test=True as a flag.

We can keep a deadline of, say, 04/20/2019 (up to discussion), so that we have sufficient time to also train the models for the Solar Data, and do hyper-parameter tuning on the same.

Dataset for models

@Jayant1234 , @Anirudh-Kakarlapudi : I am currently exploring BigQuery to create a Table containing the dataset, which can be retrieved in the cluster.
So, I intend to create a :class: Dataset, which will retrieve this Table, and the data from which can be used in the model.

Unit tests for models

@Anirudh-Kakarlapudi , @Jayant1234 : This should be a good starting point on how to unit test our Model class.

https://stackoverflow.com/questions/1323455/python-unit-test-with-base-and-sub-class

Encryption of the csv file for data security

@aashishyadavally , @Jayant1234 I know that the google storage bucket can prevent the access of the data if the permissions are given among us. I think we can add an extra layer of protection by doing the encryption using python packages and using keys/ passwords decrypting every time we use the data. Any suggestions?

defining pipelines

I think it will be quite tedious to include many transformations and estimators for each model. So, we should find a way to define them at one class and we can use subsets of these transformations and estimators for each specific implementation like linear regression.

Class/Variable names

I think it's better if we model our class names or folder structure for the Models API along the lines of SparkML or SKLearn, which will be a decent convention to follow, and the code-base will look pretty neat.

To be more descriptive, SparkML has all their regression model classes bundled in a regression file, while SKLearn bundles it's regression models in sub-categories, such as linear, trees, ensemble etc.

Looking forward to your inputs to finalize the design choice sooner than later.

Error while testing

@Anirudh-Kakarlapudi : Getting the error Data type StringType of column 'xyz' not supported for all the columns when I deploy the project onto the GCP cluster.

Can you please check that as well? Do check it on the version in the code-bind branch, which carries a few changes that I did to the code.

Adding a cross validation step

Whether `Model` class in base.py should remain abstract or not

I think defining a base abstract class is not needed. Since we have to reuse many transformations and estimators for many models. We need to define them at one place for better reuse. And deriving a class from base.py abstract class which can further become a template for other classes seems wasteful. So, I am arguing that we should drop the abstract class in favor of a nonabstract one in which we can define a flow of pipeline with all the reusable transformations and estimators.

PVLib ForecastModel class

@Anirudh-Kakarlapudi , @Jayant1234 :

Considering we are using, from einstein.models import pvlib, which in turn uses 'from pvlib.forecast import ForecastModel' - the problem is that pvlib currently says the ForecastModel API that we are importing will need maintenance, as they are planning to move the whole API into an io.

Thus, each time we run the module, the ForecastModel warning pops up. To avoid this, I was thinking I will move the from einstein.models import pvlib import into the --model=pvlib switch-case construct in the __main__.py script, which will ensure that the warning pops up only when the user chooses to run the PVLib models through einstein.

How do you think?

Lock on 'develop' branch

I was thinking, how about putting a 'Lock' on the 'develop' branch as well.
It helps us do code reviews when we push in..
During the pull request itself, we could highlight parts of the code that need improvement, which can be further changed, and Issues will be able to track those changes.

Technically, we do have a 'Lock' on the 'master' branch but considering we only intend to use it as a deployment branch, I thought this will be a fair thing to do.

Comments, @Anirudh-Kakarlapudi , @Jayant1234 ?

need a parameter model to return a dictionary also

Metrics to track model performance

@Anirudh-Kakarlapudi , @Jayant1234 : For now, as also described in the proposal, I can only think of Mean Absolute Error (MAE) and R-squared. Do you suggest any other metrics?
If so, I think the base class should also include an abstract method which tracks performance of models. Comments?