Giter VIP home page Giter VIP logo

einstein's People

Contributors

aashishyadavally avatar anirudh-kakarlapudi avatar jayant1234 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

einstein's Issues

GCP Initialization Action

@Anirudh-Kakarlapudi: Like you observed, I have tried setting up the GCP with an Initialization Action, which is failing. Can you share the gist from the logs, which we can try to debug?

Also, try GCP's dataproc-initialization-action script as well, to see if that helps.

Continuous integration

I think it would be best if @Anirudh-Kakarlapudi and @Jayant1234 started pushing their models to the code base, and test them using a generic dataset like 'AutoMPG'.
I will add in a binder script which lets you run the models by giving in test=True as a flag.

We can keep a deadline of, say, 04/20/2019 (up to discussion), so that we have sufficient time to also train the models for the Solar Data, and do hyper-parameter tuning on the same.

Dataset for models

@Jayant1234 , @Anirudh-Kakarlapudi : I am currently exploring BigQuery to create a Table containing the dataset, which can be retrieved in the cluster.
So, I intend to create a :class: Dataset, which will retrieve this Table, and the data from which can be used in the model.

Encryption of the csv file for data security

@aashishyadavally , @Jayant1234 I know that the google storage bucket can prevent the access of the data if the permissions are given among us. I think we can add an extra layer of protection by doing the encryption using python packages and using keys/ passwords decrypting every time we use the data. Any suggestions?

defining pipelines

I think it will be quite tedious to include many transformations and estimators for each model. So, we should find a way to define them at one class and we can use subsets of these transformations and estimators for each specific implementation like linear regression.

Class/Variable names

I think it's better if we model our class names or folder structure for the Models API along the lines of SparkML or SKLearn, which will be a decent convention to follow, and the code-base will look pretty neat.

To be more descriptive, SparkML has all their regression model classes bundled in a regression file, while SKLearn bundles it's regression models in sub-categories, such as linear, trees, ensemble etc.

Looking forward to your inputs to finalize the design choice sooner than later.

Error while testing

@Anirudh-Kakarlapudi : Getting the error Data type StringType of column 'xyz' not supported for all the columns when I deploy the project onto the GCP cluster.

Can you please check that as well? Do check it on the version in the code-bind branch, which carries a few changes that I did to the code.

Whether `Model` class in base.py should remain abstract or not

I think defining a base abstract class is not needed. Since we have to reuse many transformations and estimators for many models. We need to define them at one place for better reuse. And deriving a class from base.py abstract class which can further become a template for other classes seems wasteful. So, I am arguing that we should drop the abstract class in favor of a nonabstract one in which we can define a flow of pipeline with all the reusable transformations and estimators.

PVLib ForecastModel class

@Anirudh-Kakarlapudi , @Jayant1234 :

Considering we are using, from einstein.models import pvlib, which in turn uses 'from pvlib.forecast import ForecastModel' - the problem is that pvlib currently says the ForecastModel API that we are importing will need maintenance, as they are planning to move the whole API into an io.

Thus, each time we run the module, the ForecastModel warning pops up. To avoid this, I was thinking I will move the from einstein.models import pvlib import into the --model=pvlib switch-case construct in the __main__.py script, which will ensure that the warning pops up only when the user chooses to run the PVLib models through einstein.

How do you think?

Lock on 'develop' branch

I was thinking, how about putting a 'Lock' on the 'develop' branch as well.
It helps us do code reviews when we push in..
During the pull request itself, we could highlight parts of the code that need improvement, which can be further changed, and Issues will be able to track those changes.

Technically, we do have a 'Lock' on the 'master' branch but considering we only intend to use it as a deployment branch, I thought this will be a fair thing to do.

Comments, @Anirudh-Kakarlapudi , @Jayant1234 ?

Metrics to track model performance

@Anirudh-Kakarlapudi , @Jayant1234 : For now, as also described in the proposal, I can only think of Mean Absolute Error (MAE) and R-squared. Do you suggest any other metrics?
If so, I think the base class should also include an abstract method which tracks performance of models. Comments?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.