dsp-uga / einstein Goto Github PK
View Code? Open in Web Editor NEWSolar Irradiance Prediction using PySpark
Solar Irradiance Prediction using PySpark
@Anirudh-Kakarlapudi: Like you observed, I have tried setting up the GCP with an Initialization Action, which is failing. Can you share the gist from the logs, which we can try to debug?
Also, try GCP's dataproc-initialization-action script as well, to see if that helps.
I think it would be best if @Anirudh-Kakarlapudi and @Jayant1234 started pushing their models to the code base, and test them using a generic dataset like 'AutoMPG'.
I will add in a binder script which lets you run the models by giving in test=True
as a flag.
We can keep a deadline of, say, 04/20/2019 (up to discussion), so that we have sufficient time to also train the models for the Solar Data, and do hyper-parameter tuning on the same.
@Jayant1234 , @Anirudh-Kakarlapudi : I am currently exploring BigQuery to create a Table containing the dataset, which can be retrieved in the cluster.
So, I intend to create a :class: Dataset, which will retrieve this Table, and the data from which can be used in the model.
@Anirudh-Kakarlapudi , @Jayant1234 : This should be a good starting point on how to unit test our Model
class.
https://stackoverflow.com/questions/1323455/python-unit-test-with-base-and-sub-class
@aashishyadavally , @Jayant1234 I know that the google storage bucket can prevent the access of the data if the permissions are given among us. I think we can add an extra layer of protection by doing the encryption using python packages and using keys/ passwords decrypting every time we use the data. Any suggestions?
I think it will be quite tedious to include many transformations and estimators for each model. So, we should find a way to define them at one class and we can use subsets of these transformations and estimators for each specific implementation like linear regression.
I think it's better if we model our class names or folder structure for the Models API along the lines of SparkML or SKLearn, which will be a decent convention to follow, and the code-base will look pretty neat.
To be more descriptive, SparkML has all their regression model classes bundled in a regression
file, while SKLearn bundles it's regression models in sub-categories, such as linear
, trees
, ensemble
etc.
Looking forward to your inputs to finalize the design choice sooner than later.
@Anirudh-Kakarlapudi : Getting the error Data type StringType of column 'xyz' not supported
for all the columns when I deploy the project onto the GCP cluster.
Can you please check that as well? Do check it on the version in the code-bind
branch, which carries a few changes that I did to the code.
I think defining a base abstract class is not needed. Since we have to reuse many transformations and estimators for many models. We need to define them at one place for better reuse. And deriving a class from base.py abstract class which can further become a template for other classes seems wasteful. So, I am arguing that we should drop the abstract class in favor of a nonabstract one in which we can define a flow of pipeline with all the reusable transformations and estimators.
@Anirudh-Kakarlapudi , @Jayant1234 :
Considering we are using, from einstein.models import pvlib
, which in turn uses 'from pvlib.forecast import ForecastModel' - the problem is that pvlib currently says the ForecastModel API that we are importing will need maintenance, as they are planning to move the whole API into an io.
Thus, each time we run the module, the ForecastModel
warning pops up. To avoid this, I was thinking I will move the from einstein.models import pvlib
import into the --model=pvlib
switch-case construct in the __main__.py
script, which will ensure that the warning pops up only when the user chooses to run the PVLib models through einstein
.
How do you think?
I was thinking, how about putting a 'Lock' on the 'develop' branch as well.
It helps us do code reviews when we push in..
During the pull request itself, we could highlight parts of the code that need improvement, which can be further changed, and Issues will be able to track those changes.
Technically, we do have a 'Lock' on the 'master' branch but considering we only intend to use it as a deployment branch, I thought this will be a fair thing to do.
Comments, @Anirudh-Kakarlapudi , @Jayant1234 ?
@Anirudh-Kakarlapudi , @Jayant1234 : For now, as also described in the proposal, I can only think of Mean Absolute Error (MAE) and R-squared. Do you suggest any other metrics?
If so, I think the base class should also include an abstract method which tracks performance of models. Comments?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.