mchao409 / orie-4741-class-project Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 3.95 MB

Jupyter Notebook 92.37% Python 7.30% Shell 0.32%

orie-4741-class-project's People

Contributors

Stargazers

Watchers

orie-4741-class-project's Issues

Final Peer Review

The group wanted to predict the number of bikes that will be rented in each station. They explored factors that would contribute to bike-sharing by using the SF Bay Area Bike Share dataset. The project can provide the bike-sharing companies some insights about optimizing and improving their business model.

Things I like:
The group has extensively explained the data cleaning process, so it is easy to keep up with the rest of the discussion.
The group used imputation to replace the missing precipitation data and used the Google API to fix the corrupted data.
The group has explained the methodology of why choosing a particular operation instead of another.

Things to improve:
The overall presentation, structure, and wording of the report can be improved by a lot. The report is missing a title to start with. It is also missing an opening paragraph for the reader to understand the purpose of their study. There are also random characters and symbols in the report that cannot be parsed in pdfs, indicating that the group did not proofread their work prior to submission.
It is unclear to the reader that where did the group obtain the data, as there was no reference or citation to the data source. There’s also no reference to any background and related work papers that they’ve read.
When reporting the regression findings, instead of just plainly listing out all the variables that are negative or negative, the group should present their findings in a table that summarizes the variables, coefficients, statistics tests results, etc. In addition, the group should provide interpretations of the results, otherwise, the finding seems to be useless.
The group can elaborate more on areas for further research. In addition, the conclusion finding of the project is very instinctual. It is a FACT that precipitation would negatively impact the bike-sharing business, so I don’t see why the group would highlight this result to conclude their project. Instead, they could report the similarities of Stations that have positive and negative coefficients. Since they have already used latitude and longitude to impute the missing and corrupted zip code, they could have used latitude and longitude to explore the similarities and factors that positively contribute to bike-sharings.

Final peer review

Summary:
This project attempts to predict the number of bike-sharing events given a set of features including weather and time.

What I like about this project:

You did a good job describing the dataset and handling the missing data.
You also spotted that the zip code data is corrupted and fixed it.
You have a very detailed description of all parameters you tried.

What I believe could be improved:

The writing of the report could be improved, there are minor errors including grammar and wording issues.
You said that there is a clear relationship between precipitation and the number of bike-sharing events, but the plots themselves do not indicate this. The very low number of bike-sharing events when precipitation is greater than 0 could be a result of the infrequency of precipitation.
You didn't explain why you only used weather-related features when training the first model.
In addition, your explanation of why you used the decision tree is not very clear. It's also not clear why while all previous models are regression models, you used a decision tree to do classification and what the decision tree is classifying.
If you doubt the validity of your dataset, you probably should not have chosen this dataset for your project.
It'd be better if you include a conclusion about the performance of all your models and your comments and analysis of their performances.

Peer review-Adarsh

This project looks at two different datasets: One that has information about NYC rides, and one that has information
about Bay Area bike sharing rides. The data contains information about ride durations for rider and surrounding
information on the rider and weather, etc. This project is a data analysis project that hopes to see how these
surrounding factors can influence the duration of the drives.

One thing I like about this project is how they have multiple datasets. For a project as narrow as this, multiple datasets
are important. I also like the idea of assessing ride duration by comparing it to google maps expected times. Lastly, I like how the datasets constitute 4 consecutive years, going from 2013-2017.

In terms of areas of improvement, one thing I am worried about is that you guys don't have enough features to accurately predict ride length. Maybe bringing in some additional datasets would be helpful. Additionally, in terms of both your datasets, I am unsure how, considering the datasets come from different years, you can combine the features together. Additionally, I think you guys would need to perform a lot of feature transformations considering the limited number of features.

Midterm Peer Review-yz2772

Summary
This project is about predicting the number of bikes that are rented per station for a specific station at a specific hour. The two datasets they planned to use are NYC bike sharing data and SF Bay Area bike sharing data.

Things I like

I really like that you guys used Google Map API to generate the correct zip code based on stations' latitude and longitude.
The graphs in the analysis part of your report clearly showcase the relationship between precipitation, temperature, and day of the week vs. number of bike rented.
I think its a great idea to build multiple models to fit the data using different sets of features. In this way, it is better to see how features impact the prediction in a given model, how they correlate to each other, and what changes in results can be expected by adding or removing some of the features. Overall, this is a great way to do feature selection.

Things to be improved

So far you have shown the visualizations on date, precipitation, and temperature. However, since you are prediction bike rent based on station and hour, I think it is also important to include visualizations regarding these features in your report. I feel like station/location and hour might play a more important role than weather in your dataset.
In your report you ran a preliminary regression analysis, but you did not report the values for MSE, MAE, or RMSE for your training and test set. I think it is necessary to take a look at these numbers to see if the model is underfitting or overfitting. Also, you could include some statistical results for your regression model such as R-squared and p-value to determine if your features accounted for the majority of the changes in your data and whether they are statistically significant or not.
I think besides regression model you could also try some clustering models to cluster your stations based on their locations. Another thing I want to point out is that some station locations and hours may be strongly correlated. For example, if station A is located in downtown SF, and station B is in a residential area to the south of downtown, then during morning rush hours station B will almost certainly see an increase in number of bikes rented while at the same time station A will have basically 0 bike rented because people who rent a bike at station B rides it to work, and almost nobody rents a bike at station A because nobody lives there. And during afternoon rush hours, the situation will be the opposite. Also, you guys may want to take the BART/metro stations and bus stations locations into consideration because to my knowledge many people ride a bike from home to the nearby public transportation stations everyday, so any bike stations that are next to these public transportation stations might see more increase during rush hours than other bike stations.

Midterm Peer Review - ja497

This project is on predicting the number bike rentals from stations around San Francisco based on the time, location, as well as external factors such as weather of different bike rental stations. The dataset they currently use is an SF-based bike sharing dataset collected in a ride-by-ride format, which requires a little preprocessing to get the rides-per-hour information this project aims to predict on.

I really like the logic on preprocessing and preparing the data. It makes sense to me how you fix your outlier zip codes, and dropping a small number of data seems to be fine since there are probably little correlation between bike rentals.
I like your initial visualizations, which seem to show some surprising results. For instance, the graph on precipitation vs bike riders essentially shows that close to zero riders rent bikes even in the smallest of rain conditions. Just as a side note, since this is the case, perhaps encoding precipitation as a boolean (rather than amount of rainfall) might lead to better model performance too.
I like your group’s interpretation of the features used; the interpretable coefficients from linear regression might be useful for developing future models too. Do explain what features such as 94107 mean though!
Something that could be improved upon is the feature selection for modeling. For feature selection, to further expand on choosing features that make sense, it would probably be great to do some analysis on features which are useful, such as using forward/backwards selection or even best subset selection. Using PCA from last lecture might help too.
It would also be a good idea to expand on models to consider. Your current and future plans all seem to be focused around regression (particularly linear regression) and while it indeed gives interpretable results, given the geographical nature of the data, it might be helpful to check out if clustering methods such as KNN might lead to better performances, since it might the case that stations close to each other geographically might have correlated number of bike rentals.
I also noticed that you did not include the performance of your models; it might be worth sharing how accurate your models are using just linear regression before moving to other forms of regression. Further analysis including R squared to figure out how much explanatory power your model has might also be useful.
It seems that your current features are mostly the time, weather, as well as location of the bike rental in terms of zip code. My expectation would be that location-wise, perhaps the socioeconomic performance of that particular area — such as median income, urban/suburban, etc. — would greatly impact how successful a station is, since SF is such a big city with drastically different median incomes and such. In that case, using simply zip codes might not encode this location data that well, hence I would expect using the zip codes to add onto the dataset socioeconomic information based on zip code datasets found online might lead to some performance in your models.

Peer Review

This project wishes to look deeper into Bike Sharing services in two major cities: New York City and SF Bay Area. The end goal is to understand how long it will take a rider to go from start to finish based on various features such as age, gender, etc.

Positive

It is an interesting problem that could give greater insight on how certain demographics interact with the services
How you are using the datasets individually as well as aggregated seems that it will be fruitful for your analyses
Problem statement is clear and concise

Negative

I feel like the business implications of understanding travel times could be fleshed out more
I think the motivation could be fleshed out more: why should people care about travel times for bike sharing?
Flesh out more why you think the data is sufficient to solve your problem

Final Peer Review smh367

The goal of the project was to predict the number of bikes that will be rented from a specific station in a certain zip code each hour in San Francisco. They used a dataset that compiled information on zip codes, weather, number of bikes rented, and other features that may have had an effect on the number of bikes rented. This project hoped to provide people with the best times to rent a bike and where in the San Francisco Bay Area.

Things I like:
1.) I like that your group was realistic and admitted that it would not be best to put this model into production as the data may be corrupted. I also like how you handled identifiably corrupted data by using the Google Maps API.
2.) I like that you attempted to use various models such as linear, lasso, and ridge regression with various regularizers to test if you overfitted.
3.) I like that you managed to achieve a very high test accuracy with your final model (decision trees).

Ways to improve:
1.) I think your analysis would have benefitted from more visualizations. The visualizations you have do a good job at describing the data, but you can improve by using different types of visualizations, different colors, and different data such as model accuracies as you changed parameters.
2.) I think your analysis would have benefitted from testing more models. Your decision tree classifier achieved very high accuracy, but you never explored other classifiers such as random forest, logistic regression, and neural networks.
3.) I think your analysis would have benefitted from a thorough examination of which features seemed to be most significant in your predictions. It would add some interesting insight into your data.

Peer Review - ajs692

The goal of this project is to predict how long it takes for a bike to reach its destination based on the demographics and weather conditions of the user. The authors are using two datasets, the NYC Bike Sharing and SF Area Bike Share, from opposite sides of the country. Both being in large cities, the datasets should be adequately large and unique enough that they are appropriate for testing and training sets.

I like how the authors plan to use datasets from large cities but potentially completely different weather and demographics, as this will challenge their models and really test how accurate they are. They have clearly thought about what different features could be included in these models. Additionally, they have put thought into how they plan to train each model separately on one dataset or the other, and then test on the other dataset.

I would like to see more concrete goals and questions being asked, as “we plan on building many models in order to make accurate predictions” seems rather vague. Additionally, I am concerned that the time to determine the expected duration via GoogleMaps for such large datasets would be extremely time consuming, unless this data is provided ahead of time. I’d also like to see consideration for things such as traffic patterns, and a more in-depth consideration for the demographic of the riders. Two 70-year-old women could have vastly different travel times, as one could be a cyclist or avid runner and therefore in better shape than a 50-year-old man, and unless the dataset accounts for this, I think there could be some prediction errors.

proposal peer review - alh323

This project is looking into analyzing bike share programs in two major US cities. They plan on focusing on predicting the length of time a person spends on a bike-sharing bike. This will be done by using data from New York City and the Bay Area.

They plan on using data from two major cities on opposite sides of the country, which I think can help make their analysis broader and more widely applicable, as these cities are very culturally different. I like how they're also extending it by incorporating the expected Google Maps time into their analysis. I additionally like how the data is over such long periods of time, which again can make the analysis more widely applicable.

I'm worried about the limitation in the NYC data set of only having end/stop times and locations, and gender and age. Whereas there's much more information in the Bay Area data set. This could cause some variation in what models you're able to fit, and perhaps create challenges when trying to combine the data in some way. In relation to this, I'm worried that there simply isn't enough information in the data. It may be difficult to figure out the weather for each day, or what other factors could impact the bike time since there's such a limited number of features provided. Are you able to see exactly where the bike went during the ride, or just the start/end location? I'm sure there's quite a lot of people that started and ended in the same place if they just wanted a casual bike ride. Lastly, I'm not quite sure how novel or important this project could be, as I would expect the result to just be that the average time to travel between places will equal the Google Maps time, or could be very dependent on what Google Maps said, and I'm not quite sure how you could necessarily answer your question with just the data you have.

Peer Review - wmb87

This team’s project objective is to find out if the trip time of a bike ride using a ride-sharing bike can be accurately predicted. They are using two datasets: bike-sharing ride information from NYC in 2015-2017 and bike-sharing ride information from the Bay Area in 2013-2015. This information includes demographic data about the riders, their trip time, and weather during the time of a ride. Through their data analysis, they hope to uncover what factors are closely related to a rider’s travel time.

I liked how you will compare your prediction results with that of Google Maps to see how useful your models would be in practice and better evaluate your results. I think you consider a wide variety of factors in your datasets, which may lead to interesting findings. Also, by using data from cities in different regions of the country, I think you will strengthen your model in its ability to generalize across different US cities.

I think you can explain what your project is useful for more clearly. Is it to provide riders with an estimate of their trip time? If that is the case, think about how a customer would react to their age or gender being used to predict a longer trip time. Those factors may impact averages, but for many users it may be irrelevant. Another concern I have is that your data is not enough to make accurate predictions. The fact that a rider is taking a leisurely ride, riding with others, or doing a workout of some kind may have far greater impacts than demographic data. I also think you could add more detail about how large your dataset is and what you will use as training and testing data.

Midterm Peer Review - jha62

The goal of this project is to predict the number of bikes taken from a station. Their data is from the San Francisco Bay Area Bike Share Dataset and includes information about where the station is located, the hour of the day, the weather, and other related features. Using these features they can run regressions and model the number of bikes they expect to be taken from a station for a given hour of the day and set of circumstances.

Likes:

I like the method they used to handle the missing and corrupted data. Instead of simply getting rid of it they used the Google Maps API to match stations with the correct zip code when the data was corrupted.
Plotting a few bar graphs to visualize the impact certain features have on number of riders was a good addition. It helped me understand the data better.
Starting off with a simple and easy to understand regression model was a good idea. I think using this to farther understand how the features impact the predicted outcomes is useful. I liked the short analysis of what the coefficients of the preliminary model mean.

Improvements:

The paper mentioned a few error metrics but did not go into detail on how the preliminary model performed. I would like to see some concrete analysis on the performance of the model thus far.
Adding in some more visualizations could be helpful. Specifically plotting the stations on a map along with the average number of bikes from the station would be a useful way to visualize the data. I would also like to see a plot of predicted values against actual values.
Using some form of clustering to group stations that are close to each other may help improve model performance. Additionally, it may be easier and more interesting to predict rides for "morning", "afternoon", "evening", and "night" instead of each hour.

Final Peer Review

The goal of the project is to predict the number of bikes that will be rented from a specific station in a specific hour. SF Bay Area Bike Share dataset was used. The datasets include bike station information and weather of the day.

Things I like
The team identified corrupted zip code data. They used Google Maps API to replace with the correct zip code.
The team tried multiple models including linear, lasso and ridge regression models and regularizers. They compared the performance of the different models.
The description of the dataset is detailed and easy to understand. They explained the feature selection process and the reason for choosing and dropping certain columns.

Ways to improve
It would be useful to explain why R^2 of 0.33 is low. Which outliers affect the performance of the model. Is the relationship between variables linear?
More graphs and data visualizations would help understand the models better. Visualizations of linear, lasso and ridge regression similar to the decision tree model would help understand.
It would be useful to explain why they chose to use the decision tree.

Midterm Peer Review - atp44

For this midterm report, I thought it was very cool you used the Google Maps API to impute the correct zip codes for the rows that had a corrupted Oregon zipcode instead of just dropping the rows. I also liked how in your discussion of addressing the missing values you gave very clear and reasonable reasoning for dropping the precipitation_inches rows that didn’t have a float value. Finally, I appreciated the very clear interpretation of the coefficients to your first linear model. It was obvious to the reader how to interpret your preliminary model.

However, I felt that this report could have had better visualizations of their data because your analysis tied to these plots was lacking. For example, I think for the second plot you could have removed the very large outliers on the x-axis so it could have been a more ‘zoomed-in’ graph which would have created a more telling visualization. Also, I would suggest adding more discussion of the relationships shown within the second and third graphs. You mention it displays a clear relationship, however, that relationship is never elaborated on. Additionally, for the section discussing “Future Plans”, I would suggest delving deeper into what you mean by ‘regression model’ because there are many different types of models that can be used by you never mention other possible model types you are considering.

Midterm Review rl447

Summary:
This project aims to apply data analysis models and techniques to SF Bay Area bikeshare data in order to predict the number of bikes rented from a station. The data contains information about specific days (each row) for a specific station and starting hour. Each row will have a prediction of the amount of bike rides that will be taken at a station at a given hour.

Things I Like:
The level of detail in your introduction is fully descriptive without being overly wordy, and the methodology that you have chosen is very clearly stated and allows me to even visualize the dataset in my head by the last bit.
This initial exploration of the data to find the correlation between different features and the number of rides taken on a particular day are extremely interesting. I saw that there is a significant dropoff within the first 0.1 inches of precipitation, and the magnitude of this as we see in the visualization of bike riders vs precipitations confirms intuition that people dont want to ride bikes in the rain and is an effective data visualization.
I like the idea of doing mini regressions on sets of features such as temperature, humidity, and visibility. It would have been very easy to see the relationship between any one of these and the price of a bike rented from a particular situation, but it is more clear when we see all three together.
Areas for Improvement:
In the number of riders lost for precipitation, I do not think that the regression coefficient of losing 1 rider for every 2 additional inches of rain makes sense given the enormous spike that does not seem to be linear in the earlier histogram. Could a nonlinear regression model show us different results?
In your future plan, you mention that you will run regression to predict a 16-vector output. Is there any particular reason why this number was chosen?
How will you account for potential overfitting with so many covariate features that you mention have negative coefficients? For example, time and isRushHour have quite a bit of overlap?

Final Peer Review

This project aims to predict the number of shared-bikes that will be rented from a location in San Francisco within a certain period. This helps bike renters choose the time to rent bikes as well as bike-sharing companies to distribute bikes to different locations. The dataset comes from San Francisco Bay Area Bike Share data and includes information like rental location, weathers, etc.

Things I like about this project:

I think the team did a great job cleaning data. For example, when identifying corrupted zip codes, instead of dropping the data, they chose to use the Google API to find the correct entries.
Another thing that impressed me is that they chose to analyze the problem with different models (like regularized linear models, decision trees, etc.) and also tried out different loss functions and hyperparameters.
Finally, the description of how data cleaning, feature engineering processes work is very clear and easy to follow. The report also gave details reasons for why they chose to conduct the data manipulations in such ways.

Potential areas for improvement:

It would be better if your report contains more graphs to visualize the results, as this would help us understand how your models perform.
Also, I noticed that the R squared of your last linear regression model is only 0.33. I think it’d better if you can explain further what some potential reasons could be - like did you test whether there was any problem with the dataset itself or were there any outliers, etc.
I think the reason you explained why you chose 15 as the maximum decision tree is not sufficient. Moreover, 15 maximum depth does not guarantee an optimal generalization performance.
Accuracy is not the only way to tell whether the model is good or not. In particular, for this task, measures like precision and recall rates may also worth the discussion.

Peer review - midterm

This project is about predicting number of bikes per stations that are rented for the specified stationer a specific hour
I liked:

Good description of dataset and dealing with missing/corrupted data
I liked that you explained choices for features
Good visualizations that explain choices
Good explanations from modeling

Improvement areas:

It would be great if we could see some visualizations/graphs for the regression modeling part
The last paragraph in the results from regression modeling section was confusing because there wasn’t much interpretation done
In the future plan, I don’t see how you will progress from this to actually predict the number of books per stations.

Final Peer Review rt359

This project aims to predict the number of bikes that will be rented from a particular station sin San Francisco. The goal is to help riders determine what are the best times to rent bikes and also for the bike company to decide how many buses to provide at certain tims and locations, as well as provide insight into how they could better serve their customers.

Things I like:

I like that the group imputes missing precipitation level columns instead of averaging or dropping those rows and uses a variety of loss functions in the process. I also like how the group dealt with corrupted zip code values.
like that the group considers a decision tree model in addition to the linear regression models and analyzes at a variety of depths.
The description of the group’s methodology in selecting different regularizers and loss functions is thorough and displays a clear understanding of the concepts discussed.

Areas of improvement:

It would have been useful to see more visuals, for example graphs from the regression modeling or a better way of presenting which variables had positive or negative coefficients (and the implications of these).
It would have been useful to consider more implications of the model—for example identifying unusual demand pattern at specific locations or if there are unusual patterns in ridership around certain times of the year. The groups most notable insight at the moment is that precipitation has a negative coefficient, which seems intuitive.
For next steps, it would have been useful to say why the group was considering exploring neural nets. Perhaps other next steps to explore would be incorporating other data sets or seeing how applicable the endings would be for a bike share system in a different city, for example.

mchao409 / orie-4741-class-project Goto Github PK

orie-4741-class-project's People

Contributors

Stargazers

Watchers

orie-4741-class-project's Issues

Recommend Projects

Recommend Topics

Recommend Org