orie4741 / projectsfall2020 Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 64.0 205 KB

Repository for Fall 2020 ORIE 4741 projects.

projectsfall2020's People

Stargazers

Watchers

projectsfall2020's Issues

“Predicting 911 Call Urgency in New York City” predicts the severity level of a 911 call and the time it would take emergency services to reach a given caller. The objective of this project is to gather valuable information about the performance of emergency services so dispatch operators could better allocate their resources for individual well-being in the future. The project uses EMS Dispatch Data from NYC OpenData, which includes features such as the severity level code, response time, and location information.

I love the team’s effort in making sure the data selected are generally applicable to the entire dataset. For example, the team chose data in 2019 and plotted the average response time grouped by zip code to ensure that 2019 is representative of all other years. Additionally, the team did a great job of reasoning the correct loss function and regularizer to use by identifying a large number of outliers and switching to l1 loss. Furthermore, I’m impressed by the application of AutoML pipeline. This showed that the team has spent lots of time understanding the class material and improving their model.

I think the team could have specified what features they decided to use instead of mentioning “some of the data columns we believed to be most relevant.” Additionally, even though encoding location information as one-hot vectors is certainly enough for the scope of the project, the team might use different methods as future steps. Lastly, even though the model choices are great, I would like to see more explanations on why linear approximation is emphasized in the report.

Overall, I think the report is well-written and the group has done a great work.

Prediction-of-US-Traffic-Accidents _final_report_peer_review

The objective of this project is to use U.S. national traffic accident dataset to find out what are the significant factors leading to severity of accidents. They have also compared their results for five different U.S. states.
Three things I really like about this report:
• I really like how they visualized their dataset and presented the results. It looks really clear to me.
• They did give a detailed thought process of how they did the data preprocessing such as dealing with different data types, how they select the models and how they tune the hyperparameters for one model they chose. This makes their experiment reproducible.
• I find the topic really interesting especially when they select five states by including South Carolina which had the highest weighted number of car accidents.
Three things I think may need some improvement:
• They could possibly give a reasoning behind why they choose certain models.
• Some conclusions may be assumptions not supported by any authoritative source as they discuss why certain factors are more important in certain states.
• They could include their aspects of what can be done /what models or approaches can we try regarding this dataset in the future.

Midterm Peer Review

Summary:
The project was to see if they can determine the quality of a wine given 13 characteristics, some of which were nominal, however most were real valued. The group had run some preliminary models and found that many of the wines were rated around 5 and 6 and found sparse ratings at the extrema (3 and 9). Much of the report, as a result, detailed ways in which they tried to mitigate bias from this finding in the data, which included bootstrapping.

What I liked:

I really liked how you explained the problem about your data and then went into ways you tried to solve that problem
It also seems like you guys already have a good grasp on how to plot and model things, which I think is good to have this early on in the game
I also think it's great that you found out a way to even lower your MSE by using shuffling! Noticing oddities in your outputs (like with the red wine thing) can actually lead you to make a better model, which is awesome

Things to improve on:

The graphs themselves were a little hard to read - I get that the violin plots were used to show variance but perhaps having a separate plot for variance and the regression might be better as a convincing visual
Trying other types of regression might be useful! I think with these projects, it's probably best to try and use everything in our toolbox that we've learned
I agree that classification would be a good place to go next, so I would encourage looking into that to perhaps get better results
Also, I'm curious if you could use maybe more qualitative data? Like region, date, etc. These might be already somewhat factored into with the pH levels or something, but maybe it would also help in your analysis?

midterm report

I think the report is well developed and liked how they took numerous factors into consideration. The group correctly identified ratios of accidents since certain states are bigger or more populous and will naturally cause higher accidents to occur. They used ordinal values and multi-classification and imputed missing data values. I liked how there were multiple graphs to describe their problem and thorough exploration of models. 3 areas to work on in my opinion would be first, if possible consider how light it was outside (more in terms of time vs day and night encodings). This could lead to more accuracy but if that is information that can easily be encoded and if there is enough data to back each class of encoding. More explanation into latitude and longitude would be useful. I am assuming its due to the fact that certain locations are more urban, but having that affirmed would be useful. And finally, if there can be probabilities used for the types of accidents that are likely to occur for the predictions for the final report. Overall I thought it was very compelling.

Midterm peer review

This project is about predicting number of bikes per stations that are rented for the specified stationer a specific hour
I liked:

Good description of dataset and dealing with missing/corrupted data
I liked that you explained choices for features
Good visualizations that explain choices
Good explanations from modeling

Improvement areas:

It would be great if we could see some visualizations/graphs for the regression modeling part
The last paragraph in the results from regression modeling section was confusing because there wasn’t much interpretation done
In the future plan, I don’t see how you will progress from this to actually predict the number of books per stations.

"Solutions to Mental Health in Tech" Peer Review

The “Solutions to Mental Health in Tech” group plans to make a model that can tell if people have mental health issues, even if they don’t seek treatment. People in the Tech industry work in pressure cooker work environments, but often don’t seek help if they need it. The ability to tell if employees need help for mental issues can help companies prevent employee personal catastrophes, and better construct mental health & wellness resources for their employees.

Things I like:

Mental health in the workplace (especially in the tech industry) has been a very hot topic in recent years. This model would be very desirable.
If the model succeeds, businesses will be able to increase workplace productivity by reducing the amount of wasted labor hours.
If the model succeeds and programs are implemented at work, people in tech may be better equipped to handle mental issues that arise in the workplace.

Things I don’t like:

I think that you may need some more data on mental health and tech employment. There are a myriad of different tech positions at the large number of tech companies in the US or around the world. If you can find more data, I am sure it will help make the model much more robust.
The data types range widely, meaning there will need to be feature transformations for many of the data’s features.
Mental health issues range widely from person to person. The model will likely have to distinguish between the types and severities of many mental health issues in order to make a complete and accurate model. Also, there will likely need to be a wide range of outputs/predictions.

Speed dating review

This project looks at the data collected from a Columbia University speed dating study and performs a data analysis task.
More concrete, they want to assess what factors lead to a match between two partners, specifically focusing on
how attractiveness figures into potential matches. The data contains additional information about each individual's
interests/along with that of who they speed dated with.

Because the dataset contains information about the interests/views of people along with ratings of their
attractiveness, I find it interesting to separate similarity of interests from attractiveness. I like how you guys
listed out your overall motivation for why you undertook this project, contextualizing it in the growing world
of speed dating. I also like how there are multiple paths you guys can take in terms of proceeding with this project:
even if you find that attractiveness cannot effectively predict pairing success, you guys can fall back on constructing
a general model to predict pairing success using a host of features.

In terms of areas for improvement, I don't know if one dataset is enough in terms of being able
to fit a model that you think will generalize to other speed dating datasets. I also do not know valuable
someone's 'interests' are for predicting pairing success, so this goes into my suggestion for joining this
dataset with other datasets to add more features. Also I would encourage you guys to be more concrete about how you
guys are going to quantify the success of a 'match'

peer reveiw from Mental_Health team

This is a data analysis project financial project about IPO. This project is trying to find out the factor which can help them predict the correct IPO. In my understanding, they are trying to build a model that can help them find an optimal parameter set. The data they are using is from Kaggle called Stocks IPO information.

I really like this project! This project is focusing on the big profitable but also unstable, and seems unpredictable project. Therefore, the problem of IPO is very suitable for a big messy data analysis project. The data set they choose is also very good. The data set is very informative. It covers almost all the factor which will affect the IPO. I also like the criteria of a successful IPO they set up in the problem statement, which helps them clearly explain their project result.

There are three parts I am a little bit concerned about:

The size of the data is not very big(only about 3000). This means the training set of the model may not be very big. As a result, the result of the model may not be very good.
The informative data set have many features. They may need to spend plenty of time on feature engineering to help them save the computation source and get a good model.
Need more clarification about the meaning of the "reliable". (for example: is the reliable means stable? accurate? most related?)

Midterm Peer Review

This project aims to elucidate the difference in performance of NBA players in contract years vs non-contract years. Additionally, they seek to create a quantification of the "contract year effect" for each player using what they call a "slippery index." They have season statistics for 100 players over 989 total seasons at their disposal to accomplish this goal.

Things I liked

The justification for leaving out certain features on the basis of the correlation matrix and information embedded in other features was very nicely done.
The realization that you could be introducing age bias by just comparing contract vs non-contract years was smart, and I like the way you overcame the issue by only comparing non-consecutive contract year vs post-contract year.
I liked all the visualizations of the data. They did a great job of exploring the dataset and drew solid conclusions from it.

Areas for improvement

Many of the statistical differences are very small. It would be nice if you only highlighted values that were significantly different because it's hard to understand the differences the way it is presented. I see this will be in your future work which is good.
I would have liked a more in-depth explanation of what the slippery index is, and how you will attempt to construct it.
It would be nice to have a discussion on over/underfitting.

Midterm Peer Review

Overall, I think this is a very clean and thorough report. It was especially informative given this is the first time I've been exposed to this project, and I was able to easily follow the commentary. The team is analyzing how airbnb prices have been affected due to the COVID 19 pandemic, analyzing pricing data in NYC in 2020 and comparing this data to 2019 pricing information.

First thing I liked was that the data cleaning all made sense, and I don't think any information was lost or misinterpreted, which is always an important concern when cleaning a dataset. The two pairs of histograms were also very informative and make it easy to visualize the initial findings regarding the team's hypothesis. Lastly, I also think the geographic map comparing September 2019 to 2020 does a good job of framing the issue of volume for airbnb as well, and will give great context when this is further analyzed.

One thing I think would have been helpful would be a histogram for the volume of each type of rental. You have the geographic map visualizing this, but that does a better job of showing the overall trend rather than how each subtype is changing. The histograms you have for pricing are solid, I think doing the same for volume would provide some more clarity on a more granular level. One potential suggestion as far as evaluating the impact of the COVID 19 pandemic on price would be a brief comparison to a different virus outbreak (think H1N1 is the only other one that's happened since airbnb was founded), and see if there was a substantial difference there. Could be that airbnb learned from that smaller outbreak and that's why prices haven't changed so much. Last suggestion I have would be to analyze the affect of the pandemic on volume. Revenue issues for a company stem from pricing and/or volume decreases. It very well may be that prices aren't affected by COVID, but I bet there might be a more telling trend from March through October on the volume of listings. Given the wealth of COVID data as well, you may be able to formulate a model that predicts how the volume of rentals changes given a change in COVID infection levels.

Overall I think this is great progress thus far. Everything makes complete sense, and you have some really interesting data you have found.

Final Report Peer Review

This group analyzed NFL game to data to attempt to create a model that predicts NFL game point spreads more accurately than Vegas lines. They created models for both a season average only (SAO) dataset and a season and moving (SAM) dataset. Using predictive models including linear regression, ridge regression, lasso regression, and EBM, they were able to produce predictions that compared favorably to Vegas lines in many scenarios.

I really appreciated how well-written this group’s report was and how the logic of all their decisions was clear and easy to follow. I also liked the use of the EBM model we learned about from our guest lecturer, which strengthened your analysis with the use of a very effective non-linear model. Finally, I thought the evaluation of your model by comparing to Vegas lines from the actual NFL games your model is predicting was excellent and did a great job showing the potential usefulness of your model in practice. For an area to explore further for these models, I think the effect of incorporating player injury would be very interesting to see, as you mentioned in your introduction that injuries can play a major role in swinging a game towards another team. This was a fascinating project to learn about and I thought you did an excellent job in getting useful predictions.

DreamTeam Peer review

This project is about predicting the number fantasy points a fantasy football player will score. The data used to analyze this project is historical fantasy football data from 1970-1999 obtained from Fantasy Football Data Pros. The goal of this project will be to determine what is the best projection model for predicting the amount of fantasy points a player will score for this year's season.
Things I like about the proposal:

The group did a great job choosing the dataset and analyzing the data.
Idea seems to be new and refreshing. I have never heard of fantasy football before.
Good explanation of why your research question is important.
Things to improve:
You could try to include a hypothesis about the results you expect to get after testing your models.
Besides talking about how to divide the data up in to testing/training portions, they didn't really discuss what methods or techniques they were thinking of using for the project.
More analysis can be done on how the factors might allow a player to score higher than others (e.g. touchdowns, measures of defense) play a role in influencing fantasy points and projections

Adarsh-peer review

This project looks at two different datasets: One that has information about NYC rides, and one that has information
about Bay Area bike sharing rides. The data contains information about ride durations for rider and surrounding
information on the rider and weather, etc. This project is a data analysis project that hopes to see how these
surrounding factors can influence the duration of the drives.

One thing I like about this project is how they have multiple datasets. For a project as narrow as this, multiple datasets
are important. I also like the idea of assessing ride duration by comparing it to google maps expected times. Lastly, I like how the datasets constitute 4 consecutive years, going from 2013-2017.

In terms of areas of improvement, one thing I am worried about is that you guys don't have enough features to accurately predict ride length. Maybe bringing in some additional datasets would be helpful. Additionally, in terms of both your datasets, I am unsure how, considering the datasets come from different years, you can combine the features together. Additionally, I think you guys would need to perform a lot of feature transformations considering the limited number of features.

orie4741 / projectsfall2020 Goto Github PK

projectsfall2020's People

Stargazers

Watchers

Forkers

projectsfall2020's Issues

Recommend Projects

Recommend Topics

Recommend Org