orie4741 / projectsfall2020 Goto Github PK
View Code? Open in Web Editor NEWRepository for Fall 2020 ORIE 4741 projects.
Repository for Fall 2020 ORIE 4741 projects.
_
“Predicting 911 Call Urgency in New York City” predicts the severity level of a 911 call and the time it would take emergency services to reach a given caller. The objective of this project is to gather valuable information about the performance of emergency services so dispatch operators could better allocate their resources for individual well-being in the future. The project uses EMS Dispatch Data from NYC OpenData, which includes features such as the severity level code, response time, and location information.
I love the team’s effort in making sure the data selected are generally applicable to the entire dataset. For example, the team chose data in 2019 and plotted the average response time grouped by zip code to ensure that 2019 is representative of all other years. Additionally, the team did a great job of reasoning the correct loss function and regularizer to use by identifying a large number of outliers and switching to l1 loss. Furthermore, I’m impressed by the application of AutoML pipeline. This showed that the team has spent lots of time understanding the class material and improving their model.
I think the team could have specified what features they decided to use instead of mentioning “some of the data columns we believed to be most relevant.” Additionally, even though encoding location information as one-hot vectors is certainly enough for the scope of the project, the team might use different methods as future steps. Lastly, even though the model choices are great, I would like to see more explanations on why linear approximation is emphasized in the report.
Overall, I think the report is well-written and the group has done a great work.
The objective of this project is to use U.S. national traffic accident dataset to find out what are the significant factors leading to severity of accidents. They have also compared their results for five different U.S. states.
Three things I really like about this report:
• I really like how they visualized their dataset and presented the results. It looks really clear to me.
• They did give a detailed thought process of how they did the data preprocessing such as dealing with different data types, how they select the models and how they tune the hyperparameters for one model they chose. This makes their experiment reproducible.
• I find the topic really interesting especially when they select five states by including South Carolina which had the highest weighted number of car accidents.
Three things I think may need some improvement:
• They could possibly give a reasoning behind why they choose certain models.
• Some conclusions may be assumptions not supported by any authoritative source as they discuss why certain factors are more important in certain states.
• They could include their aspects of what can be done /what models or approaches can we try regarding this dataset in the future.
Summary:
The project was to see if they can determine the quality of a wine given 13 characteristics, some of which were nominal, however most were real valued. The group had run some preliminary models and found that many of the wines were rated around 5 and 6 and found sparse ratings at the extrema (3 and 9). Much of the report, as a result, detailed ways in which they tried to mitigate bias from this finding in the data, which included bootstrapping.
What I liked:
Things to improve on:
I think the report is well developed and liked how they took numerous factors into consideration. The group correctly identified ratios of accidents since certain states are bigger or more populous and will naturally cause higher accidents to occur. They used ordinal values and multi-classification and imputed missing data values. I liked how there were multiple graphs to describe their problem and thorough exploration of models. 3 areas to work on in my opinion would be first, if possible consider how light it was outside (more in terms of time vs day and night encodings). This could lead to more accuracy but if that is information that can easily be encoded and if there is enough data to back each class of encoding. More explanation into latitude and longitude would be useful. I am assuming its due to the fact that certain locations are more urban, but having that affirmed would be useful. And finally, if there can be probabilities used for the types of accidents that are likely to occur for the predictions for the final report. Overall I thought it was very compelling.
This project is about predicting number of bikes per stations that are rented for the specified stationer a specific hour
I liked:
Improvement areas:
The “Solutions to Mental Health in Tech” group plans to make a model that can tell if people have mental health issues, even if they don’t seek treatment. People in the Tech industry work in pressure cooker work environments, but often don’t seek help if they need it. The ability to tell if employees need help for mental issues can help companies prevent employee personal catastrophes, and better construct mental health & wellness resources for their employees.
Things I like:
Things I don’t like:
This project looks at the data collected from a Columbia University speed dating study and performs a data analysis task.
More concrete, they want to assess what factors lead to a match between two partners, specifically focusing on
how attractiveness figures into potential matches. The data contains additional information about each individual's
interests/along with that of who they speed dated with.
Because the dataset contains information about the interests/views of people along with ratings of their
attractiveness, I find it interesting to separate similarity of interests from attractiveness. I like how you guys
listed out your overall motivation for why you undertook this project, contextualizing it in the growing world
of speed dating. I also like how there are multiple paths you guys can take in terms of proceeding with this project:
even if you find that attractiveness cannot effectively predict pairing success, you guys can fall back on constructing
a general model to predict pairing success using a host of features.
In terms of areas for improvement, I don't know if one dataset is enough in terms of being able
to fit a model that you think will generalize to other speed dating datasets. I also do not know valuable
someone's 'interests' are for predicting pairing success, so this goes into my suggestion for joining this
dataset with other datasets to add more features. Also I would encourage you guys to be more concrete about how you
guys are going to quantify the success of a 'match'
This is a data analysis project financial project about IPO. This project is trying to find out the factor which can help them predict the correct IPO. In my understanding, they are trying to build a model that can help them find an optimal parameter set. The data they are using is from Kaggle called Stocks IPO information.
I really like this project! This project is focusing on the big profitable but also unstable, and seems unpredictable project. Therefore, the problem of IPO is very suitable for a big messy data analysis project. The data set they choose is also very good. The data set is very informative. It covers almost all the factor which will affect the IPO. I also like the criteria of a successful IPO they set up in the problem statement, which helps them clearly explain their project result.
There are three parts I am a little bit concerned about:
This project aims to elucidate the difference in performance of NBA players in contract years vs non-contract years. Additionally, they seek to create a quantification of the "contract year effect" for each player using what they call a "slippery index." They have season statistics for 100 players over 989 total seasons at their disposal to accomplish this goal.
Things I liked
Areas for improvement
Overall, I think this is a very clean and thorough report. It was especially informative given this is the first time I've been exposed to this project, and I was able to easily follow the commentary. The team is analyzing how airbnb prices have been affected due to the COVID 19 pandemic, analyzing pricing data in NYC in 2020 and comparing this data to 2019 pricing information.
First thing I liked was that the data cleaning all made sense, and I don't think any information was lost or misinterpreted, which is always an important concern when cleaning a dataset. The two pairs of histograms were also very informative and make it easy to visualize the initial findings regarding the team's hypothesis. Lastly, I also think the geographic map comparing September 2019 to 2020 does a good job of framing the issue of volume for airbnb as well, and will give great context when this is further analyzed.
One thing I think would have been helpful would be a histogram for the volume of each type of rental. You have the geographic map visualizing this, but that does a better job of showing the overall trend rather than how each subtype is changing. The histograms you have for pricing are solid, I think doing the same for volume would provide some more clarity on a more granular level. One potential suggestion as far as evaluating the impact of the COVID 19 pandemic on price would be a brief comparison to a different virus outbreak (think H1N1 is the only other one that's happened since airbnb was founded), and see if there was a substantial difference there. Could be that airbnb learned from that smaller outbreak and that's why prices haven't changed so much. Last suggestion I have would be to analyze the affect of the pandemic on volume. Revenue issues for a company stem from pricing and/or volume decreases. It very well may be that prices aren't affected by COVID, but I bet there might be a more telling trend from March through October on the volume of listings. Given the wealth of COVID data as well, you may be able to formulate a model that predicts how the volume of rentals changes given a change in COVID infection levels.
Overall I think this is great progress thus far. Everything makes complete sense, and you have some really interesting data you have found.
This group analyzed NFL game to data to attempt to create a model that predicts NFL game point spreads more accurately than Vegas lines. They created models for both a season average only (SAO) dataset and a season and moving (SAM) dataset. Using predictive models including linear regression, ridge regression, lasso regression, and EBM, they were able to produce predictions that compared favorably to Vegas lines in many scenarios.
I really appreciated how well-written this group’s report was and how the logic of all their decisions was clear and easy to follow. I also liked the use of the EBM model we learned about from our guest lecturer, which strengthened your analysis with the use of a very effective non-linear model. Finally, I thought the evaluation of your model by comparing to Vegas lines from the actual NFL games your model is predicting was excellent and did a great job showing the potential usefulness of your model in practice. For an area to explore further for these models, I think the effect of incorporating player injury would be very interesting to see, as you mentioned in your introduction that injuries can play a major role in swinging a game towards another team. This was a fascinating project to learn about and I thought you did an excellent job in getting useful predictions.
This project is about predicting the number fantasy points a fantasy football player will score. The data used to analyze this project is historical fantasy football data from 1970-1999 obtained from Fantasy Football Data Pros. The goal of this project will be to determine what is the best projection model for predicting the amount of fantasy points a player will score for this year's season.
Things I like about the proposal:
This project looks at two different datasets: One that has information about NYC rides, and one that has information
about Bay Area bike sharing rides. The data contains information about ride durations for rider and surrounding
information on the rider and weather, etc. This project is a data analysis project that hopes to see how these
surrounding factors can influence the duration of the drives.
One thing I like about this project is how they have multiple datasets. For a project as narrow as this, multiple datasets
are important. I also like the idea of assessing ride duration by comparing it to google maps expected times. Lastly, I like how the datasets constitute 4 consecutive years, going from 2013-2017.
In terms of areas of improvement, one thing I am worried about is that you guys don't have enough features to accurately predict ride length. Maybe bringing in some additional datasets would be helpful. Additionally, in terms of both your datasets, I am unsure how, considering the datasets come from different years, you can combine the features together. Additionally, I think you guys would need to perform a lot of feature transformations considering the limited number of features.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.