TeX 100.00%

prediction-of-us-traffic-accidents's People

Contributors

Watchers

Forkers

jingxuan-l

prediction-of-us-traffic-accidents's Issues

Peer Review

The group wants to analyze US traffic accident data and identify the most important elements incurring traffic accidents. The research question revolves around finding relations between traffic likeliness/frequency and natural/social elements. The data comes from the countrywide traffic dataset and includes several datapoints such as location, natural language description, etc.

What I like:

Dataset is definitely very extensive. This will help get around the problems of data bias.
I like the impact that this could potentially have if the exploratory data analysis reveals insights or trends about traffic data.
The research question is broad enough to allow for a wide range of subquestions.

My concerns

The second research question is kinda vague. What advice would be helpful and realistically implement-able? Also, how many features are really controllable? This might be an interesting question to address
There's a fair amount of textual analysis which might require analyzing data using methods that we haven't learned in class (thinking NLP). Make sure you scope the project accordingly.
Selecting subset of features is hard in so many columns

Midterm Peer Review

Summary: This project aims to find the important factors that causes traffic accidents in each state. Different factors like the weather, air pressure, and time can affect the severity of an accident. Through the use of gradient boosting algorithm, this project explores the dataset of the accidents that occur in each state.

Positives: I like how you explained in detail what each of the features you're going to explore and what type of values they were going to be. Explaining the encoding process helps to tell the reader how these features will be incorporated into the algorithm. I also liked how you explained how you were going to counter the overfitting problem through different models, eventually coming up with gradient boosting algorithm. Explaining the process of the model selection was really helpful. Finally, I liked the initial analysis that was provided in the midterm report. Explaining the different features that had a significant impact on traffic accidents highlights the progress that has already been made through the model that was selected.

Suggestions: I would suggest to see if you can include more data visual analysis to back up your initial claims about the features you deemed important when it comes to traffic accidents. That would give you more evidence to back up your model selection. Also, I would explore principal component analysis for your project. It could be an extra way to find the best covariates for your model. Finally, I would see if you can explore more of the bias variance tradeoff that occurs in your model selections, so that you can more deeply analyzes the errors that you receive.

Peer Review #1

This following project aims to analyze US traffic accident data to identify what the key factors causes traffic accidents to happen. They will be looking at the features like weather, humidity, distances from the walking areas and popularity densities to evaluate each feature’s influence on the traffic accident rate. It seems like the topic of this project is quite interesting, but it also worries me that there will be a lot of features/factors that they would need to look at to get a good fitting model at the end. But, in the other hands, having a lot of features well suits into the project requirement of 'big messy data'. Overall, this project seems to have its significance as it can greatly help the urban planners and the government when it can actually be implemented to reduce the traffic accident rates.

Final peer review

The project I reviewed is about predicting the severity in US traffic accidents. The project used different methods including OLS/LASSO and gradient boosting machine to predict the data and compared the performance. The project also implemented state-specific models and uncovered some insights including feature importance.

Three things I like this project specifically: (1) the topics was very interesting and the scope was comprehensive. The dataset and feature engineering were well described and model was established well. (2) The project fitted into the “messy data” topic well, as it incorporated large dataset and described the challenges they faced in computational power (3) After the model in general, they did state-specific models and uncovered some insights, which was very impressive.

Three things I think this project may have potential to improve: (1) In terms of computational power, may the model had some potential to use techniques like parallel computing in order to improve the efficiency (of course it would be hard and just a direction it may work) (2) About the OLS/Lasso model fitting part, I thought the reason why these models failed was not justified in details. As it would be appropriate to use OLS/Lasso if the underlying relationship is linear. And after many one-hot encoded feature transformation, the linearity might be hard to hold. The usage of scatter plot to justified the linearity/Non-linearity would make the model analysis more comprehensive. (3) The usage of latitude and longitude features was pretty confusing and especially they were the most important features selected by XBM (which I thought was counter-intuitive). Thus just wondering it might have relationship with county or unbar/rural areas, Thus it might be good to add these features and to see whether these features and replaced latitude and longitude.

Midterm Peer Review

This project aims to find the most relevant factors in determining what causes automobile accidents in the United States, using a dataset with approximately 3.5 million accident records over 49 states. Initial analysis yielded that California had a large number of overall accidents, but was much more reasonable when the ratio of accidents to number of automobiles was taken. In comparison, South Carolina had a large ratio of accidents to number of automobiles, but a relatively low absolute number of accidents. In doing comparisons, the group found that latitude and longitude played by far the largest importance in deciding whether a car crash was likely, followed by street, temperature, weather, humidity, and more.

Some points I really enjoyed about this report were:

This report was really thorough. There was a ton of work done, well beyond initial model fitting. I thoroughly enjoyed reading through it and seeing the work put into the description of the dataset and analysis. Great job!
The report was also really well written. I was easily able follow along with the content of the report, even though I wasn't familiar with 100% of the concepts. The report also referenced further materials that could be referenced if I wanted to learn more.
I think the feature transformations were well thought out and made a lot of sense, and really well reflected the kinds of transformations we discussed in class.

Some things I think could be improved upon are:

I appreciate the gradient-boosting algorithm, but how much time was spent working on models that were gone over in class? I was expecting to read at least something about linear regression and least-squares fitting, but found no such description in this report.
The project abstract states that a goal of this project is to discuss the changes that can be made in order to prevent accidents from happening, but there wasn't a lot of discussion about that in this paper. Also, a lot of the most important factors seemed to be something that couldn't easily be changed by human intervention. How do you plan to use your results? What result corresponds to what recommendation?
Some of the graphs could be labeled better. Many were missing x and y axis labels. Notably, the graph visualizing number of crashes per state was hard to tell - I have no idea what states correspond to what index, so I couldn't interpret that without outside help.

Midterm Peer Review

This project seeks to determine what factors can impact the severity of a car accident. The team is using a dataset from Kaggle with 3.5 million accident records, which covers 49 states. California was found to have the largest number of accidents, whereas South Carolina had the highest ratio of number of accidents to number of automobiles.

What I thought was great:

Dataset is big and has plenty of data points to work with!
The report was really well written. I appreciated that your team walked the reader through initial goals, processing, analysis, and initial findings.
Data Preprocessing section was well thought out. The one-hot encoding made sense
A lot of work was put into this report. Initial data analysis was interesting - I was surprised to see that visibility and traffic signals are not as important as latitude and longitude.

Some things I'd like to suggest:

It seems that some of the factors (Lat/Lng) that are not controllable by authorities played a big part in increasing the severity of a car crash. Are there other ways to recommend authorities decrease the severity of car crashes other than weather monitoring? California doesn't have much (if any) snow. How can authorities there decrease the severity of car crashes?
Not following your reason to use a neural network. Is it necessary? Are there other options that we learned in class that your team can use?
In your proposal, you mentioned that you wanted to analyze social elements that may factor into traffic accidents. How does your team plan to do this?

Midterm Peer Review

The project aims to predict the severity of traffic accidents in the US based on a number of factors including geography, environmental stimuli, and others. Their goal is to be able to provide a basis for some actions at the levels of drivers and government that could prevent or limit the impact of car accidents. They have at their disposal a dataset of traffic accidents from 49 states with lots of descriptive features for each accident.

Things I liked

I liked how a new type of model was used, it was cool to learn a bit about that.
I liked the description of the one-hot encoding process.
The parameter tuning section was very well done, I liked how you tried a variety of parameters and provided visualizations.

Areas for improvement

It would be nice to have a deeper discussion on what exactly the "severity" of an accident is capturing. Is this just referring to traffic delays associated with the accident?
It was unclear to me why you suspected significant overfitting (beginning of section 2.1).
If computation power is a concern to the point that it is limiting your choice of model, you could consider truncating your dataset. Also you could look into accessing some GPU power remotely through AWS or other resources as I think this will be much faster than running on CPU.

Final Peer Review #ll576

This project applied machine learning techniques to the U.S. national traffic accident dataset to identify the significant factors (such as demographics, environmental stimuli) that influence the severity of car accidents and explore how that might vary across states, and with the purpose of informing governments and drivers about the possible ways to avoid or reduce the impacts of car accidents.

Things I like about this project:

Dataset processing was well described, with details in how they dealt with missing values, ordinal and nominal valued entries. They also looked at the relative importance of features when fitting the XGB model.
The group described the problem of overfitting in details which they encountered when using OLS and LASSO regression, and how they then approached using gradient boosting method.
The group gave detailed description on how they tuned the parameters of different models, and justified their final choice of tuning parameters by comparing the number and size of errors of each model.
I like how they displayed the accuracy and F1-score of all three methods together in a table; this enables reader can easily compare and evaluate the performance of each.

Possible areas of improvement:

The visualizations included are the number of accidents and the ratio of accident and number of automobiles in each state; it would be nice to see the distribution of the level of severity of accidents in the states with top weighted accident ratios.
Some more description about the attempt to use k-fold validation could be included to justify their final decision of not using k-fold but shuffle the data before training/testing
The group mentioned that ols/lasso cannot reduce overfitting in this case, perhaps the group can explore some other linear models and try different regularizers as well.
The group fitted state-specific models and described how significant features vary across states, and made separate set of recommendations for each state; I think it would be better if the group only goes into details for perhaps three states, but then fit models to more than 5 states and summarize the results and findings in a table, such that we get a more holistic idea about the important features influencing car accident severity across the U.S..
Overall, this project is very well-written and provides the readers with insights into the factors influencing the severity of car accidents in different states.

Peer Appraisal and Suggestions^_^

The team aims to analyze the US Traffic Accident data and identify the important elements that could have caused traffic accidents. Features like Natural elements (weather, humidity, etc.) and social elements (distance from convenience store, population density etc.) will be analyzed by the team.

What I like about this project is the application and motivation behind the project. The outcome of a prediction model based on the day's weather and social elements could give government bodies a sense of the risk level of traffic accident and allow the stakeholders to develop counteractive measures to mitigate the risk and save lives.

Another good thing about the dataset is the large sample size (3 million) that will ease training of the model and cross-validation training (if the group chooses to do so). The large sample size is comparable to the number of features (40+) and it can prevent underfitting of the data.

One suggestion: The team could consider 2 different approaches for prediction model.
(1) Analyzing the effect of natural elements and social elements on traffic accidents level (same as what you have)
(2) Analyzing the trending of traffic accidents as a time series. We might realize the the accidents peak at certain time of the day, certain season of the year and probably relate the trending to some assumptions on the human traffic during that time/season. This could allow government officials to instill measures (e.g. deploy traffic marshalls) to mitigate if there is a trending.

Peer Review

The group wants to analyze US Traffic Accident data to figure out what elements cause accidents such as weather conditions, population density, location, and other features. They also want to provide policy recommendations for the government and personal safety recommendation to individuals to lower the chance of accidents. The dataset they are using is a traffic accident dataset from kaggle of accident reports. This project could have a really important impact on public policy to reduce the chances of accidents and save lies.

3 things I like

The project has an important impact as it will help possibly save lives by providing suggestions to individuals and governments
The research question is interesting and doesn’t have an obvious, simple answer
The data seems very detailed and can definitely help answer your first question of understanding the elements that cause accidents.

3 areas of improvement

It might be difficult to provide policy or personal recommendations because that is more subjective and the model may not be able to provide a clear answer
The data could be really messy and difficult to work with, especially because the descriptions are in plain text and NLP is out of scope for this class. It could be hard to mix NLP methods with the regression methods we have covered in class.
There are 49 columns in the dataset, so it might be difficult to figure out which are the most important to use to predict probability of an accident and which combinations of features would be best.

Final Peer Review (gs484)

This project is centered around exploring which factors can be most effectively used to predict the severity of traffic accidents in the United States. Data for 49 of the American states was obtained from Kaggle, with this data dating back to February 2016 and focusing specifically on a wide variety of features such as the time of each accident; the latitude and longitude of each accident location; temperature, humidity, and visibility at the time of each accident, etc. By analyzing this data, this group intended to develop a concrete set of recommendations on how federal governments, state governments, and drivers could help decrease the rate of traffic accidents throughout the United States.

To begin with, the group's emphasis on backing all conclusions with data tables and visualizations should be applauded. For example, including succinct data tables recording each model's overall accuracy, balanced accuracy, and F1 score made it very clear which models out of ordinary least squares regression (OLS), multinomial LASSO regression, and the gradient boosting algorithm (XGB) performed the best for each state the group focused on. Speaking of states, the group also demonstrated an extensive range in their study by not only performing analysis on countrywide data, but also individually considering different states from different regions of the United States (i.e., California from the West Coast, Minnesota from the Midwest, Texas from the South, etc.) This allowed the group to provide detailed information on how the most significant factors behind traffic accidents changed from state to state, and thereby make customized recommendations for each state on how to better improve its traffic accident rate. Similarly, the group performed a comprehensive examination into an impressively large set of features, considering everything from street type (highway vs. local), longitude, latitude, weather conditions, visibility, time of day, etc. Focusing on such an expansive list of features makes the group's models both more detailed and more believable.

However, the report itself was somewhat difficult to follow, with various typos, vague wording, and grammatical errors interfering with the audience's ability to easily follow the logic being discussed. The "more north/south distance than east/west of [California]", for instance, was somewhat questionably worded; and even if one were to correctly guess what the authors were referring to here, there is no information given on how such distances were measured, nor any statistics specifying what these distances even are. Similarly, while discussing model selection, the authors "predict that there will be significant overfitting issues with the data," without providing specific justification--and simply throwing in an unexplained but mathematically dense overview of the gradient boosting algorithm does not do anything to help the audience understand the model being developed. Finally, the authors briefly mention investigating the "correlation between each state['s] geographic center's latitude with natural cases relative importance" during the report's conclusion. The wording here arguably could have been clearer; and perhaps more troublingly, no detail is given as to how exactly this correlation is calculated. Including even a one-sentence overview here, or perhaps a scatter plot visualization, would have significantly helped the audience follow the authors' argument here.

Final Peer Review -mp668

Overview:
The goal of this project is to determine what key factors affect the severity of traffic accidents in the US. This project is relevant since its results can be used as ways to educate governments and drivers and hopefully change their behaviors for the better.

Positives:
• Good initial visualizations and data preprocessing and how the group dealt with missing values and irregular data. I also agree with their additional note of how to train test and shuffle.
• In depth analysis of the gradient boosting algorithm to avoid overfitting. Shows that the group has referred to literature and done extensive research in the matter.
• The project is extremely relevant, and I enjoyed reading about the comparisons broken down by state and believe they provide very interesting insights.

Areas of improvement:
• I can’t seem to find the actual split of data. Is it 80:20 and what was this based on since the 10 fold did not work.
• Did the group explore any linear models or errors such as least squares regression with one and many hot or quadratic or l1 loss with certain features and regularizers.
• The discussion of Weapons of Math Destruction and Fairness is quite brief and would like to expand a little bit of it to show that they have considered all possible angles.

Peer Review

I understand you are doing a project on traffic accidents in the USA. You are using a data set from Kaggle that has over 3.5 million entries since 2016. You want to see how certain conditions may effect the likelihood of an accident, and then use this to potentially recommend solutions or warnings about dangerous conditions.

I like the size of the data, you can definitely get some cool things with that much information.
I like the impact, it is a really neat area with potentially significant outcomes if you find a certain set of conditions which are a ton more hazardous than the rest.
I like the overall directions y’all are going in, it will be a cool project when its done!

Maybe you can consider time of day, and see if maybe there are more at night? At lunch? Etc.
I’d be a little concerned with overlapping data, for instance, if its raining then its definitely humid and probably lower visibility. So is there a way to separate it and see is it rain causing the accident or is it humidity, or is it…?
Is there a way to rank the severity of the accident? Is it just a small bump, or was it a multi-car accident that shut down a highway? Can you figure that out and take that into account in your finding of dangerous conditions.

heruwang743 / prediction-of-us-traffic-accidents Goto Github PK

prediction-of-us-traffic-accidents's People

Contributors

Watchers

Forkers

prediction-of-us-traffic-accidents's Issues

Recommend Projects

Recommend Topics

Recommend Org