ataki / hospitalfinder Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 23.21 MB

Use ML to predict best hospitals for patients

Python 30.51% TeX 68.84% Makefile 0.64%

hospitalfinder's People

Contributors

Stargazers

Watchers

hospitalfinder's Issues

Try binning of features and retry Naive Bayes + SVM

Might get better results by binning several features

Naive Bayes Feature Selection Issues

A lot of cross-validations are returning 0.0, implying that it's often hard for feature selection to distinguish between feature sets.

Tested with only 500 examples and 50 features; not sure how long the entire dataset + features would take, but want to debug before then.

Question: Is it normal for forward search CV to return somewhat different sets of features (i.e. 45-50% of the features change between runs of foward search).

Logistic Regression: Calculating Errors

Just wondering if the following is a legitimate way of calculating errors:

The way I calculate errors for feature selection is by treating discrete naive bayes' predicted result as continuous instead of actually discrete.

So for example, naive bayes might predict Y*, and actual is Y. Let's say the maximum obtainable error is Ymax.

I calculate the test error as:

sum(square((Y-Y*) / Ymax)) / m

Thus, I find the percent of the difference (between expected and actual) relative to the maximum possible obtainable error.

Another possible approach is to just do:

sum(square((Y-Y*)) / max

This makes it easier for feature selection to distinguish the result of cross validation, because the values are farther apart.

_Which one is more legitimate for the milestone submission?_

EM algorithm

Implementing the EM algorithm for the detection of outliers.

Datasets do not have exactly same feature mappings starting at around 800+

tagging @scottcheng @petousis so you guys can see this.

Discovered this bug earlier tonight; kind of painstaking to have to fix it.

Basically, the 2009 and 2010 datasets aren't exactly the same; a few fields are missing from 2009 which are present in 2010, and this causes errors in field translations.

As an example: field index (857-859) means revenue from medicare in 2010 but is actually (856-858) in 2009.

This only happens starting at around index 800, so fields before that are still ok.

As a result, I need to correct mapping.py 😦 However, this only applies to feature selection, and should not block you guys; just wanted to make you aware in case you were starting to do FS.

This will happen by early tomorrow afternoon.

Linear regression produces similar error for all features

I noticed that running forward search yields very different set of features each time. It turns out this is because that the error generated by training with each of the feature are amazingly close to each other.

Below is a log of running simple CV using 1 feature at a time. m here is 1000.
Each line contains the feature name, unweighted theta, and CV error.

dayOfWeek                         [ 19.58808887  -0.18769851] 6.58927488135
age                               [ 17.1282098    0.04089841] 6.70841792543
sex                               [ 19.57499247  -0.51025653] 6.8315913638
injury                            [ 17.7779104    0.22871192] 6.91217155097
seenBefore                        [ 18.36513984   0.43531626] 7.89425402349
totalServices                     [ 15.01136392   0.83372447] 6.85801535243
medicationProvided                [ 17.58766118   1.42125569] 6.94321645391
numberOfNewMedicationsCoded       [ 19.04929861  -0.24200571] 6.82832722998
numberOfContinuedMedicationsCoded [ 18.15531682   0.28367161] 7.01625326977
region                            [ 22.5996675   -2.00732449] 7.12600328373
physicianCode                     [  2.02885506e+01  -1.66753915e-02] 6.78736045563
officeSetting                     [ 18.73424023   0.10645313] 6.69207874527
physicianEmploymentStatus         [ 14.32694882   2.85618371] 6.77561833019
typeOfPracticeOwner               [ 18.03660926   0.2730096 ] 6.78011422404
eveningWeekendVistsAllowed        [ 15.98499104   1.91778674] 6.98476190476
hasComputerForPatientInfo         [ 17.86804529   0.88110341] 6.99539174526
hasComputerForPrescriptionOrders  [ 21.57264484  -2.05802031] 7.27203728936
hasComputerForLabTestOrders       [ 22.09071697  -2.44649809] 7.1204142794
hasComputerForViewingLabTestOrders[ 24.44718216  -4.64164495] 6.7487972848
hasComputerForViewingImageResults [ 20.53747697  -1.19480268] 6.83220968645
routineAppointmentSetupTime       [ 18.79300158   0.02789137] 6.84205464392
whoCompletedForm                  [ 18.31694209   0.20467146] 7.44680475593

Turns out all the features have very large theta[0](intercept term) and very small theta[1]. I think this means that these features are actually not very relevant to Y. This could be the reason why the error generated by each of these features are so close to each other.

Ensemble methods

Learned today about something called "ensemble methods" for supervised learning. Given the scenario that we have several weaker predictors with largely orthogonal features, we can use ensemble methods to aggregate these weaker predictors into a stronger predictor.

The standard method is via "bagging," which, given an input, "polls" all predictors and averages together their result. Bagging places equal weights on each predictor while polling.

Thought it might be useful to think about in case our individual models separately aren't achieving improved too much improved performance.

[  27.73  28.16  26.21  -9.    30.12  23.82  32.28  26.25  19.27  23.79
   -9.    25.12  -7.    -9.    27.89  -9.    25.68  -9.    27.05  -7.    -9.
   -7.    -7.    -9.    -9.    33.2   -9.    -9.    34.69  -9.    18.29
   34.55  -9.    38.45  -9.    42.43  21.8   29.19  31.53  33.44  29.99
   22.44  -9.    -9.    29.23  17.56  24.02  30.86  20.22  29.53]

-9, -8 and -7 are typical invalid values. For linear regression I'm thinking about just eliminating the samples with invalid values.

ataki / hospitalfinder Goto Github PK

hospitalfinder's People

Contributors

Stargazers

Watchers

hospitalfinder's Issues

Try binning of features and retry Naive Bayes + SVM

Naive Bayes Feature Selection Issues

Logistic Regression: Calculating Errors

EM algorithm

Datasets do not have exactly same feature mappings starting at around 800+

Linear regression produces similar error for all features

Ensemble methods

Tool to auto-plot training data vs test data curves for n = 0, 1, ... 30k

Incorporate kmeans distances into feature vector

Implement SVM's with L2 regularization

What to do with missing data (-9) and invalid data?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent