Giter VIP home page Giter VIP logo

hospitalfinder's People

Contributors

ataki avatar petousis avatar

Stargazers

 avatar

Watchers

 avatar  avatar

hospitalfinder's Issues

Naive Bayes Feature Selection Issues

A lot of cross-validations are returning 0.0, implying that it's often hard for feature selection to distinguish between feature sets.

Tested with only 500 examples and 50 features; not sure how long the entire dataset + features would take, but want to debug before then.

Question: Is it normal for forward search CV to return somewhat different sets of features (i.e. 45-50% of the features change between runs of foward search).

Logistic Regression: Calculating Errors

Just wondering if the following is a legitimate way of calculating errors:

The way I calculate errors for feature selection is by treating discrete naive bayes' predicted result as continuous instead of actually discrete.

So for example, naive bayes might predict Y*, and actual is Y. Let's say the maximum obtainable error is Ymax.

  1. I calculate the test error as:

sum(square((Y-Y*) / Ymax)) / m

Thus, I find the percent of the difference (between expected and actual) relative to the maximum possible obtainable error.

  1. Another possible approach is to just do:

sum(square((Y-Y*)) / max

This makes it easier for feature selection to distinguish the result of cross validation, because the values are farther apart.

_Which one is more legitimate for the milestone submission?_

EM algorithm

Implementing the EM algorithm for the detection of outliers.

Datasets do not have exactly same feature mappings starting at around 800+

tagging @scottcheng @petousis so you guys can see this.

Discovered this bug earlier tonight; kind of painstaking to have to fix it.

Basically, the 2009 and 2010 datasets aren't exactly the same; a few fields are missing from 2009 which are present in 2010, and this causes errors in field translations.

As an example: field index (857-859) means revenue from medicare in 2010 but is actually (856-858) in 2009.

This only happens starting at around index 800, so fields before that are still ok.

As a result, I need to correct mapping.py ๐Ÿ˜ฆ However, this only applies to feature selection, and should not block you guys; just wanted to make you aware in case you were starting to do FS.

This will happen by early tomorrow afternoon.

Linear regression produces similar error for all features

I noticed that running forward search yields very different set of features each time. It turns out this is because that the error generated by training with each of the feature are amazingly close to each other.

Below is a log of running simple CV using 1 feature at a time. m here is 1000.
Each line contains the feature name, unweighted theta, and CV error.

dayOfWeek                         [ 19.58808887  -0.18769851] 6.58927488135
age                               [ 17.1282098    0.04089841] 6.70841792543
sex                               [ 19.57499247  -0.51025653] 6.8315913638
injury                            [ 17.7779104    0.22871192] 6.91217155097
seenBefore                        [ 18.36513984   0.43531626] 7.89425402349
totalServices                     [ 15.01136392   0.83372447] 6.85801535243
medicationProvided                [ 17.58766118   1.42125569] 6.94321645391
numberOfNewMedicationsCoded       [ 19.04929861  -0.24200571] 6.82832722998
numberOfContinuedMedicationsCoded [ 18.15531682   0.28367161] 7.01625326977
region                            [ 22.5996675   -2.00732449] 7.12600328373
physicianCode                     [  2.02885506e+01  -1.66753915e-02] 6.78736045563
officeSetting                     [ 18.73424023   0.10645313] 6.69207874527
physicianEmploymentStatus         [ 14.32694882   2.85618371] 6.77561833019
typeOfPracticeOwner               [ 18.03660926   0.2730096 ] 6.78011422404
eveningWeekendVistsAllowed        [ 15.98499104   1.91778674] 6.98476190476
hasComputerForPatientInfo         [ 17.86804529   0.88110341] 6.99539174526
hasComputerForPrescriptionOrders  [ 21.57264484  -2.05802031] 7.27203728936
hasComputerForLabTestOrders       [ 22.09071697  -2.44649809] 7.1204142794
hasComputerForViewingLabTestOrders[ 24.44718216  -4.64164495] 6.7487972848
hasComputerForViewingImageResults [ 20.53747697  -1.19480268] 6.83220968645
routineAppointmentSetupTime       [ 18.79300158   0.02789137] 6.84205464392
whoCompletedForm                  [ 18.31694209   0.20467146] 7.44680475593

Turns out all the features have very large theta[0](intercept term) and very small theta[1]. I think this means that these features are actually not very relevant to Y. This could be the reason why the error generated by each of these features are so close to each other.

Ensemble methods

Learned today about something called "ensemble methods" for supervised learning. Given the scenario that we have several weaker predictors with largely orthogonal features, we can use ensemble methods to aggregate these weaker predictors into a stronger predictor.

The standard method is via "bagging," which, given an input, "polls" all predictors and averages together their result. Bagging places equal weights on each predictor while polling.

Thought it might be useful to think about in case our individual models separately aren't achieving improved too much improved performance.

What to do with missing data (-9) and invalid data?

I realized that there are a lot of invalid data in our dataset that we might want to be aware of when tweaking your algorithms. Below is the BMI (body mass index) of 50 data samples:

[  27.73  28.16  26.21  -9.    30.12  23.82  32.28  26.25  19.27  23.79
   -9.    25.12  -7.    -9.    27.89  -9.    25.68  -9.    27.05  -7.    -9.
   -7.    -7.    -9.    -9.    33.2   -9.    -9.    34.69  -9.    18.29
   34.55  -9.    38.45  -9.    42.43  21.8   29.19  31.53  33.44  29.99
   22.44  -9.    -9.    29.23  17.56  24.02  30.86  20.22  29.53]

-9, -8 and -7 are typical invalid values. For linear regression I'm thinking about just eliminating the samples with invalid values.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.