ataki / hospitalfinder Goto Github PK
View Code? Open in Web Editor NEWUse ML to predict best hospitals for patients
Use ML to predict best hospitals for patients
Might get better results by binning several features
A lot of cross-validations are returning 0.0, implying that it's often hard for feature selection to distinguish between feature sets.
Tested with only 500 examples and 50 features; not sure how long the entire dataset + features would take, but want to debug before then.
Question: Is it normal for forward search CV to return somewhat different sets of features (i.e. 45-50% of the features change between runs of foward search).
Just wondering if the following is a legitimate way of calculating errors:
The way I calculate errors for feature selection is by treating discrete naive bayes' predicted result as continuous instead of actually discrete.
So for example, naive bayes might predict Y*
, and actual is Y
. Let's say the maximum obtainable error is Ymax.
sum(square((Y-Y*) / Ymax)) / m
Thus, I find the percent of the difference (between expected and actual) relative to the maximum possible obtainable error.
sum(square((Y-Y*)) / max
This makes it easier for feature selection to distinguish the result of cross validation, because the values are farther apart.
_Which one is more legitimate for the milestone submission?_
Implementing the EM algorithm for the detection of outliers.
tagging @scottcheng @petousis so you guys can see this.
Discovered this bug earlier tonight; kind of painstaking to have to fix it.
Basically, the 2009 and 2010 datasets aren't exactly the same; a few fields are missing from 2009 which are present in 2010, and this causes errors in field translations.
As an example: field index (857-859) means revenue from medicare in 2010 but is actually (856-858) in 2009.
This only happens starting at around index 800, so fields before that are still ok.
As a result, I need to correct mapping.py
๐ฆ However, this only applies to feature selection, and should not block you guys; just wanted to make you aware in case you were starting to do FS.
This will happen by early tomorrow afternoon.
I noticed that running forward search yields very different set of features each time. It turns out this is because that the error generated by training with each of the feature are amazingly close to each other.
Below is a log of running simple CV using 1 feature at a time. m here is 1000.
Each line contains the feature name, unweighted theta, and CV error.
dayOfWeek [ 19.58808887 -0.18769851] 6.58927488135
age [ 17.1282098 0.04089841] 6.70841792543
sex [ 19.57499247 -0.51025653] 6.8315913638
injury [ 17.7779104 0.22871192] 6.91217155097
seenBefore [ 18.36513984 0.43531626] 7.89425402349
totalServices [ 15.01136392 0.83372447] 6.85801535243
medicationProvided [ 17.58766118 1.42125569] 6.94321645391
numberOfNewMedicationsCoded [ 19.04929861 -0.24200571] 6.82832722998
numberOfContinuedMedicationsCoded [ 18.15531682 0.28367161] 7.01625326977
region [ 22.5996675 -2.00732449] 7.12600328373
physicianCode [ 2.02885506e+01 -1.66753915e-02] 6.78736045563
officeSetting [ 18.73424023 0.10645313] 6.69207874527
physicianEmploymentStatus [ 14.32694882 2.85618371] 6.77561833019
typeOfPracticeOwner [ 18.03660926 0.2730096 ] 6.78011422404
eveningWeekendVistsAllowed [ 15.98499104 1.91778674] 6.98476190476
hasComputerForPatientInfo [ 17.86804529 0.88110341] 6.99539174526
hasComputerForPrescriptionOrders [ 21.57264484 -2.05802031] 7.27203728936
hasComputerForLabTestOrders [ 22.09071697 -2.44649809] 7.1204142794
hasComputerForViewingLabTestOrders[ 24.44718216 -4.64164495] 6.7487972848
hasComputerForViewingImageResults [ 20.53747697 -1.19480268] 6.83220968645
routineAppointmentSetupTime [ 18.79300158 0.02789137] 6.84205464392
whoCompletedForm [ 18.31694209 0.20467146] 7.44680475593
Turns out all the features have very large theta[0](intercept term) and very small theta[1]. I think this means that these features are actually not very relevant to Y. This could be the reason why the error generated by each of these features are so close to each other.
Learned today about something called "ensemble methods" for supervised learning. Given the scenario that we have several weaker predictors with largely orthogonal features, we can use ensemble methods to aggregate these weaker predictors into a stronger predictor.
The standard method is via "bagging," which, given an input, "polls" all predictors and averages together their result. Bagging places equal weights on each predictor while polling.
Thought it might be useful to think about in case our individual models separately aren't achieving improved too much improved performance.
How generic do you want this tool to be?
Right now I'm only writing it for my model - let me know here if you want to be able to run this for your models.
Supposedly L2 regularization is better when you have large number of examples compared to features.
http://metaoptimize.com/qa/questions/5205/when-to-use-l1-regularization-and-when-l2
I realized that there are a lot of invalid data in our dataset that we might want to be aware of when tweaking your algorithms. Below is the BMI (body mass index) of 50 data samples:
[ 27.73 28.16 26.21 -9. 30.12 23.82 32.28 26.25 19.27 23.79
-9. 25.12 -7. -9. 27.89 -9. 25.68 -9. 27.05 -7. -9.
-7. -7. -9. -9. 33.2 -9. -9. 34.69 -9. 18.29
34.55 -9. 38.45 -9. 42.43 21.8 29.19 31.53 33.44 29.99
22.44 -9. -9. 29.23 17.56 24.02 30.86 20.22 29.53]
-9, -8 and -7 are typical invalid values. For linear regression I'm thinking about just eliminating the samples with invalid values.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.