johnmyleswhite / ml_for_hackers Goto Github PK
View Code? Open in Web Editor NEWCode accompanying the book "Machine Learning for Hackers"
Home Page: http://shop.oreilly.com/product/0636920018483.do
Code accompanying the book "Machine Learning for Hackers"
Home Page: http://shop.oreilly.com/product/0636920018483.do
In code snippet #20 in chapter06.R, glmnet() wants a matrix with 2 or more columns but throws an error since the matrix itself is only one column. Thought about wrangling it into a two column matrix but that might not be inline with the original intent of that snippet.
Hola! @fpcMotif has created a ZenHub account for the johnmyleswhite organization. ZenHub is the only project management tool integrated natively in GitHub – created specifically for fast-moving, software-driven teams.
To get set up with ZenHub, all you have to do is download the browser extension and log in with your GitHub account. Once you do, you’ll get access to ZenHub’s complete feature-set immediately.
ZenHub adds a series of enhancements directly inside the GitHub UI:
Still curious? See more ZenHub features or read user reviews. This issue was written by your friendly ZenHub bot, posted by request from @fpcMotif.
How can i get the data used in this book?
Since the Google Social Graph API is not longer available, can you recommend any other currently active sources of data for 'practicing' the network analysis methods discussed in Chapter 11?
Thanks.
Google's social graph API no longer exists. I love this book and have used it as a template to start many other projects.
I need to make social graphs now and having skimmed this chapter I'm not sure if it's usable given the decommissioning of Google's API, as well as the changes to Twitter.
I'm currently trying to get started based on some other random social graph tut's online. I'm wondering if someone who is more familiar with this chapter can let me know if it is still worth reading completely and adapting the code to work around these changes or if that would be a waste of time?
Thanks!
library(tm)
library(ggplot2)
#tm is the text mining package of R
#ggplot is for visualization
#there are 2 sets of files for each type of mail and one will be used for training while other will be for testing
spam.path<-"data/spam/"
spam2.path<-"data/spam_2/"
easyham.path<-"data/easy_ham/"
easyham2.path<-"data/easy_ham_2/"
hardham.path<-"data/hard_ham//"
hardham2.path<-"data/hard_ham_2/"
get.msg<-function(path){
print(path)
connection<-file(path,open="rt", encoding="Latin1")
text<-readLines(connection)
#the message begins after a full line break
t<-which(text=="")[1]+1
print(length(text))
print(t)
msg<-text[seq(t, length(text))]
#print(msg)
close(connection)
return (paste(msg, collapse="\n"))
}
#tdm=term document matrix
get.tdm<-function(doc.vec){
doc.corpus<-Corpus(VectorSource(doc.vec))
control<-list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE, minDocFreq=2)
doc.dtm<-TermDocumentMatrix(doc.corpus, control)
return (doc.dtm)
}
# create a vector of emails
#use apply function
spam.docs<-dir(spam.path)
#this returns a list of file names in the directory
spam.docs<-spam.docs[seq(1,length(spam.docs)-1)]
#spam.docs<-spam.docs[which(spam.docs!="")]
#cmds file is a UNIX file which we dont need
#spam.docs<-spam.docs[!startsWith(spam.docs, "cmds")]
all.spam<-sapply(spam.docs, function(p) get.msg(paste(spam.path,p, sep="")))
spam.tdm<-get.tdm(all.spam)
#use the command below for inspection
#head(all.spam)
#z<-TermDocumentMatrix(Corpus(VectorSource(all.spam)), list(stopwords=TRUE, removeNumbers=TRUE, removePunctuation=TRUE, minDocFreq=2))
spam.matrix<- as.matrix(spam.tdm)
spam.counts<-rowSums(spam.matrix)
spam.df<-data.frame(cbind(names(spam.counts), as.numeric(spam.counts)), stringAsFactors=FALSE)
names(spam.df)<-c("term", "frequency")
spam.df$frequency<-as.numeric(spam.df$frequency)
spam.occurence<-sapply(1:nrow(spam.matrix)
, function(i){
length(which(spam.matrix[i,]>0))/ncol(spam.matrix)
})
spam.density<-spam.df$frequency/sum(spam.df$frequency)
spam.df<-transform(spam.df, density=spam.density, occurence=spam.occurence)
head(spam.df[with(spam.df,order(-occurence)), ])
#constructuon of Ham dataset
easy_ham.docs<-dir(easyham.path)
#this returns a list of file names in the directory
easy_ham.docs<-easy_ham.docs[seq(1,500)]
#spam.docs<-spam.docs[which(spam.docs!="")]
#cmds file is a UNIX file which we dont need
#spam.docs<-spam.docs[!startsWith(spam.docs, "cmds")]
all.easy_ham<-sapply(easy_ham.docs, function(p) get.msg(paste(easyham.path,p, sep="")))
easy_ham.tdm<-get.tdm(all.easy_ham)
#use the command below for inspection
#head(all.spam)
#z<-TermDocumentMatrix(Corpus(VectorSource(all.spam)), list(stopwords=TRUE, removeNumbers=TRUE, removePunctuation=TRUE, minDocFreq=2))
easy_ham.matrix<- as.matrix(easy_ham.tdm)
easy_ham.counts<-rowSums(easy_ham.matrix)
easy_ham.df<-data.frame(cbind(names(easy_ham.counts), as.numeric(easy_ham.counts)), stringAsFactors=FALSE)
names(easy_ham.df)<-c("term", "frequency")
easy_ham.df$frequency<-as.numeric(easy_ham.df$frequency)
easy_ham.occurence<-sapply(1:nrow(easy_ham.matrix)
, function(i){
length(which(easy_ham.matrix[i,]>0))/ncol(easy_ham.matrix)
})
easy_ham.density<-easy_ham.df$frequency/sum(easy_ham.df$frequency)
easy_ham.df<-transform(easy_ham.df, density=easy_ham.density, occurence=easy_ham.occurence)
easy_ham.df$NA.<-NULL
head(easy_ham.df[with(easy_ham.df,order(-occurence)), ])
#Classification function
classify.email<-function(path, training.df, prior=0.5, c=1e-6){
msg<-get.msg(path)
msg.tdm<-get.tdm(msg)
msg.freq<-rowSums(as.matrix(msg.tdm))
#Find intersection of words
msg.match<-intersect(names(msg.freq), training.df$term)
if(length(msg.match)<1){
return (prior*c^(length(msg.freq)))
}
else{
match.probs<-training.df$occurence[match(msg.match, training.df$term)]
return (prior*prod(match.probs) * c^(length(msg.freq)-length(msg.match)))
}
}
hardham.docs<-dir(hardham.path)
hardham.docs<-hardham.docs[seq(1:length(hardham.docs))]
hardham.spamtest<-sapply(hardham.docs, function(p) classify.email(paste(hardham.path,p, sep=""),
training.df = easy_ham.df))
hardham.hamtest<-sapply(hardham.docs, function(p) classify.email(paste(hardham.path, p, sep=""), training.df = easy_ham.df))
hardham.res<-ifelse(hardham.spamtest>hardham.hamtest, TRUE, FALSE)
summary(hardham.res)
This code only returns false for all values
Are there any known issues in using the arm package on Mac?
On running package_installer.R, there were errors in compiling BRugs, which isn't a direct dependency of the arm package, but is suggested for R2WinBUGS, which is.
Are there issues in running any of the book's code on Mac, if it's depending on OpenBUGS and WinBUGS, which seem to be Windows-specific libraries? Or are these libraries never used?
I added library(reshape) before running melt.
library(reshape)
from.weight <- melt(with(priority.train, table(From.EMail)),
value.name="Freq")
Then melt fn worked.
On lines 123 and 128 of the code from Chapter 3 you have a constant, c, being exponentiation.
I can't follow the logic behind this and I see in the user-contributed unconfirmed errata (http://www.oreilly.com/catalog/errataunconfirmed.csp?isbn=0636920018483) there's an entry suggesting that the ^ operator should be replaced by the * operator.
Could you confirm if this is accurate?
Thanks
In chapter 6 I am at the part where I am supposed to be executing
dtm <- DocumentTermMatrix(corpus)
However it fails out with the following error:
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "try-error"
In addition: Warning message:
In mclapply(unname(content(x)), termFreq, control) :
all scheduled cores encountered errors in user code
StackOverflow suggested installing SnowballC and also trying
corpus <- tm_map(corpus, content_transformer(tolower), lazy = TRUE)
Neither of these solutions worked and I am thus flummoxed.
Here is my session info:
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] glmnet_1.9-8 Matrix_1.1-4 mgcv_1.8-4 nlme_3.1-118 plyr_1.8.1 ggplot2_1.0.0 tm_0.6 NLP_0.1-5
loaded via a namespace (and not attached):
[1] colorspace_1.2-4 digest_0.6.8 grid_3.1.2 gtable_0.1.2 labeling_0.3 lattice_0.20-29 MASS_7.3-35 munsell_0.4.2 parallel_3.1.2 proto_0.3-10
[11] Rcpp_0.11.3 reshape2_1.4.1 scales_0.2.4 slam_0.1-32 stringr_0.6.2 tools_3.1.2
For every ggplot() function call the "legend = FALSE" setting/parameter needs to be changed to guide="none"
Also when plotting figure 9-4 in chapter 9 the latest ggplot now requires that library(scales) be loaded and it means that the scale_size() needs to be changed from:
scale_size(to=c(2,2))
to
scale_size(range=c(2,2))
the code are running from Rstudio with R 3.3.0 under osx 10.11
issue 1:
line 48 in email_classify.R:
geom_hline(aes(yintercept = c(10,30)), linetype = 2)
yintercept need to be put outside aes function , like this :
geom_hline(yintercept = c(10,30), linetype = 2)
issue 2:
error occurs when reading msg by sapply at line 139-140 ..
all.spam <- sapply(spam.docs,
function(p) get.msg(file.path(spam.path, p)))
here is the traceback
Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
'from' cannot be NA, NaN or infinite
7 stop("'from' cannot be NA, NaN or infinite")
6 seq.default(which(text == "")[1] + 1, length(text), 1)
5 seq(which(text == "")[1] + 1, length(text), 1)
4 get.msg(file.path(spam.path, p))
3 FUN(X[[i]], ...)
2 lapply(X = X, FUN = FUN, ...)
1 sapply(spam.docs, function(p) get.msg(file.path(spam.path, p)))
seems some file does not have a blank line
`> spam.path <- file.path("C:\03-Classification\data", "spam")
spam2.path <- file.path("C:\03-Classification\data", "spam_2")
easyham.path <- file.path("C:\03-Classification\data", "easy_ham")
easyham2.path <- file.path("C:\03-Classification\data", "easy_ham_2")
hardham.path <- file.path("C:\03-Classification\data", "hard_ham")
hardham2.path <- file.path("C:\03-Classification\data", "hard_ham_2")
x <- runif(1000, 0, 40)
y1 <- cbind(runif(100, 0, 10), 1)
y2 <- cbind(runif(800, 10, 30), 2)
y3 <- cbind(runif(100, 30, 40), 1)
val <- data.frame(cbind(x, rbind(y1, y2, y3)),
stringsAsFactors = TRUE)
ex1 <- ggplot(val, aes(x, V2)) +
position = position_jitter(height = 2)) +
ggsave(plot = ex1,
filename = file.path("C:\\03-Classification\\images", "00_Ex1.pdf"),
height = 10,
width = 10)
Error in grDevices::pdf(..., version = version) :
cannot open file 'C:\03-Classification\images/00_Ex1.pdf'
ggsave(plot = ex1,
filename = file.path("C:\\03-Classification\\images\\00_Ex1.pdf"),
height = 10,
width = 10)
Error: Aesthetics must be either length 1 or the same as the data (1000): yintercept
getwd()
[1] "C:/Users/mm/Documents"
`
when I repeat the code ,I just get the result as follows:
...
hardham.res <- ifelse(hardham.spamtest > hardham.hamtest,
TRUE,
FALSE)
summary(hardham.res)
...
the result is :
Mode FALSE TRUE NA's
logical 243 6 0
I also try:
hardham.res <- ifelse(hardham.spamtest == hardham.hamtest,
TRUE,
FALSE)
the result is:
Mode FALSE TRUE NA's
logical 21 228 0
that means most of the results is equal .
so i double if it's the floating overflow fault. then I change the classify.email function as below:
classify.email <- function(path, training.df, prior = 0.5, c = 1e-6)
{
msg <- get.msg(path)
msg.tdm <- get.tdm(msg)
msg.freq <- rowSums(as.matrix(msg.tdm))
msg.match <- intersect(names(msg.freq), training.df$term)
if(length(msg.match) < 1)
{
return((log10(prior)+length(msg.freq)_log10(c))) # return(prior * c ^ (length(msg.freq)))
}
else
{
match.probs <- training.df$occurrence[match(msg.match, training.df$term)]
return((log10(prior)+sum(log10(match.probs)) + (length(msg.freq) - length(msg.match))_log10(c))) # return(prior * prod(match.probs) * c ^ (length(msg.freq) - length(msg.match)))
}
}
this time I get the result:
hardham.res <- ifelse(hardham.spamtest > hardham.hamtest,
- TRUE,
- FALSE)
summary(hardham.res)
Mode FALSE TRUE NA's
logical 80 169 0
my god the conclusion is just error.
who has encounter the same problem ?
where have I make the mistake?
Hi my name nii
The SVM examples require you to load library('e1071') but this library is not in the list of packages in chapter 1 table 1-2. I only noticed as I'd installed the packages by hand from that table rather than using the install script provided.
Alos, chapter 12 uses the melt() function but library('reshape') isn't mentioned in either that chapter text nor is it loaded at the head of the chapter12.R script.
I have problem in following code:
get.msg <- function(path)
{
con <- file(path, open = "rt", encoding = "latin1")
text <- readLines(con)
msg <- text[seq(which(text == "")[1] + 1, length(text), 1)]
close(con)
return(paste(msg, collapse = "\n"))
}
How can i do , please some body help me!!
I'm new to R, so I do not have a great ability to debug issues yet. After setting up the R environment on Xubuntu and OSX, I keep running into the same issues when running fast_check.R as well as the script for the first chapter.
Checking Chapter 1 - Introduction
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Error in strsplit(unitspec, " ") : non-character argument
Calls: source ... fullseq.Date -> seq -> floor_date -> parse_unit_spec -> strsplit
In addition: Warning message:
Removed 1 rows containing non-finite values (stat_bin).
Execution halted
Here's my R version:
R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
Is there a preferred environment/version for running the sample code, or am I really missing something?
In Chapter 3 we construct a spam filter based on the data in the folder:
ML_for_Hackers/03-Classification/data/spam
In the book, the terms in these emails are ordered by occurrence with the command below. The book lists the following table with html at the top:
head(spam.df[with(spam.df, order(-occurrence)),])
term | frequency | density | occurrence | |
---|---|---|---|---|
2122 | html | 377 | 0.005665595 | 0.338 |
538 | body | 324 | 0.004869105 | 0.298 |
4313 | table | 1182 | 0.017763217 | 0.284 |
1435 | 661 | 0.009933576 | 0.262 | |
1736 | font | 867 | 0.013029365 | 0.262 |
1942 | head | 254 | 0.003817138 | 0.246 |
When running the code directly, this does not match the output I get with email at the top:
term | frequency | density | occurrence | |
---|---|---|---|---|
7781 | 813 | 0.005853680 | 0.566 | |
18809 | please | 425 | 0.003060042 | 0.508 |
14720 | list | 409 | 0.002944840 | 0.444 |
27309 | will | 828 | 0.005961681 | 0.422 |
3060 | body | 379 | 0.002728837 | 0.408 |
9457 | free | 539 | 0.003880853 | 0.390 |
This seems to be explained by the way the document vectors are processed with the removePunctuation
setting. This punctuation is removed and any terms which were separated would now be a new term. For example, becomes htmlhead. The result is that instead of html being listed as a common term in many of the emails, we have lots of low frequency combination of html with other HTML tag keywords.
Hi! Looking at @drewconway's email_classify.R, I follow the reference to http://spamassassin.apache.org/publiccorpus/ to find the data, but there doesn't seem to be a hard_ham_2 anywhere. Is there hard_ham_2?
Hello guys,
Great book :-)
Right now, I am in the 3rd chapter (e-mail classification).
I am executing the R commands one by one andi am having a problem getting the list of spam documents (page 81).
The command is : all.spam <- sapply(spam.docs, function(p) get.msg(paste(spam.path,p,sep="")))
and the error i get is
Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
invalid (to - from)/by in seq(.)
Any clue?
Thank you very much
The following command in email_classify.R fails
ggsave(plot = class.plot,
filename = file.path("images", "03_final_classification.pdf"),
height = 10,
width = 10)
Throws this error:
Error in seq.default(min, max, by = by) :
invalid (to - from)/by in seq(.)
This is using R version 2.15.0 and OSX 10.7.3
It's about the first chapter that you use string length of 8 to deal with malformed date data. After using string length to filter out malformed data, I found "19940000" in DateOccurred and it will be transformed to "NA" by using "ufo$DateOccurred<-as.Date(ufo$DateOccurred, format="%Y%m%d")" after converting date strings. Isn't it also malformed data? And I also found that the way R read the input has an error: like the line 756:
19950704 19950706 Orlando, FL 4-5 min I would like toreport three yellow oval lights which passed over Orlando,Florida on July 4, 1995 at aproximately 21:30 (9:30 pm). These were the sizeof Venus (which they passed close by). Two of them traveled one after the otherat exactly the same speed and path heading south-southeast. The third oneappeared about a minute later following the same path as the other two. Thewhole sighting lasted about 4-5 minute. There were 4 other witnesses oldenough to report the sighting. My 4 year old and 5 year old children were theones who called my attention to the "moving stars". These objects moved fasterthan an airplane and did not resemble anaircraft, and were moving much slowerthan a shooting star. As for them being fireworks, their path was too regularand coordinated. If anybody else saw this phenomenon, please contact me at: [email protected]
After reading in by the function in the book:
> ufo <- read.delim(file.path("data", "ufo", "ufo_awesome.tsv"),
+ sep = "\t",
+ stringsAsFactors = FALSE,
+ header = FALSE,
+ na.strings = "")
it's separated into two lines:
> ufo[756,]
V1 V2 V3 V4 V5 V6
756 [email protected] <NA> <NA> <NA> <NA> <NA>
> ufo[755,]
V1 V2 V3 V4 V5
755 19950704 19950706 Orlando, FL <NA> 4-5 min
V6
755 I would like to report three yellow oval lights which passed over Orlando,Florida on July 4, 1995 at aproximately 21:30 (9:30 pm). These were the sizeof Venus (which they passed close by). Two of them traveled one after the otherat exactly the same speed and path heading south- southeast. The third oneappeared about a minute later following the same path as the other two. Thewhole sighting lasted about 4-5 minutes. There were 4 other witnesses oldenough to report the sighting. My 4 year old and 5 year old children were theones who called my attention to the "moving stars". These objects moved fasterthan an airplane and did not resemble an aircraft, and were moving much slowerthan a shooting star. As for them being fireworks, their path was too regularand coordinated. If anybody else saw this phenomenon, please contact me at:
Hello,
Now that ggplot for Python has been around for a while (a few months anyways) - I am personally, for fun, converting the R examples into Python (via IPython Notebooks) using the expected libs: numpy, scipy, pandas, ggplot, statsmodels, etc. (maybe a few others).
My questions are the following:
Depending on the answer to (2), I will try to document my code accordingly. I'm currently done with Chapters 1+2 and 1/2 of 3. I suspect the rest of the code might take me another two weeks if I am doing it by myself.
Thanks,
Joe Misiti
@josephmisiti
In the following sentence, strsplit won't feedback a error when no comma in ufo$Location.
As a result, we cannot extract the "'City, State'" from "City" by the trycatch-strsplit method.
split.location <- tryCatch(strsplit(l, ",")[[1]],
error = function(e) return(c(NA, NA)))
Suggest to revised to:
get.location<-function(l)
{
split.location<-strsplit(l,",")[[1]]
clean.location <- gsub("^ ","",split.location)
if(length(clean.location)!=2)
{
return(c(NA,NA))
}
else
{
return(clean.location)
}
}
It is really weak that the repository does not have the source code that is the one talked about in the book. Instead of actually learning when working through examples, I have to sit down and search your repository for definitions that have changed (e.g. 'abb.state' in Chapter 1). Just mindnumbingly weak.
While trying to install the necessary packages using source('package_installer.R') I run into these errors,
Error in library.dynam(lib, package, package.lib) :
shared object ‘digest.so’ not found
ERROR: lazy loading failed for package ‘memoise’
* removing ‘/usr/local/Cellar/r/2.14.1/R.framework/Versions/2.14/Resources/library/memoise’
ERROR: dependency ‘memoise’ is not available for package ‘ggplot2’
* removing ‘/usr/local/Cellar/r/2.14.1/R.framework/Versions/2.14/Resources/library/ggplot2’
Is there an issue in my R install, that may be stopping the digest package dependency from installing correctly?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.