Giter VIP home page Giter VIP logo

random-forests-clj's Introduction

Random Forests in Clojure

A simple implementation of Random Forests for classification and regression in Clojure.

Features:

  • Supports categorical, continuous and text features (as bag of words)
  • Supports classification
  • Supports regression
  • Estimates out of sample error during training

Limitations:

  • All training examples must fit into memory
  • Forests are not serialized

A description of random forests can be found at: http://www.stat.berkeley.edu/~breiman/RandomForests/.

Decision trees are constructed recursively as anonymous functions choosing splitting nodes that minimize the Gini impurity. A textual representation of the generated tree is generated and stored as meta data.

Usage

To use add to your project.clj:

    [random-forests-clj "0.2.0"]

Example

Feaures are represented by the index in the training example. A forest can be built using the build-random-forest method providing the training examples and the indices of the features to use.

    (use 'random-forests.core)

    ;; target is in the last position
    (def examples (list ["M" "<25" 1] ["M" "<40" 0] ["F" "<35" 1] ["F" "<30" 1]))

    ;; features can be :continuous, :categorical or :text
    (def features (set (list (feature "gender" 0 :categorical) (feature "age" 1 :categorical))))

    ;; return a lazy sequence of decision trees with:
    ;; - 2 random feature values to determine split per splitting node
    ;; - a bootstrap resample of 3 examples per tree
    (def t (first (build-random-forest examples features 2 3)))

    ;; meta data on the Clojure function t contains a text representation of the tree
    (:tree (meta t)) ;; => "if(age==<40){0}else{1}"

    ;; also predictions on held out data are included
    (:eval (meta t)) ;; => {["M" "<25" 1] [1]}

Each tree is a function, and new examples can classified by calling the function:

    (t ["M" "<20"]) ;; => 1

Feature Encoding

Categorical features can be left as strings, continuous features should be stored as doubles, text features should be stored as sets:

   (use '[random-forests.encoding :only (text-tokens)])

   ;; text features should be encoded as text using text-tokens for stemming
   (set (text-tokens "this is a text training example")) ;; => #{"train" "text" "exampl"}

   ;; training examples with categorical, continuous and text feature
   ;; (target is last element)
   (def examples (list ["a" 4.5 #{"foo" "bar"} 1] ["b" 4.6 #{"bar" "baz"} 0])

Command Line Usage

Models can built from the command line using lein run:

Usage:

 Switches                   Default  Desc
 --------                   -------  ----
 -h, --no-help, --help      false    Show help
 -f, --features             []       Features specification (matching CSV header): name=continuous,foo=text
 -s, --size                 1000     Size of bootstrap sample per tree
 -m, --split                100      Number of features to sample for each split
 -o, --output                        Write detailed training error output in CSV format to output file
 -t, --target                        Prediction target name
 -b, --no-binary, --binary  false    Perform binary classification of target (measures AUC loss)
 -l, --limit                100      Number of trees to build

To build a binary classifier on the provided test data set using a forest of 500 trees:

lein run -f V1=categorical,V2=categorical,V3=categorical,V4=categorical,V5=categorical,V6=categorical,V7=categorical,V8=categorical,V9=categorical \
         -l 500 \
         -t target=continuous \
         -b \
         test/data/cancer.csv

which will output out of sample AUC loss for the entire forest as each tree is added to the forest:

1: 0.875000
2: 0.843000
3: 0.824000
4: 0.798000
5: 0.843000
6: 0.855000
7: 0.855000
8: 0.878000
9: 0.864000
10: 0.883000
11: 0.879000
12: 0.892000
13: 0.906000
14: 0.906000
15: 0.935000
...

License

Copyright (C) 2010-2012 Erik Andrejko

Distributed under the Eclipse Public License, the same as Clojure.

random-forests-clj's People

Contributors

chenguo avatar eandrejko avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.