Giter VIP home page Giter VIP logo

java-data-frame's Introduction

java-data-frame

Package provides the core data frame implementation for numerical computation

Build Status Coverage Status

Features

  • Load data frame from CSV file
  • Load libsvm format files
  • Create data frame using data sampling

In the future more option will be added for the supported format

Install

Add the following dependency to your POM file:

<dependency>
  <groupId>com.github.chen0040</groupId>
  <artifactId>java-data-frame</artifactId>
  <version>1.0.11</version>
</dependency>

Usage

Crate a data frame manually

The sample code below shows how to create a data frame manually:

DataFrame dataFrame = new BasicDataFrame();

DataRow row = dataFrame.newRow();
row.setCell("inputColumn1", 0.1);
row.setCategoricalCell("inputColumn2", "Hello");
row.setTargetCell("numericOutput", 0.2);
row.setCategoricalTargetCell("categoricalOutput", "YES");

dataFrame.addRow(row);

// add more rows here

// call lock to perform aggregation and prevent further addition of new rows
dataFrame.lock();

Note that you need to call "dataFrame.lock()" after you finish adding rows so that aggregation can be performed. After this api call, the data frame will prevent further addition of new rows. To start adding new rows again, call "dataFrame.unlock()" before adding more rows.

Create a data frame using Sampler

The sample code belows shows how to create a data frame using Sampler class:

DataQuery.DataFrameQueryBuilder schema = DataQuery.blank()
      .newInput("x1")
      .newInput("x2")
      .newOutput("y")
      .end();

// y = 4 + 0.5 * x1 + 0.2 * x2
Sampler.DataSampleBuilder sampler = new Sampler()
      .forColumn("x1").generate((name, index) -> randn() * 0.3 + index)
      .forColumn("x2").generate((name, index) -> randn() * 0.3 + index * index)
      .forColumn("y").generate((name, index) -> 4 + 0.5 * index + 0.2 * index * index + randn() * 0.3)
      .end();

DataFrame dataFrame = schema.build();

dataFrame = sampler.sample(dataFrame, 200);

The sample code above creates a data frame consisting of 200 rows and 3 columns ("x1", "x2", "y")

Print contents in a data frame

The sample code below shows how to print the content in the data frame:

System.out.pritnln(dataFrame.head(2));

dataFrame.stream().forEach(r -> System.out.println("row: " + r));
for(DataRow r : irisData) {
 System.out.println("row: "+ r);
}

Filtering

The sample code below create a new data frame from the old data frame using the filter predicate

DataFrame filtered = oldDataFrame.filter(row -> { ... });

Clone

The sample code below create a new data frame from the old data frame

DataFrame clone = oldDataFrame.makeCopy()

Sample and split

The shuffle the content of a data frame:

dataFrame.shuffle()

To split a data frame into two data frames:

TupleTwo<DataFrame, DataFrame> miniFrames = dataFrame.split(0.9);
DataFrame frame1 = miniFrames._1();
DataFrame frame2 = miniFrames._2();

The frame1 contains 90% of the rows in the original data frame, while frame2 contains the other 10% of the rows in the original data frame.

Convert numerical columns to categorical columns

For some algorithms which needs to treat numerical columns as categorical column, the library provides the KMeanDiscretizer to do this conversion:

The following line transforms a data frame which has a number numerical columns to a data frame which contains the categorical columns with numerical columns convert to categorical columns:

KMeansDiscretizer discretizer =new KMeansDiscretizer();
discretizer.setMaxLevelCount(12); // set number of discrete values for each numerical column
// discretizer.setMaxIters(500); // specifies the number of iterations to run k-means

DataFrame newFrame = discretizer.fitAndTransform(dataFrame);

The sample code belows is a complete code to illustrate this operation:

InputStream inputStream = FileUtils.getResource("carmileage.dat");

DataQuery.DataFrameQueryBuilder schema = DataQuery.csv().from(inputStream)
      .skipRows(29)
      .selectColumn(0).asCategory().asInput("MAKE/MODEL")
      .selectColumn(1).asNumeric().asInput("VOL")
      .selectColumn(2).asNumeric().asInput("HP")
      .selectColumn(3).asNumeric().asOutput("MPG")
      .selectColumn(4).asNumeric().asInput("SP")
      .selectColumn(5).asNumeric().asInput("WT");

DataFrame dataFrame = schema.build();
System.out.println(dataFrame.head(10));
System.out.println("categorical column count: " + dataFrame.getAllColumns().stream().filter(DataColumn::isCategorical).count());
System.out.println("numerical column count: " + dataFrame.getAllColumns().stream().filter(DataColumn::isNumerical).count());

KMeansDiscretizer discretizer =new KMeansDiscretizer();
discretizer.setMaxLevelCount(12); // set number of discrete values for each numerical column

DataFrame newFrame = discretizer.fitAndTransform(dataFrame);

System.out.println(newFrame.head(10));
System.out.println("categorical column count: " + newFrame.getAllColumns().stream().filter(DataColumn::isCategorical).count());
System.out.println("numerical column count: " + newFrame.getAllColumns().stream().filter(DataColumn::isNumerical).count());

Load from CSV file

Suppose you have a csv file named contraception.csv that has the following file format:

"","woman","district","use","livch","age","urban"
"1","1","1","N","3+",18.44,"Y"
"2","2","1","N","0",-5.5599,"Y"
"3","3","1","N","2",1.44,"Y"
"4","4","1","N","3+",8.44,"Y"
"5","5","1","N","0",-13.559,"Y"
"6","6","1","N","0",-11.56,"Y"

An example of java code to create a data frame from the above CSV file:

import com.github.chen0040.data.frame.DataFrame;
import com.github.chen0040.data.frame.DataQuery;
import com.github.chen0040.data.utils.StringUtils;

int column_use = 3;
int column_livch = 4;
int column_age = 5;
int column_urban = 6;
boolean skipFirstLine = true;
String columnSplitter = ",";
InputStream inputStream = new FileInputStream("contraception.csv");
DataFrame frame = DataQuery.csv(columnSplitter, skipFirstLine)
        .from(inputStream)
        .selectColumn(column_livch).asCategory().asInput("livch")
        .selectColumn(column_age).asNumeric().asInput("age")
        .selectColumn(column_age).transform(cell -> Math.pow(StringUtils.parseDouble(cell), 2)).asInput("age^2")
        .selectColumn(column_urban).asCategory().asInput("urban")
        .selectColumn(column_use).transform(cell -> cell.equals("Y") ? 1.0 : 0.0).asOutput("use")
        .build();

The code above create a data frame which has the following columns

  • livch1 (input): value = 1 if the "livch" column of the CSV contains value 1 ; 0 otherwise
  • livch2 (input): value = 1 if the "livch" column of the CSV contains value 2 ; 0 otherwise
  • livch3 (input): value = 1 if the "livch" column of the CSV contains value 3+ ; 0 otherwise
  • age (input): value = numeric value in the "age" column of the CSV
  • age^2 (input): value = square of numeric value in the "age" column of the CSV
  • urban (input): value = 1 if the "urban" column of the CSV has value "Y" ; 0 otherwise
  • use (output): value = 1 if the "use" column of the CSV has value "Y" ; 0 otherwise

In the above case, the output of the data frame is numerical, the code sample below shows how a data frame can be loaded for which the output is categorical:

InputStream irisStream = new FileInputStream("iris.data");
DataFrame irisData = DataQuery.csv(",", false)
      .from(irisStream)
      .selectColumn(0).asNumeric().asInput("Sepal Length")
      .selectColumn(1).asNumeric().asInput("Sepal Width")
      .selectColumn(2).asNumeric().asInput("Petal Length")
      .selectColumn(3).asNumeric().asInput("Petal Width")
      .selectColumn(4).asCategory().asOutput("Iris Type")
      .build();

Load libsvm formatted file

The sample code belows shows how a data frame can be created from "heart-scale.txt" which is in libsvm format:

DataFrame frame = DataQuery.libsvm().from(new FileInputStream("heart_scale.txt")).build();

java-data-frame's People

Contributors

chen0040 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

java-data-frame's Issues

Creation of Data Frame

This is my dataset and I don't know how to create data frame for this.

-0.208582834331337 0.0793062200956938 -0.0438155136268344 2
0.262688089773017 -0.0316019417475728 0.0475750577367206 4
0.0408863533010873 0.556117290192113 0.00897459165154265 3
-0.185185185185185 0.0785714285714286 -0.0270479134466770 0
0.117788847822594 0.130968622100955 0.0192388451443570 1

please instruct me in coding

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.