dflib / dflib Goto Github PK

View Code? Open in Web Editor NEW

100.0 12.0 9.0 3.78 MB

In-memory Java DataFrame library

Home Page: https://dflib.org

License: Apache License 2.0

Java 99.92% PLpgSQL 0.08%

dataframe data-analysis java microbatch etl data-frame dataframe-library series

dflib's Introduction

DFLib

DFLib ("DataFrame Library") is a lightweight pure Java implementation of a common DataFrame data structure. DataFrames exist in Python (pandas), R, Spark and other languages and frameworks. DFLib's DataFrame is specifically intended for Java and JVM languages.

With DataFrame API, you get essentially the same data manipulation capabilities you may be used to in SQL (such as joins, etc.), only you apply them in-memory and over dynamically defined "tables". While SQL is "declarative", DataFrame allows step-by-step transformations that are somewhat easier to understand and much easier to compose.

DataFrame is extremely versatile and can be used to model a variety of data tasks. ETL, log analysis, spreadsheets processing are just some of the examples. DFLib comes with connectors for many data formats: CSV, Excel, RDBMS, Avro, JSON and can be easily adapted to other formats (e.g. web-service-based ones like Google Sheets, etc.)

Project Links

Presentation Videos

DataFrame, a Swiss Army Knife of Java Data Processing, Nashville JUG, 03/2024
Early version of DFLib, Frankfurt WO Day, 04/2019

dflib's People

Contributors

Stargazers

Watchers

Forkers

stariy95 shad862 azapasnik alex-byi vintikjeny qiniu randysuh sniper-xx sullis

dflib's Issues

"dflib-jupyter" - integration into Jupyter notebook

I am successfully running DFLib in Jupyter using the iJava kernel. However to make it smooth, we need a bit more integration:

switching to tabular renderers for DataFrame and Series
Adding API to change max rows and width for the renderers
Make it all easy to bootstrap within notebook.

So we need a module dflib-jupyter that will install the renderers and provide all this API. Each notebook will need to do something like this to install DFLib support:

// import once in the notebook. This puts all DFLib modules as well as "slf4j-simple" 
// on classpath, so you no longer need to use %maven magic for DFLib.
%maven com.nhl.dflib:dflib-jupyter:0.6-SNAPSHOT
import com.nhl.dflib.jupyter.*;
DFLibJupyter.init(getKernelInstance());

// can change display parameters at any time
DFLibJupyter.setMaxDisplayRows(20);
DFLibJupyter.setMaxDisplayColumnWidth(30);

Per discussion under the iJava project it makes more sense to write this integration against jupyter-jvm-basekernel from the same authors, so it will work with other JVM languages.

Primitive Series "select" methods must account for nulls

IntSeries and other primitive Series delegate "select(IntSeries)" to selectInt(..), etc. This means the resulting Series can contain no nulls. Nulls are replaced with 0 / "false" . And this is a problem. E.g. when DFLib uses "select" during outer "join" operations, and we end up with zeros instead of nulls, which is obviously wrong.

Going to remove selectXXX primitive variants, and implement select to build either primitive or object series depending on the presence of nulls in the selected values.

JDBC saver for DataFrames

Add support for saving DataFrames in DB tables via JDBC. The following functionality will be implemented here:

Append DataFrame data to an existing table
Remove data from an existing table before storing the DataFrame
Allow storing DataFrame row number in a separate column (may work as PK generator)

"Series.unique" - produce Series with non-repeating values

Very often it is useful to remove duplicates from Series. So let's implement the following:

Series<T> unique();

For the object series the fastest method seems to be using an intermediary HashSet to dedupe the data. This requires significant extra space though.

DataFrame.unique - produce DataFrame with non-repeating rows

A method to remove duplicate rows from the DataFrame. Allows to compare either each column in a row, or a subset of columns

"indicator" column for joins

Often DF joins are done to figure out which rows are in both DFs, and which only exist on the left or the right. Currently this can be done by checking for nulls on the left and right, but this is not very reliable in a general case. Instead let's allow to add an optional "indicator" column that will provide this information in a form of an enum column:

enum JoinIndicator { both, left_only, right_only }

var df = df.fullJoin().on(0).indicator("ind").with(df1);

IntSeries / IntMutableList for sort and group

Switching to IntSeries allowed us to speed up hash join and filter operations. Hoping sort and group by can also be sped up with this approach. For it to work, we'll need an internal TimSort algorithm that supports primitive int comparator:

interface IntComparator {
   int compare(int i1, int i2);
}

class IntSeries {
    IntSeries sort(IntComparator c);
}

IntSeries / IntMutableList for joins and filters

Let's create IntMutableList (an appendable collection of primitive "int" values) that can be converted to IntSeries, which is immutable.

While working with collections of primitives in Java is painful, there can be real performance gains. My prototype of the data structures above speeds up joins by ~ 25-30% when used for indexing joined DataFrames.

This task will switch joins and filters to int-based implementation. Sorters and groupers will be switched separately, as this requires our own custom sorter.

Series.locate(ValuePredicate)

We need a method in Series.locate(ValuePredicate) complimentary to Series.replace that would produce a BooleanSeries for a certain condition. E.g., here is how we can replace values matching condition. Primitive series will have methods locateInt, locateDouble, etc.

Example:

DataFrame df;

// replacing a column with new values
IntSeries c = df.getColumnAsInt("c");
df
    .dropColumn("c")
    .addColumn("c", c.replace(c.locateInt(i -> i < 0), 0));

A shorthand notation (Series.replace(ValuePredicate, T) may also be provided:

df
    .dropColumn("c")
    .addColumn("c", c.replace(i -> i < 0, 0));

Related: #31, #32, #52

LongSeries

Let's implement LongSeries to handle primitive longs, similar to IntSeries, BooleanSeries and DoubleSeries.

Make IntSeries compatible with Series<Integer>

IntSeries is a much more efficient data structure compared to ArraySeries<Integer>, as #26 and #27 have shown. This task is about making IntSeries API-compatible with Series<Integer> so that we could use them as a column in the DataFrame.

Once we do that, we also need a way to convert DataFrame columns to IntSeries, and access them as such.

GroupBy.head(int) and GroupBy.toDataFrame() functions

Let's implement GroupBy.head(int) and GroupBy.toDataFrame() functions. The former truncates groups to a predefined size. The later recombines GroupBy back to a DataFrame (preserving the effects of groups sorting and truncating). This will allow to answer questions like "top 2 products by sales in each category" or "top 3 scoring teams in each division":

DataFrame top2 = all.group("division")
    .sort("score", true)
    .head(3)
    .toDataFrame();

Series.fillNulls / DataFrame.fillNulls

A method to fill missing data in a Series using either previous or next non-null value, or a specified value:

s = s.fillNulls("A");
s = s.fillNullsBackwards();
s = s.fillNullsForward();

Same for DataFrame:

df = df.fillNulls("A");
df = df.fillNulls("C1", "A");
df = df.fillNullsBackwards("C1");
df = df.fillNullsForward("C1");

A good use case for this is log analysis when we may try to tag log lines belonging to the same transaction by a unique "transaction id". E.g.:

logs.sortByColumn('thread', true)
   .sortByColumn('ts', true)
   .addColumn('txid', /* (end_tx_marker_condition) ? random_id : null */)
   .fillNullsBackwards('txtid');

Hash Join builder

Hash join methods on DataFrame already have too many arguments (and permutations of them). And we'll add more. So the time to refactor this into a builder. Example:

 DataFrame df = df2.rightJoin()
    .on("c", "a")
    .with(df1);

Multicolumn:

 DataFrame df = df2.rightJoin()
    .on("c", "a")
    .on("d", "e")
    .with(df1);

Load DataFrames from CSV

Need to start implementing some DataFrame load/save strategies. The first one will be based on CSV format (as the one most closely matching the structure of the DataFrame). Will be using Apache commons-csv library. This task is about loading DataFrames from CSV

Simplifying join API

Let's add simpler forms of join API for the most common form of joins:

// single Hasher when both source and target column have the same name or position
DataFrame.join(DataFrame another, Hasher commonHasher, ..);

// name based equi-join
DataFrame.join(DataFrame another, String leftColumn, String rightColumn, ...)
DataFrame.join(DataFrame another, String column, ..)
DataFrame.join(DataFrame another, int leftColumn, int rightColumn, ..)

Series.replace, Series.replaceNoMatch, DataFrame.nullify, DataFrame.nullifyNoMatch

Will implement "boolean indexing" operations on a DataFrame and Series. All of them will take a "condition" (either a DataFrame or a Series of booleans):

Series.replace(BooleanSeries condition, T with);
Series.replaceNoMatch(BooleanSeries condition, T with);

// DataFrame may have different types of columns, so providing a simpler version of replace
// that has no type - "nullify":
DataFrame.nullify(DataFrame condition)
DataFrame.nullifyNoMatch(DataFrame condition)

"replace"/"nullify" will replace all values matching "true" positions in condition, "replaceNoMatch" / "nullifyNoMatch" - the values in positions matching "false".

E.g.:

DataFrame df1, df2;

// 'condition' is a DataFrame of BooleanSeries per #32
DataFrame condition = df1.eq(df2);

DataFrame r = df1.nullify(condition);
DataFrame m = df1.nullifyNoMatch(condition);

Note that pandas has a similar set of ops, but with different semantics:

Column-oriented DataFrame

Now that we've hidden direct access to DataFrame rows (Object[]'s) behind an API layer, we can experiment with alternative, more optimal DataFrame designs. One that seems the most promising is a ColumnarDataFrame that would store data one Object[] per column. It has a lot of advantages over row-based DataFrame, as long as the DataFrame has height much bigger than its width (which is always the case with typical data sets). Our preliminary memory tests should up to 75% memory savings (and we haven't even explored deeper column-oriented optimizations, like primitive arrays, bitsets, etc.)

This task is about the first cut of ColumnarDataFrame.

Support for 'groupBy'

Implement support for group by operation that creates an intermediate GroupBy object that can be later resolved to any number of aggregated DataFrames. E.g.:

DataFrame df2 = df1
    .groupBy("b")
    .agg(Aggregator.first("b"), Aggregator.sum("a"), Aggregator.median("a"));

Column converter to enums for categorical data

Add the following DataFrame API:

<E extends Enum<E>> DataFrame toEnumFromStringColumn(String columnLabel, Class<E> enumType)
<E extends Enum<E>> DataFrame toEnumFromStringColumn(int pos, Class<E> enumType)
<E extends Enum<E>> DataFrame toEnumFromNumColumn(String columnLabel, Class<E> enumType)
<E extends Enum<E>> DataFrame toEnumFromNumColumn(int pos, Class<E> enumType)

This will be useful to optimize columns with categorical data for memory use and access speed.

DataFrame.fillNullsFromSeries(..) / Series.fillNullsFromSeries(..)

A variant of "fillNulls" that takes replacement values from matching indexes in a Series argument.

Tabular and Inline printers should show head and tail

Currently printers only shows top X rows up to the preset limit. It would be more informative to show both DataFrame / Series head and tail (of course constrained by the same limit). Now that both "head()" and "tail()" methods are universally available and fairly performant, there's no reason not to do it.

Scalar: helper accessor to DataFrame values

There's inevitably a need to peek at DataFrame cells contents or convert between objects and DataFrames. So it would be nice to have an ability to write a simple descriptor for each DataFrame and avoid string references and type casts. We may solve this problem with an API similar to Cayenne Property object. E.g.:

interface MyDF {
        Scalar<String> A = Scalar.at(Columns.a);
        Scalar<String> B = Scalar.at(Columns.b);
        Scalar<String> C = Scalar.at(Columns.c);

         int width = Columns.values().length;
         Index index = Index.withNames(Columns.class);

        enum Columns {a, b, c}
}

Scalar in turn can have methods to build Aggregators, JoinPredicates, Hashers, etc.

DataFrame.mapColumn

A method to produce a Series that is a result of applying a function to the DataFrame row-by-row. Essentially the same as addColumn, only the series is returned to the caller instead of being added to a DataFrame:

<T> Series<T> mapColumn(RowToValueMapper<T> rowMapper);

'vConcat' and 'hConcat' : support for concatenting DataFrames

Let's implement a special form of joins - horizontal and vertical concatenation of DataFrames. As opposed to "data joins" (aka regular joins) these new operations will not be comparing row data. Instead they will simply combine rows by row number (hConcat, former zip) or extend columns by column name (vConcat):

DataFrame hConcat(DataFrame df)
DataFrame hConcat(JoinType how, DataFrame df)

DataFrame vConcat(DataFrame... dfs)
DataFrame vConcat(JoinType how, DataFrame... dfs)

Note how both forms of concatenation can take JoinType parameter. It defines what do do with unmatched columns or rows. Semantic meaning is the same as with DB joins.

Additionally vConcat allows to join more than two frames.

dflib-jdbc: saver must support Month, Year, enums, etc

When saving data to DB via TableSaver, we are currently doing conversions for LocalDate, LocalDateTime, and LocalTime. Will need to add conversations for a few more java.time objects:

Month - int
Year - int
DayOfWeek (this is an enum, so I guess we just need general support for enums)
any other enum (either as String or a number)

GroupBy.rowNumbers()

Add a GroupBy.rowNumber() "window" function that would return a Series of the same size and sequence as the original pre-GroupBy DataFrame, containing each row order in the series. E.g.:

df = df.addColumn("row_number", df.group("a").sort("a").rowNumbers())

In the future we can add more ranking functions similar to Spark "rank" and "dense_rank".

Add LocalDate and LocalDateTime converters

Add the following common converters to ValueMapper:

static ValueMapper<String, LocalDate> stringToDate()
static ValueMapper<String, LocalDate> stringToDate(DateTimeFormatter formatter)
static ValueMapper<String, LocalDateTime> stringToDateTime()
static ValueMapper<String, LocalDateTime> stringToDateTime(DateTimeFormatter formatter)

DataFrame.tail(i), GroupBy.tail(i)

Implement tail operation everywhere were we support head: DataFrame, GroupBy

DoubleSeries

In addition to IntSeries let's implement another efficient Series object - DoubleSeries.

"DataFrame.select" operation for selecting rows

In addition to DataFrame.selectColumns add a new DataFrame.select(..). This will allow to sample DataFrame contents, implement more window functions, etc. Selector should support filtering, reordering and duplicating rows.

df1 = df.select(1, 5, 3 );
df2 = df.select(asList(1, 5, 3));

Series<Integer> s = ..
df2 = df.select(s);

RowProxy - "flyweight" API to hide access to Object[]

Will introduce the new API to hide direct access to Object[] during operations like map, sort, etc. Read version of this API will be called RowProxy, and writable - RowBuilder. This will allow us in the future to restructure DataFrames from row-oriented to column-oriented without changing the API.

My initial performance tests show that with RowProxy and RowBuilder being "flyweights", there's no negative performance impact. In fact saw some small performance improvement in a few places.

The API generally becomes cleaner, as RowProxy has access to the DataFrame index, so classes like MapContext are no longer needed.

dflib-jdbc: TableSaver to merge data by key

One of the save modes for TableSaver should be "upsert" (insert or update). This merging should be done by one or more designated "key" columns (by default this should be the table primary key).

Store DataFrames in CSV

Per #6 there is a CSV loader for DataFrames. Let'd design a complementary "saver".

DataFrame.addColumn(Series)

Implement DataFrame column addition as Series:

Series<String> s = Series.from("a", "b", "c");
df.addColumn("new_columns", c);

This operation will be useful for recombining DataFrames with the results of calculations (such as ranking functions done on GroupBy, etc.)

If df.height() != s.size(), an exception should be thrown.

'sort' and 'sortByColumns' to reorder DataFrames

Let's implement DataFrame sorting:

<V extends Comparable<? super V>> DataFrame sort(DataRowToValueMapper<V> m)

DataFrame sortByColumn(String column, boolean ascending)
DataFrame sortByColumn(int column, boolean ascending)

DataFrame sortByColumns(String[] columns, boolean[] ascending)
DataFrame sortByColumns(int[] columns, boolean[] ascending)

Series.map

Somehow we are still missing Series.map method. Need to add it:

<V> Series<V> map(ValueMapper<T, V> mapper);

DataFrame.stack operation

Allow to "stack" a DataFrame, producing a 3-column DataFrame that contains all values from the original DataFrame in a single column, indexed by the original row and column.

DataFrame stacked = df.stack();
DataFrame stacked = df.stackIncludeNulls();

Unlike pandas, we don't have multi-index Series, so "stack" would produce a DataFrame with three columns - "row", "column", "value". Those are required for "stack" to be lossless (i.e. the original DataFrame can be recreated from the stacked DataFrame).

a b    c
- ---- -
1 2    3 
4 null 6

would become

row column value
--- ------ -----
0   a      1
0   b      2
0   c      3
1   a      4
1   c      6

dflib-jdbc: alternative way for TableSaver to find the types of PreparedStatement parameters

Currently TableSaver relies on parameter metadata from PreparedStatement. This works on Derby DB, but doesn't work on MySQL. With generateSimpleParameterMetadata=true in the JDBC URL, MySQL returns all parameter types as VARCHAR (not very helpful), and with this omitted (or set to "false") an exception is thrown:

java.sql.SQLException: Parameter metadata not available for the given statement

So we have to use something more reliable to determine value type. I think table metadata may be an alternative. It should be available across the databases.

Also this brings up a question of testing across DBs using Docker with Testcontainers.

CsvLoader to provide shortcuts for datetime and number conversions

Since converting columns to LocalDate(Time) and various number types is a very common and unavoidable part of working with CSV, let's make it a bit easier:

CsvLoader numColumn(int column, Class<? extends Number> type)
CsvLoader numColumn(String column, Class<? extends Number> type) 
CsvLoader dateColumn(int column) 
CsvLoader dateColumn(String column)
CsvLoader dateTimeColumn(int column)
CsvLoader dateTimeColumn(String column)
CsvLoader dateColumn(int column, DateTimeFormatter formatter)
CsvLoader dateColumn(String column, DateTimeFormatter formatter)
CsvLoader dateTimeColumn(int column, DateTimeFormatter formatter)
CsvLoader dateTimeColumn(String column, DateTimeFormatter formatter)

Series/DataFrame element-wise equality operations

Let's define a few element-wise logical operations for DataFrames and Series, all returning a DataFrame / Series of booleans (that can be later used as value replacement conditions per #31) :

BooleanSeries mask = s1.eq(s2);
BooleanSeries mask = s1.ne(s2);

DataFrame mask = df1.eq(df2);
DataFrame mask = df1.ne(df2);

DataFrames being compared must have the same number of rows and columns and the same column labels.

BooleanSeries

In addition to IntSeries and DoubleSeries let's implement another efficient Series object - BooleanSeries. It may be used as a representation of two DataFrames comparison, etc.

Support for DataFrame aggregation API

Provide API to aggregate DataFrame data. This may be useful on its own or as the recombination step in GroupBy per #1

Hash equijoins should include one only copy of a each join column

Currently the joined DataFrame contains both source and target columns of the hash join. Which is fine in a general case (as Hasher condition is not necessarily per-column equality). However in the case of equi-joins (when the condition is indeed per-column equality), repeating columns are useless and only take memory and processing time.

We need to figure out semantics to request deduplication of join columns. In SQL-92 this is achieved via a USING clause.

Move interfaces with public static methods for creating lambdas to the root package

When working in Jupyter or JShell it would be nice to minimize the number of imports, as unlike in a true IDE, they are managed by hand.

So interfaces, enums and classes that are likely to be used by developers directly should be located in the root package of their corresponding module:

Aggregator
Hasher
JoinType
RowProxy
RowBuilder
*Predicate
*ValueMapper
JdbcConnector

Another benefit of this change is that all the "public" API is easy to discover without having to wade through subpackages.

JDBC loader for DataFrames

Implement a loader of DataFrames that can get data from a database via JDBC. Notes:

It must be an optional dependency loaded via a standalone module - dflib-jdbc
It must support loading DataFrame from table , with auto-detection of column names. It should also include the ability to filter columns to load, and specify an optional limit on the number of rows
It must support loading data via parameterized SQL query
It should be initializable from an app-provided DataSource, but also support bootstrapping an internal (non-pooling) DataSource from DB URL, user name, password

dflib-jdbc: TableSaver to create target table if missing

TableSaver should optionally create the target table if it is missing. There are a few challenges to solve here:

DDL for table creation is often DB-specific (particularly column types, such as dates, times, etc)
We don't have a robust API to determine the type of each column in the DataFrame. A "brute-force" implementation will require scanning every cell, and coming up with the JDBC type that would accumulate all values in a given column.
Determining proper length of VARCHAR / VARBINARY

As a quick workaround we need an API for the caller to provide desired column definitions in the target DB format. Long-term we'll need DB-specific JdbcConnector and better DF type introspection facilities.

DataFrame.renameColumns(UnaryOperator<String>) to upper/lower case column names

Add a convenience method DataFrame.renameColumns(UnaryOperator<String>) to upper/lower case column names in a DataFrame.

EnumSeries for categorical data

Just like we have special optimized Series for primitives, let's create one for Enums:

enum X {a, b, c}

// regular series; no optimization
Series<X> s1 = Series.forData(X.a, X.b, X.c);

// optimized enum series
Series<X> s2 = Series.forEnums(X.a, X.b, X.c);

Other than how they are created, EnumSeries have no special API compared to regular object Series.

Tabular printer should right-align numbers and booleans

We can provide a visual hint to the caller on whether a given column is a number of a String by right-aligning the numbers (and keeping Strings left-aligned).

While we don't have explicit type tracking in Series and DataFrames, I think primitive-based Series can already be switched to this approach, as those are explicit.