kotlin / dataframe Goto Github PK

View Code? Open in Web Editor NEW

705.0 705.0 48.0 32.63 MB

Structured data processing in Kotlin

Home Page: https://kotlin.github.io/dataframe/overview.html

License: Apache License 2.0

Kotlin 99.79% JavaScript 0.15% CSS 0.04% HTML 0.01% Shell 0.01% Java 0.02%

data-analysis data-science dataframe kotlin

dataframe's People

Contributors

Stargazers

Watchers

dataframe's Issues

Data aggregation bug

Example:

df.groupBy { A }.aggregate { mean() into "mean" }

This can only be done by explicitly specifying the column, but can't use String:

df.groupBy { A }.aggregate { mean("C") into "mean" }

Review exceptions

Add meaningful messages to all thrown exceptions and check exception types.

expose `DataFrameSchema` types

Hey there, thanks for this project. It looks very much I was looking for.

The feature I'm looking for is to query the schema of a DataFrame. This I need for to convert the DataFrame to a LeapFrame (which is a dataframe in MLeap project) to create input to an ML model. The LeapFrame represents the columns schema with it's own type, which basically a name and type tuple. The type is represented as DataType, which is a compound of isNullable: Bool and the DataShape which is either a Scalar, Tensor or List and BasicType which is one of Int, Long, Double, String, etc...

I was looking at the DataFrameSchema which is a good approximation of the LeapFrame schema. The only problem, that Kotlin type is part of the internal package, and not public. Would that be possible to make the DataFrameSchema (and the related types) available on the public API?

Generate enum classes as part of DataSchema

Imagine a CSV with a "day_of_week" column with string values like "monday", "friday", etc. If you could convert this column to an enum, you could use the help of completion to, for example, filter it.

It can be done the same way as generating data schemas:

after cell execution in the notebooks
on data schema import in gradle project

There are some design questions:

What if i don't need an enum?
What about normalization? "monday", "Monday" aren't the same thing.
In jupyter, you can normalize values however you want and get a nice enum.
In gradle project code generation happens once, in build time, so your values have to be normalized. How?
How many values in the enum is too much?
What if not all possible values are present in the column? What should happen if generated schema knows about 2 enum values, but the actual column at runtime has more?

Insert column after actually inserts it before

I have following dataframe:

⌌---------------------------------------------------------------------------------------------------------------⌍
|  |                  scenario| Configuration: 1.6.20| Execution: 1.6.20| Configuration: 1.7.0| Execution: 1.7.0|
|--|--------------------------|----------------------|------------------|---------------------|-----------------|
| 0| Spring server clean build|                   454|              7483|                  528|             6990|
⌎---------------------------------------------------------------------------------------------------------------⌏

And I want to insert two more columns, so I calling this for both of them (names are different):

df.insert("Configuration diff from stable release") {
        val stableReleaseConfiguration = column<Int>("Configuration: 1.6.20").getValue(this)
        val currentReleaseConfiguration = column<Int>("Configuration: 1.7.0").getValue(this)
        val percent = currentReleaseConfiguration * 100 / stableReleaseConfiguration
        "${percent}%"
}
.after("Configuration: 1.7.0")

But resulted dataframe is:

⌌-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------⌍
|  |                  Scenario| Configuration diff from stable release| Configuration: 1.6.20| Configuration: 1.7.0| Execution diff from stable release| Execution: 1.6.20| Execution: 1.7.0|
|--|--------------------------|---------------------------------------|----------------------|---------------------|-----------------------------------|------------------|-----------------|
| 0| Spring server clean build|                                   116%|                   454|                  528|                                93%|              7483|             6990|
⌎-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------⌏

It seems .after(..) actually acts as before.

kotlin-dataframe version is 0.8.0-dev-952

Split codebase into several subprojects

Suggested structure:

dataframe-core
dataframe-arrow
dataframe-excel
dataframe-csv
dataframe-html

No functions on grouped DataFrame

I can't use min, max, median like:

df.groupBy { A }.min()

Support Excel I/O

Expected arguments:

writeExcel:

columns: ColumnSelector - columns to write
sheetName: String - name of sheet
naStr: String - missing data representation
header: Boolean - whether to write header

readExcel:

path: String/URL/File
sheetName: String - sheet to read dataframe
sheetIndex: Int - sheet to read dataframe
rowsCount: Int - number of rows to read
columns: String - comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”)

Plugin dependency broken for 0.8.0-rc-3

Hi,

thank you for awesome library. I have a problem when using Gradle:

plugins {
    id("org.jetbrains.kotlin.plugin.dataframe") version "0.8.0-rc-3"
}

dependencies {
    implementation("org.jetbrains.kotlinx:dataframe:0.8.0-rc-3")
}

I get the following error:

* What went wrong:
Execution failed for task ':kspKotlin'.
> Error while evaluating property 'filteredArgumentsMap' of task ':kspKotlin'
   > Failed to query the value of task ':kspKotlin' property 'options'.
      > Could not resolve all files for configuration ':kspKotlinProcessorClasspath'.
         > Could not find org.jetbrains.kotlinx.dataframe:symbol-processor:0.8.0-dev-813.
           Searched in the following locations:
             - https://repo.maven.apache.org/maven2/org/jetbrains/kotlinx/dataframe/symbol-processor/0.8.0-dev-813/symbol-processor-0.8.0-dev-813.pom
             - https://jitpack.io/org/jetbrains/kotlinx/dataframe/symbol-processor/0.8.0-dev-813/symbol-processor-0.8.0-dev-813.pom
           Required by:
               project :

With version dev-808 it works fine.

Also, it may be worth to put plugin in kotlinx ? Because right now it's confusing:

org.jetbrains.kotlin.plugin.dataframe
org.jetbrains.kotlinx:dataframe

Support XML I/O

Simplify DataFrame interface?

DataFrame primary interface seems to be over-complicated. A lot of methods have only default implementations and could be moved to extensions. I propose to simplify it significantly like I've done here. It would allow to add and maintain features in a simpler way. For example, it would allow to addition of row-based DataFrames.

Generate constructor for schema interface

Support GenerateConstructor annotation on companion object of schema interface to generate default implementation of that interface and append overload for DataFrame:

@DataSchema
interface Record {
    val a: Int
    val b: Int

    @GenerateConstructor
    companion object
}

// region Generated Code

operator fun Record.Companion.invoke(a: Int, b: Int): Record =
    object: Record {
        override val a = a
        override val b = b
    }

fun DataFrame<Record>.append(vararg rows: Record) = concat(rows.asIterable().toDataFrame())

// endregion

// usage:

listOf(Record(1,2), Record(3,4))
  .toDataFrame()
  .append(Record(5,6))
  .add("sum") { a + b }

Documentation for `NA` term

This term is used in valueCounts and statistical operations: std, mean, varianceAndMean
https://kotlin.github.io/dataframe/valuecounts.html
In documentation, there is no explanation of what it means. What's the difference from NaN?
In the following function, skipNA used when the value is NaN.

public fun Sequence<Float>.mean(skipNA: Boolean = skipNA_default): Double {
    var count = 0
    var sum: Double = 0.toDouble()
    for (element in this) {
        if (element.isNaN()) {
            if (skipNA) continue
            else return Double.NaN
        }
        sum += element
        count++
    }
    return if (count > 0) sum / count else Double.NaN
}

Add default numRows value to head() method

It's somewhat of a standard to have a sensible default (usually 5 rows) for the head() method so you can simply call df.head() to preview the first few rows.

Broken build for Java 8 target

Hi,

version 0.8.0-rc-9 breaks builds for Java 8 targets:

➜  git:(master) ✗ ./gradlew clean build
> Task :kspKotlin FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':kspKotlin'.
> Error while evaluating property 'filteredArgumentsMap' of task ':kspKotlin'
   > Could not resolve all files for configuration ':compileClasspath'.
      > Could not resolve org.jetbrains.kotlinx:dataframe:0.8.0-rc-9.
        Required by:
            project :
         > No matching variant of org.jetbrains.kotlinx:dataframe:0.8.0-rc-9 was found. The consumer was configured to find an API of a library compatible with Java 8, preferably in the form of class files, preferably optimized for standard JVMs, and its dependencies declared externally, as well as attribute 'org.jetbrains.kotlin.platform.type' with value 'jvm' but:
             - Variant 'apiElements' capability org.jetbrains.kotlinx:dataframe:0.8.0-rc-9 declares an API of a library, packaged as a jar, preferably optimized for standard JVMs, and its dependencies declared externally, as well as attribute 'org.jetbrains.kotlin.platform.type' with value 'jvm':
                 - Incompatible because this component declares a component compatible with Java 11 and the consumer needed a component compatible with Java 8
             - Variant 'javadocElements' capability org.jetbrains.kotlinx:dataframe:0.8.0-rc-9 declares a runtime of a component, and its dependencies declared externally:
                 - Incompatible because this component declares documentation and the consumer needed a library
                 - Other compatible attributes:
                     - Doesn't say anything about its target Java environment (preferred optimized for standard JVMs)
                     - Doesn't say anything about its target Java version (required compatibility with Java 8)
                     - Doesn't say anything about its elements (required them preferably in the form of class files)
                     - Doesn't say anything about org.jetbrains.kotlin.platform.type (required 'jvm')
             - Variant 'runtimeElements' capability org.jetbrains.kotlinx:dataframe:0.8.0-rc-9 declares a runtime of a library, packaged as a jar, preferably optimized for standard JVMs, and its dependencies declared externally, as well as attribute 'org.jetbrains.kotlin.platform.type' with value 'jvm':
                 - Incompatible because this component declares a component compatible with Java 11 and the consumer needed a component compatible with Java 8
             - Variant 'sourcesElements' capability org.jetbrains.kotlinx:dataframe:0.8.0-rc-9 declares a runtime of a component, and its dependencies declared externally:
                 - Incompatible because this component declares documentation and the consumer needed a library
                 - Other compatible attributes:
                     - Doesn't say anything about its target Java environment (preferred optimized for standard JVMs)
                     - Doesn't say anything about its target Java version (required compatibility with Java 8)
                     - Doesn't say anything about its elements (required them preferably in the form of class files)
                     - Doesn't say anything about org.jetbrains.kotlin.platform.type (required 'jvm')

Is it possible to fix? Still many production envs use Java 8 and have no possibility to upgrade for another JVM version.

Excessive schema interface generation

%use dataframe

@DataSchema
interface A { val x: List<*> }

@DataSchema
interface B: A

val df = dataFrameOf("x")(listOf(1))

%trackExecution

Executing:

@DataSchema(isOpen = false)
interface _DataFrameType1 : Line_4.B

val org.jetbrains.kotlinx.dataframe.ColumnsContainer<_DataFrameType1>.x: org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>> @JvmName("_DataFrameType1_x") get() = this["x"] as org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>>
val org.jetbrains.kotlinx.dataframe.DataRow<_DataFrameType1>.x: kotlin.collections.List<kotlin.Int> @JvmName("_DataFrameType1_x") get() = this["x"] as kotlin.collections.List<kotlin.Int>
df.cast<_DataFrameType1>()

Executing:

val df = res10

Executing:

@DataSchema(isOpen = false)
interface _DataFrameType2 : Line_4.B

val org.jetbrains.kotlinx.dataframe.ColumnsContainer<_DataFrameType2>.x: org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>> @JvmName("_DataFrameType2_x") get() = this["x"] as org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>>
val org.jetbrains.kotlinx.dataframe.DataRow<_DataFrameType2>.x: kotlin.collections.List<kotlin.Int> @JvmName("_DataFrameType2_x") get() = this["x"] as kotlin.collections.List<kotlin.Int>
df.cast<_DataFrameType2>()

Executing:

val df = res13

null cannot be cast to non-null

If there are nullable types in the dataframe, then NPE is thrown when using "update"
Example:

val df = dataFrameOf("name", "value")("Alice", 1, null, 2)
df.update { name }.at(0).with("ALICE")

I get:

Caused by: java.lang.NullPointerException: null cannot be cast to non-null type kotlin.String
	at Line_346$$special$$inlined$with$1.invoke(update.kt:77)
	at Line_346$$special$$inlined$with$1.invoke(update.kt)
	at org.jetbrains.dataframe.UpdateKt$doUpdate$$inlined$map$lambda$1.invoke(update.kt:48)
	at org.jetbrains.dataframe.UpdateKt$doUpdate$$inlined$map$lambda$1.invoke(update.kt)
	at org.jetbrains.dataframe.ForEachKt.forEach(forEach.kt:19)
	at org.jetbrains.dataframe.UpdateKt.doUpdate(update.kt:47)

read complexity dataframe from csv

We can write a hierarchical dataframe to CSV, but when we read it, we get a flat dataframe where the columns are of string type:

val df = mapOf("name" to listOf("a","b","b","c","a","a","b","b","c","a"),
              "number" to listOf(1,2,3,1,3,2,3,2,1,3)).toDataFrame()

val df1 = df.groupBy("name").pivot("number", inward = false).aggregate { 
                count() into "count"
                mean() into "mean"
            }

If look at the df1 diagram, it will look like this:

name: String
1:
    count: Int?
    mean:
        number: Double?
3:
    count: Int?
    mean:
        number: Double?
2:
    count: Int?
    mean:
        number: Double?

Let's write df1 to csv and read from it:

df1.writeCSV("sample.csv")

DataFrame.read("sample.csv").schema().print()

On the dataframe read from CSV, the scheme is already like this:

name: String
1: String
3: String
2: String

Support Multiplatform

Most of the library code is common, exceptions are IO parts and Jupyter integration. We may support KMP (at least K/JS) for this library

Display more then 20 DataFrame rows in Kotlin Jupyter

When using take(n), takeLast(n) or head(n) with values n > 20, rows larger than 20 will always be truncated.

Displayed tables are showing a limit of 20 rows with the note ... only showing top 20 of 25 rows.

Is it possible to configure this library to display more than 20 rows in Kotlin Jupyter?

Support digit grouping separators on import

Support numbers like 1,593.23 or 10 233 000 etc
See: https://docs.oracle.com/cd/E19455-01/806-0169/overview-9/index.html

Can't create a column with a single String value

Following code:

val name by column("Alice")

throws an exception because of column method.

Multiply two columns and save to another column

Hi guys,

thank you for great library. May be I just can't find this in docs (sorry if that is the case), but how I can e.g. multiply two columns or take a square root?
I have two columns with double values count1 and count2 and I want to do

df.map { count1 * count2 into "mul" }
// and
df.map { sqrt(count1 * count2) into "geomMean" }

The only thing I can do is to multiply on a fixed number

df.map { count1 * 2 into "smth" }

Thanks

com.github.jkcclemens:khttp not resolved from some regions

The DataFrame seems to have com.github.jkcclemens:khttp:-SNAPSHOT transitive dependency which is obviously not on maven central and fails to resolve from some regions. Could you remove it?

Add df.info() method

In Pandas there's a df.info() method that prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

With a single method, I can see: number of rows, number of columns, and info about them such as type.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

In case of our Kotlin Dataframe it'll be valuable to see at a glance the types in different columns. For instance, if I have a few columns that I expect to contain Int values, and I can see that these columns have Strings in them, this can mean that there were some missing values in them. Having a single method to get such overview on all columns at once can be as helpful as head() method.

Float number support for statistics

Add support for Float type to:

mean
min
max
std
sum

Annotation processing for code generating typed data frames

We should be able to use the @DataSchema or similar annotation outside of Jupyter to codegen name and type safe accessors on a dataframe. Currently that is only available within Jupyter.

Dataframe not reading embedded JSON primitive

Hey folks! A picture is worth a thousand words on this one:

I first noticed this when reading a file but I simplified to the most simple example. It's a pity as the functionality is otherwise amazing.

Cannot access class CSVFormat on writing into csv file

Project declares only following dependency to this project:

implementation("org.jetbrains.kotlinx:dataframe:0.8.0-rc-2")

When I try to compile a file that has following line:

dataFrame.writeCSV(benchmarkCsvFile)

Compilation fails with following message:

Cannot access class 'org.apache.commons.csv.CSVFormat'. Check your module classpath for missing or conflicting dependencies

Windowed operations (rolling average, etc.)

Hello,

Kotlin offers the windowed function over iterables to create sub-lists of items along the data. This function can be used to calculate various metrics, such as moving averages and more.

Is this use case somehow supported in dataframe right now? Is it considered? It would be great to have a pretty way (even less tricky than using windowed), to quickly calculate moving averages, or any other more advanced functions which use rolling windows.

dataframe writeCSV adds blank line

Whenever I output a csv, a blank line is always written at the top of the file. This happens whether I"m creating a dataframe from a list of objects, or if I read a csv file into a dataFrame, process it in some manner, and then write it back out. I always get a blank line as the first line. Is this intentional? How to prevent?

I am running a gradle project, using dataframe:0.8.0-dev-952
I have also seen this with the rc-7 load.

Easiest way to reproduce:

Read any comma-separated file into a dataframe,
Write the dataframe out to another file

val inputFile =
val myDF = DataFrame.readCSV(inputFile)

val outputFile =
myDf.writeCSV(outputFile

Create DataFrame from list of rows where each row is Map

Hi guys,

it would be nice to add a method for creating a DataFrame from a list of rows represented as general Maps. Right now when I do:

val rows : List<Map<String, Any?>>

val df = rows.toDataFrame()

I get a wired result - DataFrame with columns obtained from the properties of Map class. But it is more intuitive to get a DataFrame with columns obtained from the keys of Maps. Does it make sense for you?

java.lang.ClassFormatError: Duplicate method name "_DataFrameType_x" with signature "(Lorg.jetbrains.kotlinx.dataframe.ColumnsContainer;)Lorg.jetbrains.kotlinx.dataframe.DataColumn;"

Jupyter notebook:

%use dataframe

@DataSchema
interface A {
    val x : List<*>
}


@DataSchema
interface B: A 

@DataSchema
interface C: A

val df = dataFrameOf("x")(listOf(1))

The problem is found in one of the loaded libraries: check library converters (fields callbacks)
Error compiling code:

@DataSchema(isOpen = false)
interface _DataFrameType : Line_6.B, Line_6.C

val org.jetbrains.kotlinx.dataframe.ColumnsContainer<_DataFrameType>.x: org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>> @JvmName("_DataFrameType_x") get() = this["x"] as org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>>
val org.jetbrains.kotlinx.dataframe.DataRow<_DataFrameType>.x: kotlin.collections.List<kotlin.Int> @JvmName("_DataFrameType_x") get() = this["x"] as kotlin.collections.List<kotlin.Int>
val org.jetbrains.kotlinx.dataframe.ColumnsContainer<_DataFrameType>.x: org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>> @JvmName("_DataFrameType_x") get() = this["x"] as org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>>
val org.jetbrains.kotlinx.dataframe.DataRow<_DataFrameType>.x: kotlin.collections.List<kotlin.Int> @JvmName("_DataFrameType_x") get() = this["x"] as kotlin.collections.List<kotlin.Int>
df.cast<_DataFrameType>()

Reading csv file produces ArrayIndexOutOfBoundsException

I am getting following exception on reading CSV file:

java.lang.ArrayIndexOutOfBoundsException: Index 10 out of bounds for length 10
        at org.apache.commons.csv.CSVRecord.get(CSVRecord.java:89)
        at org.jetbrains.kotlinx.dataframe.io.CsvKt.readDelim(csv.kt:292)
        at org.jetbrains.kotlinx.dataframe.io.CsvKt.readDelim(csv.kt:224)
        at org.jetbrains.kotlinx.dataframe.io.CsvKt.readCSV(csv.kt:111)
        at org.jetbrains.kotlinx.dataframe.io.CsvKt.readCSV$default(csv.kt:100)

CSV file content:

scenario,Spring server clean build,Spring server clean build,Spring client clean build,Spring client clean build,Ktor client clean build,Ktor client clean build,Incremental Spring server build with ABI change in FederatedSchemaGenerator,Incremental Spring server build with ABI change in FederatedSchemaGenerator,Incremental Spring client build with ABI change in GraphQLClient,Incremental Spring client build with ABI change in GraphQLClient
version,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4
tasks,:graphql-kotlin-spring-server:assemble,:graphql-kotlin-spring-server:assemble,:graphql-kotlin-spring-client:assemble,:graphql-kotlin-spring-client:assemble,:graphql-kotlin-ktor-client:assemble,:graphql-kotlin-ktor-client:assemble,:graphql-kotlin-spring-server:assemble,:graphql-kotlin-spring-server:assemble,:graphql-kotlin-spring-server:assemble,:graphql-kotlin-spring-server:assemble
value,total execution time,task start,total execution time,task start,total execution time,task start,total execution time,task start,total execution time,task start
warm-up build #1,29872,1220,9190,1109,12037,1101,28652,7446,21952,7762
measured build #1,9214,496,2843,719,3206,546,1659,744,1014,739
measured build #2,8137,492,2286,501,2432,476,1341,604,841,602
measured build #3,7602,487,1896,467,2134,448,1335,568,801,587
warm-up build #5,,1802,455,2057,404,1183,510,780,588
warm-up build #6,,1592,391,2095,384,1346,549,738,563
measured build #1,,1729,407,1895,378,1291,630,680,516
measured build #2,,1831,527,1915,394,1047,451,667,505
measured build #3,,1901,422,1908,352,1010,436,710,525
measured build #4,,1577,391,1705,354,1050,455,668,481
measured build #5,,1428,344,1660,327,987,445,612,459
measured build #6,,1408,357,1619,303,1012,424,645,494
measured build #7,,1307,324,1691,337,906,392,607,445
measured build #8,,1288,337,1696,317,987,425,564,427
measured build #9,,1337,331,1589,307,916,400,566,429
measured build #10,,1246,286,1459,299,1038,484,640,492

kotlin-dataframe version: 0.8.0-rc-7

Document API

There is currently no documentation on interfaces like DataColumn. It makes it hard to add custom implementations for them. For example, it is not clear what ndistinct means.

Possible to read csv to nested data class hierarchy?

Is it possible to have something like this:

data class Parent(
  val child: Child
)

data class Child(
  val age: Int
)

val df = DataFrame
        .readCSV(inputCsv)
        .convertTo<Parent>()

Maybe there is some way to specify how the fields are nested using the headers = parameter?

Arrow Support

Hi, I can't find that dataframe supports Arrow as internal serialization / backend.

Is this something which you're working on?

Enhance support for @DataSchema in gradle projects

Currently, a KSP preprocessor generates DataSchema-boilerplate.
We want to address the following:

You need to build a project to generate boilerplate. We want the same experience as kotlinx.serialization (generate as you type, rename / move refactorings)
Better error reporting. Report warning for local classes annotated with @DataSchema
Add companion object to a data schema interface automatically #113
Resolution of the generated code without extra gradle configurations on the user side.

If 1 and 4 are solved on the KSP side with the release of K2, then we only need a small compiler plugin for 2 and 3. Otherwise, codegen should be moved to a compiler plugin as well.

Support both `text/html` and `text/plain` for Jupyter output

Standard Jupyter output is a JSON object that has the following structure:

{
    mimetype1: value1,
    mimetype2: value2,
    ...
}

Jupyter client chooses one of these mimetypes for rendering. Web-based clients support both text/html and text/plain, but will prefer HTML over plain text. Console clients support only text/plain, so they will choose it. Jupyter client in Dataspell/IDEA requires BOTH for now - it's a bug and I filed an issue about it: https://youtrack.jetbrains.com/issue/DS-2920. But it really makes sense to pass both variants. BTW, it's not hard to achieve as we already have support for rendering to HTML and plain text. We only need some modifications for HtmlData class, I believe.

Just a note: pandas passes both HTML and plain data for tables, take a look at this notebook, for example: https://raw.githubusercontent.com/StrikingLoo/pandas_workshop/master/generate_dataset.ipynb

Change functions signature for reading csv files

Change the signature of the function for reading csv files.

Hide apache commons.
When trying to read a csv file without headers in jupyter need to import CSVFormat and looks like:

%use dataframe
import org.apache.commons.csv.CSVFormat
val df = DataFrame.readCSV("path_to_file", format = CSVFormat.DEFAULT.withHeader("version", "downloads", "freq").withIgnoreSurroundingSpaces())

Change encoding: String to encoding: Charset.
Then it will be possible to use kotlin.text.Charsets or java.nio.charset.StandardCharsets.

Pivot table horizontal grouping, is it supported?

Thanks for creating this great library 👍 !

I follow this documentation to create a pivot table.

Is it possible to make an horizontal grouping like in this example (add gender to rows) ?

pivot { col1 then col2 } .groupBy { row1 and row2 } col1 and c2 are grouped hierarchically but row1 and row2 are not.

Library missing on maven-central

The readme states that the library is located under https://repo.maven.apache.org/maven2/org/jetbrains/kotlin/dataframe/0.8.0-rc-5/dataframe (via implementation 'org.jetbrains.kotlin:dataframe:0.8.0-rc-5'). But the library seems to be missing. In fact, it's not just the version but the entire path seems absent on central https://repo.maven.apache.org/maven2/org/jetbrains/kotlin/dataframe/

Remove CSVFormat from public api of writeCSV

CSVFormat is a part of Apache Commons CSV, which is not exported as API and its symbols are not resolved in user's projects. So you can't pass format argument to writeCSV unless your project has dependency on Apache Commons CSV

Add primitive arrays column wrappers

Primitive array columns are required for optimized big-data applications. It is also possible to add numerical DataFrame integration with MultiK or KMath.

Move Arrow support to separate module

With this change, we can no longer use enum for all supported formats. So instead, I propose to declare SupportedFormat interface and load all implementations via ServiceLoader

Bug in type inference

It seems that the type of column named value is always Any? no matter the data. To reproduce, check out this branch or copy test from latest commit

Remove dataframe annotation preprocessor from root module dependencies

Due to Gradle limitations and the fact that preprocessor depends on DataFrame library itself, preprocessor cannot be resolved in the root module when the library version is upgraded to 0.9.0.
The proposed solution is to move tests that use preprocessor to the dataframe-tests module. For DataSchema interfaces inside main we should write accessors manually

Columns are in a different order

data class Person(val name: String, val age: Int)
val persons = listOf(Person("Alice", 20), Person("Bob", 23))
persons.toDataFrame()

actual:

age	name
20	Alice
23	Bob

expected:

name	age
Alice	20
Bob	23

Gradle dependency doesn't work.

Thee artifact org.jetbrains.kotlin:dataframe:0.7.3 is not found.
Artifact org.jetbrains.kotlinx:dataframe:0.7.3-dev-275 is found, but its sub dependency com.github.jkcclemens:khttp is again not found.

Order properties which are present in class primary constructor

We cannot infer the properties order in RT for this class:

class A {
    val x: Int
    val b: String
}

But we can do it for this class:

data class B(val x: Int, val b: String)

To achieve this, we may list primary constructor arguments names and map they to properties names:

KClass<*>.primaryConstructor?.parameters?.map { it.name }

kotlin / dataframe Goto Github PK

dataframe's People

Contributors

Stargazers

Watchers

Forkers

dataframe's Issues

Recommend Projects

Recommend Topics

Recommend Org