Giter VIP home page Giter VIP logo

dataframe's People

Contributors

adriankhl avatar alllex avatar ark-1 avatar belovrv avatar cmelchior avatar devcrocod avatar ermolenkodev avatar fb64 avatar ileasile avatar jimexist avatar jmrsnt avatar jolanrensen avatar kantis avatar koperagen avatar kopilov avatar lananovikova10 avatar leandroc89 avatar matthewwiese avatar mgroth0 avatar nikitinas avatar njacobs5074 avatar pacher avatar poslavskysv avatar rayshade avatar sarahhaggarty avatar sullis avatar tairinjane avatar vhuc avatar zaleslaw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataframe's Issues

Data aggregation bug

Example:

df.groupBy { A }.aggregate { mean() into "mean" }

This can only be done by explicitly specifying the column, but can't use String:

df.groupBy { A }.aggregate { mean("C") into "mean" }

Review exceptions

Add meaningful messages to all thrown exceptions and check exception types.

expose `DataFrameSchema` types

Hey there, thanks for this project. It looks very much I was looking for.

The feature I'm looking for is to query the schema of a DataFrame. This I need for to convert the DataFrame to a LeapFrame (which is a dataframe in MLeap project) to create input to an ML model. The LeapFrame represents the columns schema with it's own type, which basically a name and type tuple. The type is represented as DataType, which is a compound of isNullable: Bool and the DataShape which is either a Scalar, Tensor or List and BasicType which is one of Int, Long, Double, String, etc...

I was looking at the DataFrameSchema which is a good approximation of the LeapFrame schema. The only problem, that Kotlin type is part of the internal package, and not public. Would that be possible to make the DataFrameSchema (and the related types) available on the public API?

Generate enum classes as part of DataSchema

Imagine a CSV with a "day_of_week" column with string values like "monday", "friday", etc. If you could convert this column to an enum, you could use the help of completion to, for example, filter it.

It can be done the same way as generating data schemas:

  1. after cell execution in the notebooks
  2. on data schema import in gradle project

There are some design questions:

  1. What if i don't need an enum?

  2. What about normalization? "monday", "Monday" aren't the same thing.
    In jupyter, you can normalize values however you want and get a nice enum.
    In gradle project code generation happens once, in build time, so your values have to be normalized. How?

  3. How many values in the enum is too much?

  4. What if not all possible values are present in the column? What should happen if generated schema knows about 2 enum values, but the actual column at runtime has more?

Insert column after actually inserts it before

I have following dataframe:

⌌---------------------------------------------------------------------------------------------------------------⌍
|  |                  scenario| Configuration: 1.6.20| Execution: 1.6.20| Configuration: 1.7.0| Execution: 1.7.0|
|--|--------------------------|----------------------|------------------|---------------------|-----------------|
| 0| Spring server clean build|                   454|              7483|                  528|             6990|
⌎---------------------------------------------------------------------------------------------------------------⌏

And I want to insert two more columns, so I calling this for both of them (names are different):

df.insert("Configuration diff from stable release") {
        val stableReleaseConfiguration = column<Int>("Configuration: 1.6.20").getValue(this)
        val currentReleaseConfiguration = column<Int>("Configuration: 1.7.0").getValue(this)
        val percent = currentReleaseConfiguration * 100 / stableReleaseConfiguration
        "${percent}%"
}
.after("Configuration: 1.7.0")

But resulted dataframe is:

⌌-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------⌍
|  |                  Scenario| Configuration diff from stable release| Configuration: 1.6.20| Configuration: 1.7.0| Execution diff from stable release| Execution: 1.6.20| Execution: 1.7.0|
|--|--------------------------|---------------------------------------|----------------------|---------------------|-----------------------------------|------------------|-----------------|
| 0| Spring server clean build|                                   116%|                   454|                  528|                                93%|              7483|             6990|
⌎-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------⌏

It seems .after(..) actually acts as before.

kotlin-dataframe version is 0.8.0-dev-952

Support Excel I/O

Expected arguments:

writeExcel:

  • columns: ColumnSelector - columns to write
  • sheetName: String - name of sheet
  • naStr: String - missing data representation
  • header: Boolean - whether to write header

readExcel:

  • path: String/URL/File
  • sheetName: String - sheet to read dataframe
  • sheetIndex: Int - sheet to read dataframe
  • rowsCount: Int - number of rows to read
  • columns: String - comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”)

Plugin dependency broken for 0.8.0-rc-3

Hi,

thank you for awesome library. I have a problem when using Gradle:

plugins {
    id("org.jetbrains.kotlin.plugin.dataframe") version "0.8.0-rc-3"
}

dependencies {
    implementation("org.jetbrains.kotlinx:dataframe:0.8.0-rc-3")
}

I get the following error:

* What went wrong:
Execution failed for task ':kspKotlin'.
> Error while evaluating property 'filteredArgumentsMap' of task ':kspKotlin'
   > Failed to query the value of task ':kspKotlin' property 'options'.
      > Could not resolve all files for configuration ':kspKotlinProcessorClasspath'.
         > Could not find org.jetbrains.kotlinx.dataframe:symbol-processor:0.8.0-dev-813.
           Searched in the following locations:
             - https://repo.maven.apache.org/maven2/org/jetbrains/kotlinx/dataframe/symbol-processor/0.8.0-dev-813/symbol-processor-0.8.0-dev-813.pom
             - https://jitpack.io/org/jetbrains/kotlinx/dataframe/symbol-processor/0.8.0-dev-813/symbol-processor-0.8.0-dev-813.pom
           Required by:
               project :

With version dev-808 it works fine.

Also, it may be worth to put plugin in kotlinx ? Because right now it's confusing:

org.jetbrains.kotlin.plugin.dataframe
org.jetbrains.kotlinx:dataframe

Simplify DataFrame interface?

DataFrame primary interface seems to be over-complicated. A lot of methods have only default implementations and could be moved to extensions. I propose to simplify it significantly like I've done here. It would allow to add and maintain features in a simpler way. For example, it would allow to addition of row-based DataFrames.

Generate constructor for schema interface

Support GenerateConstructor annotation on companion object of schema interface to generate default implementation of that interface and append overload for DataFrame:

@DataSchema
interface Record {
    val a: Int
    val b: Int

    @GenerateConstructor
    companion object
}

// region Generated Code

operator fun Record.Companion.invoke(a: Int, b: Int): Record =
    object: Record {
        override val a = a
        override val b = b
    }

fun DataFrame<Record>.append(vararg rows: Record) = concat(rows.asIterable().toDataFrame())

// endregion

// usage:

listOf(Record(1,2), Record(3,4))
  .toDataFrame()
  .append(Record(5,6))
  .add("sum") { a + b }

Documentation for `NA` term

This term is used in valueCounts and statistical operations: std, mean, varianceAndMean
https://kotlin.github.io/dataframe/valuecounts.html
In documentation, there is no explanation of what it means. What's the difference from NaN?
In the following function, skipNA used when the value is NaN.

public fun Sequence<Float>.mean(skipNA: Boolean = skipNA_default): Double {
    var count = 0
    var sum: Double = 0.toDouble()
    for (element in this) {
        if (element.isNaN()) {
            if (skipNA) continue
            else return Double.NaN
        }
        sum += element
        count++
    }
    return if (count > 0) sum / count else Double.NaN
}

Broken build for Java 8 target

Hi,

version 0.8.0-rc-9 breaks builds for Java 8 targets:

➜  git:(master) ✗ ./gradlew clean build
> Task :kspKotlin FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':kspKotlin'.
> Error while evaluating property 'filteredArgumentsMap' of task ':kspKotlin'
   > Could not resolve all files for configuration ':compileClasspath'.
      > Could not resolve org.jetbrains.kotlinx:dataframe:0.8.0-rc-9.
        Required by:
            project :
         > No matching variant of org.jetbrains.kotlinx:dataframe:0.8.0-rc-9 was found. The consumer was configured to find an API of a library compatible with Java 8, preferably in the form of class files, preferably optimized for standard JVMs, and its dependencies declared externally, as well as attribute 'org.jetbrains.kotlin.platform.type' with value 'jvm' but:
             - Variant 'apiElements' capability org.jetbrains.kotlinx:dataframe:0.8.0-rc-9 declares an API of a library, packaged as a jar, preferably optimized for standard JVMs, and its dependencies declared externally, as well as attribute 'org.jetbrains.kotlin.platform.type' with value 'jvm':
                 - Incompatible because this component declares a component compatible with Java 11 and the consumer needed a component compatible with Java 8
             - Variant 'javadocElements' capability org.jetbrains.kotlinx:dataframe:0.8.0-rc-9 declares a runtime of a component, and its dependencies declared externally:
                 - Incompatible because this component declares documentation and the consumer needed a library
                 - Other compatible attributes:
                     - Doesn't say anything about its target Java environment (preferred optimized for standard JVMs)
                     - Doesn't say anything about its target Java version (required compatibility with Java 8)
                     - Doesn't say anything about its elements (required them preferably in the form of class files)
                     - Doesn't say anything about org.jetbrains.kotlin.platform.type (required 'jvm')
             - Variant 'runtimeElements' capability org.jetbrains.kotlinx:dataframe:0.8.0-rc-9 declares a runtime of a library, packaged as a jar, preferably optimized for standard JVMs, and its dependencies declared externally, as well as attribute 'org.jetbrains.kotlin.platform.type' with value 'jvm':
                 - Incompatible because this component declares a component compatible with Java 11 and the consumer needed a component compatible with Java 8
             - Variant 'sourcesElements' capability org.jetbrains.kotlinx:dataframe:0.8.0-rc-9 declares a runtime of a component, and its dependencies declared externally:
                 - Incompatible because this component declares documentation and the consumer needed a library
                 - Other compatible attributes:
                     - Doesn't say anything about its target Java environment (preferred optimized for standard JVMs)
                     - Doesn't say anything about its target Java version (required compatibility with Java 8)
                     - Doesn't say anything about its elements (required them preferably in the form of class files)
                     - Doesn't say anything about org.jetbrains.kotlin.platform.type (required 'jvm')

Is it possible to fix? Still many production envs use Java 8 and have no possibility to upgrade for another JVM version.

Excessive schema interface generation

%use dataframe
@DataSchema
interface A { val x: List<*> }

@DataSchema
interface B: A
val df = dataFrameOf("x")(listOf(1))
%trackExecution
1

Executing:

1

Executing:

@DataSchema(isOpen = false)
interface _DataFrameType1 : Line_4.B

val org.jetbrains.kotlinx.dataframe.ColumnsContainer<_DataFrameType1>.x: org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>> @JvmName("_DataFrameType1_x") get() = this["x"] as org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>>
val org.jetbrains.kotlinx.dataframe.DataRow<_DataFrameType1>.x: kotlin.collections.List<kotlin.Int> @JvmName("_DataFrameType1_x") get() = this["x"] as kotlin.collections.List<kotlin.Int>
df.cast<_DataFrameType1>()

Executing:

val df = res10
1

Executing:

1

Executing:

@DataSchema(isOpen = false)
interface _DataFrameType2 : Line_4.B

val org.jetbrains.kotlinx.dataframe.ColumnsContainer<_DataFrameType2>.x: org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>> @JvmName("_DataFrameType2_x") get() = this["x"] as org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>>
val org.jetbrains.kotlinx.dataframe.DataRow<_DataFrameType2>.x: kotlin.collections.List<kotlin.Int> @JvmName("_DataFrameType2_x") get() = this["x"] as kotlin.collections.List<kotlin.Int>
df.cast<_DataFrameType2>()

Executing:

val df = res13

null cannot be cast to non-null

If there are nullable types in the dataframe, then NPE is thrown when using "update"
Example:

val df = dataFrameOf("name", "value")("Alice", 1, null, 2)
df.update { name }.at(0).with("ALICE")

I get:

Caused by: java.lang.NullPointerException: null cannot be cast to non-null type kotlin.String
	at Line_346$$special$$inlined$with$1.invoke(update.kt:77)
	at Line_346$$special$$inlined$with$1.invoke(update.kt)
	at org.jetbrains.dataframe.UpdateKt$doUpdate$$inlined$map$lambda$1.invoke(update.kt:48)
	at org.jetbrains.dataframe.UpdateKt$doUpdate$$inlined$map$lambda$1.invoke(update.kt)
	at org.jetbrains.dataframe.ForEachKt.forEach(forEach.kt:19)
	at org.jetbrains.dataframe.UpdateKt.doUpdate(update.kt:47)

read complexity dataframe from csv

We can write a hierarchical dataframe to CSV, but when we read it, we get a flat dataframe where the columns are of string type:

val df = mapOf("name" to listOf("a","b","b","c","a","a","b","b","c","a"),
              "number" to listOf(1,2,3,1,3,2,3,2,1,3)).toDataFrame()

val df1 = df.groupBy("name").pivot("number", inward = false).aggregate { 
                count() into "count"
                mean() into "mean"
            }

If look at the df1 diagram, it will look like this:

name: String
1:
    count: Int?
    mean:
        number: Double?
3:
    count: Int?
    mean:
        number: Double?
2:
    count: Int?
    mean:
        number: Double?

Let's write df1 to csv and read from it:

df1.writeCSV("sample.csv")

DataFrame.read("sample.csv").schema().print()

On the dataframe read from CSV, the scheme is already like this:

name: String
1: String
3: String
2: String

Support Multiplatform

Most of the library code is common, exceptions are IO parts and Jupyter integration. We may support KMP (at least K/JS) for this library

Display more then 20 DataFrame rows in Kotlin Jupyter

When using take(n), takeLast(n) or head(n) with values n > 20, rows larger than 20 will always be truncated.

Displayed tables are showing a limit of 20 rows with the note ... only showing top 20 of 25 rows.

Is it possible to configure this library to display more than 20 rows in Kotlin Jupyter?

Multiply two columns and save to another column

Hi guys,

thank you for great library. May be I just can't find this in docs (sorry if that is the case), but how I can e.g. multiply two columns or take a square root?
I have two columns with double values count1 and count2 and I want to do

df.map { count1 * count2 into "mul" }
// and
df.map { sqrt(count1 * count2) into "geomMean" }

The only thing I can do is to multiply on a fixed number

df.map { count1 * 2 into "smth" }

Thanks

Add df.info() method

In Pandas there's a df.info() method that prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

With a single method, I can see: number of rows, number of columns, and info about them such as type.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html

In case of our Kotlin Dataframe it'll be valuable to see at a glance the types in different columns. For instance, if I have a few columns that I expect to contain Int values, and I can see that these columns have Strings in them, this can mean that there were some missing values in them. Having a single method to get such overview on all columns at once can be as helpful as head() method.

Dataframe not reading embedded JSON primitive

Hey folks! A picture is worth a thousand words on this one:

Screen Shot 2022-06-24 at 10 25 06 AM

I first noticed this when reading a file but I simplified to the most simple example. It's a pity as the functionality is otherwise amazing.

Cannot access class CSVFormat on writing into csv file

Project declares only following dependency to this project:

implementation("org.jetbrains.kotlinx:dataframe:0.8.0-rc-2")

When I try to compile a file that has following line:

dataFrame.writeCSV(benchmarkCsvFile)

Compilation fails with following message:

Cannot access class 'org.apache.commons.csv.CSVFormat'. Check your module classpath for missing or conflicting dependencies

Windowed operations (rolling average, etc.)

Hello,

Kotlin offers the windowed function over iterables to create sub-lists of items along the data. This function can be used to calculate various metrics, such as moving averages and more.

Is this use case somehow supported in dataframe right now? Is it considered? It would be great to have a pretty way (even less tricky than using windowed), to quickly calculate moving averages, or any other more advanced functions which use rolling windows.

dataframe writeCSV adds blank line

Whenever I output a csv, a blank line is always written at the top of the file. This happens whether I"m creating a dataframe from a list of objects, or if I read a csv file into a dataFrame, process it in some manner, and then write it back out. I always get a blank line as the first line. Is this intentional? How to prevent?

I am running a gradle project, using dataframe:0.8.0-dev-952
I have also seen this with the rc-7 load.

Easiest way to reproduce:

  1. Read any comma-separated file into a dataframe,
  2. Write the dataframe out to another file

val inputFile =
val myDF = DataFrame.readCSV(inputFile)

val outputFile =
myDf.writeCSV(outputFile

Create DataFrame from list of rows where each row is Map

Hi guys,

it would be nice to add a method for creating a DataFrame from a list of rows represented as general Maps. Right now when I do:

val rows : List<Map<String, Any?>>

val df = rows.toDataFrame()

I get a wired result - DataFrame with columns obtained from the properties of Map class. But it is more intuitive to get a DataFrame with columns obtained from the keys of Maps. Does it make sense for you?

java.lang.ClassFormatError: Duplicate method name "_DataFrameType_x" with signature "(Lorg.jetbrains.kotlinx.dataframe.ColumnsContainer;)Lorg.jetbrains.kotlinx.dataframe.DataColumn;"

Jupyter notebook:

%use dataframe
@DataSchema
interface A {
    val x : List<*>
}


@DataSchema
interface B: A 

@DataSchema
interface C: A 
val df = dataFrameOf("x")(listOf(1))

The problem is found in one of the loaded libraries: check library converters (fields callbacks)
Error compiling code:

@DataSchema(isOpen = false)
interface _DataFrameType : Line_6.B, Line_6.C

val org.jetbrains.kotlinx.dataframe.ColumnsContainer<_DataFrameType>.x: org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>> @JvmName("_DataFrameType_x") get() = this["x"] as org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>>
val org.jetbrains.kotlinx.dataframe.DataRow<_DataFrameType>.x: kotlin.collections.List<kotlin.Int> @JvmName("_DataFrameType_x") get() = this["x"] as kotlin.collections.List<kotlin.Int>
val org.jetbrains.kotlinx.dataframe.ColumnsContainer<_DataFrameType>.x: org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>> @JvmName("_DataFrameType_x") get() = this["x"] as org.jetbrains.kotlinx.dataframe.DataColumn<kotlin.collections.List<kotlin.Int>>
val org.jetbrains.kotlinx.dataframe.DataRow<_DataFrameType>.x: kotlin.collections.List<kotlin.Int> @JvmName("_DataFrameType_x") get() = this["x"] as kotlin.collections.List<kotlin.Int>
df.cast<_DataFrameType>()

Reading csv file produces ArrayIndexOutOfBoundsException

I am getting following exception on reading CSV file:

java.lang.ArrayIndexOutOfBoundsException: Index 10 out of bounds for length 10
        at org.apache.commons.csv.CSVRecord.get(CSVRecord.java:89)
        at org.jetbrains.kotlinx.dataframe.io.CsvKt.readDelim(csv.kt:292)
        at org.jetbrains.kotlinx.dataframe.io.CsvKt.readDelim(csv.kt:224)
        at org.jetbrains.kotlinx.dataframe.io.CsvKt.readCSV(csv.kt:111)
        at org.jetbrains.kotlinx.dataframe.io.CsvKt.readCSV$default(csv.kt:100)

CSV file content:

scenario,Spring server clean build,Spring server clean build,Spring client clean build,Spring client clean build,Ktor client clean build,Ktor client clean build,Incremental Spring server build with ABI change in FederatedSchemaGenerator,Incremental Spring server build with ABI change in FederatedSchemaGenerator,Incremental Spring client build with ABI change in GraphQLClient,Incremental Spring client build with ABI change in GraphQLClient
version,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4,Gradle 7.4
tasks,:graphql-kotlin-spring-server:assemble,:graphql-kotlin-spring-server:assemble,:graphql-kotlin-spring-client:assemble,:graphql-kotlin-spring-client:assemble,:graphql-kotlin-ktor-client:assemble,:graphql-kotlin-ktor-client:assemble,:graphql-kotlin-spring-server:assemble,:graphql-kotlin-spring-server:assemble,:graphql-kotlin-spring-server:assemble,:graphql-kotlin-spring-server:assemble
value,total execution time,task start,total execution time,task start,total execution time,task start,total execution time,task start,total execution time,task start
warm-up build #1,29872,1220,9190,1109,12037,1101,28652,7446,21952,7762
measured build #1,9214,496,2843,719,3206,546,1659,744,1014,739
measured build #2,8137,492,2286,501,2432,476,1341,604,841,602
measured build #3,7602,487,1896,467,2134,448,1335,568,801,587
warm-up build #5,,1802,455,2057,404,1183,510,780,588
warm-up build #6,,1592,391,2095,384,1346,549,738,563
measured build #1,,1729,407,1895,378,1291,630,680,516
measured build #2,,1831,527,1915,394,1047,451,667,505
measured build #3,,1901,422,1908,352,1010,436,710,525
measured build #4,,1577,391,1705,354,1050,455,668,481
measured build #5,,1428,344,1660,327,987,445,612,459
measured build #6,,1408,357,1619,303,1012,424,645,494
measured build #7,,1307,324,1691,337,906,392,607,445
measured build #8,,1288,337,1696,317,987,425,564,427
measured build #9,,1337,331,1589,307,916,400,566,429
measured build #10,,1246,286,1459,299,1038,484,640,492

kotlin-dataframe version: 0.8.0-rc-7

Document API

There is currently no documentation on interfaces like DataColumn. It makes it hard to add custom implementations for them. For example, it is not clear what ndistinct means.

Possible to read csv to nested data class hierarchy?

Is it possible to have something like this:

data class Parent(
  val child: Child
)

data class Child(
  val age: Int
)

val df = DataFrame
        .readCSV(inputCsv)
        .convertTo<Parent>()

Maybe there is some way to specify how the fields are nested using the headers = parameter?

Arrow Support

Hi, I can't find that dataframe supports Arrow as internal serialization / backend.

Is this something which you're working on?

Enhance support for @DataSchema in gradle projects

Currently, a KSP preprocessor generates DataSchema-boilerplate.
We want to address the following:

  1. You need to build a project to generate boilerplate. We want the same experience as kotlinx.serialization (generate as you type, rename / move refactorings)
  2. Better error reporting. Report warning for local classes annotated with @DataSchema
  3. Add companion object to a data schema interface automatically #113
  4. Resolution of the generated code without extra gradle configurations on the user side.

If 1 and 4 are solved on the KSP side with the release of K2, then we only need a small compiler plugin for 2 and 3. Otherwise, codegen should be moved to a compiler plugin as well.

Support both `text/html` and `text/plain` for Jupyter output

Standard Jupyter output is a JSON object that has the following structure:

{
    mimetype1: value1,
    mimetype2: value2,
    ...
}

Jupyter client chooses one of these mimetypes for rendering. Web-based clients support both text/html and text/plain, but will prefer HTML over plain text. Console clients support only text/plain, so they will choose it. Jupyter client in Dataspell/IDEA requires BOTH for now - it's a bug and I filed an issue about it: https://youtrack.jetbrains.com/issue/DS-2920. But it really makes sense to pass both variants. BTW, it's not hard to achieve as we already have support for rendering to HTML and plain text. We only need some modifications for HtmlData class, I believe.

Just a note: pandas passes both HTML and plain data for tables, take a look at this notebook, for example: https://raw.githubusercontent.com/StrikingLoo/pandas_workshop/master/generate_dataset.ipynb

Change functions signature for reading csv files

Change the signature of the function for reading csv files.

  1. Hide apache commons.
    When trying to read a csv file without headers in jupyter need to import CSVFormat and looks like:
%use dataframe
import org.apache.commons.csv.CSVFormat
val df = DataFrame.readCSV("path_to_file", format = CSVFormat.DEFAULT.withHeader("version", "downloads", "freq").withIgnoreSurroundingSpaces())
  1. Change encoding: String to encoding: Charset.
    Then it will be possible to use kotlin.text.Charsets or java.nio.charset.StandardCharsets.

Pivot table horizontal grouping, is it supported?

Thanks for creating this great library 👍 !

I follow this documentation to create a pivot table.

Is it possible to make an horizontal grouping like in this example (add gender to rows) ?

pivot { col1 then col2 } .groupBy { row1 and row2 } col1 and c2 are grouped hierarchically but row1 and row2 are not.

Remove CSVFormat from public api of writeCSV

CSVFormat is a part of Apache Commons CSV, which is not exported as API and its symbols are not resolved in user's projects. So you can't pass format argument to writeCSV unless your project has dependency on Apache Commons CSV

Add primitive arrays column wrappers

Primitive array columns are required for optimized big-data applications. It is also possible to add numerical DataFrame integration with MultiK or KMath.

Move Arrow support to separate module

With this change, we can no longer use enum for all supported formats. So instead, I propose to declare SupportedFormat interface and load all implementations via ServiceLoader

Remove dataframe annotation preprocessor from root module dependencies

Due to Gradle limitations and the fact that preprocessor depends on DataFrame library itself, preprocessor cannot be resolved in the root module when the library version is upgraded to 0.9.0.
The proposed solution is to move tests that use preprocessor to the dataframe-tests module. For DataSchema interfaces inside main we should write accessors manually

Columns are in a different order

data class Person(val name: String, val age: Int)
val persons = listOf(Person("Alice", 20), Person("Bob", 23))
persons.toDataFrame()

actual:

age name
20 Alice
23 Bob

expected:

name age
Alice 20
Bob 23

Gradle dependency doesn't work.

Thee artifact org.jetbrains.kotlin:dataframe:0.7.3 is not found.
Artifact org.jetbrains.kotlinx:dataframe:0.7.3-dev-275 is found, but its sub dependency com.github.jkcclemens:khttp is again not found.

Order properties which are present in class primary constructor

We cannot infer the properties order in RT for this class:

class A {
    val x: Int
    val b: String
}

But we can do it for this class:

data class B(val x: Int, val b: String)

To achieve this, we may list primary constructor arguments names and map they to properties names:

KClass<*>.primaryConstructor?.parameters?.map { it.name }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.