Giter VIP home page Giter VIP logo

python-csv-dataset's Introduction

csv-dataset

CsvDataset helps to read a csv file and create descriptive and efficient input pipelines for deep learning.

CsvDataset iterates the records of the csv file in a streaming fashion, so the full dataset does not need to fit into memory.

Install

$ pip install csv-dataset

Usage

Suppose we have a csv file whose absolute path is filepath:

open_time,open,high,low,close,volume
1576771200000,7145.99,7150.0,7141.01,7142.33,21.094283
1576771260000,7142.89,7142.99,7120.7,7125.73,118.279931
1576771320000,7125.76,7134.46,7123.12,7123.12,41.03628
1576771380000,7123.74,7128.06,7117.12,7126.57,39.885367
1576771440000,7127.34,7137.84,7126.71,7134.99,25.138154
1576771500000,7134.99,7144.13,7132.84,7141.64,26.467308
...
from csv_dataset import (
    Dataset,
    CsvReader
)

dataset = CsvDataset(
    CsvReader(
        filepath,
        float,
        # Abandon the first column and only pick the following
        indexes=[1, 2, 3, 4, 5],
        header=True
    )
).window(3, 1).batch(2)

for element in dataset:
    print(element)

The following output shows one print.

[[[7145.99,  7150.0,   7141.01,  7142.33,   21.094283]
  [7142.89,  7142.99,  7120.7,   7125.73,  118.279931]
  [7125.76,  7134.46,  7123.12,  7123.12,   41.03628 ]]

 [[7142.89,  7142.99,  7120.7,   7125.73,  118.279931]
  [7125.76,  7134.46,  7123.12,  7123.12,   41.03628 ]
  [7123.74,  7128.06,  7117.12,  7126.57,   39.885367]]]

...

Dataset(reader: AbstractReader)

dataset.window(size: int, shift: int = None, stride: int = 1) -> self

Defines the window size, shift and stride.

The default window size is 1 which means the dataset has no window.

Parameter explanation

Suppose we have a raw data set

[ 1  2  3  4  5  6  7  8  9 ... ]

And the following is a window of (size=4, shift=3, stride=2)

          |-------------- size:4 --------------|
          |- stride:2 -|                       |
          |            |                       |
win 0:  [ 1            3           5           7  ] --------|-----
                                                       shift:3
win 1:  [ 4            6           8           10 ] --------|-----

win 2:  [ 7            9           11          13 ]

...

dataset.batch(batch: int) -> self

Defines batch size.

The default batch size of the dataset is 1 which means it is single-batch

If batch is 2

batch 0:  [[ 1            3           5           7  ]
           [ 4            6           8           10 ]]

batch 1:  [[ 7            9           11          13 ]
           [ 10           12          14          16 ]]

...

dataset.get() -> Optional[np.ndarray]

Gets the data of the next batch

dataset.reset() -> self

Resets dataset

dataset.read(amount: int, reset_buffer: bool = False)

  • amount the maximum length of data the dataset will read
  • reset_buffer if True, the dataset will reset the data of the previous window in the buffer

Reads multiple batches at a time

If we reset_buffer, then the next read will not use existing data in the buffer, and the result will have no overlap with the last read.

dataset.reset_buffer() -> None

Reset buffer, so that the next read will have no overlap with the last one

dataset.lines_need(reads: int) -> int

Calculates and returns how many lines of the underlying datum are needed for reading reads times

dataset.max_reads(max_lines: int) -> int | None

Calculates max_lines lines could afford how many reads

dataset.max_reads() -> int | None

Calculates the current reader could afford how many reads.

If max_lines of current reader is unset, then it returns None

CsvReader(filepath, dtype, indexes, **kwargs)

  • filepath str absolute path of the csv file
  • dtype Callable data type. We should only use float or int for this argument.
  • indexes List[int] column indexes to pick from the lines of the csv file
  • kwargs
    • header bool = False whether we should skip reading the header line.
    • splitter str = ',' the column splitter of the csv file
    • normalizer List[NormalizerProtocol] list of normalizer to normalize each column of data. A NormalizerProtocol should contains two methods, normalize(float) -> float to normalize the given datum and restore(float) -> float to restore the normalized datum.
    • max_lines int = -1 max lines of the csv file to be read. Defaults to -1 which means no limit.

reader.reset()

Resets reader pos

property reader.max_lines

Gets max_lines

setter reader.max_lines = lines

Changes max_lines

reader.readline() -> list

Returns the converted value of the next line

reader csvReader.lines

Returns number of lines has been read

License

MIT

python-csv-dataset's People

Contributors

kaelzhang avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.