Giter VIP home page Giter VIP logo

csv2's Introduction

csv2

Table of Contents

CSV Reader

#include <csv2/reader.hpp>

int main() {
  csv2::Reader<csv2::delimiter<','>, 
               csv2::quote_character<'"'>, 
               csv2::first_row_is_header<true>,
               csv2::trim_policy::trim_whitespace> csv;
               
  if (csv.mmap("foo.csv")) {
    const auto header = csv.header();
    for (const auto row: csv) {
      for (const auto cell: row) {
        // Do something with cell value
        // std::string value;
        // cell.read_value(value);
      }
    }
  }
}

Performance Benchmark

This benchmark measures the average execution time (of 5 runs after 3 warmup runs) for csv2 to memory-map the input CSV file and iterate over every cell in the CSV. See benchmark/main.cpp for more details.

cd benchmark
g++ -I../include -O3 -std=c++11 -o main main.cpp
./main <csv_file>

System Details

Type Value
Processor 11th Gen Intel(R) Core(TM) i9-11900KF @ 3.50GHz 3.50 GHz
Installed RAM 32.0 GB (31.9 GB usable)
SSD ADATA SX8200PNP
OS Ubuntu 20.04 LTS running on WSL in Windows 11
C++ Compiler g++ (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0

Results (as of 23 SEP 2022)

Dataset File Size Rows Cols Time
Denver Crime Data 111 MB 479,100 19 0.102s
AirBnb Paris Listings 196 MB 141,730 96 0.170s
2015 Flight Delays and Cancellations 574 MB 5,819,079 31 0.603s
StackLite: Stack Overflow questions 870 MB 17,203,824 7 0.911s
Used Cars Dataset 1.4 GB 539,768 25 0.947s
Title-Based Semantic Subject Indexing 3.7 GB 12,834,026 4 2.867s
Bitcoin tweets - 16M tweets 4 GB 47,478,748 9 3.290s
DDoS Balanced Dataset 6.3 GB 12,794,627 85 6.963s
Seattle Checkouts by Title 7.1 GB 34,892,623 11 7.698s
SHA-1 password hash dump 11 GB 2,62,974,241 2 10.775s
DOHUI NOH scaled_data 16 GB 496,782 3213 16.553s

Reader API

Here is the public API available to you:

template <class delimiter = delimiter<','>, 
          class quote_character = quote_character<'"'>,
          class first_row_is_header = first_row_is_header<true>,
          class trim_policy = trim_policy::trim_whitespace>
class Reader {
public:
  
  // Use this if you'd like to mmap and read from file
  bool mmap(string_type filename);

  // Use this if you have the CSV contents in std::string already
  bool parse(string_type contents);

  // Shape
  size_t rows() const;
  size_t cols() const;
  
  // Row iterator
  // If first_row_is_header, row iteration will start
  // from the second row
  RowIterator begin() const;
  RowIterator end() const;

  // Access the first row of the CSV
  Row header() const;
};

Here's the Row class:

// Row class
class Row {
public:
  // Get raw contents of the row
  void read_raw_value(Container& value) const;
  
  // Cell iterator
  CellIterator begin() const;
  CellIterator end() const;
};

and here's the Cell class:

// Cell class
class Cell {
public:
  // Get raw contents of the cell
  void read_raw_value(Container& value) const;
  
  // Get converted contents of the cell
  // Handles escaped content, e.g., 
  // """foo""" => ""foo""
  void read_value(Container& value) const;
};

CSV Writer

This library also provides a basic csv2::Writer class - one that can be used to write CSV rows to file. Here's a basic usage:

#include <csv2/writer.hpp>
#include <vector>
#include <string>
using namespace csv2;

int main() {
    std::ofstream stream("foo.csv");
    Writer<delimiter<','>> writer(stream);

    std::vector<std::vector<std::string>> rows = 
        {
            {"a", "b", "c"},
            {"1", "2", "3"},
            {"4", "5", "6"}
        };

    writer.write_rows(rows);
    stream.close();
}

Writer API

Here is the public API available to you:

template <class delimiter = delimiter<','>>
class Writer {
public:
  
  // Construct using an std::ofstream
  Writer(output_file_stream stream);

  // Use this to write a single row to file
  void write_row(container_of_strings row);

  // Use this to write a list of rows to file
  void write_rows(container_of_rows rows);

Compiling Tests

mkdir build && cd build
cmake -DCSV2_BUILD_TESTS=ON ..
make
cd test
./csv2_test

Generating Single Header

python3 utils/amalgamate/amalgamate.py -c single_include.json -s .

Contributing

Contributions are welcome, have a look at the CONTRIBUTING.md document for more information.

License

The project is available under the MIT license.

csv2's People

Contributors

amirint avatar angusbarnes avatar finger563 avatar haibarapink avatar keichi avatar niclasr avatar p-ranav avatar reder2000 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csv2's Issues

Compiler error due to C++17 feature being used

The library is declared as being C++11 compatible. Unfortunately, in one of the recent commits f18b1dc, a C++17 feature has been used:
f18b1dc#diff-11cb52805aba3ae1bec78f314ffd50874111a067ca9662de968ede56cce18337R44

g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) complains:

include/csv2/writer.hpp:35: Fehler: constexpr if is a C++17 extension [-Werror,-Wc++17-extensions]
    if constexpr (has_close<Stream, void()>::value) {
       ^

Would love to see a fix for this (which should be simple). :-)

Empty quoted data consumes next cell

I have a postgres csv file which contains a number of quoted empty cells like the following:
"cell1","","cell3",NULL,"cell5"

In this case, despite five fields are present in the row, only four will be seen:

  • "cell1"
  • "","cell3"
  • NULL
  • "cell5"

I've made a patch for this copied in below. In the patched line I amend the check for an escape sequence to also validate that last_quote_location wasn't the first character of the cell and the following character isn't a delimiter. I.e. that we don't have a cell like the second cell above.

csv2.patch

trailing missing value unrecognized

First thanks for this wonderful library. It's very fast comparing to other csv parsers.

The issue is when the reader parses the following line (note that this is the last line of the file), column size will be 4. The parser failed to find the trailing missing column:

ABC,123,123,123,\0

If there's an additional line, column size will be 5 which is correct:

ABC,123,123,123,\n\0

How to solve problem? error: no matching function for call to.

In file included from /tmp/tmp.RkhsHRAu07/main.cpp:2:0:
/tmp/tmp.RkhsHRAu07/csv2/reader.hpp: In instantiation of ‘bool csv2::Reader<delimiter, quote_character, first_row_is_header, trim_policy>::mmap(StringType&&) [with StringType = const char (&)[8]; delimiter = csv2::delimiter<','>; quote_character = csv2::quote_character<'"'>; first_row_is_header = csv2::first_row_is_header; trim_policy = csv2::trim_policy::trim_characters<' ', '\011'>]’:
/tmp/tmp.RkhsHRAu07/main.cpp:13:27: required from here
/tmp/tmp.RkhsHRAu07/csv2/reader.hpp:24:11: error: no matching function for call to ‘mio::basic_mmap<(mio::access_mode)0, char>::basic_mmap(const char [8])’
mmap_ = mio::mmap_source(filename);
^
/tmp/tmp.RkhsHRAu07/csv2/reader.hpp:24:11: note: candidates are:
In file included from /tmp/tmp.RkhsHRAu07/csv2/reader.hpp:4:0,
from /tmp/tmp.RkhsHRAu07/main.cpp:2:
/tmp/tmp.RkhsHRAu07/include/csv2/mio.hpp:216:3: note: mio::basic_mmap<AccessMode, ByteT>::basic_mmap(mio::basic_mmap<AccessMode, ByteT>&&) [with mio::access_mode AccessMode = (mio::access_mode)0; ByteT = char]
basic_mmap(basic_mmap &&);
^
/tmp/tmp.RkhsHRAu07/include/csv2/mio.hpp:216:3: note: no known conversion for argument 1 from ‘const char [8]’ to ‘mio::basic_mmap<(mio::access_mode)0, char>&&’
/tmp/tmp.RkhsHRAu07/include/csv2/mio.hpp:178:3: note: mio::basic_mmap<AccessMode, ByteT>::basic_mmap() [with mio::access_mode AccessMode = (mio::access_mode)0; ByteT = char]
basic_mmap() = default;
^
/tmp/tmp.RkhsHRAu07/include/csv2/mio.hpp:178:3: note: candidate expects 0 arguments, 1 provided
gmake[3]: *** [CMakeFiles/TestReadBigFile.dir/main.o] Error 1
gmake[2]: *** [CMakeFiles/TestReadBigFile.dir/all] Error 2
gmake[1]: *** [CMakeFiles/TestReadBigFile.dir/rule] Error 2
gmake: *** [TestReadBigFile] Error 2

Benchmarking "Denver Crime Data" database

Hello,

Right now this particular file (crime.csv) is about 101 MB, has 399573 rows and 20 columns (that results in total: 20*399573=7991460 cells).
However CSV2 (benchmark/main.cpp) calculates only 7762529 cells.

Any ideas?

(btw 7991460 is being checked by an alternative mean).
(still not sure, though)

Any plans for adding csv-writer?

I'm using your deprecated CSV library, and I quite like it. I was wondering if you are planning to add a sort of a writer to this one as well?

Empty line ends on segfault

If the CSV file has an empty line, the parsing ends on a segfault.
At least the trailing newline should either be parsed out silently, or give raise to an explicit error rather than a segmentation fault.

Minimal example (can be copy/pasted into test/main.cpp):

TEST_CASE("Parse a SCSV string with column headers and trailing newline, using iterator-based loop" *
        test_suite("Reader")) {

  Reader<delimiter<' '>, quote_character<'"'>, first_row_is_header<true>> csv;
  const std::string buffer = "a b\nd 2 3\ne 5 6.7\n";

  csv.parse(buffer);

  const std::vector<std::string> expected_row_names{"d", "e"};
  const std::vector<double> expected_cell_values{2, 3, 5, 6.7};

  size_t rows=0, cells=0;
  for (auto row : csv) {
    auto icell = std::begin(row);
    std::string rname;
    (*icell).read_value(rname); // FIXME an operator-> would be expected to exists.
    REQUIRE(rname == expected_row_names[rows]);
    rows++;

    ++icell; // FIXME a postfix operator++ would be expected.
    for (; icell != std::end(row); ++icell) {
      std::string str;
      (*icell).read_raw_value(str);
      const double value = std::atof(str.c_str());
      REQUIRE(value == expected_cell_values[cells]);
      cells++;
    }
  }
  size_t cols = cells / rows;
  REQUIRE(rows == 2);
  REQUIRE(cols == 2);
}

Note that the code above advertises a use-case that was not documented: parsing a table having row headers.
In that case, using iterator-based loops makes sense. But the CellIterator interface —while functional— lacks the expected interface: operator-> and a postfix operator++.

warning when compiled with clang 11

/n1/env_centos/7.6/include/csv.hpp:6313:20: warning: explicitly defaulted move assignment operator is implicitly deleted [-Wdefaulted-function-deleted]
[build]         CSVReader& operator=(CSVReader&& other) = default;
[build]                    ^
[build] /n1/env_centos/7.6/include/csv.hpp:6379:23: note: move assignment operator of 'CSVReader' is implicitly deleted because field 'records' has a deleted move assignment operator
[build]         RowCollection records = RowCollection(100);
[build]                       ^
[build] /n1/env_centos/7.6/include/csv.hpp:5984:24: note: copy assignment operator of 'ThreadSafeDeque<csv::CSVRow>' is implicitly deleted because field '_lock' has a deleted copy assignment operator
[build]             std::mutex _lock;
[build]                        ^
[build] /opt/rh/devtoolset-9/root/usr/lib/gcc/x86_64-redhat-linux/9/../../../../include/c++/9/bits/std_mutex.h:95:12: note: 'operator=' has been explicitly marked deleted here
[build]     mutex& operator=(const mutex&) = delete;

Does it matter?

Read a single cell. Can you provide sample

Hello Pranav,
can you provide an example of how to read the contents of a single cell?
How to extract the contents that can be a string, an int or a double?
I share the request of Amirmasoudabdol, that is to be able to access each cell to read/write the content.
Thank you very much.
Sergio

Initial feedback after migrating to csv2

Prior to using csv2, I was using a different aria csv parser. That parser was simple but did not support a write API. After switching over, here are the problems I have encountered so far:

I am unable to access column values by index, so taking advantate of c++ structured bindings is not possible. The following structured binding trick was possible with the previous parser and unfortunately even using iterators this is not possible with csv2:

        std::ifstream ifs(rImportFQPN);
        auto parser = std::make_unique<aria::csv::CsvParser>(ifs);
        auto bHeader = false;
        // track the cumulative distance and the lat/lons
        /*double distanceFromStart = 0.0;*/
        try {
            for (const auto& row : *parser) {
                // skip the header row
                if (!bHeader) {
                    bHeader = true;
                } else {
                    // Extract tsSecsStr(0), utc(1), callSign(2), position(3), altitude(4), speed(5), direction(6) columns
                    // C++17 structured bindings.
                    const auto&[utcSecsStr, utcStr, callSign, latLonStr,
                        altitudeStr, speedKtsStr, directionStr] =
                            std::make_tuple(row[0], row[1], row[2],
                                row[3], row[4], row[5], row[6]);

Upon further instead I have to do something along the following lines:

            // make sure we can parse the file
            if (csvReader->mmap(rImportFQPN.generic_string())) {
                const auto header = csvReader->header();
                // Extract tsSecsStr(0), utc(1), callSign(2), position(3),
                // altitude(4), speed(5), direction(6) columns
                std::string utcSecsStr, utcStr, callSign, latLonStr;
                std::string altitudeStr, speedKtsStr, directionStr;
                for (const auto& row : *csvReader) {
                    int col = 0;
                    for (const auto& cell : row) {
                        std::string value;
                        switch (col) {
                        case 0:
                            cell.read_value(utcSecsStr);
                            break;
                        case 1:
                            cell.read_value(utcStr);
                            break;
                        case 2:
                            cell.read_value(callSign);
                            break;
                        case 3:
                            // double quotes will be escaped otherwise
                            cell.read_value(latLonStr);
                            break;
                        case 4:
                            cell.read_value(altitudeStr);
                            break;
                        case 5:
                            cell.read_value(speedKtsStr);
                            break;
                        case 6:
                            cell.read_value(directionStr);
                            break;
                        default:
                            // no need to read more cols
                            break;
                        }
                        col++;
                    }

The row and cell iterators do not support operator+=(int) so unfortunately I cannot replace

                     // C++17 structured bindings.
                    const auto&[utcSecsStr, utcStr, callSign, latLonStr,
                        altitudeStr, speedKtsStr, directionStr] =
                            std::make_tuple(row[0], row[1], row[2],
                                row[3], row[4], row[5], row[6]);

with

                    const auto&[utcSecsStr, utcStr, callSign, latLonStr,
                        altitudeStr, speedKtsStr, directionStr] =
                            std::make_tuple(*row.begin(), *(row.begin()+1), *(row.begin()+2),
                                *(row.begin()+3), *(row.begin()+4), *(row.begin()+5), *(row.begin()+6));

Also there should be cbegin() and cend() const variants to each of the iterators. Visual studio returns squiggly performance warnings indicating that the modern for loop should allow the return value to bind to a const auto& row instead making a potentially large copy of the data which is returned from the csv2 API. i.e the following should be possible except the API returns a Row and not a const Row& - ditto for Column.

                for (const auto& row : *csvReader) {
                    int col = 0;
                    for (const auto& cell : row) {

Compiling Tests fails

I wanted to compile and run the tests, but encountered a warning:

-- The CXX compiler identification is GNU 11.3.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done (1.4s)
-- Generating done (0.0s)
CMake Warning:
  Manually-specified variables were not used by the project:

    CSV2_TEST


-- Build files have been written to: /home/amirint/projects/csv2/build

image
I guess the flag must be "CSV2_BUILD_TESTS", as there is no CSV2_TEST used in the CMakeLists.txt file. Maybe the README file must be updated.

Compiling test failed with `error: size of array ‘altStackMem’ is not an integral constant-expression`

GCC version:
gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

CMake version:
cmake version 3.26.0

OS:
Ubuntu 22.04 on WSL2 (Windows 11)

Steps to reproduce:
Attempt the "Compile Tests" section in the README file.
The make command errors out with:

error: size of array ‘altStackMem’ is not an integral constant-expression

image

I simply replaced the macro SIGSTKSZ with its value 8192 in doctest.hpp, and I could successfully generate and run the tests, but that's not the solution. There are related issues like this, but at least I couldn't find anyone directly addressing and fixing it.

The Reader Bug?

I try read Denver Crime Data example file crime.csv,the first line get wrong col number.The header has 19 cols,but read 18 cols.
Debug the source,in file reader.hpp line Code:
quote_opened = escaped || (buffer_[i + 1] != delimiter::value);
seem wrong,May be is like this:
quote_opened = escaped && (buffer_[i + 1] != delimiter::value);
Or my wrong usage?

could you make a package? (tag/release)

it's easy integrate with xmake package management
good for cmake fetchcontent management, too

example https://github.com/xmake-io/xmake-repo/blob/master/packages/a/abseil/xmake.lua

i already write a config, if you made the package, i send this PR to xmake repo
`package("csv2")
set_urls("https://github.com/p-ranav/csv2.git")
set_homepage("https://github.com/p-ranav/csv2")
set_description("A CSV parser library")
set_license("MIT")
// add_version()

on_install(function (package)
    os.cp("include/csv2", package:installdir("include"))
end)
on_test(function (package)
    assert(package:has_cxxtypes("csv2::Reader<csv2::delimiter<','>, csv2::quote_character<'\"'>, csv2::first_row_is_header<false>>", 
    {configs = {languages = "c++11"}, includes = "csv2/reader.hpp"}))
end)

package_end()`

Empty file can not be parsed

I use csv.mmap(filename) to read csv file.

When the input file is empty (size 0), I expect a zero length output.

  if (csv.mmap(fileName)) {
    for (const auto row : csv)  // should not loop

but I get the following error

libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: Invalid argument fish: Job 1, '../bin/mp20_client.exe -in 0....' terminated by signal SIGABRT (Abort)

adding a cell view

Rather than building a full string, returning a string view may be more efficient in some cases (suppose one wants to drop a column).
Something like:

class Cell { [...]
std::string_view read_view() const {
return std::string_view(buffer_ + start_, end_ - start_);
}

[...]
};

Even for conversion, it's probably enough.

why template class Reader ?

Recall that :
template <class delimiter = delimiter<','>, class quote_character = quote_character<'"'>,
class first_row_is_header = first_row_is_header,
class trim_policy = trim_policy::trim_whitespace> class Reader

I am not sure about what is the exact benefit of that.
For instance, delimiter is a parameter for Cell only. quote_character as well.
first_row_is_header is used only once in RowIterator begin()
is_trim_char could well be implemented as a std::string+contains(tested char)

I don't think speed would suffer much, and flexibility would be better.

Any opinion ?

Parsing error

problem1

int,string
1,

One less column is parsed, the last column should be null

problem2

int,string,int
1,"",123

There are 3 columns in total, but only 2 columns can be parsed, '",123' is treated as one column

qt types compatibility

Hello

Is it make sense to prepare pull request with Qt types support? (or at least try to do it)
I believe sfinae magic should be enough for this.
Or library should be stl compatible only? Asking just for saving our time)

Quote character is not getting parsed

Hello,

I have copy pasted the default example to read a CSV file into variable which works fine, but the quote characters are not removed with std::string value; cell.read_value(value);

My dev environment:
OS: Windows 10
Compiler: MSVC 2019

example csv:
a,b,c
"Hello", 0.123, "World"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.