martijn / xsv Goto Github PK
View Code? Open in Web Editor NEWHigh performance, lightweight .xlsx parser for Ruby that provides nothing a CSV parser wouldn't
Home Page: https://storck.io/posts/announcing-xsv-1-0-0/
License: MIT License
High performance, lightweight .xlsx parser for Ruby that provides nothing a CSV parser wouldn't
Home Page: https://storck.io/posts/announcing-xsv-1-0-0/
License: MIT License
Asian strings such as Japanese can contain phonetic string data (furigana/yomigana)
In this case, the data structure will be as follows, and SharedStringsParser will not initialize @current_string
, which will cause an error.
In order to deal with this, I would like to add a process to skip when an rPh is received.
Can I create a PR for this?
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="35" uniqueCount="33">
<rPh><t>phonetic</t></rPh>
<si><t>Some strings</t></si>
irb(main):006:0> @file = File.open("test/files/phonetic.xml")
irb(main):007:0> strings = Xsv::SharedStringsParser.parse(@file)
Traceback (most recent call last):
7: from bin/console:14:in `<main>'
6: from (irb):7
5: from /xsv/lib/xsv/shared_strings_parser.rb:9:in `parse'
4: from /xsv/lib/xsv/sax_parser.rb:19:in `parse'
3: from /xsv/lib/xsv/sax_parser.rb:19:in `loop'
2: from //xsv/lib/xsv/sax_parser.rb:43:in `block in parse'
1: from /xsv/lib/xsv/shared_strings_parser.rb:33:in `characters'
NoMethodError (undefined method `+' for nil:NilClass)
To reproduce, just create a new XLSX with a single cell with content Zé
:
Encoding::UndefinedConversionError ("\xC3" from ASCII-8BIT to UTF-8)
Also, after having input the date 05/05/1995
, xsv gives me 1995-05-05
. Is this expected? And is there a way to override this behavior?
I'm using version 1.0.0.pre
Thanks!
I have a file with values: 0.0%, 0.1%, -0.3%
These are being interpreted as 0.000275, 0.000827, 0.003448
Is there a solution for this? I just need 0.0, 0.1, -0.3
.
Hello there! Thank you for the gem, you made a really good job!
I wanna suggest one feature that I can implement as a contributor if we'll decide to do that. So, there's the subject.
In a gem called roo
(definitely you know about it), there's a very good feature that allows passing a hash by a set of headers. My team uses roo
for parsing datasheets with headers in Russian, and then we use the content of a datasheet to create some AR entities for example.
In roo
it looks like:
SET_OF_HEADERS = {
name: /Название организации|Название/i,
inn: /ИНН/i,
kpp: /КПП/i
}.freeze
xlsx = Roo::Excelx.new(filepath)
raw_data = xlsx.sheet(0).parse(SET_OF_HEADERS)
raw_data.first.keys # => [:name, :inn, :kpp]
That allows you to define the keys of your data items so there is no need to transform the keys of every hash to pass data to the next method for example. And with that feature, you can also automatically detect the offset between the first significant row of your data and some blank space, because sometimes docs that we parse looks like this:
As you can see, there are two rows that shouldn't be present in parsed data – it just the information to one who works with this template on how to fill rows.
So if this interesting for you I could contribute some time to implement that feature in xsv
too.
Best regards.
I have a file containing multiple decimal values.
0.001 | 0.01 | 0.1
when parsing the data as described in README the first decimal is not returned correctly!
This is what I do:
x = Xsv::Workbook.open("sheet.xlsx")
sheet = x.sheets[0]
sheet.each_row do |row|
p '---row', row
end
This is what I see in the console (the first value is just 1 instead of 0.001):
[1, 0.01, 0.1]
Cannot parse empty excel with code:
workbook = Xsv::Workbook.open(@io, trim_empty_rows: true)
sheet = workbook.sheets[0]
sheet.parse_headers!
sheet.each_row do |row|
row # => ["header1", "header2"], etc.
end
because of error:
NoMethodError: undefined method `zip' for nil:NilClass
from /Users/me/.rbenv/versions/2.7.1/lib/ruby/gems/2.7.0/gems/xsv-0.3.16/lib/xsv/sheet.rb:123:in `empty_row'
Caused by CSV::MalformedCSVError: Invalid byte sequence in UTF-8 in line 1.
from /Users/me/.rbenv/versions/2.7.1/lib/ruby/2.7.0/csv/parser.rb:346:in `rescue in parse'
Caused by CSV::Parser::InvalidEncoding: CSV::Parser::InvalidEncoding
from /Users/me/.rbenv/versions/2.7.1/lib/ruby/2.7.0/csv/parser.rb:229:in `read_chunk'
Thanks for help.
It is possible to write the following for array mode but for hash mode one is forced to create intermediate variables.
Xsv::Workbook.open('file.xlsx').sheets[0].each_row do |row|
row # => ["value1", "value2"]
end
If there was a :mode
argument to open
function or if the function parse_headers!
would return self
instead of a fixed true
(or both? 🤔) this would let the user do one of the following:
Xsv::Workbook.open('file.xlsx', mode: :hash).sheets[0].each_row do |row|
row # => {"header1" => "value1", "header2" => "value2"}
end
Xsv::Workbook.open('file.xlsx').sheets[0].parse_headers!.each_row do |row|
row # => {"header1" => "value1", "header2" => "value2"}
end
It works great in my development environment and specs run successfully, but it keeps throwing that message in the AWS deployment. I'm using docker-compose to manage the deployments and I have erased the project images and containers multiple times.
Previosuly, I was using Roo Gem in this Job without loading problems.
Can you give me a hand please?
# With this call it throws _**NameError: uninitialized constant Xsv**_
x = ::Xsv::Workbook.open(report_result.document.url)
# With this call it throws _**NameError: uninitialized constant ImportWorker::Xsv**_
x = Xsv::Workbook.open(report_result.document.url)
**NameError: uninitialized constant Xsv**
/app/app/workers/import_worker.rb:16:in `perform'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:196:in `execute_job'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:164:in `block (2 levels) in process'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/middleware/chain.rb:138:in `block in invoke'
/usr/local/bundle/gems/sidekiq-failures-1.0.0/lib/sidekiq/failures/middleware.rb:9:in `call'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/middleware/chain.rb:140:in `block in invoke'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/middleware/chain.rb:143:in `invoke'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:163:in `block in process'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:136:in `block (6 levels) in dispatch'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/job_retry.rb:111:in `local'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:135:in `block (5 levels) in dispatch'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/rails.rb:43:in `block in call'
/usr/local/bundle/gems/activesupport-5.2.4.4/lib/active_support/execution_wrapper.rb:87:in `wrap'
/usr/local/bundle/gems/activesupport-5.2.4.4/lib/active_support/reloader.rb:73:in `block in wrap'
/usr/local/bundle/gems/activesupport-5.2.4.4/lib/active_support/execution_wrapper.rb:87:in `wrap'
/usr/local/bundle/gems/activesupport-5.2.4.4/lib/active_support/reloader.rb:72:in `wrap'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/rails.rb:42:in `call'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:131:in `block (4 levels) in dispatch'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:257:in `stats'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:126:in `block (3 levels) in dispatch'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/job_logger.rb:13:in `call'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:125:in `block (2 levels) in dispatch'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/job_retry.rb:78:in `global'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:124:in `block in dispatch'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/logger.rb:10:in `with'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/job_logger.rb:33:in `prepare'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:123:in `dispatch'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:162:in `process'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:78:in `process_one'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/processor.rb:68:in `run'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/util.rb:15:in `watchdog'
/usr/local/bundle/gems/sidekiq-6.0.7/lib/sidekiq/util.rb:24:in `block in safe_thread'
For example
<c r="P894" s="17"><v>43797</v></c>
Is 11/28/19
It turns out Excel completely skips the <c />
element for cells without a value as illustrated in cell D3 and E3 in the example below. This calls for some more sophisticated row loading
<row r="3" spans="1:18" x14ac:dyDescent="0.3">
<c r="A3" s="4" t="str"><f t="shared" si="0"/><v>ABCD</v></c>
<c r="B3" t="s"><v>97</v></c>
<c r="C3"><v>16</v></c>
<c r="F3" t="s"><v>147</v></c>
<c r="G3" t="s"><v>8</v></c>
Staring with version 1.0.2 I'm seeing the following error:
NoMethodError:
undefined method `scan' for nil:NilClass
# ./.bundle/ruby/2.7.0/gems/xsv-1.0.2/lib/xsv/helpers.rb:108:in `parse_number_format'
# ./.bundle/ruby/2.7.0/gems/xsv-1.0.2/lib/xsv/sheet_rows_handler.rb:104:in `format_cell'
# ./.bundle/ruby/2.7.0/gems/xsv-1.0.2/lib/xsv/sheet_rows_handler.rb:60:in `end_element'
# ./.bundle/ruby/2.7.0/gems/xsv-1.0.2/lib/xsv/sax_parser.rb:68:in `block in parse'
# ./.bundle/ruby/2.7.0/gems/xsv-1.0.2/lib/xsv/sax_parser.rb:19:in `loop'
# ./.bundle/ruby/2.7.0/gems/xsv-1.0.2/lib/xsv/sax_parser.rb:19:in `parse'
# ./.bundle/ruby/2.7.0/gems/xsv-1.0.2/lib/xsv/sheet.rb:64:in `each_row'
I'm guessing this was introduced in c653bb8
It seems the format
passed to parse_number_format
isn't always a string.
The document was generated by caxlsx.
Xsv::Workbook.open
does not accept Tempfile, so its not possible to parse ActiveStorage::Blob files, its return
NoMethodError: undefined method `start_with?' for #<Tempfile: (closed)>
I saw this issue #10 and i guess is the same error but for a different Column type. I have the stack from the sidekiq job:
2021-07-03T01:51:22.084Z 43432 TID-ovi6i5kkw WARN: Xsv::Error: Encountered unknown column type d
2021-07-03T01:51:22.084Z 43432 TID-ovi6i5kkw WARN: /Users/christian/.rvm/gems/ruby-2.6.1/gems/xsv-1.0.0/lib/xsv/sheet_rows_handler.rb:111:in `format_cell'
/Users/christian/.rvm/gems/ruby-2.6.1/gems/xsv-1.0.0/lib/xsv/sheet_rows_handler.rb:60:in `end_element'
/Users/christian/.rvm/gems/ruby-2.6.1/gems/xsv-1.0.0/lib/xsv/sax_parser.rb:68:in `block in parse'
/Users/christian/.rvm/gems/ruby-2.6.1/gems/xsv-1.0.0/lib/xsv/sax_parser.rb:19:in `loop'
/Users/christian/.rvm/gems/ruby-2.6.1/gems/xsv-1.0.0/lib/xsv/sax_parser.rb:19:in `parse'
/Users/christian/.rvm/gems/ruby-2.6.1/gems/xsv-1.0.0/lib/xsv/sheet.rb:64:in `each_row'
...
I put a log inside the format_cell
method and i saw the problem was with the column of type custom
with a date value "2021-07-01T05:11:26.00000026635825825"
.
I Attatch a sample excel file.
Hi, I happened to hit upon an Excel file that has cells without a column name. Crazy, right? Anyway, a few quick fixes sorted that out. What do you think?
helpers.rb:
# Return the index number for the given Excel column name (i.e. "A1" => 0)
def column_index(col)
return if col == nil # <-- if receiving a nil column name, return
col.each_codepoint.reduce(0) do |sum, n|
break sum - 1 if n < A_CODEPOINT # reached a number
sum * 26 + (n - A_CODEPOINT + 1)
end
end
sheet_bounds_handler.rb:
def start_element(name, attrs)
case name
#...
when "v"
col = column_index(@cell)
@max_column = col if col != nil && col > @max_column # <-- Don't update @max_column
@max_row = @row if @row > @max_row
#...
sheet_rows_handler.rb:
def initialize(mode, empty_row, workbook, row_skip, last_row, &block)
#...
@current_cell = {}
@current_cell_number = 0 # <-- Trackable cell number
#...
def start_element(name, attrs)
case name
when "c"
@current_cell = attrs
@current_cell_number += 1 # <-- Advance cell number
@current_value.clear
when "v", "is", "t"
@store_characters = true
when "row"
@current_row = @empty_row.dup
@current_cell_number = 0 # <-- Reset cell number
@current_row_number = attrs[:r].to_i
end
end
def end_element(name)
case name
#...
when "c"
col_index = column_index(@current_cell[:r])
col_index = @current_cell_number unless col_index != nil # <-- use tracked cell number if there is no column name
#...
Sorry for the quick code drop, don't have time for a pull request. Does this help?
Close zip, drop sheets, etc.
multiple spaces in cell value converted to 1 space
had cell value with "foo--bar" where each "-" was a space.
when read, xsv reported the cell value as "foo-bar", having only a single space.
windows, ruby 2.5.8p224, xsv (0.3.13)
The hash parsing the F2022-10.xlsx is wrong:
$ pry
[1] pry(main)> require 'xsv'
=> true
[2] pry(main)> x = Xsv.open 'F2022-10.xlsx', :parse_headers => true
=> #<Xsv::Workbook:300 sheets=1 trim_empty_rows=false>
[3] pry(main)> s = x.sheets.first
=> #<Xsv::Sheet:320 mode=hash>
[4] pry(main)> s[0]
=> {"Interní číslo"=>"2022009",
nil=>61,
"Typ faktury"=>"FAKTURA",
"Název firmy nebo jméno osoby"=>"Foo bar",
"IČO"=>1234567,
"DIČ"=>"CZ1234567",
"Ulice"=>nil,
"PSČ"=>"Velehradská 1",
"Město"=>"686 03",
"Datum vystavení"=>"Staré Město",
"Datum splatnosti"=>nil,
"Měna"=>#<Date: 2022-08-01 ((2459793j,0s,0n),+0s,2299161j)>,
"Pořadí"=>#<Date: 2022-08-15 ((2459807j,0s,0n),+0s,2299161j)>,
"Název"=>nil,
"Množství"=>"EUR",
"Cena za MJ"=>nil}
The correct output should be:
--- wrong
+++ correct
@@ -7,18 +7,17 @@
=> #<Xsv::Sheet:320 mode=hash>
[4] pry(main)> s[0]
=> {"Interní číslo"=>"2022009",
- nil=>61,
"Typ faktury"=>"FAKTURA",
"Název firmy nebo jméno osoby"=>"Foo bar",
"IČO"=>1234567,
"DIČ"=>"CZ1234567",
- "Ulice"=>nil,
- "PSČ"=>"Velehradská 1",
- "Město"=>"686 03",
- "Datum vystavení"=>"Staré Město",
- "Datum splatnosti"=>nil,
- "Měna"=>#<Date: 2022-08-01 ((2459793j,0s,0n),+0s,2299161j)>,
- "Pořadí"=>#<Date: 2022-08-15 ((2459807j,0s,0n),+0s,2299161j)>,
- "Název"=>nil,
- "Množství"=>"EUR",
- "Cena za MJ"=>nil}
+ "Ulice"=>"Velehradská 1",
+ "PSČ"=>"686 03",
+ "Město"=>"Staré Město",
+ "Datum vystavení"=>#<Date: 2022-08-01 ((2459793j,0s,0n),+0s,2299161j)>,
+ "Datum splatnosti"=>#<Date: 2022-08-15 ((2459807j,0s,0n),+0s,2299161j)>,
+ "Měna"=>"EUR",
+ "Pořadí"=>1,
+ "Název"=>"Item 1,
+ "Množství"=>2,
+ "Cena za MJ"=>61}
I think that the parser is confused by the empty columns.
Hello when reading the following file only the numeric values are getting parsed correctly. all text-values are parsed to nil
formatting_issue.xlsx
this is what I did:
x = Xsv::Workbook.open(file_path)
sheet = x.sheets[0]
# Iterate over rows
sheet.each_row do |row|
p '-----------------', row # => ["header1", "header2"], etc.
end
This is what is printed:
"-----------------"
[nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil]
"-----------------"
[nil, nil, nil, nil, nil, 0, nil, nil, nil, nil, 1, nil, nil, nil, nil, nil, nil, nil, 1, nil, nil, nil, nil, nil, nil, nil, nil, 1, nil]
`
when copying all data from the xlsx into a simple txt-editor and then copying it back into excel. the data is getting parsed correctly:
"-----------------"
["PartNumber", "Brand", "SubBrand", "PopulationStatus", "PartType", "UniversalPart", "BasePartNumber", "Gtin", "GtinQualifierCode", "GtinQualifierName", "QuantitySize", "QuantitySizeUom", "ContainerType", "QuantityPerApplication", "QuantityPerApplicationUom", "QuantityPerApplicationQualifier", "EffectiveDate", "AvailableDate", "MinimumOrderQuantity", "MinimumOrderQuantityUom", "PartGroup", "PartSubGroup", "PartCategoryCode", "Unspsc", "VmrsCode", "Position", "DefaultMfrLabel", "DefaultApplicationQty", "Tags"]
"-----------------"
["P76550M-2", 1234, 1234, "Active (Publish to receiver database)", "Fuel Pump Module Assembly", 0, nil, 19826103970, "UP", "UPC", 1, "EA", "BX", 1, "EA", "MAX", nil, Fri, 23 Aug 2019, 1, "EA", nil, nil, 550303, 40151532, 44003258, nil, nil, 1, nil]
Hi,
I'am try upload a xlsx file and i'm getting a erro on format cell:
"Encountered unknown column type n":["/opt/rubies/ruby-2.6.5/lib/ruby/gems/2.6.0/bundler/gems/xsv-027d612d4f74/lib/xsv/sheet_rows_handler.rb:28:in format_cell'","/opt/rubies/ruby-2.6.5/lib/ruby/gems/2.6.0/bundler/gems/xsv-027d612d4f74/lib/xsv/sheet_rows_handler.rb:98:in
end_element'","/opt/rubies/ruby-2.6.5/lib/ruby/gems/2.6.0/bundler/gems/xsv-027d612d4f74/lib/xsv/sheet.rb:62:in `sax_parse'","/opt/rubies/ruby-2.6.5/lib/ruby/gems/2.6.0/bundler/gems/xsv-027d612d4f74/lib/xsv/sheet.rb:62:in
Can you give me a light to fix it?
It would fit well with other hash returning functions like JSON.parse and would be of use.
I am not sure but I suspect there would be less memory allocation since currently, each row allocates a new string key for the header labels in the :hash
mode.
Similar to #24 this could be done in two ways (or both? 🤔) with a symbolize_names: true
argument i) to open
function and/or ii) to parse_headers!
function.
To reproduce:
Parsing the following file complex-headers.xlsx
@sheet = Xsv.open("./complex-headers.xlsx").sheets[0]
@sheet.parse_headers!
p @sheet[0]
Expected
{nil=>nil, "Customer nr."=>537066, "Customer name"=>"Unilever", "Item number"=>391790102, "Description"=>"Soap", "Qty outst."=>6, "Another Ref"=>"18-01-2024", "Reference"=>"#JL-65568", "Site address"=>"5 Cross Street", "Zdepot"=>"GG"}
Actual
=> {nil=>"GG", "Customer nr."=>nil, "Customer name"=>537066, "Item number"=>"Unilever", "Description"=>391790102, "Qty outst."=>"Soap", "Another Ref"=>6, "Reference"=>"18-01-2024", "Site address"=>"#JL-65568", "Zdepot"=>"5 Cross Street"}
See #55 for a failing test case.
Generic HTML entities in the source XML such as —
or —
are not supported, but passed through unchanged.
I came across an XLSX file that had german umlauts encoded this way in it. They were not decoded to the actual characters but were output as ä
(should be ä
).
Hi, I tried to parse this xlsx file, but I am getting this error:
x = Xsv::Workbook.open(file)
NoMethodError: undefined method
get_input_stream' for nil:NilClass from /Users/tomasdurcak/.rbenv/versions/2.6.6/lib/ruby/gems/2.6.0/gems/xsv-0.3.14/lib/xsv/workbook.rb:78:in
fetch_shared_strings'
Caused by NoMethodError: undefined methodget_input_stream' for nil:NilClass from /Users/tomasdurcak/.rbenv/versions/2.6.6/lib/ruby/gems/2.6.0/gems/xsv-0.3.14/lib/xsv/workbook.rb:78:in
fetch_shared_strings'
Caused by CSV::MalformedCSVError: Illegal quoting in line 2.
from /Users/tomasdurcak/.rbenv/versions/2.6.6/lib/ruby/2.6.0/csv/parser.rb:879:in `parse_quotable_robust'
I'm using xsv v1.1.0, and trying to open this file.
The error I get is:
Xsv::Workbook.open(path_to_file)
NoMethodError: undefined method `[]' for nil:NilClass
from /Users/fionasanggang/.asdf/installs/ruby/2.7.1/lib/ruby/gems/2.7.0/gems/xsv-1.1.0/lib/xsv/sheet.rb:36:in `initialize'
I believe this document was generated using the Open XML SDK 2.5 for Office, due to the <x:worksheet>
, <x:sheetData>
elements in the sheet.xml file.
Do you have any hints or suggestions?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.