dilshod / xlsx2csv Goto Github PK
View Code? Open in Web Editor NEWConvert xslx to csv, it is fast, and works for huge xlsx files
License: MIT License
Convert xslx to csv, it is fast, and works for huge xlsx files
License: MIT License
For this file https://dl.dropboxusercontent.com/u/16118108/01-11-2013-BeCompact-USD%3D321.xlsx i get such output:
"line1: line 1
Line2: line2"
I expect a bit different:
"line1: line 1\nLine2: line2"
May be it will be better to escape newline characters in output?
Traceback (most recent call last):
File "/usr/local/bin/xlsx2csv", line 550, in
xlsx2csv.convert(outfile, sheetid)
File "/usr/local/bin/xlsx2csv", line 162, in convert
self._convert(sheetid, outfile)
File "/usr/local/bin/xlsx2csv", line 200, in _convert
sheet.to_csv(writer)
File "/usr/local/bin/xlsx2csv", line 362, in to_csv
self.parser.ParseFile(self.filehandle)
File "/usr/local/bin/xlsx2csv", line 377, in handleCharData
xfs_numfmt = self.styles.cellXfs[s]
IndexError: list index out of range
Dilshod,
I added an outfile.close() to the finally statement in xlsx2csv() because garbage collection (I believe) wasn't closing the file quickly enough for my use case. I'm just letting you know so that you can add it to master if you think other's would benefit as well.
Thanks,
Nathan
In some cases cell data is not parsed correctly.
Investigation and debug output shows that these cells contain shared string and use 2 exec of handleCharData function for each. But when script resolves shared data address to string it uses data of last exec not self.collected_string (or self.data). In result you have almost random data in cell.
Change in 296th line of script
int(data) > int(self.data)
Diff:
diff xlsx2csv.py xlsx2csv_fixed.py
296c296
< self.data = self.sharedStrings[int(data)]
---
> self.data = self.sharedStrings[int(self.data)]
Looks like this is fix after "Issue #15 (Misinterpretation of &". Data parsing is fixed.
If there is an ampersand in a cell the tool will split up the cell contents.
Example:
Cell: Hello & World
will be parsed to
Hello
World
I made a temporary hack in my version by:
def handleStartElement(self, name, attrs):
self.collectedString = ""
and
def handleCharData(self, data):
if self.in_cell_value: # defines that there is a value between a set of nodes
self.collectedString += data
self.data = self.collectedString # default value
Hi,
It would nice if the repository were tagged on release in order to (1) diff between releases (2) use automated tools to package releases in regular basis (http://githubredir.debian.net/).
Use of "release/" prefix helps to find tags by name easily with git tag -l
:
release/1.0
release/1.1
release/1.10
release/1.2
The Freecode is kinda "yellow pages" of Open Source software. It would be useful if the project were listed there; just open an account and you're set.
Once set up, there is nothing to maintain, unless you want to issue release announcements for the subsribers of project watchers.
An xlsx file I'm processing contains a date cell that looks like "9/15/2011 15:22" in Excel. I haven't been able to figure out what format to pass to xlsx2csv in order to resolve this.
(1) [no format passed]: m/15/11 h:09
(2) %Y/%m/%d %hh:%mm : 2011/09/15 Seph:09m
(3) %Y/%m/%d %H:%M : 2011/09/15 00:00
The last seem closest, but I would like to see the hours:minutes resolved.
Yes, it's ancient Python, but still:
line 780: parser.add_argument("-s", "--sheet", dest="sheetid", default=1, type=int, help="sheet number to convert")
produces a
Traceback (most recent call last):
File "xlsx2csv.py", line 780, in ?
help="sheet number to convert")
File "/usr/lib/python2.3/optparse.py", line 820, in add_option
option = self.option_class(*args, **kwargs)
File "/usr/lib/python2.3/optparse.py", line 430, in __init__
checker(self)
File "/usr/lib/python2.3/optparse.py", line 499, in _check_type
raise OptionError("invalid option type: %r" % self.type, self)```
Change it to
line 780: parser.add_argument("-s", "--sheet", dest="sheetid", default=1, type="int", help="sheet number to convert")
and it works. (Note the "" around int.)
Excel 2007 file with content
8001990029363
8001990029387
8001990028694
8001990028663
8001990028649
8001990028854
8001990048081
8001990029370
8001990029394
8001990048180
8001990027710
Code:
from openpyxl.reader.excel import load_workbook
book = load_workbook('error.xlsx', use_iterators=True)
sheet = book.get_active_sheet()
for row in sheet.iter_rows():
print [cell.internal_value for cell in row]
Crash with error:
Traceback (most recent call last):
File "test.py", line 6, in <module>
for row in sheet.iter_rows():
File "C:\Python\lib\site-packages\openpyxl\reader\iter_worksheet.py", line 236, in get_squared_range
cell = cell._replace(internal_value=shared_date.from_julian(float(cell.internal_value)))
File "C:\Python\lib\site-packages\openpyxl\shared\date_time.py", line 162, in from_julian
return EPOCH + datetime.timedelta(days=utc_days)
OverflowError: Python int too large to convert to C long
I executed in a Windows command shell the following command:
python d:\yy\ww\py\lib\site-packages\xlsx2csv.py "D:\yy\ww\XLS2CSV Converter\source\xxx.xlsx" "D:\yy\xxx\XLS2CSV Converter\distination\vv.csv"
and get the following error message:
Traceback (most recent call last):
File "d:\yy\ww\py\lib\site-packages\xlsx2csv.py", line 847, in
xlsx2csv.convert(outfile, sheetid)
File "d:\yy\ww\py\lib\site-packages\xlsx2csv.py", line 178, in convert
self._convert(sheetid, outfile)
File "d:\yy\ww\py\lib\site-packages\xlsx2csv.py", line 247, in _convert
sheet.to_csv(writer)
File "d:\yy\ww\py\lib\site-packages\xlsx2csv.py", line 558, in to_csv
self.parser.ParseFile(self.filehandle)
File "d:\yy\ww\py\lib\site-packages\xlsx2csv.py", line 706, in handleEndElement
self.writer.writerow(d)
UnicodeEncodeError: 'cp932' codec can't encode character '\xd4' in position 215: illegal multibyte sequence
I'm using Python 3.4.0 and xlsx2csv-0.6.1 installed via pip.
What am I doing wrong?
In the excel file "Sheet2" and "Sheet3" are blank so I'd like to exclude them, I keep passing the following command:
xlsx2csv -a -E"Sheet2" -E"Sheet3" file.xlsx
but the produced output still contains delimiters for these two sheets. I've also tried putting a space between the -E option and the names, and I've tried excluding the quotation marks, but to no avail.
xlsx to csv works great but fails with the following for xls files. xls might not even be supported but if you could add xls support, that had be awesome! Thank you!
Traceback (most recent call last):
File "/home/thinkcode/libs/python/xlsx2csv.py", line 344, in ?
xlsx2csv(args[0], f, **kwargs)
File "/home/thinkcode/libs/python/xlsx2csv.py", line 91, in xlsx2csv
ziphandle = zipfile.ZipFile(infilepath)
File "/usr/lib64/python2.4/zipfile.py", line 210, in init
self._GetContents()
File "/usr/lib64/python2.4/zipfile.py", line 230, in _GetContents
self._RealGetContents()
File "/usr/lib64/python2.4/zipfile.py", line 242, in _RealGetContents
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file
When converting an XLSX file that has emtpy cells at the end of the row, the corresponding row in the CSV is missing the trailing commas needed for using the CSV as an import into a database orr other system that validates the number of expected columns are present.
Excel has capability holds phonetic for cell value as as . element placed inner tag. So, the current implements shows mixed value and phonetic.
I've avoid the problem like below code. I hope apply the code.
def handleStartElement(self, name, attrs):
if name == 'si':
self.si = True
self.value = ""
elif name == 't' and self.rPh:
self.t = False
elif name == 't' and self.si:
self.t = True
elif name == 'rPh':
self.rPh = True
def handleEndElement(self, name):
if name == 'si':
self.si = False
self.strings.append(self.value)
elif name == 't':
self.t = False
elif name == 'rPh':
self.rPh = False
Getting "null-pointer exceptions" from xlsx2csv when parsing some files:
Traceback (most recent call last):
File "/usr/local/bin/xlsx2csv", line 847, in <module>
xlsx2csv.convert(outfile, sheetid)
File "/usr/local/bin/xlsx2csv", line 178, in convert
self._convert(sheetid, outfile)
File "/usr/local/bin/xlsx2csv", line 247, in _convert
sheet.to_csv(writer)
File "/usr/local/bin/xlsx2csv", line 558, in to_csv
self.parser.ParseFile(self.filehandle)
File "/usr/local/bin/xlsx2csv", line 660, in handleStartElement
startCol = start.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
An example where this problem occurs is here.
According to pip, my currently installed version is the latest, but xlsx2csv -v
only outputs the script's name (xslx2csv
- want me to open a separate ticket for that?). I see I do have the --merge-cells
options available from the script.
I have this error :
Traceback (most recent call last):
File "/usr/local/bin/xlsx2csv", line 5, in
pkg_resources.run_script('xlsx2csv==0.6', 'xlsx2csv')
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 499, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1235, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/xlsx2csv-0.6-py2.7.egg/EGG-INFO/scripts/xlsx2csv", line 548, in
xlsx2csv = Xlsx2csv(options.infile, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/xlsx2csv-0.6-py2.7.egg/EGG-INFO/scripts/xlsx2csv", line 155, in init
self.workbook = self._parse(Workbook, "xl/workbook.xml")
File "/usr/local/lib/python2.7/dist-packages/xlsx2csv-0.6-py2.7.egg/EGG-INFO/scripts/xlsx2csv", line 219, in _parse
instance.parse(filehandle)
File "/usr/local/lib/python2.7/dist-packages/xlsx2csv-0.6-py2.7.egg/EGG-INFO/scripts/xlsx2csv", line 239, in parse
sheets = workbookDoc.firstChild.getElementsByTagName("sheets")[0]
IndexError: list index out of range
you can find excel file at :
http://members.tsetmc.com/tsev2/excel/MarketWatchPlus.aspx?d=0
thanks
I can't covert xlsx documents from Apache POI . No error messages, just making a csv file that contains row numbers.
My server is a solaris10 server. I need help on this Converting xlsx file from my office 2010 exel works fine
If a cell contains something like "21.10", xlsx2csv interprets it as a number and truncates the zero. This is bad if the cell is intended to represent a version number or something like that.
These cells are formatted as numbers in Excel, with a fixed number of digits after the decimal point. I have hundreds of files, so I can't change the formats by hand.
The best fix would be to simply retain leading and trailing zeros in the csv.
Another fix would be to interpret numbers with trailing or leading zeros as strings. (maybe with a flag)
Another fix would be a flag to treat all cells as strings, rather than numbers.
Using with --all or -s 0 causes:
Traceback (most recent call last):
File "/usr/local/bin/xlsx2csv", line 550, in
xlsx2csv.convert(outfile, sheetid)
File "/usr/local/bin/xlsx2csv", line 168, in convert
if cmd:
NameError: global name 'cmd' is not defined
Hi, great tool! Consider adding it to PyPI so people can install it with pip, and also possibly exposing it additionally as a library. Thanks!
Hi,
The parser seems to turn into troubles with xlsx files with empty row 1.
It returns:
...
self.colNum = cellId[:len(cellId)-len(self.rowNum)]
TypeError: object of type 'NoneType' has no len()
Without success, I tried to fix it by "ignoring" the cell with NoneType and go on with empty value, but this makes the data in following rows to be missing in the CSV output.
It only happens if the input XLSX has an entirely empty row 1.
Do you have an idea how to deal with this issue?
Thanks for help and thanks for the great .py!
Hi,
I've got a directory filled with xlsx's that I wanted to convert to csv. Is there any easy way to batch convert?
The tool exports minutes < 10 as single characters, for example 14:05 is exported as 14:5. For some routines which import this value, it is converted to 0145, because these only read the digits.
If you can change line 545 to this: self.data = str(t / 60) + ":" + ('0' + str(t % 60))[-2:]
then the minutes will always have two digits, even when 0.
Привет. Имеется проблема с разбором многостраничных документов. Приблизительно проблема заключается в том, что id документа в .../worksheets/sheet*.xml не обязательно совпадает с номером страницы.
ps. Я не Питонист, да и в формате Excel 2007 не разбираюсь, сейчас поковыряю, может смогу разобраться.
Hi,
In the Alternatives section, kindly change this entry
Bash:
http://kirk.webfinish.com/2009/12/xlsx2csv/
TO
Bash:
http://kirk.webfinish.com/?p=91 <-- direct link to bash script which works now.
Also, please add the perl alternative too, if possible.
And which is achived like this:
wget http://repo.iotti.biz/CentOS/6/noarch/xls2csv-1.06-16.el6.lux.1.noarch.rpm
yum localinstall xls2csv-1.06-16.el6.lux.1.noarch.rpm -y
Thanks.
I get this error, when running on Windows 7:
C:\Users\jh>python.exe C:\1\xlsx2csv.py
File "C:\1\xlsx2csv.py", line 286
date = datetime.datetime(1904, 01, 01) + datetime.timedelta(float(data))
^
SyntaxError: invalid token
Am I doing it wrong?
I have problem with float.
In xlsx-file i have 0.103 (general or number). After convert in csv i got 0.10299999999999999
The same, 0.276 -> 0.27600000000000002
Hi! Thank you for the tool!
Unfortunately, it throws a traceback, while tries to convert one some files:
Traceback (most recent call last):
File "./xlsx2csv.py", line 477, in <module>
xlsx2csv(args[0], None, args[1], **kwargs)
File "./xlsx2csv.py", line 121, in xlsx2csv
styles = parse(ziphandle, Styles, "xl/styles.xml")
File "./xlsx2csv.py", line 179, in parse
instance.parse(f)
File "./xlsx2csv.py", line 222, in parse
numFmtId = int(numFmt._attrs['numFmtId'].value)
AttributeError: Text instance has no attribute '_attrs'
Here is file itself: http://www.sima-land.ru/files-local/price/igrushki.xlsx
Please strip CR (carriage return, \r, 0x0D) from output lines. Now:
$ xlsx2csv test.xlsx | cat -A | head -2
test, 1^M$
test, 2^M$
Something like this to detect if running under modern Windows (pseudo code):
STRIP_OUTPUT = true
# example: ProgramData=C:\ProgramData
if <environment variable "ProgramData" exists>
AND
<it matches regexp "[a-z]:\">
then
STRIP_OUTPUT = false
endif
given the headers A,B,C and data stuff,more stuff, [empty] the csv output doesn't not include a comma after "more stuff".
Using the version in pip, I found this error on some of my workbooks:
Traceback (most recent call last):
File "/usr/local/bin/xlsx2csv", line 548, in <module>
xlsx2csv = Xlsx2csv(options.infile, **kwargs)
File "/usr/local/bin/xlsx2csv", line 155, in __init__
self.workbook = self._parse(Workbook, "xl/workbook.xml")
File "/usr/local/bin/xlsx2csv", line 219, in _parse
instance.parse(filehandle)
File "/usr/local/bin/xlsx2csv", line 233, in parse
self.appName = workbookDoc.firstChild.getElementsByTagName("fileVersion")[0]._attrs['appName'].value
KeyError: 'appName'
I tried to use the script but gave me the above error . find details below :
[xlsx2csv master] $ ./xlsx2csv.py /home/rtcoms/Desktop/bank_wise_csv_ifsc_codes/IFCB2009_02.xls /home/rtcoms/Desktop/code/personal_projects/ifcs_details/tmp/1.csv
Traceback (most recent call last):
File "./xlsx2csv.py", line 443, in
xlsx2csv(args[0], outfile, **kwargs)
File "./xlsx2csv.py", line 115, in xlsx2csv
ziphandle = zipfile.ZipFile(infilepath)
File "/usr/lib/python2.6/zipfile.py", line 696, in init
self._GetContents()
File "/usr/lib/python2.6/zipfile.py", line 716, in _GetContents
self._RealGetContents()
File "/usr/lib/python2.6/zipfile.py", line 728, in _RealGetContents
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file
Hi,
I am facing a problem with xlsx files that have number such as 9.810070626E+019 (which shows in excel as 98100707530000000000) --> with xlsx2csv this becomes 98100707530000007168 when viewed with text editor. However, with OpenOffice xls-to-csv macro the value is shown as it should as 98100707530000000000.
Both of these are shown correctly when the csv is viewed in e.g. OpenOffice, but I'm taking the csv into different software, e.g. Pentaho Data Integration and there the values show as in a text editor..
Any help appreciated
--patrik
First, thank you for this amazing library.
The date formatting seems to be really bad though. It converts "4/5/2014" into "04-05-14". This format makes it very hard to convert back into a date. PHP strtotime() interprets it as May 14, 2004. At the very least, the year should be 4 digits, that would help a lot.
Additionally, Excel sometimes stores dates as simply an integer representing the number of days since January 1, 1900 (or January 1, 1904 on Mac for some reason). It would be excellent if those could also be converted to date strings as well.
Excel exported csv Line.
;G3;Dachbox;320;;;;Schwarz Metallic;Arjes;;;133.8;72;37.5;Dach
xlsx2csv exported csv Line
G3Dachbox320~~~~Schwarz MetallicArjes~~~133.800000000000017237.5~Dach
Found some more samples in my Files.
Excel field format: General
Original Value: 133.8
CSV Value: 133.80000000000001
I Also found: 7.4999999999999997E-2 instead of 0.075
Got this error today with a xlsx file (tested with 0.6.1):
Traceback (most recent call last):
File "./xlsx2csv.py", line 843, in
xlsx2csv = Xlsx2csv(options.infile, **kwargs)
File "./xlsx2csv.py", line 166, in init
self.workbook = self._parse(Workbook, "xl/workbook.xml")
File "./xlsx2csv.py", line 268, in _parse
instance.parse(filehandle)
File "./xlsx2csv.py", line 314, in parse
if 'r:id' in attrs: id = int(attrs["r:id"].value[3:])
ValueError: invalid literal for int() with base 10: ''
I am trying to convert some large xlsx files in the range of 60,000 to 85,000 lines.
The program fails with the error:
$ xlsx2csv.py -s 1 60656.xlsx 60656.csv
Traceback (most recent call last):
File "/home/Alex/mybin/xlsx2csv.py", line 412, in
xlsx2csv(args[0], outfile, **kwargs)
File "/home/Alex/mybin/xlsx2csv.py", line 98, in xlsx2csv
workbook = parse(ziphandle, Workbook, "xl/workbook.xml")
File "/home/Alex/mybin/xlsx2csv.py", line 125, in parse
instance.parse(ziphandle.read(filename))
File "/home/Alex/mybin/xlsx2csv.py", line 135, in parse
self.appName = workbookDoc.firstChild.getElementsByTagName("fileVersion")[0]._attrs['appName'].value
IndexError: list index out of range
I am using python 2.7 under linux.
I have tested the script with other xlsx files (smaller) and had no issus.
Thanks
Alex.
If the correct Python is somewhere else in the path, you can't run xlsx2csv.py directly.
I am getting the following error from an excel file I have (Microsoft Excel 97-2004 workbook):
Invalid xlsx file: ...
When I looked at the source code, it seems to be coming from the following line:
self.ziphandle = zipfile.ZipFile(xlsxfile)
Any idea why I get this error and any work-around?
I tried : python xlsx2csv.py myfile.xlsx myfile.csv
Complete error message :
Traceback (most recent call last):
File "xlsx2csv.py", line 164, in
xlsx2csv(sys.argv[1], f)
File "xlsx2csv.py", line 22, in xlsx2csv
Sheet(shared_strings, ziphandle.read("xl/worksheets/sheet1.xml"), writer)
File "xlsx2csv.py", line 94, in init
self.parser.Parse(data)
File "xlsx2csv.py", line 154, in handleEndElement
self.writer.writerow(d)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)
So in the UK, where dates are typically of this type, they were converted to numbers (as in the number excel uses to store a date).
To fix this, I added:
'dd/mm/yyyy' : 'date',
on line 49
The conversion from myfile.xslx to csv works fine on Linux but with the same options (defaults ones :o) ) the conversion on windows doubles the number of lines : it adds one blank line every lines !
Do you have an idea of what's wrong in the way i use xlsx2csv ?
Is there any option that may solve this pb ?
Thanks.
I ran xlsx2csv with xlsx that had multiple sheets. But, xlsx2csv converted only one sheet [first sheet] to csv.
I am not sure if this is strictly in the purview of this program.
I was trying to use the program to convert an excel sheet A.xlsx which had a link to B.xlsx.
I changed B.xlsx and then ran xlsx2csv.py A.xlsx A.csv
As expected, the change done in B.xlsx did not get reflected in A.csv
But note A.xlsx, it is already set to "Automatically update links"
The only way it works is this way - Open A.xlsx and just do a save. Then do the conversion.
Is there a way in python to force an xlsx to save itself (at command prompt) - before doing the conversion.
All this is being done in a windows machine - but I guess any trick available, I should be able to map from linux to windows.
Kind Regards
I ran across a xlsx file using the inlineStr method instead of the more common sharedStrings approach. The conversion failed because the inlineStr are stored in a tag instead of the normal tag.
Adding this to the if then block in the handleStartElement method allows for proper conversion of inlineStr. Please test and add to next version:
elif self.in_cell and name == 'is':
self.in_cell_value = True
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.