Monday, June 3, 2013

XLRD vs OPENPYXL, Round II

[UPDATE 2013-12-02] The major issue discussed in this post, RE: charts not read, worksheets skipped and out of order, was resolved and pulled into the latest release 1.7.0 as well as many other bug fixes. With this latest version, I think that OpenPyXL can be considered the dominant OOXML (post 2007) Excel reader and writer. Note that OpenPyXL is the default Excel reader for Pandas the rapidly growing Python data analysis toolset.

This is a continuation of the previous post on reading Excel from Python. Uh, I might have called it too early! XLRD pulls ahead, but will it win the bout? Read on ...

Reading the contents

Assume we have a sample Excel spreadsheet with 3 worksheets and 2 charts on sheets. In OpenPyXL, you load the workbook, but right away you notice something wrong with the sheet names. Where's 'Sheet3'?
>>> wb_openpyxl = load_workbook(sample)
>>> wb_openpyxl.get_sheet_names()
['Sheet1',
 'Sheet2',
 'Sheet3']
Loading the sheets with XLRD get's it right.
>>> wb_xlrd = open_workbook(sample)
>>> wb_xlrd.sheet_names()
[u'Sheet1',
 u'Sheet2',
 u'Sheet3']
XLRD returns the sheets in the same order as they are visible in the actual spreadsheet, and omits the charts which don't actually contain any data. Unfortunately OpenPyXL can't tell charts from sheets just yet, and is actually naming some of the sheets incorrectly after the charts.
'Sheet1' --> 'Sheet1'
'Sheet2' --> 'Sheet2'
'ChartA' --> 'Sheet3'
This would be OK, since all of the sheets are there, and you could use the sheets' indices, but if you don't only know their names and not the order, then this is an issue. It has been reported in issues #179, #165 and #209. Unfortunately, this issues affects the optimized reader as well. I sent a pull request with a proposed fix for it that has already been merged with master. This issue was resolved and pulled into the current release, OpenPyXL-1.7.0.

Reading Cells

OpenPyXL can use the Excel format, EG: 'A3' or by row & column.
>>> ws1_openpyxl = wb_openpyxl.get_sheet_by_name('Sheet1')
>>> ws1_openpyxl.cell('A3').value
XLRD only reads cells by (row, column).
>>> ws1_xlrd = wb_xlrd.get_sheet_by_name('Sheet1')
>>> ws1_xlrd.cell_value(2, 0)
Both can let you slice the data, but OpenPyXL also allows ranges using Excel format.
>>> ws1_openpyxl.range('A1:C2')
Weird thing about the optimized reader in OpenPyXL, is that it only allows reading sheet contents using the iter_rows() function, which in a way defeats the purpose of the optimized reader, since you have to read in all of the columns in each row!
>>> all_rows = [r for r in ws1_openpyxl.iter_rows()]

The Winner

I think XLRD wins this round, because even though its documentation is sparse, it's not rocket science, and it get's the worksheets, relatively quickly, and more importantly correctly! The screw up with the charts is kind of a non-starter for OpenPyXL.

And another thing occurred to me during Round II. XLRD can open any Excel spreadsheet dating back to like 1995, but OpenPyXL is only for Excel 2007 and newer, which if you didn't know is a zipped XML file.

Finally, even though XLRD doesn't let you use the easy Excel cell reference notation, it is generally faster. And the iter_rows() limitation for the optimized reader in OpenPyXL is a bit annoying, since you're forced to read in many columns that you might not have wanted to read!

No comments:

Post a Comment

Fork me on GitHub