This is a continuation of the previous post on reading Excel from Python. Uh, I might have called it too early! XLRD pulls ahead, but will it win the bout? Read on ...
Reading the contents
Assume we have a sample Excel spreadsheet with 3 worksheets and 2 charts on sheets. In OpenPyXL, you load the workbook, but right away you notice something wrong with the sheet names. Where's 'Sheet3'?
>>> wb_openpyxl = load_workbook(sample) >>> wb_openpyxl.get_sheet_names() ['Sheet1', 'Sheet2', 'Sheet3']Loading the sheets with XLRD get's it right.
>>> wb_xlrd = open_workbook(sample) >>> wb_xlrd.sheet_names() [u'Sheet1', u'Sheet2', u'Sheet3']XLRD returns the sheets in the same order as they are visible in the actual spreadsheet, and omits the charts which don't actually contain any data.
'Sheet1' --> 'Sheet1'
'Sheet2' --> 'Sheet2'
'ChartA' --> 'Sheet3'
Reading Cells
OpenPyXL can use the Excel format, EG: 'A3' or by row & column.>>> ws1_openpyxl = wb_openpyxl.get_sheet_by_name('Sheet1') >>> ws1_openpyxl.cell('A3').valueXLRD only reads cells by (row, column).
>>> ws1_xlrd = wb_xlrd.get_sheet_by_name('Sheet1') >>> ws1_xlrd.cell_value(2, 0)Both can let you slice the data, but OpenPyXL also allows ranges using Excel format.
>>> ws1_openpyxl.range('A1:C2')Weird thing about the optimized reader in OpenPyXL, is that it only allows reading sheet contents using the iter_rows() function, which in a way defeats the purpose of the optimized reader, since you have to read in all of the columns in each row!
>>> all_rows = [r for r in ws1_openpyxl.iter_rows()]
The Winner
I think XLRD wins this round, because even though its documentation is sparse, it's not rocket science, and it get's the worksheets, relatively quickly, and more importantly correctly! The screw up with the charts is kind of a non-starter for OpenPyXL.And another thing occurred to me during Round II. XLRD can open any Excel spreadsheet dating back to like 1995, but OpenPyXL is only for Excel 2007 and newer, which if you didn't know is a zipped XML file.
Finally, even though XLRD doesn't let you use the easy Excel cell reference notation, it is generally faster. And the iter_rows() limitation for the optimized reader in OpenPyXL is a bit annoying, since you're forced to read in many columns that you might not have wanted to read!
No comments:
Post a Comment