>>> print 'Reading %(Lolita)s in %(Tehran)s' % {'Lolita': 'Excel', 'Tehran': 'Python'}That was really a stretch. It's actually not funny or witty at all. In fact now it's just terribly awkward. Although it would of made a little sense if Excel and Python were switched, IE: Reading Python in Excel right, but that is not exactly what we mean. Now I'm just makein it worse.
Reading Excel in Python
Main Contenders
Moving on, ever tried to read an Excel file into Python? Numpy only supports csv, so you Google and up pop the main contenders:Python ← Excel
- xlrd - Version 0.9.2 as of this post was released 2013-04-09 available at the cheese factory and on Github, BSD license.
- openpyxl - Version 1.6.2 released 2013-04-18, an auspicious day, available at the cheese factory and from Bitbucket, MIT license.
- xlsxWriter - version 0.3.3 released 2013-06-10 available at PyPI with source on Github, BSD license.
- json2xlsx - version 1.2.5, last updated 2013-06-05 at PyPI and Github, BSD.
Python → Excel
- DataNitro, formerly IronSpread, is an excel plugin that replaces allows Python to be used for add-ins and macros in lieu of VBA, proprietary commercial/personal licensing. I didn't review this one, but I plan to revisit it later.
- PyXLL, version 2.0.3, also lets you use Python to make Excel add-ins using decorators.
openpxl
This project is for Excel 2007 and newer, it is a port of ExcelPHP and is super easy to use, made even easier by its Sphinx docs at read-the-docs. For some weird reason, pypi and bitbucket both point to old (v-1.5.7) documentation @ pythonhosted, but don't be fooled; get the latest which is much expanded!
$ pip install openpyxl
works fine for any platform since there are no extensions, yay!
$ python
>>> from openpyxl import load_workbook
>>> big_wb = openpyxl('C:\path\to\spreadsheet.xlsx')
seems too take awhile and there doesn't seem any logging or progress available. So use the lazy approach by setting use_iterators=True, which is much faster!
>>> big_wb_fast = openpyxl('C:\path\to\spreadsheet.xlsx', use_iterators=True)
This approach uses an optimized reader, but the workbook loaded is read only.
xlrd
This project covers all Excel spreadsheets, don't be mislead by the 0 major version, this is a relatively mature project. It is part of a triad of Excel targeted packages, including xlwt and xlutils for writing and other basic utilities. The documentation is in one giant html page, which has a lot of info, but is frankly more challenging to read than the nice Sphinx documentation of openpyxl. There is also a tutorial in the form of a pdf.
Install all 3 packages using pip works fine as they are all pure Python. Then in an interpreter ...
>>> from xlrd import open_workbook
>>> import sys
>>> big_wb_xlrd = open_workbook('C:\path\to\spreadsheet.xlsx'',on_demand=True, logfile=sys.stdout, verbosity=1)
This is where better documentation might help as there is no indication what the different levels of verbosity are, although maybe it's obvious. False, 0 or None means none, True or 1 gives you lite feedback, just right, but verbosity=2 gives you a deluge of information - probably best to output this to a file instead of STDOUT.
Now here comes the sad part. If you are using Excel 2007 or newer, on_demand is not supported. If you have logging enabled and verbosity>0, then you will see this message:
WARNING *** on_demand=True not yet implemented; falling back to False
Boo Hoo! Because on_demand is the xlrd equivalent of use_iterators or the openpyxl optimized reader which enhanced speed so much. I will say, that xlrd read a large file slightly faster than openpyxl with use_iterators=False, but definitely slower than the optimized reader.
Conclusion
- For xls files, use xlrd with on_demand=True .
- For xlsx files, use openpyxl with use_iterators=True.
- IMHO openpyxl could be improved with logging to give feedback and xlrd/xlwt could use some better documentation.
In my next post, Round II, I play with reading in actual cell contents. Let's see if openpyxl stays in front.
No comments:
Post a Comment