Friday, May 31, 2013

>>> print 'Reading %(Lolita)s in %(Tehran)s' % {'Lolita': 'Excel', 'Tehran': 'Python'}

[UPDATE 2013-06-14] The google group python-excel is a great source of information on this topic. That's where I discovered xlsxWriter and PyXLL. I also searched the PyPI database with keyword: "xlsx", and I got several more hits, which I've added below in addition to these two. Really thought I'd already had checked PyPI - must be losing my mind; need more sleep.
>>> print 'Reading %(Lolita)s in %(Tehran)s' % {'Lolita': 'Excel', 'Tehran': 'Python'}
Reading 
Excel in Python
That was really a stretch. It's actually not funny or witty at all. In fact now it's just terribly awkward. Although it would of made a little sense if Excel and Python were switched, IE: Reading Python in Excel right, but that is not exactly what we mean. Now I'm just makein it worse.

Main Contenders

Moving on, ever tried to read an Excel file into Python? Numpy only supports csv, so you Google and up pop the main contenders:

Python ← Excel

Python → Excel

  • DataNitro, formerly IronSpread, is an excel plugin that replaces allows Python to be used for add-ins and macros in lieu of VBA, proprietary commercial/personal licensing. I didn't review this one, but I plan to revisit it later.
  • PyXLL, version 2.0.3, also lets you use Python to make Excel add-ins using decorators.

openpxl

This project is for Excel 2007 and newer, it is a port of ExcelPHP and is super easy to use, made even easier by its Sphinx docs at read-the-docs. For some weird reason, pypi and bitbucket both point to old (v-1.5.7) documentation @ pythonhosted, but don't be fooled; get the latest which is much expanded!
$ pip install openpyxl
works fine for any platform since there are no extensions, yay!
$ python
>>> from openpyxl import load_workbook
>>> big_wb = openpyxl('C:\path\to\spreadsheet.xlsx')
seems too take awhile and there doesn't seem any logging or progress available. So use the lazy approach by setting use_iterators=True, which is much faster!
>>> big_wb_fast = openpyxl('C:\path\to\spreadsheet.xlsx', use_iterators=True)
This approach uses an optimized reader, but the workbook loaded is read only.

There is a google group, issue tracking, a blog and an old blog.

xlrd

This project covers all Excel spreadsheets, don't be mislead by the 0 major version, this is a relatively mature project. It is part of a triad of Excel targeted packages, including xlwt and xlutils for writing and other basic utilities. The documentation is in one giant html page, which has a lot of info, but is frankly more challenging to read than the nice Sphinx documentation of openpyxl. There is also a tutorial in the form of a pdf.

Install all 3 packages using pip works fine as they are all pure Python. Then in an interpreter ...
>>> from xlrd import open_workbook
>>> import sys
>>> big_wb_xlrd = open_workbook('C:\path\to\spreadsheet.xlsx'',on_demand=True, logfile=sys.stdout, verbosity=1)
This is where better documentation might help as there is no indication what the different levels of verbosity are, although maybe it's obvious. False, 0 or None means none, True or 1 gives you lite feedback, just right, but verbosity=2 gives you a deluge of information - probably best to output this to a file instead of STDOUT.

Now here comes the sad part. If you are using Excel 2007 or newer, on_demand is not supported. If you have logging enabled and verbosity>0, then you will see this message:
WARNING *** on_demand=True not yet implemented; falling back to False
Boo Hoo! Because on_demand is the xlrd equivalent of use_iterators or the openpyxl optimized reader which enhanced speed so much. I will say, that xlrd read a large file slightly faster than openpyxl with use_iterators=False, but definitely slower than the optimized reader.

Conclusion

  • For xls files, use xlrd with on_demand=True .
  • For xlsx files, use openpyxl with use_iterators=True.
  • IMHO openpyxl could be improved with logging to give feedback and xlrd/xlwt could use some better documentation.
In my next post, Round II, I play with reading in actual cell contents. Let's see if openpyxl stays in front.
Fork me on GitHub