Thursday, October 3, 2013

Turn off Objective-C for m-files, Turn on MATLAB

I just posted about how cool Sublime Text 2 editor is, in Markdown builder for Sublime Text 2, but it's not all ice cream and apple pie. One thing that's a drag is that it always chooses Objective-C syntax for MATLAB. It's a pretty quick fix. Opening up the MATLAB language file found under packages in the preferences folder there's this comment re: file type
 <key>fileTypes</key>
 <array>
  <!-- Actually, it's generally .m, but that's taken by Objective-C. It needs an entry to show up in the syntax list. -->
  <string>matlab</string>
 </array>
Look in the Objective-C language file and you see this is indeed true, m-files are treated as Objective-C instead of MATLAB.
 <key>fileTypes</key>
 <array>
  <string>m</string>
  <string>h</string>
 </array>
Cut and paste <string>m</string> from the Objective-C file to the MATLAB file, and viola, it's all ice cream and apple pie!

Markdown builder for Sublime Text 2

By far the coolest editor around is Sublime Text 2. Set it to build Markdown using PyMarkdown
{
    "cmd": ["markdown_py.bat", "$file", "-f", "$file.html"],
    "selector": "text.html.markdown"
}
Save it in your Sublime preferences folder under Packages/User/Markdown.sublime-build. Now write some Markdown, hit ctrl-B and voila, HTML generated from your Markdown.

Monday, September 30, 2013

Are MATLAB containers.Map more efficient?

UPDATE 2015-02-10 I learned something interesting recently that sheds some light on the answer to this question. I believe that MATLAB's containers.Map is a wrapper around Java's java.util.HashMap. A hash map or hash table is a zeroth order look-up table that uses a hash of the look-up key or index for tables as the address in memory that contains the corresponding value. Therefore it doesn't matter how big the hash map or hash table is, MATLAB (or Java) can immediately return the contents given the key. Since the hash is guaranteed to be unique for any given index or key, the memory address of the value will also be unique. So the answer to the question, "Are MATLAB containers.Map more efficient?" is absolutely and without any doubt a loud and resounding, "Yes!" QED

MATLAB's containers.Map's are similar to a Java HashMap or a Python dictionary. They have been around at least since 2008b believe it or not. They were billed as
Fast Key Lookup Provided with New Map Data Structure
when they were introduced. See Programming Fundamentals in 2008b Release Notes. How much memory do they realy take up? MATLAB doesn't say. It seems like they're smaller, but you'll soon see that every containers.Map is always 112 Bytes. How much faster can they really be? My guess is they are really only more efficient for really large data, that are only nested with other containers.Map's of primitive types (IE integers, doubles, &c.).

They have some limitations in subassignment and indexing.

For example:
  • If the value of map('k1') is also a containers.Map,
  • map('k1')('k2')
    Error: ()-indexing must appear last in an index expression.
    then this is invalid because MATLAB only allows one pair of parentheses which must appear last. So you'll have to split this into two commands. This also allows assignment into the nested containers.Map ...
    val_is_map = map('k1') % make a temp handle to the top level containers.Map
    val_is_map('k2') = new_value % assignment using the handle also changes the containers.Map it points to
    test = map('k1') % make another handle to test the original containers.Map was updated
    test('k2') == new_value % true
    ... because containers.Map's are handles (IE pointers) so changing the value of a copied containers.Map changes the source.
  • If the value of map('k1') is a structure,
  • map('k1').field = 'foo'
    Error using containers.Map/subsasgn
    Only one level of indexing is supported by a containers.Map.
    then this is also not valid because only the top level of containers.Map can be assigned.
  • Finally if map('k1') is a cell array, then good luck trying to index it.
>> dict = containers.Map({'all','amend','author','msg'}, ...
       {{'-a','--all',true}, ...
       {[],'--amend',true}, ...
       {[],'--author',true}, ...
       {'-m','--message',false}})

dict = 

  Map with properties:

        Count: 4
      KeyType: char
    ValueType: any

>> s = struct('all',{'-a','--all',true}, ...
       'amend',{[],'--amend',true}, ...
       'author',{[],'--author',true}, ...
       'msg',{'-m','--message',false})

s = 

1x3 struct array with fields:

    all
    amend
    author
    msg

>> c1 = {{'all','-a','--all',true}, ...
       {'amend',[],'--amend',true}, ...
       {'author',[],'--author',false}, ...
       {'msg','-m','--message',false}}

c1 = 

    {1x4 cell}    {1x4 cell}    {1x4 cell}    {1x4 cell}

>> c2 = {'all','-a','--all',true; ...
       'amend',[],'--amend',true; ...
       'author',[],'--author',false; ...
       'msg','-m','--message',false}

c2 = 

    'all'       '-a'    '--all'        [1]
    'amend'       []    '--amend'      [1]
    'author'      []    '--author'     [0]
    'msg'       '-m'    '--message'    [0]

>> whos
  Name      Size            Bytes  Class

  dict      4x1               112  containers.Map
  s         1x3              1670  struct
  c1        1x4              2344  cell
  c2        4x4              1896  cell

Sunday, September 29, 2013

MATLAB syntax for Java inner objects

MATLAB is a Java (and .NET) interpreter, yay!
using-java-libraries-in-matlab
But calling inner objects can be tricky in MATLAB. Use the `javaMethod` and `javaObject` builtins. All the examples are from org.eclipse.jgit
javamethod
javaobject

Constants

These are the easiest of all. Though not technically an inner anything, a constant could be confusing, but it is called exactly as it would be in Java or MATLAB.
    filesystem = org.eclipse.jgit.util.FS.DETECTED
FS is a the filesystem class in org.eclipse.jgit.util pacakge. Its constant DETECTED can be accessed using regular dot notation.
org/eclipse/jgit/util/FS

Enumeration of an inner class

This is where it starts to get tricky. A nested or inner class is created in a separate class file preceded with a dollar sign. MATLAB uses the same notation, but only as a string in the javaMethod command.
    NOTRACK = javaMethod('valueOf', ...
        'org.eclipse.jgit.api.CreateBranchCommand$SetupUpstreamMode', ...
        'NOTRACK')
The `CreateBranchCommand has a nested class called `SetupUpstreamMode`. Access it in MATLAB with a dollar symbol, "$", instead of dot notation, but access it using `javaMethod`. For example it has several enumerations. `NOTRACK` is an enumeration of `SetupUpstreamMode`. Calling the `valueOf()` method of `SetupUpstreamMode` and passing it the string, "NOTRACK" inside the MATLAB builtin `javaMethod` does the trick.
org/eclipse/jgit/api/CreateBranchCommand
org/eclipse/jgit/api/CreateBranchCommand.SetupUpstreamMode
org/eclipse/jgit/api/CreateBranchCommand.SetupUpstreamMode.html#NOTRACK

Construct an inner class object

This is also easy.
    user = javaObject('org.eclipse.jgit.transport.CredentialItem$Username')
Username is an static nested class of CredentialItem. Access it using the dollar sign instead of dot notation in a call to `javaObject`.
org/eclipse/jgit/transport/CredentialItem.Username

And that's pretty much that. There are some other Java tools, like javaArray, javaMethodEDT & javaObjectEDT. I'll update this more later. Promise.

Credit for MATLAB brush: Will Schleter. Thanks!

Friday, September 20, 2013

IPOPT as non-linear solver for MATLAB

You have a non-linear problem to solve but not the MATLAB Optimization Toolbox?
  1. First download the IPOPT mex and m-files, and extract to your MATLAB search path.
  2. Make an m-file that defines your objective and constraints, gradient and Jacobian.
  3. Credit for MATLAB brush: Will Schleter. Thanks!
  4. Other non-linear solvers for MATLAB
  5. Perfect for solving a staggered mesh.
  6. If you were using Python you could use optimization routines in SciPy.
function [x,info] = myNonLinearProblem(x0, auxdata)
%MYNONLINEARPROBLEM Solves my non-linear problem
% [X,INFO] = MYNONLINEARPROBLEM[X0,AUXDATA]
% returns solution, X, and INFO to my non-linear problem
% X0 is the initial guess
% AUXDATA are any additional arguments my non-linear problem
% requires

x0 = x0(:); % initial guess

%% constraints
%set all of your equations as constraints
funcs.constraints = @constraints;
% constraints require a Jacobian
funcs.jacobian = @jacobian;
% Jacobians require a sparsity structure
funcs.jacobianstructure = @jacobianstructure;
% set upper and lower bounds of constraints to zero
options.cl = zeros(size(x0)); % lower bounds
options.cu = zeros(size(x0)); % lower bounds

%% objective
% set the objective sum of the squares
funcs.objective = @objective;
% objective requires a gradient
funcs.gradient = @gradient;

%% options
% set Quasi-Newton option
options.ipopt.hessian_approximation = 'limited-memory';

%% solve
[x,info] = ipopt_auxdata(x0,funcs,options);
end

% x & auxdata must be passed as only args to
% objective, gradient, constraints and jacobian

function f = objective(x,auxdata)
% objective is the sum of the squares of the residuals
f = residual(x,auxdata);
f = f(:);
f = f'*f;
end

function g = gradient(x,auxdata)
[f,j] = residual(x,auxdata);
f = f(:);
g = f'*j;
end

function f = constraints(x,auxdata)
f = residual(x,auxdata);
f = f(:);
end

% Jacobian and Jacobian structure must be sparse

function j = jacobian(x,auxdata)
% rows correspond to each constraint
% columns correspond to each x
[~,j] = residual(x,auxdata);
j = sparse(j);
end

function j = jacobianstructure(x)
% x is the only arg passed to Jacobian structure
% assuming closed system of equations
% # of constraints = degrees of freedom
% therefore Jacobian matrix will be square
j = sparse(ones(size(x)));
end

% put the equations and their derivatives here
% unfortunately IPOPT is not a Jacobianless solver :(

function [f,j] = residuals(x,auxdata)
% calculate your residuals here
% f1(x,auxdata) = lhs1(x,auxdata) - rhs1(x,auxdata) = 0
% f2 = ...
% j1(x,auxdata) = [df1/dx1, df1/dx2, ...]
% j2 = ...
end

Friday, August 23, 2013

Jedi Nation

In Star Wars when Naboo, Leia's homeworld is destroyed by the Death Star, Obi Wan slumps over in the Millennium Falcon and utters something like, "I feel as if ten thousand souls suddenly screamed out, and then were silenced."

Hmm, is Obi Wan using his Jedi powers to snoop on the universe? One could see it that way, but the verb "snoop" carries the connotation that he did not have the snoopee's best interests at heart. And Obi Wan was all heart.

Hence the Jedi Nation.

Let us take the current climate of anti-terrorist snooping. It brings to mind images of kicked down doors in the middle of night, subjects disappearing based on flimsy wiretap evidence. But what if we take the Jedi approach? What if the next day, Obi Wan shows up to the subjects apartment, offers to buy her some coffee and says, "I got a feeling from the Force that you're feeling a but unhappy with the current government and some other stuff. You know all these wars over oil piss me off too. Want to talk to someone about it? Are you in a bad spot right now? I could help you with your rent for awhile. Are you looking for work? Let me measure your mitochondria. You're a bit old to start the training, but the force runs strong in you. We'll make a Jedi out of you yet!"

If she still resists, Obi Wan can try Jedi mind tricks. "Theses are not the innocent civilians you are looking for." If she tries to kill Obi Wan, then he might be forced to slice her up with his light-saber in self defense.

Wednesday, August 14, 2013

Flatten nested if statements with the opposite condition

Have you ever found yourself nesting if, for and other flow-control statements seemingly endlessly into the dreaded arrow formation?
EG:
if this:
    if that:
        for x in X:
            if yuck:
                ...
            else:
                ...
                continue
    else:
        while Y:
            if foobar:
                ...
            else:
                ...
                break
else:
    ...
One of the easiest ways to collapse if statements is using the opposite condition and return, continue or break.
EG:
if not this:
    ...
    return
if not that:
    while Y:
        if not foobar:
            ...
            break
        ...
    return
for x in X:
    if not yuck:
        ...
        continue
    ...
Some refactoring tools will help you do this, but it good programming practice to avoid nested flow-control statements when possible. Even though in this case it only eliminated 2 lines, it is more compact, lines are not as long, so wrapping/continuing/splitting commands is not an issue, and in general the code should run faster, since it only executes as much code as is needed before returning, continuing or breaking.

Monday, August 12, 2013

SifterClient

Wow! Someone took an interest in one of my side projects. A long time ago, we started using Sifter at work, so I decided to make an Android app, called SifterReader to access it from my phone. It was a fun little side project, but we stopped using Sifter so I abandoned it. Now it's been revived. That's nice!

Thursday, August 1, 2013

importing UML models from Modelio to Papyrus

Skipping the discussion of class diagrams, UML and round trip engineering, let's just say after dismissing Umlet and Violet, Argo, NClass and StarUML, and all of the non-Windows or commercial offerings you have settled between Modelio and Papyrus. You like the ease of Modelio, but its freemium service means you can't generate code, and Papyrus + Acceleo does just that.
  1. Make your diagram in Medelio
  2. From the model explorer, right click and select XMI --> export
  3. Browse to a location outside of your workspace and save it with the EMF UML 3.0.0, but unclick adding Modelio notations please! Also save it with the *.uml extension instead *.xmi.
  4. Now start eclipse and show the Papyrus perspective.
  5. if you already have a project, that's fine, just start a new Papyrus model.
  6. Create a folder in your project called XMI and import the files you exported from Modelio - there should be 2 of them, a profile and your model.
  7. Double click the model that you just created (e.g. model.di) and it should open in the main window, and hopefully you will see it in the model explorer too.
  8. Right click and select import, then import package from user model.
  9. In the window that opens, select your profile and model packages, and then import all. I'm not 100% positive that you need to import the profile, but it doesn't hurt, and I had issues when I tried to just copy the model components from the Modelio UML model to Papyrus.
  10. Ta-daa, you should now see two packages in your model explorer one is the default profile and the other is your Modelio model.
  11. Now at the top level model, create a class diagram (or any of the 9 UML diagrams). This is the most important step!
  12. Drag and drop the first class into the diagram window.
  13. Expand that class in the model explorer and select the class attributes and drag and drop them into the first section of the class below the title.
  14. Repeat for the operations, but drop them into the second box.
  15. Drag over any associations.
  16. You may need to remove and re-add any stereotypes as for some reason they were not showing up in the diagram.
Voila! you've transferred your model and your diagram!

Tuesday, July 30, 2013

Setuptools scandal derails distribute with wheels?

Distribute is dead. Setuptools was a zombie for awhile but has now merged with the distribute fork and is the newest direction in Python packaging. But that's not all, now there are also wheels, which roll truer than eggs but might be a bit wobbly until there are a few more of them and they gain momentum.

the old Distribute graphic
This is a 180 from the new hotness.
If you use pip, uninstall distribute and setuptools.
$ pip uninstall distribute setuptools
Then install the newest version of setuptools and pip. Finally install wheels.
$ pip install setuptools pip wheels
Distribure-0.7.x which was meant as a transition between setuptools-0.6 and 0.8 is no longer necessary since setuptools-0.9.8 is out. Pip is at 1.4 with support for both setuptools and wheels.

Tuesday, July 2, 2013

world wide widgets

Lots of great Tkinter widges abound on the web and even in the Python source code under demos. Here are my newest favs and some others I've posted previously:
  1. scrolled canvas: uses canvas windows with frames and a scrollbar. credit: python source code demo "canvas-with-scrollbars.py"
  2. treeview table: uses a ttk.Treeview to make a table - and they said it couldn't be done. credit: daniweb
  3. rascally resize tk scrollbars
  4. ttk Notebook demo for Py2
Please feel free to copy these and use them for good. They are covered by this license.

Thursday, June 27, 2013

What is a scope for lambda?

Okay computer science students out there, riddle me this.
A scope is home for a function,
>>> def g(i):
>>>     def f():
>>>         return i
>>>     return f
>>> print [f() for f in (g(i) for i in xrange(3))]
[0, 1, 2]
but what is a scope for a lambda?
>>> print [f() for f in [lambda: i for i in xrange(3)]]
[2, 2, 2]
A scope is home for a generator,
>>> print [f() for f in (lambda: i for i in xrange(3))]
[0, 1, 2]
and a default parameter is a hack for a lack of a scope,
>>> print [f() for f in [lambda a=i: a for i in xrange(3)]]
[0, 1, 2]
but a new scope is home for a lambda.
>>> print [f() for f in ((lambda a=i: lambda: a)() for i in xrange(3))]
[0, 1, 2]

And while we're on the topic of weird Python hacks and weird comprehensions.
What the heck is this?
>>> foobar = [(1, 2, 3), (4, 5), (6, 7, 8, 9), (0, )]
>>> # foo are elements in bar which are elements in foobar
>>> [foo for bar in foobar for foo in bar]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 0]
Well this was really just an excuse to rock the new syntax highlighter.

Tuesday, June 25, 2013

syntax sensation: A comparison of syntax highlighters

define CSS styles in html

I added 2 CSS definitions, .block-code, which I'm using here, and .inline-code. The definitions are in this Gist: Which looks like this:
To see the XKCD Python comic, type: import antigravity
>>> import this
The Zen of Python, by Tim Peters
 
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
I have been using <pre></pre> tags for blocks of code and <span></span> for inline code instead of <code></code> and that seems to work great. Some downsides of this approach are that there are no line numbers, no syntax highlighting, and it doesn't scroll.

google-code-prettify

This is a very simple syntax highlighter. It works similar to mathjax, the embedded Gist in the previous section and some of my NPR posts, by loading a js script from the web.
<script src="https://google-code-prettify.googlecode.com/svn/loader/run_prettify.js"></script>
It automatically detects languages, e.g. Python, there are several skins (this is sons-of-obsidian), and you can add linenumbers. It will format <pre></pre> and <code></code> tags that contain class="prettyprint", but it doesn't use <span></code> tags. For line numbers use class="prettyprint linenum".
#!/usr/bin/python

def fib():
  '''
  a generator that produces the elements of the fibonacci series
  '''

  a = 1
  b = 1
  while True:
    a, b = a + b, a
    yield a

def nth(series, n):
  '''
  returns the nth element of a series,
  consuming the earlier elements of the series
  '''

  for x in series:
    n = n - 1
    if n <= 0: return x

print nth(fib(), 10)

SyntaxHighlighter

This is the most ubiquitous and snazzy highlighter. Same as prettify, you load javascript.
<link href='http://alexgorbatchev.com/pub/sh/current/styles/shCore.css' rel='stylesheet' type='text/css'/>
<link href="http://alexgorbatchev.com/pub/sh/current/styles/shThemeFadeToGrey.css" rel="stylesheet" type="text/css" />
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shCore.js" type="text/javascript"></script>
<script src="http://alexgorbatchev.com/pub/sh/current/scripts/shAutoloader.js" type="text/javascript"></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushPython.js' type='text/javascript'></script>
<script src='http://alexgorbatchev.com/pub/sh/current/scripts/shBrushXml.js' type='text/javascript'></script>
<script language='javascript'>
SyntaxHighlighter.config.bloggerMode = true;
SyntaxHighlighter.all();
</script>
Then use <pre class="brush: python">your Python code goes here</pre>. It scrolls nicely, and there are several themes and supported languages - just load the appropriate brush script, e.g.: "shBrushPython.js" script, then specify the language in lower case, i.e.: "python". The <script></script> mode doesn't work on Blogger, and note the bloggerMode = true configuration in the script. Also, there is no inline mode, but there is a nice scrollbar.
#!/usr/bin/python

def fib():
  '''
  a generator that produces the elements of the fibonacci series
  '''

  a = 1
  b = 1
  while True:
    a, b = a + b, a
    yield a

def nth(series, n):
  '''
  returns the nth element of a series,
  consuming the earlier elements of the series
  '''

  for x in series:
    n = n - 1
    if n <= 0: return x

print nth(fib(), 10)

Monday, June 24, 2013

rascally resize tk scrollbars

First, is <pre></pre> the coolest html tag ever? I don't have to worry about non-breaking spaces or line-breaks with pre-formatted text, I just copy and paste my code.
Second, in case you were wondering, I've added a some CSS for inline and block code to my blogger posts. They're in this Gist.
Third, here's a demo of scrollbars and listbox that resize with the window. The trick is in the row and column weights. Set it to a positive number to resize, by using either columnconfigure or rowconfigure.
#! /usr/bin/env python

from Tkinter import *
from ttk import *
import calendar

root = Tk()
root.title('Listy')

master = Frame(root)
master.pack(expand=True, fill=BOTH)
master.columnconfigure(0, weight=1)
# keep scrollbar same width, ie don't resize!
master.columnconfigure(1, weight=0)
master.rowconfigure(0, weight=1)

# y-scrollbar
scrolly = Scrollbar(master, orient=VERTICAL)
scrolly.grid(row=0, column=1, sticky=N+S)

# listbox
listy = Listbox(master)
listy.grid(row=0, column=0, sticky=N+S+E+W)

# content
for m in calendar.month_name:
    listy.insert(END, m)
for d in calendar.day_name:
    listy.insert(END, d)

# bind scrollbar to listbox
listy.config(yscrollcommand=scrolly.set)
scrolly.config(command=listy.yview)

if __name__ == '__main__':
    master.mainloop()
You could do this with the packer geometry manager, by using pack(expand=YES, fill=BOTH) for the listbox and pack(fill=Y) for the scrollbar. The trick is expand which causes the listbox to resize, but not the scrollbar.
#! /usr/bin/env python

from Tkinter import *
from ttk import *
import calendar

root = Tk()
root.title('Listy')

master = Frame(root)
master.pack(expand=True, fill=BOTH)

# y-scrollbar
scrolly = Scrollbar(master, orient=VERTICAL)
scrolly.pack(side=RIGHT, fill=Y)

# listbox
listy = Listbox(master)
listy.pack(side=LEFT, expand=YES, fill=BOTH)

# content
for m in calendar.month_name:
    listy.insert(END, m)
for d in calendar.day_name:
    listy.insert(END, d)

# bind scrollbar to listbox
listy.config(yscrollcommand=scrolly.set)
scrolly.config(command=listy.yview)

if __name__ == '__main__':
    master.mainloop()

ttk Notebook demo for Py2

There is a very nice ttk Notebook demo on a very cleverly named blog called Py in my eye. Note: there are very few differences between the Python 3 version of this demo and the Python 2 version, other than
For a Python 2 version of Jane's demo, see this Gist. To help myself understand what was going on, I forced myself to decompose that stellar example into this super easy demo:
#! /usr/bin/env python

from Tkinter import *
from ttk import *

root = Tk() # create a top-level window

master = Frame(root, name='master') # create Frame in "root"
master.pack(fill=BOTH) # fill both sides of the parent

root.title('EZ') # title for top-level window
# quit if the window is deleted
root.protocol("WM_DELETE_WINDOW", master.quit)

nb = Notebook(master, name='nb') # create Notebook in "master"
nb.pack(fill=BOTH, padx=2, pady=3) # fill "master" but pad sides

# create each Notebook tab in a Frame
master_foo = Frame(nb, name='master-foo')
Label(master_foo, text="this is foo").pack(side=LEFT)
# Button to quit app on right
btn = Button(master_foo, text="foo", command=master.quit)
btn.pack(side=RIGHT)
nb.add(master_foo, text="foo") # add tab to Notebook

# repeat for each tab
master_bar = Frame(master, name='master-bar')
Label(master_bar, text="this is bar").pack(side=LEFT)
btn = Button(master_bar, text="bar", command=master.quit)
btn.pack(side=RIGHT)
nb.add(master_bar, text="bar")

# start the app
if __name__ == "__main__":
    master.mainloop() # call master's Frame.mainloop() method.
    #root.destroy() # if mainloop quits, destroy window
Some notes:
  • The original demo puts the notebook in a frame, in another frame inside the top-level window, but you can just go nb->frame->root, and skip the extra frame. Not sure what you gain or lose.
  • If you want to see the demo decomposed, I converted the original demo as a script in the Gist.
  • You don't have to call Tk() to create a top-level window, Frame will do it for you. Then you can access the window via the master attribute of Frame.
  • If you call the Frame's quit() method, it will destroy the window for you, so the last line, root.destroy(), is not necessary.
  • If you don't bind the "WM_DELETE_WINDOW" protocol to Frame's quit() method, you will get a traceback when root.destroy() is called, saying that it can't destroy the window because it's already been deleted.
  • Use fill=BOTH if your labels and buttons are smaller than the parents they occupy if you want them to extend to both sides.
  • All of these demos are included in your Python distribution. On MS Windows it is here: C:\Python27\tcl\tk8.5\demos
Enjoy!!!

Wednesday, June 19, 2013

Dates and Datetimes


NumPy Datetimes

NumPy has datetimes, called datetime64 to avoid confusion with the Python datetime module and class. But it only uses ISO 8601 formats for text entries. i.e.: 2013-06-19T16:14:32.00-0700. It will also take a Python datetime.datetime() or numpy.datetime64() as an argument, but NumPy will always shift the date/time to the local timezone. If the Python datetime.datetime() object is naive (IE no tzinfo ) then NumPy will assume it is UTC (Zulu, GMT or +0000). Calling numpy.datetime64().item() will return the UTC equivalent Python datetime.datetime() object.

Examples with np.datetime64 dtype:

>>> import numpy as np
>>> from datetime import datetime
>>> np.datetime64(datetime.today().isoformat())
numpy.datetime64('2013-06-19T16:17:27.612000-0700')

Examples with np.array:

>>> dt = np.dtype([('dates', 'datetime64[D]'), ('dni', float)])
>>> data = [('2001-01-01', 834.34),
...         ('2001-01-02', 635.12)]
>>> npdata = np.array(data, dt)
array([(datetime.date(2001, 1, 1), 834.34),
       (datetime.date(2001, 1, 2), 635.12)],
      dtype=[('dates', '<M8[D]'), ('dni', '<f8')])

Repeat that with a datetime using Zulu time.

>>> dt = np.dtype([('dates', 'datetime64[m]'), ('dni', float)])
>>> data = [('2001-01-01T00:30Z', 834.34),
...         ('2001-01-01T01:30Z', 635.12)]
>>> npdata = np.array(data, dt)
array([(datetime.datetime(2001, 1, 1, 0, 30), 834.34),
       (datetime.datetime(2001, 1, 1, 1, 30), 635.12)],
      dtype=[('dates', '<M8[m]'), ('dni', '<f8')])

Repeat that with a datetime using UTC offset (+0000) for Zulu.

>>> dt = np.dtype([('dates', 'datetime64[m]'), ('dni', float)])
>>> data = [('2001-01-01T00:30-0000', 834.34),
...         ('2001-01-01T01:30-0000', 635.12)]
>>> npdata = np.array(data, dt)
array([(datetime.datetime(2001, 1, 1, 0, 30), 834.34),
       (datetime.datetime(2001, 1, 1, 1, 30), 635.12)],
      dtype=[('dates', '<M8[m]'), ('dni', '<f8')])
n.b.: Numpy converts strings for you, so you don't have to use np.datetime64 to cast them as datetime64 dtypes. Also it converts them to Python datetime.datetime or datetime.date, depending on your date units and shifts them to GMT (or Zulu) time. NumPy seems to handle dates and datetimes with the default units of day, e.g.: [D], but for structured arrays you must specify the datetime units e.g.: [D], [m], [s] or [ms] (see datetime units) in addition to datetime64 as the dtype or NumPy gives you this cryptic error:
Value Error: Cannot create a NumPy datetime other than NaT with generic units
Thanks to this answer on SO for unriddling that puzzle. If you make a NumPy datetime with nothing, you'll discover that NaT means "Not a time". In addition you may get this error, which is a bit more informative.
TypeError: Cannot cast datetime.datetime object from metadata [us] to [D] according to the rule 'same_kind'
This is because datetime.datetime uses micro-seconds [us] as its default, but NumPy uses days [D]. Specify the dtype using [us] or some form of seconds units, e.g.: [s], [ms], and it should work.

Matplotlib.dates.datestr2num()

This is an undocumented function that uses dateutils to convert a string to a floating decimal number that matplotlib uses to treat dates, similar to MATLAB and Excel. The function can also be imported via pylab.
>>> import pylab
>>> import pytz
>>> pst = pytz.timezone('US/Pacific')
>>> some_date = pylab.datestr2num('1998-1-1 12:15-0800')
>>> pylab.num2date(some_date, pst)
datetime.datetime(1998, 1, 1, 12, 15, tzinfo=<DstTzInfo 'US/Pacific' PST-1 day, 16:00:00 STD>)
>>> same_date = pylab.datestr2num('1/1/1998 12:15-0800')
>>> pylab.num2date(some_date, pst)
datetime.datetime(1998, 1, 1, 12, 15, tzinfo=<DstTzInfo 'US/Pacific' PST-1 day, 16:00:00 STD>)
Pretty nifty! Works better than I thought! In fact I like it way better than NumPy or Python for that matter. Note how using pytz helps matplotlib set the timezone; the tz class from dateutils can also be used to set tzinfo. Also note that if we hadn't set the UTC offset in the string, then it would have output 4:15 AM instead of 12:15, since it would have assumed GMT. Also matplotlib is pretty smart about determining the format; the default is month/day/year.

Python datetime

The lame way to do this is with Python's datetime.strptime(), but it doesn't support the %z directive for UTC offset (that's only for strftime() functions of datetime, date and time instances), and it only has an abstract class tzinfo for timezone, which can be replaced by pytz.
>>> dt = datetime.strptime('1998-1-1 12:15', '%Y-%m-%d %H:%M')  # naive datetime instance
>>> print dt
1998-01-01 12:15:00
>>> new_dt = datetime(*dt.timetuple()[0:6], tzinfo=pst)  # aware datetime instance
>>> print new_dt
1998-01-01 12:15:00-08:00
>>> new_dt.toordinal()
729390
The ordinal of the datetime is the date and hour part but matplotlib also outputs the fractional sub-day portion. The strptime() classmethod lets you set the format, which is very nice.

Time

When working with times, it will assume 1900-01-01, while NumPy assumes 1970 and matplotlib will default to today's date. But it's actually hard to create a NumPy time only as in Python datetime.time. Maybe there is a correct way to do it, but I could only make datetime.date and datetime.datetime with numpy.datetime64.

NumPy time example:

>>> np.array(datetime.time(8, 30), dtype='datetime64[m]')
Could not convert object to NumPy datetime
The only way I could do it was with timedelta64.
>>> t = np.timedelta64(8, 'h') + np.timedelta64(30, 'm')
>>> print t
510 minutes
>>> np.array(t, dtype='datetime64')
>>> array(datetime.datetime(1970, 1, 1, 8, 30), dtype='datetime64[m]')
So you can see, NumPy just randomly chose 1970 to be the year! As I said, Python reverts to 1900. For these examples I have to use datetime.time, so reimport datetime by itself. Also I assume that pytz was imported and pst is a pytz.timezone instance of 'US/Pacific' as in the sections above.

Python datetime.time example:

>>> import datetime
>>> t = datetime.time(8, 30, tzinfo=pst)
>>> t.strftime('%Y-%m-%d %H:%M %Z')
'1900-01-01 08:30 US/Pacific'
The timezone name, %Z, worked but I couldn't get %z to show the UTC offset. But both worked for datetime.datetime.
>>> dt = datetime.datetime.strptime('8:30','%H:%M')
>>> dt.replace(tzinfo=pst).strftime('%m/%d/%Y %H:%M:%S %z (%Z)')
'01/01/1900 08:30:00 -0800 (PST)'
Finally, as I said, matplotlib assumes whatever the current date. Notice too, that since it knows that it's currently daylight savings time!

matplotlib time example:

>>> md = pylab.num2date(pylab.datestr2num('8:30 -0700'), pst)
>>> md
datetime.datetime(2013, 6, 19, 8, 30, tzinfo=<DstTzInfo 'US/Pacific' PDT-1 day, 17:00:00 DST>)
You can use datetime.replace() to swap out the date for whatever you want.
>>> print md.replace(1998, 1, 1)
1998-01-01 08:30:00-07:00
This works for tzinfo too.
>>> print dt.replace(tzinfo=pst)  # aware datetime instance
1998-01-01 12:15:00-08:00

ISO 8601 Format

All of the datetime classes and therefore matplotlib too, all have an isoformat function.
>>> md.isoformat()
'2013-06-19T08:30:00-07:00'

Timezone

Hope you noticed that there are lots of timezone info.
>>> t.tzname()
'US/Pacific'
>>> md.tzname()
'PDT'
Please see the pytz and dateutils packages for complete details on using timezones, as there are package specific methods other than replacing tzinfo. For example, pytz exposes the localize method to create a datetime directly from a timezone object.
>>> pst.localize(datetime(2013,4,20,12,30))
datetime.datetime(2013, 4, 20, 12, 30, tzinfo=<DstTzInfo 'US/Pacific' PDT-1 day, 17:00:00 DST>)

Python time

This is a separate module for timing CPU operations. Also IPython has its own magical time, which can be called using %time. You can use time.clock() to measure how fast code runs on most platforms, and time.sleep() will make it pause. Wow! I hope I can remember all of this!

Saturday, June 15, 2013

[Python] read formatted input

I have come to the same conclusion as this blog and this "physics forum thread" (is this a real forum? or a copy of another forum?).
There is no Python equivalent of C/C++ fscanf or MATLAB fscanf, sscanf or textscan.
Here are some alternatives that I have found.
  1. numpy.genfromtext() does more or less exactly the same thing. It reads strings, via StringIO, and file. There is a nice section on importing data in the NumPyUser Guide.
    • instead of format specifiers like '%8f%4s%2d' use delimiter=(8, 4, 2) and set dtype=(float, str, int). Voila!
    • But genfromtext() does so much more! Using dtypes you can also set field names. There are options for skipping headers and footers, See the documentation.
  2. parse 1.6.1 offers parse(), the opposite of format() on PyPI. I haven't tried it, and I wish there was more documentation, specifically examples of multiple parsed tokens, but it does seem to be a python version of textscan, but for strings only.
  3. The re module in the standard Python reference is an obvious choice to parse tokens from strings. There is even a section on simulating scanf that offers recipes for %f and other formatters.
  4. For simple delimiters, one can use either of the following:
    • csv module from the standard Python reference
    • numpy.loadtxt() which has the added advantage of reading in data as NumPy arrays.
    • str.split() obviously
There are probably many other methods, but for MATLAB converts, once they move from disbelief and denial onto acceptance, it's pretty straightforward issue to resolve.

Wednesday, June 12, 2013

use generators, yield, xrange and obj.iter*() whenever possible

Python has a very cool feature. Java also has it, but Java isn't cool. Probably C and its variants have it, and I'm sure that super cool Ruby has it too. What about MATLAB, anyone? Bueller? ... Bueller?

iterators

An iterator uses less memory and is generally faster than an iterable.

Generators

Instead of returning a list from a function, turn it into an iterator by using yield

iterable

def listy(x=5):
    return range(x)

iterator

def genny(x=5):
    idx = 0
    while idx < 5:
        yield idx
        idx += 1

xrange

The example above is the same as the difference between range and xrange.

iterable

>>> listy = range(5):
>>> print listy
[0, 1, 2, 3, 4]

iterator

>>> for idx in xrange(5):
...     print idx,
0, 1, 2, 3, 4:

Generator Expression

Just like you can make a list on the fly with a list comprehension, you can make a generator on the fly with a generator expression!

iterable

>>> listy = [idx for idx in range(5)]:
>>> print listy
[0, 1, 2, 3, 4]

iterator

>>> animals = {'dog': 'spot', 'cat': 'felix'}
>>> genny = ('%s is a %s' % (name, animal) for animal, name in animals.iteritems())
>>> for animal_name in genny:
...     print animal_name,
spot is a dog felix is a cat

obj.iter*()

This actually links to a great section in the tutorial on looping techniques, where you will see examples of xrange(), enumerate(), and iteritems(). You can see in the generator comprehension above that dictionary has a method called iteritems() that produces an iterator (generator in Pythonish) instead of an iterable (like a Python list).

iterable

>>> animals = {'dog': 'spot', 'cat': 'felix'}
>>> listy = animals.items()
>>> print listy
[('dog', 'spot'), ('cat', 'felix')]

iterator

>>> animals = {'dog': 'spot', 'cat': 'felix'}
>>> genny = animals.iteritems())
>>> for animal in genny:
...     print animal,
('dog', 'spot') ('cat', 'felix')

Last Word

Just like in Java, an iterator is a one use container. Each time its next() method is called it advances to the next item until it reaches the end and then its raises a StopIteration exception. When the loop catches the exception, it exits gracefully. In order to reuse it you would have to create a new generator, but if you find that you need to use it multiple times, then you are better off using an iterable instead. In Summary ...
  • iterator: one time use, faster and uses less memory, good when only need to iterate through items one time
  • iterable: slower, uses more memory, good if you need to use any list methods or if you need to iterate through the same items many times.

Tuesday, June 11, 2013

Religion can be philisophy, aristocracy or imaginary friend

Disclaimer: I know this is a touchy subject, and I am not the most tactful person, so please
stop reading now!
if you have strong views on religion. What I am about to express is my opinion, and is not meant to influence or offend anyone. If you are already offended then I apologize, and hope that by turning back now you can avert any more offense.

Another theory; religion could be dissected as either a philosophy or an aristocracy. Wait, hear me out. I know that is way oversimplifying something so complex and evidently intricately woven into the human condition. But I am expressly thinking about understanding the will of each religion's deity/deities - what is done with those instructions is a different topic.

So let's consider, hypothetically a religion has a deity or some deities. This essentially what defines a religion right? That it has gods? Perhaps a religion has no gods, merely guidelines that were divined by a group of humans that are now revered for their amazing insight. That sounds a bit like a philosophical cult which is my first proposition, but I'm getting ahead of myself. Now that we've established a god or gods, how does information exchange occur?

  1. There is an elite class of god listeners who alone hear god's messages and then repeat them to the rest of that god's followers.
  2. All of that god's followers attempt to divine their god's meaning and then share and debate their theories to come up with some consensus.
Number one is clearly a form of aristocracy because the god has divined who shall be the people who receive the message and make decisions for others just as a monarch is generally chosen through some cosmic means. However if the god-listener class is elected it might possibly be considered some form of democracy. More likely the god-listener class takes that right through an exertion of their power (either by force or through influence), which might make it either a tyranny or oligarchy. Perhaps these types of institutions are all called republics - a small group represents a larger group. But it can go terribly wrong if the larger group doesn't question the validity of the small group. I wonder is it a sin to question the pope? Or a pastor's interpretation of the bible. Even a simple Sunday morning comic strip has multiple interpretations; isn't it more likely that a literary work of unknown origin transcribed multiple times may have ambiguous meaning? I'm not saying that we can't ultimately come to a consensus, there is meaning everywhere that we can all agree on, e.g. that murder is generally bad is agreed by all. What I mean is that unquestioning acceptance of religious, governmental or scientific dogma is both lazy and very dangerous. As Socrates said in the Apology, "an unexamined life is not worth living."

Number two sounds like a philosophy to me. I like it.

OK, let's take this a few steps further. Now replace god or gods with some other belief. Say god = the universe? Or gods = scientific theories, because let's be frank, no matter how much proof we have, even acceptance of a law is still merely just a belief. We believe that electrons tunnel through energy barriers because we have seen so much evidence that suggests convincingly that it may be true. But Einstein and Copernicus and Galileo can attest to the fact that even "scientific" theories and laws evolve and shift as new evidence comes to light. So I digress, my point with this last exercise is that religion has many societal parallels.

I also realized, while talking with my wife about suffering and grief that even if you can't hear your deity or deities message, when you are in need, merely believing that they exist and love you, may be a solace, and I like that too. The universe is a cold hard place, and it's always nice to have a friend.

Equality increases self esteem

Similar to my post on "the second law of infodynamics" this post proposes another completely hypothetical theory of human social interaction.
Equality increases self esteem.
Right now my son loves wearing pink, and says today, "I'm wearing a dress." Next week it will be a different color. He is completely innocent. Theses issues seem trivial to us today in the socially enlightening 21st century. In fact we view it as a victory over the absurd and old-fashioned dogma about distinct gender roles.

So lets examine that dogma. What was its motivation? I propose that it was a defensive coping mechanism. To cope with what? What could possible happen if a boy did wear a dress? Or a woman was a combatant? Or a same sex marriage occurred? Did that mean that I might start wearing dresses, because secretly I wanted to but my society told me it was wrong so I felt insecure and bad about myself? If I was secure about my individuality, why would I care what another person did? Merely being irked or irritated is not a reason for outrage is it? No there has to be a deeper reason. Our prejudices are manifestations of our inward fears. We are racist because we seek to dehumanize and theretofore justify the luxuries we take for granted at the expense of others' suffering. We are sexist and homophobic for the same reasons.

But what happens when we remove these barriers? The we don't have to be defensive. There is nothing to cope with. We can feel good about ourselves whoever we are. Equality increases our self esteem.

Monday, June 10, 2013

quantities and units

[UPDATED 2013-07-24] add buckingham.py

Main Contenders

Looking at doing calculations with units? Let's see what's out there. Start by doing a quick Google search with Python + Units. The first site that looks like a match is Python Units.

Python units 0.06

  • Last updated: 2013-2-25
  • Download: PyPI
  • Documentation: None [1]
  • Repository: Bitbucket
  • Last commit: 2013-02-24
  • Owner: Aran Donohue
Then there's a few SO hits and some personal blog entries similar to pp. Python Quantities seems to be a recurring theme.

Python quantities 0.10.1

A little further down is new contender called Pint.

Python Pint 0.2

With some digging a few more packages pop up. A relative newcomer is Python-numericalunits.

Python numericalunits 1.11

  • Last updated: 2013-02-21
  • Download: PyPI
  • Documentation: None [1]
  • Repository: Github
  • Last commit: 2013-02-22
  • Owner: Steve Byrnes
One package that was really hard to find, only saw it in a SO post was Unum.

Python Unum 4.1.1

  • Last updated: 2010-06-19
  • Download: PyPI
  • Documentation: linked to from here
  • Repository: Bitbucket
  • Last commit: 2012-03-25
  • Owners: Chris MacLeod, Pierre Denis

Others

There are probably several others, but I think these are the main contenders. I found some by using search within PyPI, eg: magnitude-0.9.1  (c. 2007). Several are listed in this SO question including buckingham.py. Finally, DimPy (c. 2008) just randomly appeared way down the list when I Googled how to add a new unit to quantities, which is possible, but not well documented.
>>> US_cent = pq.UnitCurrency('cent', 1, u_symbol=u'¢')
>>> US_dollar = pq.UnitCurrency('dollar', 100 * US_cent,
                                'cent', u_symbol=u'$')
>>> cost = 10 * US_cent / pq.kWh
>>> print cost

SciPy.constants

I think it's important to note that SciPy does have many physical constants and conversion-factors to SI units. In fact it's a bit disappointing to see such a flagrant violation of the DRY principle with numerous physical constants and CODATA files floating around. But SciPy does not really have a good representation of units and a framework for using units in calculations.

Usage

Most of the packages are the same, multiplication by the units, creates a new class instance of the units. Here a snippet from Pint's documentation:
>>> distance = 24.0 * ureg.meter
>>> print(distance)
24.0 meter
>>> time = 8.0 * ureg.second
>>> print(time)
8.0 second
>>> speed = distance / time
>>> print(speed)
3.0 meter / second
The exception to this pattern is Python-units which uses a call to create objects.
>>> meters = unit('m')
>>> distance = meters(10)
Python-quantities is the only package with dependencies; it depends on NumPy, which really doesn't matter to me. Pint also supports NumPy arrays, which is important.

Snap Decision

Difficult to compare and decide without trying them all out. Who has time for that? So I think unit and numericalunits are both to undocumented for my taste. Unum looks like it is unsupported and/or not active anymore. That leaves Pint and quantities. Pint looks really slick, I like their design principles and it looks like their 0.3 release is coming out soon. It looks like quantities has been around for a while, there are both positive and  negative reviews, although to be fair that post about temperature conversions from C to F is the main reason SciPy doesn't have support for units conversions although it does have a great constants class with units. So I think I'll try quantities first, but keep my eye on Pint too. I hope to have a part II with some comparisons between these two soon.

Footnote

[1] There is some documentation for both units and numericalunits on their PyPI sites.

Tuesday, June 4, 2013

Sphinx with NumPyDoc and Consolidated Fields

[UPDATE 2014-01-31] This is a major update - Sphinx-1.3 now packages Napoleon, allowing you to use Google or NumPy style documentation and have them produce Sphinx formatted documentation.

Sphinx documentation is awesome as is, although IMHO the unformatted docstring is not easy to read (EG: using help(my_fun) to get help on my_fun will show the Sphinx ReST roles and directives). A couple of cool tweeks are consolidated fields and the NumPyDoc extension. For contrast I've also included the Google recommended Python documentation style for unformatted docstrings.

Monday, June 3, 2013

XLRD vs OPENPYXL, Round II

[UPDATE 2013-12-02] The major issue discussed in this post, RE: charts not read, worksheets skipped and out of order, was resolved and pulled into the latest release 1.7.0 as well as many other bug fixes. With this latest version, I think that OpenPyXL can be considered the dominant OOXML (post 2007) Excel reader and writer. Note that OpenPyXL is the default Excel reader for Pandas the rapidly growing Python data analysis toolset.

This is a continuation of the previous post on reading Excel from Python. Uh, I might have called it too early! XLRD pulls ahead, but will it win the bout? Read on ...

Reading the contents

Assume we have a sample Excel spreadsheet with 3 worksheets and 2 charts on sheets. In OpenPyXL, you load the workbook, but right away you notice something wrong with the sheet names. Where's 'Sheet3'?
>>> wb_openpyxl = load_workbook(sample)
>>> wb_openpyxl.get_sheet_names()
['Sheet1',
 'Sheet2',
 'Sheet3']
Loading the sheets with XLRD get's it right.
>>> wb_xlrd = open_workbook(sample)
>>> wb_xlrd.sheet_names()
[u'Sheet1',
 u'Sheet2',
 u'Sheet3']
XLRD returns the sheets in the same order as they are visible in the actual spreadsheet, and omits the charts which don't actually contain any data. Unfortunately OpenPyXL can't tell charts from sheets just yet, and is actually naming some of the sheets incorrectly after the charts.
'Sheet1' --> 'Sheet1'
'Sheet2' --> 'Sheet2'
'ChartA' --> 'Sheet3'
This would be OK, since all of the sheets are there, and you could use the sheets' indices, but if you don't only know their names and not the order, then this is an issue. It has been reported in issues #179, #165 and #209. Unfortunately, this issues affects the optimized reader as well. I sent a pull request with a proposed fix for it that has already been merged with master. This issue was resolved and pulled into the current release, OpenPyXL-1.7.0.

Reading Cells

OpenPyXL can use the Excel format, EG: 'A3' or by row & column.
>>> ws1_openpyxl = wb_openpyxl.get_sheet_by_name('Sheet1')
>>> ws1_openpyxl.cell('A3').value
XLRD only reads cells by (row, column).
>>> ws1_xlrd = wb_xlrd.get_sheet_by_name('Sheet1')
>>> ws1_xlrd.cell_value(2, 0)
Both can let you slice the data, but OpenPyXL also allows ranges using Excel format.
>>> ws1_openpyxl.range('A1:C2')
Weird thing about the optimized reader in OpenPyXL, is that it only allows reading sheet contents using the iter_rows() function, which in a way defeats the purpose of the optimized reader, since you have to read in all of the columns in each row!
>>> all_rows = [r for r in ws1_openpyxl.iter_rows()]

The Winner

I think XLRD wins this round, because even though its documentation is sparse, it's not rocket science, and it get's the worksheets, relatively quickly, and more importantly correctly! The screw up with the charts is kind of a non-starter for OpenPyXL.

And another thing occurred to me during Round II. XLRD can open any Excel spreadsheet dating back to like 1995, but OpenPyXL is only for Excel 2007 and newer, which if you didn't know is a zipped XML file.

Finally, even though XLRD doesn't let you use the easy Excel cell reference notation, it is generally faster. And the iter_rows() limitation for the optimized reader in OpenPyXL is a bit annoying, since you're forced to read in many columns that you might not have wanted to read!

please start your project major version at zero

This post is probably a repeat of something on Coding Horror, but in case it hasn't been stated before, I'd like to make the case for starting your version numbering at zero. Why, well, because some code is available before it has been thoroughly tested and depends on user feedback to determine when the code has enough of the bugs worked out to consider it mature. During this time its version number may be changing rapidly, but the one thing that distinguishes a fledgling package from a robust one is the zero in front of the version number. It says, hey I was just born. I might not be complete. I might have bugs. I might break or die. I'll let you all know when I get the code equivalent of a bar-mitzvah by changing my zero to a one.
if major_version:
    print "I'm mature.",
    print "This is release #%d." % major_version
else:
    print "I'm newish,",
    print "so I might still have some big issues."
Is that so hard?

Friday, May 31, 2013

>>> print 'Reading %(Lolita)s in %(Tehran)s' % {'Lolita': 'Excel', 'Tehran': 'Python'}

[UPDATE 2013-06-14] The google group python-excel is a great source of information on this topic. That's where I discovered xlsxWriter and PyXLL. I also searched the PyPI database with keyword: "xlsx", and I got several more hits, which I've added below in addition to these two. Really thought I'd already had checked PyPI - must be losing my mind; need more sleep.
>>> print 'Reading %(Lolita)s in %(Tehran)s' % {'Lolita': 'Excel', 'Tehran': 'Python'}
Reading 
Excel in Python
That was really a stretch. It's actually not funny or witty at all. In fact now it's just terribly awkward. Although it would of made a little sense if Excel and Python were switched, IE: Reading Python in Excel right, but that is not exactly what we mean. Now I'm just makein it worse.

Main Contenders

Moving on, ever tried to read an Excel file into Python? Numpy only supports csv, so you Google and up pop the main contenders:

Python ← Excel

Python → Excel

  • DataNitro, formerly IronSpread, is an excel plugin that replaces allows Python to be used for add-ins and macros in lieu of VBA, proprietary commercial/personal licensing. I didn't review this one, but I plan to revisit it later.
  • PyXLL, version 2.0.3, also lets you use Python to make Excel add-ins using decorators.

openpxl

This project is for Excel 2007 and newer, it is a port of ExcelPHP and is super easy to use, made even easier by its Sphinx docs at read-the-docs. For some weird reason, pypi and bitbucket both point to old (v-1.5.7) documentation @ pythonhosted, but don't be fooled; get the latest which is much expanded!
$ pip install openpyxl
works fine for any platform since there are no extensions, yay!
$ python
>>> from openpyxl import load_workbook
>>> big_wb = openpyxl('C:\path\to\spreadsheet.xlsx')
seems too take awhile and there doesn't seem any logging or progress available. So use the lazy approach by setting use_iterators=True, which is much faster!
>>> big_wb_fast = openpyxl('C:\path\to\spreadsheet.xlsx', use_iterators=True)
This approach uses an optimized reader, but the workbook loaded is read only.

There is a google group, issue tracking, a blog and an old blog.

xlrd

This project covers all Excel spreadsheets, don't be mislead by the 0 major version, this is a relatively mature project. It is part of a triad of Excel targeted packages, including xlwt and xlutils for writing and other basic utilities. The documentation is in one giant html page, which has a lot of info, but is frankly more challenging to read than the nice Sphinx documentation of openpyxl. There is also a tutorial in the form of a pdf.

Install all 3 packages using pip works fine as they are all pure Python. Then in an interpreter ...
>>> from xlrd import open_workbook
>>> import sys
>>> big_wb_xlrd = open_workbook('C:\path\to\spreadsheet.xlsx'',on_demand=True, logfile=sys.stdout, verbosity=1)
This is where better documentation might help as there is no indication what the different levels of verbosity are, although maybe it's obvious. False, 0 or None means none, True or 1 gives you lite feedback, just right, but verbosity=2 gives you a deluge of information - probably best to output this to a file instead of STDOUT.

Now here comes the sad part. If you are using Excel 2007 or newer, on_demand is not supported. If you have logging enabled and verbosity>0, then you will see this message:
WARNING *** on_demand=True not yet implemented; falling back to False
Boo Hoo! Because on_demand is the xlrd equivalent of use_iterators or the openpyxl optimized reader which enhanced speed so much. I will say, that xlrd read a large file slightly faster than openpyxl with use_iterators=False, but definitely slower than the optimized reader.

Conclusion

  • For xls files, use xlrd with on_demand=True .
  • For xlsx files, use openpyxl with use_iterators=True.
  • IMHO openpyxl could be improved with logging to give feedback and xlrd/xlwt could use some better documentation.
In my next post, Round II, I play with reading in actual cell contents. Let's see if openpyxl stays in front.
Fork me on GitHub