Wednesday, April 22, 2015

Git big media on Windows

[UPDATE 2015-06-09] I have switched to git-fat-0.5.0 [2015-05-06]; the fork on PyPI maintained on GitHub by Alan Braithwaite of Cyan Inc. Unfortunately, I could not figure out how to clone a git-media repo, therefore git-media sucks.

Git-Fat - just works

Basically this comes with everything you need to work on Windows, Mac or Linux. It uses rsync for transport, and the rest is mostly written in Python but does depend on some libraries that are standard in Linux and have mature ports in Windows and Mac. One of the major benefits of git-fat over git-media is that it uses a .gitfat config file which updates your .git/config when git fat init is run. This is similar to git submodules and makes repos portable. In general there's more functionality and features than git-media. For example, you can list the files managed by git-fat, check for orphans and pull data from or push data to your remote storage. The only catch is that the wheel file at PyPI has metatags for win32 not amd64. This is easy to fix, but I think there are a couple of use cases that might differ from how the distribution was implemented.


If you look at a Linux install, the repository has a symlink to in bin called git-fat. Why not just bootstrap git-fat if we only really need one file. Just dump everything in a single folder, change the file name to git-fat and make sure there's a shebang that uses #! /usr/bin/env python which git seems to prefer, then stick it on your path. This works for both msys-git and Windows cmd.


msysgit comes with Git Bash, a posix shell which includes many Linux libraries ported to Windows, such as gawk and ssh. Unfortunately it does not come with rsync, however you can get rsync from the msys source either from mingw-w64 (that's where I got it), from msys2, from the original mingw project, from mingw builds and from lots of places. You could even get it from cygwin. I usually stick files like this in my local bin folder which is always first on my path in git bash. You'll need to also grab the iconv, intl, lzma and popt msys libraries which rsync depends on. Anyway, since you have these libraries, you don't need the ones bundled in the wheel, however, git-fat is written to look for those bundled files if it detects that your platform is Windows, so just comment out those lines. You will need to change awk to gawk, since awk is a shell script that calls gawk. Again you can bootstrap this file, ie: put it in your local bin folder or install it into your Python site-packages and/or scripts folder.


This is the way I ended up using it. You can download my version here and install it with pip. I put the windows libraries into the site-packages git-fat folder instead of in scripts and then in the git-fat script, added the site-packages git-fat folder to the shell's path. Then I called the git-fat module as a script by adding a file to it which basically imports and calls it using Python with the -m option, but you could just as easily call the module as a script. This just keeps these extra libraries bundled together rather than dumping them into the scripts folder with everything else. Also since I mostly use git bash it doesn't put git-fat's libraries ahead of git's since they both use gawk and ssh.


Usage is extremely easy compared to git-media, which is a plus! Note these instructions are for msysgit git bash. For Windows cmd window replace git fat with git-fat everywhere. Both methods should work fine.

  1. Clone a repo that uses git-fat: git clone my://remote/repo.git
  2. At this point there are only placeholders for your files with the same names, but just sha numbers that tell git-fat which file to grab from your remote storage
  3. Run git fat init which sets up the filters and smudges that tell your local repo how to use git-fat with the .gitattributes file which is part of the repo already.
  4. Run git fat pull which downloads your files from the remote storage specified in the .gitfat, which is also already in the repo
  5. Run git fat list to see a list of managed files
  6. Run git fat status to see a list of orphans waiting to be pulled/pushed?

Creating a repo and setting it up to use git-fat is also easy. There is great help in the readme at PyPI and the readme at GitHub.

  1. Create a .gitfat file that specifies where rsync should store files. Note there are no indents. A windows UNC path seems to work fine.
    remote = //server/share/repo/fat
  2. Create a .gitattributes file to specify which files to store at the remote
  3. Commit the .gitfat and .gitattribute files
  4. Run git fat init to set up your local .git/config
  5. Hack, commit, push, etc.
  6. Run git fat push to send stuff to your remote

Git-Media - sucky

Finally time to install Ruby. You're going to need it if you want to use git-media which let's you mix big biinary files within your git repo, but store them in some remote host, which could be google-drive, amazon s3, another server via ssh/scp or a network share. Why don't you want to store big binary files in your git repo? Since Git stores each revision instead of deltas, that means that it will quickly blow up as you make new commits.


Super easy, they recommend 2.1, no admin rights required, unzips into c:\ just like python, I checked all of the options: tk/tcl, add ruby to path and what was the last option? Then I ran gem update.

git-media gem

gem install git-media trollup right_aws

right_aws OpenSSL::Digest issue

There is a tiny issue with right_aws where it outputs the message:

Digest::Digest is deprecated; use Digest
which is easily fixed by following the comments in git-media issue #3 or right_aws pull request #176.

git-media setup

The readme on the github overview page has everything you need to know.

other large file storage

Wednesday, April 15, 2015

Recommended Python Project Layout

[UPDATE 2018-09-04] Links to Cookiecutter and Bootstrap a Scientific Python Library from the National Synchrotron Light Source II (NLSL-II).

[UPDATE 2016-07-19] Lately I've preferred using core instead of lib for the main package modules.

[UPDATE 2015-06-04] Create top level package to bundle all sub-packages and package-data together for install.

Been looking for a good, comprehensive, credible guide:

  1. Pretty good links in this SO Q&A:
    1. What is the best project structure for a Python application?
    2. Especially this one:
      1. Open Sourcing a Python Project the Right Way
  2. And maybe, maybe theses ones:
    1. Learn Python The Hard Way Exercise 46: A Project Skeleton
    2. The Hitchhiker’s Guide to Python! Structuring Your Project by Kenneth Reitz
      1. Repository Structure and Python also by Kenneth Reitz
    3. How to Package Your Python Code: Minimal Structure by Scott Torborg
    4. Interesting Things, Largely Python and Twisted Related: Filesystem structure of a Python project by Jean-Paul Calderone
  3. Of course understanding Python Modules and Packages
    1. The Python Tutorial: 6. Modules
  4. An understanding of how to install packages, and roughly I guess how pip and setuptools interact with distutils is good
    1. Python Documentation: Installing Python Modules
  5. Way later down the line it helps to understand distutils and setuptools for deploying packages
    1. Python Packaging User Guide
    2. Setuptools
    3. Python Documentation: Distributing Python Modules
    4. How to Package Your Python Code by Scott Torborg
    5. The Hitchhiker’s Guide to Packaging
  6. There are also a packages that will create a boiler plate project layout for you but I wouldn't recommend them except as reference guides - the tutorial by NSLS-2 being the notable exception, PTAL!
    1. Bootstrap a Scientific Python Library: This is a tutorial with a template for packaging, testing, documenting, and publishing scientific Python code.
    2. Cookiecutter: A command-line utility that creates projects from cookiecutters (project templates), e.g. creating a Python package project from a Python package project template.
    3. PyPI: Python Boilerplate Template

It's hard to pin a standard style down. Here’s mine:

MyProject/ <- git repository
+- .gitignore <- *.pyc, IDE files, venv/, build/, dist/, doc/_build, etc.
+- requirements.txt <- to install into a virtualenv
+- <- use setuptools, include packages, extensions, scripts and data 
+- <- files to include in or exclude from sdist
+- readme.rst <- incorporate into and docs
+- changes.rst <- release notes, incorporate into and docs
+- <- script to run myproject from command line, use Python
|                         argparse for command line arguments put shebang
|                         `#! /usr/bin/env python` on 1st line and end with a
|                         `if __name__ == "__main__":` section, include in
|                scripts section for install
+- <- scripts for configuration, documentation generation
|                          or downloading assets, etc., include in
+- venv/ <- virtual environment to run tests, validate, development
+- myproject/ <- top level package keeps sub-packages and package-data together
   |             for install
   +- <- contains __version__, an API by importing key modules,
   |                 classes, functions and constants, __all__ for easy import
   +- docs/ <- use Sphinx to auto-generate documentation
   +- tests/ <- use nose to perform unit tests
   +- other_package_data/ <- images, data files, include in
   +- core/ <- main source code for myproject, sometimes called `lib`
   |  |
   |  +- <- necessary to make mypoject_lib a sub-package
   |  |
   |  +- … <- the rest of the folders and files in myproject
   +- related_project/ <- a GUI library that uses myproject_lib or tools that
      |                   myproject_lib depends on that's bundled together, etc.
      +- <- necessary to make related_project a sub-package
      +- … <- the rest of the folders and files in your the related project
Fork me on GitHub