Blocking large file commits in git

21 Jun 2013

In my previous blog post on Locating large objects in a git repository, I covered some of the git utilities that can be used to identify large files in your repository. In this post I’d like to go over how to create a server-side hook that will block large files before they get into your repo. Some of the same utilities from the post referenced above will be used in creating this hook.

What are server-side hooks?

Git hooks allow you to perform actions at different points in the git push workflow. Server-side hooks are executed at certain specific phases of the push process:

  • pre-receive

    Fires before any refs are updated on the server, and is run once. This hook receives a list of changes on stdin in the format of: old-ref new-ref refname, one line for each ref to be updated. Both stderr and stdout are forwarded back to the user for messaging. A non-zero exit will prevent any refs from being updated.

  • update

    Fires before the updating the refs on the server, and is executed once for each ref that is being updated. It takes 3 arguments: ref-name old-ref new-ref, and will pass stderr and stdout back to the client. A non-zero exit status will only prevent the effected ref from being updated on the server.

  • post-receive

    Fires once after all refs have been updated, and gets the same input on stdin as the pre-receive hook. As with the hooks before it, stdout and stderr are passed back to the client for messaging. Generally this type of hook is only used for notification and/or messaging to clients about the push that just occured.

  • post-update

    As with the post-receive hook, this fires once after the refs have been updated, and takes a variable number of arguments. Each parameter that is passed in, is the name of a ref that was successfully updated. Similarly to post-receive this hook is also used primarily for notification, as it cannot effect the status of the push.

On the git server these hooks live inside the hooks directory of the bare repository. They must to be named to match the type of hook they are.

Which hook type should I use?

Since we’re looking to block an actual commit from happening, we would want to use either a pre-receive or update hook. The distinction here is that the pre-receive will block all updates on a non-zero exit, whereas the update will only block the specific update for the ref which returns a non-zero exit. For this example I will choose to use pre-receive because I want to block the entire push if we find an object that is over the file size limit.

How does it work?

The basic flow of this hook is as follows:

  1. Read stdin line by line for push information (old-ref new-ref refname)
  2. For each line you receive, obtain a list of files that have changed.
  3. Inspect each file’s object and get its size in bytes.
  4. Compare that size to a limit that you have set.
  5. Provide useful feedback to the user about any exceptions that are rasied.
  6. Exit accordingly

We also need to be aware of a couple of exceptional cases: branch creation and branch deletion. On a branch creation the old-ref will be a null SHA1 (0000000000000000000000000000000000000000). Similarly on a branch deletion the new-ref will be a null SHA1. For the purposes of this hook, there should be nothing happening in a branch deletion that is of any importance to us, so we will want to skip looking at those types of operations.

For this example I will use a bash script which calls out to the git utilities. I am choosing to do it this way so that we can discuss how each step works, and what is happening. Similar hooks could be written using modules designed to interact with git, such as GitPython or Grit.

What does the script look like?

Breakdown

Let’s explore the script a little bit, and go through the operations happening to make all this work. The first git operation, which is at line 21 is performing a git-diff with several options.

  • --stat is making diff display a diffstat
  • --name-only makes the output only contain the name of the files changed
  • --diff-filter=ACMRT limits the output of the diff to only contain additive operations.
  • ${oldref}..${newref} generates the diff based on changes to be applied to oldref by newref.

The next git operation is on line 23, where git cat-file is used to obtain the object size of a file in a commit. When we encounter a file that is above our defined MAXSIZE (line 27), the script outputs some actionable information to the user. When the filesize condition is met, the EXIT variable is set to a non-zero value, and we exit with that at the end of execution.

Conclusion

As you can see, it’s trivial to put together a simple bash script that can enforce file size restrictions on pushes. Alternatively, it would be a relatively simple modification to allow this script to operate on total commit size. Knowing the core concepts required to perform a task like this makes it easier to port it to using a git module in a different scripting language.

comments powered by Disqus