Posts tagged ‘gitifyhg’

A Treatise On The History Of Distributed Version Control

This is not yet another git vs Mercurial debate. I admit bias towards git, which I use whenever I have a choice. This is most of the time now that gitifyhg is awesome. However, I have been using Mercurial for my day job for about a year. I am more familiar with Mercurial and it’s extensions than many developers who prefer it. I consider myself an advanced git user (not an expert) and an intermediate Mercurial user.

I therefore have the background to claim that Mercurial and git are equally capable. Mercurial doesn’t have certain features of git that I miss, but those features are implementable with development time. Sometimes git’s interface isn’t as easy to use or teach as I would like, but aliases and projects like git extras alleviate this issue.

This article is about philosophy, not technology. Mercurial’s documentation, mailing lists, and stack overflow questions are littered with dire warnings that extensions that rewrite history are dangerous and best avoided. Git, on the other hand, takes a “consenting adults” approach to history rewriting. While it acknowledges that rewriting history can be dangerous and should be avoided in certain circumstances, it also allows the coder to choose when and how to apply this rule.

To avoid comparing the two systems, I’ll refer to the two styles as “permanent history” and “mutable history”. Both git and Mercurial are fully capable of maintaining both styles of history. However, Mercurial users tend to prefer “permanent,” while git users typically adopt a “mutable” approach.

The permanent history philosophy emphasizes that a changeset cannot be altered once it has been committed. It is important to record exactly what state the repository was in when the commit was made. If that state is not acceptable, then a new commit is made to correct it. Future readers of the history in question will see that a commit was made and that later, it was amended in another commit. Permanent history is analogous to a captain’s log or accountant’s general journal. Every action should be recorded separately.

The goal of committing in the permanent history paradigm is to record a specific state of the repository.

The mutable history philosophy, in contrast, sees changesets as individual paragraphs in a living story. It can and should be edited to ensure it tells the story as effectively and coherently as possible. Each changeset should have a topic sentence (the commit message) and supporting sentences (the patch). When a commit is initially made, the book is never assumed to be in the final draft that will go to the publisher.

The goal of committing in the mutable history paradigm is to record a related set of changes.

The different stages of code history

There are several stages that changesets go through as a program is written. These stages are perceived differently by the two styles.

  1. Working directory changes have not yet been committed.
  2. Local changesets have been committed but not yet pushed to any public repository.
  3. Public changesets have been published to a public repository and are available to other coders.

The working directory stage is treated identically by the two philosophies. If it hasn’t been committed, both styles have an “anything goes” attitude. If you screw something up and fix it, you are not expected to commit the bad code for posterity. If you leave a debugging statement in the code but catch it in a git|hg diff command, just delete it before committing. If your tests aren’t passing during the uncommitted changes stage, edit the files to make sure they do pass.

The attitudes diverge slightly when it is time to commit the changes in the working directory. Neither style requires committing all of the changes that are in the working directory. For example, if you edit a test that is not related to the feature you are about to commit, you can separate the two diverse changes into separate commits. However, this practice is more common in mutable history circles than permanent, largely because permanent history coders want to record the current state of the repository while mutable history followers are focusing on related changes.

The two histories have polar opposite beliefs about the local changeset stage. Permanent history maintains that a committed changeset should not be altered in any way. Some proponents may allow amending or rolling back the most recent commit, provided it has not been pushed publicly. They frown upon editing the “second last” or earlier commits, even if they haven’t been published.

Mutable history, on the other hand, takes the same “anything goes” approach that applies to the working directory stage. If changes have never been pushed publicly, then the mutable historian will comfortably rearrange and reorder them, move patch hunks from one changeset to another, or squash relevant commits together.

Permanent history fans may be surprised to learn that their philosophy on public commits is the same as mutable history’s. Once a commit has been pushed to the permanent public repository, both philosophies consider that it should not be changed, ever. If mutable history is likened to writing a story with coherent chapters, then public commits are like a published book. Once it has been published, the book should not be altered.

Let me reiterate: altering permanent published history is considered a Bad Thing by both philosophies.

There is a fourth stage available in the mutable style that permanent history does not allow. This “temporary public” stage lies between the local changesets and public changesets phases. At this stage, other people can see your changesets before they are moved into permanent public history. They may be rearranged and edited as if they are local history, but there must be agreement between all viewers that this section of history is still considered mutable. This is akin to sharing a draft of the book with a proofreader or copy-editor before it is published.

The source code for git itself is managed in this way, as discussed in maintain-git.txt. While it contains permanent public branches that canot be altered, it also describes a “pu” branch that is temporary public. This branch is used to share the state of upcoming changesets; other developers can provide feedback, not only on the quality of the code, but also on the quality of individual patches, commit messages, and ordering.

History Is Communication

There are several reasons to maintain code history. Some examples include:

  1. Preserve a record of past state in case you need to return to it.
  2. Compare two versions of a code base and to find the specific code that introduced a new bug.
  3. Concurrent development via patching and merging is virtually impossible without it.

However, the primary purpose of code history is to communicate. Each changeset implicitly communicates that the developer had some reason to take a snapshot of the repository at that time. It communicates exactly what the state of the repository was when the snapshot was taken, and is even able to communicate what changed between that snapshot and the previous one. The commit message describes these changes in English, preferably with a one line summary followed by a complete description of what changed and why.

While the two ideologies agree that history is extremely useful for communication, mutable and permanent history disagree as to what should be communicated.

Permanent history’s main purpose is to communicate “honestly” what happened, for all posterity to see. Each snapshot shows exactly what occurred in the repository. Two developers created different changesets in parallel and then at some specified point, they merged them. Someone forgot to delete a debugging statement and made a second commit to fix it.

Mutable history prefers to communicate “effectively”. The goal is to make local changesets as readable as possible before pushing them. Each changeset ideally contains a single related set of changes. Related changesets are further grouped together on individual branches. If this is not the case, they are modified or moved before being made public.

If you catch a problem in mutable history after committing but before pushing publicly, fix the commit. If two distinct changesets actually communicate a single idea, squash them together. If a single changeset contains two ideas, split them apart or move one to a different branch so the current one only contains cogent changes.

The permanent history crowd suggests that this rewriting of local changes before they are pushed is dishonest, or lying. However, it is easy to lie at the working directory stage in the permanent history paradigm. If you run an hg|git diff and notice that you forgot to delete a debugging statement committing, then it is perfectly acceptable to delete that line and “lie” about having forgotten it.

If they truly wanted to record what “honestly” occurred, permanent history tools would track every single change at the text editor or IDE level.

I think we all agree that this is ridiculous. In truth, permanent history shares mutable history’s desire to have clean, communicative commits. The primary difference is deciding when it’s “too late” to change them. In permanent history, once committed, you can’t change it. In mutable history, you can and should change it up until the point it is pushed to a permanent public repository.

The ability to change history before pushing allows the developer to separate the two distinct tasks of “coding” and “organizing”. Often, when coding, we encounter a separate issue that needs to be addressed, a missing feature, a bug, documentation that needs writing. In strict permanent history paradigm, your only “honest” option is to commit both features in a single changeset. However, permanent history rules are relaxed before the first commit has been made, so two other available options are:

  • shelve/stash the existing changes, write and commit the second feature, and then unshelve/stash apply
  • write the two distinct features in the same working directory and use git’s index or Mercurial’s crecord extension to commit them as separate patches.

These options are commonly used by mutable history developers, but they also have another option: Commit the two features and continue coding. Then reorganize or split the changesets into a sensible series of commits appropriate to good communication before pushing the features to a permanent public repository.

To create clean, well-ordered commits, the permanent history style demands that we think about one thing at a time and decide what the most relevant history communication path is before we start coding.

The mutable history style understands that programming doesn’t work this way. It is common and acceptable to begin work on one feature and discover a bad comment or FIXME you had forgotten and perform a psychological context switch to work on that.

One of my colleagues once informed me that “Mercurial is for people who don’t need to hide their mistakes.” This is bullshit for two reasons. First, Mercurial, like git, is perfectly capable of hiding mistakes. It’s easy to edit local unpublished changesets in Mercurial before pushing them live. There are numerous extensions — both third party and built in — that allow this kind of operation.

Second, this statement deliberately misrepresents the purpose of history rewriting. We don’t rewrite history to hide our mistakes. We do it for the benefit of future readers of our git|hg log. Reorganizing history when we write it greatly reduces the cognitive overhead for readers trying to understand what we did and why. History, like code, is meant to be read more often than it is written. Crafting it before pushing it publicly eases the amount of work for future readers of that history.

It is difficult to understand a series of cumulative changesets that keep undoing themselves or refactor large sections of code. It is better to order these changesets such that they make sense. I’m not saying that only the final product should be committed, if other changesets are able to communicate useful information. There are legitimate reasons, regardless of history ideology, to record mistakes in the permanent record of the repository. If the mistake has already been pushed publicly, the best thing to do is admit that you did not communicate as effectively as you had intended and make a new changeset to fix the problem. This is akin to providing an errata to a published book.

Another good reason is to record that some experiment you attempted was a failure. Perhaps it made the system unbearably slow or it arbitrarily deletes data. These commits can live in a branch in permanent history, forever documenting that this experiment was attempted, that it failed (so people don’t waste time trying it again), and how it failed (in case someone else wants to improve upon your design). Neither philosophy advocates the hiding of this kind of mistake. However, mutable history does expect that the failed commits be well ordered with commit messages that effectively communicate what you did and what went wrong.

History is a form of documentation. Like any documentation, it should be well-crafted and report the evolution of the system effectively. For example, one of the best early pieces of advice I received when I started using git (and inadvertantly learned the mutable history philosophy) is to “never use the word ‘and’ in a commit message.” The word “and” is a sign that you are trying to communicate two different ideas or changes in a single changeset.

There are other motivators for this piece of advice in addition to disseminating useful information. If you ever want to revert certain related changes, it is easier to do so if those changes exist in a single patch or consecutive set of patches. There is no need to extract the changes that you want to keep from combined snapshots. Both DVCS’s provide commands to trivially reverse earlier changesets, but this is only useful if the changesets contain only the idea you wish to reverse. It also eases communication when a commit message says “reverse the changes from revision X,” compared to “reverse some of the changes that do this and this as committed in revision X, but allow the lines that perform an unrelated operation alone.”

Further, if you want to apply an individual set of changes to a different branch of the project without merging an entire branch, it is a trifling matter if those changes are part of a single coherent, cohesive, consecutive set of patches.

Finally, if you don’t have write access to a project, single idea changesets make it easy for the person who is integrating your patches to see what you did and what you intended. Mutable history integrators will generally reject your patches if they do not communicate effectively. You are unlikely to ever get write access to an upstream project (if it follows the mutable history paradigm) if you do not prove that you adhere to the single concept per changeset guideline.

Admittedly, it takes more effort to alter history than to just take a snapshot when you feel the code is in a semi-acceptable state or you want to have a backup available. This effort has a huge long-term payoff. When people say that they don’t care about maintaining clean history, I get the same sense of distaste as “I don’t bother with writing tests” or “I tend to put documentation off to the last minute”. Not maintaining clean history is a sign of a lazy developer. You may be saving yourself time by just committing, but you are adding overhead to everyone who ever has to interpret your commit history, including yourself.

Code Review

Code review is an extremely useful tool for improving the state of a source code repository. It’s a simple concept: other members of the team review each changeset and make suggestions for future improvement.

Reviewers using the permanent history philosophy can improve the quality of the future codebase, but they cannot suggest improvements to patches already under review. They do not have an opportunity to improve the communication quality of those patches if they have been pushed publicly, which is normally the case if you want to share patches for review.

The mutable history style changes the point at which modifications are not allowed from “in the local directory” to “publicly pushed to the permanent repository”. Code can be pushed to a temporary public location for review purposes. Other team members can review it and comment on the quality, not only of the code, but also of the individual changesets. Once review is completed, and any suggestions integrated into the change history, the temporary repository can be safely deleted.

Thus, the code review phase becomes more than a review of the code, it is also a patch review, a history review. Code review gives other developers the chance to say, “This patch could communicate more effectively if…”.

Incremental Merging

The most dangerous moments in version control occur when two different branches of development that touch overlapping pieces of code have to be merged.Someone has to figure out what the two original sets of changes did and then figure out what the combined code has to do to accommodate both ideas. If the branches have been divergent for a long period of time, this job is nearly as difficult as rewriting both features entirely from scratch.

This is compounded by the fact that normally, the two different branches were written by different developers. While the person doing the merge may be intimately familiar with their own work, they have to become just as well-versed in the alternate branch before they can merge it safely.

Worse, all of these changes get combined into a large “merge commit” that basically includes the entire modified history of the two feature branches squished into a single gigantic diff. This is horrible for communication. If just one line of code was inappropriately merged, it becomes a nightmare to answer the question, “why did this work on the feature branch but fail after the merge”?

In the permanent history paradigm, it is common to attempt to alleviate this problem by merging frequently. This way, a smaller subset of changes can be covered in each merge. Unfortunately, this is a terrific way to introduce malicous or erronious changes into the history. People tend to assume merge commits “did the right thing” and don’t review them as closely as the patches being merged.

Moreover, the history becomes much less readable as these unnecessary merge commits clutter up the intention of both feature branches. Such commits do not serve any communication purpose other than, “the developer of this branch was afraid that divergent changes would be too hard to merge at a later, more appropriate time.”

The cognitive overhead is greatly reduced in the mutable history paradigm. Mutable history encourages rebasing over unnecessary merging. Instead of merging two branches of commits, rebasing makes one branch appear linearly after another branch. When rebased, the branch contains a series of commits that make sense in a linear order with no confusing merges in the history.

When you rebases a branch, each individual changeset is applied against the upstream branch, one at a time. Because the changesets contain small unit of changes, they are less likely to conflict, and therefore apply cleanly. When there is a conflict, it is easier to tell (from both the commit message and the code) how the code needs to be written to apply that changeset against new changes. This “one change at a time” process is much easier to apply than a single large merge commit. In addition, if you have previously rebased a branch and reordered commits for optimal communication, you will find that future merges or rebases onto other branches are even easier.

This means that, compared to a merge, you do not need to be as intimate with upstream changes from other developers. You figure out how you would have written each of your own changeset if you had been applying them directly to the upstream branch. This is not nearly so mentally fatiguing as trying to unravel two parallel sets of changes a la “ok, they did this, and I did this so I need to do this to get things back into a sane state”.

While I am avoiding a git vs Mercurial debate here, I’d like to point out that the various Mercurial utilities for rebasing and history editing are not very effective as compared to git’s tools. The rebase extesion, histedit, Mercurial queues, and pbranch all do the job, but they require more effort than in git. They don’t have git’s famous rerere functionality, they have potential to lose or obliterate history altogether, and they are neither as well integrated nor maintained as git’s tools. I do not say this to convince you to switch to git, but to point out that if you have tried these tools and found them lacking, it is not because the concept of rebasing and mutable history is a bad thing, but because the tools require further development.

Note that while mutable history users avoid unnecessary merges whose soul purpose is to reducing merge fatigue, they are not averse to merge commits that communicate useful information. So it is perfectly sensible to create a merge commit that demonstrate that a feature branch (usually containing a linear set of related changesets) has been merged into default. However, before the merge occurs, the feature branch should have its history edited in such a way that the entire branch will apply cleanly and no merge conflicts will occur that require any diff to be committed with the merge.

Have your cake and eat it, too

Distributed version control systems allow us to have multiple copies of repos in different states. If you feel strongly that an “honest” permanent history is important, perhaps it would be a good idea to keep this honest copy of history in a separate repository or a different branch. But for the sake of effective, coherent communication, maintain the main history in a mutable style.

In git, this can be done with branches. Simply do the development work on one branch (maybe give it a name like permanent/branchname to identify it as such). Push all changes to this branch as they occur. However, keep your master branch clean for most effective communication. When a feature is ready to go live, create a new branch from the commits on the permanent branch and rebase them onto master in a well-ordered manner that communicates clearly.

I’m not sure how viable this would be in Mercurial, since it’s not easy to copy commits between branches. More likely, it would be suitable to have two repositories; one that contains the permanent historical record, and one that contains the edited history.

I don’t personally believe this is necessary. The mutable history paradigm communicates everything I need it to. However, if you are unsure if you are ready to make the switch, I want to make it clear that it is possible to maintain both styles for a period while you experiment with the idea. If it turns out you don’t like the mutable history paradigm, you can always delete the offending mutable branches or repository… though, of course, this would be a mutation of history in itself.

I expect people who perform this experiment to realize that well-crafted history is worth the small amount of extra up-front effort required to maintain it.

Gitifyhg is now awesome.

Gitifyhg is a git client for pushing to and pulling from Mercurial repositories. I described the first implementation, which used hg-git internally, last month. However, I found that it didn’t work as well in practice as my initial tests had indicated, and I morosely reverted to using hg directly.

While researching ways to improve the app, I stumbled across git-remote-hg by Felipe Contreras. It claimed to be a cure-all to the git-hg bridging problem that “just worked”. So I downloaded and tried it, but it didn’t live up to the author’s hype. While it could handle basic cloning, pushing to and pulling from the default branch, it failed for me when working with named upstream branches, something I have to do regularly in my day job. I submitted several issues and pull requests, most of which were ignored. The more deeply involved I became with the code, the more I felt a complete rewrite was in order.

I had some free time during the holiday season, and started my version of a git remote as an exercise, with no intent to create anything useful. To my surprise, something useful emerged. In fact, I believe gitifyhg is the most robust and functional git to hg client currently available. More importantly, it is eminently hackable code: well tested and fairly well documented. I hope this will make it easy to contribute to, and that my inbox will soon be full of pull requests.

This is the real deal. The best part is that you don’t have to learn any new commands. It’s just basic git with mercurial as a remote. The only command that has changed from normal git usage is clone:

pip install gitifyhg
git clone gitifyhg::http://selenic.com/hg
cd hg

By adding gitifyhg:: before the mercurial url, you can git clone most mercurial repositories. If you can’t, it’s a bug. Other complex repositories I have successfully cloned include py.test and pypy.

You can easily use gitifyhg in the fashion of git-svn. All named branches are available as remote branches. default maps to master. Other branches map to branches/<branchname>.

If you want to commit to the master branch, I suggest a workflow like this:

git clone gitifyhg::<any mercurial url>
cd repo_name
git checkout -b working  # make a local branch so master stays prestine
# hack and commit, hack and commit
git checkout master
git pull  # Any new commits that other people have added to upstream mercurial are now on master
git rebase master working  # rebase the working branch onto the end of master
git checkout master
git push

Working on a named mercurial branch, for example feature1, is easy:

git checkout --track origin/branches/feature1
git checkout -b working  # make a local branch so feature1 stays prestine for easy pulling and rebasing
# hack and commit, hack and commit
git checkout branches/feature1
git pull  # New commits from upstream on the feature1 branch
git rebase branches/feature1 working  # rebase the working branch onto the end of feature1
git checkout master
git push  #push your changes back to the upstream feature1 branch

It is even possible to create new named branches (assuming my_new_branch doesn’t exist yet in Mercurial):

git checkout -b "branches/my_new_branch"
# hack add commit
git push --set_upstream origin branches/my_new_branch

These basic workflows have been working flawlessly for me all week. In contrast to my previous attempts to use git to hg bridges, I have found it easier to use gitifyhg than to work in the mercurial commands that I have become expert with, but not used to, in the past year.

Gitify hg is not yet perfect. There are a few issues that still need to be ironed out. There are failing tests for most of these in the gitifyhg test suite if you would like to contribute with some low-hanging fruit:

  • Anonymous branches are dropped when cloned. Only the tip of a named branch is kept.
  • Tags can be cloned and pulled, but not pushed.
  • Bookmarks can be cloned and pushed, but not pulled reliably. I suspect this is related to the anonymous branch issue.

So give it a shot. Gitifyhg is just one easy_install gitifyhg away.

Gitifyhg: Accessing Mercurial repos from GIT

My company uses Mercurial for internal hosting. Other than that, it’s an absolutely terrific place to work.

About three quarters of my colleagues are sick of hearing the rest of us complain about Mercurial’s inadequacies. They mention tutorials like this one that simply don’t work in real life. I’ve done my research, and I have not been able to find a viable git to hg pipeline anywhere. There is a git-hg repo, but it appears to be un-maintained and not overly well documented. It’s also written in bash, so I’m not eager to take over maintenance.

Instead, I’ve been spending some of my non-working hours trying to make a python wrapper around hg-git named gitifyhg. It started out as an automated version of the hg-git tutorials, but is now expanding to provide additional support and tools.

Right now I’m following a git-svn style of workflow, where your git master branch can be synced up with the hg default branch. It seems to be working somewhat ok. I think at this point, the pain of using gitifyhg is about equal to the pain of using hg alone. Hopefully future improvements will reduce that pain, and as always, patches are most welcome!

The instructions are pretty clear and the unit tests provide a pretty good description of basic usage. Non-basic usage probably doesn’t work (yet). Here’s a tutorial of what does work:

Acme corporation uses Mercurial for all their repositories. Fred and Wilma are developing a project named AcmeAdmin together. Fred is content to use Mercurial for hosting, but Wilma is a git advocate who is going crazy with the restrictions Mercurial places on her. She decides she’s going to try gitifyhg on this new project to see how much trouble it is.

Fred starts the project by creating a Mercurial repo that will serve as the remote repo for both of them:

mkdir acmeadmin
cd acmeadmin
hg init

Fred and Wilma are doing a very strange sort of pair programming where they have separate repositories on the same machine. They are doing this so if you were reading along, you could type in the same commands they are typing and the demo would work. Rest assured that in spite of this queer practice, Fred and Wilma are hotshot programmers. So Fred now clones this remote repo and starts working. In the meantime, Wilma is browsing the gitifyhg README, so she hasn’t cloned anything yet.

cd ..
hg clone acmeadmin fredacme
cd fredacme
echo "Write Documentation" >> TODO
hg add TODO
hg commit -m "write the documentation for this project."
echo "Write Unit Tests" >>TODO
hg commit -m "unit tests are done"
hg push

Wilma’s caught up now and eager to try gitifyhg. She takes over the keyboard and clones the remote repository, which already contains Fred’s changes. Then she runs gitifyhg and verifies that a git repository exists with Fred’s two commits.

cd ..
hg clone acmeadmin wilmaacme
cd wilmaacme
gitifyhg
git log

Fred’s gone for coffee. Wilma now starts her own hacking and pushes the commits using gitifyhg.

echo "Implement code until tests pass" >>TODO
git commit -m "code implemented" -a
git hgpush

Wilma got kind of lucky here because she implemented her code on the master branch and there were no conflicts with Fred’s work. If there had been, I’m not sure what would have happened. Fred’s finished his coffee and pulls in Wilma’s change via Mercurial. He then commits some more changes.

cd ../fredacme
hg pull -u
echo "deploy" >> TODO 
hg commit -m "it's up and running"
hg push

Meanwhile, Wilma is working on a separate feature. She remembers to create a new git branch this time, which is good because she’s going to want to do some rebasing when Fred’s commit comes in.

cd ../wilmaacme
git checkout -b feature_branch
echo "new feature documentation" >> FEATURE
git add FEATURE
git commit -m "start new feature by documenting it"

Before pushing her changes, Wilma pulls to see if Fred has done anything. He has! But the hgpull merges all that into her master branch. She checks out her working branch and rebases it onto master. Then she pushes it upstream.

git hgpull
git checkout feature_branch
git rebase master
git checkout master
git merge feature_branch
git hgpush

And that’s the basic workflow! I’m probably going to add an hgrebase command to take care of that checkout rebase checkout merge step, since I expect it to be common. It should probably use git rebase -i, which is probably the most amazing thing that ever happened to version control.

Like I said, this is how gitifyhg works when all goes well. When things don’t go well it’s still a bit of a mess. I’ve had to manually sync up the hg and git working directories using git reset --hard and hg update -C a couple times. Once, my hgpush ended up putting a bookmarked branch on my hg repo that I didn’t want to go public (luckily, hg push fails by default when there are multiple heads); I ended up having to use hg strip to clean it up. I’m hoping to automate or prevent some of this in the future. For now, try it out and submit patches or at least issues!