.. _hgmozilla_dag: ===================== The DAG and Mercurial ===================== Distributed version control systems (DVCS) like Mercurial (and Git) utilize a directed acyclic graph (DAG) for representing commits. This DAG is how the *history* of a repository is stored and often represented. To help understand how this works, we'll be using Mercurial commands to show and manipulate the DAG. We start by creating an empty Mercurial repository with an empty graph:: $ hg init repo $ cd repo The Mercurial command for inspecting the repository history (and the DAG by extension) is ``hg log``. Let's run it on our empty repository:: $ hg log -G No output. This is confirmation that the graph is empty. .. note:: We use ``-G`` throughout this article because it is necessary to print a visual representation of the DAG. Adding Nodes ============ We'll need to create nodes to make our DAG interesting. The way we do this is by *committing*. But first, we need something to commit. Let's create a file:: $ echo 1 > file Then we tell Mercurial that we're interested in storing the history of this file:: $ hg add file We then use ``hg commit`` to create a new node (*changeset* in Mercurial parlance) and add it to the DAG:: $ hg commit -m A (``-m A`` says to use ``A`` for the commit message). Now let's take a look a the DAG:: $ hg log -G -T '{desc}' @ A .. note:: ``-T '{desc}'`` tells Mercurial to only print the *description* or *commit message* from the changeset/node. Without it, output would be much more verbose. It looks like we have a single node - denoted by ``@`` and ``A``. This graph / repository state isn't very interesting. So, we perform another commit:: $ echo 2 >> file $ hg commit -m B And then inspect the DAG:: $ hg log -G -T '{desc}' @ B | o A This is slightly more interesting! The ``B`` node has been introduced. That is pretty obvious. ``A`` has its node represented by a ``o``. ``B`` has its node represented by ``@``. Mercurial uses ``@`` to represent the node currently attached to the *working directory*. This will be discussed later. The first column has grown a vertical pipe character (``|``) between the two entries. In graph terms, this is an *edge*. While the visualization doesn't indicate it, ``B`` internally stores a pointer back to ``A``. This constitutes the *edge* and since the edge is directional (``B`` to ``A``), that makes the graph *directed* (the *D* from *DAG*). Directed graphs borrow terminology from biology to represent relationships between nodes: parent node A node that came before another (is referred to by another node). child node A node that derives directly from another. root node A node that has no parents. head node A node that has no children. descendants All the children, the children's children, the children's children's children, and so on of a node. ancestors The parents, parents' parents, parents' parents' parents, and so on of a node. In our graph so far: * ``A`` is the parent of ``B`` * ``B`` is a child of ``A`` * ``A`` is a root * ``B`` is a head * The descendants of ``A`` are just ``B``. * The ancestors of ``B`` are just ``A``. If you don't understand this, let's try committing a few more nodes to help your understanding.:: $ echo 3 >> file $ hg commit -m C $ echo 4 >> file $ hg commit -m D $ hg log -G -T '{desc}' @ D | o C | o B | o A ``A`` is still the root node. Since ``B`` has children, it is no longer a *head*. Instead, ``D`` is now our head node. If all you do is ``hg commit`` like we've been doing so far, your repository's DAG will be a linear chain of nodes, just like we have constructed above. 1 head. Every node has 1 parent (except the root). .. important:: The important takeaway from this section is that the *history* of Mercurial repositories is stored as a DAG. ``hg commit`` creates a changeset and appends a node to a graph. A DAG node and a Mercurial changeset are effectively the same thing. Nodes are Hashes of Content =========================== Up to this point, we've been using our single letter commit messages (``A``, ``B``, etc) to represent nodes in our DAG. This is good for human understanding, but it hides an important detail of how Mercurial actually works. Mercurial uses a SHA-1 digest to identify each node/changeset in the repository/DAG. The SHA-1 digest is obtained by hashing the content of the changeset. Essentially, Mercurial changesets consist of various pieces of data including but not limited to: 1. The parent node(s) 2. The set of files and their state 3. The author 4. The date 5. The commit message Mercurial assembles all these pieces of data in a well-defined manner, feeds the result into a SHA-1 hasher, and uses the digest of the result as the node/changeset ID. SHA-1 digests are 20 bytes or 40 hex characters. They look like ``835dbd9444dbed0cdc2ca27e23839f05a58e1dc1``. For readability, these are almost always abbreviated to 12 characters in user-facing interfaces. e.g. ``835dbd9444db``. We can ask Mercurial to render these SHA-1 digests instead of the commit messages:: $ hg log -G -T '{node}' @ 2bf9b23b2d0379540038866a72699a8ce5e92e84 | o 0f165760af41ddde6470860088f421c1efcc5a5f | o 7175417717e87c88e4cf61ab2f76f2c54c76fa4b | o 8febb2b7339e5843832ab893ca2a002cd4394a03 Or we can ask for the short versions:: $ hg log -G -T '{node|short} {desc}' @ 2bf9b23b2d03 D | o 0f165760af41 C | o 7175417717e8 B | o 8febb2b7339e A .. note:: We start to use some more capabilities of Mercurial's *templates* feature. This allows output from Mercurial commands to be customized. See ``hg help templates`` for more. Because SHA-1s (even their short versions) are difficult to remember, we'll continue using commit messages and single letters throughout this article to aid comprehension. Important Properties from Using Hashing --------------------------------------- Since node IDs are derived by hashing content, this means that changing **any** of that content will result in the node ID changing. Change a file: new node ID. Change the commit message: new node ID. Change the parent of a node: new node ID. Changing the content of a changeset and thus its node ID is referred to as *history rewriting* because it changes the *history* of a repository/DAG. *History rewriting* is an important topic, but it won't be discussed quite yet. The important thing to know is that if you change anything that's part of the changeset, the node ID changes. Moving Between Nodes ==================== Looking at the state of our Mercurial repository on the filesystem, we see two entries:: $ ls -A .hg/ file The ``.hg`` directory contains all the files managed by Mercurial. It should be treated as a black box. Everything else in this directory (currently just the ``file`` file and the current directory) is referred to as the *working directory* or *working copy* (both terms used interchangeably). The *working directory* is based on the state of the files in a repository at a specific changeset/node. We say *based on* because you can obviously change file contents. But initially, the *working directory* matches exactly what is stored in a specific changeset/node. The ``hg update`` (frequently ``hg up``) command is used to change which node in the DAG the *working directory* corresponds to. If you ``hg up 7175417717e8``, the *working directory* will assume the state of the files from changeset/node ``7175417717e8...``. If you ``hg up 2bf9b23b2d03``, state will be changed to ``2bf9b23b2d03...``. The ability to move between nodes in the DAG introduces the possibility to... Creating DAG Branches ===================== Up until this point, we've examined perfectly linear DAGs. As a refresher:: $ hg log -G -T '{node|short} {desc}' @ 2bf9b23b2d03 D | o 0f165760af41 C | o 7175417717e8 B | o 8febb2b7339e A Every node (except the root, ``A``/``8febb2b7339e``) has 1 parent node. And the graph as a whole has a single head (``D``/``2bf9b23b2d03``). Let's do something a bit more advanced. We start by switching the *working directory* to a different changeset/node:: $ hg up 7175417717e8 1 files updated, 0 files merged, 0 files removed, 0 files unresolved $ hg log -G -T '{node|short} {desc}' o 2bf9b23b2d03 D | o 0f165760af41 C | @ 7175417717e8 B | o 8febb2b7339e A (Note how ``@`` - the representation of the active changeset/node in the *working directory* - moved from ``D`` to ``B``) Now let's commit a new changeset/node:: $ echo 5 >> file $ hg commit -m E created new head That *created new head* message is a hint that our DAG has changed. Can you guess what happened? Let's take a look:: $ hg log -G -T '{node|short} {desc}' @ 4a3687e9313a E | | o 2bf9b23b2d03 D | | | o 0f165760af41 C |/ o 7175417717e8 B | o 8febb2b7339e A ``B`` now has multiple direct children nodes, ``C`` and ``E``. In graph terminology, we refer to this as a *branch point*. ``E`` has no children, so it is a *head* node (``D`` is still a head node as well). Because the visualization of the graph can resemble a tree (from nature, not your computer science textbooks), small *protrusions* from the main *trunk* are referred to as *branches* from the perspective of the DAG. (Mercurial has overloaded *branch* to convey additional semantics, so try not to confuse a *DAG branch* with a *Mercurial branch*.) The *created new head* message was Mercurial telling us that we created not only a new *DAG head* but also a new *DAG branch*. Because your commit is taking the repository in a different *direction* (very non-scientific word), this act of creating new DAG branches is sometimes referred to as *divergence* or *diverging*. DAG branches turn out to be an excellent way to work on separate and isolated units of change. These are often referred to as *feature branches* because each DAG branch consists of a specific feature. For more, see :ref:`hgmozilla_workflows`. It's worth noting that ``hg commit`` **always** produces a new head node because the node being created never has any children. However, it may not create a new DAG branch: a new DAG branch is only created when the parent node of the commit isn't a head node. Before we go on let's commit a new changeset on top of ``E`` to make the DAG branch more pronounced:: $ echo 6 >> file $ hg commit -m F $ hg log -G -T '{node|short} {desc}' @ da36621d7a94 F | o 4a3687e9313a E | | o 2bf9b23b2d03 D | | | o 0f165760af41 C |/ o 7175417717e8 B | o 8febb2b7339e A Merging DAG Branches ==================== Now that we have multiple DAG branches, it is sometimes desirable to *merge* them back into one. The Mercurial command for performing this action is ``hg merge``. Let's change our working directory to the changeset that we want to merge *into*. We choose ``D``, since it was our original head.:: $ hg up 2bf9b23b2d03 1 files updated, 0 files merged, 0 files removed, 0 files unresolved Now we tell Mercurial to bring the changes from ``F``'s head into ``D``'s:: $ hg merge da36621d7a94 0 files updated, 1 files merged, 0 files removed, 0 files unresolved (branch merge, don't forget to commit) $ hg commit -m G Visualizing the result:: $ hg log -G -T '{node|short} {desc}' @ 19c6c94d7bb2 G |\ | o da36621d7a94 F | | | o 4a3687e9313a E | | o | 2bf9b23b2d03 D | | o | 0f165760af41 C |/ o 7175417717e8 B | o 8febb2b7339e A ``G``/``19c6c94d7bb2`` is what is referred to as a *merge commit*. It is the result of a commit operation that merged 2 nodes. From the perspective of the DAG, it is a node with 2 parents, not 1. Conclusion ========== These are the basics of how Mercurial uses a directed acyclic graph (DAG) to represent repository history. If you would like to learn more about how distributed version control systems (like Mercurial) use DAGs, please read `this article `_. For more on workflows that build upon this knowledge, see :ref:`hgmozilla_workflows`.