Merkle DAGs offer a straightforward way of achieving data deduplication, efficiently storing data by encoding redundant sections as links. This can take place on both small and large scales.
For an example of small-scale data duplication, consider the use case of tracking changes files in a directory over time (this is often called versioning). One change we could make to this directory is to delete the "fish" directory, replacing it with a directory called "dogs". These changes result in a new DAG, representing an updated state of the directory. However, all of the nodes representing the "cats" directory and its files are common to both DAGs. Therefore, we can reuse them, as depicted below, where the orange nodes represent nodes that are only used in the original DAG, the green nodes represent those that are common to both, and the blue nodes represent the extra nodes needed for the new state.
This means that we can actually store both versions of the "pics" directory, without taking up twice as much space as it takes to store a single version! Git, a common version control system, uses Merkle DAGs in a very similar way to track changes to source code in software projects!
Deduplication makes an even bigger difference when you expand this concept to a larger scale. For example, consider the use case of browsing the web! When a person visits a web page using a browser, the browser must first download any resources associated with the page, including images, text, and styling markup. You may have noticed that a lot of web pages actually look fairly similar, using slightly different variants of a common theme. Often, there's a lot of redundancy here—the themes are mostly the same, with tweaks here and there in the data that describes them.
Content addressing enables us to form something of a global, distributed file system. Using Merkle DAGs, you can "store" a massive dataset without really having to store it: as long as you have an Internet connection, you can retrieve the pieces you need at any given point in time. In fact, nobody has to store the entire dataset! CIDs allow us to seamlessly link and build data collections across computers, helping everybody to use storage space more efficiently (though, it should be noted, there are also often times we want to duplicate data for redundancy).
We also aren’t limited in the granularity of our DAGs: rather than working with large nodes that correspond to entire files, we can split a file up into chunks and make a DAG out of them! When we do so, we can often find ways to deduplicate the contents of similar files!
Feeling stuck? We'd love to hear what's confusing so we can improve this lesson. Please share your questions and feedback.