Merkle DAGs | Lesson 1 of 8

Data has structure!

In our Content Addressing on the Decentralized Web tutorial, we discussed the concept of a content identifier, or CID. (If you haven’t reviewed that lesson yet, we’d highly recommend you do so, as this tutorial builds on the concepts it introduces!) A CID functions like a fingerprint for a blob of data, and consists primarily of a cryptographic hash of the data itself. We can use this identifier as a unique and succinct name to point to that data. Because the name is unique, we can use it as a link, replacing location-based identifiers, like URLs, with ones based on the content of the data itself.

However, links aren’t just used for identifying specific content; they’re fundamental tools for representing, organizing, and traversing structured information. In all kinds of objects and systems in our daily lives—telephone directories, bibliographies, mind maps, taxonomies, and more— we find data with structure, and links form a critical part of that structure.

Data Structures

Whether you're a programmer or not, you're surrounded by structured data. Lists, dictionaries, and catalogs all help us organize information and take into account the relationships between various pieces of data.

At a certain level of detail, it starts to become necessary to formally describe the properties of the data we work with, giving rise to concrete specifications called, appropriately enough, data structures! From Wikipedia:

In computer science, a data structure is a data organization, management and storage format that enables efficient access and modification. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data.

In programming, data structures are everywhere. The way you organize data into variables in order to use them in programs involves anywhere from dozens to millions of data structures. If you're a developer, you're probably familiar with common data structures like arrays, objects, graphs, etc. These structures are often linked together: for example, in a common data structure known as a linked list, each item indicates where to find the next item in a computer's memory.

On the decentralized web we access data directly from our peers, rather than from a central authority. Within an isolated environment, such as your own laptop, you can have a great degree of trust in the data structures you work with in memory or on disk. However, in a decentralized system you have less, or possibly zero, trust among peers. To fit this environment, we need an efficient way to link data structures together, while still preserving our ability to verify their integrity (a crucial property of CIDs). To continue the previous example, if we're traversing a linked list that we got from the distributed web, we don't want a bad actor to be able to insert items in the middle undetected.

In this tutorial, we'll explore this notion, learning how to develop a specialized data structure, called a Merkle DAG, that meets these needs. As we'll see, this structure provides the foundation for a trustworthy, distributed web of interlinked data.