This post focuses on outlines the vision of web3.storage. The state of the post reflects where the product will be in Q4 2022. For instance, we are currently in Beta of our new upload API that natively uses UCAN for auth, and will have this incorporated in the web3.storage client and website the coming weeks. In the meantime, some of what is outlined in this blog post might not yet be incorporated into the core web3.storage and NFT.Storage products.
Check out Part 2 here!
At DAG House, we’ve been building services that allow developers and end users to easily utilize IPFS content addressing with their data:
- We built easy-to-use clients in NFT.Storage and web3.storage for web developers to make data available on IPFS and stored on Filecoin in two lines of Javascript
- We wrote Elastic IPFS, our own open source, cloud-native implementation of IPFS built for scalability to give our users faster and more reliable data uploads and storage (99.9% decrease in 5xx errors to uploads, 99th% highest upload time from 1-10 min to less than 10s)
- We created IPFS HTTP gateway infrastructure that takes advantage of IPFS’s natural affinity for caching, which has resulted in ~10x faster reads of data available on IPFS
This has allowed users to utilize content addressing in production use cases where it is the best solution without risking performance compromises often associated with peer-to-peer technology. As a result, you see many developers using NFT.Storage and web3.storage to take advantage of IPFS’s immutability guarantees. Much of this traction has been in the blockchain space (e.g., to store and reference off-chain NFT metadata and images), but this is just the tip of the iceberg of the benefits of content addressing when it is able to be performant and reliable.
But making content addressing accessible and production-ready has only been the first step in our vision to bring web3 to the web. We’ve been working hard on executing on a vision, coupling IPFS with UCAN, a self-contained, verifiable authorization primitive that uses decentralized identifiers for identity, to transform web3.storage into a developer platform that allows apps to break out of siloed services. This improves efficiency, eliminates lock-in, and opens up a broader set of application architectures. By natively utilizing these open, cryptographically verifiable, and portable protocols, the platform allows:
- Apps and users to trustlessly interact with (download, link, reference) data as long as it’s available via IPFS simply by pointing to its content address - we call this data anywhere, which breaks down data silos, reconnecting the web and improving bandwidth efficiency
- Users, apps, services, etc. to utilize permissioned services (e.g., store data, run compute workloads) portably using their own identity - we call this the serverless bazaar, which allows users to take their workloads anywhere (i.e., can take the exact same request and trustlessly plug it in to any other service) with no intermediaries
By combining the ease-of-use of a developer platform with these verifiable, decentralized protocols, developers can start interacting with their “data layer” separately from their “infrastructure layer.” Developers can choose to store data or run workloads wherever it makes the most sense - with web3.storage, locally, with other cloud providers, in blockchains - without changing how they interact across these services. This makes the benefits of web3 tangible - it opens up the possibility of any application architecture, not just traditional ones, so you can pick what solves your users’ needs the best. It also limits vendor lock-in, forcing transparency and accountability in cloud services.
The Data Layer
We like to say that web3.storage helps unlock your “data layer.” What does this mean?
The Data Layer refers to data itself, independent of where it is physically stored. It is fluid and flexible, where any entity on the web - end-users, applications, infrastructure, organizations, and more - can interact with it (store it, read it, process it, send it, and more) without having to worry about whether the data they care about is controlled by someone else.
Overview of key protocols
We’ll get more into the technical details later, but for now, it’s useful to know that the Data Layer is fundamentally enabled by three decentralized protocols:
-
IPFS: Ability to reference data using a unique identifier specific to that data (a “content identifier,” or CID). The Data Layer references everything using IPFS CIDs.
- Any actor on the web can address the specific piece of data they are intending to using the CID, a cryptographic hash of the data.
- They can read the data independent of where it physically is as long as it’s on the network (from cloud and decentralized storage, to peers in the network, to locally).
- Data can internally reference other CIDs, creating “pointers” to other data that roll up into the overall CID without needing the data itself (a complementary protocol called IPLD).
-
DID: A cryptographic identifier. The types of DID we use allow the holder to prove they have control over that identity without a centralized source of truth. Every actor that needs an identity and is interacting with the Data Layer should have a DID.
-
UCAN: An authorization mechanism building on top of DIDs where “everything a user is allowed to do is captured directly” in a token. UCAN is a powerful auth mechanism that really magnifies the power of the Data Layer.
- The token contains verifiable cryptographic signatures validating that the true DID holder signed it
- It is sent to authenticated APIs so the service can verify the token without checking any internal or external source-of-truth
- A DID holder can also delegate any permissions they have to other DIDs so they can directly interact with authenticated services on the holder’s behalf
Benefits of the data layer
So there’s all these fancy data and identity protocols that use cryptography that the Data Layer is built on top of! But what does the Data Layer unlock, exactly?
-
One powerful thing is verifiability.
Because everything that interacts with the Data Layer is referencing data using IPFS CIDs, actors have guarantees that the data they receive is really what they’re looking for without needing to trust others in the network.
- Because any authenticated interaction uses UCAN, actors can verify themselves that those making requests have permission to do so.
-
Another is openness.
- IPFS allows anyone plugged into the network to get data with a corresponding CID - there are no gatekeepers or silos in the Data Layer.
- Even with this openness, things like private data can be secured via encryption. Since every actor has an identity based on cryptography, user flows can be designed to enable private data use cases.
- In permissioned interactions, there is no central authority determining who has permissions to do what since UCAN validation is self-contained.
-
This leads to composability, which unlocks improved efficiency and speed.
- Data can be referenced and linked simply by using CIDs (hash-linked).
- Rather than users and applications needing to include the entirety of the data in a payload, they can just reference the CID of the data itself, or even a CID that references a schema that contains the data.
- Blocks of data that are the same and share a CID no longer have to be stored multiple times just because they are in two different data sets - data can be structured to point to these blocks without actually reuploading the blocks themselves, deduplicating the data.
- Further, CIDs are infinitely cacheable, since CIDs are unique to their underlying data, so only one copy of the data needs to be kept for popular content.
-
Verifiability, composability, and composability create portability, limiting vendor lock-in.
- Because data is referenced using IPFS CIDs, you can easily move hosted storage providers across the Data Layer. Just start writing to the new one - since you are referencing data by what it is, not where it’s stored (i.e., HTTP URLs), everything should continue to work seamlessly.
- You can continue keeping your data on the old provider, or portably migrate your data to your new provider by simply giving them a list of CIDs.
- Any workloads you’re running over your data can also be decentralized. Run workloads and generate CIDs for the input, function, and output. This suddenly makes all work portable and verifiable, and commoditizes your compute provider.
- Because UCAN invocations are self-contained, any competing services can recognize whatever permissions actors have with competitors. This means users can seamlessly take the same invocation to another service if they are unhappy with their current provider.
-
Users have self-sovereign identity since every actor has a DID and the entire Data Layer is verifiable and open, making things user-centric.
- Anything that plugs into the Data Layer can and should have a DID, from end-users, to applications, to accounts within services, to services themselves.
- With their identity, any actor can take who they are and their data to any interaction with anyone else on the web.
- Companies no longer control a user’s identity or data that’s important to them. Because a user locally generates an account DID, they are in charge of it - companies merely execute UCAN invocations, acting on the user’s behalf.
- Instead of companies creating walled gardens for user data and sharing some limited access to that data, it can be the other way around.
- Novel use cases are unlocked (e.g., multiple user DIDs directly crowdfunding an account DID).
-
Serverless application structures and data flows are enabled due to content addressing and DIDs, meaning anyone that can interact directly with any other part of the Data Layer.
- Self-sovereign identity means that a “user” is no longer defined by a service’s backend server.
- UCAN tokens can be delegated to actors to interact directly with permissioned services; data does not have to flow through backend servers as a proxy.
- With content addressing, storage can be on-device, cloud, and CDN - all without the developer leaving the web browser.
- This portability extends even to workloads run on top of data, since the output of a workload can easily reference the CIDs of the input and function (creating a verifiable proof of the workload).
These advantages apply to anyone on the web, regardless of whether you’re used to traditional application-server models and APIs, or on the forefront of web3.
Next up: The web3.storage stack
In this post, we set the foundation of what the data layer is. In the next post, we will discuss the web3.storage stack - more about the protocols we use, and the products we’ve built on top of them!