Data is an essential resource in our digital economy. Unfortunately, we currently live in the era of internet data silos, where data is captured by individual applications and platforms and is only shared sporadically. This status quo has many reasons, but can mostly be attributed to the absence of a global, standardized and trusted infrastructure that facilitates and incentivizes the exchange of data and algorithms between businesses, humans and machines at scale. Here we’ll dive deeper into the functional requirements of a global data exchange layer such as decentralization, data provenance, data pricing. When realized, we could see the true emergence of digital ecosystemswith interoperable services, built-in data monetization and the volume and variety of data to build powerful algorithms.
Enabling factors like the advent of mobile, user friendly interfaces and the cloud helped with generating enormous amounts of user data. This Cambrian explosion of data combined with rapid progress in storage and computing allowed for machine learning to undergo a renaissance over the past decade. This provided us with better statistical models and algorithms, resulting in more powerful services (e.g. recommender systems, image recognition, voice assistants) which once again leads to increased usage and, hence, even more data.
Unfortunately, most of the data currently sits in business silos. On the one hand this is caused by big platforms that keep “their” data to themselves to keep competitors at bay, on the other hand is caused by the absence of an advanced infrastructure that could facilitate and incentivize the exchange of data and algorithms. Most of thedata exchange that currently takes place happens in narrow APIs which are fully controlled by the applications themselves or by means of ad hoc data dumps via primitive data marketplaces. Consequently, we have still not realized the full potential of our data economy. That is, in the best-case scenario, any dataset, whether from Google or from a small business, could potentially be modelled by any talented data scientists in the world, and can be applied in other contexts and for different purposes and/or can be aggregated and fused with other datasets from other sources at scale.
However, in order for such a situation to be realized, the internet could benefit from the development of a global data– and AI exchange layer with a few important conditions in place. First of all, this data exchange protocol layer preferably needs to have the characteristics of a utility, providing trust, security, openness, privacyand neutrality in a multi-stakeholder environment.
In a previous note, we have discussed how the use of decentralized open-source protocols such as permissionless blockchains could be an important enabler for these prerequisites. In contrast, a centrally managed and owned data exchange platform could lead to distrust as it would introduce a middleman position, which could either become disproportionately powerful and/or could be gamed both internally and externally. Secondly, a data exchange layer requires some form of data provenance/lineage, i.e. keeping track of the origin of a piece of data, the processes that it undergoes, who processed it and where it is going over time. Keeping track of the life cycle of a piece of data is a crucial feature in making the data value chain transparent, but more importantly, treating data as a real commodity with a state, an owner and a value. The presence of data provenance in turn enables other important functionalities like pricing of data units and transaction of data and the creation and real-time verification of more fine grained service agreements in the form of smart contracts.
These functionalities are currently being developed by a plethora of projects, each competing to become a significant part of the solution in the data exchange layer. Several examples of such (complementary) projects are provided in the observations section above. Crucially, these projects have announced that they will collaborate other higher level features in order for them to be interoperable and to prevent the emergence of competing standards that would, once again, lead to the silofication of data.