Introducing Zarrchitecture on dClimate

Introducing Zarrchitecture on dClimate

UPDATE: You can directly access 30+ terabytes of institutional-grade climate data via our Data Marketplace.

dClimate Marketplace
A decentralized climate data marketplace

With the upcoming release of a new Web 2.0 API and decentralized marketplace for gridded datasets, we at dClimate want to explain the architectural decisions that make it possible to serve up huge climate datasets in a decentralized fashion. In this blog post, we will describe the challenges and successes we faced in storing and exposing the data, as well as provide some color on the types of data that will be accessible.

If you are interested in using the new dClimate API, you can access it here.

Overview

dClimate will begin using Zarr as the storage backend for gridded climate data served over the dClimate marketplace and accompanying API.

Gridded datasets like ERA5 and CPC provide well-understood and consistent climate variable readings across the entirety of a dataset’s temporal and spatial coverage area. This consistency, coupled with high quality climate science underlying their measurements, has made such gridded datasets the cornerstone of many climate models, risk products, and other common applications of climate data. But the size and complexity of this data makes them hard to manage and harder to analyze. By repackaging these datasets into a new, cloud-native file format called Zarr, we serve them quickly in recognizable formats within dClimate and hence improve their accessibility for casual and institutional users alike.

Zarrs have several advantages over traditional file formats. Specifically, they:

  • Save memory by loading and serving only the “chunks” of data needed for a specific analysis
  • Have a very flexible data model adaptable to IPFS’s decentralized paradigm
  • Integrate well with Dask and Xarray, complementary tools for processing and analyzing datasets, respectively

Working with Zarrs may be better, but it does not mean it is easy. Beyond the obvious challenge of mastering a new tool, we had to adapt it to dClimate’s decentralized storage model.

  • When parsing data we first ensured that each individual unit of IPFS data for a dataset (confusingly known as IPFS chunks) was equivalent to one day’s worth of data to enable data deduplication.
  • We leveraged Zarr’s flexible key: value store data model to serve individual data chunks over IPFS; the chunk’s index is the key and its IPFS Content Identifier (CID) is the value.
  • When many chunks resulted in Zarr metadata files exceeding IPFS’s 2 MB block size limit w̶e̶ c̶r̶e̶a̶t̶e̶d̶ a̶ n̶e̶s̶t̶e̶d̶ t̶r̶e̶e̶ s̶t̶r̶u̶c̶t̶u̶r̶e̶ t̶o̶ s̶t̶o̶r̶e̶ g̶r̶o̶u̶p̶s̶ o̶f̶ c̶h̶u̶n̶k̶ k̶e̶y̶:̶v̶a̶l̶u̶e̶s̶ a̶s̶ I̶P̶F̶S̶ h̶a̶s̶h̶e̶s̶. (̶i̶n̶ t̶h̶e̶ f̶u̶t̶u̶r̶e̶ t̶o̶ u̶s̶e̶ H̶A̶M̶T̶, s̶e̶e̶:̶ F̶u̶t̶u̶r̶e̶ W̶o̶r̶k̶)̶ Update: We now use a HAMT for the Zarr metadata via this library

The end result is a more performant dClimate backend which not only makes our API faster, stabler, and easier to maintain, but serves data natively in analysis-ready formats like NetCDFs.

Climate data and dimensionality

Gridded climate datasets are some of the most valuable datasets we store on dClimate. The “grid” in their name reflects how they store data points in multi-dimensional matrices — essentially spreadsheets where each cell represents one point in space, time, and possibly other dimensions (more on that below). Just like in the grids in Figures 1 and 3, each of these cells is equally sized, meaning moving X cells up, down, right, or left will always yield the same change in location and/or time. Data points for each of these cells are therefore said to be regular, since they’re evenly spaced, and continuous, since they cover the entire area or time span defined by the grid.

By contrast, readings from fixed measurement points like weather buoys or stations may provide more accurate inputs for the locations measured, but because they are neither regular nor continuous analysis in their “blind spots” may be low quality. Consider a hypothetical precipitation model built to insure Alaskan backcountry tour operators against the risk of storms preventing bush plane flights to remote locations. A model built on weather station data might perform well in Anchorage, where many weather stations cluster, but prove unreliable around actual tour destinations like Lake Clark National Park, where station density is lower. A gridded dataset would be more appropriate as it provides relatively reliable readings at regular intervals.

Figure 1: Comparing station data to 0.25 square degree gridded data around Anchorage, Alaska. A gridded dataset has values for every grid cell whereas a station dataset has values only for the stations (in red).

Station locations are indicative and taken from the FAA’s Surface Weather Observation Stations (SWOS) data at https://www.faa.gov/air_traffic/weather/asos/?state=AK. Background map from Stamen Design.

On the flip side, gridded climate datasets pose special challenges for data engineering and analysis. Most gridded climate datasets are n-dimensional, meaning that beyond the two spatial dimensions of latitude and longitude, they are indexed along further dimensions like time, water depth, or atmospheric column height. Besides complicating data manipulation (there is no such thing as a 4D Excel spreadsheet) these added dimensions bulk up n-dimensional datasets: higher-resolution gridded datasets can easily stretch into multiple terabytes for a single variable (e.g. daily temperature readings).

Figure 2: An example 3 dimensional dataset — latitude, longitude, and time — for 2 data variables — temperature and precipitation. Note that datasets sometimes extend into additional dimensions like water or atmospheric column depth. Source: XArray, https://xarray.dev/

These multi-terabyte datasets are difficult to download, let alone open, rearrange, and analyze in memory. As a result, working with n-dimensional datasets at scale has historically been the preserve of a few researchers at specialized academic and research institutions with access to strong internet connections, high performance computing clusters, and patient analysis delivery timelines.

With climate events increasingly impacting economies and societies and climate data driving political and economic decisions of greater size on much shorter time scales — for example, climate risk insurance payouts at Arbol — a new paradigm and set of tools is needed to make gridded climate data usable for a broader, less well-resourced set of users. At the same time the greater availability of cloud computing resources has lowered the cost and complexity of transforming n-dimensional data into a more accessible format. Adopting cloud native tools and file formats for processing gridded data — and adapting them to dClimate’s decentralized paradigm — therefore holds incredible promise for helping more actors make smart, data driven decisions based on past, present, and future climate trends.

Storing and serving data with Zarrs

Enter Zarr. Zarr is a cloud-optimized format for intelligently breaking down and storing n-dimensional gridded data. By pre-packaging small, contiguous segments of data as chunks and providing just those chunks we need for an operation, Zarr makes it possible to work with datasets too large to comfortably fit into memory. Zarr solves the memory problem of processing large datasets by efficiently loading only the relevant chunks of data at a time, allowing clients to make requests directly to the data layer (e.g. spatiotemporal requests without an API!) . These chunks can live in various forms of permanent storage, including on disk, in the cloud, or in dClimate’s case, on IPFS (more on this later).

Figure 3: A simplified example of chunking. By dividing an African dataset into chunks, an analysis for East Africa would only need to load 1/6th the data of a non-chunked dataset. Note that dClimate’s chunks also extend along the time dimension, meaning that an accurate visual would extend “backwards” into the third dimension. Source: Climate Data Operator’s User Guide 2012, http://www.idris.fr/media/ada/cdo.pdf

Zarr stands in contrast to older file formats for storing gridded data like NetCDF and GRIB that don’t allow for access to only the desired portions of data. These formats tend to use large individual files; in one common implementation, each file represents the whole spatial domain of the data for a single timestep. This might be useful for data calculations that do not involve multiple points in time, like calculating the average temperature of a city, county, or country at a single time step. However, other data operations, such as calculating the average temperature for a city at all time steps, would require every large file to be first downloaded and then accessed — a prohibitively expensive operation.

On the other hand, Zarrs, with their small, easily-manipulated chunks can more efficiently handle a variety of operations. When a user wants to access a portion of a Zarr, they first request metadata that provides an overview of the dataset and references to where chunks are stored based on the grid coordinates of the dataset (e.g. latitude, longitude and time). Once this metadata is loaded into an object stored in memory, the user can treat this object as though it contains the full dataset — requests to the object for data will be handled lazily, with only the chunks necessary for completing the request ever being loaded into memory. It’s as if you could access only the page you needed in a PDF without downloading and opening the whole thing.

Lastly, Zarrs are agnostic about how their chunks are stored — whether chunks are stored as binary, compressed files, or otherwise, the only true requirement for representing a Zarr is a structured metadata file detailing where to find those chunks. As a result, the actual method for accessing these chunks is highly customizable to your preferred storage media. This made Zarr an ideal choice for dClimate, given our mandate to store gridded data in a non-traditional, decentralized fashion; the flexibility of how Zarr chunks can be stored allowed us to use IPFS as our datastore while still conforming to a widely-used gridded data format.

Zarrs on IPFS

In order to understand the value of having Zarrs on IPFS it is important to first understand “Why IPFS?” As described in the IPFS docs,

IPFS is a distributed system for storing and accessing files, websites, applications, and data.

Files are served up in IPFS by a network of connected nodes which together create a decentralized resilient network which is both censorship resistant and robust. Replication among various nodes increases data redundancy while permanence is achieved through pinning and services such as Filecoin (major 🔑: your files will never be lost).

Additionally, files are content addressable and not addressed by location:

Instead of being location-based, IPFS addresses a file by what’s in it, or by its content. The content identifier (CID) is a cryptographic hash of the content at that address. The hash is unique to the content that it came from, even though it may look short compared to the original content. It also allows you to verify that you got what you asked for — bad actors can’t just hand you content that doesn’t match.

This means the file you requested is the file that you receive, which when coupled with the liveness offered by a decentralized network, makes it trivial to share data for Open Science. Instead of sharing files via unreliable FTP servers, data is always available with guaranteed integrity which helps to address the problems of transparency and reproducibility in research which is one of the core tenets of DeSci.

The key technology that enables Zarrs to be stored on IPFS is IPLD, an extension of object formats such as JSON and CBOR. In addition to supporting the data types you’d expect to be included in an object format, like strings, integers, and lists, IPLD objects may contain CIDs that link to other IPLD objects or files on IPFS. Because Zarr metadata is agnostic about how we map from coordinates to chunks we can use an IPLD object within valid Zarr metadata. We simply store the chunks in files on IPFS and link to them by their CIDs, e.g. {‘0.0.0’ : <CID Object>, ‘0.0.1’ : <CID Object>, etc.}

Figure 4: A visual overview of Zarr metadata. The coordinate indices (0.0, 0.1, etc.) map to individual data chunks. We store each chunk separately on IPFS and reference it with a CID.

A Python implementation of IPLD objects serving as zarr metadata called ipldstore can be found at https://github.com/dClimate/ipldstore. We forked and modified the (excellent) original repo at https://github.com/d70-t/ipldstore to make data access more performant and to allow metadata and chunks to transfer over the IPFS network.

The library works by conforming to the Python zarr implementation’s requirement that Python objects which store zarr data must be mutable mappings — maps from strings (the coordinate-based keys of zarr metadata) to bytes (the content of zarr chunks). Whereas a more standard zarr datastore would obtain these bytes from files stored on disk or in the cloud, ipldstore tells zarr to make ipfs cat calls to an IPFS node running on the same machine as the library.

To make ipldstore ready for production, we had to make some thorny implementation decisions. IPFS imposes a ~ 2 MB size limit on blocks transferred over the IPFS network. However, for large zarrs, the metadata can be tens of MBs, precluding the possibility of storing metadata as a single IPLD object (see: Future Work section for improvements). Our solution was to break down the metadata from a flat key-value store into a tree with the leaf nodes detailing IPFS hashes where zarr chunks can be found (Update: We converted this structure to a HAMT!). Another issue we encountered was performance — by default, zarr will make the ipfs cat calls needed to complete a query in sequence, creating a bottleneck in the data retrieval process. By parallelizing these requests, we achieved order-of-magnitude speedups compared to the single-threaded implementation. Finally, we had to leave our Zarr data uncompressed for IPFS’s de-duplication to work — otherwise updates containing existing data would have duplicated bytes.

These changes are key for storing and serving the gridded datasets over the dClimate marketplace. Moreover, the resulting package serves data quickly over a web2 API which will be released to the public in the coming weeks. This API leverages zarr’s data access patterns to enable a variety of queries for geospatial data over HTTP, such as all data contained within a polygon or a weekly rollup of an hourly dataset.

Zarr plays nice with friends

The API we’re building for gridded data takes advantage of two powerful tools in the Python ecosystem that integrate well with Zarrs: Dask and Xarray. Both Dask and Xarray solve hard problems elegantly and have active user and support communities, making it easier to design, maintain, and extend dClimate’s API.

Dask provides a Python-native interface for parallelizing computations across many CPUs or entire computing clusters, drastically speeding up calculations while avoiding memory blowouts. In the new dClimate API, Dask is used to perform data processing and retrieval in parallel, allowing users to pull gridded data over large regions, as well as aggregate the values with operations such as mean and standard deviation. In addition, Dask is the key technology that allowed us to put zarrs on IPFS in the first place — we used Dask to transform source data in formats like GRIB or NetCDF into zarr using multiple threads and CPUs. This allowed for parsing decades of hourly data into a Zarr in about a day, a transformation that would be much slower without the parallelization of Dask.

Xarray provides a Python-native interface for exploring and analyzing n-dimensional datasets. It functions as a layer of abstraction around zarrs, allowing the user to perform queries and calculations that would be too complex to do otherwise. For example, Xarray enables us to pull all data in a polygon on the Earth’s surface from a zarr. In this query, Xarray handles the transformation of the polygon to the latitude/longitude pairs that zarr should access, letting the user ignore implementation details like chunks and zarr metadata.

In order to facilitate querying within browser contexts which are memory constrained we also created a fork of the zarr.js library https://github.com/dClimate/zarr-ipfs so that clients can directly interface with files on the IPFS network without the need for a server or an API. It is critical to note that this zarr architecture opens the door to bringing climate data on-chain in a completely decentralized fashion, without the need for centralized servers or APIs, only thin clients which can run on oracle services such as Chainlink external adapters. This ensures that an external API failure cannot prevent climate data access on an oracle as there is no reliance on an API, only IPFS :) In other words, truly decentralized climate primitives can finally exist chain along with products built on top of them.

Redistributing data

With the work we have done with Zarr on IPFS our hope is that not only will it be used by those in the geospatial sector but also in other industries already exploring using Zarr such as Open Microscopy, Genetics Research(1000 genomes) and many more (Fun Fact: zarr was created by an Oxford Geneticist! @alimanfoo). As mentioned previously, coupling Zarr with IPFS gives one the ability to make queries against an n-dimensional dataset directly from a local client to a decentralized network with no middleware (or servers that can fail) while also getting immutability/data integrity out the box to improve research reproducibility.

We envision a future where IPFS nodes running in space can collect satellite data in-situ transmitting it to where it’s needed for queries and analysis on demand. We also hope for collaboration across different disciplines where zarr visualization work in the geospatial field can assist in biological imaging or viewing SNPs in genetics research or vice versa.

You can begin making climate data queries directly against IPFS by setting up a local IPFS node and swarm connecting to our node, using a Docker image, or this(coming soon) Ansible playbook if you wish to deploy it onto a server elsewhere. Additionally here is a Google Colab Jupyter notebook with a template on how to use this climate data which is stored on IPFS(and runs an IPFS node within the notebook itself peered to the dClimate node!) with another example building off of it here.

Conclusion

Our new Zarrchitecture will make working with gridded climate datasets on the dClimate API and Marketplace a faster, stabler, and much more enjoyable experience. For data consumers, it will also speed up and enable a number of exciting tools we’re developing on the visualization and analysis side. For ourselves, it will put dClimate on the cutting edge not just of distributed data storage but of climate data provision in general, aligning us much more closely with the Pangeo community’s exciting, pioneering climate data engineering work.

UPDATE

The new Zarr-optimized API is live here. You can also access 30+ TBs of institutional-grade climate data directly via our Data Marketplace.

dClimate Marketplace
A decentralized climate data marketplace

Future Work

Listed below are some initiatives we are working on among many others. As our work is completely open source we encourage the community to contribute so please reach out to community@dclimate.net if any of the below interests you or if you have any other ideas.

  • Translate IPLDStore to Typescript or Wasm(Rust) for running within browser contexts
  • Visualization Libraries to illustrate Zarr on IPFS Data within Browser Contexts


Do you want to learn more about the decentralized and open climate data ecosystem we are building?