An integral part of the engineering ethos at Quantcast is that it is important to seek out hard problems and worth solving them well. This philosophy has led us to build highly-efficient products that empower brands, agencies and publishers to know and grow their audiences on the open internet.

At the center of the Quantcast Platform is Ara, our AI and machine learning engine. One of the core components of Ara is Kamke, a custom OLAP-style database consisting of over a trillion data points, that allows the platform to deliver flexible insights on composite audiences at interactive speeds. Like many modern analytics platforms,1 such as Druid, we use the Roaring bitmap encoding format to efficiently manipulate integer sets. Over the last year, we’ve repeatedly gone back to the low-level Roaring code to look for ways to improve the performance of our system. This blog focuses on the ARM64-specific changes that we made.

A Roaring Bitmap Primer

Roaring bitmaps are an encoding format for subsets of 32-bit integers2 that provides industry-leading compression, a standardized serialization format, and efficient set operations. Like many other databases, Kamke uses Roaring to efficiently encode the dataset and perform set calculations. When migrating Kamke to ARM, we were able to achieve significant performance improvements with small modifications to the Go library. 

It’s worth starting with a brief overview of the Roaring data structure, as some of the optimizations were performed on the internal structures. If you’re interested in further reading, roaringbitmap.org has a number of excellent resources.

Roaring Data Structure

The main strategy of Roaring bitmaps is to decompose the 32-bit integer space into a two-level trie, mapping the top short (upper 16 bits) to a container, which in turn encodes the set of bottom shorts (lower 16 bits) for those values with a given top short. Part of the efficiency of Roaring is that it implements three different container structures, and uses whichever one minimizes storage costs. The three containers’ serialization formats can all be operated against directly and have implementations for a variety of common operations (such as union, intersection, difference, and count).

Container Types

Array Container

A sorted array of distinct shorts. Takes 2 * cardinality bytes to encode. The vast majority of containers in Kamke (over 99.97%) are encoded as array containers.

Run-Length Encoded Container (run container)

A sorted array of pairs of (start, length-1) representing all the ranges that are present in the set. The storage cost is 4 bytes for every range of values present. 

Bitmap Container

A bitmap of 65,536 bits.3 Takes 8,192 bytes regardless of contents.

In general, Roaring algorithms will use whichever container format has the smallest serialized size.

Optimizing Roaring Bitmaps In High Performance Go

Kamke is backed by an immense number of Roaring bitmaps, with the full dataset comprising over 100 billion distinct bitmaps. We use so many bitmaps that even if the data stayed within memory-mapped files, merely instantiating all the bitmap structures would lead to heap allocations in excess of the total stored data. As a consequence, all allocations must happen within the lifecycle of a query request, creating and manipulating tens of millions of bitmaps while still responding in under a second.

In order to achieve this performance, we have repeatedly investigated bottlenecks within our system and implemented more optimized versions. For brevity, this post is focused on those optimizations related to the new AWS Graviton 2 machines, which run on 64-bit ARM Neoverse N1 cores.

To measure the impact of these changes, we selected five real queries from across our product suite as a benchmark. Each query was run ten times on a single worker with several key metrics measured. All numbers are given relative to the baseline code running on an x86 machine (r5d.2xlarge, to be precise).

Direct Port From R5d To R6gd Instance Classes

Kamke makes heavy use of both RAM and attached storage, so we have typically favored the R-class memory-optimized machines with attached storage. One of the advantages of using Go for our system is that it provides reliable compilation across multiple architectures. We were able to compile and deploy against ARM just by setting GOARCH to ARM64 and spinning up a cluster on the r6gd class of machines. Doing this showed immediate benefits: despite being 18% less expensive, we saw about a 15% improvement in both total thread-seconds utilized and overall wall-clock time.

Query TypeCPU4MALLOCS5TALLOCS6WALL CLOCK7
Custom Audience88.90%820.40%248.85%90.64%
Word Cloud87.65%482.20%189.28%88.36%
Postal Code76.40%190.97%141.59%77.27%
Cities80.04%445.49%169.02%81.08%
Demographics78.51%592.57%193.88%79.53%

However, the total number of allocations (mallocs) and bytes allocated (tallocs) per query increased markedly. We’ve found that large amounts of allocation can lead to degraded performance of the system, as it imposes various limits on the concurrency of the calculations. Using the pprof profiling tools, we were quickly able to identify the main culprit.

Little-Endian Encoding

A common trick for high-performance Go programs is to use the unsafe package to convert from one slice type to another without moving the underlying data. For example, this method is used to convert a slice of bytes to a slice of two-byte shorts when reading out an array container. However, this is only safe if the underlying architecture is little-endian. Prior to version 0.5.4, the library was very pessimistic about the potential little-endianness of a given platform, so would use safer but less generic methods to read bitmaps off of disk. This single change fixed all of the allocation problems, further improving the speed ups.

Query TypeCPUMALLOCSTALLOCSWALL CLOCK
Custom Audience77.41%100.00%100.00%84.39%
Word Cloud74.91%100.00%100.00%76.12%
Postal Code67.42%100.00%100.03%72.73%
Cities72.12%100.00%99.98%72.51%
Demographics70.73%100.00%100.01%71.18%

Assembly Algorithms

Achieving latency improvements of 15-30% just by switching machines and a couple lines of compile commands was pretty great, particularly since the cluster was 18% cheaper. But we still found that our code was spending a lot of time in a couple specific functions, and one in particular was a bottleneck: the union2by2 for reading two sorted short arrays and writing the union to a third slice. Despite our best efforts, we couldn’t improve on the existing implementation, at least not while writing standard Go. 

Initially we were going to attempt to implement the vectorized algorithms described in Roaring Bitmaps: Implementation of an Optimized Software Library.8 However, these algorithms had not yet been implemented in the ARM64 NEON instruction set anywhere, and it seemed prudent to first just reimplement the basic algorithm. To that end, we ported the algorithm over to ARM64 assembly instructions. Using `go build -gcflags -S`, we can see exactly where we get performance improvements. Most mundanely, we avoid various bounds checks and stack unwindings–this function will not fail cleanly. However, much of the improvement comes from two features unique to ARM v8: Condition Flags and post-increment for load and store operations.

Condition Flags

While in normal code a comparison returns either true or false, within ARM v8 the CMP instruction will set 4 bits, which can then be read by future branch instructions (BEQ, BLS, etc.). This means we can route to the three cases (x<y, x=y, x>y) off of a single comparison. The compiled code prior to 1.16 made two CMP calls per loop, although the latest version does use this functionality.

Post-Increment

If you look at the generic code, you’ll notice that as it moves down the two slices, it both increments pointers k1 and k2 and reassigns the values of s1 and s2. The assembly code, however, can avoid doing this, leveraging instructions like MOVHU.P 2(R0), R6, which moves the short at location R0 into R6 and increments R0 by 2 in a single instruction. By calculating the register locations of the final term of each list, we can test whether we’ve reached the end by directly comparing against the read pointer, rather than separately tracking how far along each slice we are.  

Minimizing Bounds Checks

Another benefit of the hand-written assembly is that it is able to more aggressively eliminate bounds checks than the Go compiler. However, Daniel Lemire,9 one of the creators of Roaring, ran a test to measure the impact of getting the compiled code to avoid them and did not see any measurable impact.

The PR introducing the assembly algorithms showed improvements of as much as 40% in the execution time of benchmarks that heavily leverage union2by2. Since it is only a part of our calculations, our improvements were less drastic, but still notable.

Query TypeCPUMALLOCSTALLOCSWALL CLOCK
Custom Audience71.43%100.00%99.98%79.09%
Word Cloud71.81%100.00%100.00%72.09%
Postal Code61.80%100.00%100.04%64.07%
Cities69.28%100.00%99.98%69.46%
Demographics68.73%100.00%100.01%69.86%

High Performance ARM Is Still Evolving

It’s only been a little over a year since high performance ARM machines have been generally available. As such, optimizing systems for executing on ARM is still a work in progress. All of our previous tests had been on Go 1.13, and we typically hadn’t seen performance improvements on version bumps, so we hadn’t been upgrading to the new releases in a timely manner. However, AWS’s documentation indicated that Go 1.16 had some meaningful performance improvement on ARM. After a bit of fiddling to get it compiled, we saw the benefits of these improvements.

Query TypeCPUMALLOCSTALLOCSWALL CLOCK
Custom Audience67.64%99.99%93.94%71.76%
Word Cloud70.34%100.00%91.64%69.84%
Postal Code59.55%100.00%87.54%61.82%
Cities67.94%100.00%90.91%67.28%
Demographics67.55%100.00%87.45%67.14%

Conclusion

In summary, the Graviton 2 machines are the real deal. If you’re doing computation at scale, it’s worth your time to migrate your workloads over. However, it is a different architecture, and you’ll want to do your due diligence in ensuring that you’re getting all the benefits of cheaper, faster computers.

  1. Druid, Elasticsearch, Apache Pinot, among many others
  2. There also exist 64 extensions in several languages, but the implementation is much less standardized across languages.
  3.  216, the number of 16 bit integers (shorts)
  4. The amount of actual CPU core-seconds used across all threads, according to the proc filesystem. Kamke is designed to both fully leverage all the cores in a single query and handle multiple requests concurrently. The amount of CPU used by a query puts a hard constraint on the amount of concurrency supported, so reducing core-seconds helps us scale.
  5. The total number of allocations made during the query. A single kamke query can make a billion memory allocations to the runtime, to the point where the volume of those requests determines the overall query performance. Go’s MemStats provides the capability to exactly track the number of allocations.
  6. The size (in bytes) of allocations to the heap during query execution. The frequency of garbage collection is determined by the size of the heap, and query performance suffers during garbage collection because the write barrier is turned on. 
  7. The overall time it takes for the query to run 
  8. Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O’Hara, François Saint-Jacques, Gregory Ssi-Yan-Kai. Roaring Bitmaps: Implementation of an Optimized Software Library Software: Practice and Experience 48 (4), 2018: https://arxiv.org/abs/1709.07821
  9. Daniel Lemire, a professor of computer science, is one of the creators of Roaring bitmaps and the primary maintainer for several of the Roaring implementations on GitHub, including the Go repository that we use. He’s been kind enough to review most of the code discussed here and provide valuable advice on some of the scaling issues we ran into.