Probabilistic Counting

Probabilistic data structures are one of the most overlooked topics in computer science: bloom filters, HyperLogLog counters, approximate nearest neighbors, random projections should be known by programmers as well as sorting algorithms, hashtables or heaps. It’s one field that can really make a difference, in many cases, in terms of performance and resources saving. 

Maybe the “probability” part frightens programmers who are often good in logic, but not so much in mathematics. You shouldn’t be afraid of these algorithms, they are not so hard to understand, and a good intuition can help a lot: a good understanding of random number generation, hashing and collisions is helpful.

In this article I’ll write about how to count events of a very large variety, i.e. how to store a lot of counts. Starting from Count-Min Sketches, one of the early probabilistic structures, then the logarithmic counting known as Morris counters, and finally other approaches that finally led us to our proposal of the Count-Min Tree Sketches.

The straight way to counting

First, a little bit of popular science: in computer science, what is the way to go is you want to store counts? For instance, if you want to know how many times posts have been liked, or how many times hashtags have been twitted during a day?

The structure that one would normally use is a (key,value) collection or associative array, where the key contains the label of the thing being counted, and the value contains the number itself. To be able to actually count, you need a collection that can be modified, with the ability to access the value in constant or logarithmic time. To do that, you would use a hashtable or a tree map.

Counting approximately with Count-Min Sketches

In 2005, a paper by Cormode & Muthukrishnan described the Count-Min Sketch (CMS), a probabilistic data structure, with some inspiration taken from the Bloom Filter (and their counting variant), that can very efficiently store counts associated with items. The main difference between the Bloom approach and the CMS approach is that the CMS separates the memory in several layers and uses different hashes for each layer, while the Bloom approaches uses different hashes in just one memory segment.

There are two operations associated with a CMS: 

  • Update(item,value) and 
  • Query(item).

Update() adds a value to the counts stored for the items, while Query() returns the estimate of the count for the item.

The great idea behind the CMS is that, even if there are some collisions (and there will be necessarily), one can minimize the estimation error by keeping only the MIN value stored in the cells corresponding to an item. Indeed, all other cells necessarily overestimate the real count.

The illustration above shows a CMS with 3 layers of 20 cells. It has been updated by counting two items : X and Y. X has been seen 2 times, Y 3 times. Because the hash of X and Y is the same for the layer 0, there is a collision, and thus, the cell contains the sum of X and Y counts. When querying the CMS for the count of X, the values extracted for X are (5,2,2). Thanks to the MIN rule, the CMS returns only the minimum value, and consequently, 2 is correctly returned.  

The CMS can support positive and negative updates. If we add the constraint of positive-only updates, we can use the Conservative Update variant. In this variant, only the cells which are already at the minimum value are going to be incremented (at least if we only do increments of one, otherwise the rule is slightly more complicated).

Evaluation

The following graphs show the estimation error for the original CMS and the CMS with conservative update (CMS-CU) both with 3 layers, with size varying from 12.5% to 800% of the perfect size (the size of an array containing just the counts, which would be doable with a perfect hash function).

The first graph shows the absolute error (RMSE), after counting a million words of a Wikipedia dump, plus bigrams (i.e., for the sentence “the little cat”, we actually count “the”,”little”,”cat” and “the little” and “little cat”). These million words ends up creating a hashtable with 523563 entries (30MiB based on our estimation). The perfect size is computed on 32-bits counters, totaling 2MiB. Consequently, the storage required for this hashtable is 1500% the perfect size.

In this graph, the middle vertical lines shows the perfect size. Measuring absolute error is not a neutral choice, because most of the values when counting unigrams and bigrams are at one (1), because a lot of bigrams are seen only once. When measuring absolute error, estimating 101 instead of 1 is the same than estimating 100100 instead of 100000 (and you may agree that it’s actually not the same king of error).

So we also need to measure the average relative error:

It’s close to the first graph, but it doesn’t mean the same thing AT ALL ! At the perfect size, a CMS has an average relative error of 1 (100%), which means that, on average, instead of 1 it will estimate 2, instead of 100, it will estimate 200, etc. For a relative error of 1% (i.e. probably enough for any practical purpose), you need 800% of the perfect size. That’s half the memory requirement of a hashtable, which is great. But is it good enough?

In 2014, we started to work on a streamed version of our text mining tool, eXenGine. As a consequence we started to gain interest in Count-Min Sketch, but quickly became disappointed in the size gain : a Count-Min Sketch can approximate values with a relative error of 1%, at half the size of a hashtable, which is interesting, but not a game changer.

Counting approximately approximately

It wasn’t good enough for us. Given the number of items that we wanted to count, halving the memory requirement was not a good enough target. So we started to ponder on better systems, with two objectives in mind: 

  1. the error measurement should be the Average Relative Error, because I will either use the estimation for cut-off (ignoring very rare events), or for computing the Pointwise Mutual Information or the TF-IDF, which both take a logarithm of the counts at some point.
  2. the distribution of the values is highly skewed (text data usually follow the Zipf Law), so storing small values should be super-efficient.

At first (in 2014), we came up with the idea of using approximate counters (also known as Morris counters, analyzed by the great Flajolet, the same scientist who gave us HyperLogLog). It seemed really strange for me at the time that nobody applied this idea before. Turns out (as often) that I didn’t search hard enough : David Talbot already had the idea in 2009. Unknowingly, we published a paper (Guillaume Pitel and Geoffroy Fouquier) about the idea. At least, it inspired Seif Lotfy to work on various implementations of CMS. Our method is significantly different than Talbot’s and Osborne, since they basically store a unary chain approximating the log-count using a Bloom Fiter (see below). We haven’t tried their method because it seemed to me overly penalizing it terms of memory access.

What we proposed (and what David Talbot proposed before us), was to use smaller counters (let’s say 8 bits instead of 32) which would approximate the real value. To do that, the key is to allow a probabilistic increment. For instance, to count approximately in base 2, you would have 50% chance to increment from 1 to 2, 25% to increment from 2 to 3, etc. With this rule, the approximated value is (almost) 2^x, so counting up to 2^32 with an approximate counter in base 2 only requires 5 bits (because you can count up to 32 with 5 bits). We gave it the name of “Count-Min Log” because it uses logarithmic counters.

Obviously, with a base 2 counter, the error due to the approximate counting is quite big. With 8 bits counters, however, things become good enough to be usable.

Just have a look at the graph below, it shows the Average Relative Error of CMS, Count-Min Log with 8 bits (CMLS8), and Count-Min Log with 16 bits (CMLS16), all with the Conservative Update rule. It is remarkable that CMLS8 stops improving at some point. It’s due to the precision limit of the 8 bits approximate counters. CMLS16 also stops improving at some point, but it’s not shown by the graph.

That should make us happy, we’ve halved the memory requirement from the CMS. But do not rush implementing the Count-Min sketch now, let’s have a look at the simple RMSE:

This is a disaster! Because of the errors on the highest values, the average absolute error is enormous, and doesn’t improve with the sketch size. However, there’s a reason why we are mostly interested in the ARE: in text mining, we mostly use the logarithm of the counts, so we don’t really care about the RMSE. Instead, let’s have a look at the estimation error on the PMI; for each bigram of the corpus, let’s compute the PMI and the absolute error with the real PMI :

To give you a good grasp of the meaning of the PMI, here are the histograms of the PMI estimated on a smaller corpus with the reference, the CMS, and the CMLS8:

As you can see, the PMI is typically between 0 and 10, and centered aroung 4 or 5. So an error of 0.1 on the PMI we be close to the real value.

Using logarithmic counters allowed us to improve significantly on our objective, but still we felt unsatisfied because we couldn’t take into account the particularly high skewness of our data.

From Count-Min Sketch to Count-Min Tree Sketch

We’ve seen previously that using approximate (logarithmic) counters improves the Relative Error of the CMS, and the absolute error of the Pointwise Mutual Information. Why ? mostly because we trade precision (which we don’t care that much about) for size, so that without losing much on the error, we can double or quadruple the number of counters.

Why can we do that? Is it really because our counts follow a Zipf law ? The answer is both yes and no. The true answer is that we can trade relative precision (i.e. good precision for small values and low precision for high value) because we care about the PMI, which takes the logarithm of the frequency. This, in turn, is directly caused by the fact that the word counts follow the Zipf law.

But fundamentaly, with logarithmic counters, we didn’t take direct advantage of the counts’ distribution, by allocating less bits to low values. We were still allocating the same number of bits for every event, these bits were simply used in a different way. 

Huffman ? not really

Some of you, readers, probably thought about Huffman coding, but really it’s the opposite. In Huffman coding, a very frequent code, for instance the bit sequnce 0b01100100 (ascii code for “e”) may be coded with 3 bits, because it frequently occurs, while  0b01011010 (“Z”) may be coded with 10 bits, because it is much less frequent.

We, however, are not compressing bit streams: instead we count occurrences. Consequently the most frequent events will need many bits for counting, while many low-frequency items will only need a few bits.

Talbot & Osborne

Let’s first look back at the Talbot & Osborne idea. It basically consists in storing a chain of ones that will represent the log-count of an item is a base-1 representation.

Let’s say you want to count the item “the”. You have a family of independent hash functions h_i(x) that gives you a position in a bit array. You first lookup the bit at h_1(‘the’), if it’s one, you lookup the value at h_2(‘the’) and so on, until you encounter a 0. Finally, the last i gives you the log-value of your Morris counter.

In order to avoid false positives, one could also use several hash functions for each i, in exchange for an increase in memory usage.

One (huge) drawback of the method : suppose you encode the Morris counters with 11 bits, the maximum value will be 2^11, i.e. 2048. As a consequence reading this value would require eleven accesses in memory. That’s a huge cost, at least if you consider this solution for online counting (i.e. not just storing the values), because it will especially affect high frequency items.

The solution Talbot proposes to avoid this consists in probing probabilistically the unary chain of bits. Basically, the idea is that you don’t have to keep looking for the next bit if you’ve already given up on the idea of doing the increment. And the thing is, the longer your bit chain, the less likely you are going to extend it. The bound they propose for this solution is however not so sexy. 

TOMB counters: Van Durme & Lall

The TOMB counters combine the Talbot & Osborne idea, the Morris counters and the Bloom counter approach. 

Instead of storing the value in a base-1 representation (i.e. a chain of ones terminated by a zero), it uses several levels of storage, each having a small Morris counter.

The idea is that you start by updating one of the Morris counting column-cells at the bottom layer. When the cell is at the maximum value, you go to the next layer. Computing the stored value of an item consists in summing the values of each layer up to the first layer with a non-maximum value. While this improves significantly in many ways compared to the Talbot and Osborne idea, there are several possible improvements to be made: the “length” of the value is encoded in a suboptimal way.

Count-Min Tree Sketches

In 2016, we (Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand and Abdul Muhamadsultane) designed and implemented the Count-Min Tree Sketch (CMTS), a tree structure mixing counting bits and barrier bits. The objective that motivated this work was to have a very efficient way to count Zipfian data, where low frequency items vastly outnumber high frequency items. The main characteristics of the CMTS are:

  • The bits coding the length of a counter is separated from the bits of the counter
  • The length of a counter is encoded in base-1
  • Both the length storage and the counter storage are “shared”

In summary, we have two kinds of bits: counting bits and barrier bits for coding the length.

An example of CMTS with 4 layers and a spire.

As illustrated in the figure above, a counter starts at the bottom with a counting bit, if its first barrier bit equals one, then the next counting bit is also considered otherwise the counter just uses one counting bit. The counter stops at the bit right after the last contiguously activated barrier. A barrier bit, once set to 1 can never be changed anymore. At the very top, after the last barrier bit, a spire holds the remaining counting bits, mostly for performance reasons since our goal was to store a “super-cell” in a typical CPU cache line of 64 bytes (512 bits), allowing for a 128 bits base. Ideally a super-cell takes 2*128+2*64+… = 510 bits with only 8 levels allowing for a value to be stored in the base tree itself up to 765. The spire is stored in another memory segment to code values above this threshold. With this setting, we obtain 128 counters for 512+32 bits, i.e 4.25 bits per counter on average for a single layer. 

The general algorithm for updating and querying a Sketch is unchanged compared to a classical Count-Min Sketch. However, given the complexity of the shared bits structure, the algorithms for getting the value, incrementing or merging the counters in case of distributed computing are more complex than a Count-Min Sketch. Obtaining a competitive implementation has been a major challenge.

Getting the value of counters in a CMTS with 4 layers and a 4 bits spire.

The principle of getting the value of a CMTS’s counter is the following, and is illustrated in the figure above:

  1. The binary values of the barrier are gathered from the bottom cell of the counter, up to the first zero barrier. If there are 2 barrier bits contiguously set (like counter 0 in Figure 2), then = 2.
  2. (+ 1) value bits are gathered in counter c. If the bottom value bit is 0 and the two others are 1, then the counter is c= 110= 6
  3. The real value can finally be computed: c+2×(2b-1)=12

Incrementing can be implemented by getting the real value, incrementing it, then putting the result back into the CMTS, by setting the correct barrier and counting bits.

The principle of setting the value of a CMTS’s counter is the following:

  1. The new barrier nb is computed from the post increment value nv using the following formula:
    nb=min(nblayers,lsb((nv+2)/4)) where lsb(x) gives the bit length of x, i.e. the position of the first set bit plus one.Within the previous example, the new value of counter 0 is 13, nblayer is 4 and lsb((13+2)/4)=2. Hence, nb = 2
  2. The new counting bit is then computed by nc nv − 2 × (2nb − 1). Here nc = 7,i.e.,111b
  3. Finally, the new barrier and the new value are set: nb bits set to one for the barrier, and nb + 1 bits are set for the value using the new counting bit. In our example the number of barrier bits set to one remains the same (2), and nb + 1 = 3 bits are set for the value using the new counting bit nc = 111b, thus the first three counters are set to 1. In this case only the first one changes (from 0 to 1).

The same basic get and set algorithms can be used for merging counters, even though many errors can be avoided by taking into account the possible overflows that can happen due to the conflicts in the tree structure.

Evaluation

We have verified this hypothesis empirically in the following setting: we count unigrams and bigrams of 140 million words of the English Wikipedia corpus. The small corpus analyzed contains 14.7 million distinct tokens.

In the following, the ideal perfect count storage size corresponds, for a given number of elements, at the minimal amount of memory to store them perfectly, in an ideal setting. A high-pressure setting corresponds to a setting where the memory footprint is lower than the ideal perfect count storage size for the same number of elements. For our experiment, this size is 59 Megabytes (MiB).

All sketches use an implementation which is faster than a native dictionary structure (we use the native C++ STL implementation of an UnorderedMap). The memory footprint of this structure is approximately 815 MiB, i.e. 1380% of the ideal perfect count storage size.

We compare the estimates of four sketches (all use the conservative update mode):

  • CMS-CU is the classical linear Count-min Sketch.
  • CMLS16-CU is the Count-min-log Sketch using a logarithmic base of 1.00025 and 16bits counters.
  • CMLS8-CU is the Count-min-log Sketch using a logarithmic base of 1.08 and 8bits counters.
  • CMTS-CU is the Count-Min Tree Sketch. Parameters are: 128 bits base, 32 bits spire.

First we compare the Average Relative Error:

Then we compare the RMSE on Pointwise Mutual Information : we use the same value stored, i.e. counts of unigrams and bigrams, and compute the PMI of a bigram which uses three values: count(x), count(y) and count(x_y).

One can see that with a “reasonable” size around the perfect size, CMTS vastly outperforms the other sketches. There is however a tendency of the error to grow faster than the other sketches under very high “pressure”, but since we’re interested only in settings where the ARE is less than 1%, the error is anyway not acceptable.

What’s next?

An important difference between Count-Min sketches and regular data structures is that you have to set your size once and for all. If you mispredict the target vocabulary size, you can be seriously in trouble as error rates grow with the pressure.

We usually have a good predictor of the vocabulary based on the size of the corpus we are counting, so for our use case, it’s not really a problem. But we would like to eventually publish a super robust version of CMTS that would automatically adapt the data structure as more data get ingested.

We have already explored two main directions:

  • Adaptative subsampling: when a cell or a sketch is under pressure, divide the counts by two and start counting only at half rate (then ¼ rate and so on). Dividing the counts is however challenging.
  • Adaptative growth: when a sketch is under pressure, create a sketch with twice the size, and on the subsequent updates, init the counts in the new sketch the first time you update an item.

Let’s hope we’ll find some time for it.