KT Companion: Compression in LZ

One important question about LZ compression is "Is it good?" And the answer is "yes", but it isn't necessarily immediately obvious why.

Join me below the fold for some bit-counting and fun with variable length coding.

From the exercises, we had a string like "muddle-the-middle-muddled-mud", and we used the method in class to come up with a compressed message of:
(4,1)(3,1)(11,1)(1,1)(9,1)(13,1)(7,1)(10,1)(15,1)(4,2)(11,1)(18,1)(10,1)(1,1)(11,3)(7,1)(18,5)(3,1)(7,4)

How have we done in compressing our original message? Well, let's observe that the original message was roughly 30B in size (29 characters, plus the end-of-string character).

How big is the compressed message? Well, we have 19 pairs of integers - how big is that? It depends on our representation method. If we're using 32-bit unsigned integers, the message is 19*2*4B=152B.

But wait! It's even worse than that. Why? Well, along with our message, we also needed to send our dictionary as raw text. If the dictionary contained 9 characters (as we said in the workshop question), plus a sentinel end-of-string to indicate that we're done sending the dictionary, this is another 10B that we need to send as part of our compressed message.

Using 32-bit unsigned integers is a terrible idea, which is obvious because the message takes up far more space than the original uncompressed text. But why is it terrible? Well, our 32-bit unsigned integer allows us to represent any number between 0 and 4294967295. But the actual values that we want to store are all between 1 and 18! That's clearly a lot of space that we're wasting.
How many bits do we actually need to represent a number between 1 and 18? 5 bits. (1 is 00001, 2 is 00010, 3 is 00011, ... 18 is 10010.) If we send each pair of values as 2*5b=1.25B, then our message is now 19*1.25=23.75B. But now, we also need to send another part to the message: the number of bits per value - which is going to indicate to the listener how wide each codeword (pair of values) in the message is. (We don't know this before we start compressing, so neither does the decoder.) If this number is sent as 8-bit unsigned integer (presuming that our codewords will never be longer than 255 bits in length ... that seems like a pretty reasonable assumption to me!), then this strategy requires 23.75B (for the message) + 10B (for the dictionary) + 1B (for the codeword width) = 34.75B. That's longer than the original message still! Hmm...

Well, we can do better than 5b per value. How? Well, observe that the pairs of values are in pairs: the first number of the pair ranges from 1 to 18, and will need 5b; the second number of the pair only runs from 1 to 5, so we only need 3 bits to transmit this value. (1 is 001, ... 5 is 101.) And now we also need to send 2 8-bit unsigned ints to tell the decoder how many bits each value in the codeword pair will take up. So, the message is 19*(5b+3b)=19B; the dictionary is 10B; we need another 2B for the codeword lengths... 31B total. Still longer than the original message! Argh!

Why can't we compress the original message? The main problem: it's too short. Observe that the dictionary (10B) is fully one-third the size of the entire original message (30B) - that's going to be a difficult hurdle to overcome. If the string was much longer, say 100KB, the size of the dictionary would probably only increase to roughly 256B, and make up a much smaller proportion of the total message.
The other nice property about longer strings is that, as we go along, we need to read back into the dictionary less and less often (because we've seen a given character before), and the likelihood of longer matches from the characters before increases. (Observe that 15 of the 19 pairs are single character matches, including the first 9 - which are almost all dictionary reads. Even for this contrived string!) The reason that we get longer matches as we go along is that natural language is self-similar[citation needed] and that the distribution of characters that can follow any given character (or sequence of characters) is quite peaky. For example, in English text, when we see the letter "q" it is very frequently followed by a "u". (But there are counter-examples! Although, in a document about "Qantas" or "Compaq", that sequence of letters is going to occur very frequently, and therefore be ameable to compression.)

There's one more very important notion that we use in real-life coding that will also make life easier: the notion of variable-length coding.
First of all, observe that, with the {5-bit, 3-bit} strategy I used above, we require a complete pass of the output to determine what the size of the codewords is going to be, and therefore start outputting the codes. This two-pass strategy is very undesirable in real applications. (In fact, we've already defined a two-pass strategy, to find the characters in our alphabet, and then to code; so this is actually a three-pass strategy, which is woeful. LZ implementations in the real world are actually one-pass, and they build up the dictionary as they go.)

One-, two-, and three-pass algorithms aside, let's think about a good way to compress this data. What's the most frequent value? It's obviously 1 - that occurs much more often than any other value. (Over longer documents, the next most common is typically 2, then 3, etc. Our document is too short to see this though.) So, let's think of a coding scheme that uses fewer bits for smaller (common) numbers, and more bits for larger (and hopefully rare!) numbers.

A good place to start here is Unary coding. What is unary? Well, it's the oldest coding system of them all. Imagine that I have 5 or fewer apples, and I wanted to indicate to someone - with whom I don't share a language - how many apples I had. What will I do? Well, I've got a very ~~handy~~ convenient counting system available to me: I will hold up a number of fingers corresponding to the number of apples I have. 1 apple - 1 finger, 2 apples - 2 fingers, etc. (Note that this gets harder with more than 5 apples, and that there are differing cultural norms.)

I've just described Unary coding. What's the code for the codeword "0"? 0. What's the code for the codeword "1"? 10. "2"? 110. "3"? 1110. And so on. (There are a few variations of Unary coding; this isn't the only one.) Basically, what I'm going to do is to use the codeword's number of "1"s, followed by a 0. "10"? 11111111110.

It turns out that this is also a bad code to use for our LZ message, because, even though there are lots of "1"s, there are a few larger numbers too, e.g. "18" - the fact that this will be 18 bits long means that I lose out on most of my compression from all those "1"s.

So, what are we going to do? Well, it turns out that there is a very effective alternative called Elias-gamma coding, developed in the 70s for a problem similar to this one. How does Elias-gamma work? Well, it represents a number as two parts: an exponent part (to the base 2), and a mantissa part.

For example, say I have the number 25. This is equal to 2^4+9 (and some other values, but 2^4 is the largest exponent less than 25). 65? 2^6+1. 64? 2^6+0. 63? 2^5+31. And so on. The exponent is always pretty small, so I'm going to encode it in unary. The mantissa is between 0 and 2^e-1 (where e is the corresponding exponent value), so I'm going to encode that as e-bit binary.

Let's look at 25=2^4+9 again. I'm going to represent the 4 (in the exponent) as 11110 and the 9 (in the mantissa) as 1001 (in 4-bit binary). Overall, the codeword for 25 is 111101001.
65=2^6+1: 1111110000001.
64=2^6+0: 1111110000000.
63=2^5+31: 11111011111
Observe that larger numbers need more bits to represent, but it doesn't go up very quickly. How about small numbers?
1=2^0+0: 0
2=2^1+0: 100
3=2^1+1: 101
4=2^2+0: 11000
These increase in bits pretty fast, but it does have the nice property that I can code my "1"s - the most common value - in just 1 bit!

(Note also that we can represent any number in Elias-gamma. 1234567? 11111111111111111111000101101011010000111. It's long, but we can do it. This means that we don't need an extra pass through the data to determine the largest value we want to encode.)

Let's take another look at our LZ-coded message:
(4,1)(3,1)(11,1)(1,1)(9,1)(13,1)(7,1)(10,1)(15,1)(4,2)(11,1)(18,1)(10,1)(1,1)(11,3)(7,1)(18,5)(3,1)(7,4)

How many bits does the 4 take to encode in Elias-gamma? 5. 1? 1. 3? 3. etc. Let's count up how much space (in bits) we need to represent the message now:
(5+1)+(3+1)+(7+1)+(1+1)+(7+1)+(7+1)+(5+1)+(7+1)+(7+1)+(5+3)+(7+1)+(9+1)+(7+1)+(1+1)+(7+3)+(5+1)+(9+5)+(3+1)+(5+5)
(I've added parentheses to make the pairs clearer.) In total, this is 138b=17.25B. Plus the dictionary (10B), this is 27.25B in total. ... We've done it! We've compressed the string by 14 precious bits!

KT Companion

Tuesday, 19 August 2014

Compression in LZ

No comments:

Post a Comment