WTF is Middle Out Encoding?

Chris Zhu
4 min readJul 15, 2017

--

If you’ve ever seen HBO’s silicon valley, you’ve probably seen that scene. But what exactly is middle out encoding?

I don’t know. Turns out it doesn’t exist. But lets talk about a real compression algorithms that were briefly mentioned in that scene: bottom up encoding.

All the code for this blog example can be found here.

Bottom Up — Huffman Encoding

Lets build the bottom up encoding from scratch. First, lets explain how encoding works. Lets say we have some string like:

bbaadaacaab

Then we can build a table of frequency counts:

a: 6
b: 3
c: 1
d: 1

From this frequency count, we want to build this tree step by step:

Step 1: Combine the two smallest nodes, c and d, and create a new node c_d with the added counts of c and d.

Step 2: Combine the two smallest nodes from the previous iteration, b and c_d, and create a new node b_c_d.

Step 3: Combine the two smallest nodes from the previous iteration, a and b_c_d, and create a new node a_b_c_d. Now that only one root node exists, the encoding is complete.

Once we have this tree, we are done, all we have to do, is on the way down, we assign a 0 to each branch on the left, and a 1 for each branch on the right like so:

This is why its called bottom up encoding, because we build the tree from the bottom leaf nodes, up to the parent node.

From this tree, we can get the encoding table by following the paths of the trees until we reach the node, for example, to obtain the binary encoding for c we would walk down the tree as follows and go through 1, 1, 0.

Therefore, the binary encoding map is:

a: 0
b: 10
c: 110
d: 111

A few things worth noting about this compression, you can see, in the third line, the huffman tree that we built assigned a shorter bit encoding to characters that appear more frequently like a and b. Meanwhile, characters that appear less frequently like c and d, are assigned a longer encoding.

Now, if we were to encode bbaadaacaab with our encoding above

bbaadaacaab = 101000111001100010

The original encoding contained 11 characters, which is 88 bits in size (1 char = 1 byte). Our encoding would produce 18 bits in total, which is a reduction of 79%.

Proof

I wrote a separate post on just the proof of the Huffman encoding. If you don’t care about the proof, feel free to skip ahead to the implementation.

Find that here.

Implementation

Now that we understand how bottom up encoding works, lets implement it in python.

Lets use the hamlet text as an example input, which can be found http://erdani.com/tdpl/hamlet.txt. Once we read in the text, we want to count every character in the text.

Next, lets define our node class:

Next, we can build the huffman tree:

The core of the logic is in build_huffman_tree. This takes the counter that we built previously, and converts it into a list of nodes with nodify. Then, in the while loop, it pops off two nodes from the list, builds a new node from it via build_node, then adds the node back into the list of nodes.

Now that we have these helper functions, we can write the full encode function:

: 00010
!: 011001100
: 10
,: 011100
.: 011110
...v: 0001110
y: 011101
x: 0110010101
z: 00011011110
Original Data: 1533872 bits
Encoded Data: 843328 bits
Achieved a 54% compression

So, we achieved a 54% compression on the Hamlet text.

Now, lets make sure that we can actually decode this encoding back to the original text. Luckly, the decode function is much more straight forward, all we need to decode the encoding is the table the encoding was built with:

Decoded input is correct!

Caveats

If you had payed attention, you probably noticed that we didn’t actually “compress” the data since we are using strings to represent the encoded binary data, instead of actual binary.

There is more complexity to build a general purpose Huffman Encoder, instead of building a tree based on the frequency of characters, we would want to build a tree based on frequency of bytes.

To create and save a compressed file, we need to save the table along side the data. We can imagine an adhoc file structure to save this information like:

|-- size of table --|-- size of data --|-- table --|-- data --|

And then to decompress this file, we could write a program that:

  1. Reads the first 4 bytes as TABLE_SIZE
  2. Then reads the next 4 bytes as DATA_SIZE
  3. Reads the next TABLE_SIZE bytes to get the encoding table
  4. Uses this encoding table to read the next DATA_SIZE bytes and decompress the data.

Cheers!

--

--

No responses yet