Stonehill
College
Data Structures and Discrete Mathematics Learning
Community
Professors Ralph Bravaco and
Shai Simonson
Text Compression and Huffman Trees
Consider a simple “message” (or file) which is stored as a
string of characters using the ASCII code.
Each character in the message requires eight bits of
storage. For example character “A,” (ASCII code 65)
would be stored as 010000001, character “B” (ASCII 66) would bet stored as 010000010 etc. In fact, this very paragraph would require 3472 bits of storage
(434 characters and eight bits for each character).
Data compression is a technique which reduces the size of
a file or data collection without affecting the information
contained in the file. A compressed
file, obviously, requires less storage space.
Moreover, electronic transfer of a compressed file is
faster than the transfer of an uncompressed file and a smaller
file is consequently less susceptible to transmission errors.
One of the simplest but efficient methods of data compression
was introduced by David Huffman in his paper ``A method for the
construction of minimum redundancy codes.''
Huffman’s method is very intuitive.
Unlike the ASCII code which uses exactly eight
bits to encode each character, a Huffman code uses a variable
number of bits for each character. With Huffman’s scheme, a
common character such as “a” or “e” might be encoded with perhaps just two bits and less
frequently used characters like “q” or “#” would be encoded with
a longer bit string. Consequently,
the most frequently occurring characters are encoded with a
minimal number of bits. (Morse code
uses the same technique, “E” is encodes as a single dot, “T” just
one dash, but “Q” is
dashdashdotdash). Because ASCII encodes
every character with eight bits we say that ASCII code is a fixed
length code. Huffman’s code
is a variable length code.
We’ll begin with a trivial example.
Suppose we wish to encode the string
“SHESELLSSEASHELLS”
The ASCII code would require 160 bits (20 characters each
requiring 8 bits).
Notice, however, that the characters
which
appear in the string have the following frequencies ( and
relative frequencies)
A
1
.05
H
2
.10

3
.15
E
4
.20
L
4
.20
S
6
.30
A possible variable length code which uses just 2 bits to encode
frequently used characters like “S” or “L “ might be :
Character
Codeword
Frequency of Character
A
0000
1
H
0001
2

001
3
E
10
4
L
11
4
S
01
6
We will refer to this code as Code 1.
We will also use the term “codeword” to denote an
individual character code. Thus,
0000 and 001 are codewords.
Notice that “S” which occurs 6 times (or 30% of the time) is
encoded using just 2 bits and “A” which occurs only once is
encoded with 4 bits.
With this code, the string, “SHESELLSSEASHELLS,”
can
be encoded with just 49 bits:
0100011000101101111010010110000000101000110111101
You may wonder why we didn’t use a “more efficient” code like:
S
0
L
1
E
00
 11
H
01
A
10
( Code 2)
Using Code 2 we can encode the string “SHESELLSSEASHELLS” as
001001100011011000101101000110
which uses a mere 30 bits. Certainly,
compressing
the string to 30 bits is more efficient than storing it as 49
bits. So, why not use Code 2? It does the job with fewer bits.
To answer this question, consider just the simple substring,
“SHE.” Using the
ASCII coding scheme, “SHE” is encoded as
010100110100100001000101
Now, if we wish to decode (or translate this bit string back to
ordinary, readable text), we simply
look at each successive group of exactly 8 bits:
01010011  (83)
– “S”
01001000  (72)
– “H”
01000101  (69)
– “E”
That’s all there is to it.
Now, let’s encode “SHE” using our “more efficient” Code 2:
00100.
Wow! Just 5 bits!
Now, given this five bit string, how do we decode it? Do we decode 00100 as “EAS”
(00 10 0)
or as “ SSLE”
( 0 0 1 00 ) or perhaps as
“SSLSS” (0 0 1
0 0) ? You really cannot tell
without putting some marker between the codewords for each
letter.
On the other hand, using Code 1 we encode “SHE” as
01000110.
To decode this bit string , simply scan
the
string from left to right until you
find a complete codeword. You have
just decoded the first character. Continue
the
process until you have scanned the entire bit string. You will see there is no ambiguity.
01 0001
10
S H E
Exercise 1.
a. Decode each of the following bit strings which are
encoded with Code 1:
01000011001011011110100110101101
00010000110010001000001001000110101101
b. Now,
using the following bit string, show that Code 2 is ambiguous. In other words, find two possible
translations of the following bit string:
01011100011011001000
c.
Explain why Code 1 is not ambiguous. What
property
does Code 1 have which ensures that each bit string has a unique
translation?
Prefix Codes
For arbitrary variable length codes, we may not be able
determine the end of one codeword . For
example, using Code 2, do we interpret the bit string 0101 as
0.10.1 (SAL) or 01.01 (HH) or 0.1.0.1 (SLSL)? Is
the initial letter encoded as 0 (S) or
01 (H)? Without some kind of marker between codewords, it is
impossible to determine where each codeword ends.
Code 1 however, does not have this drawback. The bit string 00010000011 can have
only one interpretation:
0001.0000.001.
Let’s look at this string. As we scan from left to right we see:
0 is not a valid code. Continue
scanning
00 is not a valid code. Continue
000 is not a valid code. Continue
0001 is the code for H. Done.
OK, we have unearthed an “H.” But how do we know that there is
no code that starts with 0001 ? How
do we know that 00011 is not the code for some other
letter?
If you look closely at the codewords of Code 1, you will notice
that no codeword is a prefix of another codeword.
The codeword for “H” is 0001. No
other codeword starts with 0001. The
codeword for “S” is 01. Nothing
else begins with 01. Thus, using a
leftright scan, once we read 0001 we recognize the codeword for
“H.” Further, we know that 0001 is
not the start of the codeword for some other character, since no
other character code begins with 0001.
Definition.
A code is a prefix code (
or has the prefix property) if
no
codeword is the prefix of another codeword.
Again, Code 1 has the prefix property but Code 2 does not. The codes which we will develop
will have the prefix property.
A Huffman code turns out to be an “optimal” prefix code. We will be more precise about the
meaning of “optimal” a bit later. First,
let’s see how the Huffman code is constructed.
Once we look at the mechanics, we will take a closer look
underneath the hood.
Codes and Binary Trees:
Let’s look at Code 1 again:
Character
Codeword
Frequency
Relative Frequency
A
0000
1
.05
H
0001
2
.10

001
3
.15
E
10
4
.20
L
11
4
.20
S
01
6
.30
20
1.00
We can represent this code with a binary tree.
On such a tree,
each left branch is labeled 0,
each right branch is labeled 1
the encoded characters are placed in the leaves
Notice that the concatenation of the branch labels on the path
from the root to any leaf gives the code word for the symbol
stored in the leaf. For example,
the path from the root to the leaf labeled “H’ is
leftleftleftright and the codeword for “H” is 0001. The path to “L” is
rightright , so the codeword for “L” is 11.
Huffman’s Algorithm constructs such a binary tree in a bottom up
manner. The Huffman tree yields an optimal prefix code for the symbols contained in the
leaves.
Building the Huffman Tree.
Given the frequencies of each character in a file, the
Huffman tree can be constructed as follows:
For each symbol to be encoded, make the binary tree consisting
of a single root. Each of these root nodes should hold
the frequency (or relative frequency)
of one symbol. Ultimately, these nodes will be the leaves
of our Huffman tree.
This can be implemented as a priority queue. The
priorities are the frequencies.
2. Choose the two trees with the smallest root values and merge
these into a new tree, as in the
following diagram. Here the trees labeled A and H are chosen. Priorities/frequencies are 1 and 2. . The
root of the new tree holds the value 3 , obtained as the sum of
the values 1 and 2. Place this in the queue.
3. Continue the process, always
forming a new tree from the two trees with the lowest root
values. Here we merge the two trees each with 3 stored in the
root into a tree with root value 6. Place the new tree in
the queue. Four trees remain in the priority queue
with root values 4,4,6 and 6.
4. Continue this process.. Remember always merge the two
subtrees which have the smallest root values. These are
the two trees with root values 4 and 4. Merge them into a
single tree with root 8 and place the new tree in the queue.
Now merge the two trees with root values 6 into a tree with root
12. Place the new tree in the queue so that two trees
remain:
Merge the two trees into one final tree. Here is the final
tree, with each branch labeled either 0 or 1.
The following is a general algorithm:
For each character c to be encoded
create a binary tree, consisting of a single root node r such
that content (r) =
f(c), the frequency of c
Place each very simple tree into a priority queue S
where the priority is the frequency.
for i = 1 to n1
{
From the priority queue
,S, remove the two trees, x and y,
with minimal values, f_{1 }and f_{2} in the
root
create a tree with root z
such that
x is the left child of z
y is the right child of z
content(z) = f_{1} _{ }+
f_{2
}add
this new tree with root z to S.
}
Using the Huffman tree to determine the codewords.
Once the Huffman tree is created, it is easy to determine
the codeword for each character.
Method 1: Bottom Up
As we’ve seen, each leaf represents one character. If we begin at a leaf and proceed up
ttree to the root, we can “trace out” (in
reverse) the codeword for a given symbol:
For each leaf a
while a is not the root of
the tree
if a is a left subtree
output
‘0’
else
output ‘1’
a = parent(a)
Notice that the algorithm will output
codewords in reverse order.
Example:
Let a be leaf “ .”
a is a right subtree; output ‘1’
a = parent(a)
a is a left subtree; output ‘0’
a = parent(a)
a is a left subtree; output ‘0’
a = parent(a)
a is the root; done
Output: 100; codeword for ““ is 001
Method 2 : Top Down  this is easier
An ordinary, recursive tree traversal with an additional
parameter can also be used to determine the codewords:
void traverse(string s, tree root)
Each time that the function is called recursively, pass it
string s concatenated with either
‘0’ or ‘1.’ When visiting a leaf, simply output string s – s is the codeword for character
associated with the leaf.
Encoding the file:
Simply translate each character using a table of codewords:
Example:
Character
Codeword
A
0000
H
0001

001
E
10
L
11
S
01
Translate: “SHESELLS.......”
010001100010110111101
S H E 
S E L L S
Decoding a compressed file:
When decoding a compressed file, you will once again use the
Huffman tree. Consequently, the frequency table must once again
be available. When the tree is
constructed, the text characters are stored in the leaves. Each codeword, determines a unique
path from the root of the tree to to the leaf holding the
translation of the code word. Once
we reach a leaf, we begin the decoding process again at the
root.
Example: Decode : 0100011.....
Begin at the root:
0100011 – move left
0100011 – move
right
A leaf—“S” (01 is the codeword
for S)
Go back to the root and continue the process.
0100011 etc.
Exercise 2:
a. Construct a Huffman
tree for the alphabet A B C D E F where the relative frequencies
of each character are:
A
.22
B
.13
C
.33
D
.10
E
.20
F
.02
b. Use
the tree to determine the code word for each of the character of
this short alphabet
c.
Encode the string “AECBCAF”
Program 1. Encoding
a text file.
Write a program which will encode a text
file using the Huffman algorithm.
The user interface should be very simple.
Just provide the user with two prompts:
Input file:
Output file:
so that the user may supply the
name of the input (text) file as well as a name for the
compressed file. Your output should
actually be two files : the compressed file, and also a file with the frequency table . This second file should have the same
name as the compressed file and an extension of feq. For example if the compressed file is called
something.zip then the second file should be something.feq
To encode text using a Huffman code you must
First
build a frequency chart for all characters in your text
Build a Huffman
tree
Make a list of
all codewords
Using the list of
codewords, encode your text.
Here are a few suggestions to help you with the implementation.
Since ASCII values fall in the range 0
...127, use a simple array (int freq[128]) to store frequencies.
For example, if 'A'
has frequency 13, then freq[65] will have
the value 13. Of course, not all ASCII characters will appear in
your text, so many entries in the frequency array will be 0.
When you build the Huffman tree, you must make a number of
design decisions.
As you know, the algorithm builds the tree incrementally. At each step, the subtrees with
minimum roots must be chosen. The
simplest method for determining the these two minimum valued
trees would be to use a priority queue.
If you plan to build your table of codewords in
a bottom up fashion, you will need to maintain a table which
holds the address of each leaf on the tree.
Also, each node must contain the address of its
parent. The top down method is easier and is really just
an inorder traversal which keeps track of the codewords as you
do the traversal.
Simply keep the codewords in an array indexed from 0 to 127 so
that you may access the a character’s codeword directly, i.e.,
without a search. Save your
frequency table to a file – simply list the 128 frequencies in
the array.
Program
2: Decoding a compressed file.
Write a program which will decode a
compressed file. Your program should prompt the
user for two file names:
Compressed
file:
Text file:
If the name of the compressed file is something.zip,
your program should use that file and also the file something.feq
as input.
Exercise 3:
Use the programs in the previous
exercises to compress( and decode)
a. The Declaration of
Independence
b. The US Constitution
c. Shakespeare’s Hamlet
The text of these files can be downloaded.
Real Life Compression
For convenience, you
encoded your text into characters (bytes), with each
byte equal to either ‘0’ or ‘1’. However,
storing each bit in its own byte will result in a file 8 times
larger than it needs to be. In order to realize the true
"reallife"compression, you must convert the bytes into
bits. To do this, every eight '0' or '1' characters
(bytes) is packed into a single byte. Pseudocode to
accomplish this is given below:
While there are more 0/1 characters (bytes)
{
byte temp = 0;
for i = 1 to 8 //
This loop calculates the decimal value of 8 zeros and ones
and stores it in temp. If the number of bytes
in the file is not divisible by 8, just pad the file with a
few extra bytes.
{
b =
getnext_0_or_1_character(); // get the next bit (stored
in a byte)
temp = temp* 2 +
b;
// note that b
should be 0 or 1 (not the ascii value of b which would be 48
or 49)
// Note that to
do this correctly in Java, you will need to cast the
expressions to bytes, since Java automatically upcasts bytes
to int when doing arithmetic
}
// store temp in a byte array
}
For example, given the bits 10010011,
the byte "temp" grows from 0 like this: 0, 1, 2, 4, 9, 18, 36,
73, 147.
These numbers represent the decimal value of the 8bit
binary number 10010011 as it grows from left to right.
(Note that Java stores byte values as two's complement signed
numbers, so if you try to print out these bytes, it will display
negative numbers for any unsigned value greater than 127.
This will not affect your compression at all.)
FileOutputStream fos = new FileOutputStream("pathname");
fos.write(myByteArray);
fos.close();
or follow instructions here:
Compete the following chart:
SIZE
IN BYTES

Text
file 
winzip 
Huffman 
Declaration
of Independence 



Constitution 



Hamlet 



You do not need to write any programs to
find these values. You can get all
this information via Windows and Word.
Huffman Codes—mathematical questions.
Now that we have looked at the
Huffman algorithm, let’s take a closer look at some of the
mathematical properties of a Huffman tree.
Exercise 4:
We have stated that the Huffman’s Algorithm produces a code with
the prefix property. Explain why
this is the case. You do not have
to give a rigorous mathematical proof—just think about how the
tree is constructed and give a plausible explanation. A few sentences will suffice.
Optimal codes
We have stated the the Huffman code is
“optimal” so perhaps it is now time to give a more
precise definition of this term.
We first begin by defining the cost of the code or equivalently
the tree.
Definition. Suppose T
is a tree representing a prefix code for a file.
The cost of T, C(T), is
the number of bits necessary to encode the file.
Thus
C(T) = Sum[
frequency(c)*length(c)] for each character c
where length(c) is the length of the path from the root
of T to the leaf representing c.
Note: length(c) is also the number of bits in the
codeword for character c.
Example:
Consider the following Huffman tree, T which we built to encode
the message “SHESELLSSEASHELLS”
C(T) = frequency(‘A’)length(‘A’)+ frequency(‘H’)length(‘H’)
+frequency(‘’)length(‘’)+
frequency(‘S’)length(‘S’)+
frequency(‘E’)length(‘E’)+ frequency(‘L’)length(‘L’) =
1*4 + 2*4+3*3 + 6*2 + 4*2 + 4*2 = 4
+ 8 + 9 + 12 + 8 +8 = 49
Definition. Suppose
T is a tree representing a prefix code for a file.
The code is optimal if C(T)
is minimal.
Note an optimal code is not unique – for a given file more than
one optimal code may exist. (Trivially, just change all 0’s to
1’s and 1’s to 0’s in any optimal code and you have another
optimal code).
With a precise definition of "optimal" in hand, we will prove
(in the next several exercises) that a Huffman tree is optimal.
Exercise 5:
A full binary tree is a binary tree in
which ever nonleaf node has two children.
Notice that the Huffman trees in our examples have been full
trees.
Prove: An optimal prefix code for a
file is always represented by a full binary tree
Hint: Suppose there is a nonfull
tree that did represent an optimal code. Find
a
contradiction by producing a better code.
Exercise 6:
Show by means of a counter example that a
full binary tree may represent a prefix code which is not
optimal, even if the letters with lower frequencies have the
longer codewords.
Suggestion: Build another full tree for “SHESELLSSEASHELLS”
which has a higher cost than the tree of our previous examples.
Exercise 7:
Let x and y be two characters with the lowest frequencies.
(Arbitrarily, assume frequency(x) <= frequency(y)). Prove that there is an optimal tree
where x and y are siblings at the
maximum depth of the tree.
Hint:
Assume that you have an optimal prefix tree, T, such that a and b are at the maximum depth.
(You can assume frequency (a) <=
frequency(b)). Here is a partial
view:
Now, because x and y have the two smallest frequencies
frequency(x) <= frequency(a) and frequency(y) <=
frequency(b). Also, length(a)
>= length(x) and length(b) >= length(y).
Now switch a and x, resulting in a new tree T’.
Show C(T’) <= C(T)?. Now, in T’,
switch y and b. Show C(T’’)
<= C(T’)?
Exercise 8:
Prove that Huffman’s algorithm produces an optimal prefix
code tree.
Outline of proof: Use mathematical induction on n, the
number of characters/leaves.
For n = 1, Huffman’s algorithm produces a tree with one
node, obviously optimal.
Now assume that for n leaves, Huffman’s algorithm
produces an optimal tree. Show that Huffman’s algorithm
produces an optimal tree for n+1 leaves.
Let T_{n+1} be an optimal tree with n+1
leaves. We may assume that in an optimal tree the two
characters, say x and y, with the lowest
frequencies are siblings at the lowest level of the tree. (Why
may we assume that?). Now remove x
and y and replace them with a “new character z
with frequency(z) = frequency(x) + frequency(y). Call the new tree T_{n}. Let T’_{n} be the Huffman tree constructed from these n
characters Now finish the
proof.
Exercise 9:
The Huffman tree for frequencies equal to the
first n Fibonacci numbers results in a very
unbalanced encoding of n characters.
Suppose that the frequencies for the characters a, b, c, d, e, f, g, and
h are the first 8 Fibonacci numbers: 1,1,2,3,5,8,13,and 21.
Construct the Huffman tree.
You found the cost for n = 8. Now you should
determine the general cost of the Huffman tree for n
characters when the frequencies are the first n
Fibonacci numbers.
a. Discover a
formula for the sum of the first n Fibonacci numbers,
and prove it by induction..
Fill in this chart to help you get a good conjecture. Hint: Look at Fibonacci(n + 2).
n 
Fibonacci(n) 
SumFib(n) = Fibonacci(1) +
Fibonacci(2) + ... + Fibonacci(n) 
1 
1 
1 
2 
1 
2 
3 
2 
4 
4 
3 
7 
5 
5 

6 
8 

7 
13 

8 
21 

9 
34 
b. Prove that C(n) = C(n1) + SumFib(n), where C(1) = 0.
You can prove this by looking carefully
at the structure of the tree. Draw the trees. The
next tree is built inductively from the previous tree.
Fill in this chart to help you with your
proof. Hint: Line up the sums in the chart,
and examine the difference between two rows.
n 
Fibonacci(n) 
C(n) = Cost of Huffman tree for the
first n Fibonacci numbers 
1 
1 
0*1 = 0 
2 
1 
1*1 + 1*1 = 2 
3 
2 
2*1 + 2*1 + 1*2 = 6 
4 
3 
3*1 + 3*1 + 2*2 + 1*3 = 13 
5 
5 

6 
8 

7 
13 

8 
21 

9 
34 