Huffman encoding¶
This module implements functionalities relating to Huffman encoding and decoding.
AUTHOR:
- Nathann Cohen (2010-05): initial version.
Classes and functions¶
-
class
sage.coding.source_coding.huffman.
Huffman
(source)¶ Bases:
sage.structure.sage_object.SageObject
This class implements the basic functionalities of Huffman codes.
It can build a Huffman code from a given string, or from the information of a dictionary associating to each key (the elements of the alphabet) a weight (most of the time, a probability value or a number of occurrences).
INPUT:
source
– can be either- A string from which the Huffman encoding should be created.
- A dictionary that associates to each symbol of an alphabet a numeric
value. If we consider the frequency of each alphabetic symbol, then
source
is considered as the frequency table of the alphabet with each numeric (non-negative integer) value being the number of occurrences of a symbol. The numeric values can also represent weights of the symbols. In that case, the numeric values are not necessarily integers, but can be real numbers.
In order to construct a Huffman code for an alphabet, we use exactly one of the following methods:
- Let
source
be a string of symbols over an alphabet and feedsource
to the constructor of this class. Based on the input string, a frequency table is constructed that contains the frequency of each unique symbol insource
. The alphabet in question is then all the unique symbols insource
. A significant implication of this is that any subsequent string that we want to encode must contain only symbols that can be found insource
. - Let
source
be the frequency table of an alphabet. We can feed this table to the constructor of this class. The tablesource
can be a table of frequencies or a table of weights.
In either case, the alphabet must consist of at least two symbols.
EXAMPLES:
sage: from sage.coding.source_coding.huffman import Huffman, frequency_table sage: h1 = Huffman("There once was a french fry") sage: for letter, code in sorted(h1.encoding_table().items()): ....: print("'{}' : {}".format(letter, code)) ' ' : 00 'T' : 11100 'a' : 0111 'c' : 1010 'e' : 100 'f' : 1011 'h' : 1100 'n' : 1101 'o' : 11101 'r' : 010 's' : 11110 'w' : 11111 'y' : 0110
We can obtain the same result by “training” the Huffman code with the following table of frequency:
sage: ft = frequency_table("There once was a french fry") sage: sorted(ft.items()) [(' ', 5), ('T', 1), ('a', 2), ('c', 2), ('e', 4), ('f', 2), ('h', 2), ('n', 2), ('o', 1), ('r', 3), ('s', 1), ('w', 1), ('y', 1)] sage: h2 = Huffman(ft)
Once
h1
has been trained, and hence possesses an encoding table, it is possible to obtain the Huffman encoding of any string (possibly the same) using this code:sage: encoded = h1.encode("There once was a french fry"); encoded '11100110010001010000111011101101010000111110111111100001110010110101001101101011000010110100110'
We can decode the above encoded string in the following way:
sage: h1.decode(encoded) 'There once was a french fry'
Obviously, if we try to decode a string using a Huffman instance which has been trained on a different sample (and hence has a different encoding table), we are likely to get some random-looking string:
sage: h3 = Huffman("There once were two french fries") sage: h3.decode(encoded) ' eierhffcoeft TfewrnwrTrsc'
This does not look like our original string.
Instead of using frequency, we can assign weights to each alphabetic symbol:
sage: from sage.coding.source_coding.huffman import Huffman sage: T = {"a":45, "b":13, "c":12, "d":16, "e":9, "f":5} sage: H = Huffman(T) sage: L = ["deaf", "bead", "fab", "bee"] sage: E = [] sage: for e in L: ....: E.append(H.encode(e)) ....: print(E[-1]) 111110101100 10111010111 11000101 10111011101 sage: D = [] sage: for e in E: ....: D.append(H.decode(e)) ....: print(D[-1]) deaf bead fab bee sage: D == L True
-
decode
(string)¶ Decode the given string using the current encoding table.
INPUT:
string
– a string of Huffman encodings.
OUTPUT:
- The Huffman decoding of
string
.
EXAMPLES:
This is how a string is encoded and then decoded:
sage: from sage.coding.source_coding.huffman import Huffman sage: str = "Sage is my most favorite general purpose computer algebra system" sage: h = Huffman(str) sage: encoded = h.encode(str); encoded '11000011010001010101100001111101001110011101001101101111011110111001111010000101101110100000111010101000101000000010111011011000110100101001011100010011011110101011100100110001100101001001110101110101110110001000101011000111101101101111110011111101110100011' sage: h.decode(encoded) 'Sage is my most favorite general purpose computer algebra system'
-
encode
(string)¶ Encode the given string based on the current encoding table.
INPUT:
string
– a string of symbols over an alphabet.
OUTPUT:
- A Huffman encoding of
string
.
EXAMPLES:
This is how a string is encoded and then decoded:
sage: from sage.coding.source_coding.huffman import Huffman sage: str = "Sage is my most favorite general purpose computer algebra system" sage: h = Huffman(str) sage: encoded = h.encode(str); encoded '11000011010001010101100001111101001110011101001101101111011110111001111010000101101110100000111010101000101000000010111011011000110100101001011100010011011110101011100100110001100101001001110101110101110110001000101011000111101101101111110011111101110100011' sage: h.decode(encoded) 'Sage is my most favorite general purpose computer algebra system'
-
encoding_table
()¶ Returns the current encoding table.
INPUT:
- None.
OUTPUT:
- A dictionary associating an alphabetic symbol to a Huffman encoding.
EXAMPLES:
sage: from sage.coding.source_coding.huffman import Huffman sage: str = "Sage is my most favorite general purpose computer algebra system" sage: h = Huffman(str) sage: T = sorted(h.encoding_table().items()) sage: for symbol, code in T: ....: print("{} {}".format(symbol, code)) 101 S 110000 a 1101 b 110001 c 110010 e 010 f 110011 g 0001 i 10000 l 10001 m 0011 n 00000 o 0110 p 0010 r 1110 s 1111 t 0111 u 10010 v 00001 y 10011
-
tree
()¶ Returns the Huffman tree corresponding to the current encoding.
INPUT:
- None.
OUTPUT:
- The binary tree representing a Huffman code.
EXAMPLES:
sage: from sage.coding.source_coding.huffman import Huffman sage: str = "Sage is my most favorite general purpose computer algebra system" sage: h = Huffman(str) sage: T = h.tree(); T Digraph on 39 vertices sage: T.show(figsize=[20,20])
-
sage.coding.source_coding.huffman.
frequency_table
(string)¶ Return the frequency table corresponding to the given string.
INPUT:
string
– a string of symbols over some alphabet.
OUTPUT:
- A table of frequency of each unique symbol in
string
. Ifstring
is an empty string, return an empty table.
EXAMPLES:
The frequency table of a non-empty string:
sage: from sage.coding.source_coding.huffman import frequency_table sage: str = "Stop counting my characters!" sage: T = sorted(frequency_table(str).items()) sage: for symbol, code in T: ....: print("{} {}".format(symbol, code)) 3 ! 1 S 1 a 2 c 3 e 1 g 1 h 1 i 1 m 1 n 2 o 2 p 1 r 2 s 1 t 3 u 1 y 1
The frequency of an empty string:
sage: frequency_table("") defaultdict(<... 'int'>, {})