by Prasanth » Wed Jun 22, 2011 6:30 pm
Data compression is the process of converting an input data stream or the source stream or the original raw data into another data stream that has a smaller size. Data compression is popular because of two reasons
1) People like to accumulate data and hate to throw anything away. No matter however large a storage device may be, sooner or later it is going to overflow. Data compression seems useful because it delays this inevitability
2) People hate to wait a long time for data transfer. There are many known methods of data compression. They are based on the different ideas and are suitable for different types of data. They produce different results, but they are all based on the same basic principle that they compress data by removing the redundancy from the original data in the source file. The idea of compression by reducing redundancy suggests the general law of data compression, which is to "assign short codes to common events and long codes to rare events". Data compression is done by changing its representation from inefficient to efficient form.
The main aim of the field of data compression is of course to develop methods for better and better compression. Experience shows that fine tuning an algorithm to squeeze out the last remaining bits of redundancy from the data gives diminishing returns. Data compression has become so important that some researches have proposed the "simplicity and power theory". Specifically it says, data compression may be interpreted as a process of removing unnecessary complexity in information and thus maximizing the simplicity while preserving as much as possible of its non redundant descriptive power.
BASIC TYPES OF DATA COMPRESSION
There are two basic types of data compression.
1. Lossy compression
2. Lossless compression
LOSSY COMPRESSION
In lossy compression some information is lost during the processing, where the image data is stored into important and unimportant data. The system then discards the unimportant data
It provides much higher compression rates but there will be some loss of information compared to the original source file. The main advantage is that the loss cannot be visible to eye or it is visually lossless. Visually lossless compression is based on knowledge about colour images and human perception.
LOSSLESS COMPRESSION
In this type of compression no information is lost during the compression and the decompression process. Here the reconstructed image is mathematically and visually identical to the original one. It achieves only about a 2:1 compression ratio. This type of compression technique looks for patterns in strings of bits and then expresses them more concisely.
TECHNIQUES OF DATA COMPRESSION
There are three important techniques of data compression.
1) basic technique
2) statistical technique
3) dictionary method
BASIC TECHNIQUES
These are the techniques, which have been used only in the past. The important basic techniques are run length encoding and move to front encoding.
STATISTICAL TECHNIQUES
They are based on the statistical model of the data. Under this statistical techniques there comes three important techniques
• Shannon Fano coding
• Huffman coding
• Arithmetic coding
DICTIONARY METHODS
This method select strings of symbols and encodes each string as a token using a dictionary. The important dictionary methods are
• LZ77 (sliding window)
• LZRW1
BASIC TECHNIQUES
1. 1.RUN LENGTH ENCODING
The basic idea behind this approach to data compression is this: if a data item occurs n consecutive times in the input stream replace the n occurences with a single pair <n d> . the n consecutive occurences of a data item are called run length of n and this approach is called run length encoding or RLE.
RLE IMAGE COMPRESSION
RLE is a natural candidate for compressing graphical data. A digital image consists of small dots called pixels. Each pixel can be either one bit indicating a black or white dot or several bits indicating one of several colours or shades of gray. We assume that this pixels are stored in an array called bitmap in the memory. Pixels are normally arranged in the bit map in scan lines. So the first bit map pixel is the dot at the top left corner of the image and the last pixel is the one at the bottom right corner. Compressing an image using RLE is based on the observation that if we select a pixel in the image at random there is a good chance that its neighbours will have the same colour. The compressor thus scans the bit map row by row looking for runs of pixels of same colour.
Consider the grayscale bitmap -
12,12,12, 12, 12, 12, 12, 12, 12, 35,76,112,67,87,8787,5, 5, 5, 5, 5, 5,1- - - - - -
Compressed Form --
9, 12, 35, 76, 112, 67, 3, 87, 6, 5, 1- - - - - - - - - - -
1. MOVE TO FRONT CODING
The basic idea of this method is to maintain the alphabet A of symbols as a list where frequently occuring symbols are located near the front. A symbol 'a' is encoded as the no of symbols that precede it in this list. Thus if A=('t','h','e','s') and the next symbol in the input stream to be encoded is 'e', it will be encoded as '2' since it is preceded by two symbols. The next step is that after encoding 'e' the alphabet is modified to A=('e','t','h','s') . This move to front step reflects the hope that once 'e' has been read from the input stream it will read many more times and will at least for a while be a common symbol.
Let A = (“t”, “h”, “e”, “s” )
After encoding the symbol “e”, A is modified.
Modified Form:-
A = (“e”, “t”, “h”, “s” )
ADVANTAGE
This method is locally adaptive since it adapts itself to the frequencies of the symbol in the local areas of input stream. This method produces good results if the input stream satisfies this hope that is if the local frequency of symbols changes significantly from area to area in the input stream.
STATISTICAL TECHNIQUES
1. SHANNON FANO CODING
Shannon fano coding was the first method developed for finding good variable size codes. We start with a set of n symbols with known probabilities of occurences. The symbols are first arranged in the descending order of the probabilities. The set of symbols is then divided into two subsets that have the same probabilities. All symbols of one subset are assigned codes that start with a zero while the codes of the symbols in the other subset start with a one. Each subset is then recursively divided into two. The second bit of all codes is determined in a similar way. When a subset contains just two symbols their codes are distinguished by adding one more bit to each. The process continues until no subset remains.
Consider a set of seven symbols, whose probabilities are given. They are arranged in the descending order of the probabilities. The two symbols in the first subset are assigned codes that start with 1, so their final codes are 11 and 10. The second subset is divided in the second step, into two symbols and three symbols. Step 3 divides last three symbols into 1 and 2.
Shannon-Fano Example
Prob. Steps Final
____________________________________________________________
1. 0.25 1 1 :11
2. 0.20 1 0 :10
3. 0.15 0 1 1 :011
4. 0.15 0 1 0 :010
5. 0.10 0 0 1 :001
6. 0.10 0 0 0 1 :0001
7. 0.05 0 0 0 0 :0000
The average size of this code is
= 0.25 x 2 + 0.20x2 + 0.15 x3 + 0.15 x 3 + 0.10 x 3 + 0.10 x 4 + 0.05 x 4
= 2.7 bits / symbol.
This is a good result because the entropy is ≈ 2.67.
ADVANTAGE
The advantage of this method is that it is very easy to implement.
2. HUFFMAN CODING
A commonly used method for data compression is huffman coding. The method starts by building a list of all the alphabet symbols in descending order of their probabilities. It then constructs a tree with a symbol at every leaf from the bottom up. This is done in steps where at each step the two symbols with smallest probabilities are selected, added to the top of partial tree, deleted from the list and replaced with an auxiliary symbol representing both of them. When the list is reduced to just one auxiliary symbol the tree is complete. The tree is then traversed to determine the codes of the symbols.
The huffman method is somewhat similar to shannon fano method. The main difference between the two methods is that shannon fano constructs its codes from top to bottom while huffman constructs a code tree from bottom up.
This is best illustrated by an example. Given five symbols with probabilities as shown in Figure. They are paired in the following order:
1. a4 is combined with a5 and both are replaced by the combined symbol a45, whose probability is 0.2.
2. There are now four symbols left, a1, with probability 0.4, and a2, a3, and a45, with probabilities 0.2 each. We arbitrarily select a3 and a45 combine them and replace them with the auxiliary symbol a345, whose probability is 0.4.
3. Three symbols are now left, a1, a2, and a345, with probabilities 0.4, 0.2, and 0.4 respectively. We arbitrarily select a2 and a345, combine them and replace them with the auxiliary symbol a2345, whose probability is 0.6.
4. Finally, we combine the two remaining symbols a1, and a2345, and replace them with a12345 with probability 1.
The tree is now complete, “lying on its side” with the root on the right and the five leaves on the left. To assign the codes, we arbitrarily assign a bit of 1 to the top edge, and a bit of 0 to the bottom edge of every pair of edges. This results in the codes 0, 10, 111, 1101, and 1100. The assignments of bits to the edges is arbitrary.
The average size of this code is 0.4 x 1 + 0.2 x 2 + 0.2 x 3 + 0.1 x 4 + 0.1 x 4 = 2.2 bits / symbol, but even more importantly, the Huffman code is not unique.
APPLICATION IN IMAGE COMPRESSION
Now the following approaches illustrates how all these fore said techniques are applied to image compression. Photographic digital images generate a lot of data taking up large amounts of storage space and this is one of the main problems encountered in digital imaging. To rectify this problem image compression is used depending on the type of data, text, graphics, photographic or video. Image compression reduces image data by identifying patterns in the bit strings, describing pixel values then replacing them with a short code.
BASIC PRINCIPLE OF IMAGE COMPRESSION
The idea of losing image information becomes more palatable when we consider how digital images are created. Here are three examples: (1) A real-life image may be scanned from a photograph or a painting and digitized (converted to pixels). (2) An image may be recorded by a video camera that creates pixels and stores them directly in memory. (3) An image may be painted on the screen by means of a paint program. In all these cases, some information is lost when the image is digitized. The fact that the viewer is willing to accept this loss suggests that further loss of information night be tolerable if done properly.
Digitizing an image involves two steps: sampling and quantization. Sampling an image is the process of dividing the two-dimensional original image into small regions: pixels. Quantization is the process of assigning an integer value to each pixel. Notice that digitizing sound involves the same two steps, with the difference that sound is one-dimensional.
Here is a simple process to determine qualitatively the amount of data loss in a compressed image. Given an image A, (1) compress it to B, (2) decompress B to C, and (3) subtract d = C – A. if a was compressed without any loss and decompressed properly, then C should be identical to A and image D should be uniformly white. The more data was lost in the compression, the farther will D be from uniformly white.
The main principles discussed so far were RLE, scalar quantization, statistical methods, and dictionary-based methods. By itself, none is very satisfactory for color or grayscale images.
RLE can be used for (lossless or lossy) compression of an image. This is simple, and it is used by certain parts of JPEG, especially by its lossless mode. In general, however, the other principles used by JPEG produce much better compression than does RLE alone. Facsimile compression uses RLE combined with Huffman coding and gets good results, but only for bi-level images.
Scalar quantization can be used to compress images, but its performance is mediocre. Imagine an image with 8-bit pixels. It can be compressed with scalar quantization by cutting off the four least-significant bits of each pixel. This yields a compression ratio of 0.5, not very impressive, and at the same time reduces the number of colors (or grayscales) from 256 to just 16. Such a reduction not only degrades the overall quality of the reconstructed image, but may also create bands of different colors which is a noticeable and annoying effect.
Statistical methods work best when the symbols being compressed have different probabilities. An input stream where all symbols have the same probabilities will not compress, even though it may not necessarily be random. It turns out that for continuous-tone color or grayscale image, the different colors or shades often have roughly the same probabilities. This is why statistical methods are not good choice for compressing such images, and why new approaches for images with color discontinuities, where adjacent pixels have widely different colors compress better with statistical methods, but it is not easy to predict, just by looking at an image, whether it has enough color discontinuities.
Dictionary-based compression methods also tend to be unsuccessful in dealing with continuous-tone images. Such an image typically contains adjacent pixels with similar colors, but does not contain repeating patterns. Even an image that contains repeated patterns such as vertical lines may lose them when digitized. A vertical line in the original image may become slightly slanted when the image is digitized, so the pixels in a scan row may end up having slightly different colors from those in adjacent rows, resulting in a dictionary with short strings.
Another problem with dictionary compression of images is that such methods scan the image row by row, and may thus miss vertical correlations between pixels. Traditional methods are therefore unsatisfactory for image compression, so we turn on to novel approaches. They are all different, but they remove redundancy from an image by using the following principle.
Image compression is based on the fact that neighbouring pixels are highly correlated.
APPROACH 1
This is used for bi level images. A pixel in such an image is represented by 1 bit. Applying the principle of image compression to it therefore means that the immediate neighbours of a pixel 'p' tends to be similar to 'p'. Thus it makes sense to use run length encoding to compress the image. A compression method for such an image may scan it row by row and compute the run length of black and white pixels. A compression method for such an image may scan it in raster ie, row by row and compute the lengths of runs of black and white pixels. They are encoded by variable size codes and are written on the compressor. An example of such a method is facsimile compression.
Data compression is especially important when images are transmitted over a communication line because the user is typically waiting at the receiver, eager to see something quickly. Documents transferred between fax machines are sent as bitmaps, so a standard data compression method was needed when those were developed and proposed by the ‘International Telecommunications Union’. Although it has no power enforcement, the standards it recommends are generally accepted and adopted by industry.
The first data compression standard developed by the ITU were T2 and T3. These are now obsolete and have been replaced by T4 and T6. They have typical speeds of 64 k band. Both methods can produce compression ratios of 10:1 or better, reducing the transmission time of a typical pate to about a minute with former and a few seconds with the later.