Text Compression Flashcards

(26 cards)

1
Q

why is text compression different from other forms of data compression?

A

text compression must be exactly reconstructable

requires lossless coding

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

lossless compression

A

class of algorithms allowing original data to be perfectly reconstructed from compressed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

lossy compression

A

achieve data reduction by discarding information

this is suitable for certain media types such as images, video, and audio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

dictionary methods

A

work by replacing a word/text with an index to an entry in a dictionary
diagram coding is an example

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

symbolwise methods

A

work by estimating the probabilities of symbols and coding one symbol at a time using shorter code words for the likely symbol
relies on the modelling step and a coding step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

modeling step

A

estimation of the probability for the symbols in the text (the better the estimates, the higher compression can be achieved)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

coding step

A

conversion of probabilities from a model into a bitstream

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

static method

A

uses a fixed model or fixed dictionary derived in advance of any text to be compressed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

semi-static

A

uses current text to build a model or dictionary during one pass and then apply it in the second pass

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

adaptive method

A

build model/dictionary adaptively during one pass

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

information content

A

the number of bits in which a symbol s should be coded is its information content I(s)
I(s) = -log(base 2)P[s] where P[s] is the predicted probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

entropy

A

the average information per symbol over the whole alphabet is given by the following formula:
H = the sum of the product of P[s] and I(s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

zero-order models

A

ignore preceding context

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what do preceding models influence

A

probability of encountering a given symbol at a particular place in a text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

static modeling

A

derives ant then uses a single model for all texts.

will perform poorly on texts different from those used in constructing the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

adaptive modeling

A

derives model during encoding

  • begins with a base probability distribution
  • refine the model as more symbols are encountered
17
Q

semi-static modeling

A

derives a model for the file in the first pass

better suited than static but inefficient because must make two passes over tet and transmit model

18
Q

coding

A

the task of determining the output representation of a symbol given the probability distributions

19
Q

Huffman coding

A

use a code tree for encoding and decoding (0 or 1)
each leaf node is a symbol of the alphabet
The Algorithm:
- code tree constructed bottom-up from the probabilistic model according to the following algorithm
1. probabilities are associated with leaf nodes
2. identify two nodes with smallest probabilities -> join under a parent node whose probability is their sum
3. repeat step 2 until only one node remains
4. 0s and 1s are assigned to each branch

20
Q

prefix free

A

no codeword is the prefix of another

21
Q

canonical Huffman coding

A

address the issues of storage (high cost for memory) where the codes are generated in a standardized format
all codes for given code length assigned values sequentially

advantages over Huffman coding:

  • efficient storage of codebooks
  • more efficient coding algorithm
22
Q

arithmetic coding

A

a coding mechanism primarily used for images and can code arbitrarily close to the entropy thus optimal in terms of compression. common in adaptive modeling
output: a stream of bits

23
Q

disadvantages of arithmetic coding

A
  • slower than Huffman coding
  • not easy to start decoding in middle of the compressed stream
  • therefore less suitable for text retrieval
24
Q

LZ77

A

example of decoding
the encoder output is a series of triples
1st: indicates how far back in the decoded output to look for the next phrase
2nd: length of that phrase
3rd: next character from the input

25
synchronisation points
assume the smallest unit of random access in the compressed archive in the document either store bit offset of the document or ensure it ends on a byte boundary
26
what is the algorithm of canonical Huffman coding
1. Calculate the length of code for each symbol. a. probability x length of the code 2. group symbols with same code length into blocks and order (i.e. alphabetically) 3. assign code by counting up a. code length of 2: 00, 01 b. code length of 3: 100, 1010, 110 c. code length of 4: 1110, 1111 4. store/transmit code a. provide sequences of symbols b. number of symbols at each code length