# 10 - Strings and Text 1 Flashcards

What is the terminology for this topic?

an alphabet = {A, C, G, T} (DNA)

= ASCII, Unicode, etc

a string = ACAGTCCGGGACT

all strings indexed from position 1

Let S,T be two strings

|S| is length of S

S(i) is indexing S S(i .. j) denotes S(i)S(i+1) ..

S(j) ST is concatenation of S and T

What is a substring?

ACAGTCCGG**GACTG**ACG

a string contained within a string a substring of S is S(i .. j) (i

What is a common substring?

TCCACT**GACTG**CTGC

ACAGTCCGG**GACTG**ACG

GACTG is a common substring U is a common substring of S and T if U is a substring of both S and T

What is a subsequence?

AC**A**G**T**CC**G**GGA**C**T**GA**CG

obtained by deleting 0 or more characters from the string

What is a common subsequence?

A**CA**G**T**CC**G**GG**ACTG**A**CG**

T**C**C**A**C**TGACTGC**T**G**C

U is a common subsequence of S and T if U is a subsequence of both S and T

What is a prefix?

**ACAGT**CCGGGACTGACG

S(1 . . k) for some k (1 k n), where n = |S|

What is a suffix?

ACAGTCCGGG**ACTGACG**

S(k . . n) for some k (1 k n), where n = |S|

What is star / *?

* = {,A,C,G,T,AA,AC,AG,AT,CA,…}

set of all strings composed of symbols from the alphabet

What is the leaf terminology?

a leaf node: no children

a branch node: 1 or more children

a unary node: exactly 1 child

a binary node: exactly 2 children

What is the naive solution to finding longest repeated substring?

- String S of length n, build n x n array A with A(i, j) = 1 if S(i) = S(j) and A(i, j) = 0 otherwise
- repeated substrings are represnted by sequences of 1’s on a diagonal
- scan all diagonals to find longest repeat
- O(n
^{2}) time and space - repeated substrings may overlap

What is naive solution to common substrings?

- Find longest common substring of two strings
- two strings S, T of lengths m, n
- build similar m x n array of 0’s and 1’s
- gives longest common substring of T and S in O(mn) time and space

Using trees for suffixes

- Let S be string of length n
- k
^{th}suffix of S is the suffix S(k..n) - S has n suffixes, including S itself

- k
- Data structures to store the n suffixes of S
- Suffix Trie - O(n
^{2}) space - Suffix Tree - O(n) space

- Suffix Trie - O(n

What problems use a suffix tree to solve them?

- Longest repeat of S
- O(n) time/spacespace

- Longest common substring
- O(m+n) time/space for 2 strings S,T where m = |S| and n = |T|

- Multiple string searching
- O(n+r) time/space, for long peice of text T (n = |T|) and strings of total length r

What is a Suffix Trie?

- A trie is a multiway branching tree T
- used to store strings C over an alphabet Σ

What are the properties of a Suffix Trie?

- suffix trie T for string S is a trie used to store all suffixes of S
- T is rooted at some vertex v
- Each edge of T is labelled with some σ∈Σ
- No 2 children of a node have same edge label
- Each node w corresponds to a string S∈Σ* (concatentation of edge labels on the path from v to w) - S is the path label of w
- Each node is marked according to whether ir corresponds to a string S∈C
- each suffix of S represented by unique leaf node of T
- if some suffix of S is a prefix of another suffix
- then a suffix trie for S may not exist, e.g. consider S = queue

- can ensure the suffix trie exists by appending to s a character $ not appearing in S, before constructing T