Flashcards in Topic A: Computer Science Deck (78):

1

## Shortest path between two vertices (from a weighted undirected graph).

### Can be solved by Dijkstra’s shortes path algorithm. This algorithm can fail if the edge weights are negative. In this algorithm, you initalize all distances for a node as infinity, then choose a node that you wish to find the shortest path for. Calculate the distance from that node to all of its neighbors and update the distnace values. Se the node as visited and don’t visit it again. Choose the node with the smalledst value and repeat until all nodes have been visited

2

##
Polynomial running time algorithm

### Asymptotic analysis allows us to analyze the computational complexity of an algorithm, ignoring the effects of what computer we are using, how good our compiler is, etc. If an algorithm runs in polynomial time, it has O(nk) for k > 1 where bigh Oh notation denotes the asymptotic upper bound of complexity

3

## Perfect binary tree (graph theoretic)

### A binary tree is a tree where each parent has at most two children. A full binary tree is a binary tree where every node is a leaf or has two children. A perfect binary tree is a full binary tree where all the leaves are at the same tree depth.

4

## Tail recursion (algorithms)

### blank

5

## View (relational database)

### blank

6

## Regularization (machine learning)

### blank

7

## Collision (hashing function)

### blank

8

## Dynamic memory allocation

### blank

9

##
Explain what is over fitting with respect to training an algorithm and how can we detect it. For linear regression using L2 penalty, write the function optimized and give at least one technique that can be used to avoid over fitting in this scenario (5 points).

### blank

10

## Define model selection as it pertains to machine learning

### blank

11

## Suppose we have k different letters distributed independently and identically with frequencies, f1 ... fk, in a string. Define the entropy of this string. Describe the kind of string with maximum entropy.

### blank

12

## Describe the elements of a Hidden Markov Model. That is, what are the variables, parameters, observations, etc.?

### blank

13

##
Briefly describe bootstrap resampling distribution.

### blank

14

##
What are the differences between stacks, queues, and priority queues?

### blank

15

## Compare quicksort and mergesort and list pros and cons.

### blank

16

## Write three mathematical conditions that imply an undirected graph G=(V,E) is a tree.

### blank

17

## If a query string P is length | P | and database size is S, what is the computational complexity of exact search for P in a database with a Burroughs-Wheeler Transform based indexing scheme?

###
blank

18

## For a strictly unrooted binary tree (all vertices have degree one or degree three), how many edges are there for a tree with 10 leaves?

### blank

19

##
Define SNP using the word population.

### blank

20

##
Briefly define False Discovery Rate.

### blank

21

##
Briefly define codon bias.

### blank

22

##
Define maximum likelihood (ML) and maximum parsimony (MP) criteria for phylogeny recon- struction. List three advantages of ML over MP, and three advantages of MP over ML.

### blank

23

## Please define what the following statement means: In vivo protein folding problem.

### blank

24

##
Please define what the following statement means: Local alignment problem (pairwise sequence alignment).

### blank

25

##
We wish to describe a dynamic system with periodic dynamics to model circadian rhythm. We constructed a differential equation system with all real-valued eigenvalues. Is this an appropriate model for circadian rhythm? Describe the rationale for your answer.

### blank

26

## Briefly describe a normalization method for RNA sequencing based expression quantification and its rationale.

### blank

27

## Define polymorphism and give one example. Overloading?Overwriting?

### blank

28

## Give an example of a programming language that is (a) statically typed and one that is (b) dynamically typed.

### blank

29

## Chado is a generic genomic schema. It is purposefully ”weak-typed”. What does that mean?

### blank

30

## In memory allocation, what is garbage collection? List one advantage and one disadvantage of garbage collection.

### blank

31

## In which of the following problems can you estimate the parameters a and b using linear regres-sion A) y = ax1 + bx2, B) y = ax12 + bex2 , C) y = abx1 + bx2, D) y = ax1 + sin bx2.

### blank

32

## Define a “functional language,” give an example of one, and give one advantage of using such a language.

### blank

33

##
Give two reasons that Java or C++ is superior to Perl for large projects.

### blank

34

## Describe two data structures for representing an acyclic graph. Compare their pros and cons.

### blank

35

## A min-heap with no duplicate values is a binary tree in which the value in each node is smaller than the values in its children’s nodes. Suppose the following values are inserted sequentially into an empty heap in the following order: 21, 65, 42, 6, 71, 69. What will the contents of the min-heap be after the values are inserted?

### blank

36

## A k-ary tree is a tree in which every node has at most k children. Give a (reasonably tight) upper bound for the maximum number of leaves in a k-ary tree with n nodes and height h.

### blank

37

## Briefly explain a “hashing” function.

### blank

38

## Write three mathematical conditions that imply an undirected graph G=(V,E) is a tree.

### blank

39

## How many possible unrooted leaf-labeled binary trees (i.e., the type of trees in phylogenies) are there when we have 10 leaves (i.e., 10 terminal vertices)?

### blank

40

## What is a binary search tree?

### blank

41

## A dictionary is a data structure that stores a set of integers and allows the operations of insert, delete, and find. Describe an implementation of the dictionary structure and state the running time of each of these operations in a dictionary with n items.

### blank

42

## A queue is a first-in-first-out data structure that allows the operation enqueue and dequeue. Explain potential computational efficiency problems with the following simple implementation: The elements of the queue are stored in an array and the indices of the front and back of the queue are stored in two variables, front and back, which are used receptively by dequeue and enqueue operations.

### blank

43

##
What does it mean for a problem to be in NP? Is the problem of determining if there is a path

of length at most L from vertex s to vertex t in a weighted directed graph in NP?

### blank

44

## Define O(f(n)).

### blank

45

## Briefly define the NP-complete class of problems. Give one example of a problem that is NP-complete and one that is not.

### blank

46

## What is a witness or a certificate for a NP class problem?

### blank

47

## Briefly define the SAT (satisfiability) problem in algorithm analysis.

### blank

48

## Describe in words the “knapsack” problem.

### blank

49

## Describe in words the “set cover” problem.

### blank

50

## Briefly describe a O(n) clustering algorithm.

### blank

51

## What is a “foreign key”?

### a field or attribute in one table that refers to a primary key in another table

52

## Name two key differences between an Excel spreadsheet application and a relational database management system.

### blank

53

## What’s a denormalized schema and why would you use it?

### blank

54

## You have a table of gene IDs and names and a table of transcript IDs and descriptions. Given that a gene can have multiple transcripts, write down a schema for these capturing their relationship.

### blank

55

## Prove by induction that any tree with n nodes contains n − 1 edges.

###
base case: tree with one node has zero edges and 1-1 = 0

induction hypothesis: assume every tree with n nodes has n-1 edges

inductive step: there exists a tree with n+1 nodes and more than n edges (by contradiction, this doesn’t exist)

56

## Mathematically define the problem that Dijkstra’s Shortest Path algorithm aims to solve. Give an example condition where the algorithm can fail.

### Dijkstra’s shortest path algorithm aims to solve the problem of finding the shortest path from a node to every other node in a weighted acyclic graph. The algorithm can fail if the edge weights are negative. The algorithm involves associating with each node a ”distance” value initialized to infinity. Then choose a node at random (or choose node we wish to find the shortest path for) and calculate the distance from that node to all of its neighbors and update the distance values to that calculated value. Then set the arbitrary node chosen to be ”visited” and do not visit the node again. Then chose the node with the smallest distance value and repeat, meaning we update its distance value with the weights for all its neighbors and set the node to visited. We do this for all nodes (until all nodes have been visited).

57

## Briefly describe a O(n log n) complexity algorithm to sort n elements in an array.

### blank

58

## Suppose we want to determine the largest and second-largest integers in a set of n integers. Describe a procedure that performs at most n + log2 n comparisons to do this.

### We will find MAX first by using the tournament method. Elements are paired off and compared in rounds. In each round after the first one, the winners from the preceding round are paired off and compared. (If at any round the number of keys is odd, then one of them simply waits for the next round.) We can describe this tournament by a tree, each leaf contains an element, and at each subsequent level the parent of each pair contains the winner. The root contains MAX. We have n-1 comparisons in total. In the process of finding MAX, every element except MAX loses in one compar- ison. The second largest element must lose directly to MAX. Since MAX is involved in at most log n comparisons, the second largest must be one of at most log n elements. We find the maximum of these by log n - 1 comparisons, that’s the second largest element. Thus the total number of comparisons is n + log n - 2.

59

## Describe an O(n) algorithm to find a random permutation of the integers from 1 to n. You may assume that you have a procedure Random (j) that generates an integer uniformly at random between 1 and j (inclusive). Show why every one of the n! permutations is equally likely to be generated by your procedure.

###
FisherYates shuffle

To shuffle an array a of n elements (indices 0..n-1):

forifromn 1downto1do

j ← Random(i)

exchange a[j] and a[i]

With a high-quality unbiased random number source, the algorithm is also guaranteed to produce unbiased results.

60

## Quicksort has running time O(n2) in the worst case but O(nlogn) on average. True or false? Please explain your reasoning.

###
Yes this is true

worst case is when you always pick the highest or lowest value as your pivot – typically you will choose something in the middle

Actual implementation chooses a random pivot

61

## Name three sorting algorithms that are O(nlogn)

###
Merge sort

Quick Sort

Heap Sort

62

## For what kind of data might the merge sort algorithm be faster than quicksort? When might it be slower?

###
Both merge sort and quicksort have a average time complexity: O(n log n)

If the data is well-sorted, and the pivot picking is also in the order of the data, it is the worst case scenario for quicksort. (Although this can be avoid by randomly selecting the pivot.) In the worst case, we would still get O(nlogn) for merge sort, and O(n2) for quicksort.

Quicksort is often faster as it has smaller hidden constant for it’s time complexity.

Besides, quicksort is in-space where merge sort depends on the structure you’re sorting (list: require

constant space, array: need O(n) space).

63

## How to design a recursive algorithm?

### First determine the base case, then determine general case. Combine the two to make one recursivie algorithm. Each general case should call the recursive algorithm, and make the program smaller, and closer to reaching the base case. Finally, the base case should terminate the algorithm (base case will not call the the recursive algorthm)

64

## What are the worst-case running times of QuickSort, MergeSort, and InsertionSort? Why is QuickSort often the method of choice?

###
worst case these algorithms are O(n2), O(nlogn), O(n2), respectively

quick sort is often the method of choice because in real situations you are usually not that unlucky to always choose the max or the min for the pivot. Also, it usually smaller and it require less space to implement compare to merge sort.

The average of quicksort is O(n log n), but it is optimal for randomly sorted keys

65

##
(a) What does it mean that a sorting algorithm is stable?

(b) Why is stability for sorting important?

###
(a) Maintain the relative order of records with equal keys (i.e., values). Ex. If R appears before S in the original list, then R will always appear before S in the sorted list.

(b)The order of the first sort is preserved when there are tie in the second sort and so forth Stable sort will always return same solution

66

## Suppose we have the problem of finding the string ACT in a database of DNA strings. State the decision version of this problem.

### Given a database of DNA strings S, is ACT a substring of S?

67

## The Burrows-Wheeler Transform also implements what data structure involving strings (i.e., what important string data structure can be also obtained from BWT)?

### suffix array.

68

## In order to quickly search for exact matches to a query string Q in a database S, we can use the strategy of constructing an index to S by k-mer seeds. We can also create a gapped-seed where instead of consecutive k letters, we use k letters, each separated by a gap of size q. Briefly describe the advantage of such gapped seeds.

###
1. The database S such human genome is usually highly redundant. We might have some indices that map to many positions, greatly reducing efficiency. By placing gaps in the index, we can help avoid the redundancies that are often spatially adjacent.

For example, a mammalian genome has hundreds of thousands of short repeated sequences. A gapped p-mer whose total span is greater than the typical size of the repeated sequences has a better chance of covering unique sequences that yield unique positions.

2. By using multiple gapped p-mer with different gap sizes, we can implement a multiple filtration procedure. In the search, we can retrieve all positions that satisfy multiple gapped indices.

69

## Running time for sorting algorithms have a theoretical lower bound of O(n log n), but the running time of the radix sort algorithm is O(n). There are some details missing in the seemingly contradictory statement; fill in the details so it is correct.

### The lower bound of O(n log n) is for comparison sorts, which only use the comparisons to deter- mine the relative order of elements. However, radix sort is not comparison sort algorithm.

70

## In a rooted tree, the leaves are labeled with integers. We want an algorithm that will label each node v with the maximum label on any leaf that is in the subtree of v. What traversal algorithm would you use to solve this problem? Briefly explain how the algorithm would work.

### Depth first search. I would do a post-order traversal of the tree so that I would explore the left subtree and the right subtree and return the max of the children to the sub-root so the final step would be taking the max of the root’s left and right child

71

## Give an example of a O(n log n) sorting algorithm.

###
Merge sort: divide-and-conquer algorithm

recursively divide the list, sort and merge

72

## Briefly outline an algorithm to optimize a quadratic real-valued function of two variables.

###
The Newton-Raphson Method

73

## Given n numbers, give an O(k) algorithm to sample k distinct numbers at random from it. Show that you pick any subset with probability exactly 1/C(n,k).

###
for i = 1 to k{ \\o(k)

r = random(i, n) \\o(1)

swap(i, r) \\O(1)

}

As we randomly picked k elements, if we picked it with an order, the probability is 1/P (n, k). Here we pick it without order concern, there will be k! permutations for the same selected set, so the probability is k!/P(n,k) = 1/C(n,k)

74

## Give an example algorithm that uses divide and conquer.

###
An example of an algorithm that uses divide and conquer is quick sort, an algorithm for sorting a list of n elements. First, put the elements into an array. Then pick an arbitrary pivot point and put all elements smaller than that pivot into a ”left” array and all elements greater than that pivot into a ”right” array and then return the concatenated array after calling the same sort of algorithm recur- sively on the left and right array. Each iteration divides the problem into smaller pieces and works on the smaller pieces of the problem, hence the name ”divide and conquer”.

75

## Briefly describe the difference between an approximation algorithm and a heuristic algorithm.

###
Approximation: guaranteed to be within some factor of the optimal solution

Heuristic: no guarantee that within certain factor of optimal solution. Something can be reasonable to solve the problem, but we cannot prove it

76

## Briefly describe the idea behind branch-and-bound algorithmic strategy and describe an example.

###

used to find optimal solutions of various optimization problems, especially in discrete and combina- torial optimization. It consists of a systematic enumeration of all candidate solutions, where large subsets of fruitless candidates are discarded all at once, by using upper and lower estimated bounds of the quantity being optimized.

This is often used for finding the top local alignments in a database (BLAST) – stop investigating that alignment once it has reached a score that is not as good as current

77

## Briefly explain why dynamic programming is important, and give one example of the use of dynamic programming in computational biology.

###
The dynamic programming approach seeks to solve each subproblem only once, thus reducing the number of computations. This is especially useful when the number of repeating subproblems is expo- nentially large. Usually dynamic programming keep the result of each subproblem in record to avoid recalculating.

Example: BLAST

78