# INTRO+DATASETS Flashcards

1
Q

What is Data Science

A

process of building, cleaning, structuring datasets to analyse and extract meaning

2
Q

Process of Data Science

A
2. Get data
3. Explore data
4. model data
5. visualize and communicate results
3
Q

key principles in DS

A
• get many data sources
• understand how data collected
• use statistical models
• understand correlations
• good comm skills
4
Q

What does the discussion of probability include

A

-random experiments that produce a series of possible outcomes (can be infinity outcomes)

5
Q

elements of probability model (uncertainty of experiment)

A
• sample space(ohm symbol)(set that contains all possible outcomes. outcomes are mutually exclusive and collective exhaustive)(an event is a collection of one or more outcomes–subset of sample space)
• probability fraction p(A) assigns event A a no. between 0 and 1. Complement of event A= A^c– p(A^c)=1-p(A)
6
Q

conditional probability

A

probability of outcome A given that event B (DENOMINATOR)has occurred.

7
Q

independent

A

A and B are independent if the occurrence of B provides no information about A. intersect of events A and B =P(A)*P(B)

8
Q

Variable?

A

variable is any characteristic observed in a study. summary of ALL outcomes in a random process

9
Q

quantitative variable

A

there is meaningful distance between any 2 points of data

10
Q

types of categorical variable

A
• ordinal

- nominal

11
Q

types of quantitative variable

A
• discrete (separate numbers)

- continuous (possible values form an interval)

12
Q

distribution of a variable (probability distribution)

A

list of possible outcomes+associated probability

13
Q

Cumulative probability distribution

A

probability that the discrete variable is less than or equal to a particular value.

14
Q

probability density function (used for continuous variable as impossible to list down all values and prob for each value

A

Probability density function (PDF) is the probability that the value of a continuous variable falls within an interval.

15
Q

cumulative density function

A

Cumulative distribution function (CDF) is the probability that the variable is less than or equal to a particular value.

16
Q

modal category?

A

category with the highest frequency

17
Q

Bar plot (common way to display categorical variable)

A

One vertical bar for each possible category that could occur,
with the height proportional to the frequency of that category.

18
Q

Histogram(quantitative variable)

A
• Divide the range of data into intervals of equal width.
• Count the number of observations that fall within each interval.
• Label the intervals on the x-axis.
• Draw a bar over each interval
19
Q

Weakness of range?

A

sensitive to extreme observations

20
Q

variance definition

A

average squared deviations from the mean

21
Q

empirical rules of SD

A
• 68% of observations fall within +-1SD
• 95% fall within +-2SD
• almost all fall within +-3SD(check for outliers)
22
Q

interquartile range

A

range between upper and lower quartiles (robust to outliers)

23
Q

5 number summary

A

min , lower quartile, median (X0.5), upper quartile, max (min N max NOT considering outliers)

24
Q

when does an association exist?

A

if a particular value for a certain variable(response/dependent) is more likely to occur with certain values of another variable(explanatory/independent)

25
Q

covariance

A

measures the extent to which two variables move in the same direction

26
Q

correlation

A

covariance between two variables divided by the product of their standard deviations

27
Q

A

getwd()

28
Q

get data types

A

class(a) (if a assigned to smth)

29
Q

true or false class?

A

logical

30
Q

Creating a vector of numbers and name it x

A

x=c(1,2,3,4) x= 1 2 3 4 class:numeric

31
Q

length of vector

A

length(x)

32
Q

alternative ways to write

x = matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)

A

x=matrix(c(1,2,3,4),2,2)

x=matrix(1:4,2,2)

33
Q

vector by row first?

A

y=matrix(1:4,2,2,byrow=TRUE)

34
Q

class of matrix?

A

‘matrix’ ‘array’

35
Q

dimension of matrix

A

dim(x)=2 2 (row then column)

36
Q

extract component from row 2, column 3 of matrix A

A

A[2,3]

37
Q

attain subset of first row of A

A

A[1,]

38
Q

delete first row of A

A

A[-1,]

39
Q

list out all objects?

A

ls()

40
Q

remove one or all object?

A
• rm(x)

- rm(list=ls()) (must contain name or character strings)