MSCDSA04 - Data Exploration Flashcards

1
Q

In Python, how do you define a function

A

use ‘ def ‘ then function name, then in () the inputs it expects, and finish with a ‘ : ‘ eg

def myFunction (x,y):
    z = x+y
    return z
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are descriptive statistics?

A

Statistics that summarise the data concisely, and use different ways to visualise the data.

Scatter graphs
Mean, media, mode, skewdness etc

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Exploratory data analysis?

A

looking for patterns, differences, and other features that address the questions we are interested in.

At the same time checking for inconsistencies and identifying limitations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a cross-sectional study vs a longitudinal study?

A

a cross-sectional study captures a snapshot of a group at a point in time.
Everyone in population Should have an equal chance of being selected
a longitudinal study observes a group repeatedly over a period of time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the name for people who make who participate in a survey

A

Respondents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In Co-Lab, how do you load a module?

A

from collections load ..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In Python, how do you get an index number for each item in a list?

A

use and for loop and the ‘ enumerate ‘ function

for myIndexNumber, i in enumerate(myListName):
print (myIndexNumber, “\t”, i)

The above will iterate through a list and print out the index number and a tab and the list value for that index

0 0
1 23
2 34
3 17

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

In Python, how do you sort a list of items?

A

The sorted() function returns a sorted list of the specified iterable object.

You can specify ascending or descending order. Strings are sorted alphabetically, and numbers are sorted numerically.

a = (“h”, “b”, “a”, “c”, “f”, “d”, “e”, “g”)
x = sorted(a, reverse=True)
print(x)

produces: [‘h’, ‘g’, ‘f’, ‘e’, ‘d’, ‘c’, ‘b’, ‘a’]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What domes first? Machine learning or Data Exploration?

A

Data exploration comes before machine learning.

You cannot do the machine learning until you understand something about what the data has to tell you

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 5 basic terms in statistics?

A

1) Population = is everything
2) Sample = is the group sampled from the population
3) Statistic = is the quantity we calculate from the sample data
4) Parameter = is a number that is a property of the population. A statistic is an estimate of a parameter e.g. the average mean measurement of the variables (say ‘height’)
5) The variable = a characteristic of interest for each person or thing in the population. It is the data: what is measured , can be numeric (weight, time etc), or categorical (eye colour, gender, ethnicity)

Data are the actual values of the variable: may be numbers or words.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are inferential statistics?

A

1) verify a hypothesis
2) trying to find a line of best fit etc (relies on probability) -
3) test if there is a relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What happens to the mean of the data is skewed?

A

it is bad for the mean, it is an average and so the mean can be too big or too small due to more high or low value outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is more sensitive to skewness? Mean, Median, or Mode

A

Mean is more sensitive.

Mode and Median are less sensitive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a ‘ repository ‘ in GitHub and what is it for?

A

A repository is usually used to organize a single project.

Repositories can contain folders and files, images, videos, spreadsheets, and data sets – anything your project needs.

GitHub recommend including a README, or a file with information about your project.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are ‘Pull Requests ‘ in GitHub? What are they for and how do they work?

A

Pull Requests are the heart of collaboration on GitHub.

When you open a pull request, you’re proposing your changes and requesting that someone review and pull in your contribution and merge them into their branch.

Pull requests show ‘ diffs ‘, or differences, of the content from both branches. The changes, additions, and subtractions are shown in green and red.

As soon as you make a commit, you can open a pull request and start a discussion, even before the code is finished.

By using GitHub’s @mention system in your pull request message, you can ask for feedback from specific people or teams, whether they’re down the hall or 10 time zones away.

You can even open pull requests in your own repository and merge them yourself.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Who was the Swedish statistician who presented ‘ the joy of statistics ‘

A

Professor Hans Rosling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Who was the Dr, who presented ‘ The Joy of Data ‘

A

Dr Hannah Fry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe 5 key steps of the ‘data exploration and analysis process?

A

1) sample the population.
2) Pick the population parameters desired.
3) Choose variables (numeric or categorical). Understand the type(s) of data (NOIR).
4) Use descriptive statistics such as mean, median and mode, variance, standard deviation.
5) Use Inferential Statistics to draw conclusions such as relationships, or to test hypotheses.

19
Q

What are the differences between the different types of statistical data?

A

N) Nominal = Qualitative; variables are ‘named categories’ with no mathematical properties

O) Ordinal = Quantititive; The variable is a number but indicates only RANK order and can NOT be used for calculations (e.g. means)

I) Interval = Quantititive; The variable is numbers with equal sized steps, but no ZERO point. Scores can be added, means calculated.

R) Ratio = Quantititive; same as interval but the variable is a numeric scale with a ZERO point. (e.g. age, miles per hour, etc). Ratios can be calculated

20
Q

Demonstrate how the different types of data influence the types of analysis?

A

Nominal or ordinal could be displayed in a frequency table or pie chart / bar graph

21
Q

What does NOIR stand for and what does it mean

A

Different types of data in statistics:

N) Nominal = Qualitative; variables are ‘named categories’ with no mathematical properties

O) Ordinal = Quantitative; The variable is a number but indicates only RANK order and can NOT be used for calculations (e.g. means)

I) Interval = Quantitative; The variable is numbers with equal sized steps, but no ZERO point. Scores can be added, means calculated. (e.g a temperature scale) - differences make senses but ratios do not (20˚ / 10˚)

R) Ratio = Quantitative; same as interval but the variable is a numeric scale with a ZERO point. (e.g. age, miles per hour, etc). Ratios can be calculated (e.g. exam scores)

22
Q

What sort of data comes from counting or measuring?

A

Quantitative, always the result of counting

23
Q

What are the 2 types of Quantitative data, and what are their properties?

A

Discrete and Continuous

Discrete are the result of counting items (whole)

Continuous are the result of measuring (assuming accurate measurement) i..e height or speed

24
Q

What 2 categories can statistical analysis be separated into?

A

1) Descriptive Statistics
2) Inferential Statistics

1) Descriptive is about organising and summarising data, through:
a) Plotting / visualising data
b) Using numbers(e.g. finding averages etc)

2) Inferential is methods from probability theory to draw conclusions from data, to:
a) Test if a relationship between variables
b) test a hypothesis about a parameter to do with the population

25
Q

What are 3 Central Tendencies?

A

1) Mean - sum of x / n of x or
mean of x = 1/n * ∑x

2) Median - the middle value or mean of the middle 2 values where n is even
3) Mode = most common data value

26
Q

What is the ‘range’ of a data set and how is calculated

A

The ‘Range’ is a measure of dispersion or ‘spread’.

It is the arithmetic distance between the largest and smallest number in the set.

27
Q

What is the ‘Variance’ of a population, how is it calculated?

A

The variance is a measure of dispersion. It is Sigma squared (σ or ∑)²

1) Find the difference between each data point and the mean of the data set
2) Square the differences
3) Calculate the arithmetic mean of the the squared differences

But it produces quite strange results unless you continue on to calculate the Standard Deviation which is simply the square root of the variance

28
Q

What is the standard Deviation of a population, how is it calculated?

A

The Standard Deviation is a measure of dispersion. It is the square root of the variance.

Or simply √Sigma squared
or √σ ² or √∑²
or simply ∑ or σ

1) Calculate the variance
2) Take the Square root of the variance

29
Q

What is stratification and how can it improve on random sampling, and under what cases?

A

is possible, in some instances, to improve on simple random sampling by stratification of the population.

This is particularly true where the population is heterogeneous (i.e. made up of dissimilar groups) and the population can be stratified into homogeneous (i.e. similar) classes. These classes should define mutually exclusive categories.

30
Q

What does heterogeneous mean?

A

made up of dissimilar groups

31
Q

What does homogeneous mean?

A

Homogeneous means similar.

E.g the population can be stratified into homogeneous (i.e. similar) classes

32
Q

In Pandas, what is a DataFrame? What is it used for?

A

a DataFrame is a collection of values of potentially different types

It is used for data manipulation
———-
Imagine it as a relational data table, with rows and named columns.

The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in Spark and R.

33
Q

In Pandas, what is a Series?

A

In Pandas a series is a single column.

A DataFrame contains one or more Series and a name for each Series.

34
Q

In Pandas, what DataFrame.method can be used to produce a set of descriptive statistics automatically?

A

dfName.describe()

35
Q

What two Pandas DataFrame.methods show the:

1) The first few records in the DataFrame
2) The last few records in the DataFrame

A

1) dfName.head()
2) dfName.tail()

in both cases the default is 5 records but you can provide a parameter of the number of records

36
Q

What is NumPy?

A

NumPy is a popular toolkit for scientific computing.

Pandas Series can be used as arguments to most NumPy functions:

37
Q

In Python, what is lambda for?

A

The lambda keyword in Python provides a shortcut for declaring small anonymous functions, that can operate ‘in line’.

Lambdas have only an implicit single return statement (the result of the calculation) so some people refer to lambdas as ‘single expression functions’

Lambdas can also remember the values from their scope even when they are no longer in scope

38
Q

What are the 4 elementary terms in probability

A

four terms −

1) experiment, (tossing a coin)
2) outcome, (observable result of an experiment
3) event (outcome or set of outcomes to an experiment of interest to the experimenter.)
4) sample space (a sample space is the set of all possible outcomes of an experiment.)

For example, if we throw a die then the sample space is {1, 2, 3, 4, 5, 6} and two possible events are

(a) a score of 3 or more, represented by the set: {3, 4, 5, 6}
(b) a score which is even, represented by the set: {2, 4, 6}.

39
Q

Which takes ‘account of order’; a combination or a permutation?

A

Permutation takes account of order

Combination does not take account of order

40
Q

Which distribution goes with which type:

1) Discrete distribution
2) Continuous distribution

With

A) Measurement
B) Counting

A

1) and B) discrete is from counting

2) and A) continuous is from measurement

41
Q

In Co-Lab how can you render mathematical symbols?

A

You can add math to text cells using LaTeX

It will be rendered by MathJax.

Just place the statement within a pair of $ signs.

For example $\sqrt{3x-1}+(1+x)^2$ becomes √3𝑥−1 +(1+𝑥)2.

42
Q

The alpha α symbol in t tests means what?

A

α is the significance level that we’re testing a hypothesis against

43
Q

in many statistical calculations ‘degrees of freedom’ actually means.. what?

A

its the count of datapoints in the sample in question, usually minus 1