# MSCDSA04 - Data Exploration Flashcards

In Python, how do you define a function

use ‘ def ‘ then function name, then in () the inputs it expects, and finish with a ‘ : ‘ eg

def myFunction (x,y): z = x+y return z

What are descriptive statistics?

Statistics that summarise the data concisely, and use different ways to visualise the data.

Scatter graphs

Mean, media, mode, skewdness etc

What is Exploratory data analysis?

looking for patterns, differences, and other features that address the questions we are interested in.

At the same time checking for inconsistencies and identifying limitations.

What is a cross-sectional study vs a longitudinal study?

a cross-sectional study captures a snapshot of a group at a point in time.

Everyone in population Should have an equal chance of being selected

a longitudinal study observes a group repeatedly over a period of time.

What is the name for people who make who participate in a survey

Respondents

In Co-Lab, how do you load a module?

from collections load ..

In Python, how do you get an index number for each item in a list?

use and for loop and the ‘ enumerate ‘ function

for myIndexNumber, i in enumerate(myListName):

print (myIndexNumber, “\t”, i)

The above will iterate through a list and print out the index number and a tab and the list value for that index

0 0

1 23

2 34

3 17

In Python, how do you sort a list of items?

The sorted() function returns a sorted list of the specified iterable object.

You can specify ascending or descending order. Strings are sorted alphabetically, and numbers are sorted numerically.

a = (“h”, “b”, “a”, “c”, “f”, “d”, “e”, “g”)

x = sorted(a, reverse=True)

print(x)

produces: [‘h’, ‘g’, ‘f’, ‘e’, ‘d’, ‘c’, ‘b’, ‘a’]

What domes first? Machine learning or Data Exploration?

Data exploration comes before machine learning.

You cannot do the machine learning until you understand something about what the data has to tell you

What are the 5 basic terms in statistics?

1) Population = is everything

2) Sample = is the group sampled from the population

3) Statistic = is the quantity we calculate from the sample data

4) Parameter = is a number that is a property of the population. A statistic is an estimate of a parameter e.g. the average mean measurement of the variables (say ‘height’)

5) The variable = a characteristic of interest for each person or thing in the population. It is the data: what is measured , can be numeric (weight, time etc), or categorical (eye colour, gender, ethnicity)

Data are the actual values of the variable: may be numbers or words.

What are inferential statistics?

1) verify a hypothesis

2) trying to find a line of best fit etc (relies on probability) -

3) test if there is a relationship.

What happens to the mean of the data is skewed?

it is bad for the mean, it is an average and so the mean can be too big or too small due to more high or low value outliers

What is more sensitive to skewness? Mean, Median, or Mode

Mean is more sensitive.

Mode and Median are less sensitive

What is a ‘ repository ‘ in GitHub and what is it for?

A repository is usually used to organize a single project.

Repositories can contain folders and files, images, videos, spreadsheets, and data sets – anything your project needs.

GitHub recommend including a README, or a file with information about your project.

What are ‘Pull Requests ‘ in GitHub? What are they for and how do they work?

Pull Requests are the heart of collaboration on GitHub.

When you open a pull request, you’re proposing your changes and requesting that someone review and pull in your contribution and merge them into their branch.

Pull requests show ‘ diffs ‘, or differences, of the content from both branches. The changes, additions, and subtractions are shown in green and red.

As soon as you make a commit, you can open a pull request and start a discussion, even before the code is finished.

By using GitHub’s @mention system in your pull request message, you can ask for feedback from specific people or teams, whether they’re down the hall or 10 time zones away.

You can even open pull requests in your own repository and merge them yourself.

Who was the Swedish statistician who presented ‘ the joy of statistics ‘

Professor Hans Rosling

Who was the Dr, who presented ‘ The Joy of Data ‘

Dr Hannah Fry