STAT 102 Midterm Flashcards
Jane Jacobs (add more from lecture 2)
Cities are about the relationships and interactivity between people.
Safety thru eyes on the street.
Wanted mixed use neighborhoods.
Available urban data (add the questions at beginning of lecture)
Crime
Economics
Demographics
Land use
Variable
Characteristic that takes on different values for different individuals.
Categorical variables
Place an individual into one of several groups.
Example: race, gender, types of crime.
Can’t do direct mathematical operation on categorical variables
Continuous variables
Take on numerical values across an entire range.
Examples: height, income, population
Can do direct math with these.
How to visualize categorical data
Bar plot and pie chart.
Bar plot
Best way to visualize categorical data.
X axis is the categories
Y axis is the frequency of relative frequency of each category.
Pie chart
Another way to visualize categorical variables.
Problem is it is harder to see the actual totals of each category and distinguish small categories.
Distribution of a continuous variable
Describes what values a variable takes and how frequently these values occur.
Described in terms of center, spread, shape, outliers.
How to visualize continuous variables
Histogram, box plot, maps, linear regression
Box plots are better for center and spread and identifying outliers
Historgrams are better for looking at the shape–>
Skewness and multi-modality
Histogram
Same idea as bar plots, but for continuous variables.
Divide the x axis into bins. Y axis is the frequency or count of each bin
Boxplot
Box contains central 50% of values
Line in middle is median, 50% of values on each side.
Whiskers have rest of distribution except for outliers.
Outliers are suspiciously large or small values (<1.5IQR or >1.5IQR)
Left vs right skewed
Skewed toward the side with the longer tail.
Left skewed means long left tail. When left skewed,
mean< median.
Right skewed means long right tail. Income is always right skewed.
When right skewed, mean> median
Mean vs median
Mean used to measure center for approx normal dist.
Sum of data / n
Affected by large outliers and asymmetry so use median if skewed.
Median is the middle value, more robust measure of center. 50th percentile. Use for skewed data.
How to measure spread of dist
Variance, standard deviation, IQR.
use iqr for skewed, variance and SD for approx norm
Variance / standard deviation
Spread for a normal dist.
Variance is the average of the squared deviations of each onservation.
SUM (x - mean)^2 / (n-1)
Standard deviation is square root of variance. Where you expect most data to lie.
IQR
used for skewed data. Robust measure of spread.
Q3-Q1
75th percentile - 25th percentile.
Log transformation of data
Used to make skew distributions normal. Based on the magnitude of the value rather than the actual value.
Log = natural logarithm
Log 10 = log with base 10
Preserves the ordering of the values–> median is still the median. Not the case with mean though.
Scatter plots
The primary way to visualize the relationship between two continuous variables.
Correlation
Value between -1 and 1 that provides the sign and strength of a linear association between 2 variables.
Correlation coefficient –> how much things vary around a line.
Only appropriate for linear relationships.
R
Side by side box plots
Good way to visualize how a continuous variable changes across different categories.
Categories on x axis and numerical count on y axis.
See if differences are significant by using probability modeling.
Random variables
Used to represent an uncertain quantity or data point. Each value of a random variable has a certain probability between 0 and 1
Marginal vs conditional prob
Marginal probabilities are used to model the uncertainty in a single random variable.
P(X=x) is the prob that X will take on the value x
Conditional prob are used to model how the distribution of one random variable changes based on the value of another random variable.
P(X=x | Y=y) is the prob that X will take on the value x given that Y has the value y.
P (below poverty line | children >0) = P (BPL And c>0) / P(c>0)
P(A|B)=P(A&B)/P(B)
Normal distribution
Used to model a continuous distribution in order to obtain probabilities to the left or right of a value or within a certain range. Standard normal has mean 0 and standard deviation 1 Denoted N (mu, sigma)