Domain 3 - Data Flashcards

1
Q

Completeness

A

Are all the fields of the data complete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Correctness

A

Is the data accurate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Consistency

A

Is the data provided under a given field and for a given concept consistent with the definition of that field and concept

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Currency

A

Is the data obsolete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Collaborative

A

Is the data based on one opinion or on a conses of experts in the relative area

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Confidential

A

Is the data secure from unauthorized use by individuals other than the decision maker

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Clarity

A

Is the data legible and comprehensivle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Common Format

A

Is the data in a format easily used in the application for which it is intended

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Convenient

A

Can the data be conveniently and quickly access by the intended user in a time-frame that allow for it to be effectively used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cost-effective

A

Is the cost of collecting and using the data commensurate with its value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data warehouses typically describe (three things)

A
  1. A Staging area
  2. Data integration
  3. Access Layers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data warehouse staging area

A

The operational data sets from which the information is extracted

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data integration

A

The centralized source where the data is conveniently stored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Access layers

A

Multiple OLAP data marts which store the data in a form which will be easy for the analysis to retrieve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data mart

A

A subset of the data warehouse organized along a single point of view (e.g., time, product type, geography) for efficient data retrieval.

Usually oriented to a specific business line or team. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Data marts allow analysts to… (five things)

A
  1. Slice Data
  2. Dice Data
  3. Drill-down/up
  4. Roll-up
  5. Pivot
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Slice data

A

filtering data by picking a specific subset of the data-cube and choosing a single value for one of its dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Dice data

A

grouping data by picking specific values for multiple dimensions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Drill-down/up

A

allow the user to navigate from the most summarized (high-level) to the most detailed (drill-down)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Roll-up

A

summarize the data along a dimension (e.g., computing totals or using some other formula)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Pivot

A

interchange rows and columns (`rotate the cube’)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Fact tables

A

used to record measurements or metrics for specific events at a fairly granular level of detail

23
Q

Transaction fact details

A

record facts about specific events (like sales events)

24
Q

Snapshot fact tables

A

record facts at a given point in time (like account details at month end)

25
Accumulating snapshot tables
record aggregate facts at a given point in time
26
Dimension tables
Hhave a smaller number of records compared to fact tables although each record may have a very large number of attributes. Dimension table includes time dimension tables, geography dimension table, product dimension table, employee dimension table, and range dimension tables.
27
What to do with missing data (4 things)
Deletion of record Deletion when necessary Imputation Imputation at random
28
Filtering
Filtering can involve using relational algebra projection and selection to add or remove data based on its value. Filtering usually involves outlier removal, exponential smoothing and the use of either Gaussian or median filters.
29
Filling in missing data with imputation
If other observations in the dataset can be used, then values for missing data can be generated using random sampling or Monte Carlo Markov Chain methods. To avoid using other observations, imputation can be done using the mean, regression models or statistical distributions based on existing observations.
30
Dimensionality reduction options for structured data
Principle component analysis or factor analysis can help determine whether there is correlation across different dimensions in the data
31
Dimensionality reduction options for unstructured text data
term frequency-inverse document frequency (tf-idf): is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus
32
Feature hashing
Dimensionality reduction technique for when data has a variable number of features. Feature hashing is an efficient method for creating a fixed number of features which form the indices of an array
33
Sensitivity analysis and wrapper methods
Used when you don't know which features of your data are important. Wrapper methods involve identifying a set of features on a small sample and then testing that set on a holdout sample.
34
Self-organizing maps and Bayes nets
Used to understand the probability distribution of the data
35
Normalization
Used to ensure data stays within common ranges. Prevents scales of data from obscuring interpretation and analysis
36
When is format conversion used?
When data is in binary format
37
When are Fast Fourier Transforms and Discrete wavelet transforms used?
With frequency data
38
When are coordinate transformations used?
For geometric data defined over a euclidean space.
39
Connectivity-Based clustering methods
AKA Hierarchical clustering | Generates an ordered set of clusters with variable precision
40
Hierarchical clustering
AKA Connectivity-Based methods | Generates an ordered set of clusters with variable precision
41
Centroid–Based clustering methods
When the number of clusters is known, k-means is a popular technique. When the number is unknown, x-means is a useful extension of k-means that both creates clusters and searches for the optimal number of clusters. Canopy clustering is an alternate way of enhancing k-means when the number of clusters is unknown.
42
Distribution-based clustering methods
Gaussian mixture models, which typically used the expectation-maximization (EM) algorithm. Used if you want any data element’s membership in a segment to be `soft.’
43
Density-based methods
Clustering method for non-elliptical clusters - fractal and DB scan can be used.
44
Graph-Based methods
Clustering method for when you have knowledge of how one item is connected to another. Cliques and semi-cliques
45
Topic modelling
Clustering method for text data
46
How to determine important variables when structure of data is unknown?
Tree-based methods
47
How to determine important variables when statistical measures of importance are needed?
GLM models
48
How to determine important variables when statistical measures of importance are not needed?
Regression with shrinkage (e.g., LASSO, elastic net) and stepwise regression
49
How to classifying data into existing groups when unsure of feature importance?
Neutral nets and random forests are helpful
50
How to classifying data into existing groups when unsure of feature importance but require a transparent model?
Decision trees (e.g., CART, CHAID)
51
Key problem with neutral nets and random forests
Difficult to explain, "black box", less transparent than decision trees
52
How to classifying data into existing groups with fewer than 20 dimensions?
K-nearest neighbours
53
When to use Naive Bayes?
When you have a large dataset with an | unknown classification signal
54
When to use Hidden Markov Chains?
When estimating an unobservable state based on observable values