Prelims Flashcards

(70 cards)

1
Q

Methods of Data Collection:

A

Observation
Interview
Questionnaire

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Seeks to ascertain what people think and do by watching them in action as they express themselves in various situations and activities.

A

Observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Instead of writing the response the interviewee subject gives the needed information verbally face-to-face relationship.

A

Interview

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Provides speedy and simple technique gathering data about groups of individuals scattered in a wide and extended field.

A questionnaire form is sent usually by post / online

A

Survey Questionnaire

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sampling Techniques

A
  • Probability Random Sampling
  • Non-Probability Non-Random Sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Start with a complete sampling frame of all eligible individuals from which you select your sample. (unbiased)

A

Probability Sampling Method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Probability Sampling Method

A
  • Simple Random
  • Systematic Sampling
  • Stratified Sampling
  • Clustered Sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Each individual or unit in the population has an equal chance of being selected. Selection is done randomly. Ensuring unbiased representation.

using methods like a lottery system or a random number generator

A

Simple Random Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

A method where elements are selected from an ordered population at regular intervals. The first sample is chosen randomly within the first interval, and subsequent samples follow a fixed pattern.

(e.g., every 5th or 10th individual).

A

Systematic Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Population is divided into distinct subgroups (strata) based on shared characteristics

(e.g., age, income level, education).

A

Stratified Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Population is divided into occurring groups (clusters).

(e.g., schools, neighborhoods, or companies. )

A

Clustering Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

All members of the selected clusters are surveyed

A

one-stage cluster sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

random subset of individuals within selected clusters is surveyed

A

two stage cluster sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Not all members of the population have an equal chance of being selected. Instead, selection is based on subjective judgment, convenience, or specific criteria set by the researcher.

A

Non-Probability Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Used in exploratory research, qualitative studies, or when probability sampling is impractical due to resource constraints.

A

Non-Probability Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Non-Probability Sampling

A
  • Convenience Sampling
  • Quota Sampling
  • Purposive Sampling
  • Snowball Sampling
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Participants are selected based on their availability, accessibility, and willingness to participate rather than random selection.

A

Convenience Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Used in pilot studies, exploratory research, or when time and resources

A

Convenience Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Researchers deliberately selected individuals who are most relevant to the researcher topic based on predefined criteria, expertise, or specific characteristics.

A

Purposive Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Commonly used in qualitative research.

A

Purposive Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Researchers divide the population into subgroups (quotas) and select participants from each subgroup to ensure representation based on characteristics like age, gender, or income. The selection process is not random, so not all population members have an equal chance of participating.

A

Quota Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Existing participants recruit future participants from their network, particularly useful for studying hard-to-reach or specialized populations.

A

Snowball Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Starts with few known individuals who refer others, creating a chain of referrals.

A

Snowball Sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Methods in Data Presentation

A
  • Textual Method
  • Tabular Method
  • Graphical Method
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Using written or descriptive explanations without tables or graphs. Involves narrating findings in sentences and paragraphs to convey insights.
Textual Method
26
A structured ways of presenting data in rows and columns using table, organizes numerical or categorical data systematically for easy comprehension.
Tabular Method
27
Visual representation of data using charts, graphs, or diagrams to simplify complex information and highlight trends, patterns, or relationships.
Graphical Method
28
29
Question
Answer
30
Data types
* numeric, categorical * static, dynamic (temporal)
31
Other kinds of data
* distributed data * text, Web, meta data * images, audio/video
32
missing attribute values, lack of certain attributes of interest, or containing only aggregate data | e.g., occupation=“”
incomplete data
33
containing errors or outliers | e.g., Salary=“-10”
noisy data
34
containing discrepancies in codes or names | Age=“42” Birthday=“03/07/1997” Was rating “1,2,3”, now rating “A, B, C”
inconsistent data
35
Why is data Processing Important?
* No quality data, no quality mining results * Quality decision must be based on quality data * Duplicate or missing data may cause incorrect or even misleading statistics
36
Multi-Dimensional Measure of Data Quality
* Accuracy * Completeness * Consistency * Timeliness * Believability * Value added * Interpretability * Accessibility
37
Major Tasks in Data Processing
* Data Cleaning * Data Integration * Data Transformation * Data Reduction * Data discretion
38
Fill in missing values, smooth noisy data, identify or remove outliers and noisy data, and resolve inconsistencies | number one problem in data warehousing
Data Cleaning
39
Data Cleaning Tasks:
* Fill in Missing Values * Identify outliers and smooth out noisy data * Correct Inconsistent Data * Resolve Redundancy caused by data integration
40
Data is not always available (many tuples have no recorded values for several attributes, such as customer income in sales data)
Missing Data
41
Missing data causes:
* Equipment malfunction * Inconsistent with other recorded data and thus deleted * Data not entered due to misunderstanding * Certain data may not be considered important at the time of entry * Not register history or changes of the data
42
Handling missing data
* Ignore the tuple * Fill in missing values manually: tedious and infeasible * Fill it automatically
43
Fill missing data with
* A global constant e.g., unknown * The attribute mean * The most probable value:inferenced-based such as Bayesian formula, decision tree or EM algorithm
44
Random error or variance in a measured variable
Noisy Data
45
Incorrect attribute values may due to:
* Faulty data collection instruments * Data entry problems * Data transmission problems
46
Handling Noisy Data
* Binning Method * Clustering * Combined computer and human inspection
47
When reducing noise and trend analysis is needed
Smoothing by bin means
48
When keeping real-world constraints and preserving limits is important
When keeping real-world constraints and preserving limits is important
49
Detect and remove outliers, Data points inconsistent with the majority of data
Clustering
50
Integration of multiple databases or files
Data Integration
51
Integrate metadata from different sources Entity identification problem: identify real world entities from multiple data
Schema Integration
52
Removing noise from data
Smoothing
53
scaled to fall within a small, specified range
Normalization
54
summarization
Aggregation
55
concept hierarchy climbing
Generalization
56
Normalization
* Min-max normalization * Z-score normalization * Normalization by decimal scaling
57
Obtains reduced representation in volume but produces the same or similar analytical results. Data is too big to work with.
Data Reduction
58
Data Reduction Strategies
* Dimension reduction—remove unimportant attributes * Aggregation and clustering * Sampling
59
Feature selection (i.e., attribute subset selection): Select a minimum set of attributes (features) that is sufficient for the data mining task.
Dimension Reduction
60
Popular reduction technique. Divide data into buckets and store average (sum) for each bucket
Histograms
61
Choose a representative subset of data
Sampling
62
Data discretion three types of attributes
* Nominal * Ordinal * Continuous
63
# Data discretion three types of attributes values from an unordered set
Nominal
64
# Data discretion three types of attributes values from an ordered set
Ordinal
65
# Data discretion three types of attributes real numbers
Continuous
66
Data discretion techniques
* Binning Method - equal-width, equal-frequency * Entropy-based (1) * Entropy-based (2)
67
for bin width of e.g., 10:
Equi-width binning
68
for bin density of e.g., 3
Equi-frequency binning
69
Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values
Discretization
70
Reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)
Concept Hierarchies