Exam 1 Flashcards
(84 cards)
Where is the most effort put in data mining?
Data preparation and cleaning
What are the data-related steps in the CRISP-DM guide?
Select/find data, Clean the Data, Prepare the data, Integrate the data, and Format the data
How is data represented?
Numeric: Continuous attributes. Measurements, int, float data types
Nominal - values are symbolic - labels: sunny, old, yellow: can perform equality checks. Categorical coding; may code “1” but has no arithmetic meaning.
Ratio - The measurement scheme defines a zero point a distance, temperature differential but not the temperature itself; math operations are valid.
Ordinal - rank order: “cold, cool, warm, hot” or “good, better, best”. No distance between them, cna beform equality checks.
Interval - ordered and measured in fixed units.
When is numeric data easy to interpret?
When defined ranges exist.
How do you measure something good, bad, healthy?
Need domain expert.
What are some cautions on data cleaning?
Document what you do, work carefully, don’t make assumptions, be aware of bias.
What are some was to introduce bias?
Language - different terms or grammars to describe the domain, data attributes, or the problem.
Search - look at other search options
Overfitting - results provide a solution based on bad assumptions/patters, or search stops too soon.
Actions already perform on the data.
How the data was gathered (how questions were asked, how responses were interpreted, who asked the questions, how samples were selected).
Synonym for “error”.
What are some examples of data cleaning?
Handling invalid values, duplicates, missing data, data entry errors, converting data to specific values in order to perform correct measurements.
What is meant by dirty data?
Data that is incorrect, inaccurate, irrelevant or incomplete.
Data needing to be converted (nominal to numeric)
Data with different formats or coding schemes (such as dates)
Data from >1 file with different field delimiters
Data that is coded
Data that must be summarized “rolled up”
How does data get “dirty”?
Inconsistent definitions, meanings (especially when combining different sources)
Data entry mistakes
Collection errors
Corrupted data transmissions
Conversion errors.
What are some data issues?
Out of range entries
Unknown, unrecorded or irrelevant data
Missing values
Language translation issues
Unavailable readings
Inapplicable data (asking a male if pregnant)
Customer provided incorrect data
Duplicate data
Stale data
Unavailable data
Data may be available but not in electronic form.
Data associated with the wrong person
User provided wrong data.
Consider representing dates as YYYYMM or YYYYMMDD. What’s good about this formatting? What is the limitation?
Good: You can sort the data.
Limitation: Does not perserve intervals. (ie. 20040201 - 20040131
What are some legacy issues when it comes to dates?
Y2K - 2 digit year. - Year 02 is it 1902 or 2002? Depends on context (child birthday or a year a house was built). Typical approach is to set a cutoff year. If YY < cutoff, then 20YY else 19YY
What are some reasons values may be missing?
- They are unknown, unrecorded, irrelevant data
- Malfunctioning equipment
- Changes in the design
- Collation/merge of different datasets.
- Unavailable data
- Removals because of security or privacy issues.
- Translation issues (especially languages)
- That adata being used for a different purposes than originally planned (ethical/ legal issues)
- self-reporting - people may omit if the input mechanism does not require an input.
How should one deal with missing values?
- Ignore the attribute or entire instances. (May throw out the needle in the haystack!)
- Try to estimate or predict: use mean, mode, or median values. Relatively easy and not bad on average.
- Treat as seperate value
- Look for , 0, “.”, 999, N/A Decide on a standard and create a new value.
- Does missing imply a default value?
- Compute the value based on previous values.
- If inserting zeros for missing values, think about what it has done to the mean and standard deviation.
- Be careful when using tools (some have default operations to handle missing data)
- Randomly select values from current distribution (pro: won’t change overall shape of the curve - little impact on the mean).
Again, what are some sources of inaccurate data? :)
- Data entry mistakes
- Measurement errors
- Outliers previously removed
- Duplicates
- stale data
- Different values. New York, NY, N.Y
How can you find inaccurate data?
Look for the obvious (run statistical tools), Look for nonsensical data (negative grade or age).
What is discretization?
- Binning
- Useful for generating summary data
- produces discrete values
What is one issue that can come from binning with equal-width?
It could result in clumping. For example, if 99% of employees earn 0-200,000 and the owner makes 2,000,000. With a width of 200,000 only one person in the upper bin.
How can we even out the distribution?
By binning with equal-height. Instead of defining bin sizes of range N, assign N values to each bin.
When binning…
- Do not split repeated values across bins
- Create a separate bin for special values.
Talk about the considerations with equal-width and equal-height binning.
Equal-height is usually preferred because it avoids clumping. Equal-width is simplest and good in many situations. However, equal-height usually gives better results.
Is by bins okay?
AFter you create bins, create a histogram of values and look at the general shape of it. Jagged shape may indicate a weakness in the way the bins were formed so try different number of bins and different boundaries (shit ranges).
Why use rollup?
Can help reduce the complexity of your model.