Revision Flashcards
(180 cards)
What is a Data Cube?
- Data cubes are the building blocks of multidimensional models on which a data warehouse is based
- A data cube allows data to be modelled and viewed in multiple dimensions.
- A data cube is defined by dimensions and facts. -
- Dimensions are perspectives or entities that an organization wants to keep. For example, if our data model was related to Sales, we might have dimension tables such as item (item_name, brand, type) or time(day, week, month, quarter, year).
- The fact table will contain measures such as Euros_sold and keys to each of the related dimension tables. A data cube is really a lattice of cuboids.
What is a base cuboid?
An n-dimensional base cube (e.g. (time, item, location, supplier)
What is the apex cuboid?
Topmost 0-D cuboid which holds the highest level of summarization (all).
What is a data cube in terms of cuboids?
A lattice of cuboids.
What is a Fact Table?
A fact table is a large central table with the bulk of the data, contains no redundancy and is connected to a set of smaller attendant tables (dimension tables).
What is a distributive measure?
If the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning, e.g. count(), sum(), min(), max()
What is an algebraic measure?
If it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function. E.g avg(), min_N(), standard_deviation()
What is a holistic measure?
If there is no constant bound on the storage size needed to describe a sub-aggregate. In other words, we need to look at all the data. E.g. median(), mode(), rank()
A distributive measure
An aggregate function is distributive if it can be computed in a distributed manner as follows. Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n aggregate values. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function to the entire dataset (without partitioning), the function can be computed in a distributed manner.
An example of a distributive measure.
For example, sum() can be computed for a data cube by first partitioning the cube into a set of subcubes, computing sum() for each subcube, and then summing up the counts obtained for each subcube. Hence, sum() is a distributive aggregate function. For the same reason, count(), min(), and max() are distributive aggregate function.
A measure is distributive if…
it is obtained by applying a distributive aggregate function.
Can distributive measures be computed efficiently?
Yes, because of the way the computation can be partitioned.
An example of an algebraic measure.
For example, avg() can be computed by sum()/count() where both sum() and count() are distributive aggregate functions.
Another example of an algebraic measure.
standard_deviation()
A measure is algebraic if…
it is obtained by applying an algebraic aggregate function.
Describe a holistic function.
An aggregate function is holistic if there is no constant bound on the storage size needed to describe a subaggregate. That is, there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation. Common examples of holistic functions include median(), mode(), and rank().
A measure is holistic if…
it is obtained by applying a holistic aggregate function.
Explain why holistic measures are not desirable when designing a data warehouse.
Most large data cube applications require efficient computation of distributive and algebraic measures. Many efficient techniques exist for this. In contrast, it is difficult to compute holistic measures efficiently. Efficient techniques to approximate the computation of some holistic measures, however, do exist. For example, rather than computing the exact median(), techniques can be used to estimate the approximate median value for a large data set. In many cases, such techniques are sufficient to overcome the difficulties of efficient computation of holistic measures.
What are 2 costs of the Apriori algorithm?
The bottleneck of Apriori is candidate generation. Two costs:
- Possibly huge candidate sets
- Requires multiple scans of the database to count supports for each candidate by pattern matching
How does the Frequent Pattern Growth approach avoid the two costly problems of Apriori?
- Compresses a large database into a compact frequent pattern tree structure - highly condensed but complete for frequent pattern mining
- Avoids costs database scans
- Avoids candidate generation: sub-database test only
Why is the FP growth method compact?
- Reduces irrelevant information - infrequent items are gone.
- Frequency descending ordering - more frequent items are more likely to be shared
- Can never be larger than the original database (if not count node-links and counts)
Notion of ‘closeness’ in K-NN.
Measure of distance between K-observations and the test object. For numeric attributes, it is usually Euclidean distance.
What are the 3 main steps of a Data Mining process?
Assuming we have defined the problem around which we are developing a DM solution for, we can describe the 3 main steps of DM as:
- Data gathering + preparation
- Model building and evalution
- Knowledge deployment
Describe the tasks that need to be performed at each step of the DM process.
- Data gathering + preparation = data access, data sampling, data transformation
- Model building + evaluation = create model, test model, evaluate model, interpret model
- Knowledge deployment = Apply model, custom reports, external applications