Micro partitions and Clustering Flashcards
(30 cards)
What is Microparitioning?
It is a way how SF physically stores the data. Incoming source data is divided into smaller chunks, their metadata collected and stored.
Why there is a need for micropartitioning?
For query efficiency. This avoids full table scan by using query pruning. SF collects some metadata on micro partitions. So when a query runs, based on the metadata it prunes away some partitions, so full table scan is avoided.
What is the size of micro partition?
50 to 500MB before compression. Data is always stored as compressed format in SF.
What is query pruning?
It is simply a concept where unnecessary partitions are avoided to be scanned for that query. It is based on the metadata collected while creating micro partitions.
What metadata is collected on micropartitions?
Range of values.
distinct values.
additional properties that SF collects for query performance which is not revealed
Micro partitions are collected by users? Can it be changed by users?
It is not created by users. SF automatically does it as data is coming in. It cannot be changed by users.
Users can do clustering from their end which will improve performance.
What is clustering depth?
Once micro partitions are created, SF also collects metadata on how many micro partitions overlap… This overlapping is called clustering depth. Lower the depth it is good.
What is the clustering depth of empty table?
0
What are the disadvantages of static partitioning done in traditional databases which Microparitioning takes care of ?
- maintenance (as data gets updated, static partitioning can go outdated)
- Data skew if partition columns are not uniformly distributed.
Is micro partitions logical storage or physical storage?
They are physical storage - how tables are physically stored at the back end - data is stored in columnar format.
When do you need clustering/setting up cluster keys?
When clustering depth is large on micropartitions, (because ordering at the time of the insert was not perfect)
when lot of DMLs has happened on the tables and existing micropartitions are not causing efficient query pruning etc.
Clustering can be done on what object types?
Tables and materialized views
Clustering is recommended for what tables?
large tables where queries are running slower.
large tables where DML is happening frequently.
generally tables in Terabyte range
How do you choose columns for clustering?
Clustering will create new micropartitions (MPs) so it will add to storage cost so you have to be careful about how many columns and what columns you choose.
- 3 to 4 columns is ok.
- Choose columns used in filters and then used in joins.
Choose columns
- that have lot of distinct values so many MPs can be ignored using query pruning
- as well as it should not be so unique that MPs cannot group them.
What columns should not be chosen for clustering?
Extremely high cardinal (most unique values) like id columns should not be chosen….because it cannot be grouped together for creating a cluster.
Extremely low cardinal columns (less unique values) like gender should not be chosen as well….because query pruning cannot be done efficiently.
For high cardinal columns like timestamp, you can do expressions like to_date conversion to do efficient clustering.
When choosing multiple columns for clustering - what order to be followed?
Choose from lowest cardinality to highest.
What is Reclustering?
when lot of DMLs has happened on clustered tables over time, SF automatically does Reclustering to keep clustering/query performance efficient.
Does reclustering end up in additional costs?
Yes, because it will create new MPs. Old MPs will be reserved for time travel and fail safe.
Can you do expressions on clustering key columns?
Yes, some limited expressions like timestamp to date conversions can be done.
Does changing the clustering key using ALTER impact anything to the table immediately?
There will be no impact unless there is another reclustering.
When does SF does reclustering?
When SF determines there is a benefit in doing this….it is doing automatically.
No need to monitor clustered table current state by users.
No need for users to designate any WHs for reclustering.
Everything is done by SF.
What are benefits of automated reclustering from SF?
- Users need not allocate resources/WHs for it. SF does it dynamically when it sees the benefit.
- Users need not monitor the current state.
- SF gives full control in the sense - users can even suspend reclustering OR even remove the cluster key.
- It does not block DML operations while reclustering is done.
Do you need to provide virtual WH for reclustering?
No need…as it is done by SF automatically dynamically.
What privileges are needed to add clustering to the table?
USAGE/OWNERSHIP privileges on the database and schema of that table.