What is a page?
SQL Server internally organizes data in a data file in pages. A page is an 8KB unit and belongs to a single object; for example, to a table or an index. A page is the smallest unit of reading and writing.
Pages are physical structures.
What is an extent?
An extent consists of eight consecutive/contiguous pages. Pages from an extent can belong to a single object (uniform extent) or to multiple objects (mixed extent). SQL Server stores the first 8 pages of an object in mixed extents. When an object exceeds 8 pages, SQL Server allocates additional uniform extents for this object. With this type of organization, small objects waste less space and big objects are less fragmented.
What are the two types of logical structures SQL Server uses organize data in pages?
Heaps or Balanced Trees (B-Trees)
What is a heap?
The default structure for organizing pages is called a heap. Heaps occur when a B-Tree is not used to organize the data pages in a table. Specifically, if you do not create a clustered index explicitly (or implicitly through a PK or UNQ constraint), then a table is organized as a heap.
SQL Server can find data in a heap only by scanning the whole heap. SQL Server uses IAM pages to scan heaps in physical order or allocation order. Even if your query wants to retrieve only a single row, SQL Server has to scan the entire heap.
SQL Server stores new rows anywhere in the heap. They are inserted without an enforced order. It can store a new row in an existing page if the pages is not full or allocate a new page or extent for the object.
SQL Server does not allocate any pages for a table when you create it. It allocates the first page and also the first IAM page, when you insert the first row in the table.
What is an IAM page?
IAM stands for Index Allocation Map. These pages allow SQL Server to trace which pages and extents belong to an object. Every table or index has at least one IAM page called “first IAM”. A single IAM page can point to approximately 4 GB of space. Large objects can have more than one IAM page. IAM pages are organized as a doubly linked list: each page has a pointer to its descendent and antecedent.
What does the sys.indexes catalog view provide?
sys.indexes provides general information about tables and indexes. The type column stores a value of 0 for heaps, 1 for clustered tables (indexes), and 2 for nonclustered indexes.
What does the sys.dm_db_index_physical_stats DMV provide?
These objects tell you how many pages are allocated for an object, page space (KB), row counts, internal/external fragmentation, etc.
For clustered indexes, it shows the index depth, index level, and page count per level.
What is internal fragmentation?
Internal fragmentation means that pages are not full. The more rows you have stored on a single page, the fewer pages SQL Server must read to retrieve these rows, and the less memory it uses for cached pages for the same number of rows.
In heaps you do not get much internal fragmentation because SQL Server stores new rows in existing pages if there is enough room.
However, in clustered indexes, when you insert a new row into a full page, SQL Server has to split the page into two pages and move half of the rows into the second page. This happens because SQL Server needs to maintain the logic order of rows. This way, you get some internal fragmentation.
What is external fragmentation?
Also, known as logical fragmentation, external fragmentation occurs when pages are physically out of order.
External fragmentation can slow down full or partial scans in logical order.
What is a balanced tree?
You organize a table as a balanced tree when you create a clustered index. The structure is called a balanced tree because it resembles an inverse tree. Every balanced tree has a single root page and at least one or more leaf pages. In addition, it can have zero or more intermediate levels.
All data in a clustered table is stored in leaf pages. Data is stored in logical order of the clustering key and is not physically ordered. SQL Server still uses IAM pages to follow the physical allocation.
Pages above the leaf level point to leaf-level pages. A row in a page above the leaf level contains a clustering key value and a pointer to a page where this value starts in logically ordered leaf level. SQL Server creates intermediate level pages which point to leaf level pages as necessary when it can no longer reference all leaf level pages.
Pages on the same level are organized as a doubly linked list; therefore, SQL Server can find the previous and the next page in logical order.
What is a clustering key?
A clustering key can consist of a single column or multiple columns. If the key consists of multiple columns, then this is a composite key.
You can have up to 16 columns in a key; the size of all columns together in a composite key must not exceed 900 bytes.
Can you use non-unique values as the key of a clustered index?
Yes. You can use a column or columns with unique or non-unique values for a key of a clustered index. However, SQL Server internally always maintains uniqueness of the clustering key. It adds a uniquifier value which is a sequential integer to the repeating values. The first value is stored without the uniquifier; the first repeating value gets a uniquifier with a value of 1, the second 2, and so on.
How can you control internal and external fragmentation?
You can control the internal fragmentation with the FILLFACTOR option for the leaf-level pages and with the PAD_INDEX option for the higher-level pages of the CREATE INDEX statement.
You can also rebuild or reorganize an index to get rid of the external fragmentation by using the ALTER INDEX…REORGANIZE or ALTER INDEX…REBUILD statements.
Why should you use a short clustering key?
A short clustering key means that more rows can fit on pages above the leaf level. Therefore, fewer levels are potentially needed. Fewer levels means a more efficient index because SQL Server needs to read fewer pages to find a row. A uniquifier extends the key; therefore, having a short and unique key is preferred for seeks.
What are some of the advantages of clustered indexes over heaps?
A clustered index scan can be done in logical order or, when the logical order is not needed, in physical order or allocation order. In addition, SQL Server can perform a partial scan if sequential rows in the order of the clustering key are requested by your query.
What are some disadvantages of clustered indexes compared to heaps?
When using a clustered index, when you insert a new row into a full page, SQL Server has to split the page into two pages and move half of the rows into the second page. This happens because SQL Server needs to maintain the logical order of rows. This way, you get some internal fragmentation which you cannot get in a heap.
What is a scan and a seek?
An index scan retrieves all rows from the table. Index seek retrieves selective rows from the table.
What type of clustering key should you select for different business scenarios such as OLTP and DW?
For OLTP applications, selecting a sequential integer as the clustering key is typically a very good choice.
However, in DW scenarios, many queries read huge amounts of data, typically ordered. For example, many DW queries search for rows in order of a date or datetime column. If this is the case, then you might prefer to support such a partial scan and create a clustered index on the date column.
How do randomly generated GUIDs (NEWID) measure up as clustering keys?
GUIDs for clustering keys can lead to quite inefficient indexes.
There will be multiple page splits and no page will physically be in the correct logical order, e.g. external fragmentation. The external fragmentation mainly slows down scans which should not be that frequent in OLTP environments; however, they are very important to data warehousing area.
Internal fragmentation is a problem in both scenarios because the table is much bigger than it would be with a sequential key.
What is the difference between reorganizing an index vs rebuilding an index?
You can get rid of fragmentation if you rebuild or reorganize the index.
Reorganizing an index is a slower but less intrusive process than rebuilding an index.
Reorganizing physically reorganizes the leaf nodes of the index while a rebuild drops the existing Index and recreates the index.
Rebuilding uses more CPU and it locks the database resources.
As a general guideline, you should reorganize an index when the external fragmentation is less than 30 percent and rebuild it if it is greater than 30 percent.
What type of clustering key would you select for an OLTP environment?
For an OLTP environment, a short, unique, and sequential clustering key might be the best choice.
What is a nonclustered index?
A nonclustered index does not contain all of the data and serves as a pointer to table rows for quick seeks.
Nonclustered indexes have a very similar structure to clustered ones. Actually, the root and intermediate levels look the same as in a clustered index. The leaf level is different because it does not hold all of the data. What is stored on the leaf level of a nonclustered index depends on the underlying table organization whether it is organized as a heap or as a balanced tree.
The leaf level of a nonclustered index contains the index keys and “row locators”. A row locator points to a row in the underlying table.
If the table is a heap, then the row locator is called a row identifier (RID). This is an 8-byte pointer containing the database file ID, page ID of the target row, and the target row ID on that page. The operation of retrieving the row from the heap is called “RID lookup”.
If the table is organized as a balanced tree, then the row locator is the clustering key. This means that when SQL Server seeks for a row, it has to traverse all levels on a nonclustered index and then also all levels of a clustered index. This operation is called a “key lookup”.
What is a filtered non clustered index?
There are atypical situations where including all of the records in a table is less than ideal.To assist in these scenarios, nonclustered indexes can be filtered to reduce the number of records they contain.
Filtered nonclustered indexes are useful when some values in a column occur rarely whereas other values occur frequently. In such cases, you would create a filtered index over the rare values only. SQL Server uses this index for seeks of rare values, but performs scans for frequent values.
Filtered indexes are inexpensive to maintain, because SQL Server has to update them for changes in the rare values only.
You create a filtered index by adding a WHERE clause to the CREATE INDEX statement.
How can you use a filtered nonclustered index to enforce a filtered uniqueness?
For example, imagine that a column has NULLs in multiple rows; however, known values must be unique. You cannot create a filtered PK or UQ constraint; however, you could create a filtered unique nonclustered index from known values only which would allow multiple NULLs and reject duplicate known values.
What is a columnstore index?
In addition to regular row storage, SQL Server 2012 can store index data column by column in what’s called a columnstore index. Columnstore indexes can speed up data warehousing queries by a large factor, from 10 to even 100 times.
A columnstore index is just another noclustered index on a table. The SQL Server Query Optimizer considers using the columnstore index during the query optimization phase just as it does any other index. All you have to do to take advantage of this feature is create a columnstore index on a table.
A columnstore index is stored compressed. The compression factor can be up to 10 times the original size of the index. When a query references a single column that is a part of a columnstore index, then SQL Server fetches only that column from disk; it doesn’t fetch entire rows as with row storage. This also reduces disk I/O and memory cache consumption. Columnstore indexes use their own compression algorithm; you cannot use Row or Page compression on a columnstore index.
On the other hand, SQL Server has to return rows. Therefore, rows must be reconstructed when you execute a query. This row reconstruction takes some time and uses some CPU memory resources.
How does the performance of a columnstore index measure up?
Columnstore indexes accelerate data warehouse queries, not OLTP workloads.
Because of the row reconstruction issues and other overhead when you update compressed data, tables containing a columnstore index become read-only. If you want to update a table containing a columnstore index, you must first drop the index.
How are columnstore indexes organized?
The columnstore index is divided into units called segments. Segments are stored as large objects and consist of multiple pages. Segments are the unit of transfer from disk to memory. Each segment has metadata that stores the min and max value of each column for that segment. This enables early segment elimination in the storage engine. SQL Server loads only those segments requested by a query into memory.
What are indexed views?
You can optimize queries that aggregate data and perform multiple joins by permanently storing the aggregated and joined data.
You can create a view with a query that joins and aggregates data. Then you can index the view to get an indexed view. With indexing, you are materializing a view.
Note that the view must be created with the SCHEMABINDING option if you want to index it. In addition, you must use the COUNT_BIG aggregate function.
Nevertheless, after you create the view and the index, execute the aggregate query again and measure the I/O.
How many clustered indexes can you create on a table?
There can only be 1 clustered index because this is the table itself organized as a balanced tree.
How many nonclustered indexes can you create on a table?
What is the row locator when a table is stored as a balanced tree?
What is the row locator when a table is stored as a heap?
What are the different types of levels in a balanced tree?
Root level, intermediate level, leaf level.
What is one of the most important parts of a query that can benefit from an index?
The WHERE clause is one of the most important parts of a query that can benefit from an index.
You can check whether an index was used by displaying the estimated or actual execution plan. You can also track index usage by querying the sys.dm_db_index_usage_stats DMV.
What happens when your query does not include a WHERE clause and returns all rows from a table?
The execution plan will show that SQL Server used (for example) a clustered index scan. The whole table was scanned, regardless of how many indexes there are.
In the execution plan, what does the Ordered bit indicate?
When the Ordered property of the operator is set to False, the scan is unordered - also known as an allocation scan.
Remember that order is not guaranteed if you do not include the ORDER BY clause.
Does adding a WHERE clause guarantee that an index is going to be used?
Adding a WHERE clause to a query does not guarantee that an index is going to be used. The clause has to be supported by an appropriate index, and it must be selective enough.
If the query returns too many rows, it is less expensive for SQL Server to perform a table or clustered index scan than to do a nonclustered index seek and then RID or key lookups.
Can the JOIN clause benefit from appropriate indexes as well?
If you query aggregates data and uses the GROUP BY clause, should you consider supporting this clause with an index?
Yes. SQL Server can aggregate data by using a hash or a stream aggregate operator. The stream aggregate is faster; however, it needs sorted input.
The hash match (aggregate) operator is used when an aggregate query is not supported by an index.
An aggregate query can benefit from an index even if it does not include the GROUP BY clause. For example, if you use the MIN aggregate function and you have an appropriate index, then SQL Server can seek for the first value of an index only, and does not have to scan the entire table.
If you include the ORDER BY clause, should you consider supporting it with an index?
Yes. If there is no appropriate index for the ORDER BY clause, SQL Server must sort data before returning it. Sorting large data sets could be a big performance hit on SQL Server. The data needs to be sorted in memory or must be spilled to tempdb if it does not fit in memory.
What is meat by a “covering” index?
When an index contains all the columns referenced by a query it is typically referred to as covering the query.
Covered queries are very efficient.
What does the INCLUDE clause do with regards to the CREATE INDEX statement?
In an attempt to cover more queries with a non clustered index, you could try to add more columns. However, with a longer key, the index would become less efficient.
SQL Server 2012 allows you to include a column in a nonclustered index on the leaf level only and not as part of a key.
You can do this by using the INCLUDE clause of the CREATE INDEX statement.
The included column is not part of the key, and SQL Server does not use it for seeks. Included columns help cover queries.
An index with non key columns can significantly improve query performance when all columns in the query are included in the index as key or nonkey columns. Performance gains are achieved because the query optimizer can locate all the column values within the index, resulting in less I/O.
However, you should be careful not to include too many columns. For example, if you included all columns of a table, you would actually copy the table.
What is a SARG? Why do you need it in order for an index to be used?
A SARG (searchable argument) in a predicate helps the Query Optimizer decide to use an index.
To write an appropriate SARG, you must ensure that a column that has an index on it appears in the predicate alone and not as a function parameter.
The column name is alone on one side of the expression, and the constant or calculated value appears on the other side.
Inclusive operators include eq, lt, gt, gte, lte, BETWEEN, LIKE.
However, the LIKE operator is only inclusive if you don’t use a wildcard % or _ at the beginning of the string you are comparing the column to.
How does the Query Optimizer convert the arguments provided in an IN clause?
The Query Optimizer converts the IN operator to OR with a separate comparison to each element from the IN operator list.
What clauses of a query should you consider supporting with an index?
WHERE, JOIN, GROUP BY, and ORDER BY.
How does the Query Optimizer handle ANDs vs ORs?
Using the AND operator in the WHERE clause predicate means that each part of the predicate limits the result set even more than the previous part. The Query Optimizer understands how the logical AND operator works, and can use appropriate indexes.
However, the logical OR operator is inclusive.
If the two conditions use two different columns, then SQL Server conservatively takes the worst case and estimates that the query would return the max number of rows.
Having multiple conditions in a predicate connected with OR operator lowers the possibility for SQL Server to use indexes.
You should consider rewriting the predicate to a logically equivalent predicate that uses the AND operator.
How can you support the SELECT clause of a query by using a nonclustered index that is already used for the WHERE clause?
You could modify the index that is already used to INCLUDE the columns from the SELECT list that are not part of the key.
Where does SQL Server sort data if a sort is needed?
SQL Server sorts data in memory or spills the data to tempdb if it does not fit in memory.
You create an index to support the WHERE clause of a query. However, SQL Server does not use the index. What are some possible reasons?
(1) The arguments in the predicate are not searchable, (2) The predicate is not selective enough
What are statistics and how are the used?
SQL Server maintains statistics of the distribution of key values in special system statistical pages. The Query Optimizer uses these statistics to estimate the cardinality, or number of rows, in the query result set. In other words, it helps the Query Optimizer, produce an efficient query execution plan.
By default, SQL Server creates statistics automatically for each index and for searchable non-key columns (used as searchable arguments) during query execution.
Each statistics object is stored in a statistics binary large object and is created on one or more columns.
Statistics include a header with metadata about the statistics and a density vector to measure cross column correlation.
Statistics also include a histogram with the distribution of values in the first column.
What are the 3 database options that influence the automatic creation of statistics?
(1) AUTO_CREATE_STATISTICS - When this option is set to on, SQL Server creates statistics automatically. This option is on by default and you should leave this option on in the vast majority of cases.
(2) AUTO_UPDATE_STATISTICS - When this option is set to on, SQL Server automatically updates statistics when there are enough changes in the underlying tables and indexes. With this option on, SQL Server also updates an out-of-date statistics during query optimization. SQL Server checks for outdated statistics before compiling a query and before executing a cached query. In general, you should leave this option turned on.
(3) AUTO_UPDATE_STATISTICS_ASYNC - This option determines whether SQL Server uses synchronous or asynchronous statistics updates during query optimization. If the statistics are updated asynchronously, SQL Server cannot use them for the optimization of the query that triggered the update; however, SQL Server does not wait for the statistics update during the optimization phase.
What is the limit for the number of steps in a histogram?
A statistic can have maximally 200 steps.
What catalog views can you query to get information about statistics?
You can get information about statistics by querying the sys.stats and sys.stats_columns catalog views.
How can you get detailed information about statistics?
You can get detailed information about statistics with the DBCC SHOW_STATISTICS command.
Note that you pass in the table name and the name of the statistic.
DBCC SHOW_STATISTICS(N’Sales.Orders’, N’idx_nc_empid’);
By default, you get all statistics information, including the header, density vector, and histogram.
From the header, you can get information like when the statistics were last updated.
The WITH STAT_HEADER option simply returns the header.
The WITH HISTOGRAM option simply shows the histogram of the statistics.
What commands are available for manually maintaining statistics?
You can manually maintain statistics with the CREATE, DROP, and UPDATE statistics commands.
You can also use the sys.sp_updatestats system procedure to manually update statics for all tables in a database. Note that this stored procedure can take a long time to execute and use a lot of resources.
What does the STATS_DATE() system function do?
STATS_DATE() provides information about when the statics were last updated.
What are filtered statistics?
Similar to filtered indexes, you can also create filtered statistics. Statistics created by SQL Server automatically are always created on all rows of a table. If queries frequently select from a subset of rows that has a unique data distribution, filtered statistics can improve query plans.
What should you do if you get a warning in the execution plan that a particular statistic is missing?
You can create this statistic manually. However, before creating it manually, you should verify that AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS database options are turned on and that the database is not read-only. If the database is read-only, the Query Optimizer cannot save statistics.
When should you consider updating statistics manually?
(1) When query execution times are slow, and you know that the queries are written correctly and supported with appropriate indexes. Before you use query hints, update statistics. SQL Server does not consider using the index with outdated statistics. Check also whether auto-updating statistics is turned off for the database.
(2) When insert operations occur on asc or desc key columns. Statistics are not updated for every single row; therefore, the number of rows inserted might be too small to trigger a statistics update. If queries select from the recently added rows, the current statistics might not have cardinality estimates for these new values. In addition bulk inserting rows to a table or truncating can change the distribution of data a lot. Queries executed right after these operations might get a suboptimal execution plan.
(3) After an upgrade from a previous version of SQL Server. Statistics information can change with a new version of SQL Server. To be on the safe side, you should update the statistics for the upgraded database.
How would you quickly update statistics for the whole database after an upgrade?
You should use the sys.sp_updatestats system stored procedure.
What are some of the reasons to create statistics manually?
One example is when a query predicate contains multiple columns that have cross-column relationships; statistics on the multiple columns can help improve the query plan.
Statistics on multiple columns contain cross column densities that are not available in single-column statistics. However, if the columns are already in the same index, the multi-column statistics object already exist, so you should not create an additional one manually.