What is a data analysis function?
A data analysis function is a function applied to a set of rows and it returns a single value, e.g. the SUM aggregate function.
When does a query become a “grouped query”?
When you use an aggregate function, a GROUP BY clause, or both.
What happens when you invoke a group function, but don’t include an explicit GROUP BY clause?
All rows are arranged in one group and then the group function operates on the rows in that single group.
What happens when you invoke a group function and include an explicit GROUP BY clause?
Rows are arrange in one or more groups according to the grouping set of expressions and the group function operates on each group.
What is the HAVING clause?
The HAVING clause uses a predicate but evaluates the predicate per group as opposed to per row. This means that you can refer to aggregate computations because the data has already been grouped.
What is the difference between the HAVING and WHERE clauses?
WHERE is evaluated per row; HAVING is evaluated per group.
What are the general set functions supported by SQL?
(1) COUNT, (2) SUM, (3) AVG, (4) MIN, and (5) MAX. General set functions are applied to an expression and ignore NULLs.
What is the difference between COUNT(col1) and COUNT(*)?
The former ignores NULLs and therefore the counts are less than or equal to those produced by the latter.
When using general set functions, how can you work with distinct occurrences?
You can specify a DISTINCT clause before the expression, e.g. COUNT(DISTINCT shippeddate).
When using a GROUP BY clause, what must be done to expressions in the clauses that follow, namely, HAVING, SELECT, and ORDER BY?
All expressions that appear in those clauses must guarantee a single result value per group. There’s no problem referring directly to elements that appear in the GROUP BY clause because each of those already return one distinct value per group. For other elements from the underlying table, you must apply an aggregate function.
What are the clauses that allow you to define multiple grouping sets?
(1) GROUPING SETS, (2) CUBE, (3) ROLLUP. You use these in the GROUP BY clause.
What is the GROUPING SETS clause and how do you use it?
You can use the GROUPING SETS clause to list all grouping sets that you want to define in the query. You list the grouping sets separated by commas within the outer pair of parenthesis. You use an inner pair of parenthesis to enclose each grouping set. If you don’t use inner parenthesis, each individual element is considered a separate grouping set. e.g. GROUP BY GROUPING SETS ( (shupperid, YEAR(shippeddate)), (shipperid), (YEAR(shippeddate)), ());
Can a grouping set be empty?
Yes. This results in one group with all rows for computation of grand aggregates.
When using grouping sets, what happens in the output when an element isn’t part of the grouping set?
NULLs are used as placeholders in rows where an element isn’t part of the grouping set.
What is the CUBE clause and how do you use it?
The CUBE clause accepts a list of expressions as inputs and defines all possible grouping sets that can be generated from the inputs - including the empty grouping set., e.g. GROUP BY CUBE (shipperid, YEAR(shippeddate)); This produces 4 grouping sets: (1) shipperid, (2) YEAR(shippeddate), (3) shipperid, YEAR(shippeddate), (4) (Empty)
What is the ROLLUP clause and how do you use it?
The ROLLUP clause accepts a list of expressions as inputs and defines a hierarchy formed by the input elements such as a location hierarchy (country, region, city), .e.g GROUP BY ROLLUP (country, region, city) produces 4 grouping sets: (1) country, region, city, (2) country, region, (3) country, (4) (Empty).
How can you tell whether a NULL in the grouped results represents a placeholder or an original NULL from the table?
T-SQL provides 2 functions: GROUPING and GROUPING_ID.
What is the GROUPING function and how does it work?
Tells whether a NULL in the grouped results represents a placeholder or an original NULL. GROUPING accepts a single element as input and returns 0 when the element is part of the grouping set and 1 when it isn’t, e.g. GROUPING(country) => 0/1
What is the GROUPING_ID function and how does it work?
GROUPING_ID accepts the list of grouped columns as inputs and returns an integer representing a bitmap. The rightmost bit represents the rightmost input. The bit is 0 when the respective element is part of the grouping set and 1 when it isn’t. The result integer is the sum of the values representing elements that are not part of the grouping set because their bits are turned on., e.g. GROUPING_ID(country, region, city), 7 would represent the empty grouping set - none of the 3 elements is part of the grouping set. Therefore, the respective bits (1, 2, 4 => 7) are turned on.
Can you specify multiple GROUPING SETS, CUBE, and ROLLUP clauses in the GROUP BY clause?
Yes; however, by doing so, you achieve a multiplication effect. For example, CUBE(a,b,c) yields 8 grouping sets. ROLLUP(x,y,z) defines 4 grouping sets. By specifying a column between them as in CUBE(a,b,c), ROLLUP(x,y,z) you multiply them and get 32 grouping sets.
What are the clauses that you can use to define multiple grouping sets in the same query?
GROUPING SETS, CUBE, and ROLLUP.
What does COUNT(col1) return when all inputs are NULLs?
It returns 0 when all inputs are NULLs whereas other general set functions like MIN, MAX, SUM, and AVG return NULL in such as case.
Can the GROUPING and GROUPING_ID functions be used to sort data based on grouping set association - that is, first detail and then aggregates?
These functions can be used for sorting data because they return a 0 bit for a detail element and a 1 bit for an aggregated element. So, if you want to see detail first, sort by the result of the function in ascending order.
What is pivoting data?
Pivoting is a technique that groups and aggregates data, transitioning it from a state of rows to a state of columns.
What are the three things you need to identify in all pivot queries?
(1) What do you want to see on rows? This element is known as the “on rows” or “grouping element” (2) What do you want to see on columns? This element is known as the “on cols” or “spreading element” (3) What do you want to see in the intersection of each distinct row and column value? This element is known as the “data” or “aggregation element”.
What is the general form for a pivot query?
WITH PivotData AS ( SELECT < grouping column >, < spreading column >, < aggregation column > FROM < source table > )
SELECT < select list >
PIVOT ( < aggregate function >(< aggregation column >)
FOR < spreading column > IN () ) as P;
What are the steps to a pivot query?
(1) Define a table expression (like the one named PivotData) that returns the 3 elements necessary for pivoting. (2) Issue an outer query against the table expression and apply the PIVOT operator to that table expression. The PIVOT operator returns a table result. Assign an alias to the table result (P). (3) Specify the aggregate function for the PIVOT operator (e.g. SUM) (4) Then specify the FOR clause followed by the spreading column. (5) Then specify the IN clause followed by the list of distinct values that appear in the spreading element, separated by commas.
What is an example of the pivot query?
WITH PivotData AS ( SELECT custid, -- grouping column shipperid, -- spreading column freight, -- aggregation column FROM Sales.Orders ) SELECT custid, , ,  FROM PivotData PIVOT(SUM(freight) FOR shipperid IN (, , )) AS P;
The aggregation and spreading elements are indicated in the PIVOT clause, but how is the grouping element identified?
Process of elimination - it’s what’s left from the queried table besides the aggregation and spreading elements.
Why should you prepare a table expression for the pivot operator to return only the three elements needed for the pivot task?
Because all elements besides the aggregation and spreading elements are implicitly used for grouping. By using a table expression, you control which columns are used for grouping.
What are the four limitations to the PIVOT operator?
(1) The aggregation and spreading elements cannot be the results of expressions - they must be column names from the queried table. You can however apply expressions in the query defining the table expression, assign aliases to those expressions, and then use the aliases in PIVOT (2) The COUNT(*) function isn’t allowed as an aggregate function used by PIVOT. You must use COUNT( < col name >) There is a work around using the table expression (3) PIVOT is used to using only one aggregate function (4) The IN clause of the PIVOT operator accepts a static list of spreading values. It doesn’t support a subquery as input. You need to know ahead of time what the distinct values are in the spreading column. You can use dynamic SQL to work around this.
What is unpivoting?
String with pivoted data, when unpivoting, you rotate the input data from a state of columns to a state of rows.
What types of “functions” are PIVOT and UNPIVOT?
They are table operators similar to JOIN, etc.
What are the three elements in every unpivoting task?
(1) The set of source columns that you’re unpivoting, (2) The name you want to assign to the target values column (e.g. “freight”), (3) The name you want to assign to the target names column (“shipperid”).
What is the general form for unpivot?
SELECT < column list >, < names column >, < values column >
FROM < source table >
UNPIVOT( < values column > FOR < names column > IN ( )) AS U;
What is an example of unpivot?
SELECT custid, shipperid, freight
UNPIVOT(freight FOR shipperid IN (,,)) AS U;
Does the UNPIVOT operator filter out rows with NULLs in the value column?
Yes. The assumption is that those represent inapplicable cases. There’s no reason to keep a row for a certain customer-shipper pair if it’s not applicable.
What are the types of the names and values columns?
The names colum is defined as a nvarchar(128) and the values column is defined with the same type as the type of the source columns that were unpivoted.
What is the difference between pivot and unpivot?
Pivot rotates data from a state of rows to a state of columns. Unpivot rotates the data from columns to rows.
What is a window function?
You define a set of rows per function and then return one result value per each underlying row and function. The window is defined with respect to the current row.
What is a group function?
You use grouped queries to arrange the queried rows in groups and then the group functions are applied to each group. You get one result row per group - not per underlying row.
What are the three types of window functions?
Aggregate, ranking, and offset.
What is one benefit of window functions over group functions?
Windowed queries do not hide detail - they return a row for every underlying query’s row. This means you can mix detail and aggregated elements in the same query.
What clause do you use to define a window?
You use an OVER clause to define a window. When using empty parenthesis, the OVER clause represents the entire underlying query’s result set, e.g. SUM(val) OVER () represents the grand total sum over all rows - it’s treated as one partition. You can use a window function partition clause to restrict the window, e.g. SUM(val) OVER (PARTITION BY custid) represents the current customer’s total.
What is framing?
Framing is a filtering option available to window aggregate functions. You define ordering within the partition by using a window order clause, and then based on that order you can confine a frame of rows between two delimiters.
What are the options for window framing units?
ROWS or RANGE
What are the delimiter options available for the ROWS window frame unit?
(1) UNBOUNDED PRECEDING or FOLLOWING, (2) CURRENT ROW, (3) < n > ROWS PRECEDING or FOLLOWING
What is the shorter form of the frame extent: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW?
ROWS UNBOUNDED PRECEDING
What clauses are window functions allowed to be used?
SELECT and ORDER BY - if you need to refer to the result of a window function in any clause evaluated before SELECT, you need to use a table expression (CTE).
What frame extent clause would you use to include only the last three rows?
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW.
What are the differences between ROWS and RANGE window frame extent?
RANGE is based on logical offsets from the current row’s sort key. ROWS is based on physical offsets in terms of number of rows from the current row. SQL 2012 has a very limited implementation of RANGE and supports only UNBOUNDED PRECEDING or FOLLOWING and CURRENT ROW as delimiters. One difference between ROWS and RANGE when using the same delimiters is that the former doesn’t include tied rows in terms of the sort key and the latter does.
What is the default when you define a window with a window order clause, but no window frame clause?
The default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Therefore, if you are after the special behavior you get from RANGE that includes peers, make sure you explicitly define the ROWS option.
Which window option ROWS or RANGE gets optimized better?
The ROWS option usually gets optimized much better than RANGE when using the same delimiters.
What are the ranking functions supported in T-SQL?
ROW_NUMBER, RANK, DENSE_RANK, and NTILE.
Is the ORDER BY clause mandatory with the window ranking functions?
Yes.But the window order clause only determines ordering for the window function’s computation not for presentation. There’s no guarantee that the rows will be presented in the same order as the window function’s ordering. If you need such a guarantee, add a presentation ORDER BY clause.
What is the ROW_NUMBER function?
The ROW_NUMBER function computes a unique sequential integer starting with 1 within the window partition based on the window ordering. Note that if the ordering isn’t unique, the ROW_NUMBER function is not deterministic. If there’s no “tie breaker” in the ordering, the choice of which row gets the higher number is arbitrary - optimization dependent.
What is the RANK function?
The RANK function returns the number of rows that have a lower ordering value than the current plus 1. Can have gaps between ranking values.
What is the DENSE_RANK function?
The DENSE_RANK function returns the number of distinct ordering values that are lower than the current plus 1. Cannot have gaps between ranking values.
What is the NTILE function?
The NTILE function allows you to arrange the rows within the partition into a requested number of equally sized tiles based on the specified ordering. You specify the desired number of tiles as input to the function, e.g. NTILE(100). If there are 830 rows in the result set, the tile size is 830 / 100 = 8 with a rem of 30. Because there is a rem, the first 30 tiles are assigned an extra row.
What are the 4 different window offset functions available in T-SQL?
LAG, LEAD, FIRST_VALUE, and LAST_VALUE.
In general, what purpose do the window offset functions serve?
Return an element from a single row that is in a given offset from the current row in the window partition, or from the first or last row in the window frame.
What are the LAG and LEAD functions?
The LAG and LEAD functions support window partition and ordering clauses. They don’t support a window frame clause. The LAG function returns an element from the row in the current partition that is a requested number of rows before the current row with 1 assumed as the default offset. The LEAD function returns an element from the row that is in the requested offset after the current row. If no explicit offset is specified, it uses a default of 1. If you want a different offset, you specify it as the second argument, e.g. LAG(val,3) If a row doesn’t exist in the requested offset, NULL is returned. If you want to return something different, specify it as the third argument, e.g. LAG(val,3,0).
What is an example of LAG/LEAD?
SELECT custid, orderid, orderdate, val,
LAG/LEAD(val) OVER (PARTITION BY custid ORDER BY orderdate, orderid)
What are the FIRST_VALUE and LAST_VALUE functions?
The FIRST_VALUE and LAST_VALUE functions return a value expression from the first or last rows in the window frame respectively. These functions support both window partition, order, and frame clauses.
What is an example of FIRST_VALUE and LAST_VALUE?
SELECT custid, orderid, orderdate, value,
FIRST_VALUE(val) OVER (PARTITION BY custid ORDER BY orderdate, orderid ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as first_val
For performance reasons, is it generally recommended to avoid the RANGE option?
Yes. You need to be explicit and use the ROWS clause.
What frame do you need to specify if you’re after the FIRST_VALUE?
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW - you need the first row in the partition.
What frame do you need to specify if you’re after the LAST_VALUE?
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING - you need the last row in the partition.
What are the clauses that the different types of window functions support?
Partitioning, ordering, and framing clauses.
What do the delimiters UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING represent?
The beginning and the end of the partition.
Why are window functions allowed only in the SELECT and ORDER BY clauses of a query?
They are supported to operate on the underlying query’s result which is achieved when logical query processing gets to the SELECT phase.