SQL: from 0 to hero Flashcards by José Cruz

What is the difference between create table and create view?

Table 2: departments (dept_id, de

Table 1: employees (id, name, dept_id)

easy

create table: Stores data physically; faster for repeated access but uses storage.

Use: Static data or precomputed results.
Avoid: Dynamic data needing frequent updates.

create view: Virtual table; reflects live data but slower for complex queries.

Use: Dynamic, query-based data.
Avoid: Performance-intensive queries.

How well did you know this?

Not at all

Perfectly

What is the difference between INNER JOIN and OUTER JOIN?

T2: departments (dept_id, dept_name)

T1: employees (id, name, dept_id), T2: departments (dept_id, dept_name)

easy

INNER JOIN returns matching rows from both tables, while OUTER JOIN includes unmatched rows from one or both tables.

– INNER JOIN: Only matching rows

SELECT e.name, d.dept_name 
FROM employees e
INNER JOIN departments d ON e.dept_id = d.dept_id;

– OUTER JOIN: Includes unmatched rows

SELECT e.name, d.dept_name 
FROM employees e
LEFT OUTER JOIN departments d ON e.dept_id = d.dept_id;

How well did you know this?

Not at all

Perfectly

How can you **find duplicate records **in a table?

Table: employees (id, name, dept_id)

difficulty: easy

Use GROUP BY on columns and HAVING COUNT(*) > 1

SELECT name, COUNT(*) 
FROM employees 
GROUP BY name 
HAVING COUNT(*) > 1;

How well did you know this?

Not at all

Perfectly

How do window functions differ from aggregate functions?

employees (id, name, dept_id, salary)

**difficulty: ** medium

Window functions operate over a subset of rows defined by OVER() without collapsing them into one row, unlike aggregate functions.

-- Window Function
SELECT name, salary, SUM(salary) OVER (PARTITION BY dept_id) AS dept_salary 
FROM employees;

-- Aggregate Function
SELECT dept_id, SUM(salary) AS dept_salary 
FROM employees 
GROUP BY dept_id;

How well did you know this?

Not at all

Perfectly

What is a Common Table Expression (CTE), and when would you use it?

Table: sales (region, sales)

**difficulty: ** medium

A CTE is a temporary result set used for modular, readable queries (e.g., recursive operations).

WITH SalesCTE AS (
  SELECT region, SUM(sales) AS total_sales 
  FROM sales 
  GROUP BY region
)
SELECT region, total_sales 
FROM SalesCTE;

How well did you know this?

Not at all

Perfectly

How can you calculate the **cumulative sum **of a column?

sales (name, sales, sales_date)

**difficulty: ** medium

Use SUM(column) OVER (ORDER BY column) in a window function.

SELECT name, sales, 
       SUM(sales) OVER (ORDER BY sales_date) AS cumulative_sales 
FROM sales;

How well did you know this?

Not at all

Perfectly

What is the purpose of the EXPLAIN statement?

Table: employees (id, name, dept_id)

**difficulty: ** medium

It shows the query execution plan to optimize performance.

How well did you know this?

Not at all

Perfectly

How do you remove duplicate rows from a table?

employees (id, name, dept_id)

**difficulty: ** medium

Use DELETE with ROW_NUMBER() or DISTINCT.

DELETE FROM employees 
WHERE ROW_NUMBER() OVER (PARTITION BY name ORDER BY id) > 1

How well did you know this?

Not at all

Perfectly

Write a query to get the second highest salary from an employees table.

Table: employees (id, name, salary)

**difficulty: ** medium

SELECT MAX(salary) 
FROM employees 
WHERE salary < (SELECT MAX(salary) FROM employees);

How well did you know this?

Not at all

Perfectly

What is the difference between HAVING and WHERE?

employees (id, name, dept_id, salary)

**difficulty: ** easy

WHERE filters rows before grouping and HAVING filters groups after aggregation.

-- WHERE
SELECT * 
FROM employees 
WHERE salary > 50000;

-- HAVING
SELECT dept_id, AVG(salary) 
FROM employees 
GROUP BY dept_id 
HAVING AVG(salary) > 60000;

How well did you know this?

Not at all

Perfectly

How do you handle nulls in SQL when performing calculations?

Table: employees (id, name, salary)

**difficulty: ** easy

Use COALESCE() or ISNULL() to replace nulls with default values.

SELECT name, COALESCE(salary, 0) AS salary 
FROM employees; -- COALESCE replaces NULL with a value

How well did you know this?

Not at all

Perfectly

How do you find the 90th percentile in a dataset?

Table: employees (id, name, salary)

**difficulty: ** hard

With PERCENTILE_CONT

SELECT PERCENTILE_CONT(0.9) 
WITHIN GROUP (ORDER BY salary) AS percentile_salary 
FROM employees;

How well did you know this?

Not at all

Perfectly

How do you **rank rows **in a table?

Table: employees (id, name, salary)

**difficulty: ** medium

SELECT name, salary, 
       RANK() OVER (ORDER BY salary DESC) AS rank 
FROM employees;

How well did you know this?

Not at all

Perfectly

What is the difference between TRUNCATE and DELETE?

Table: employees (id, name, dept_id)

**difficulty: ** easy

TRUNCATE TABLE employees; -- Deletes all rows

DELETE FROM employees WHERE dept_id = 101; -- Deletes specific rows

Table: employees (id, name, dept_id)

How well did you know this?

Not at all

Perfectly

What is a primary key, and why is it important? Create a table with a primary key.

Table: employees (id, name)

**difficulty: ** easy

The primary key is a column in a relational database table that’s distinctive for each record. Primary keys must contain UNIQUE values, and cannot contain NULL values.

CREATE TABLE employees (
  id INT PRIMARY KEY,
  name VARCHAR(50)
)

Learn more about primary keys.

How well did you know this?

Not at all

Perfectly

How do you pivot data in SQL?

Table: employees (id, dept_id, gender)

**difficulty: ** medium

SELECT dept_id,
       SUM(CASE WHEN gender = 'Male' THEN 1 ELSE 0 END) AS male_count,
       SUM(CASE WHEN gender = 'Female' THEN 1 ELSE 0 END) AS female_count
FROM employees 
GROUP BY dept_id

How well did you know this?

Not at all

Perfectly

Write a query to calculate the percentage of total sales per region?

Table: sales (region, sales)

**difficulty: ** hard

SELECT region, 
       SUM(sales) / SUM(SUM(sales)) OVER () * 100 AS percentage 
FROM sales 
GROUP BY region;

How well did you know this?

Not at all

Perfectly

What is the purpose of normalization in SQL?

employees (id, name, dept_id)

Tables: departments (dept_id, dept_name),

**difficulty: ** easy

To reduce redundancy and dependency by organizing data into logical tables.

CREATE TABLE departments (dept_id INT PRIMARY KEY, dept_name VARCHAR(50));
CREATE TABLE employees (id INT PRIMARY KEY, name VARCHAR(50), dept_id INT);

How well did you know this?

Not at all

Perfectly

How do you get the total row count of a table or grouped result set?

Table: employees (id, name, dept_id)

**difficulty: ** easy

Use COUNT(*) for the table or COUNT(column) for grouped subsets.

-- Total row count of the table
SELECT COUNT(*) AS total_rows 
FROM employees;

-- Row count grouped by department
SELECT dept_id, COUNT(*) AS dept_row_count 
FROM employees 
GROUP BY dept_id;

How well did you know this?

Not at all

Perfectly

What is the difference between RANK(), DENSE_RANK(), and ROW_NUMBER()?

Table: employees (id, name, salary)

**difficulty: ** medium

RANK() assigns ranks with gaps for tied values.
DENSE_RANK() assigns consecutive ranks without gaps.
ROW_NUMBER() assigns unique row numbers regardless of ties.

SELECT name, salary, 
       RANK() OVER (ORDER BY salary DESC) AS rank,
       DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank,
       ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_number
FROM employees;

How well did you know this?

Not at all

Perfectly

How can you replace rows based on a condition while updating a table?

Table: employees (id, name, salary, dept_id)

**difficulty: ** medium

Use the UPDATE statement with a CASE expression in the SET clause to selectively update rows based on conditions.

UPDATE employees
SET salary = CASE 
               WHEN dept_id = 101 THEN salary * 1.1 
               ELSE salary 
               END
WHERE dept_id IN (101, 102);

How well did you know this?

Not at all

Perfectly

What are common SQL data types, and how are they used?

sales (id INT, product STRING, amount DOUBLE, sale_date TIMESTAMP

**difficulty: ** easy

INT: Whole numbers
STRING: Text
DOUBLE: Decimal numbers
TIMESTAMP: Date and time

CREATE TABLE sales (
  id INT,
  product STRING,
  amount DOUBLE,
  sale_date TIMESTAMP
);

How well did you know this?

Not at all

Perfectly

How do you query across multiple databases in Databricks?

Tables: sales, campaigns

Databases: sales_db, marketing_db

**difficulty: ** medium

-- Joining tables across databases
SELECT s.id, s.amount, c.campaign_name
FROM sales_db.sales s
JOIN marketing_db.campaigns c 
ON s.id = c.sale_id;

How well did you know this?

Not at all

Perfectly

What is advanced SQL functions like NTILE?

Table: sales (id, amount, sale_date)

**difficulty: ** medium

NTILE: Divide rows into quartiles

SELECT id, amount, 
       NTILE(4) OVER (ORDER BY amount) AS quartile
FROM sales;

How well did you know this?

Not at all

Perfectly

What is the advanced SQL function `LEAD`? | Table: sales (id, amount, sale_date) ## Footnote **difficulty: ** medium

`LEAD`: Access the next row's value ```SELECT id, amount, LEAD(amount) OVER (ORDER BY sale_date) AS next_amount FROM sales;```

How can you use `CASE` for advanced query logic? | Table: sales (id, amount) ## Footnote **difficulty: ** easy

Use `CASE` to create conditional columns or filters in queries. ```SELECT id, amount, CASE WHEN amount > 100 THEN 'High' WHEN amount > 50 THEN 'Medium' ELSE 'Low' END AS sales_category FROM sales;```

What does the `LAG()` function do in SQL? | Table: sales (id, amount, sale_date) ## Footnote **difficulty: ** medium

`LAG()` retrieves the value of a column from the previous row in a window. ```SELECT id, amount, sale_date, LAG(amount) OVER (ORDER BY sale_date) AS prev_sale FROM sales;```

What does `UNBOUNDED PRECEDING` mean in window functions? | sales (id, amount, sale_date) ## Footnote **difficulty: ** hard

It refers to the start of the window, including all rows before the current one. To create a cumulative sum with `UNBOUNDED PRECEDING`: ```SELECT id, amount, SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sales FROM sales;```

How does `PARTITION BY` work in SQL window functions? | Table: sales (id, amount, sale_date, region) ## Footnote **difficulty: ** medium

`PARTITION BY` divides data into partitions to apply window functions separately to each group. ```SELECT region, id, amount, RANK() OVER (PARTITION BY region ORDER BY amount DESC) AS region_rank FROM sales;```

What is the difference between `RANGE` and `ROWS` in windowing? | Table: sales (id, amount, sale_date) ## Footnote **difficulty: ** hard

`ROWS` operates on a physical set of rows (e.g., current and previous). `RANGE` operates on a logical range of values (e.g., same amount). -- Using ROWS ```SELECT id, amount, SUM(amount) OVER (ORDER BY sale_date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS moving_sum FROM sales;``` -- Using RANGE ```SELECT id, amount, SUM(amount) OVER (ORDER BY sale_date RANGE BETWEEN INTERVAL 1 DAY PRECEDING AND CURRENT ROW) AS moving_sum FROM sales;```

How do `FIRST_VALUE()` and `LAST_VALUE()` work in SQL? | Table: sales (id, amount, sale_date) ## Footnote **difficulty: ** medium

`FIRST_VALUE()` returns the first value in a window; `LAST_VALUE()` returns the last value. ```SELECT id, amount, sale_date, FIRST_VALUE(amount) OVER (ORDER BY sale_date) AS first_sale, LAST_VALUE(amount) OVER (ORDER BY sale_date ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS last_sale FROM sales;```

How do you **calculate a moving average** in SQL using window functions? | Table: sales (id, amount, sale_date) ## Footnote **difficulty: ** hard

Use `AVG()` over a window frame to calculate a moving average. -- 3-day moving average ```SELECT id, amount, sale_date, AVG(amount) OVER (ORDER BY sale_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg FROM sales;```

How can you **calculate percentiles** with SQL window functions? ## Footnote **difficulty: ** medium

Use `PERCENT_RANK()` or `CUME_DIST()` to calculate percentile rankings. ```SELECT id, amount, PERCENT_RANK() OVER (ORDER BY amount) AS percent_rank, CUME_DIST() OVER (ORDER BY amount) AS cumulative_dist FROM sales;```

What is the `QUALIFY` statement in SQL, and how is it used? | sales (id, amount, sale_date) ## Footnote **difficulty: ** medium

The `QUALIFY` statement is used to filter rows after applying window functions, simplifying complex filtering logic. In the example below, we only keep the row with the highest rank: ```SELECT id, amount, RANK() OVER (ORDER BY amount DESC) AS rank FROM sales QUALIFY rank = 1;```

# Exercise How would you calculate the **total revenue** for each product for **the past 7 days**? **Table Name:** `orders` - `order_id` INT - `product_id` INT - `order_date` TIMESTAMP - `status` STRING (values: 'completed', 'pending', 'cancelled') - `total_amount` DOUBLE **Table Name:** `products` - `product_id` INT - `product_name` STRING - `category` STRING - `price` DOUBLE ## Footnote **medium**

```SELECT p.product_name, SUM(o.total_amount) AS total_revenue FROM orders o JOIN products p ON o.product_id = p.product_id WHERE o.status = 'completed' AND o.order_date >= CURRENT_DATE - INTERVAL 7 DAY GROUP BY p.product_name;```

How would you calculate the **average order value** for each product **category** in the **last 60 days**? **Table Name:** `orders` - `order_id` INT - `product_id` INT - `order_date` TIMESTAMP - `status` STRING (values: 'completed', 'pending', 'cancelled') - `total_amount` DOUBLE **Table Name:** `products` - `product_id` INT - `category` STRING ## Footnote **medium**

```SELECT p.category, AVG(o.total_amount) AS avg_order_value FROM orders o JOIN products p ON o.product_id = p.product_id WHERE o.status = 'completed' AND o.order_date >= CURRENT_DATE - INTERVAL 60 DAY GROUP BY p.category```

How would you find the **most recent order** placed by each customer? **Table Name:** `orders` - `order_id` INT - `customer_id` INT - `order_date` TIMESTAMP - `total_amount` DOUBLE ## Footnote **medium**

```SELECT customer_id, MAX(order_date) AS latest_order FROM orders GROUP BY customer_id;```

How would you count the **total number of orders** by **order status** for the **last 90 days**? **Table Name:** `orders` - `order_id` INT - `order_date` TIMESTAMP - `status` STRING (values: 'completed', 'pending', 'cancelled')

```SELECT status, COUNT(order_id) AS total_orders FROM orders WHERE order_date >= CURRENT_DATE - INTERVAL 90 DAY GROUP BY status```

How would you calculate the **7-day rolling average** of total sales for each product category? **Table Name:** `orders` - `order_id` INT - `product_id` INT - `order_date` TIMESTAMP - `total_amount` DOUBLE **Table Name:** `products` - `product_id` INT - `category` STRING ## Footnote **advanced**

```SELECT p.category, o.order_date, AVG(o.total_amount) OVER (PARTITION BY p.category ORDER BY o.order_date RANGE BETWEEN INTERVAL 6 DAY PRECEDING AND CURRENT ROW) AS rolling_avg FROM orders o JOIN products p ON o.product_id = p.product_id;```

How would you compare the **current order total** with the **previous order total** for each customer and calculate the difference? **Table Name:** `orders` - `order_id` INT - `customer_id` INT - `order_date` TIMESTAMP - `total_amount` DOUBLE ## Footnote **advanced**

```SELECT customer_id, order_id, total_amount, LAG(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS previous_order, total_amount - LAG(total_amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS difference FROM orders;```

How would you calculate the **cumulative sales total** for each product category? **Table Name:** `orders` - `order_id` INT - `product_id` INT - `order_date` TIMESTAMP - `total_amount` DOUBLE **Table Name:** `products` - `product_id` INT - `category` STRING ## Footnote **advanced**

```SELECT p.category, o.order_date, SUM(o.total_amount) OVER (PARTITION BY p.category ORDER BY o.order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sales FROM orders o JOIN products p ON o.product_id = p.product_id```

How would you retrieve the **first order** placed by each customer? **Table Name:** `orders` - `order_id` INT - `customer_id` INT - `order_date` TIMESTAMP - `total_amount` DOUBLE ## Footnote **advanced**

```WITH ranked_orders AS ( SELECT customer_id, order_id, order_date, total_amount, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS row_num FROM orders ) SELECT customer_id, order_id, order_date, total_amount FROM ranked_orders WHERE row_num = 1```

How would you rank **products by total sales** within each category for the last 30 days? **Table Name:** `orders` - `order_id` INT - `product_id` INT - `order_date` TIMESTAMP - `total_amount` DOUBLE **Table Name:** `products` - `product_id` INT - `category` STRING ## Footnote **advanced**

```SELECT p.category, p.product_id, SUM(o.total_amount) AS total_sales, RANK() OVER (PARTITION BY p.category ORDER BY SUM(o.total_amount) DESC) AS sales_rank FROM orders o JOIN products p ON o.product_id = p.product_id WHERE o.order_date >= CURRENT_DATE - INTERVAL 30 DAY GROUP BY p.category, p.product_id```

How would you **rank products** by their **sales total** in each category, ensuring no gaps in ranking? **Table Name:** `orders` - `order_id` INT - `product_id` INT - `order_date` TIMESTAMP - `total_amount` DOUBLE **Table Name:** `products` - `product_id` INT - `category` STRING ## Footnote **advanced**

```SELECT p.category, p.product_id, SUM(o.total_amount) AS total_sales, DENSE_RANK() OVER (PARTITION BY p.category ORDER BY SUM(o.total_amount) DESC) AS sales_rank FROM orders o JOIN products p ON o.product_id = p.product_id GROUP BY p.category, p.product_id```

How would you **divide customers into 4 quartiles** based on their total order value over the last 90 days? **Table Name:** `orders` - `order_id` INT - `customer_id` INT - `order_date` TIMESTAMP - `total_amount` DOUBLE ## Footnote **advanced**

```SELECT customer_id, SUM(total_amount) AS total_spent, NTILE(4) OVER (ORDER BY SUM(total_amount) DESC) AS quartile FROM orders WHERE order_date >= CURRENT_DATE - INTERVAL 90 DAY GROUP BY customer_id```

How would you calculate the **total sales by category** and **overall sales** using **GROUPING SETS**? **Table Name:** `orders` - `order_id` INT - `product_id` INT - `order_date` TIMESTAMP - `total_amount` DOUBLE **Table Name:** `products` - `product_id` INT - `category` STRING ## Footnote **advanced**

```SELECT p.category, SUM(o.total_amount) AS total_sales FROM orders o JOIN products p ON o.product_id = p.product_id GROUP BY GROUPING SETS (p.category, ())```

How would you generate a **list of all possible combinations** of products and customers? **Table Name:** `orders` - `order_id` INT - `customer_id` INT - `product_id` INT **Table Name:** `products` - `product_id` INT - `category` STRING **Table Name:** `customers` - `customer_id` INT - `customer_name` STRING ## Footnote **advanced**

```SELECT c.customer_id, p.product_id FROM customers c CROSS JOIN products p```

How would you get a **list of all customers** with **their order details** (including those who didn't place any order)? **Table Name:** `orders` - `order_id` INT - `customer_id` INT - `order_date` TIMESTAMP - `total_amount` DOUBLE **Table Name:** `customers` - `customer_id` INT - `customer_name` STRING ## Footnote **advanced**

```SELECT c.customer_name, o.order_id, o.total_amount FROM customers c FULL OUTER JOIN orders o ON c.customer_id = o.customer_id```

How would you extract **values from a JSON column** to find the **product name** from a JSON structure in the **orders** table? **Table Name:** `orders` - `order_id` INT - `order_date` TIMESTAMP - `product_info` STRING (JSON format: `{"product_id": 1, "product_name": "Widget"}`) ## Footnote **advanced**

```SELECT order_id, JSON_EXTRACT(product_info, '$.product_name') AS product_name FROM orders```

How would you **retrieve only the latest order** for each customer using the **QUALIFY** statement? **Table Name:** `orders` - `order_id` INT - `customer_id` INT - `order_date` TIMESTAMP - `total_amount` DOUBLE ## Footnote **advanced**

```SELECT customer_id, order_id, order_date, total_amount FROM ( SELECT customer_id, order_id, order_date, total_amount, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num FROM orders ) QUALIFY row_num = 1```

How would you **flatten an array** of product IDs from the `orders` table to see each individual product in the order? **Table Name:** `orders` - `order_id` INT - `customer_id` INT - `order_date` TIMESTAMP - `product_ids` ARRAY (example: `[1, 2, 3]`) ## Footnote **medium**

```SELECT order_id, customer_id, order_date, EXplode(product_ids) AS product_id FROM orders```

How would you **join orders** with a list of **products** using an exploded field for `product_ids` in the `orders` table? **Table Name:** `orders` - `order_id` INT - `customer_id` INT - `order_date` TIMESTAMP - `product_ids` ARRAY (example: `[1, 2, 3]`) **Table Name:** `products` - `product_id` INT - `product_name` STRING ## Footnote **advanced**

```SELECT o.order_id, o.customer_id, o.order_date, p.product_name FROM orders o LATERAL VIEW EXPLODE(o.product_ids) exploded AS product_id JOIN products p ON o.product_id = p.product_id```

How would you **join orders with products** and **filter** only the orders containing a specific product (e.g., product_id = 2) from the exploded `product_ids` field? **Table Name:** `orders` - `order_id` INT - `customer_id` INT - `order_date` TIMESTAMP - `product_ids` ARRAY (example: `[1, 2, 3]`) **Table Name:** `products` - `product_id` INT - `product_name` STRING ## Footnote **advanced**

```SELECT o.order_id, o.customer_id, o.order_date, p.product_name FROM orders o LATERAL VIEW EXPLODE(o.product_ids) exploded AS product_id JOIN products p ON exploded.product_id = p.product_id WHERE exploded.product_id = 2```

How would you **check the number of NULL values** in the `email` column of the `users` table? **Table Name:** `users` - `user_id` INT - `email` STRING ## Footnote **medium**

```SELECT COUNT(*) AS null_count FROM users WHERE email IS NULL```

How would you identify **rows with missing values** in the `first_name` or `last_name` columns? **Table Name:** `users` - `user_id` INT - `first_name` STRING - `last_name` STRING ## Footnote **medium**

```SELECT user_id, first_name, last_name FROM users WHERE first_name IS NULL OR last_name IS NULL;```

How would you **check for duplicate entries** in the `email` column of the `users` table? **Table Name:** `users` - `user_id` INT - `email` STRING ## Footnote **easy**

```SELECT email, COUNT(*) AS duplicate_count FROM users GROUP BY email HAVING COUNT(*) > 1```

How would you find **non-numeric values** in a column `amount` which should contain numeric data? **Table Name:** `transactions` - `transaction_id` INT - `amount` STRING ## Footnote **advanced**

```SELECT transaction_id, amount FROM transactions WHERE NOT amount REGEXP '^[0-9]+(\.[0-9]+)?$'```

How would you identify if there are **duplicate rows** based on the combination of `email` and `user_id` in the `users` table? **Table Name:** `users` - `user_id` INT - `email` STRING ## Footnote **medium**

```SELECT email, user_id, COUNT(*) AS duplicate_count FROM users GROUP BY email, user_id HAVING COUNT(*) > 1```

How would you detect **invalid date formats** in the `birth_date` column (which should contain valid `DATE` values)? **Table Name:** `users` - `user_id` INT - `birth_date` STRING

```SELECT user_id, birth_date FROM users WHERE NOT ISDATE(birth_date)```

How would you check if the **values in the `product_id`** column in the `orders` table match the valid IDs in the `products` table? **Table Name:** `orders` - `order_id` INT - `product_id` INT **Table Name:** `products` - `product_id` INT - `product_name` STRING

```SELECT o.order_id, o.product_id FROM orders o LEFT JOIN products p ON o.product_id = p.product_id WHERE p.product_id IS NULL```

How would you identify **orders with unusually high amounts** greater than 1000 in the `orders` table? **Table Name:** `orders` - `order_id` INT - `total_amount` DOUBLE

```SELECT order_id, total_amount FROM orders WHERE total_amount > 1000```

# Handling Missing Values (Imputation Techniques) How would you **fill in missing `total_amount` values** in the `orders` table with the **average** `total_amount`? **Table Name:** `orders` - `order_id` INT - `total_amount` DOUBLE

```SELECT order_id, COALESCE(total_amount, (SELECT AVG(total_amount) FROM orders)) AS filled_total_amount FROM orders```

# Handling Outliers in Numerical Data How would you **identify outliers** in the `total_amount` column of the `orders` table using the **Interquartile Range (IQR)** method? **Table Name:** `orders` - `order_id` INT - `total_amount` DOUBLE ## Footnote **advanced**

```WITH percentiles AS ( SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY total_amount) AS Q1, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY total_amount) AS Q3 FROM orders ) SELECT order_id, total_amount FROM orders, percentiles WHERE total_amount < Q1 - 1.5 * (Q3 - Q1) OR total_amount > Q3 + 1.5 * (Q3 - Q1)```

# Removing Duplicates in a Large Dataset How would you **remove duplicate entries** in the `users` table, keeping only the first occurrence based on `email`? **Table Name:** `users` - `user_id` INT - `email` STRING ## Footnote **advanced**

```WITH ranked_users AS ( SELECT user_id, email, ROW_NUMBER() OVER (PARTITION BY email ORDER BY user_id) AS row_num FROM users ) DELETE FROM ranked_users WHERE row_num > 1;```

# Data Type Conversions and Implicit Casting How would you **convert** the `amount` column from **STRING** to **FLOAT** in the `transactions` table and check for any conversion errors? **Table Name:** `transactions` - `transaction_id` INT - `amount` STRING

```SELECT transaction_id, CAST(amount AS FLOAT) AS amount_float FROM transactions WHERE TRY_CAST(amount AS FLOAT) IS NULL```

# Empty Strings vs NULL Values How would you find **rows where the `email` is either NULL or an empty string** in the `users` table? **Table Name:** `users` - `user_id` INT - `email` STRING

```SELECT user_id, email FROM users WHERE email IS NULL OR email = ''```

# Data Redundancy How would you identify **redundant rows** where the same `email` appears multiple times in the `users` table? **Table Name:** `users` - `user_id` INT - `email` STRING

```SELECT email, COUNT(*) AS duplicate_count FROM users GROUP BY email HAVING COUNT(*) > 1```

A product manager needs to update the **price** of a product, insert a new product, and delete an outdated product in the `products` table. 1. **Insert** a new product with `product_id` = 101, `product_name` = 'Wireless Mouse', and `price` = 29.99. 2. **Update** the price of the product with `product_id` = 101 to 24.99. 3. **Delete** the product with `product_id` = 101 as it’s outdated. **Table Name:** `products` - `product_id` INT - `product_name` STRING - `price` DOUBLE ## Footnote **easy**

**insert** ```INSERT INTO products (product_id, product_name, price) VALUES (101, 'Wireless Mouse', 29.99);``` ** update** ```UPDATE products SET price = 24.99 WHERE product_id = 101;``` **delete** ```DELETE FROM products WHERE product_id = 101;```

SQL: from 0 to hero Flashcards

Learn active concepts about SQL to succeed in product data science interviews. (68 cards)