SQL Flashcards
(28 cards)
what code for skipping rows in the returned result?
OFFSET
SELECT column_names
FROM table_name
ORDER BY column_name
LIMIT number_of_rows
OFFSET offset_value;
(OFFSET 9, skips first 9 rows)
-used for skipping specified rows, returning chunks for exporting, focusing on data in a particular range
-always use order by first
syntax for using NOT?
SELECT column1, column2, …
FROM table_name
WHERE NOT condition;
e.g.
where not country = ‘spain’
when combining with something else then put with that
e.g.
where country not in (‘’)
how does between work?
inclusive
e.g.
where number between 1 and 5
whats the syntax for window functions
SELECT start_terminal,
duration_seconds,
SUM(duration_seconds) OVER
(PARTITION BY start_terminal ORDER BY start_time rows between 13 preceding and current row)
AS running_total
FROM tutorial.dc_bikeshare_q1_2012
-can do without the partition and it will do over the whole dataframe, otherwise partitions over the sections and goes up in order so dont forget the order by
-the 13 means consider the current row and the previous 13 in the calculations
what functions apart from the standard aggregations can be run on window functions and what do they do?
ROW_NUMBER() -no arg accepted, starts at 1 at each partition
RANK() - starts at 1, if same occurs twice gives both same number and then skips next number
DENSE_RANK() - starts at 1, if same occurs twice gives both same number and then goes to next number
PERCENT_RANK() - relative rank of the current row, rank/partition - ranges from 0 to 1 - gives percentile ranking
NTILE(# of tiles) - e.g. NTILE(4) gives quartiles, NTILE(100) gives percentiles. careful with NTILE(100) when small number of entries
LAG(column, 1) - gives entry of specified column in the row before
LAG(column 3) -gives entry of specified column 3 rows ahead
how do you alias a window function when you are using multiple with the same window
SELECT start_terminal,
duration_seconds,
NTILE(4) OVER ntile_window AS quartile,
NTILE(5) OVER ntile_window AS quintile,
NTILE(100) OVER ntile_window AS percentile
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < ‘2012-01-08’
WINDOW ntile_window AS
(PARTITION BY start_terminal ORDER BY duration_seconds)
ORDER BY start_terminal, duration_seconds
-window clause must come after where cause
how do you create a ‘difference’ column? like finding the difference between a value and the value before it
SELECT start_terminal,
duration_seconds,
duration_seconds -LAG(duration_seconds, 1) OVER
(PARTITION BY start_terminal ORDER BY duration_seconds)
AS difference
FROM tutorial.dc_bikeshare_q1_2012
WHERE start_time < ‘2012-01-08’
ORDER BY start_terminal, duration_seconds
Datetypes in SQL:
DATE: YYYY-MM-DD
TIME: HH:MM:SS
TIMESTAMP: YYYY-MM-DD HH:MM:SS
INTERVAL: interval between two date/time values ‘x year x month…’
What functions to get current date/time?
NOW(): current date, time and Timezone
CURRENT_DATE: current date
CURRENT_TIME: current time and Timezone
CURRENT_TIMESTAMP: current date, time and Timezone
how do you calculate the difference between 2 times or dates
just use minus operator
date - date
returns an interval
how do you add or subtract a certain amount of time from a date or time
using interval
date + INTERVAL ‘1 day’
date - INTERVAL ‘1 month’
etc
how can you extract different parts of a date?
select date_part(‘year’, date)
select extract(year from date)
-both extract a part of the date, extract can handle timezones
date_trunc(‘year’, date)
-truncates date to certain part
how do you convert dates to strings and strings to dates?
date to specified string format, choose whichever string format you want:
to_char(date, ‘YY/MM/DD’)
string to specified date format. here you need to specify the format that the string is in e.g.:
to_date(‘01/02/2024’, ‘DD/MM/YYYY)
to_timestamp (same as above but includes time)
when to use in and exists?
select * from …
where x in (select * from ….)
-checks all rows for the in statement
select * from …
where exists
(select 1 from … where x=x)
-checks if the row with the condition exists, stops as soon as it does - better for large datasets
what are some ways of creating conditional logic and the syntaxes? ie new columns with logic conditional on other columns
select x, y,
case when
condition then result
condtion2 then result 2
else result3
end as new_column
end
from …
coalesce returns the first non-null value:
select coalesce(col1, col2, ‘newoption’) as newcolumn
from …
NULLIF returns null if two expressions are equal
select nulliff(col1, 0) from …
-can be used to prevent division by 0 error
how do you use conditional logic to calculate things like counting active users, or sales from US customers
select count(case when x = ‘active’ then 1 else 0) as active_users,
count(*) as total_users
from …
SELECT
SUM(CASE WHEN country = ‘US’ THEN revenue ELSE 0 END) AS us_revenue,
SUM(revenue) AS total_revenue
FROM sales;
what are the 2 ways to cast types in postgres
select 123::integer
select now()::date
OR
select cast(123 as integer)
select cast(‘hello’ as text)
when and how to pivot data?
pivot = turn rows into columns
e.g. show sales per month as columns like ‘feb sales’ ‘march sales’
SELECT
product,
SUM(CASE WHEN month = ‘Jan’ THEN sales ELSE 0 END) AS jan_sales,
SUM(CASE WHEN month = ‘Feb’ THEN sales ELSE 0 END) AS feb_sales
FROM sales_data
GROUP BY product;
group by the column you want to pivot on, then use case when to group
how and when to unpivot?
unpivot = turn columns into rows
e.g. turn a table with columns of KPIs into one KPI name column and one value column
SELECT product, ‘Jan’ AS month, jan_sales AS sales FROM monthly_sales
UNION ALL
SELECT product, ‘Feb’ AS month, feb_sales FROM monthly_sales;
manually create each name/value column and union all
what is normalisation and denormalisation?
normalisation = storing data to reduce redundancy. data is split into multiple related tables. reduces duplicates - is more efficient for storage. harder to read - need joins
denormalisation = combine tables to reduce joins and make more readable. improves read performance at the cost of some redundancy
what is collation and what are the different sensitivity types?
collation is the set of rules around how string data is stored
case sensitive/insensitive - a and A
accent sensitive/insensitive - e and é
-others for different languages
what and how to use histogram buckets
histogram buckets use to group continuous data into buckets
-organises values into fixed width intervals and count how many fall into each one
SELECT
width_bucket(age, 0, 100, 10) AS bucket_number, – 10 buckets from 0 to 100
FROM users
width_bucket(value, min, max, num_buckets)
-returns which bucket the value belongs to. buckets are equal width and span from min to max]
e.g. segment users by age, group orders by price etc.
how to select random rows
…
order by random ()
limit 10
how do you calculate something when its asking you say the percentage of the total dataframe. e.g. find percentage of users where a certain column = a certain value
select
count(case when close_date = ‘2020-01-01’ then 1 end) * 100 / count(*)
as pc
from accounts
use aggegate with case when