ADV SQL QRIES
Remind me, how did that expression go?
- CASE
- TRY queries and CATCH errors
- Subqueries
- Common Table Expressions (CTE)
- Window Functions - They all use OVER()
- RANK()
- DENSE_RANK()
- Giving rows a number with ROW_NUMBER()
- Framing for sliding windows using RANGE BETWEEN…AND or ROWS BETWEEN…AND
- Fetching relative rows: LAG() and LEAD()
- Fetching absolute rows: FIRST_VALUE() and LAST_VALUE()
- PARTITION BY subdivides processing ranges in a column
- Paging with NTILE()- subdivide a table into equal-length pages
- Pivoting with the CROSSTAB() Extension
- Group total rows with GROUP BY ROLLUP() and CUBE()
- Fill Null values with COALESCE()
- String Manipulation
- Compress a Column into a single String with STRING_AGG()
- Concatenating Multiple Columns with || and CONCAT()
- Changing Case with UPPER(), LOWER(), INITCAP()
- Replace Characters with REPLACE()
- Trim Characters with TRIM(), LTRIM(), RTRIM()
- Pad Characters with PAD(), LPAD(), RPAD()
- Reverse char order with REVERSE()
- Character Count with CHAR_LENGTH() or LENGTH()
- Find Char Position with STRPOS() or POSITION()
- Clip Strings from the Ends with LEFT(), RIGHT(), and SUBSTRING()
- Data Types
- User-Defined Functions (PostgreSQL)
- Text Search and NLP (PostgreSQL)
- Drop Duplicate Rows (based on Specific Columns)
- Time Manipulation
- Transactions (MS SQL SERVER)
- Special Case: Accessing ARRAYS
- SQL query in Python Pandas
- Table Management
- Extension Management (PostgreSQL)
The following post consists of notes I had on my drive and I had been returning to over and over to remind me of some concepts and SQL syntax.
Of course, the original tables and outputs are not available here, but the point is providing a lookup for structures and building blocks.
CASE
The CASE statement can create a new column based on multiple conditions in a specified column.
The structure always follows CASE
WHEN
THEN
ELSE
END
AS
. The ELSE
and AS
are optional
SELECT
m.date,
t.team_long_name AS opponent,
CASE WHEN m.home_goal < m.away_goal THEN 'Barcelona win!'
WHEN m.home_goal > m.away_goal THEN 'Barcelona loss :('
ELSE 'Tie' END AS outcome
FROM matches_spain AS m
LEFT JOIN teams_spain AS t
ON m.hometeam_id = t.team_api_id
WHERE m.awayteam_id = 8634;
CASE and aggregate function (SUM)
The following query creates multiple new columns with a summation aggregate function.
SELECT
c.name AS country,
SUM(CASE WHEN m.season = '2012/2013' AND m.home_goal > m.away_goal
THEN 1 ELSE 0 END) AS matches_2012_2013,
SUM(CASE WHEN m.season = '2013/2014' AND m.home_goal > m.away_goal
THEN 1 ELSE 0 END) AS matches_2013_2014,
SUM(CASE WHEN m.season = '2014/2015' AND m.home_goal > m.away_goal
THEN 1 ELSE 0 END) AS matches_2014_2015
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
GROUP BY country;
Percentages with AVG and CASE
For each country calculate the percentage of ties (home=away goals_). Use the AVG
and CASE
(for 1,0 assignment) functions.
SELECT
c.name AS country,
ROUND(AVG(CASE WHEN m.season='2013/2014' AND m.home_goal = m.away_goal THEN 1
WHEN m.season='2013/2014' AND m.home_goal != m.away_goal THEN 0
END),2) AS pct_ties_2013_2014,
ROUND(AVG(CASE WHEN m.season='2014/2015' AND m.home_goal = m.away_goal THEN 1
WHEN m.season='2014/2015' AND m.home_goal != m.away_goal THEN 0
END),2) AS pct_ties_2014_2015
FROM country AS c
LEFT JOIN matches AS m
ON c.id = m.country_id
GROUP BY country;
TRY queries and CATCH errors
Similar as in most languages, SQL can make use of Try/Catch (sometimes called exception/error blocks).
BEGIN TRY
ALTER TABLE products
ADD CONSTRAINT CHK_Stock CHECK (stock >= 0);
END TRY
BEGIN CATCH
SELECT 'There was an error adding the constraing!';
END CATCH
Nested TRY/CATCH
It is also possible to have a nested TRY block inside the CATCH block, which lets you attempt a second query if the first failed.
Use example: the nested (inner) TRY block could be used to write an error message into a table that the outer TRY failed.
BEGIN TRY
INSERT INTO bikes (make, model, year, size)
VALUES ('Nukeproof', 'Grizzlybear', '2021', 'S');
END TRY
BEGIN CATCH
SELECT 'An error occurred inserting the bike!';
BEGIN TRY
INSERT INTO errors
VALUES ('Error inserting a bike');
SELECT 'Error inserted correctly!';
END TRY
BEGIN CATCH
SELECT 'An error occurred inserting the error!';
END CATCH
END CATCH
CATCH Error Details
The following query writes a message to an ‘error’ table if the outer Try block fails. If that fails, too, then the error details of the inner try are queried.
BEGIN TRY
INSERT INTO bikes (make, model, year, size)
VALUES ('Nukeproof', 'Grizzlybear', '2021', 'S');
END TRY
BEGIN CATCH
SELECT 'An error occurred inserting the product!';
BEGIN TRY
INSERT INTO errors
VALUES ('Error inserting a product');
END TRY
BEGIN CATCH
SELECT ERROR_NUMBER() AS number,
ERROR_SEVERITY() AS severity_level,
ERROR_STATE() AS state,
ERROR_LINE() AS line,
ERROR_MESSAGE() AS message;
END CATCH
END CATCH
Custom Errors, RAISERROR
‘RAISERROR’ lets you define a custom error and parameters. Syntax:
RAISERROR ( { msg_str | msg_id | @local_variable_message },
severity,
state,
[ argument [ ,...n ] ] )
[ WITH option [ ,...n ] ]
@product_id = 56747
BEGIN TRY
IF NOT EXISTS (SELECT * FROM products WHERE product_id = @product_id)
RAISERROR('No product with id %d.', 11, 1, @product_id);
ELSE
SELECT * FROM products WHERE product_id = @product_id;
END TRY
BEGIN CATCH
SELECT ERROR_NUMBER() AS number,
ERROR_SEVERITY() AS severity_level,
ERROR_STATE() AS state,
ERROR_LINE() AS line,
ERROR_MESSAGE() AS message;
END CATCH
Subqueries
While regular subqueries can be calculated on their own, correlated subqueries cannot need results from outer queries. Queries with multiple subqueries will get complicated quickly.
Subqueries in WHERE clause
SELECT
team_long_name,
team_short_name
FROM team
WHERE team_api_id IN
(SELECT hometeam_id
FROM match
WHERE home_goal >= 8);
Subqueries in FROM
SELECT
country,
date,
home_goal,
away_goal
FROM
(SELECT c.name AS country,
m.date,
m.home_goal,
m.away_goal,
(m.home_goal + m.away_goal) AS total_goals
FROM match AS m
LEFT JOIN country AS c
ON m.country_id = c.id) AS subq
WHERE total_goals >= 10;
Subqueries in SELECT
Subqueries in SELECT
statements generate a single value!
SELECT
l.name AS league,
ROUND(AVG(m.home_goal + m.away_goal), 2) AS avg_goals
(SELECT ROUND(AVG(home_goal + away_goal), 2)
FROM match
WHERE season = '2013/2014') AS overall_avg
FROM league AS l
LEFT JOIN match AS m
ON l.country_id = m.country_id
WHERE season = '2013/2014'
GROUP BY league;
Correlated Subqueries
Correlated subqueries reference one or more columns in the main query. They dend on information in the main query to run, and thus, cannot be executed on their own. Correlated subqueries are evaluated in SQL once per row of data retrieved - a process that takes a lot more computing power and time than a simple subquery.
The following WHERE
clause filters for entries with total goals higher than three times the average.
SELECT
main.country_id,
date,
main.home_goal,
main.away_goal
FROM match AS main
WHERE
(home_goal + away_goal) >
(SELECT AVG((sub.home_goal + sub.away_goal) * 3)
FROM match AS sub
WHERE main.country_id = sub.country_id);
Nested Subqueries in FROM
Now it’s getting complicated. Best design such queries from inner to outer.
SELECT
c.name AS country,
AVG(outer_s.matches) AS avg_seasonal_high_scores
FROM country AS c
LEFT JOIN (
SELECT country_id, season,
COUNT(id) AS matches
FROM (
SELECT country_id, season, id
FROM match
WHERE home_goal >= 5 OR away_goal >= 5) AS inner_s
GROUP BY country_id, season) AS outer_s
ON c.id = outer_s.country_id
GROUP BY country;
Common Table Expressions (CTE)
CTEs are declared and executed ahead of the main query and help organizing queries with complicated subqueries.
They always precede the main query and go WITH
cte AS
(SELECT
…..)
WITH home AS (
SELECT m.id,
m.date,
t.team_long_name AS hometeam,
m.home_goal
FROM match AS m
LEFT JOIN team AS t
ON m.hometeam_id = t.team_api_id),
away AS (
SELECT m.id,
m.date,
t.team_long_name AS awayteam,
m.away_goal
FROM match AS m
LEFT JOIN team AS t
ON m.awayteam_id = t.team_api_id)
SELECT
home.date,
home.hometeam,
away.awayteam,
home.home_goal,
away.away_goal
FROM home
INNER JOIN away
ON home.id = away.id;
Window Functions - They all use OVER()
The OVER()
clause allows you to pass an aggregate function down a data set, similar to subqueries in SELECT
, but run faster.
The empty OVER()
generates column with an aggregate of the complete column and the same value in each row.
SELECT
m.id,
c.name AS country,
m.season,
m.home_goal,
m.away_goal,
AVG(m.home_goal + m.away_goal) OVER() AS overall_avg
FROM match AS m
LEFT JOIN country AS c ON m.country_id = c.id;
The following qerry with the OVER(ORDER BY...)
calculates the running total/sum up to the current row.
SELECT
athlete,
medals,
SUM(medals) OVER(ORDER BY athlete ASC) AS sum_medals
FROM athlete_medals
ORDER BY athlete ASC;
RANK()
Calculate a ranking and also order the query by rank.
Note: Assigns the same number to rows with identical values and skips over the next numbers. E.g.: 1,2,3,3,5,6,6,8…
The following query calculates the average goals per game and groups by league. The averages are ranked by league.
SELECT
l.name AS league,
AVG(m.home_goal + m.away_goal) AS avg_goals,
RANK() OVER(ORDER BY AVG(m.home_goal + m.away_goal) DESC) AS league_rank
FROM league AS l
LEFT JOIN match AS m
ON l.id = m.country_id
WHERE m.season = '2011/2012'
GROUP BY l.name
ORDER BY league_rank;
SELECT
athlete,
medals,
RANK() OVER (ORDER BY medals DESC) AS Rank_N
FROM Athlete_Medals
ORDER BY Medals DESC;
DENSE_RANK()
Note: Assigns same number to rows with identical values but does NOT skip over the next numbers. E.g.: 1,2,3,3,4,5,5,6…
SELECT
country,
athlete,
DENSE_RANK() OVER (PARTITION BY country ORDER BY medals DESC) AS rank_n
FROM athlete_medals
ORDER BY country ASC, rank_n ASC;
Giving rows a number with ROW_NUMBER()
Generate a column that counts the current ROW_NUMBER()
.
SELECT
*,
ROW_NUMBER() OVER() AS Row_N
FROM Summer_Medals
ORDER BY Row_N AS
The ORDER BY
inside the OVER()
clause takes precedence over outer ORDER BY
. The following queries only calls distinct years and does not requrie a subquery.
SELECT
Year,
ROW_NUMBER() OVER(ORDER BY Year ASC) AS Row_N
FROM (
SELECT DISTINCT Year
FROM Summer_Medals
) AS Yea
Framing for sliding windows using RANGE BETWEEN…AND or ROWS BETWEEN…AND
Keywords for sliding windows are UNBOUNDED
, n PRECEDING
, n FOLLOWING
, and CURRENT ROW
The window is supposed to run sum and averages between the current row and the top of the table or bottom of the table. They both arrive at the same cumulative sum, but the way to get there is different.
Difference
RANGE
vs.ROW
:RANGE()
treats duplicates inOVER(ORDER BY...)
as a single entity, which is a special case.ROWS
behaves normally.
Medals | Rows Running Total | Range Running Total |
---|---|---|
10 | 10 | 10 |
50 | 60 | 110 |
50 | 110 | 110 |
60 | 170 | 230 |
60 | 230 | 230 |
70 | 300 | 300 |
The following query is calculating the Running Total (RT) and Running Aaverage (). The window uses ascending date ordering (smallest date number at top) -> the window goes from top (oldest date) of table to current row.
SELECT
date,
home_goal,
away_goal,
SUM(home_goal) OVER(ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
AVG(home_goal) OVER(ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_avg
FROM match
WHERE
hometeam_id = 9908
AND season = '2011/2012';
The following query is calculating the RT and RA in descending date ordering (largest date number at top) -> the window goes from current row to bottom (oldest date).
SELECT
date,
home_goal,
away_goal,
SUM(home_goal) OVER(ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_total,
AVG(home_goal) OVER(ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_avg
FROM match
WHERE
awayteam_id = 9908
AND season = '2011/2012';
Moving average (MA). Calculate a column with the 3-year (2 preceding + current) average.
SELECT
year,
medals,
AVG(medals) OVER(ORDER BY year ASC
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS medals_ma
FROM russian_medals
ORDER BY year ASC;
Fetching relative rows: LAG() and LEAD()
The LAG(column, n)
function fetches a single value from BEFORE (above) the current row (given vertical lag distance) for creating a new column.
The following query determins (1) for each year’s Olympics which the gold medal country is and (2) which country is the reigning/last champion country.
SELECT
Year, Champion,
LAG(Champion) OVER(ORDER BY year ASC) AS Last_Champion
FROM weightlifting_gold
ORDER BY year ASC;
The LEAD(column, n)
function fetches a single value from AFTER (below) the current row (given vertical lag distance) for creating a new column.
SELECT
year,
athlete,
LEAD(athlete, 3) OVER (ORDER BY year ASC) AS future_champion
FROM discus_medalists
ORDER BY year ASC;
Fetching absolute rows: FIRST_VALUE() and LAST_VALUE()
These functions do not depend on the current row’s position. FIRST_VALUE(column)
and LAST_VALUE(column)
The following query fetches the gold-winner athlete and a second column with the first athlete. The order is alphabetically.
SELECT
athlete,
FIRST_VALUE(athlete) OVER(ORDER BY athlete ASC) AS First_Athlete
FROM all_male_medalists;
Fetching the last value is more complicated, since it requires the definition of the RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
.
Otherwise the last value of the window is the current row, and that’s now what we mean.
SELECT
year,
city,
LAST_VALUE(city) OVER(ORDER BY year ASC
RANGE BETWEEN
UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING
) AS last_city
FROM hosts
ORDER BY Year ASC;
PARTITION BY subdivides processing ranges in a column
Window functions can be applied to separate columns into processing sections. This way, it can be avoided that functions leak over from one section into another, e.g. when using ROW_NUMBER
or LAG()
functions.
The following query partitions the table by gender and avoids the LAG()
function determining gold winners of ‘male’ spilling into ‘females’. In that case, a null
value will be displayed.
SELECT
gender,
year,
country AS champion,
LAG(country,1) OVER(PARTITION BY gender ORDER BY year ASC) AS last_champion
FROM athletics_gold
ORDER BY gender ASC, year ASC;
Partition by multiple columns (gender and event).
SELECT
gender,
year,
event,
country AS champion,
LAG(country,1) OVER (PARTITION BY gender, event ORDER BY year ASC) AS last_champion
FROM athletics_gold
ORDER BY event ASC, gender ASC, year ASC;
Paging with NTILE()- subdivide a table into equal-length pages
NTILE(num_pages) OVER(ORDER BY column) AS new_column
divides the table into num_pages depending on how the values in the given column. The new column contains the number of which page they were assigned. These pages are practically quantiles on which statistical analysis can be performed.
The following query subdivides the ordered medals into three pages.
SELECT
athlete,
medals,
NTILE(3) OVER(ORDER BY medals DESC) AS third
FROM athlete_medals
ORDER BY medals DESC, athlete ASC;
Perform aggregate function analysis on the pages/quantiles.
The following query gets the average medals earned in each page.
WITH thirds AS (
SELECT
athlete,
medals,
NTILE(3) OVER(ORDER BY medals DESC) AS third
FROM athlete_medals)
SELECT
third,
AVG(medals) AS avg_medals
FROM thirds
GROUP BY third
ORDER BY third ASC;
Pivoting with the CROSSTAB() Extension
Pivoting with CROSSTAB
requires the tablefunc
extension that needs to be created befor qerrying.
A regular query gets entered into the SELECT * FROM CROSSTAB($$....) AS ct (column COLTYPE,...)
function and the new pivoted column names defined.
CREATE EXTENSION IF NOT EXISTS tablefunc;
SELECT * FROM CROSSTAB($$
SELECT
gender,
year,
country
FROM summer_medals
WHERE
year IN (2008, 2012)
AND medal = 'gold'
AND event = 'pole vault'
ORDER BY gender ASC, year ASC;
$$) AS ct (gender VARCHAR,
"2008" VARCHAR,
"2012" VARCHAR)
ORDER BY gender ASC;
Original:
gender | year | country |
---|---|---|
Men | 2008 | AUS |
Men | 2012 | FRA |
Women | 2008 | RUS |
Women | 2012 | USA |
Pivot by year:
gender | 2008 | 2012 |
---|---|---|
Men | AUS | FRA |
Women | RUS | USA |
Group total rows with GROUP BY ROLLUP() and CUBE()
ROLLUP() vs CUBE()
Use ROLLUP() for hierachical data such as data parts (year, quarter) and don’t wall all possible group-level aggregations. Use CUBE() for all possible group-level aggregations
Null
entries in the rows signifies the total rows.
The following query counts medals by country and gender. At the gender group level there is a total row (male+female) using the ROLLUP(gender)
clause.
SELECT
country,
gender,
COUNT(*) AS gold_awards
FROM summer_medals
WHERE
year = 2004
AND medal = 'Gold'
AND country IN ('DEN', 'NOR', 'SWE')
GROUP BY country, ROLLUP(gender)
ORDER BY country ASC, gender ASC;
Group totals by all possible groups with CUBE()
. Note: all columns have to go into the CUBE()
or it will fail.
SELECT
gender,
medal,
COUNT(*) AS awards
FROM Summer_Medals
WHERE
year = 2012
AND country = 'RUS'
GROUP BY CUBE(gender, medal)
ORDER BY gender ASC, medal ASC;
Fill Null values with COALESCE()
Useful for any function that produces nulls, such as ROLLUP()
, CUBE()
, pivoting, first/last row of LAG()
, LEAD()
.
Use in SELECT
section of query: COALESCE(column, 'null filler string') AS column
.
SELECT
COALESCE(country, 'all countries') AS country,
COALESCE(gender, 'all genders') AS gender,
COUNT(*) AS awards
FROM summer_medals
WHERE
year = 2004
AND medal = 'gold'
AND country IN ('DEN', 'NOR', 'SWE')
GROUP BY ROLLUP(country, cender)
ORDER BY country ASC, gender ASC;
String Manipulation
Compress a Column into a single String with STRING_AGG()
Use in SELECT
statemtt like as such: STRING_AGG(column, 'separator')
SELECT
STRING_AGG(country, ', ')
FROM country_ranks
WHERE rank_col <= 3;
Input Table:
Country | Rank |
---|---|
USA | 1 |
RUS | 2 |
AUS | 3 |
… | … |
Output string: USA, RUS, AUS
Concatenating Multiple Columns with || and CONCAT()
Merge strings with ||
SELECT
first_name || ' ' || last_name AS full_name
FROM
film;
CHARLOTTE HUNTER
Another example:
SELECT
customerID || ': ' || first_name || ' ' || last_name AS full_name;
1: CHARLOTTE HUNTER
Merge strings with posgreSQL’s CONCAT()
SELECT
CONCAT(first_name,' ', last_name) AS full_name
FROM
film;
1: CHARLOTTE HUNTER
Changing Case with UPPER(), LOWER(), INITCAP()
SELECT
UPPER(email)
FROM customer;
SELECT
LOWER(email)
FROM customer;
SELECT
INITCAP(title)
FROM film;
Police Academy
Replace Characters with REPLACE()
SELECT
REPLACE(email, 'hedgeAThog.com', '[email protected]') AS email
FROM customer;
Trim Characters with TRIM(), LTRIM(), RTRIM()
TRIM([leading | trailing | both][characters] from string)
If just the column string is passed to TRIM(), then all space
chars is be removed by default.
If no trim char string passed, is entered, then space
char is the default.
SELECT
TRIM(),
LTRIM(' [email protected] '),
RTRIM(' [email protected] ')
Pad Characters with PAD(), LPAD(), RPAD()
Syntax: LPAD(input_string, length_after_pad, lpad_string)
Note that if the input string is shorter than the padded length, then no padding string will be inserted! If no padding string passed, is entered, then space
char is the default.
SELECT
LPAD('[email protected]', 20, '#'),
RPAD('[email protected]', 17, '$')
’#######[email protected]’, ‘[email protected]\(\)$’
Reverse char order with REVERSE()
SELECT
email,
REVERSE(email)
FROM customer;
[email protected], moc.goh@egdeh
Character Count with CHAR_LENGTH() or LENGTH()
SELECT
title,
CHAR_LENGTH(title)
LENGTH(title)
FROM film;
Police Academy, 14, 14
Find Char Position with STRPOS() or POSITION()
Remember, that SQL indexing starts at 1, not 0!
SELECT
email,
STRPOS(email, '@'),
POSITION('@' IN email)
FROM customer;
[email protected], 6, 6
Clip Strings from the Ends with LEFT(), RIGHT(), and SUBSTRING()
LEFT(), RIGHT(): Specify how many characters to be clipped starting from the ends.
SELECT
RIGHT(column_name, 50),
LEFT(column_name, 10)
FROM customer;
SUBSTRING()/SUBSTR(): Specify the start index and length.
SELECT
SUBSTRING(column_name, 10, 50),
SUBSTR(colum_name, 10, 50)
FROM customer;
SUBSTRING(): Use functions to determine start/end index with FROM…FOR expression.
SELECT
SUBSTRING(email FROM 0 FOR POSITION('@' IN email)), -- name part of address up to @
SUBSTRING(email FROM POSITION('@' IN email)+1 FOR CHAR_LENGTH(email)) -- provider part of address starting after @
FROM customer;
hedge, hog.com
Clip a column to 50 characters but do not clip a word in half.
SELECT
LEFT(description, 50 - POSITION(' ' IN REVERSE(LEFT(description, 50)) -- Subtract the position of the first whitespace character
FROM film;
Data Types
Casting Data Types
PostgreSQL native casting with two colons column :: data-type
SELECT
NOW():: timestamp;
Standard SQL casting with CAST(column AS data-type)
function.
SELECT
CAST(NOW() AS timestamp);
Creating Data Types (PostgreSQL)
CREATE
TYPE dayofweek as ENUM(
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday',
'Sunday'
);
Getting Information about Data Types (PostgreSQL)
SELECT
typname,
typcategory
FROM pg_type
WHERE typname = 'dayofweek';
User-Defined Functions (PostgreSQL)
A User-Defined Function is the equivalent of a Stored Proceedure.
A function can bundle several SQL queries together into a package using CREATE
.
CREATE FUNCTION squared (i integer) RETURNS integer AS $$
BEGIN
RETURN i*i;
END;
$$ language plpgsql; --define language
SELECT
squared(10);
100
Text Search and NLP (PostgreSQL)
Full-Text Search (PostgreSQL)
Postgre has extensions of the normal WHERE column LIKE '%mystring%'
with wildcards (%,_) etc. capabilities.
Additional features are stemming, fuzzy matching (spelling mistakes), and ranking.
SELECT
title,
description
FROM film
WHERE to_tsvector(title) @@ to_tsquery('elf');
Compare Two Strings (PostgreSQL)
Calculate the Levenshtein Distance (number of edits required between two words).
SELECT
levenshtein('hugo', 'hagolm');
3
Calculate Similarity (0-1)
similarity('hugo', 'hagolm');
Drop Duplicate Rows (based on Specific Columns)
Drop duplicate rows based on specific columns, not the entire rows is duplicated.
Use a Common Table Expression (CTE) with a window function ‘PARTITION BY’ that creates a row number per window.
Any duplicates will have a row number >2. Then, evaluate the CTE and only keep row numbers = 1.
WITH cte AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY
col_1,
col_2,
col_3
ORDER BY
col_1,
col_2,
col_3
) row_num
FROM orders_table
)
SELECT *
FROM cte
WHERE row_num = 1;
id | sales | worker_firstname | worker_lastname | registration_date | row_num |
---|---|---|---|---|---|
000000119 | 13 | Johnny | Smith | 2014-01-31 | 1 |
000000127 | 14 | Brian | Merkel | 2014-02-28 | 1 |
000000120 | 29 | Karin | Harris | 2014-01-31 | 1 |
000000128 | 16 | Carol | Schulze | 2014-02-28 | 1 |
Time Manipulation
NOW()
andCURRENT_TIMESTAMP()
fetch time (millisecond precision) with timezone! However,CURRENT_TIMESTAMP(2)
can be passed an argument for roundeding to digits. If the timezone is not required, then these two functions can be cast totimestamp
data-type.CURRENT_DATE
only fetches the date portion of theCURRENT_TIMESTAMP()
CURRENT_TIME()
only fetches the time portion (with timezone) of theCURRENT_TIMESTAMP()
. Can be passed rounding parameter.
Intervals and Time Differences
This query selects a timestamp style column and performs arithmetic on it with the INTERVAL
function. The INTERVAL
add 3 days to the rental date so we can compare if the return date was in time.
SELECT
rental_date,
return_date,
rental_date + INTERVAL '+ `+3 days' AS expected_return_date
FROM rental;
Convert integers to time intervals.
SELECT
title,
INTERVAL '1' day * f.rental_duration AS rental_duration,
return_date - rental_date AS days_rented
FROM film
WHERE return_date IS NOT NULL
The AGE()
function calculates an INTERVAL between a given timestamp/date and NOW if ONE parameter given. If TWO parameters are given, then the interval between those.
SELECT
AGE(rental_date)
FROM rental;
Extracting from Timestamps
Two useful functions are EXTRACT(field FROM source)
and DATE_PART('field',source)
, which produce interchangeable results.
Timestamp parts/fields to extract:
Field (long) | Field (short) | Description |
---|---|---|
year | y | Year |
quarter | Quarter | |
month | Month | |
day | d | Day of Month |
dow | Day of Week | |
week | w | Week |
hour | h | hour |
minute | m | Minute |
second | s | Second |
millisecond | ms | Millisecond |
Examples for the quarter of the year in a timestamp:
SELECT
EXTRACT(quarter FROM TIMESTAMP '2009-02-12 05:18:00') AS quarter;
1
SELECT
DATE_PART('quarter', TIMESTAMP '2009-02-12 05:18:00') AS quarter;
1
SELECT
EXTRACT(dow FROM rental_date) AS dayofweek,
COUNT(*) as rentals
FROM rental
GROUP BY 1; -- column called by positional index instead of names
The following query extracts quarters and years from a timestamp column and sums the payment amount by by quarter and years.
SELECT
EXTRACT(quarter FROM payment_date) AS quarter,
EXTRACT(year FROM payment_date) AS year,
SUM(amount) AS total_payments
FROM payment
GROUP BY 1, 2;
The following query extracts the rental duration (rental_days) in a time interval of 90 days starting from 2005-05-01.
SELECT
EXTRACT(dow FROM r.rental_date) AS dayofweek,
AGE(return_date, rental_date) AS rental_days
FROM rental AS r
WHERE rental_date
BETWEEN CAST('2005-05-01' AS TIMESTAMP)
AND CAST('2005-05-01' AS TIMESTAMP) + INTERVAL '90 day';
Truncating Timestamps
The DATE_TRUNC()
function will truncate a timmestamp or interval data types from fine detail (e.g., minutes) to low detail (e.g., year). This means the timestamp’s length remains the same but set to the lowest logical value (01
for days/months, 00
for hours/minutes/seconds).
SELECT
DATE_TRUNC('year', TIMESTAMP '2009-02-12 05:18:00';
2009-01-01 00:00:00
SELECT
DATE_TRUNC('month', TIMESTAMP '2009-02-12 05:18:00';
2009-05-01 00:00:00
The following query truncates to a day and counts the number of entries (rentals) by day.
SELECT
DATE_TRUNC('day', rental_date) AS rental_day,
COUNT(*) AS rentals
FROM rental
GROUP BY 1;
Transactions (MS SQL SERVER)
Transactions move around values between owners. For example money between bank accounts. This means a transaction contains a subtraction and addition that both have to succeed.
The concept of “Atomic transactions” rolls back a transaction if any of its parts raised an error. SQL has a specific ‘TRAN’ keyword that can be put in a ‘TRY’ block. If the ‘TRAN’ raised an error, the ‘CATCH’ block calls a ‘ROLLBACK TRAN’.
With TRY/CATCH
BEGIN TRY
BEGIN TRAN;
UPDATE accounts SET current_balance = current_balance - 100 WHERE account_id = 4451;
INSERT INTO transactions VALUES (4451, -100, GETDATE());
UPDATE accounts SET current_balance = current_balance + 100 WHERE account_id = 6743;
INSERT INTO transactions VALUES (4451, 100, GETDATE());
COMMIT TRAN;
END TRY
BEGIN CATCH
SELECT 'Rolling back the transaction';
ROLLBACK TRAN;
END CATCH
With XACT_ABORT
With ‘SET XACT_ABORT ON;’ errors in a ‘TRAN’ block will always be rolled back. No ‘TRY’ block needed.
Note: ‘THROW’ error still works with XACT_ABORT.
SET XACT_ABORT ON;
BEGIN TRAN;
UPDATE accounts set current_balance = current_balance - current_balance * 0.01 / 100
WHERE current_balance > 5000000;
IF @@ROWCOUNT <= 10
THROW 55000, 'Not enough wealthy customers!', 1;
ELSE
COMMIT TRAN;
Special Case: Accessing ARRAYS
It is not necessary to fetch the entire array of a column. We can select specific elements with square brackets. Indexing in PostgresSQL starts with 1, and not 0 as with many other languages.
Search for text in ARRAY in specific position.
SELECT
email[1][1] AS type, -- select from first address the first element (address type/description)
email[1][2] AS address, -- select from first address the second element (email address)
test_scores[1] -- select only first element of the array
FROM funky_table
WHERE email[1][1] = "work"; -- only search in in array element 1 and sub-element 1
Search for text in any ARRAY position with ANY
or @>
operators. Results are equivalent.
SELECT
email[2][1] AS type, -- select from second address the first element (address type/description)
email[2][2] AS address, -- select from second address the second element (email address)
test_scores[1] -- select only first element of the array
FROM funky_table
WHERE 'home' = ANY (email) -- select where the column `email` contains an array element named 'home' in any array position)
SELECT
email[2][1] AS type,
email[2][2] AS address
test_scores[1]
FROM funky_table
WHERE email @> ARRAY['home'];
SQL query in Python Pandas
The SQL query is passed as a single string. The tripple quotation mark wrapping (“””) is useful to have multi-row strings in Python. The table name goes in quotation marks and the SQL server connection has to be passed after the query.
Connection URI (Uniform Resource Identifier) for PostgreSQL: postgresql://[user[:password]@][host][:port][/database]
import sqlalchemy
connection_uri = "postgresql://repl:password@localhost:5432/pagila"
db_engine = sqlalchemy.create_engine(connection_uri)
import pandas as pd
df = pd.read_sql("""
SELECT
first_name,
last_name
FROM "Customer"
ORDER BY last_name, first_name
""", db_engine)
df.head(3)
first_name | last_name | |
---|---|---|
0 | Connagh | Bailey |
1 | Brook | Bloom |
2 | Ann | Dalton |
ETL process
- For the “Extraction” process from a OLAP db, the following function queries the entire SQL table using a function and loads it into a Pandas dataframe.
import sqlalchemy
import pandas as pd
# custom extraction function
def extract_table_to_pandas(tablename, db_engine):
query = "SELECT * FROM {}".format(tablename)
return pd.read_sql(query, db_engine)
connection_uri = "postgresql://repl:password@localhost:5432/pagila"
db_engine = sqlalchemy.create_engine(connection_uri)
extract_table_to_pandas("film", db_engine)
Do some column transformations in Spark e.g., convert float to string and split on pattern into multiple columns.
Loading into analytical PostgreSQL database
connection_uri = "postgresql://repl:password@localhost:5432/dwh"
db_engine_dwh = sqlalchemy.create_engine(connection_uri)
film_pdf_joined.to_sql("film", db_engine_dwh, schema="store", if_exists="replace")
# check the table in the SQL db
pd.read_sql("SELECT film_id, recommended_film_ids FROM store.film", db_engine_dwh)
Table Management
Create Table
Creating a table schema works with [column name] [data-type]. ARRAYS can be created with square brackets. Multiple square brackets are nested arrays.
CREATE TABLE funky_table (
column_1 text,
column_2 integer,
column_3 scores[],
employee_id integer,
email text[][]
);
Insert Values into Columns
Inserting works with calling the column names if only specific columns are inserted into.
INERT INTO funnky_table
(column_1, column_2) VALUES ('string for column_1', 12);
Inserting multiple rows into all columns at once can be done without column names, but the order has to match the columns. ARRAYS are inserted with curly braces {} and array elements separated by commas.
INSERT INTO funky_table
VALUES ("Prickly Hedgehog",
2,
{99,83,18,811}, -- standard array with four elements
2224,
{\{"work", "[email protected]"},{"home", "[email protected]"}} -- nested array (remove escape char \)
);
System Database Information Gathering
Which tables are in the system database?
SELECT *
FROM
information_schema.tables
WHERE
table_schema = 'public';
Which columns are in the ‘actor’ table and what are their details?
SELECT *
FROM
information_schema.columns
WHERE
table_name = 'actor';
Column descriptions are detailed: table_catalog
table_schema
table_name
column_name
ordinal_position
column_default
is_nullable
data_type
character_maximum_length
….. and so on.
Extension Management (PostgreSQL)
Show Extensions
Available in the distribution ready to be installed.
SELECT
name
FROM pg_available_extensions;
dblink, pg_stat_tatements, pg_trgm
Currently installed and ready for use.
SELECT
extname
FROM pg_extension;
plpgsql
Install/Import Extensions
First install and then check if the installation succeeded.
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
SELECT
extname
FROM pg_extension;