ADV SQL QRIES

Posted in Projectblog ∙ 47 mins read

Remind me, how did that expression go?

CASE
1. CASE and aggregate function (SUM)
2. Percentages with AVG and CASE
TRY queries and CATCH errors
Subqueries
Common Table Expressions (CTE)
Window Functions - They all use OVER()
Pivoting with the CROSSTAB() Extension
Group total rows with GROUP BY ROLLUP() and CUBE()
Fill Null values with COALESCE()
String Manipulation
Data Types
User-Defined Functions (PostgreSQL)
Text Search and NLP (PostgreSQL)
1. Full-Text Search (PostgreSQL)
2. Compare Two Strings (PostgreSQL)
Drop Duplicate Rows (based on Specific Columns)
Time Manipulation
Transactions (MS SQL SERVER)
1. With TRY/CATCH
2. With XACT_ABORT
Special Case: Accessing ARRAYS
SQL query in Python Pandas
1. ETL process
Table Management
Extension Management (PostgreSQL)
1. Show Extensions
2. Install/Import Extensions

The following post consists of notes I had on my drive and I had been returning to over and over to remind me of some concepts and SQL syntax.

Of course, the original tables and outputs are not available here, but the point is providing a lookup for structures and building blocks.

CASE

The CASE statement can create a new column based on multiple conditions in a specified column.

The structure always follows CASE WHEN THEN ELSE END AS. The ELSE and AS are optional

SELECT  
	m.date,
	t.team_long_name AS opponent,
	CASE WHEN m.home_goal < m.away_goal THEN 'Barcelona win!'
		 WHEN m.home_goal > m.away_goal THEN 'Barcelona loss :(' 
		 ELSE 'Tie' END AS outcome
FROM matches_spain AS m
LEFT JOIN teams_spain AS t 
ON m.hometeam_id = t.team_api_id
WHERE m.awayteam_id = 8634;

CASE and aggregate function (SUM)

The following query creates multiple new columns with a summation aggregate function.

SELECT 
	c.name AS country,
	SUM(CASE WHEN m.season = '2012/2013' AND m.home_goal > m.away_goal 
			 THEN 1 ELSE 0 END) AS matches_2012_2013,
				
	SUM(CASE WHEN m.season = '2013/2014' AND m.home_goal > m.away_goal 
			 THEN 1 ELSE 0 END) AS matches_2013_2014,
				
	SUM(CASE WHEN m.season = '2014/2015' AND m.home_goal > m.away_goal
			 THEN 1 ELSE 0 END) AS matches_2014_2015
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
GROUP BY country;

Percentages with AVG and CASE

For each country calculate the percentage of ties (home=away goals_). Use the AVG and CASE (for 1,0 assignment) functions.

SELECT 
	c.name AS country,
	ROUND(AVG(CASE WHEN m.season='2013/2014' AND m.home_goal = m.away_goal THEN 1
					WHEN m.season='2013/2014' AND m.home_goal != m.away_goal THEN 0
					END),2) AS pct_ties_2013_2014,
				   
	ROUND(AVG(CASE WHEN m.season='2014/2015' AND m.home_goal = m.away_goal THEN 1
					WHEN m.season='2014/2015' AND m.home_goal != m.away_goal THEN 0
					END),2) AS pct_ties_2014_2015
FROM country AS c
LEFT JOIN matches AS m
ON c.id = m.country_id
GROUP BY country;

TRY queries and CATCH errors

Similar as in most languages, SQL can make use of Try/Catch (sometimes called exception/error blocks).

BEGIN TRY
	ALTER TABLE products
		ADD CONSTRAINT CHK_Stock CHECK (stock >= 0);
END TRY

BEGIN CATCH
	SELECT 'There was an error adding the constraing!';
END CATCH

Nested TRY/CATCH

It is also possible to have a nested TRY block inside the CATCH block, which lets you attempt a second query if the first failed.

Use example: the nested (inner) TRY block could be used to write an error message into a table that the outer TRY failed.

BEGIN TRY
	INSERT INTO bikes (make, model, year, size)
		VALUES ('Nukeproof', 'Grizzlybear', '2021', 'S');
END TRY

BEGIN CATCH
	SELECT 'An error occurred inserting the bike!';
    BEGIN TRY
    	INSERT INTO errors 
        	VALUES ('Error inserting a bike');
        SELECT 'Error inserted correctly!';
	END TRY
	
    BEGIN CATCH
    	SELECT 'An error occurred inserting the error!';
    END CATCH 
END CATCH

CATCH Error Details

The following query writes a message to an ‘error’ table if the outer Try block fails. If that fails, too, then the error details of the inner try are queried.

BEGIN TRY 	
	INSERT INTO bikes (make, model, year, size)
		VALUES ('Nukeproof', 'Grizzlybear', '2021', 'S');  
END TRY

BEGIN CATCH
	SELECT 'An error occurred inserting the product!';
    BEGIN TRY
    	INSERT INTO errors 
        	VALUES ('Error inserting a product');
    END TRY
	
    BEGIN CATCH
    	SELECT  ERROR_NUMBER() AS number,  
        	ERROR_SEVERITY() AS severity_level,  
        	ERROR_STATE() AS state,
        	ERROR_LINE() AS line,  
        	ERROR_MESSAGE() AS message; 
    END CATCH    
END CATCH

Custom Errors, RAISERROR

‘RAISERROR’ lets you define a custom error and parameters. Syntax:

RAISERROR ( { msg_str | msg_id | @local_variable_message },  
    severity, 
    state,
    [ argument [ ,...n ] ] )  
    [ WITH option [ ,...n ] ]

@product_id = 56747

BEGIN TRY
IF NOT EXISTS (SELECT * FROM products WHERE product_id = @product_id)
	RAISERROR('No product with id %d.', 11, 1, @product_id);
ELSE 
	SELECT * FROM products WHERE product_id = @product_id;
END TRY

BEGIN CATCH
    SELECT  ERROR_NUMBER() AS number,  
        ERROR_SEVERITY() AS severity_level,  
        ERROR_STATE() AS state,
        ERROR_LINE() AS line,  
        ERROR_MESSAGE() AS message; 
END CATCH

Subqueries

While regular subqueries can be calculated on their own, correlated subqueries cannot need results from outer queries. Queries with multiple subqueries will get complicated quickly.

Subqueries in WHERE clause

SELECT
	team_long_name,
	team_short_name
FROM team
WHERE team_api_id IN
	(SELECT hometeam_id 
	 FROM match
	 WHERE home_goal >= 8);

Subqueries in FROM

SELECT
	country,
	date,
	home_goal,
	away_goal
FROM 
	(SELECT c.name AS country, 
			m.date, 
			m.home_goal, 
			m.away_goal,
			(m.home_goal + m.away_goal) AS total_goals
	FROM match AS m
	LEFT JOIN country AS c
	ON m.country_id = c.id) AS subq
WHERE total_goals >= 10;

Subqueries in SELECT

Subqueries in SELECT statements generate a single value!

SELECT 
	l.name AS league,
   	ROUND(AVG(m.home_goal + m.away_goal), 2) AS avg_goals
	(SELECT ROUND(AVG(home_goal + away_goal), 2) 
     	FROM match
     	WHERE season = '2013/2014') AS overall_avg
FROM league AS l
LEFT JOIN match AS m
ON l.country_id = m.country_id
WHERE season = '2013/2014'
GROUP BY league;

Correlated Subqueries

Correlated subqueries reference one or more columns in the main query. They dend on information in the main query to run, and thus, cannot be executed on their own. Correlated subqueries are evaluated in SQL once per row of data retrieved - a process that takes a lot more computing power and time than a simple subquery.

The following WHERE clause filters for entries with total goals higher than three times the average.

SELECT 
	main.country_id,
	date,
	main.home_goal, 
	main.away_goal
FROM match AS main
WHERE 
	(home_goal + away_goal) > 
		(SELECT AVG((sub.home_goal + sub.away_goal) * 3)
		 FROM match AS sub
		 WHERE main.country_id = sub.country_id);

Nested Subqueries in FROM

Now it’s getting complicated. Best design such queries from inner to outer.

SELECT
	c.name AS country,
	AVG(outer_s.matches) AS avg_seasonal_high_scores
FROM country AS c
LEFT JOIN (
		SELECT country_id, season,
			COUNT(id) AS matches
		FROM (
			SELECT country_id, season, id
			FROM match
			WHERE home_goal >= 5 OR away_goal >= 5) AS inner_s
		GROUP BY country_id, season) AS outer_s
ON c.id = outer_s.country_id
GROUP BY country;

Common Table Expressions (CTE)

CTEs are declared and executed ahead of the main query and help organizing queries with complicated subqueries.

They always precede the main query and go WITH cte AS (SELECT…..)

WITH home AS (
	SELECT m.id,
		m.date,
		t.team_long_name AS hometeam,
		m.home_goal
	FROM match AS m
	LEFT JOIN team AS t 
	ON m.hometeam_id = t.team_api_id),

away AS (
	SELECT m.id,
		m.date, 
		t.team_long_name AS awayteam,
		m.away_goal
  	FROM match AS m
  	LEFT JOIN team AS t 
  	ON m.awayteam_id = t.team_api_id)

SELECT 
	home.date,
	home.hometeam,
	away.awayteam,
	home.home_goal,
	away.away_goal
FROM home
INNER JOIN away
ON home.id = away.id;

Window Functions - They all use OVER()

The OVER() clause allows you to pass an aggregate function down a data set, similar to subqueries in SELECT, but run faster.

The empty OVER() generates column with an aggregate of the complete column and the same value in each row.

SELECT 
	m.id, 
	c.name AS country, 
   	m.season,
	m.home_goal,
	m.away_goal,
	AVG(m.home_goal + m.away_goal) OVER() AS overall_avg
FROM match AS m
LEFT JOIN country AS c ON m.country_id = c.id;

The following qerry with the OVER(ORDER BY...) calculates the running total/sum up to the current row.

SELECT
  athlete,
  medals,
  SUM(medals) OVER(ORDER BY athlete ASC) AS sum_medals
FROM athlete_medals
ORDER BY athlete ASC;

RANK()

Calculate a ranking and also order the query by rank.

Note: Assigns the same number to rows with identical values and skips over the next numbers. E.g.: 1,2,3,3,5,6,6,8…

The following query calculates the average goals per game and groups by league. The averages are ranked by league.

SELECT 
	l.name AS league,
	AVG(m.home_goal + m.away_goal) AS avg_goals,
	RANK() OVER(ORDER BY AVG(m.home_goal + m.away_goal) DESC) AS league_rank
FROM league AS l
LEFT JOIN match AS m 
ON l.id = m.country_id
WHERE m.season = '2011/2012'
GROUP BY l.name
ORDER BY league_rank;

SELECT
  athlete,
  medals,
  RANK() OVER (ORDER BY medals DESC) AS Rank_N
FROM Athlete_Medals
ORDER BY Medals DESC;

DENSE_RANK()

Note: Assigns same number to rows with identical values but does NOT skip over the next numbers. E.g.: 1,2,3,3,4,5,5,6…

SELECT
  country,
  athlete,
  DENSE_RANK() OVER (PARTITION BY country ORDER BY medals DESC) AS rank_n
FROM athlete_medals
ORDER BY country ASC, rank_n ASC;

Giving rows a number with ROW_NUMBER()

Generate a column that counts the current ROW_NUMBER().

SELECT
  *,
  ROW_NUMBER() OVER() AS Row_N
FROM Summer_Medals
ORDER BY Row_N AS

The ORDER BY inside the OVER() clause takes precedence over outer ORDER BY. The following queries only calls distinct years and does not requrie a subquery.

SELECT
  Year,
  ROW_NUMBER() OVER(ORDER BY Year ASC) AS Row_N
FROM (
  SELECT DISTINCT Year
  FROM Summer_Medals
) AS Yea

Framing for sliding windows using RANGE BETWEEN…AND or ROWS BETWEEN…AND

Keywords for sliding windows are UNBOUNDED, n PRECEDING, n FOLLOWING, and CURRENT ROW

The window is supposed to run sum and averages between the current row and the top of the table or bottom of the table. They both arrive at the same cumulative sum, but the way to get there is different.

Difference RANGE vs. ROW: RANGE() treats duplicates in OVER(ORDER BY...) as a single entity, which is a special case. ROWS behaves normally.

Medals	Rows Running Total	Range Running Total
10	10	10
50	60	110
50	110	110
60	170	230
60	230	230
70	300	300

The following query is calculating the Running Total (RT) and Running Aaverage (). The window uses ascending date ordering (smallest date number at top) -> the window goes from top (oldest date) of table to current row.

SELECT 
	date,
	home_goal,
	away_goal,
	SUM(home_goal) OVER(ORDER BY date 
		ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total,
	AVG(home_goal) OVER(ORDER BY date 
		ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_avg
FROM match
WHERE 
	hometeam_id = 9908 
	AND season = '2011/2012';

The following query is calculating the RT and RA in descending date ordering (largest date number at top) -> the window goes from current row to bottom (oldest date).

SELECT 
	date,
	home_goal,
	away_goal,
	SUM(home_goal) OVER(ORDER BY date DESC
		ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_total,
	AVG(home_goal) OVER(ORDER BY date DESC
		ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_avg
FROM match
WHERE 
	awayteam_id = 9908 
    AND season = '2011/2012';

Moving average (MA). Calculate a column with the 3-year (2 preceding + current) average.

SELECT
	year,
	medals,
	AVG(medals) OVER(ORDER BY year ASC
		ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS medals_ma
FROM russian_medals
ORDER BY year ASC;

Fetching relative rows: LAG() and LEAD()

The LAG(column, n) function fetches a single value from BEFORE (above) the current row (given vertical lag distance) for creating a new column.

The following query determins (1) for each year’s Olympics which the gold medal country is and (2) which country is the reigning/last champion country.

SELECT
	Year, Champion,
	LAG(Champion) OVER(ORDER BY year ASC) AS Last_Champion
FROM weightlifting_gold
ORDER BY year ASC;

The LEAD(column, n) function fetches a single value from AFTER (below) the current row (given vertical lag distance) for creating a new column.

SELECT
	year,
	athlete,
	LEAD(athlete, 3) OVER (ORDER BY year ASC) AS future_champion
FROM discus_medalists
ORDER BY year ASC;

Fetching absolute rows: FIRST_VALUE() and LAST_VALUE()

These functions do not depend on the current row’s position. FIRST_VALUE(column) and LAST_VALUE(column)

The following query fetches the gold-winner athlete and a second column with the first athlete. The order is alphabetically.

SELECT
	athlete,
	FIRST_VALUE(athlete) OVER(ORDER BY athlete ASC) AS First_Athlete
FROM all_male_medalists;

Fetching the last value is more complicated, since it requires the definition of the RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

Otherwise the last value of the window is the current row, and that’s now what we mean.

SELECT
	year,
	city,
	LAST_VALUE(city) OVER(ORDER BY year ASC
		RANGE BETWEEN
		UNBOUNDED PRECEDING AND
		UNBOUNDED FOLLOWING
		) AS last_city
FROM hosts
ORDER BY Year ASC;

PARTITION BY subdivides processing ranges in a column

Window functions can be applied to separate columns into processing sections. This way, it can be avoided that functions leak over from one section into another, e.g. when using ROW_NUMBER or LAG() functions.

The following query partitions the table by gender and avoids the LAG() function determining gold winners of ‘male’ spilling into ‘females’. In that case, a null value will be displayed.

SELECT
	gender,
	year,
	country AS champion,
	LAG(country,1) OVER(PARTITION BY gender ORDER BY year ASC) AS last_champion
FROM athletics_gold
ORDER BY gender ASC, year ASC;

Partition by multiple columns (gender and event).

SELECT
	gender,
	year,
	event,
	country AS champion,
	LAG(country,1) OVER (PARTITION BY gender, event ORDER BY year ASC) AS last_champion
FROM athletics_gold
ORDER BY event ASC, gender ASC, year ASC;

Paging with NTILE()- subdivide a table into equal-length pages

NTILE(num_pages) OVER(ORDER BY column) AS new_column divides the table into num_pages depending on how the values in the given column. The new column contains the number of which page they were assigned. These pages are practically quantiles on which statistical analysis can be performed.

The following query subdivides the ordered medals into three pages.

SELECT
	athlete,
	medals,
	NTILE(3) OVER(ORDER BY medals DESC) AS third
FROM athlete_medals
ORDER BY medals DESC, athlete ASC;

Perform aggregate function analysis on the pages/quantiles.

The following query gets the average medals earned in each page.

WITH thirds AS (
	SELECT
		athlete,
		medals,
		NTILE(3) OVER(ORDER BY medals DESC) AS third
	FROM athlete_medals)
  
SELECT
	third,
	AVG(medals) AS avg_medals
FROM thirds
GROUP BY third
ORDER BY third ASC;

Pivoting with the CROSSTAB() Extension

Pivoting with CROSSTAB requires the tablefunc extension that needs to be created befor qerrying.

A regular query gets entered into the SELECT * FROM CROSSTAB($$....) AS ct (column COLTYPE,...) function and the new pivoted column names defined.

CREATE EXTENSION IF NOT EXISTS tablefunc;

SELECT * FROM CROSSTAB($$
	SELECT
		gender,
		year,
		country
	FROM summer_medals
	WHERE
		year IN (2008, 2012)
		AND medal = 'gold'
		AND event = 'pole vault'
	ORDER BY gender ASC, year ASC;
$$) AS ct (gender VARCHAR,
           "2008" VARCHAR,
           "2012" VARCHAR)
ORDER BY gender ASC;

Original:

gender	year	country
Men	2008	AUS
Men	2012	FRA
Women	2008	RUS
Women	2012	USA

Pivot by year:

gender	2008	2012
Men	AUS	FRA
Women	RUS	USA

Group total rows with GROUP BY ROLLUP() and CUBE()

ROLLUP() vs CUBE()

Use ROLLUP() for hierachical data such as data parts (year, quarter) and don’t wall all possible group-level aggregations. Use CUBE() for all possible group-level aggregations

Null entries in the rows signifies the total rows.

The following query counts medals by country and gender. At the gender group level there is a total row (male+female) using the ROLLUP(gender) clause.

SELECT
	country,
	gender,
	COUNT(*) AS gold_awards
FROM summer_medals
WHERE
	year = 2004
	AND medal = 'Gold'
	AND country IN ('DEN', 'NOR', 'SWE')
GROUP BY country, ROLLUP(gender)
ORDER BY country ASC, gender ASC;

Group totals by all possible groups with CUBE(). Note: all columns have to go into the CUBE() or it will fail.

SELECT
	gender,
	medal,
	COUNT(*) AS awards
FROM Summer_Medals
WHERE
	year = 2012
	AND country = 'RUS'
GROUP BY CUBE(gender, medal)
ORDER BY gender ASC, medal ASC;

Fill Null values with COALESCE()

Useful for any function that produces nulls, such as ROLLUP(), CUBE(), pivoting, first/last row of LAG(), LEAD().

Use in SELECT section of query: COALESCE(column, 'null filler string') AS column.

SELECT
	COALESCE(country, 'all countries') AS country,
	COALESCE(gender, 'all genders') AS gender,
	COUNT(*) AS awards
FROM summer_medals
WHERE
	year = 2004
	AND medal = 'gold'
	AND country IN ('DEN', 'NOR', 'SWE')
GROUP BY ROLLUP(country, cender)
ORDER BY country ASC, gender ASC;

String Manipulation

Compress a Column into a single String with STRING_AGG()

Use in SELECT statemtt like as such: STRING_AGG(column, 'separator')

SELECT
	STRING_AGG(country, ', ')
FROM country_ranks
WHERE rank_col <= 3;

Input Table:

Country	Rank
USA	1
RUS	2
AUS	3
…	…

Output string: USA, RUS, AUS

Concatenating Multiple Columns with || and CONCAT()

Merge strings with ||

SELECT 
	first_name || ' ' || last_name AS full_name
FROM
	film;

CHARLOTTE HUNTER

Another example:

SELECT
	customerID || ': ' || first_name || ' ' || last_name AS full_name;

1: CHARLOTTE HUNTER

Merge strings with posgreSQL’s CONCAT()

SELECT
	CONCAT(first_name,' ', last_name) AS full_name
FROM
	film;

1: CHARLOTTE HUNTER

Changing Case with UPPER(), LOWER(), INITCAP()

SELECT
	UPPER(email)
FROM customer;

[email protected]

SELECT
	LOWER(email)
FROM customer;

[email protected]

SELECT
	INITCAP(title)
FROM film;

Police Academy

Replace Characters with REPLACE()

SELECT
	REPLACE(email, 'hedgeAThog.com', '[email protected]') AS email
FROM customer;

Trim Characters with TRIM(), LTRIM(), RTRIM()

TRIM([leading | trailing | both][characters] from string)

If just the column string is passed to TRIM(), then all space chars is be removed by default.

If no trim char string passed, is entered, then space char is the default.

SELECT
	TRIM(),
	LTRIM('    [email protected]    '),
	RTRIM('    [email protected]    ')

‘[email protected] ‘, ‘ [email protected]’

Pad Characters with PAD(), LPAD(), RPAD()

Syntax: LPAD(input_string, length_after_pad, lpad_string) Note that if the input string is shorter than the padded length, then no padding string will be inserted! If no padding string passed, is entered, then space char is the default.

SELECT
	LPAD('[email protected]', 20, '#'),
	RPAD('[email protected]', 17, '$')

’#######[email protected]’, ‘[email protected]$’

Reverse char order with REVERSE()

SELECT
	email,
	REVERSE(email)
FROM customer;

[email protected], moc.goh@egdeh

Character Count with CHAR_LENGTH() or LENGTH()

SELECT
	title,
	CHAR_LENGTH(title)
	LENGTH(title)
FROM film;

Police Academy, 14, 14

Find Char Position with STRPOS() or POSITION()

Remember, that SQL indexing starts at 1, not 0!

SELECT
	email,
	STRPOS(email, '@'),
	POSITION('@' IN email)
FROM customer;

[email protected], 6, 6

Clip Strings from the Ends with LEFT(), RIGHT(), and SUBSTRING()

LEFT(), RIGHT(): Specify how many characters to be clipped starting from the ends.

SELECT
	RIGHT(column_name, 50),
	LEFT(column_name, 10)
FROM customer;

SUBSTRING()/SUBSTR(): Specify the start index and length.

SELECT
	SUBSTRING(column_name, 10, 50),
	SUBSTR(colum_name, 10, 50)
FROM customer;

SUBSTRING(): Use functions to determine start/end index with FROM…FOR expression.

SELECT
	SUBSTRING(email FROM 0 FOR POSITION('@' IN email)), -- name part of address up to @
	SUBSTRING(email FROM POSITION('@' IN email)+1 FOR CHAR_LENGTH(email)) -- provider part of address starting after @
FROM customer;

hedge, hog.com

Clip a column to 50 characters but do not clip a word in half.

SELECT
	LEFT(description, 50 - POSITION(' ' IN REVERSE(LEFT(description, 50)) -- Subtract the position of the first whitespace character
FROM film;

Data Types

Casting Data Types

PostgreSQL native casting with two colons column :: data-type

SELECT
	NOW():: timestamp;

Standard SQL casting with CAST(column AS data-type) function.

SELECT
	CAST(NOW() AS timestamp);

Creating Data Types (PostgreSQL)

CREATE
	TYPE dayofweek as ENUM(
		'Monday',
		'Tuesday',
		'Wednesday',
		'Thursday',
		'Friday',
		'Saturday',
		'Sunday'
		);

Getting Information about Data Types (PostgreSQL)

SELECT
	typname,
	typcategory
FROM pg_type
WHERE typname = 'dayofweek';

User-Defined Functions (PostgreSQL)

A User-Defined Function is the equivalent of a Stored Proceedure.

A function can bundle several SQL queries together into a package using CREATE.

CREATE FUNCTION squared (i integer) RETURNS integer AS $$
	BEGIN
		RETURN i*i;
	END;
$$ language plpgsql;	--define language

SELECT
	squared(10);

100

Text Search and NLP (PostgreSQL)

Full-Text Search (PostgreSQL)

Postgre has extensions of the normal WHERE column LIKE '%mystring%' with wildcards (%,_) etc. capabilities.

Additional features are stemming, fuzzy matching (spelling mistakes), and ranking.

SELECT
	title,
	description
FROM film
WHERE to_tsvector(title) @@ to_tsquery('elf');

Compare Two Strings (PostgreSQL)

Calculate the Levenshtein Distance (number of edits required between two words).

SELECT
	levenshtein('hugo', 'hagolm');

3

Calculate Similarity (0-1)

	similarity('hugo', 'hagolm');

Drop Duplicate Rows (based on Specific Columns)

Drop duplicate rows based on specific columns, not the entire rows is duplicated.

Use a Common Table Expression (CTE) with a window function ‘PARTITION BY’ that creates a row number per window.

Any duplicates will have a row number >2. Then, evaluate the CTE and only keep row numbers = 1.

WITH cte AS (
    SELECT *, 
        ROW_NUMBER() OVER (
            PARTITION BY 
                col_1, 
                col_2, 
                col_3
			ORDER BY 
                col_1, 
                col_2, 
                col_3
        ) row_num
    FROM orders_table
)
SELECT * 
FROM cte
WHERE row_num = 1;

id	sales	worker_firstname	worker_lastname	registration_date	row_num
000000119	13	Johnny	Smith	2014-01-31	1
000000127	14	Brian	Merkel	2014-02-28	1
000000120	29	Karin	Harris	2014-01-31	1
000000128	16	Carol	Schulze	2014-02-28	1

Time Manipulation

NOW() and CURRENT_TIMESTAMP() fetch time (millisecond precision) with timezone! However, CURRENT_TIMESTAMP(2) can be passed an argument for roundeding to digits. If the timezone is not required, then these two functions can be cast to timestamp data-type.
CURRENT_DATE only fetches the date portion of the CURRENT_TIMESTAMP()
CURRENT_TIME() only fetches the time portion (with timezone) of the CURRENT_TIMESTAMP(). Can be passed rounding parameter.

Intervals and Time Differences

This query selects a timestamp style column and performs arithmetic on it with the INTERVAL function. The INTERVAL add 3 days to the rental date so we can compare if the return date was in time.

SELECT
	rental_date,
	return_date,
	rental_date + INTERVAL '+ `+3 days' AS expected_return_date
FROM rental;

Convert integers to time intervals.

SELECT
	title,
	INTERVAL '1' day * f.rental_duration AS rental_duration,
	return_date - rental_date AS days_rented
FROM film
WHERE return_date IS NOT NULL

The AGE() function calculates an INTERVAL between a given timestamp/date and NOW if ONE parameter given. If TWO parameters are given, then the interval between those.

SELECT
	AGE(rental_date)
FROM rental;

Extracting from Timestamps

Two useful functions are EXTRACT(field FROM source) and DATE_PART('field',source), which produce interchangeable results.

Timestamp parts/fields to extract:

Field (long)	Field (short)	Description
year	y	Year
quarter		Quarter
month		Month
day	d	Day of Month
	dow	Day of Week
week	w	Week
hour	h	hour
minute	m	Minute
second	s	Second
millisecond	ms	Millisecond

Examples for the quarter of the year in a timestamp:

SELECT
	EXTRACT(quarter FROM TIMESTAMP '2009-02-12 05:18:00') AS quarter;

1

SELECT
	DATE_PART('quarter', TIMESTAMP '2009-02-12 05:18:00') AS quarter;

1

SELECT 
	EXTRACT(dow FROM rental_date) AS dayofweek, 
	COUNT(*) as rentals 
FROM rental 
GROUP BY 1; -- column called by positional index instead of names

The following query extracts quarters and years from a timestamp column and sums the payment amount by by quarter and years.

SELECT
	EXTRACT(quarter FROM payment_date) AS quarter,
	EXTRACT(year FROM payment_date) AS year,
	SUM(amount) AS total_payments
FROM payment
GROUP BY 1, 2; 

The following query extracts the rental duration (rental_days) in a time interval of 90 days starting from 2005-05-01.

SELECT 
	EXTRACT(dow FROM r.rental_date) AS dayofweek,
	AGE(return_date, rental_date) AS rental_days
FROM rental AS r 
WHERE rental_date
	BETWEEN CAST('2005-05-01' AS TIMESTAMP)
	AND CAST('2005-05-01' AS TIMESTAMP) + INTERVAL '90 day';

Truncating Timestamps

The DATE_TRUNC() function will truncate a timmestamp or interval data types from fine detail (e.g., minutes) to low detail (e.g., year). This means the timestamp’s length remains the same but set to the lowest logical value (01 for days/months, 00 for hours/minutes/seconds).

SELECT
	DATE_TRUNC('year', TIMESTAMP '2009-02-12 05:18:00';

2009-01-01 00:00:00

SELECT
	DATE_TRUNC('month', TIMESTAMP '2009-02-12 05:18:00';

2009-05-01 00:00:00

The following query truncates to a day and counts the number of entries (rentals) by day.

SELECT 
	DATE_TRUNC('day', rental_date) AS rental_day,
	COUNT(*) AS rentals 
FROM rental
GROUP BY 1;

Transactions (MS SQL SERVER)

Transactions move around values between owners. For example money between bank accounts. This means a transaction contains a subtraction and addition that both have to succeed.

The concept of “Atomic transactions” rolls back a transaction if any of its parts raised an error. SQL has a specific ‘TRAN’ keyword that can be put in a ‘TRY’ block. If the ‘TRAN’ raised an error, the ‘CATCH’ block calls a ‘ROLLBACK TRAN’.

With TRY/CATCH

BEGIN TRY  
	BEGIN TRAN;
		UPDATE accounts SET current_balance = current_balance - 100 WHERE account_id = 4451;
		INSERT INTO transactions VALUES (4451, -100, GETDATE());
        
		UPDATE accounts SET current_balance = current_balance + 100 WHERE account_id = 6743;
		INSERT INTO transactions VALUES (4451, 100, GETDATE());
	COMMIT TRAN;
END TRY

BEGIN CATCH
	SELECT 'Rolling back the transaction';
	ROLLBACK TRAN;
END CATCH

With XACT_ABORT

With ‘SET XACT_ABORT ON;’ errors in a ‘TRAN’ block will always be rolled back. No ‘TRY’ block needed.

Note: ‘THROW’ error still works with XACT_ABORT.

SET XACT_ABORT ON;

BEGIN TRAN; 
	UPDATE accounts set current_balance = current_balance - current_balance * 0.01 / 100
		WHERE current_balance > 5000000;
	IF @@ROWCOUNT <= 10	
		THROW 55000, 'Not enough wealthy customers!', 1;
	ELSE		
		COMMIT TRAN;

Special Case: Accessing ARRAYS

It is not necessary to fetch the entire array of a column. We can select specific elements with square brackets. Indexing in PostgresSQL starts with 1, and not 0 as with many other languages.

Search for text in ARRAY in specific position.

SELECT
	email[1][1] AS type,			-- select from first address the first element (address type/description)
	email[1][2] AS address,			-- select from first address the second element (email address)
	test_scores[1]				    -- select only first element of the array
FROM funky_table
WHERE email[1][1] = "work";			-- only search in in array element 1 and sub-element 1

Search for text in any ARRAY position with ANY or @> operators. Results are equivalent.

SELECT
	email[2][1] AS type,			-- select from second address the first element (address type/description)
	email[2][2] AS address,			-- select from second address the second element (email address)
	test_scores[1]				    -- select only first element of the array
FROM funky_table
WHERE 'home' = ANY (email)			-- select where the column `email` contains an array element named 'home' in any array position)

SELECT 
  email[2][1] AS type, 
  email[2][2] AS address
  test_scores[1]
FROM funky_table
WHERE email @> ARRAY['home'];

SQL query in Python Pandas

The SQL query is passed as a single string. The tripple quotation mark wrapping (“””) is useful to have multi-row strings in Python. The table name goes in quotation marks and the SQL server connection has to be passed after the query.

Connection URI (Uniform Resource Identifier) for PostgreSQL: postgresql://[user[:password]@][host][:port][/database]

import sqlalchemy

connection_uri = "postgresql://repl:password@localhost:5432/pagila"
db_engine = sqlalchemy.create_engine(connection_uri)

import pandas as pd

df = pd.read_sql("""
SELECT
    first_name,
    last_name
FROM "Customer"
ORDER BY last_name, first_name
""", db_engine)

df.head(3)

	first_name	last_name
0	Connagh	Bailey
1	Brook	Bloom
2	Ann	Dalton

ETL process

For the “Extraction” process from a OLAP db, the following function queries the entire SQL table using a function and loads it into a Pandas dataframe.

import sqlalchemy
import pandas as pd

# custom extraction function
def extract_table_to_pandas(tablename, db_engine):
    query = "SELECT * FROM {}".format(tablename)
    return pd.read_sql(query, db_engine)

connection_uri = "postgresql://repl:password@localhost:5432/pagila" 
db_engine = sqlalchemy.create_engine(connection_uri)

extract_table_to_pandas("film", db_engine)

Do some column transformations in Spark e.g., convert float to string and split on pattern into multiple columns.
Loading into analytical PostgreSQL database

connection_uri = "postgresql://repl:password@localhost:5432/dwh"
db_engine_dwh = sqlalchemy.create_engine(connection_uri)

film_pdf_joined.to_sql("film", db_engine_dwh, schema="store", if_exists="replace")

# check the table in the SQL db
pd.read_sql("SELECT film_id, recommended_film_ids FROM store.film", db_engine_dwh)

Table Management

Create Table

Creating a table schema works with [column name] [data-type]. ARRAYS can be created with square brackets. Multiple square brackets are nested arrays.

CREATE TABLE funky_table (
	column_1 text,
	column_2 integer,
	column_3 scores[],
	employee_id integer,
	email text[][]
	);

Insert Values into Columns

Inserting works with calling the column names if only specific columns are inserted into.

INERT INTO funnky_table
	(column_1, column_2) VALUES ('string for column_1', 12);

Inserting multiple rows into all columns at once can be done without column names, but the order has to match the columns. ARRAYS are inserted with curly braces {} and array elements separated by commas.

INSERT INTO funky_table
	VALUES ("Prickly Hedgehog",
		2,
		{99,83,18,811}, -- standard array with four elements
		2224,
		{\{"work", "[email protected]"},{"home", "[email protected]"}}  -- nested array (remove escape char \)
		);

System Database Information Gathering

Which tables are in the system database?

SELECT * 
FROM
	information_schema.tables
WHERE
	table_schema = 'public';

Which columns are in the ‘actor’ table and what are their details?

SELECT * 
FROM
	information_schema.columns
WHERE
	table_name = 'actor';

Column descriptions are detailed: table_catalog table_schema table_name column_name ordinal_position column_default is_nullable data_type character_maximum_length ….. and so on.

Extension Management (PostgreSQL)

Show Extensions

Available in the distribution ready to be installed.

SELECT
	name
FROM pg_available_extensions;

dblink, pg_stat_tatements, pg_trgm

Currently installed and ready for use.

SELECT
	extname
FROM pg_extension;

plpgsql

Install/Import Extensions

First install and then check if the installation succeeded.

CREATE EXTENSION IF NOT EXISTS fuzzystrmatch;
SELECT
	extname
FROM pg_extension;