SQL – Nextra

Generic SQL notes

Query

BETWEEN

Both values are inclusive.

COUNT

COUNT(*) counts rows — it includes rows where columns are NULL.
COUNT(column) counts non-NULL values in that column — NULLs are ignored.
COUNT(DISTINCT column) counts distinct non-NULL values.

ORDER BY

NULL values will always come first.

JOIN

Here’s the SQL JOIN guide formatted in Markdown.

Key idea: joins combine rows from two tables (A and B) using a matching condition (usually A.key = B.key).

INNER JOIN

Returns only rows where the join condition matches in both A and B.
Use when you want rows that have related records in both tables.
SQL example:

SELECT *
FROM A
INNER JOIN B ON A.id = B.a_id;

LEFT (OUTER) JOIN

Returns all rows from A plus matching rows from B; where B has no match, B columns are NULL.
Use when you need every A row regardless of whether it has a match.
SQL example:

SELECT *
FROM A
LEFT JOIN B ON A.id = B.a_id;

RIGHT (OUTER) JOIN

Mirror of LEFT JOIN: returns all rows from B plus matching rows from A; unmatched A columns are NULL.
Less common; supported by many DBs.
SQL example:

SELECT *
FROM A
RIGHT JOIN B ON A.id = B.a_id;

FULL (OUTER) JOIN

Returns rows that match plus non-matching rows from both A and B; unmatched side columns are NULL.
Use when you need union of both tables with preserved unmatched rows.
SQL example:

SELECT *
FROM A
FULL OUTER JOIN B ON A.id = B.a_id;

CROSS JOIN

Cartesian product: every row of A combined with every row of B; no join condition.
Use rarely (e.g., combinatorial generation).
SQL example:

SELECT *
FROM A
CROSS JOIN B;

SELF JOIN

A table joined to itself (using aliases) to compare rows within the same table.
SQL example:

SELECT a.*, b.*
FROM employees a
JOIN employees b ON a.manager_id = b.id;

ANTI-JOIN (NOT EXISTS / LEFT JOIN ... WHERE NULL)

Returns rows in A that have no matching row in B.
Implementations:

-- NOT EXISTS version (preferred for clarity & performance)
SELECT *
FROM A
WHERE NOT EXISTS (SELECT 1 FROM B WHERE B.a_id = A.id);
 
-- LEFT JOIN + IS NULL alternative
SELECT A.*
FROM A
LEFT JOIN B ON A.id = B.a_id
WHERE B.a_id IS NULL;

SEMI-JOIN (EXISTS / IN)

Returns rows from A that have at least one match in B, but does not duplicate A rows per matching B row.
Implementations:

-- EXISTS (preferred)
SELECT *
FROM A
WHERE EXISTS (SELECT 1 FROM B WHERE B.a_id = A.id);
 
-- IN alternative
SELECT *
FROM A
WHERE A.id IN (SELECT a_id FROM B);

JOIN - Performance & tips

Use INNER JOIN when you only want matching pairs (fastest in many cases).
Prefer EXISTS/NOT EXISTS for semi/anti-joins; LEFT JOIN ... IS NULL can give incorrect results if B.a_id is nullable.
Use appropriate indexes on join keys.
Avoid CROSS JOINs unless intentional.
FULL OUTER JOIN may be unsupported or slower; consider UNION of LEFT and RIGHT if needed.

Transaction

Atomicity

Consistency

Isolation (opens in a new tab)

Read phenomenon

In ascending order of seriousness:

Dirty read

Read uncommitted changes made by another transaction
Non-repeatable read

A non-repeatable read occurs when, during the course of a transaction, a row is retrieved twice and the values within the row differ between reads.
Phantom read

A phantom read occurs when, in the course of a transaction, new rows are added or removed by another transaction to the records being read.
Serialization anomaly

The result of successfully committing a group of transactions is inconsistent with all possible orderings of running those transactions one at a time.

Isolation levels

In ascending order of strictness:

Serializable (no phenomenon)
Repeatable read (phantom read)
Read committed (non-repeatable read, phantom read)
Read uncommitted (dirty read, non-repeatable read, phantom read)

Isolation level	Dirty read	Non-repeatable read	Phantom read	Serialization anomaly
`Read uncommitted`	Allowed, but not in PG	Possible	Possible	Possible
`Read committed`	Not possible	Possible	Possible	Possible
`Repeatable read`	Not possible	Not possible	Allowed, but not in PG	Possible
`Serializable`	Not possible	Not possible	Not possible	Not possible

Resources
- PostgreSQL: Documentation - 13.2. Transaction Isolation (opens in a new tab)

Durability

Database Design

ID Generation

Auto Increment

UUID

128-bit
Versions
- Version 1 & 2 (date-time & MAC address)
  - The uniqueness of v1 and v2 UUIDs based on network-card MAC addresses also depends on network-card manufacturers properly assigning unique MAC addresses to their cards, which like other manufacturing processes is subject to error. Additionally some operating system permit the end user to customise the MAC address, notably OpenWRT.[
  - Usage of the node's network card MAC address for the node ID means that a version-1 UUID can be tracked back to the computer that created it.
- Version 3 & 5 (namespace name-based)
  - v3: MD5
  - v5: SHA-1
- Version 4 (random number)
Pros
- Sharding
- Merge/Replicate
- Scaling
- Flexibility
- Custom format
- Does not need DB to generate a key
Cons
- Performance
Resources
- RFC 4122: A Universally Unique IDentifier (UUID) URN Namespace (opens in a new tab)

Reference

Comparison of relational database management systems (opens in a new tab)

RDBMS

Index

Clustered vs Non-clustered Index

	Clustered Index	Non-clustered Index
Storage	The actual data rows are physically sorted and stored in the order of the clustered index key.	The index contains a copy of the indexed columns and a pointer to the actual data row.
Sorting	Only one clustered index can be created on a table.	Multiple non-clustered indexes can be created on a table.
Key Uniqueness	A clustered index key must be unique.	A non-clustered index key can have duplicate values.
Impact on Data Modification	Data modification operations (inserts, updates, deletes) can be slower since the physical order of the rows may need to be rearranged.	Data modification operations are generally faster as they only update the index , not the actual data rows.
Index Seek vs Scan	Clustered indexes are efficient for range queries and large data retrieval since the data is physically sorted.	Non-clustered indexes are efficient for specific lookups or single-row retrievals, as they provide a quick lookup path to the data.
Included Columns	Clustered indexes include all columns of the table by default	Non-clustered indexes can include additional columns to cover specific query needs.
Primary Key Usage	By default, a primary key constraint creates a clustered index.	A primary key can create either a clustered or non-clustered index, depending on the configuration.

View

Materialized View

PostgreSQL: Documentation: 16: 41.3. Materialized Views (opens in a new tab)

Update materialized view periodically or in an event-driven manner.

CTE

Technique	Advantages	Disadvantages	Suitable Use Case
CTEs	Temporary scope within a single query No storage or maintenance required Improves readability by modularizing code	Limited to the query in which they are defined	Organizing complex queries, temporary transformations, and breaking down multi-step operations
Derived Tables	Simplifies nested subqueries No need for permanent storage	Harder to read/maintain for complex queries List cannot be reused multiple times within a query	Quick, single-use transformations and aggregations within a query
Views	Reusable across queries Can enhance security by restricting data access	Requires maintenance and can affect multiple queries Complex views can impact performance	Long-term reusable logic and data access control

Performance

Query Optimization

Resources
- Databases.Illuminated 4th Edition - Chapter 10 Relational Query Optimization
Heuristics
1. Do selection as early as possible. If appropriate, use cascading of selects, commutativity of selects with projects, joins, and products; and distributivity of select over set union, set intersection, and set difference to move the selection as far down the query tree as possible.
2. Use associativity of join, product, union, and intersection to rearrange relations in the query tree so that the selection operation that will produce the smallest table will be executed first.
3. If a product appears as an argument for a selection, where the selection involves attributes of the tables in the product, transform the product into a join. If the selection involves attributes of only one of the tables in the product, apply the selection to that table first.
4. Do projection early. Insertion of projections can be used to reduce the size of intermediate results. If appropriate, use cascading of projects; distributivity of projection over product and set union, intersection, and difference; and commutativity of selection and projection to move the projection as far down the query tree as possible. Examine all projections to see if some are unnecessary.
5. If there is a sequence of selections and/or projections with the same argument, use commutativity or cascading to combine them into one selection, one projection, or a selection followed by a projection.
6. If a subexpression appears more than once in the query tree, and the result it produces is not too large, compute it once and save it (materialized CTE). This technique is especially useful when querying on views, because the same subexpression must be used to construct the view each time.

General Tips

Reduce the size of the intermediate tables, (use LIMIT and OFFSET)
- Perform selection early, only retrieve the needed rows.
- Perform projection early, only retrieve the needed columns.
Use EXPLAIN to understand how the query is executed
EXPLAIN will only generate a plan for the given statement.
EXPLAIN ANALYZE will return the plan, run the query, and show the actual runtime and row count for each step.
When using EXPLAIN ANALYZE, because the query will be executed actually, be careful with INSERT, UPDATE, and DELETE.
Minimize locks to tables
Use materialized views to store the results of a query
Convert a big query of multiple tables into a series of smaller queries of single tables. Cache and join results on the application level to enhance scalability.
Replace reusable subqueries with CTE

Query benchmark

PostgreSQL

Bruce Momjian: Postgres Blog - Timing a Query (opens in a new tab)

time PGOPTIONS="-c log_min_duration_statement=0 \
-c client_min_messages=log" psql <<END
\timing
SELECT 1;
END

SVN Cheatsheet Software Testing