SQL

Generic SQL notes

Query

BETWEEN

  • Both values are inclusive.

COUNT

  • COUNT(*) counts rows — it includes rows where columns are NULL.
  • COUNT(column) counts non-NULL values in that column — NULLs are ignored.
  • COUNT(DISTINCT column) counts distinct non-NULL values.

ORDER BY

  • NULL values will always come first.

JOIN

Here’s the SQL JOIN guide formatted in Markdown.

Key idea: joins combine rows from two tables (A and B) using a matching condition (usually A.key = B.key).

INNER JOIN
  • Returns only rows where the join condition matches in both A and B.
  • Use when you want rows that have related records in both tables.
  • SQL example:
SELECT *
FROM A
INNER JOIN B ON A.id = B.a_id;
LEFT (OUTER) JOIN
  • Returns all rows from A plus matching rows from B; where B has no match, B columns are NULL.
  • Use when you need every A row regardless of whether it has a match.
  • SQL example:
SELECT *
FROM A
LEFT JOIN B ON A.id = B.a_id;
RIGHT (OUTER) JOIN
  • Mirror of LEFT JOIN: returns all rows from B plus matching rows from A; unmatched A columns are NULL.
  • Less common; supported by many DBs.
  • SQL example:
SELECT *
FROM A
RIGHT JOIN B ON A.id = B.a_id;
FULL (OUTER) JOIN
  • Returns rows that match plus non-matching rows from both A and B; unmatched side columns are NULL.
  • Use when you need union of both tables with preserved unmatched rows.
  • SQL example:
SELECT *
FROM A
FULL OUTER JOIN B ON A.id = B.a_id;
CROSS JOIN
  • Cartesian product: every row of A combined with every row of B; no join condition.
  • Use rarely (e.g., combinatorial generation).
  • SQL example:
SELECT *
FROM A
CROSS JOIN B;
SELF JOIN
  • A table joined to itself (using aliases) to compare rows within the same table.
  • SQL example:
SELECT a.*, b.*
FROM employees a
JOIN employees b ON a.manager_id = b.id;
ANTI-JOIN (NOT EXISTS / LEFT JOIN ... WHERE NULL)
  • Returns rows in A that have no matching row in B.
  • Implementations:
-- NOT EXISTS version (preferred for clarity & performance)
SELECT *
FROM A
WHERE NOT EXISTS (SELECT 1 FROM B WHERE B.a_id = A.id);
 
-- LEFT JOIN + IS NULL alternative
SELECT A.*
FROM A
LEFT JOIN B ON A.id = B.a_id
WHERE B.a_id IS NULL;
SEMI-JOIN (EXISTS / IN)
  • Returns rows from A that have at least one match in B, but does not duplicate A rows per matching B row.
  • Implementations:
-- EXISTS (preferred)
SELECT *
FROM A
WHERE EXISTS (SELECT 1 FROM B WHERE B.a_id = A.id);
 
-- IN alternative
SELECT *
FROM A
WHERE A.id IN (SELECT a_id FROM B);
JOIN - Performance & tips
  • Use INNER JOIN when you only want matching pairs (fastest in many cases).
  • Prefer EXISTS/NOT EXISTS for semi/anti-joins; LEFT JOIN ... IS NULL can give incorrect results if B.a_id is nullable.
  • Use appropriate indexes on join keys.
  • Avoid CROSS JOINs unless intentional.
  • FULL OUTER JOIN may be unsupported or slower; consider UNION of LEFT and RIGHT if needed.

Transaction

Atomicity

Consistency

Isolation (opens in a new tab)

Read phenomenon

In ascending order of seriousness:

  1. Dirty read

    Read uncommitted changes made by another transaction

  2. Non-repeatable read

    A non-repeatable read occurs when, during the course of a transaction, a row is retrieved twice and the values within the row differ between reads.

  3. Phantom read

    A phantom read occurs when, in the course of a transaction, new rows are added or removed by another transaction to the records being read.

  4. Serialization anomaly

    The result of successfully committing a group of transactions is inconsistent with all possible orderings of running those transactions one at a time.

Isolation levels

In ascending order of strictness:

  1. Serializable (no phenomenon)
  2. Repeatable read (phantom read)
  3. Read committed (non-repeatable read, phantom read)
  4. Read uncommitted (dirty read, non-repeatable read, phantom read)
Isolation levelDirty readNon-repeatable readPhantom readSerialization anomaly
Read uncommittedAllowed, but not in PGPossiblePossiblePossible
Read committedNot possiblePossiblePossiblePossible
Repeatable readNot possibleNot possibleAllowed, but not in PGPossible
SerializableNot possibleNot possibleNot possibleNot possible

Durability

Database Design

ID Generation

Auto Increment

UUID

  • 128-bit

  • Versions

    • Version 1 & 2 (date-time & MAC address)

      • The uniqueness of v1 and v2 UUIDs based on network-card MAC addresses also depends on network-card manufacturers properly assigning unique MAC addresses to their cards, which like other manufacturing processes is subject to error. Additionally some operating system permit the end user to customise the MAC address, notably OpenWRT.[
      • Usage of the node's network card MAC address for the node ID means that a version-1 UUID can be tracked back to the computer that created it.
    • Version 3 & 5 (namespace name-based)

      • v3: MD5
      • v5: SHA-1
    • Version 4 (random number)

  • Pros

    • Sharding
    • Merge/Replicate
    • Scaling
    • Flexibility
    • Custom format
    • Does not need DB to generate a key
  • Cons

    • Performance
  • Resources

Reference

RDBMS

Index

Clustered vs Non-clustered Index

Clustered IndexNon-clustered Index
StorageThe actual data rows are physically sorted and stored in the order of the clustered index key.The index contains a copy of the indexed columns and a pointer to the actual data row.
SortingOnly one clustered index can be created on a table.Multiple non-clustered indexes can be created on a table.
Key UniquenessA clustered index key must be unique.A non-clustered index key can have duplicate values.
Impact on Data ModificationData modification operations (inserts, updates, deletes) can be slower since the physical order of the rows may need to be rearranged.Data modification operations are generally faster as they only update the index , not the actual data rows.
Index Seek vs ScanClustered indexes are efficient for range queries and large data retrieval since the data is physically sorted.Non-clustered indexes are efficient for specific lookups or single-row retrievals, as they provide a quick lookup path to the data.
Included ColumnsClustered indexes include all columns of the table by defaultNon-clustered indexes can include additional columns to cover specific query needs.
Primary Key UsageBy default, a primary key constraint creates a clustered index.A primary key can create either a clustered or non-clustered index, depending on the configuration.

View

Materialized View

CTE

TechniqueAdvantagesDisadvantagesSuitable Use Case
CTEsTemporary scope within a single query
No storage or maintenance required
Improves readability by modularizing code
Limited to the query in which they are definedOrganizing complex queries, temporary transformations, and breaking down multi-step operations
Derived TablesSimplifies nested subqueries
No need for permanent storage
Harder to read/maintain for complex queries
List cannot be reused multiple times within a query
Quick, single-use transformations and aggregations within a query
ViewsReusable across queries
Can enhance security by restricting data access
Requires maintenance and can affect multiple queries
Complex views can impact performance
Long-term reusable logic and data access control

Performance

Query Optimization

  • Resources

    • Databases.Illuminated 4th Edition - Chapter 10 Relational Query Optimization
  • Heuristics

    1. Do selection as early as possible. If appropriate, use cascading of selects, commutativity of selects with projects, joins, and products; and distributivity of select over set union, set intersection, and set difference to move the selection as far down the query tree as possible.

    2. Use associativity of join, product, union, and intersection to rearrange relations in the query tree so that the selection operation that will produce the smallest table will be executed first.

    3. If a product appears as an argument for a selection, where the selection involves attributes of the tables in the product, transform the product into a join. If the selection involves attributes of only one of the tables in the product, apply the selection to that table first.

    4. Do projection early. Insertion of projections can be used to reduce the size of intermediate results. If appropriate, use cascading of projects; distributivity of projection over product and set union, intersection, and difference; and commutativity of selection and projection to move the projection as far down the query tree as possible. Examine all projections to see if some are unnecessary.

    5. If there is a sequence of selections and/or projections with the same argument, use commutativity or cascading to combine them into one selection, one projection, or a selection followed by a projection.

    6. If a subexpression appears more than once in the query tree, and the result it produces is not too large, compute it once and save it (materialized CTE). This technique is especially useful when querying on views, because the same subexpression must be used to construct the view each time.

General Tips

  • Reduce the size of the intermediate tables, (use LIMIT and OFFSET)
    • Perform selection early, only retrieve the needed rows.
    • Perform projection early, only retrieve the needed columns.
  • Use EXPLAIN to understand how the query is executed
  • EXPLAIN will only generate a plan for the given statement.
  • EXPLAIN ANALYZE will return the plan, run the query, and show the actual runtime and row count for each step.
  • When using EXPLAIN ANALYZE, because the query will be executed actually, be careful with INSERT, UPDATE, and DELETE.
  • Minimize locks to tables
  • Use materialized views to store the results of a query
  • Convert a big query of multiple tables into a series of smaller queries of single tables. Cache and join results on the application level to enhance scalability.
  • Replace reusable subqueries with CTE

Query benchmark