Notes

Search

Search IconIcon to open search

Introduction to Databases

Last updated Feb 14, 2023

The simplest database we can have is a bunch of CSV files that we can manage with our own code. For every entity, we’ll store them in our own file (album.csv, artists.csv). The application will parse / read / write.

There are a lot of issues with this approach:

For these problems and others, we want to offload that complex logic. We use a Database Management System (DBMS). We can define, create, query, update, administer databases (we assume DB’s are on disk).

Early DBMs were hard to build and maintain. Tight coupling between logical and physical layers. You have to know what queries your app would execute before you deployed the database.

Edgar Codd proposed the relational model. The first paper came out on 1969 and the one that most people read was in 1970. He proposed:

There are other data models, which are ways we can describe the data in a database.

At the top we have the relational model. Than we have NoSQL, which are key/value, document/json, graph models, column/family. Array / Matrix databases are used in machine learning, we mostly use dataframes. There are hierachical and Network datamodels but are esoteric, obsolete/rare, and are seen in legacy applications. They are remnants of old software, don’t use them.

Relational Model

There are three components of the relational model:

A relation is an unordered set that contains the relationship of attributes that represent entities (ARtist(name, year, country).

A tuple is a set of attribute values (also known as its domain) in the relation.

An n-ary relation means a Table with $n$ columns.

A relation’s primary key uniquely identifies a single tuple. Some DBMSs automatically create an internal primary key if you don’t define one. Auto-generation of a unique integer primary keys:

A foreign key specifies that an attribute from one relation has to map to a tuple in another relation.

Data Manipulation Languages (DML)

How do we store and retrieve information from a database?

Relational Algebra

Edgar Codd proposed 7 fundamental operators in relational algebra to retrieve and manipulate tuples in a relation.

Each operator takes one or more relations as its inputs and outputs a new relation.

Select

The SELECT operator chooses a subset of the tuples from a relation that satisfies a selection predictate

The syntax: $ \sigma_{\textnormal{predictate}}(R). $

In SQL this would be:

1
SELECT * FROM R WHERE a_id = 'a2' OR b_id = '103';

Projection

Generate a relation with tuples that contains only the specified attributes.

Syntax: $ \Pi_{\textnormal{attributes}}(R) $

In SQL this would be:

1
SELECT b_id-100, aid FROM R WHERE a_id = 'a2';

Union

Generate a relation that contains all tuples that appear in either only one or both input relations.

Syntax: $ R \cup S $

In SQL this would be:

1
2
3
SELECT * FROM R
UNION ALL
SELECT * FROM S;

Intersection

Generate a relation that contains all tuples that appear in both input relations.

Syntax: $ R \cap S $

In SQL this would be:

1
2
3
SELECT * FROM R
INTERSECT
SELECT * FROM S;

Difference

Generate a relation that contains all tuples that appear in the first relation but not the second.

Syntax: $ R - S $

In SQL this would be:

1
2
3
SELECT * FROM R
EXCEPT
SELECT * FROM S;

Product

Generate a relation that contains all possible combinations of tuples from the input relation (cross product, cartesian product).

Syntax: $ R \times S $

This might sound stupid but it’s useful for something like testing where you want to test every possible combination of inputs / configurations.

Join

Generate a relation that contains all tuples that are a combination of two tuples (one from each input relation) with a common value for one or more atttributes.

Syntax: $ R \bowtie S $

In SQL this would be:

1
 SELECT * FROM R NATURAL JOIN S;

More operators

There are additional operators that people have added to the relational algebra.

Observation

Relational algebra still defines the high-level steps of how to compute a query. But this can still be inefficient because even though two queries might yield the same result, the performance of the two queries might be different because of the implementation details.

A better approach is to state the high-level answer that you want the DBMS to compute.

The relational model is indepnende ot any query language implementation.

Conclusions

Databases are ubiquitous.

Relational algebra defines the primitives for processing queries on a relational database.

We will see relational algebra again when we talk about query optimization + execution.