Introduction to Databases

Let’s explore the basics of databases and DBMS (Database Management Systems) with practical examples. Understand data models, flat file limitations, and how modern systems ensure data integrity and efficiency.

This article is based on the lessons I teach about DBMS at Fondazione PIN in Prato in 2025.

Introduction
#

Let’s start by answering the most important question, what’s a database? Think about it for a second, Are SQLite and PostgreSQL a database? And what about your phone’s contacts list?

A database is a collection of information that models some aspect of the real world.

So your phone’s contacts list, is a database. SQLite or PostgreSQL are not, they are a DBMS, a Database Management System, to be specific a relational one.

A Database Management System (DBMS) is software that allows applications to store and analyze information in a database. A generic DBMS supports the definition, creation, querying, updating and administration of a database according to a given data model. And it allows us to manage and organize that data in a structured and efficient way.

But is it really necessary to use a DBMS?

Let’s pretend we have a .csv file called where we collect some user responses from a questionary, our boss ask us to get calculate the average experience of every language.

How do we do it?

timestamp,language,experience
06/01/2025 11.39.03,Java,1
06/01/2025 11.40.33,Python,4
06/01/2025 11.40.41,C,1
06/01/2025 11.41.06,Java,2
06/01/2025 11.41.13,Java,3
06/01/2025 11.41.18,C,1
06/01/2025 11.41.25,Python,5
06/01/2025 11.41.32,Java,3
20/01/2025 9.46.58,Python,3
20/01/2025 9.47.17,Python,3
20/01/2025 9.47.21,C,4
20/01/2025 9.47.27,Python,3
20/01/2025 9.47.47,C,4
20/01/2025 9.47.47,Python,3
20/01/2025 9.47.47,Python,3
20/01/2025 9.47.50,Python,3
20/01/2025 9.47.50,C,1
20/01/2025 9.48.14,Python,3
20/01/2025 9.48.20,Python,3
20/01/2025 9.48.33,C,4
20/01/2025 9.48.45,Python,3
20/01/2025 9.48.45,Python,3
20/01/2025 9.48.47,Java,3
20/01/2025 9.48.47,Java,2
20/01/2025 9.49.08,Python,2
20/01/2025 9.49.19,Python,1
20/01/2025 9.49.26,Python,2
20/01/2025 9.49.28,Python,3
20/01/2025 9.49.37,Python,2

We could use Python:

import csv

with open("responses.csv") as file:
    reader = csv.DictReader(file)
    counts = {}
    total_experience = {}

	for row in reader:
        favorite = row["language"]
        if favorite in counts:
            counts[favorite] += 1
            total_experience[favorite] += int(row["experience"])
        else:
            counts[favorite] = 1
            total_experience[favorite] = int(row["experience"])

for favorite in sorted(counts, key=counts.get, reverse=True):
    print(
        f"{favorite}: {counts[favorite]} | experience {total_experience[favorite]/counts[favorite]}"
    )

Or we could achieve the same result using a DBMS like SQLite.

sqlite3 response.db

Let’s create a table

 CREATE TABLE response (timestamp text, language text, experience int);

Import csv file

.mode csv

.import --skip 1 responses.csv response

What’s the average experience of every language?

SELECT language, count(*) AS n, AVG(experience) as experience FROM response GROUP BY language ORDER BY n DESC;

We achieve the same result, so why bother? Let’s talk about a flat file format like csv.

Flat file
#

The information is saved comma separated (CSV) and the files are managed by our application.

use one file per entity
the application must loop through all the files each time it wants to read or update a record.

Artist

name, year, country
"Linkin Park", 1996, "USA"
"Adele", 2006, "UK"
"Harry Styles", 2010, "UK"

Album

name, artist, year
"Hybrid Theory", "Linking Park", 2000
"21", "Adele", 2011
"Fine Line", "Harry Styles", 2019

If we had something like this we could search in what year Hybrid Theory was released.

for row in reader:
	favorite = row["name"]
	if favorite == "Hybrid Theory":
		print(int(row["year"]))

Correct right? Or did you spot the error?

Linkin Park is written without the final ‘g’, our record inside the Albums’ file is wrong.

What are the problems with this implementation?

Data integrity:

How can we be sure that the artist is the same for each album?
What if someone changes the year to an invalid value?
How do we handle multiple artists per album?
How do we handle deletions?

Implementation:

How do we find a specific record?
What if multiple applications need to access the same data set? What if they run on different machines?

Durability:

What if our machine crashes while our program is writing/updating a record?
How can we replicate it across multiple machines to increase availability in case of crashes?

Database Management System
#

We already define what’s a DBMS, as a software that allows applications to store and analyze information in a database. A DBMS is responsible for:

Managing all access to the database, ensuring that only authorized users can access the data.
Applying authorization controls and validation procedures, to ensure the integrity and security of the information.
Providing a standard interface for data manipulation.
Managing concurrency, avoiding conflicts when multiple users access and modify data at the same time.
Ensuring data persistence and recovery, using backup and recovery techniques in the event of failures or errors.
Optimizing performance, using indexes and caching techniques to speed up data access.

Data Model
#

While a DBMS handles the practical tasks of storing, securing, and retrieving data, a data model is a set of concepts for describing data in a database.

Think of a data model as the set of ideas or rules that helps a database know how to organize and understand all the information it holds. It’s like the blueprint for data, outlining concepts like relationships and data types. How that data is organized, that’s a schema. So a schema is a specific description of how a particular collection of data fits into that blueprint—it’s the detailed plan showing exactly what data is stored and how it’s structured.

There are different types of data model:

Type	Description
Relational	e.g. MySQL, PostgreSQL, SQL Server
Key/value	Used for caching, e.g. Redis
Graph	NoSQL
Document / JSON / XML / Object	NoSQL, such as MongoDB
Wide-Column / Column-family	NoSQL, Google BigTable or Cassandra
Array (vector, matrix, tensor)	Machine Learning
Hierarchical	Obsolete
Network	Obsolete
Semantic	Obsolete
Entity-Relation	Obsolete

During the following lessons we will take a look at the Relational one using SQLite and also key/value with Redis and Document with MongoDB.

Introduction#

Flat file#

Database Management System#

Data Model#

Introduction
#

Flat file
#

Database Management System
#

Data Model
#