Database Normal Forms and SQL Query Performance

Goal

In this study, the effects of database normalization (first three normal forms - 1NF, 2NF, and 3NF) on the SQL query performance were put to the test.

SQL performance was assesed following:

query (successfully) completion within a 30-minute timeout
query duration (in seconds), for the completed queries.

Research Design

The database

Three (sub)schemas of the TPC-H benchmark database were created and populated (using the DBGen utility provided by the TPC-H) for:

three database servers MS-SQL Server, MySQL, and PostgreSQL
two scale factors (0.1 GB and 1GB)
three normal forms (1NF, 2NF, and 3NF) of the TPC-H database.

The SQL query set

For each scale factor:

an initial 1,000 SQL query set was created (randomly) for the original (3NF) TPC-H schema
this initial query set was adapted for execution in 2NF and 1NF (mainly by removing the unnecessary joins)
all the queris were tweaked for execution in all three DBMSs (there are some differences among the three SQL dialects).

Directory queries containts the queries executed in PostgreSQL for 1NF, 2NF and 3NF and scale factors of 0.1GB and 1GB.

Data collection

Query execution results (query completion and query duratiion) were collected using JMeter.

For each query, a 30-minute timeout was set (and controlled through JMeter).

Directory data containts the files with query parameters and query execution results.

Variables

dbserver (MS-SQL Server, MySQL, PostgreSQL)
scale_factor (for this study, the TPC-H database was populated with data of 0.1GB and 1GB)
normal_form (1NF, 2NF, or 3NF)
various parameters describing to query complexity (e.g., the number of joins, number of predicated in WHERE, etc.)
query_completion (whether the query execution was completed during the 30 minute timeout)
duration_sec (query execution time in seconds for each completed query).

Method

exploratory data analysis (see scripts 1a..., 1b..., and 1c... in directory scripts)
statistical tests for analysing the association between query completion and the normal form (see scripts 2a... and 2b... in directory scripts)
Machine Learning models based on random forest and extreme gradient boosting algoritms (see scripts 3a... and 3b... in directory scripts) for:
- prediction of odds of query_completion (classification)
- prediction of query duration_sec (scoring)
- estimating the importance (among all predictors) of the predictor normal_form in the outcome variability (query_completion and duration_sec_
- examination of feature effects on the outcome (using techniques of interpretable ML).

Results

Results raise serious concerns about the conventional consensus on the performance gains incurred by the reduced number of table joins. Even for small-sized databases, the penalties due to the extra volume caused by redundancy associated with lower normal forms seem larger than the performance gains due to the reduced number of joins.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
figures		figures
queries		queries
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Database Normal Forms and SQL Query Performance

Goal

Research Design

The database

The SQL query set

Data collection

Variables

Method

Results

About

Uh oh!

Releases

Packages

Languages

marinfotache/normal_forms_and_sql_query_performance

Folders and files

Latest commit

History

Repository files navigation

Database Normal Forms and SQL Query Performance

Goal

Research Design

The database

The SQL query set

Data collection

Variables

Method

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages