In this study, the effects of database normalization (first three normal forms - 1NF, 2NF, and 3NF) on the SQL query performance were put to the test.
SQL performance was assesed following:
- query (successfully) completion within a 30-minute timeout
- query duration (in seconds), for the completed queries.
Three (sub)schemas of the TPC-H benchmark database were created and populated (using the DBGen utility provided by the TPC-H) for:
- three database servers MS-SQL Server, MySQL, and PostgreSQL
- two scale factors (0.1 GB and 1GB)
- three normal forms (1NF, 2NF, and 3NF) of the TPC-H database.
For each scale factor:
- an initial 1,000 SQL query set was created (randomly) for the original (3NF) TPC-H schema
- this initial query set was adapted for execution in 2NF and 1NF (mainly by removing the unnecessary joins)
- all the queris were tweaked for execution in all three DBMSs (there are some differences among the three SQL dialects).
Directory queries containts the queries executed in PostgreSQL for 1NF, 2NF and 3NF and scale factors of 0.1GB and 1GB.
Query execution results (query completion and query duratiion) were collected using JMeter.
For each query, a 30-minute timeout was set (and controlled through JMeter).
Directory data containts the files with query parameters and query execution results.
- dbserver (MS-SQL Server, MySQL, PostgreSQL)
- scale_factor (for this study, the TPC-H database was populated with data of 0.1GB and 1GB)
- normal_form (1NF, 2NF, or 3NF)
- various parameters describing to query complexity (e.g., the number of joins, number of predicated in WHERE, etc.)
- query_completion (whether the query execution was completed during the 30 minute timeout)
- duration_sec (query execution time in seconds for each completed query).
- exploratory data analysis (see scripts 1a..., 1b..., and 1c... in directory scripts)
- statistical tests for analysing the association between query completion and the normal form (see scripts 2a... and 2b... in directory scripts)
- Machine Learning models based on random forest and extreme gradient boosting algoritms (see scripts 3a... and 3b... in directory scripts) for:
- prediction of odds of query_completion (classification)
- prediction of query duration_sec (scoring)
- estimating the importance (among all predictors) of the predictor normal_form in the outcome variability (query_completion and duration_sec_
- examination of feature effects on the outcome (using techniques of interpretable ML).
Results raise serious concerns about the conventional consensus on the performance gains incurred by the reduced number of table joins. Even for small-sized databases, the penalties due to the extra volume caused by redundancy associated with lower normal forms seem larger than the performance gains due to the reduced number of joins.