Optimizing SQL – Part 1
Design approaches for multi-lingual data
In this example, we’re assuming a dataset consisting of 6 multi-lingual fields. Each field holds strings with an average length of 200 characters in the 4 locales en, de, zh and ru. A fallback function is implemented using the CASE expression (doesn’t add much overhead and is faster than fetching data in two locales and implementing the fallback at application level).
This is how your queries will look like, if you use the “dictionary approach” (master data in dict_books and locale specific strings in the general table dict_dictionary, which is shared between all master tables):
SELECT book_id, CASE WHEN loc0.dict_value IS NOT NULL THEN loc0.dict_value ELSE org0.dict_value END AS trans_author_id, CASE WHEN loc1.dict_value IS NOT NULL THEN loc1.dict_value ELSE org1.dict_value END AS trans_title_id, CASE WHEN loc2.dict_value IS NOT NULL THEN loc2.dict_value ELSE org2.dict_value END AS trans_subtitle_id, CASE WHEN loc3.dict_value IS NOT NULL THEN loc3.dict_value ELSE org3.dict_value END AS trans_abstract_id, CASE WHEN loc4.dict_value IS NOT NULL THEN loc4.dict_value ELSE org4.dict_value END AS trans_copyright_id, CASE WHEN loc5.dict_value IS NOT NULL THEN loc5.dict_value ELSE org5.dict_value END AS trans_body_id FROM dict_books a JOIN dict_dictionary org0 ON a.author_id = org0.dict_id AND org0.dict_locale = 'en' LEFT JOIN dict_dictionary loc0 ON a.author_id = loc0.dict_id AND loc0.dict_locale = 'ru' JOIN dict_dictionary org1 ON a.title_id = org1.dict_id AND org1.dict_locale = 'en' LEFT JOIN dict_dictionary loc1 ON a.title_id = loc1.dict_id AND loc1.dict_locale = 'ru' JOIN dict_dictionary org2 ON a.subtitle_id = org2.dict_id AND org2.dict_locale = 'en' LEFT JOIN dict_dictionary loc2 ON a.subtitle_id = loc2.dict_id AND loc2.dict_locale = 'ru' JOIN dict_dictionary org3 ON a.abstract_id = org3.dict_id AND org3.dict_locale = 'en' LEFT JOIN dict_dictionary loc3 ON a.abstract_id = loc3.dict_id AND loc3.dict_locale = 'ru' JOIN dict_dictionary org4 ON a.copyright_id = org4.dict_id AND org4.dict_locale = 'en' LEFT JOIN dict_dictionary loc4 ON a.copyright_id = loc4.dict_id AND loc4.dict_locale = 'ru' JOIN dict_dictionary org5 ON a.body_id = org5.dict_id AND org5.dict_locale = 'en' LEFT JOIN dict_dictionary loc5 ON a.body_id = loc5.dict_id AND loc5.dict_locale = 'ru'
First, I would like to mention that inserting data in those tables is a mess, because you’ll need 25 INSERTs for each dataset (compared to 5 for the relational design below).
The query for the “true relational design” with one table (books) for the master data and one for the locale specific strings (books_locale) looks like:
SELECT a.book_id, CASE WHEN c.author IS NOT NULL THEN c.author ELSE b.author END AS author, CASE WHEN c.title IS NOT NULL THEN c.title ELSE b.title END AS title, CASE WHEN c.subtitle IS NOT NULL THEN c.subtitle ELSE b.subtitle END AS subtitle, CASE WHEN c.abstract IS NOT NULL THEN c.abstract ELSE b.abstract END AS abstract, CASE WHEN c.copyright IS NOT NULL THEN c.copyright ELSE b.copyright END AS copyright, CASE WHEN c.body IS NOT NULL THEN c.body ELSE b.body END AS body FROM books a JOIN books_locale b ON a.book_id = b.book_id AND b.locale = 'en' LEFT JOIN books_locale c ON a.book_id = c.book_id AND c.locale = 'ru';
Now, wonder what’s the difference in execution time? The first query takes 0.30s and the second just 0.06s (1000 rows). If do the query with 10,000 books instead, the execution time is 3.07s against 0.56s. Conclusion: The relational design is about 5.5 times faster than the dictionary type of design.
There’s a third option, which is using a single table with a primary key consisting of book_id and book_locale. The query would look like
SELECT b.book_id, CASE WHEN c.author IS NOT NULL THEN c.author ELSE b.author END AS author, CASE WHEN c.title IS NOT NULL THEN c.title ELSE b.title END AS title, CASE WHEN c.subtitle IS NOT NULL THEN c.subtitle ELSE b.subtitle END AS subtitle, CASE WHEN c.abstract IS NOT NULL THEN c.abstract ELSE b.abstract END AS abstract, CASE WHEN c.copyright IS NOT NULL THEN c.copyright ELSE b.copyright END AS copyright, CASE WHEN c.body IS NOT NULL THEN c.body ELSE b.body END AS body FROM books b LEFT JOIN books c ON b.book_id = c.book_id AND c.book_locale = 'ru' WHERE b.book_locale = 'en'
And the execution time for 10.000 entries is… just 0.38s! (0.05s for 1.000) That is below the relational design and makes me a bit angry, but anyways,… Inserting data is also faster, though not that safe because you can get in trouble with foreign keys. I think it’s time for database vendors to integrate multi-lingual features in their products. Like they did with XML and GIS before.
Why use subselects?
Compare those two queries:
SELECT a.article_id, article_name, b.effective_from_date, b.price FROM articles a JOIN price_history b WHERE a.article_id = b.article_id AND b.effective_from_date = ( SELECT max(effective_from_date) FROM price_history c WHERE a.article_id = c.article_id AND c.effective_from_date <= NOW() ) ORDER BY a.article_id; SELECT a.article_id, article_name, b.effective_from_date, b.price FROM articles a JOIN price_history b ON a.article_id = b.article_id AND b.effective_from_date <= NOW() LEFT JOIN price_history c ON b.article_id = c.article_id AND b.effective_from_date < c.effective_from_date AND c.effective_from_date <= NOW() WHERE c.effective_from_date IS NULL ORDER BY a.article_id;
The first query takes about 0.27s and the second 6.26s (1000 articles with 100 prices in the history table each) – this is why subselects are often not that bad!