Thursday, October 6, 2011

MySQL Schema and Index Optimization Example

Transaction logging is a essential activity for telco industry. These records are important to resolve customer dispute or technical support activity. This post will discuss about MySQL schema and index optimization on a table that store huge amount (20 Millions ++ per day in one table) of transaction record.



Gathered Information

1. Shown below is the table schema for the transaction log table.
CREATE TABLE `transactionlog` (
  `datetime` varchar(45) NOT NULL DEFAULT '',
  `version` int(10) DEFAULT NULL,
  `category` varchar(45) DEFAULT NULL,
  `src_acc_name` varchar(45) DEFAULT NULL, 
  `dst_acc_name` varchar(45) DEFAULT NULL,
  `ib_msg_id` varchar(100) DEFAULT NULL,
  `ob_msg_id` varchar(100) DEFAULT NULL,
  `receiver_mpno` varchar(45) DEFAULT NULL, 
  `sender_mpno` varchar(45) DEFAULT NULL,   
  `country_code` varchar(5) DEFAULT NULL,
  `operator_id` int(10) DEFAULT NULL, 
  `submitted_datetime` varchar(45) NOT NULL DEFAULT '',
 
  ----- WITH MORE COLUMNS ----
 
  KEY `index1` (`submitted_datetime`(20),`category`),
  KEY `index2` (`submitted_datetime`(20),`receiver_mpno`(20)),
  KEY `index3` (`submitted_datetime`(20),`src_acc_name`,`category`),
  KEY `index4` (`submitted_datetime`(20),`operator_id`),
  KEY `index5` (`ib_msg_id`(25))
) ENGINE=MyISAM DEFAULT CHARSET=latin1

2. Shown below is the SQL that utilize above table that extract from code.
SELECT * FROM transactionlog
  LEFT JOIN cfg.operator ON (transactionlog.operator_id  = cfg.operator.operator_id)
  LEFT JOIN cfg.country ON (transactionlog.country_code  = cfg.country.country_code)
  WHERE submitted_datetime >= '2009-09-10 20:00:00'
    AND submitted_datetime <= '2009-09-10 22:00:00'
    AND category IN ('Category 1', 'Category 2', 'Category 3') ORDER BY datetime ASC;

SELECT * FROM transactionlog
  LEFT JOIN cfg.operator ON (transactionlog.operator_id  = cfg.operator.operator_id)
  LEFT JOIN cfg.country ON (transactionlog.country_code  = cfg.country.country_code)
  WHERE submitted_datetime >= '2009-09-10 20:00:00'
    AND submitted_datetime <= '2009-09-10 22:00:00'
    AND (receiver_mpno = '69012641245' OR sender_mpno = '69012641245')
  ORDER BY datetime ASC;

SELECT * FROM transactionlog
  LEFT JOIN cfg.operator ON (transactionlog.operator_id  = cfg.operator.operator_id)
  LEFT JOIN cfg.country ON (transactionlog.country_code  = cfg.country.country_code)
  WHERE submitted_datetime >= '2009-09-10 20:00:00'
    AND submitted_datetime <= '2009-09-10 22:00:00'
    AND (src_acc_name = 'carrier11' OR dst_acc_name = 'carrier11')
    AND category IN ('Category 1', 'Category 2', 'Category 3')
  ORDER BY datetime ASC;

SELECT * FROM transactionlog
  LEFT JOIN cfg.operator ON (transactionlog.operator_id  = cfg.operator.operator_id)
  LEFT JOIN cfg.country ON (transactionlog.country_code  = cfg.country.country_code)
  WHERE submitted_datetime >= '2009-09-10 20:00:00'
    AND submitted_datetime <= '2009-09-10 22:00:00'
    AND transactionlog.operator_id = 3366
  ORDER BY datetime ASC;

SELECT * FROM transactionlog
  LEFT JOIN cfg.operator ON (transactionlog.operator_id  = cfg.operator.operator_id)
  LEFT JOIN cfg.country ON (transactionlog.country_code  = cfg.country.country_code)
  WHERE submitted_datetime >= '2009-09-10 20:00:00'
    AND submitted_datetime <= '2009-09-10 22:00:00'
    AND (ib_msg_id = 'message03400123123413' OR ob_msg_id = 'message03400123123413')
  ORDER BY datetime ASC;

Analysis and Solution

OK, let's start with table schema defination. Below are the few issues I spotted.

1. datetime & submitted_datetime

Problem
Both data type are varchar(45) which doesn't make sense. Why want to store date time information in a varchar rather than datetime datatype? Based on http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html, datetime datatype consume 8 bytes (current sue 46 bytes). In addition, perfomance on index creation and index filtering on datetime datatype definately faster than varchar.

Solution
Change the datatype to datetime.

2. Category as constant

Problem
After study the code, found out that categories is a set of constant string; wherelse in coding, those categories is identified by integer.

Solution
Change the datatype of category from varchar(45) to tinyint (3). Then another issue raise to me, data is insert into this table using select insert sql statement which the catogery of the source table also is a varchar. In order to solve this, I recommeneded to create a mysql procedure to map the string to an integer and use it at the select insert statement. Pro of this method is no code changes needed. Cons is if there are new category added, need to update the mysql procedure.

Next, let's look into index for this table. Below are the few issues I spotted.

3. Table index definition

Problem
  1. From the provided sql, found out that ob_msg_id and dst_acc_name apear at where clause but it not exist in index.
  2. From the provided sql, the query have a pattern which start from submitted_datetime then follow by other criteria. If that is the case, no need to create so many index to optimize the search filtering. With correct sequence, one index is enough, please refer to http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html
Solution
Create a index as follow, Then modify the code that will generate the SQL statement where clause follow below index sequence.

KEY `index1` (`submitted_datetime`, `category`, `receiver_mpno`(20), `sender_mpno`(20), `src_acc_name`, `dst_acc_name`, `operator_id`, `ib_msg_id`(25), `ob_msg_id`(25), `datetime`) USING BTREE




Propose Solution Summary

Table Schema
CREATE TABLE `transactionlog` ( 
`datetime` datetime NOT NULL DEFAULT '', 
`version` int(10) DEFAULT 0, 
`category` tinyint(3) DEFAULT 0,  
`src_acc_name` varchar(45) DEFAULT NULL,   
`dst_acc_name` varchar(45) DEFAULT NULL, 
`ib_msg_id` varchar(100) DEFAULT NULL, 
`ob_msg_id` varchar(100) DEFAULT NULL, 
`receiver_mpno` varchar(45) DEFAULT NULL,   
`sender_mpno` varchar(45) DEFAULT NULL,     
`country_code` varchar(5) DEFAULT NULL, 
`operator_id` int(10) DEFAULT 0,   
`submitted_datetime` datetime NOT NULL DEFAULT '',   

----- WITH MORE COLUMNS ----   

KEY `index1` (`submitted_datetime`, `category`, `receiver_mpno`(20), `sender_mpno`(20), `src_acc_name`, `dst_acc_name`, `operator_id`, `ib_msg_id`(25), `ob_msg_id`(25), `datetime`) USING BTREE

) ENGINE=MyISAM DEFAULT CHARSET=latin1
Select Statement
SELECT * FROM transactionlog
  LEFT JOIN cfg.operator ON (transactionlog.operator_id  = cfg.operator.operator_id)
  LEFT JOIN cfg.country ON (transactionlog.country_code  = cfg.country.country_code)
  WHERE submitted_datetime >= '2009-09-10 20:00:00'
    AND submitted_datetime <= '2009-09-10 22:00:00'
    AND category IN (1, 2, 3) ORDER BY datetime ASC;

SELECT * FROM transactionlog
  LEFT JOIN cfg.operator ON (transactionlog.operator_id  = cfg.operator.operator_id)
  LEFT JOIN cfg.country ON (transactionlog.country_code  = cfg.country.country_code)
  WHERE submitted_datetime >= '2009-09-10 20:00:00'
    AND submitted_datetime <= '2009-09-10 22:00:00'
    AND category IN (1, 2, 3)
    AND (src_acc_name = 'carrier11' OR dst_acc_name = 'carrier11')
  ORDER BY datetime ASC;

Optimization Testing and It's Result

Environment

Platform: Virtual Machine 32bit
CPU: 1 core, Intel(R) Core(TM)2 Duo CPU     T6670  @ 2.20GHz
Memory: 1024M
OS version : CentOS release 5.4 (Final)
Mysql version: Ver 14.14 Distrib 5.5.15, for Linux (i686) using readline 5.1
Mysql config: installation default

Result


Task (transactionlog table contain 20M records initially) Before (s)After (s)Improvement (%)
Select insert 32K+ records and insert into transactionlog table422.525.9798.59%
Search by category318.5191.1171.39%
Search by receiver/sender number314.774.3876.36%
Search by account name326.6796.0970.58%
Search by operator id327.4979.7975.64%
Search by message ID319.4686.9272.79%



Conclusion

Bravo, the results generally are improve over 70%. As shown at the table above, more index (or separated index) doesn't mean better performance. It is very expensive to generate index, it hoc up CPU time and IO waiting time. Secondly, varchar is convenient but is very bad in indexing; slower and bigger size.

No comments:

Post a Comment