query_script_split

QueryScript Flow Control: split statement

SYNOPSIS

Single table operations, autodetect mode:

split (statement operating on single table)
  statement;
Multiple tables operations; explicit declaration of splitting table:
split (schema_name.table_name: statement operating on multiple tables)
  statement;
Statementless split, manual mode:
split (schema_name.table_name)
  statement;
Provide parameters to split operation:
split ({foo:bar}: statement)
  statement;
Discussion on the various flavors of split follows.

DESCRIPTION

split automagically breaks a query into subparts -- smaller chunks -- and works these in steps. It alleviates the load caused by large operations by turning them into smaller ones.

Consider the following query: we realize we must UPDATE a column on all rows:

split (UPDATE sakila.rental SET rental_date = rental_date + INTERVAL 6 HOUR)
  pass;
To execute a "normal" UPDATE of the above form would mean, assuming the table is very large, issuing a very large transaction. Such transaction could take hours to complete, by which time locks are accumulating, performance is degrading, and an attempt attempt at rollback can make for an even larger overhead.

A solution to such a problem is in the form of chunking: splitting the query into many smaller ones, each operating on a distinct group of rows. Not only the query, but the transaction itself (assuming AUTOCOMMIT=1) is broken into smaller transactions. Each such transaction is quick to complete, and has better chance at not making for any locks. One may choose to "rest" in between chunks, making for the availability of system resources.

The above split code does just that: it automagically breaks the query, by:

  • Analyzing the query, detecting, if possible, the table on which the split is done (with a multi-table query explicit instruction is required)
  • Analyzing the table, detecting best method of splitting it up into smaller parts. This is done by choosing the best UNIQUE KEY by which to work out the splitting process.
  • Rewriting the query so as to add a filtering condition placeholder: the expression which limits.
  • Determining the particular chunks by running over the actual rows, and issuing query on each chunk.
Thus, split works by selecting a particular table used by the SQL action statement, and by breaking it apart. This table is called the splitting table or the chunking table.

split looks like a looping construct: the statement gets executed once per each chunk of the original query. As with looping constructs, it respects, among others, the following statements:

  • break: terminate split execution: this means skipping any remaining chunks.
  • throttle: control loop execution time by sleeping in between iterations, time of sleep proportional to time of execution.

split defaults to chunks of 1,000 rows each. This can be configured via the size parameter.

Use of expanded variables is allowed within the split statement, as well as within the table definition and the query params. See EXAMPLES for more on this.

Magic variables

split introduces magic variables, which are available within a split iteration statement. These are:

  • $split_columns: comma separated list of columns by which the split algorithm splits the table.
  • $split_index: name of index used by this split operation (this index implies the names of columns as presented in $split_columns).
  • $split_min: minimum values of $split_columns. This is the starting point for the split operation.
  • $split_max: minimum values of $split_columns. This is the ending point for the split operation.
  • $split_range_start: per chunk, values of $split_columns indicating chunk's lower boundary.
  • $split_range_end: per chunk, values of $split_columns indicating chunk's upper boundary.
  • $split_step: iteration counter. Value is 1 for 1st iteration 2 for 2nd iteration, etc.
  • $split_rowcount: the number of rows affected by current split step.
  • $split_total_rowcount: the total number of rows affected so far by the split statement.
    This is an accumulation of $split_rowcount
  • $split_clause: the computed filtering clause. See following discussion.
  • $split_total_elapsed_time: total number of seconds elapsed since split operation started. This includes possible throttling or sleeping time.
  • $split_table_schema: schema for splitting table.
  • $split_table_name: the splitting table, by which split works out the smaller steps.

Flavors

Single table, autodetect mode

split can analyze statements involving single table, and automatically identify the referenced table. This makes for the simplest and cleanest syntax:

split (DELETE FROM sakila.rental WHERE rental_date < NOW() - INTERVAL 5 YEAR)
  SELECT $split_total_rowcount AS 'rows deleted so far';
create table world.City_dup like world.City;
split (insert into world.City_dup select * from world.City)
{
  throttle 2;
}

Multiple tables operations; explicit declaration of splitting table

When multiple tables are involved, the user must specify the splitting table:

split (sakila.film: UPDATE sakila.film, sakila.film_category SET film.rental_rate = film.rental_rate * 1.10 WHERE film.film_id = film_category.film_id AND film_category.category_id = 3)
  sleep 0.5;
The user is always allowed to specify the splitting table, even on single table operations:
split (sakila.rental: DELETE FROM sakila.rental WHERE rental_date < NOW() - INTERVAL 5 YEAR) {
  SELECT $split_total_rowcount AS 'rows deleted so far';
}
The user may still specify the splitting table even when the statement operates on a single table; in which case the specified table must indeed be the table being operated on. However, with statement operating on single table it is best to let split() figure out the table name.

Statementless split, manual mode

split can also accept just the splitting table, without a query. In this "manual mode" the table is being split and iterated, but no "actual" query is issued.

The loop construct, however, is iterated, and the magic variables are available. This allows the user to manually execute what would have been automatic, or otherwise act in unconventional manner. Consider:

split (sakila.rental) {
  DELETE FROM sakila.rental WHERE rental_date < NOW() - INTERVAL 5 YEAR AND :${split_statement};
}
In the above example, the user builds the splitting of the DELETE query manually.
In other use cases, the user may be interested in the metadata of the splitting process (see EXAMPLES). The metadata is provided by split's magic variables.

Providing parameters

split accepts parameters for operation. Normally, split does everything on its own, and does not require instruction. However, the user is given the choice of fine tuning split's operation by providing any combination of the following paramaters:

  • index: explicit name of index to use for splitting. The index must exist under given name and must be UNIQUE, or else an error is thrown.
    By default split chooses the best index for splitting the table without hint.
  • size: number of rows used in each step (minimum: 100; maximum: 50,000; default: 1,000)
  • start: starting point for the split operation. This is a comma delimited list of values (quoted on multiple values).
    The count of values must match the number of columns in the index picked by split. Thus, if split picks an AUTO_INCREMENT PRIMARY KEY for the operation, the value is merely a single integer.
    It makes most sense to use this parameters in conjunction with index.
    Compound values such as '2013-07-05,12,smith' are valid.
    Values do not have to strictly exist in the table: the split operation will begin as of these values, meaning starting at the first row with these exact values or larger.
    An error is thrown when the number of values specified does not match the number of columns covered by the splitting key.
    All data types are supported, including textual.
  • stop: ending point for the split operation. This is a comma delimited list of values (quoted on multiple values).
    The count of values must match the number of columns in the index picked by split. Thus, if split picks an AUTO_INCREMENT PRIMARY KEY for the operation, the value is merely a single integer.
    It makes most sense to use this parameters in conjunction with index.
    Compound values such as '2013-07-05,12,smith' are valid.
    Values do not have to strictly exist in the table: the split operation will run up to these values, meaning stopping at the first row with these exact values and excluding any larger.
    An error is thrown when the number of values specified does not match the number of columns covered by the splitting key.
    All data types are supported, including textual.
  • table: explicit table & schema name, when multiple statements are used. This parameter is not required, though allowed, when the statement operates on a single table.

In the following example, the rental table has an AUTO_INCREMENT PRIMARY KEY column called rental_id. The split operation starts with rental_id value of 1200, works till the end of table, and uses chunks of 500 rows at a time.

split ({start: 1200, size: 500} : DELETE FROM sakila.rental WHERE rental_date < NOW() - INTERVAL 5 YEAR)
  throttle 2;
In the above example, the user builds the splitting of the DELETE query manually.

The following example shows how to use the start parameter in the case of a compound key. The film_actor table has a compound PRIMARY KEY on (`actor_id`,`film_id`). The split operation starts with actor_id = 12 & film_id = 631.

split({start:'12,631'} : UPDATE sakila.film_actor set last_update=NOW())
  throttle 2;
In the above the values '12,631' are quoted: quotes must be added for multiple values; they are allowed but not required on single values (as can be seen on the previous example).
Since there is no limitation on the type of columns, it is possible that they are textual. The case where you would have a comma (",") in one of your start/stop column values is not supported; any comma is interpreted as a columns separator.

LIMITATIONS

split accepts these types of statements:

  • DELETE FROM table_name ...
  • DELETE FROM table_name USING <multi table syntax> ...
  • UPDATE table_name SET ...
  • UPDATE <multiple tables> SET ...
  • INSERT INTO some_table SELECT ... FROM <single or multiple tables> ...
  • REPLACE INTO some_table SELECT ... FROM <single or multiple tables> ...
  • SELECT ... FROM <multiple tables> ...

The following limitations apply to the split statement:

  • You should avoid using index hints on the splitting table.
  • At current, split does not accept the DELETE FROM tbl.* ... syntax. Use DELETE FROM tbl ... instead.
  • Statements with DISTICT will probably result with unexpected results. Statements with GROUP BY may also behave unexpectedly, depending on the statement.

split is furthermore subject to the following limitations:

  • A split statement cannot be nested within another split statement. To clarify, there is no problem with nesting other loop constructs such as while, foreach etc.
  • Aliasing the splitting table is not allowed.
  • For table autodetection to work, the statement must work on a single table only, and must not contain index hints, derived tables or subqueries. You may always choose to explicitly declare the splitting table using the split (schema_name.table_name: the statement) {...} variation.

EXAMPLES

Mike is resigned. Assign all mike's issues to Jon:

call run("
  split (update sakila.rental set staff_id = 2 where staff_id = 1)
    select $split_total_rowcount as 'processed issues';
");
  
+------------------+
| processed issues |
+------------------+
|              479 |
+------------------+
1 row in set (0.07 sec)

+------------------+
| processed issues |
+------------------+
|              983 |
+------------------+
1 row in set (0.09 sec)

...

+------------------+
| processed issues |
+------------------+
|             8040 |
+------------------+
1 row in set (0.40 sec)

Create denormalized table, fill it:

CREATE TABLE sakila.denormalized_film_category (
  film_id smallint unsigned NOT NULL,
  category_id tinyint unsigned NOT NULL,
  last_update timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  film_title varchar(255),
  category_name varchar(255),
  PRIMARY KEY (film_id,category_id)
);

split (sakila.film_category:
  INSERT INTO sakila.denormalized_film_category
  SELECT
    film_id,
    category_id,
    film_category.last_update,
    film.title,
    category.name
  FROM
    sakila.film_category
    JOIN sakila.film USING (film_id)
    JOIN sakila.category USING (category_id)
  ) 
{
  SELECT $split_total_rowcount AS 'total rows generated so far';
  throttle 2;
}
The above uses the sample sakila database. It just so happens that the number of rows in sakila.film_category is exactly 1,000, which makes for a single step.

Walk through a table (no particular statement to execute); watch the magic variables:

call run("
  split(sakila.film_actor) { 
    select 
      $split_step as step, $split_columns as columns, 
      $split_min as min_value, $split_max as max_value, 
      $split_range_start as range_start, $split_range_end as range_end
  }
");

+------+----------------------+-----------+-------------+-------------+------------+
| step | columns              | min_value | max_value   | range_start | range_end  |
+------+----------------------+-----------+-------------+-------------+------------+
|    1 | `actor_id`,`film_id` | '1','1'   | '200','993' | '1','1'     | '39','293' |
+------+----------------------+-----------+-------------+-------------+------------+

+------+----------------------+-----------+-------------+-------------+------------+
| step | columns              | min_value | max_value   | range_start | range_end  |
+------+----------------------+-----------+-------------+-------------+------------+
|    2 | `actor_id`,`film_id` | '1','1'   | '200','993' | '39','293'  | '76','234' |
+------+----------------------+-----------+-------------+-------------+------------+

+------+----------------------+-----------+-------------+-------------+-------------+
| step | columns              | min_value | max_value   | range_start | range_end   |
+------+----------------------+-----------+-------------+-------------+-------------+
|    3 | `actor_id`,`film_id` | '1','1'   | '200','993' | '76','234'  | '110','513' |
+------+----------------------+-----------+-------------+-------------+-------------+

+------+----------------------+-----------+-------------+-------------+-------------+
| step | columns              | min_value | max_value   | range_start | range_end   |
+------+----------------------+-----------+-------------+-------------+-------------+
|    4 | `actor_id`,`film_id` | '1','1'   | '200','993' | '110','513' | '146','278' |
+------+----------------------+-----------+-------------+-------------+-------------+

+------+----------------------+-----------+-------------+-------------+-------------+
| step | columns              | min_value | max_value   | range_start | range_end   |
+------+----------------------+-----------+-------------+-------------+-------------+
|    5 | `actor_id`,`film_id` | '1','1'   | '200','993' | '146','278' | '183','862' |
+------+----------------------+-----------+-------------+-------------+-------------+

+------+----------------------+-----------+-------------+-------------+-------------+
| step | columns              | min_value | max_value   | range_start | range_end   |
+------+----------------------+-----------+-------------+-------------+-------------+
|    6 | `actor_id`,`film_id` | '1','1'   | '200','993' | '183','862' | '200','993' |
+------+----------------------+-----------+-------------+-------------+-------------+

Use expanded variables as table & schema names in split statement. In this example we update all tables called rental in any database.

foreach($tbl, $scm: table like rental) {
  split(update :${scm}.:${tbl} set rental_date = rental_date + interval 1 day) {
    throttle 1;
  }
}

SEE ALSO

foreach, break, throttle, candidate_keys

AUTHOR

Shlomi Noach
 
common_schema documentation