oak-chunk-update

NAME

oak-chunk-update: Perform long, non-blocking UPDATE/DELETE operation in auto managed small chunks

SYNOPSIS

Delete rows from world.City where population is small:
oak-chunk-update --database=world --execute="DELETE FROM City WHERE Population < 10000000 AND OAK_CHUNK(City)"
Same as above, provide fully qualified table names, use 100 rows chunk size:
oak-chunk-update --execute="DELETE FROM world.City WHERE Population < 10000000 AND OAK_CHUNK(world.City)" --chunk-size=100
Same as above, sleep for 200 milliseconds between chunks, verbose:
oak-chunk-update --database=world --execute="DELETE FROM City WHERE Population < 10000000 AND OAK_CHUNK(City)" --chunk-size=100 --sleep=200 --verbose
Same as above, quit when no rows are affected:
oak-chunk-update --database=world --execute="DELETE FROM City WHERE Population < 10000000 AND OAK_CHUNK(City)" --chunk-size=100 --sleep=200 --verbose --terminate-on-not-found
Perform a multi-table UPDATE operation, choose world.City as chunking table, sleep twice the query time:
oak-chunk-update --execute="UPDATE City, Country SET City.District = 'unknown' WHERE City.CountryCode = Country.Code AND Country.Continent = 'Africa' AND OAK_CHUNK(City)" --sleep-ratio=2
Same as above, avoid using INFORMATION_SCHEMA by specifying chunk key:
oak-chunk-update --execute="UPDATE City, Country SET City.District = 'unknown' WHERE City.CountryCode = Country.Code AND Country.Continent = 'Africa' AND OAK_CHUNK(City)" --force-chunking-column=id:integer
Same as above, specify constant start position and end position:
oak-chunk-update --execute="UPDATE City, Country SET City.District = 'unknown' WHERE City.CountryCode = Country.Code AND Country.Continent = 'Africa' AND OAK_CHUNK(City)" --force-chunking-column=id:integer --start-with=1074 --end-with=2990
Same as above, specify query start position, skip locks:
oak-chunk-update --execute="UPDATE City, Country SET City.District = 'unknown' WHERE City.CountryCode = Country.Code AND Country.Continent = 'Africa' AND OAK_CHUNK(City)" --force-chunking-column=id:integer --start-with="SELECT MIN(id) FROM world.City WHERE District=''" --skip-lock-tables
Provide connection parameters. Prompt for password:
oak-chunk-update --user=root --ask-pass --socket=/tmp/mysql.sock --database=world --execute="DELETE FROM City WHERE Population < 10000000 AND OAK_CHUNK(City)"
Use a defaults file for parameters.
oak-chunk-update --defaults-file=/home/myuser/.my-oak.cnf --database=world --execute="DELETE FROM City WHERE Population < 10000000 AND OAK_CHUNK(City)"

DESCRIPTION

This utility allows for splitting long running or non-indexed UPDATE/DELETE oprations, optionally multi-table ones. Long running updating queries are often used. Some examples:

  • Purging old table records (e.g. purging old logs).
  • Updating a column on a table scale.
  • Deleting or updating a small number of rows, but with a non-indexed search condition.
  • Inserting into one table aggregated values from other tables.

oak-chunk-update splits such long running tasks into small chunks. It also allows for sleep time between chunks. This allows for less lock time, better replication responsiveness (less lag) and less stress on system resources (CPU, IO). To perform, the utility uses a UNIQUE KEY on a given table, which is used for the splitting process. Note that the query may involve multiple tables (JOINed), in which case one of the tables must have a UNIQUE KEY. The utility requires, then:

  • At least on of the tables participating in the UPDATE/DELETE query has a UNIQUE KEY.
  • The query must indicate to the utility the table for which the UNIQUE KEY is used.

The query must include a hint in the form OAK_CHUNK(table_name) or OAK_CHUNK(database_name.table_name). See SYNOPSIS for examples. The table indicated in the OAK_CHUNK clause is the table which must contain a UNIQUE KEY, which is used for splitting the query. The utility rewrites the query by iteratively replacing the OAK_CHUNK(...) clause with appropriate values from the UNIQUE KEY. In case more than one UNIQUE KEY is available on the table, the utility chooses in the following order:

  • If there's a PRIMARY KEY - this is the selected key
  • A key for which the first column is non-textual is prefereable to a key for which the first column is textual
  • A key with a smaller numeric data type takes precedance
  • A key with fewer columns take precedance

The tool auto selects the chunking key by observing INFORMATION_SCHEMA. Reading from INFORMATION_SCHEMA is risky on large, busy servers. It is possible to instruct the tool to use a specific column, by adding --force-chunking-column.

Limiting the number of scanned rows is done via --start-with and --end-with. Neither, either or both can be specified, as constant or as a SELECT query returning a single integer value. In the latter case it is required that the chunking key is itself an integer. --terminate-on-not-found is another way of limiting scanned rows, by instructing the tool to quit the first time the query has no effect. This works well for monotonic columns, e.g. TIMESTAMP columns with increasing values.

OPTIONS

--ask-pass

Prompt for password.

-c CHUNK_SIZE, --chunk-size=CHUNK_SIZE

Number of rows to act on in chunks (default: 1000). 0 means all rows updated in one operation The lower the number, the shorter any locks are held, but the more operations required and the more total running time. -d DATABASE, --database=DATABASE

Database name (required unless table is fully qualified)

--defaults-file=DEFAULTS_FILE

Read from MySQL configuration file. Overrides --user, --password, --socket, --port.

Configuration needs to be in the following format:

[client]
user=my_user
password=my_pass
socket=/tmp/mysql.sock
port=3306

--end-with=END_WITH

Assuming chunking on numeric field (e.g. AUTO_INCREMENT), end chunking with this value. Either provide a constant or a query returning a single integer value.

-e EXECUTE_QUERY, --execute=EXECUTE_QUERY

Query to execute, which contains a chunk placeholder in the form of OAK_CHUNK(table_name) (required)

--force-chunking-column=FORCED_CHUNKING_COLUMN

Columns to chunk by; avoids querying in INFORMATION_SCHEMA. Format: either column_name:type, where type is integer/text/temporal - for single column keys, or column1_name,column2_name,... for one or more column keys, with no type.

-H HOST, --host=HOST

MySQL host (default: localhost)

--no-log-bin

Do not log to binary log (actions will not replicate). This may be useful if the slave already finds it hard to replicate behind master. The utility may be spawned manually on slave machines, therefore utilizing more than one CPU core on those machines, making replication process faster due to parallelism.

-p PASSWORD, --password=PASSWORD

MySQL password

-P PORT, --port=PORT

TCP/IP port (default: 3306)

--print-progress

Show number of affected rows during utility runtime

--sleep=SLEEP_MILLIS

Number of milliseconds to sleep between chunks. Default: 0

--sleep-ratio=SLEEP_RATIO

Ratio of sleep time to execution time. Default: 0

--skip-lock-tables

Do not issue a LOCK TABLES READ. May be required when using queries within --start-with or --end-with

--skip-retry-chunk

Avoid retrying a chunk operation on error. Default: false

-S SOCKET, --socket=SOCKET

MySQL socket file. Only applies when host is localhost

--start-with=START_WITH

Assuming chunking on numeric field (e.g. AUTO_INCREMENT), start chunking from this value and onward. Either provide a constant or a query returning a single integer value.

--terminate-on-not-found

Terminate on first occurrence where chunking did not affect any rows (default: False)

-u USER, --user=USER

MySQL user

-v, --verbose

Print user friendly messages

ENVIRONMENT

Requires MySQL 5.0 or newer, python 2.3 or newer. python-mysqldb must be installed in order to use this tool. You can
apt-get install python-mysqldb
or
yum install mysql-python

SEE ALSO

oak-repeat-query

LICENSE

This tool is released under the BSD license.
Copyright (c) 2008 - 2010, Shlomi Noach
All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
* Neither the name of the organization nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT 
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR 
TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

AUTHOR

Shlomi Noach
 
openark kit documentation