|
Q and A
Asked and Answered
We have tables that are fairly large (more than 50M rows). We can't
purge data for business reasons, but the probability of accessing a
piece of data is far greater for recent data than it is for older data.
Should we split the data into two partitions (old and recent) so
that
the partition containing the subset of rows likely to be accessed is
smaller?
Robert Catterall responds:
That's a good question. If by "partition," you mean a partition or
partitions in a DB2 for OS/390 partitioned tablespace, I don't know if I
would recommend that approach if you never purge from the table (and a
never-purge table makes me a little nervous). If you do purge, but only
after a considerable period of time (some number of months or years),
you might want to try
partitioning on a date basis. For example, if you
need to retain data in the table for three years, you could go with 156
partitions in a tablespace one for each week in a three-year period.
I would not recommend making a date column the partitioning key, because
partitioning by a continuously ascending key is problematic at present.
Instead, you could have a partition number column, with values from 1
through 156. A row going into a particular partition is assigned the
associated partition number by
way of this column. A separate table that
maps dates to partition numbers would be used on insert to determine the
correct partition number, and on retrieval to translate a requested date
into a partition number. In this way, you can "wrap" around these 156
partitions as time progresses. If you have a brief "window" during which
the data can be unavailable at the beginning of a new week, you can
clear out an "old" partition very efficiently via a load utility REPLACE
DD DUMMY operation. Otherwise, you'd
clear out old data via SQL DELETE
(following archive, if you need to archive the old data).
If you really never will purge data from this table, you might try
replacing the one table with multiple tables. You could do this in a
couple of ways:
1) Have a "current period" table holding whatever you consider to be the
"current" data (a month's worth, three month's worth, whatever) and a
history table (holding data older than that belonging to the "current"
time period).
2) Have a table
per time period (for example, a January 2002 table, a
February 2002 table, and so on).
The nice thing about option 1 is that the table names are fixed, making
coding pretty easy. The main disadvantage is the CPU expense of moving
rows from the one table to the other as they age. If a row will be
accessed multiple times while it's "current," the improved retrieval
efficiency (better locality of reference, better buffer pool hit ratio)
may out weigh the table move cost.
The second option
could require more application code, in that an
application may have to figure out which table to access. On top of
that, you can't name a table via a host variable in an SQL statement;
so, you either have to have a lot of alternative static statements in
your program, and execute the right one based on the time period of
interest, or you go with dynamic SQL, figuring out the table name and
building the statement at execution time. I'll tell you that I'm much
less opposed to dynamic SQL than I used to be. You
might want to read
the column I wrote about dynamic SQL last year for
DB2 Magazine
(
http://www.db2mag.com/db_area/archives/2001/q2/db2dba.shtml
).
Still referring to option 2, an alternative to smarter application
programs is a smarter (and more high-maintenance) database. You could
ensure that the current period table always has the same name by way of
the DB2 table rename function. Trouble is, when you do the rename, you
have to rebind
programs referring to the target table. Thus, the rename
approach might well require that the change be effected during a
maintenance window.
Still another alternative would be to stick with your current scheme and
have zero free space (or as close as you can get to that in the
tablespace. That way, all new data would go to the end of the tablespace
or partition, and you could get good locality of reference because
current data rows are close to each other. The clusteredness of your
table might
deteriorate quite a bit, but would that be bad? Maybe not,
depending on the nature of accesses to data in the table.
See a
complete archive of reader/author Q&As
.
|