/ hadoop

Automatically Expire Data In HBase

Did you know that you can setup Time-To-Live on columns in HBase? The feature is especially nice if you are dealing with sensitive data such as network packet captures. You will also find that this saves you some time in creating and scripting maintenance scripts, as you perhaps would have done with other tools. As the documentation of HBase puts it:

ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached. This applies to all versions of a row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC.

To configure a table with a TTL of 15 seconds for instance, open hbase shell:

create 'test-table', {'NAME' => 'column-family1', 'TTL' => 15}

Now store a row:

put 'test-table', 'some-row-key', 'column-family1:test', 'just-a-value'

If you retrieve that record by get 'test-table', 'some-row-key', 'column-family1:test' a couple of times. You will see that it magically disappears after 15 seconds.

If we are to trust what is discussed here (2011), there is no way to set a TTL for all column families. A TTL can only be applied on one column at the time as shown above. That makes sense since it's implemented in HColumnDescriptor.

There's another aspect as well, being versions. I don't know about you, if you've tried implementing some kind of versioning system in MySQL? The great in this case is that HBase has it built-in.