Tuesday, October 4, 2016

Experiments with HBase, Phoenix, and SQL at Scale

By Lars Hofhansl

For a short time we have a large Hadoop/HBase/Phoenix cluster at our disposal before it will be handed off to its designated use. Many 100's of machines, many 1000's of cores, many dozen TB of aggregate ram. (not at liberty to state the exact size of the cluster, nor its exact H/W configuration).

We decided to use this opportunity to perform some semi-scientific tests of Phoenix and HBase at scale to see how much of the hardware we can actually utilize and drive for single queries.

For that we loaded the TPC-H LINEITEM table with 600m and 6bn rows, respectively.

The 600m table uses about 372GB of data (176GB on disk after FAST_DIFF encoding), the 6bn row table measured 4.3TB (1.65TB on disk). While not a lot of data at all for this type of cluster (it all fits easily in the HBase block cache) it none-the-less lets us gauge how Phoenix and HBase can scale their workloads across the cluster.

We performed the following queries over the data:

    With we measure the scan performance
  • SELECT COUNT(*) FROM T WHERE l_quantity < 1000;
    Here we measure scan performance with a simple aggregate
  • SELECT * FROM T ORDER BY l_extendedprice;
    Terrasort anyone? (we canceled the query as soon as the first result is returned, in to measure the sort performance)
  • SELECT COUNT(*) FROM T, J WHERE l_extendedprice = k1;
    Measure how Phoenix' join operator scales
  • SELECT COUNT(*) FROM T WHERE l_extendedprice = <x>;
    With and without a local index.

The Good

  1. We were able to drive 1000's of cores by sizing Phoenix' client threadpool accordingly. HBase does not have a fully supported asynchronous client, so many threads are needed not doing any actual work, but just triggering work on the RegionServers and waiting for the results.
  2. Phoenix uses guide-posts to break work down in chunks smaller then a region, and as we increased the number of threads the performance scaled linearly with it, utilizing the cores on the servers - for queries that can be scaled that way that is.
  3. HBase did a good job of automatically splitting the data and distributing it over the cluster. We found that as long as the data fits into the aggregate cache it did not matter how many regions (and hence server) were involved, indicating there's a potentially for crunching _way_ larger data sets with a single client. There was room to run larger queries and multiple queries at the same time.
  4. Phoenix can sort 4.3TB in 26.4s, engaging 1000's of cores on 100's of machines.
  5. A LOCAL index over 4.3TB is created in 1 minute. Even with many region it was still effective in reducing the query times significantly.
  6. Due to the chunking and chunk queuing when a query is executing, resource allocation is surprisingly fair.


First we varied the number of a driver threads on the client for the first query.
Notice that we can scan through 6bn rows, or 4.3TB of data in 6.55s.

Same with a simple filter applied. We can search 6bn rows/4.3TB of data in 8.36s.

Next we performed sorting. As soon as the query returns the first result we cancel it.
(note that with an added LIMIT clause Phoenix does an O(N) scan, instead of a O(N*log(N)) sort).

As mentioned, a single client can drive the cluster to sort 4.3TB in 26.4s!

Joins... This time fixing the threadpool size around 4000, and varying the number of a join keys. In the 128k case we performed 786 trillion comparisons of a BigInt value in 26s

Lastly, we also created LOCAL indexes on the data. Due to their local nature they are created in just a few second, and even at many 1000 regions still were effective to reduce query times significantly.

Index creation on the 6bn row/4.7TB table took almost exactly 1 minute.
It brought a scan selecting a few 1000 of the 6nb rows from 8s to about 10ms - and note that this is a table with many 1000 regions, where each region needs to pinged with "do you have these keys?"

The Bad

Joins naturally do not scale well. Phoenix now has a merge join option, which sorts the relations at the servers and then performs a final merge on the client. At the sizes we're talking about this can't be done realistically. Perhaps joins with highly selective filters will work, something we'll test soon.

There is a lot of work to do. In order to be efficient Phoenix/HBase need to work on denormalized data - which should not really come as a surprise.

So we measured joins only for situations where they can be executed as a broadcast-hashjoin. With default configuration this failed at 2^18 (256k) joins keys.

LOCAL indexes of a table are mixed in the same single column family. Creating many of them make that column family the dominant one in terms of size, driving HBase's size based split decisions.

Unlike MapReduce, Spark and some distributed databases, Phoenix has no resource allocation or restarting framework, other than what's provided by HBase. Phoenix is not useful choice for queries that run for more than (say) 1h, or where partial failure on some machines is likely.

And the Ugly

  • When queries naturally do not scale, they simply "never" (as in days) return, churning the client forever until canceled or timing out, failing.
  • Some results seemed surprising at first. OFFSET queries become very slow with increasing OFFSETs. Upon inspection that makes sense
  • Distinct queries on high cardinality tables churned for a long time and then failed.
  • There is no cost based optimizer, Phoenix uses simple heuristics to determine how to execute a query. For simple queries that works well, but breaks down for complex joins.
  • Dropping a local index is very expensive as each entry is marked for deletion individually.


To drive large queries you must configure the Phoenix client accordingly:
Increase phoenix.query.threadPoolSize (1000, 2000, or 4000) and phoenix.query.queueSize (maybe 100000).

Semi- (or un-)scientific; more of a qualitative tests whether a larger cluster can actually be utilized to execute a single query. Phoenix/HBase do quite well in terms of scaling. We will continue to do more tests and drive the fixed cost down, to allow Phoenix/HBase to utilize the machines better.

We'll have this cluster just a little bit longer. More to come. And if you have any other Phoenix/HBase testing you'd like us to do, please comment, and I'll see what we can do.

Wednesday, February 17, 2016

HBase - Compression vs Block Encoding

By Lars Hofhansl

(Updated on February 28th, 2016, to correct spelling errors and clarifications, and to add section about single threaded scanning)

HBase has many options to encode or compress that data on disk.

In this excellent blog post Doug Meil and Thomas Murphy outline the effects of block encoding and compression on the storage footprint.

In this post I will explore the effects of encoding and compression options on (read) performance. I performed the Scan performance tests with the excellent Apache Phoenix  the relational layer over HBase. The Get tests use a straight HBase client, without Phoenix.


In short:
Block encoding refers to ability to encode KeyValues or Cells on-the-fly as blocks are written or read, exploiting duplicate information between consecutive Cells. The main advantage is that the blocks can be cached in encoded form and decoded as needed in a streaming way. The encodings used most commonly in HBase are FAST_DIFF and (to a lesser extent) PREFIX encoding.

Compression here means full compression of HBase blocks using SNAPPY, GZIP, and others. Newer versions of HBase have the ability cache block in compressed form, but I did not test this here.

(Compression and block encoding for HBase are described in more detail here.)

The most important property to note is that the cost of decompression is proportional to the number of blocks that have to be decompressed, whereas the cost of decoding is proportional to the number of Cells visited. Keep that in mind when looking at the numbers below.

The Setup

For my test I created Phoenix tables of with the following schema:

create table test (pk integer primary key, v1 float, v2 float);

Then I seed the pk from a sequence, and v1/v2 with random values like so:

upsert into test values(next value for pkey, rand(), rand());

followed by a few (doubling the size of the table each time):

upsert into test select next value for pkey, rand(), rand() from test;

This data set, except for the monotonically increasing key, would not compress well.

With that I created tables with 32m rows, or 128m Cells altogether (including the one that Phoenix creates for book-keeping), on a single machine.

Let's first look at the footprint on disk:


So, the first take away: You have to use some kind of compression or encoding. Since every column needs to be stored as its own cell it clocks in with about 24 bytes per column uncompressed - or about 100 bytes per "row". (In a relational database with a dense row format that entire row would only take about 12 bytes.)

I tested the following four scenarios:
  • 12 core machine, 3GB ram, 4 DDR3 memory channels
  • from disk (7200rpm drive, 8ms seek time or about 125 IOPS, about 120MB/s)
  • from SSD (about 800MB/s, about 2500 IOPS)
  • from OS cache
  • from HBase cache

The region server was configured with:
-Xmn2g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCompressedOops -XX:+UseCMSInitiatingOccupancyOnly -XX:MaxGCPauseMillis=5000 -XX:GCTimeLimit=90 -XX:GCHeapFreeLimit=10

Short Circuit Reads were enabled for HDFS.

Now let's do some scanning

In order to avoid caching in HBase I used the NO_CACHE hint in Phoenix that I added a while ago (PHOENIX-13)

 select /*+ NO_CACHE */ count(*) from test;
   (This will actually issue 11 scan requests to HBase in parallel)

I cleared the OS Cache with

 echo 3 | sudo tee /proc/sys/vm/drop_caches

We can see that reading from spinning disk is (not surprisingly) dominated by the amount of data to be read. Extra CPU time consumed by compression is negligible, we are clearly IO bound.
We also see that in all cases - even GZIP - HBase was able to push the drive close to it's limit. In fact in all cases I saw 90-95MB/s read from disk, the performance gain from compression is close to proportional to the compression ratio.

Let's look at the same data without the HDD numbers to see better what's going on:

Here we see the fact the decoders incur per Cell overhead, the HBase block cache does not help for FAST_DIFF or PREFIX. For PREFIX encoding the HBase block cache actually does worse than the OS Cache, that is something I need to investigate further.

We also see that for Scans the block cache has no/little advantage over the OS cache, or even reading straight form SSD - and in fact HBase wasn't able to keep the SSD fully busy. In other words for SSDs HBase scans are CPU bound. This is interesting, I will also investigate this part further.

All schemes except FAST_DIFF/PREFIX store the blocks in uncompressed form (note that newer version of HBase have the option to cache blocks in their compressed form, I did not test that). In this case the uncompressed data fits in the block cache, so it's not entirely fair to FAST_DIFF and PREFIX.

So use FAST_DIFF only when you expect your working set to fit in the aggregate block cache when compressed, but not uncompressed.

We can now also see the relative CPU cost of the compression. FAST_DIFF is actually worst(!), followed by GZIP, PREFIX, and then SNAPPY, which is most efficient.
GZIP consumed to most CPU to decompress a block, but after decompressing a block scanning goes through the block without any extra work.

Since for scans the HBase block cache generally does not buy much over Linux' disk block caching ("OS Cache" in the graph) as we're CPU bound anyway; in these situations it might be better to disable cacheBlocks (setCacheBlocks(false)) on the Scan object - even when the entire scanned data would fit into the HBase block cache.

February 28, 2016

Note that Phoenix uses parallel scans across regions and uses equal-width "guide posts" to guide how Phoenix should parallelize the work across HBase region servers. In the Scan examples Phoenix chose 11-way parallelization to execute the queries. I have since added a SERIAL hint to Phoenix to force serial execution of a query. See PHOENIX-2697.

How does performance look like when all scanning is single threaded?
Now this is interesting. Without compression we still run into the spinning disk limit throughput limit. Everything else is mostly CPU bound.

Side note, memory bandwidth:

In the next chart I mapped the scan throughput out of the OS cache seen by 1 thread and by 11 threads. (I used the OS cache, and not the HBase block cache, as this shows the internal end-to-end performance of HBase.):

MB/s is the actual throughput seen out of the OS cache. "raw" indicates the throughput in terms of uncompressed data. We see that the throughput in terms of raw uncompressed bytes is very similar across all schemes.

We run into some internal HBase/Phoenix limit around 710MB/s.

Now look at the single threaded performance!
The fastest is (of course) uncompressed scanning, which utilizes about 168MB/s of the available memory bandwidth. It seems we need throw a lot of cores at the problem to utilize the available memory bandwidth. Luckily Apache Phoenix does that automatically where possible; but this is still something to look in HBase land.

In order to put this into context... According to spec this machine has 4 DDR memory channels for a total maximum bandwidth of 51.2 GB/s.

I wrote a simple bandwidth tester in C++. It allocates 1GB of memory and then accesses it sequentially as bytes or as 64bit longs. (for good measure I wrote the same in Java, and came to the same numbers, within 10%). Please note that I am not hardware guy.

Here's what I measured (bandwidth, GB/s):
Measured B/W1 process4 processes

So the HBase numbers are quite a bit lower. We see that the maximum throughput (710MB/s) is close 4x the single threaded throughput (168MB/s), indicating that this is probably related to memory throughput and not CPU cycles (we should have seen an 11x speed up, the machine has enough cores).

Now, this is from the Linux block cache, so we need account for the fact that the blocks are copied at least 2 times (into a direct byte buffer, from there into a byte[]), but since the block cache is only marginally better, this looks to be all internal Phoenix and HBase friction.

OK. That was scanning. Let's try some random gets.

I wanted to explore the decompression cost more. Here I wrote a little tool that issues 1000 random gets. Now each get will incur a block read followed by decompression (if not in the block cache), or a seek + decoding.

The numbers from HDDs are absolutely dominated by the seek time. In all cases this was close to 6.2s, which is close to what you would expect from doing 1000 seeks in 7200rpm drives (actually you would expect 8s for 1000 Gets, presumably it's a little faster, since some gets might hit the same blocks again).
Use SSDs for random Gets, or make sure that your working set fits into the OS cache or block cache. Otherwise each Get will require HBase to perform a disk seek (about 4ms for 15k drive, 8ms for 7.2k dive, 12ms for 5k drive).

So here we see the cost of the decompression again. This time HBase needs to load and decompress a block for each get - unless that block is found in the HBase block cache. We can estimate that the "seek time" on the SSD is about 0.4ms, or that the drive can do about 2500 IOPS (which does seems low).

We also see that FAST_DIFF is not doing as well with the data in block cache, again because it's the only scheme that stores blocks in their compressed form - and the cost of decoding is proportional to the number of Cells traversed.

When the data is on SSD or in the OS Cache, we do see that encoders are doing slightly better. They do not have to decode the entire block, but only the portion needed to decode the Cells we're interested in.

Most prominently for Gets we do see the value of the OS Cache and especially the HBase block cache - assuming the working set will likely fit into the aggregate block cache across the cluster.


  1. Not using compression or encoding is not an option. I think HBase should not even allow that. Both SNAPPY and FAST_DIFF are good all around options.
  2. If you regularly scan large data sets from spinning disk, you're best of with GZIP. (but watch write speed, which I haven't tested here)
  3. For large scans, consider not using the HBase block cache. (Scan.setCacheBlocks(false)), even if the whole scan could fit into the block cache.
  4. If you mostly perform large scans you might even want to consider running HBase with a much smaller heap and size the block cache down, to only rely on the OS Cache. This will alleviate some garbage collection related issues.
  5. We need to throw a lot of cores at a Scan to utilize the available memory bandwidth. When spec'ing machines for HBase, do not skimp on cores, HBase needs those. Apache Phoenix makes it easy to utilize more cores to increase scan performance.
  6. If there is a chance that your data set would fit into the block cache, FAST_DIFF is a good option. It will allow more data to fit into the block cache, since the data is cached in its encoded form.
  7. While for Scans the HBase block cache shows fairly little advantage, for Gets it is quite important to have your data set cached.
  8. If you do lots of random gets, make sure you use SSDs, or that your working set fits into RAM (either OS cache or the HBase block cache), or performance will be truly terrible.
What I didn't test:
  1. CPU constrained environments.
  2. Write performance
What I also didn't test:
  • Multiple different scans in parallel. I'd expect the cached scenarios would behave similarly w.r.t. each other. Obviously scanning and Gets from rotating disks will do much worse.

Friday, January 22, 2016

HBaseCon 2016 is on!

HBaseCon 2016 is on!

I am lucky enough to be on the Program Committee again this year.

HBaseCon will take place May 24, 2016 at
The Village
969 Market St.
San Francisco, CA 94103

The Call For Papers is open. If you have anything interesting to say about HBase, what you do with HBase, what other technologies you use with HBase, how you solved XYZ with HBase, how you hacked HBase, what we should improve in HBase, etc, etc, and want to talk about it at HBaseCon please let us know.

The CFP closes Feb 28th.

Monday, December 21, 2015

Yet more on HBase GC tuning with CMS

By Lars Hofhansl
Some of my previous articles delve into gc tuning and
more tuning for scanning.

We have since performed more tests with heavy write loads and I need revise my previous recommendation based on the findings.

I wrote a simple test tool that generates 5 million random keys of approximately 200 bytes; then it starts 50 threads that each pick 100k random batches of these keys and write them to HBase. I then start multiple of these and point them to an HBase cluster.
The result is that we see a lot of churn in HBase as new versions of Cells are written but the overall size of the data is kept more or less constant with compactions - so I can test very large write loads with limited disk space.

We find that for these kinds of loads a young generation of 512MB that I had recommended before is hopelessly undersized. We're seeing lots of premature promotions into tenured space, followed by minute long full pauses that eventually cause the region servers to shut down.

For the tests I ran with a 16G heap and found a young gen of 2G is good.

Note that in CMS the size of the young gen is one of the knobs to trade latency for throughput. The smaller the young gen is size the smaller the pauses tend to be... If all short lived garbage fits into the young gen, that is. If it does not, short lived objects get promoted into the tenured space and that will eventually lead to very long (minutes with 30G heaps) full collection pauses.

So the goal is to size the young gen just large enough to fit the short lived objects. In HBase we do some guide posts for this. 40% of the heap (by default) is dedicated to the memstores, another 40% (again by default) to the block cache, and the rest (20%) to "day-to-day" garbage.

With I now recommend the following for the young generation:
  • at least 512MB
  • after that 10-15% of the heap
  • but at most 3GB

For most setups the following should cover all the bases:

-Xmn2g - small eden space (or even -Xmn3g for very large heaps)
-XX:+UseParNewGC - collect eden in parallel
-XX:+UseConcMarkSweepGC - use the non-moving CMS collector
-XX:CMSInitiatingOccupancyFraction=70 - start collecting when 70% of the tenured gen are full to avoid collection under pressure
-XX:+UseCMSInitiatingOccupancyOnly - do not try to adjust CMS setting

In the end you have to test with your own work loads. Start with the smallest young gen size that you think you can get away with. Then watch the GC log. If you see lot's of "promotion failed" type messages, you need to increase eden (-Xmn), do that until the promotion failures stop.

We'll be doing testing with G1GC soon.

Friday, May 8, 2015

My HBaseCon talk about HBase Performance Tuning

By Lars Hofhansl

HBaseCon 2015 was a blast as always. All the presentations and videos will be online soon.

In the meanwhile my presentation on "HBase Performance Tuning" can be found on SlideShare.

Monday, April 20, 2015

HBaseCon 2015

Don't forget to come to HBaseCon, the yearly get-together for all things HBase in San Francisco. May 7th, 2015.

We have great collection of sessions this year:

  • A highly-trafficked HBase cluster with an uptime of sixteen months
  • An HBase deploy that spans three datacenters doing master-master replication between thousands of HBase nodes in each
  • Some nuggets on how Bigtable does it (and HBase could too)
  • How HBase is being used to record patient telemetry 24/7 on behalf of the Michael J. Fox Foundation to enable breakthroughs in Parkinson Disease research
  • How Pinterest and Microsoft are doing it in the cloud, how FINRA and Bloomberg are doing it out east, and how Rocketfuel, Yahoo! and Flipboard, etc., are representing the west

Among many others!

I'll be talking about HBase Tuning and have a brief cameo in the HBase 2.0 panel, talking abount semantic versioning. Feel free to find me afterwards.

Sunday, March 8, 2015

HBase Scanning Internals - SEEK vs SKIP

By Lars Hofhansl, March 8th, 2015

Recently we ran into a problem where a mapper that scanned a region of about 10GB with a timerange that did not include any Cell (KeyValue) took upwards of 20 minutes to finish; it processed only about 8MB/s.

It turns out this was a known problem that has eluded a fix for while: Deciding at the scanner level whether to SEEK ahead past potentially many Cells or whether to power through and repeatedly SKIP the next Cell until the next useful Cell is reached.


Scanning forward through a file, HBase has no knowledge about how many columns are to follow for the current row or how many versions there are for the current column (remember that every version of every column in HBase has its own Cell in the HFiles).

In order to deal with many columns or versions, HBase can issue SEEKs to the next row (seeking past all versions for all remaining columns for the row) or the next column (seeking past all remaining versions). HBase errs on the side of SEEK'ing frequently since SKIP'ing potentially 1000' or 100000's of times can be disastrous for performance (imagine a row with 100 columns and 10000 versions each - unlikely, but possible).

The problem is: SEEK'ing is about 10x as expensive as a single SKIP - depending on how many seek pointers into HFiles have to be reset.

Yet, in many cases we have rows with only a few or even just one column and just one version each. Now the SEEKs will cause a significant slowdown.

After much trying finally there is a proposed solution:
HBASE-13109 Make better SEEK vs SKIP decisions during scanning
(0.98.12 and later)

How does it work?

HFiles are organized like B-Trees, and it is possible to determine the start key of the next block in each file.

A heuristic is now:
Will the SEEK we are about to execute get us into the next block of the HFile that is at top of the heap used for the merge sorting between the HFiles?

If so, we will definitely benefit from seeking (the repeated SKIPs would eventually exhaust the current block and load the next one anyway).

If not, we'll likely benefit from repeated SKIP'ing. This is a heuristic only, the SEEK might allow us to seek past manys Cell in HFiles not at the top of the heap, but that is unlikely.

In all tests I did this performs equal to or (much) better than the current behavior.

The upshot is that now the HBase plumbing (filters, coprocessors, delete marker handling, skipping columns, etc) can continue to issue SEEKs where that is logically possible, and then at the scanner level it can decide whether to act upon the SEEK or to keep SKIP'ing inside the current block, with almost optimal performance.

HBASE-13109 Allows many queries against HBase to execute 2-3x faster. Especially those that select specific columns, or those that filter on timeranges, or where many deleted columns for column families are encountered.
Queries that request all columns and that do not filter the data in any way will not benefit from this change.

You do need to do anything to get that benefit (other than upgrading your HBase to at least the upcoming 0.98.12).