The Cost of Analytical Query Processing in the Cloud, Part 1
This is the first in the series of blog posts on the topic. I will look at database options and examine each in terms of data scalability, query processing capability, and cost.
Any major analytical database solution in the cloud needs to support large databases and complex analytical queries. To refine the meaning of “large”, let us use the following database sizes as reference points: 100 gigabyte, 1 terabyte, 10 terabyte, and 100 terabyte. For analytical database and analytical queries example, let us use TPC-H benchmark.
Pricing
Before writing this blog post, I looked at current cloud pricing to understand the cost of query processing in the cloud.
First, users pay for data transfer. Amazon charges ten cents per gigabyte of data transferred in or out of its Elastic Compute Cloud (EC2) and one cent per gigabyte of data transferred between “local zones” or between different public IP addresses inside EC2, however, there are no charges for data transfer between private IP addresses within the same availability zone. See Amazon EC2 and EBS descriptions for definitions.
Second, users pay for data storage and data retrieval (I/O): in case of Amazon ESB, ten cents per gigabyte per month, plus 10 cents for every million of I/O requests.
Row-Oriented Databases, Column-Oriented Databases, Key-Value Databases…
Generally, databases fall into one of three categories: row-oriented, column-oriented, and key-store. Many key-value databases rely on row-oriented back-ends.
I would recommend reading an excellent ReadWriteWeb blog post by Tony Bain on key-value databases. Such databases are not intended for complex analytical query processing and thus not covered here.
All row-oriented databases employ data sharding or hash based data partitioning to overcome single-server limitations and leverage data-parallelism. There are tons of resources on sharding and partitioning - from excellent books such as “High Performance MySQL” to blogs such as Database Sharding Blog at CodeFutures or MySQL Performance Blog. In particular, I would recommend the recent excellent blog post by Dare Obasanjo. Data sharding has been in use for ages. For example, in Oracle Parallel Query feature in Oracle 7.1.
Recently, column-oriented databases have started exploring hash-based data partitioning schemes as well, e.g. Vertica for the Cloud.
Analytical Query Processing with Sharding
Star Schema design is the simplest possible analytical schema. In case of a single large “fact” table and several small “dimension” tables, sharding is straightforward: hash-partition the fact table, copy and cache each of the dimension tables in every shard.
When caching of dimension tables is impractical or when there two or more large tables, things get more complicated, as many types of analytical queries cannot be represented as queries against individual partitions followed by some result-combining step.
For example, in the TPC-H benchmark schema, the two largest tables, LINEITEM and ORDERS are unavoidable candidates for sharding. Over half of the 22 TPC-H benchmark queries contain joins of two or more tables. If you calculate ORDERS partition numbers by hashing o_orderkey values (e.g., some big prime number divided by o_orderkey, modulo the number of partitions); and LINEITEM partition number by performing the same calculation on l_orderkey, then all joinable ORDERS and LINEITEM table rows will come from corresponding partitions. The two tables can be joined on (o_orderkey=l_orderkey) without any need to exchange data between shards.
Most implementations of analytical join query processing with sharding are based on this simple observation. If individual servers in the parallel server array have enough memory (or at least storage) for caching six smaller TPC-H tables, then even multi-table TPC-H joins can be executed without exchanging data between partitions.
Why do we have to worry about data exchange between partitions related to database join processing?
Suppose your schema includes two or more fact tables. Executing a complex join query involving both fact tables, which is not equijoin on the partitioning keys, necessitates data exchange between partitions. In the general case, you will have n*(n-1)/2 intra-query data exchanges.
In case of 50 nodes, 1225 data exchange links; in case of 200 nodes, 19900 data exchange links; in case of 1000 nodes, 499500 data exchange links.
As all data exchanges need to be asynchronous, your MPP database server engine will need to employ tuple queues at each of the data exchanges. Thus, if the server allocates, say, 512K for each incoming and outgoing tuple queue, then these tuple queues will require over 1GB of memory at each server node in case of 50 nodes; almost 20GB of memory at each server in case of 200 nodes; and almost 500GB of RAM at each server node in case of 1000 nodes.
This example illustrates scalability limitations of analytic query processing in databases that employ “shared-nothing” MPP architecture.
The bad news is that cloud computing, at least the way Amazon structures it, both, imposes and expects this architecture pattern.
Thus, economics of analytical processing in the cloud is a complex puzzle. Data storage cost is just a piece in this puzzle.

Recent Comments