New England Database Day 2009 at MIT
On Friday I attended the second New England Database Day at Massachusetts Institute of Technology, an all day conference-style event where participants mostly from the research community in the New England area came together to present ideas and discuss their research.
The conference format included two keynote talks, a single track of six technical session talks, and a poster session. Almost 30 posters were displayed at the poster session; most of them presented by graduate students on their research topics.
The first keynote speaker, Michael Franklin of UC Berkeley and TruVisio, spoke about using stream processing for “continuous analytics”. Steam processing queries employ SQL extensions for stream processing. TruVisio database, which is a modified version of Postgres, captures “sliding window” snapshot tables of streaming data and allows for joining these tables with persistent data tables in analytical queries. This is certainly an interesting idea. However, the need to use SQL language extensions lowers the value proposition. Perhaps, in true real-time stream processing use cases, you have no choice. In all other cases, there are simpler and more scalable alternatives. For example, you could package streaming data in small batches and load these batches into an Infobright table. As loaded data gets appended, you could delete the oldest data-pack row as soon as the new data-pack row is loaded, thus maintaining a “sliding data window” containing a constant number of datapack rows. Note that you will need IEE for the “delete” part.
The second keynote speaker, Alon Halevy, heads the “Deep Web” search initiative at Google. The Shallow Web consists of all Web content (about 5 million pages) that you can search with Yahoo or Google. The Deep Web is everything else, including data behind the forms; it is estimated to be 500 times the size of Shallow Web. Deep Web search comes in three flavors: vertical search (sneaking into databases that back online forms or have online interfaces), search for anything (semantic search), and product search. Alon Halevy listed several analytical database application challenges: schema auto-complete, synonym discovery, creating entity lists, association between instances and aspects, data level synonyms discovery. Interestingly, all these challenges are classic problems addressed by the rough set theory, the foundation of Infobright technology.
It makes me wonder if this Deep Web search is a natural application domain for Infobright. Unfortunately, Google will have to solve these problems the hard way, as Google uses the Big Table database and MapReduce framework for everything search related. Incidentally Google also uses MySQL, albeit only for serving ads and other related applications.
For me, Devesh Agrawal’s graduate research presentation on index design for flash based embedded systems was the most interesting of the six technical session talks. Devesh is a graduate student at University of Massachusetts-Amherst. Even though the talk was about index design, the research findings and the proposed algorithm are applicable in all cases when flash memory read or write operation goes through a sequence of intermediate reads or redirections. His Lazy-Adaptive algorithm provides a provably efficient mechanism for “lazily” performing writes in order to get faster response for lookups. This makes it better suited for flash-memory-based storage devices. For index applications, his algorithm provides 2x to 8x I/O performance gains in most cases over a range of workloads, datasets, and memory constraints.
A year ago David DeWitt published his much talked about blog “MapReduce: A Major Step Backwards”. This blog led to a heated discussion at the last year’s New England Database Day conference. Continuing this discussion, Professor Daniel Abadi of Yale University talked about cloud computing and why in his view analytical data processing will eventually migrate away from datacenters to the cloud in spite of the obvious limitations in the areas of transaction support and data security. Responding to a question about the feasibility and cost of loading analytical and BI data to table partitions deployed on cloud based servers, he admitted that cloud computing was for analytical data already existing on the Web, because loading terabytes of data over the Web would be prohibitively costly and slow. What about complex analytical queries? If you join two partitioned data tables in the cloud, then in all cases beside equijoin on the partitioning fields, it is necessary to send data to other partitions in order to evaluate join conditions, n*(n-1)/2 node-to-node data exchanges. Apparently, only analytical queries not requiring such data exchanges can be processed in the cloud. So, what kind of analytical database or data warehouse can be deployed in the cloud if you can’t even load your own BI and analytical data, let alone execute complex analytical queries?
One last observation. There was no mentioning of open source databases at the conference. Perhaps, some members of the academic community still underestimate the industry changing innovative revolutionary products and technologies from companies such as Infobright.

February 3rd, 2009 at 2:06 am
[...] Google uses the Big Table database and MapReduce framework for everything search related, notes Alex Esterkin, Chief Architect at Infobright, Inc., a company delivering open source data warehousing solutions. [...]