Facebook: Scuba

Origins

Originally, we relied on pre-aggregated graphs and a carefully managed, hand-coded, set of scripts over a MySQL database of performance data.

We now use Scuba for most real-time, ad-hoc analysis of arbitrary data.

Scuba runs on hundreds of servers each with 144 GB RAM in a shared-nothing cluster. It stores around 70 TB of compressed data for over 1000 tables in memory, distributed by partitioning each table randomly across all of the servers.

In addition to a SQL query interface (for a subset of SQL including grouping and aggregations but not joins), Scuba provides a GUI that produces time series graphs, pie charts, distributions of column values, and a dozen other visualizations of data besides tables with text.

SCUBA USE CASES

  • Performance Monitoring
  • Trend Analysis
  • Pattern Mining

Data model

Scuba’s data model differs from the standard relational model in two ways. First, there is no create table statement; a table is formed on each leaf node whenever the leaf first receives data for it.
Since the leaves receive data at different times, the table may exist only on some leaves and may have a different schema.
Scuba presents a single table image to its users, however, despite the different schemas, by treating any missing columns as null values. Second, the columns within the table’s rows may be sparsely populated; it is common for there to be 2 or 3 different row schemas within a table or for a column to change its type over time (usually to achieve better compression). Together, these two differences let Scuba adapt tables to the needs of its users without any complex schema evolution commands or workflows. Such adaptation is one of Scuba’s strengths.

Query model

SELECT column, column, ...,
aggregate(column), aggregate(column), ...
FROM table
WHERE time >= min-timestamp
AND time <= max-timestamp
[AND condition ...]
GROUP BY column, column, ...
ORDER BY aggregate(column)
LIMIT number

The aggregate functions supported include the traditional count, min, max, sum, and average functions, as well as other useful functions such as sum/minute, percentiles, and histograms. The WHERE clause must contain a time range, although other conditions are optional. The LIMIT clause defaults to 100,000 rows to avoid memory issues in grouping and rendering problems at the client. The GROUP BY and ORDER BY clauses are wholly optional.

Joins are not supported in Scuba. When combining data from multiple sources is necessary, joining is usually done before importing the data into Scuba.