Serverless OLAP with Colab + chdb

Serverless OLAP with Colab + chdb

Serverless, Persistent ClickHouse on Colab + Google Drive

Β·

4 min read

In our previous episodes, we've introduced chdb, a 100% compatible clickhouse-powered engine for embedded OLAP workloads. Like DuckDB but for ClickHouse.

One of our personal goals was to create a completely serverless ClickHouse and become able to run ClickHouse queries without any server setups, cloud services and idle costs; Project chdb makes this - and so much more - finally possible.

ClickHouse & Colab Notebooks

Amongst its many features, chdb allows executing fully-fledged ClickHouse SQL queries from Google Colab Notebooks. This was previously impossible due to the requirement for a ClickHouse server but thanks to chdb, it only takes a couple lines of Python to get your favourite OLAP engine roading inside your Notebooks.

There are many examples in the chdb repository leveraging Notebooks, but I noticed they were all stateless, while our embedded rocket on a bicycle can support persistent sessions and with Colab support for Google Drive as storage, it provides us with a free* and safe place for our data, with full access control and security.

Making an OLAP ColabHouse

Surprisingly, all it takes is a few lines of code and no external services!

πŸ‘‰ TLDR; Here's a Colab Notebook to try right now!

Step By Step

Let's start a new Colab project and install chdb using pip and python:

!pip install chdb --upgrade --quiet

Next, let's mount our Google Drive as database storage for our chdb session:

from google.colab import drive
drive.mount('/mount', force_remount=True)

Drive Aaccess is authorized, so let's start a chdb session using it for persistence:

from chdb import session as chs

## Create a Session using Google Drive Storage for persistence 
sess = chs.Session('/mount/MyDrive/chdb')
sess.query("CREATE DATABASE IF NOT EXISTS db_xxx ENGINE = Atomic;")
sess.query("CREATE TABLE db_xxx.download (when DateTime, userid UInt32, bytes Float32) ENGINE=MergeTree PARTITION BY toYYYYMM(when) ORDER BY (userid, when);")
sess.query("INSERT INTO db_xxx.download SELECT now() + number * 60 as when, 25, rand() % 100000000 FROM system.numbers LIMIT 150;")
sess.query("INSERT INTO db_xxx.download SELECT now() + number * 60 as when, randUniform(1,24), rand() % 100000000 FROM system.numbers LIMIT 150;")
sess.query("CREATE MATERIALIZED VIEW db_xxx.download_daily_mv ENGINE = SummingMergeTree PARTITION BY toYYYYMM(day) ORDER BY (userid, day) POPULATE AS SELECT toStartOfDay(when) AS day, userid, count() as downloads, sum(bytes) AS bytes FROM db_xxx.download GROUP BY userid, day;")

It's that simple. Did it work? It did, of course! Let's go and read our data the cool way, using our new serverless Materialized View to get a summary of our dataset:

print(sess.query("SELECT * FROM db_xxx.download_daily_mv FINAL ORDER BY bytes DESC LIMIT 5", "Pretty"))

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃                 day ┃ userid ┃ downloads ┃      bytes ┃
┑━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━┩
β”‚ 2023-10-12 00:00:00 β”‚     25 β”‚       150 β”‚ 6934346798 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2023-10-12 00:00:00 β”‚     13 β”‚        10 β”‚  583895472 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2023-10-12 00:00:00 β”‚     20 β”‚         9 β”‚  501303012 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2023-10-12 00:00:00 β”‚     21 β”‚         7 β”‚  500541088 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2023-10-12 00:00:00 β”‚      8 β”‚        12 β”‚  493559005 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The above is just an example to get you started. Make it do what you want. SQL Queries, Engines and Functions are 100% compatible with ClickHouseℒ️.

Sweet and Simple 🍯

Our Colab query runs transparently and provides persistence for our chdb sessions through executions of the same notebook or from other Colab notebook projects.

Let's take a look at the data structure chdb/ClickHouse has created in our drive:

import os
os.listdir('/mount/MyDrive/chdb')

Mounted at /mount
['user_defined',
 'data',
 'metadata',
 'metadata_dropped',
 'tmp',
 'user_scripts',
 'store']

There's our ClickHouse/chdb data, ready to be reattached for our next execution!

Note this approach is not necessarily intended for storing lots of data, but rather metadata referring to remote tables mounted on S3/R2 or URLEngine resources we might want to access and store aggregations from materialized views.

You can name your directories or hash values to produce session folders from names or use HTTP authentication tokens as we do in our chdb-server API emulator.

TLDR; Here's a ready notebook you can run with your Google account.

Thanks, chdb!This is the cloud service we dreamed of - Go enjoy your free ClickHouse instance with persistent storage and total data access control!

Did you find this article valuable?

Support qryn by becoming a sponsor. Any amount is appreciated!

Β