Wrapping a Serverless ClickHouse pt.3
Introducing chdb: in-memory OLAP Engine based on ClickHouse
Meet chdb: the in-memory OLAP Engine based on ClickHouse
Built on the solid foundations of ClickHouse, chdb is a lightning-fast OLAP engine designed for in-memory processing and native binding into popular languages.
Think of DuckDB, but Clickhouse powered.
In a nutshell, chdb offers all of the ClickHouse columnar functions to retrieve, transform, and analyze local and remote data using SQL, easily interacting with popular data formats such as Parquet, Arrow, CSV, and JSON. Distributed as a library, it requires no server installation and offers native bindings for popular programming languages to run native speed queries from Python, Go, Rust, Node and Bun.
How fast is it? Check out the ClickBench benchmark for a detailed answer š
Features
In-process SQL OLAP Engine, powered by ClickHouse
No need to install ClickHouse
Minimized data copy from C++ to Python with Python memoryview
Input&Output support Parquet, CSV, JSON, Arrow, ORC and 60+more formats
Backstory
Many have tried and (mostly) failed at binding a library around ClickHouse, and if you follow our blog you know we've been building our own musl clickhouse-local version too, but our friend Auxten managed to patiently puzzle together a working binding builder and library wrapper, opening the way for chdb and libchdb to be created.
Well done Auxten, and thanks for letting us join the chdb development group!
Python
chdb is available as a module for Python 3.7+ on macOS and Linux (amd64/aarch64)
pip install chdb
That's it. You can immediately start using chdb from your Python code:
import chdb
res = chdb.query('select version()', 'Pretty'); print(res.data())
You can use chdb to instantly access and analyze any Parquet, CSV, Dataframe or any other supported format file locally or using any S3-compatible object storage:
# Query local files in any format
res = chdb.query('select * from file("data.parquet", Parquet)', 'JSON'); print(res.data())
# With large data result use get_memview() to avoid extra data copy.
res = chdb.query('select * from file("data.csv", CSV)', 'CSV'); print(str(res.get_memview().tobytes()))
# Use Dataframe format natively from python
chdb.query('select * from file("data.parquet", Parquet)', 'Dataframe')
You can also invoke chdb from CLI using Python. Let's query a remote CH server:
python3 -m chdb "select count(*) from remoteSecure('play.clickhouse.com:9440','default.covid','play','play')"
If you love ClickHouse and want to use its powerful functions.... sky's the limit!
More Bindings: Go, Rust, Node, Bun
Not into Python? No worries! chdb bindings are available for many languages:
Before using the bindings, you will have to install libchdb on your system using the provided deb or rpm packages (currently only for Linux/amd64 and modern GLIBC)
The bindings invoke the library for full-speed processing even when using NodeJS. Using ClickHouse from within your application has never been easier! ā”ā”ā”
Public Demo
You can use chdb to build anything, including serverless query processors. Here's our public demo running on a 100% free 1xCPU/256MB RAM instance on fly.io and pretending to be a Clickhouse HTTP server including the "Play" user interface:
Project Status
chdb is still experimental and under heavy development but already fully functional. The library size is not irrelevant at ~100MB compressed (~380MB uncompressed) but its optimized for low memory usage and operating at native speed with a minimal footprint of just ~100MB, allowing it to run in places where ClickHouse cannot.
There's so much more on the horizon for chdb and the ClickHouse community!
Are you skilled in C/C++ or interested in helping evolve the Go/Rust/Node bindings? Join us and help with the development of chdb at https://github.com/chdb-io