Shared resource lock powered by Clickhouse

Shared resource lock powered by Clickhouse

ยท

6 min read

Background

One core feature if Qryn project is simplicity of setup. We do our best to make the infrastructural complexity of the project as simple as possible. The desired setup for us is just Clickhouse and one (or many) stateless workers connected to it. It reveals a set of problems in our perspective. We should avoid or replace:

  • service registry
  • shared context storage
  • message queue

Let's talk about one of service registry modules such as resource lock and our implementation of it through clickhouse.

Resource lock

lock2.jpeg

First of all let's recap what the resource lock is.

Let's assume there's a shared resource. A printer for example. And a user Jack wants to send three documents to print. The documents should be printed strictly in a certain order: document 1, document 2, document 3 and no other files should be printed in between.

What does Jack have to do? He should write a big sign "Please don't touch the printer" in front of it, then he should send all three documents and finally remove the sign for the next time. Of course he should reboot the printer and cancel his jobs if something goes wrong, but let's not dig into transactions. Just locks.

Let's call this sign a shared resource lock. Inside one application there's a standard feature for this purpose called mutex, but for cross-application resource locking everything goes a bit worse.

And there's a challenge: to implement a shared resource lock in a database not designed to have any locks at all.

Good old Postgresql days

two-thinkers.jpg

As for me, I have come to Clickhouse from Postgresql. It had a fast and handy feature exactly for this purpose named Advisory Lock. A user could claim a lock for any bigint with pg_advisory_lock command and this bigint was considered "locked" until:

  • the user unlocks it with pg_advisory_unlock
  • the user session goes disconnected

Unfortunately Clickhouse doesn't have this feature and it's not planned.

What can be locked in Clickhouse

6x4v87.jpg

But how can we lock at least anything or at least get two simultaneous requests queued one after another?

After a close look at its documentation I found a table engine named TinyLog locking writings. Well, it was not a bad start, so I created something resembling a locking queue, but the module turned out to be monstrous.

It was a quadruple buffered message queue with a custom mechanism of cleanup by timeout. "Not great, but better then nothing" - I thought and packaged this into a module, but the simple and elegant solution was too obvious to see it.

The final solution

170e543e45cc875e11f270fadef5fa0a.jpg

Of course there's another mechanism with locks in Clickhouse. It's the tablespace itself:

  • two simultaneous create table requests do not succeed, one of them fails with "table already exists" message
  • presence of a table can be a flag that the resource is claimed busy.

And I can use that in the lock mechanism. There's one remaining problem: the locking table will not be removed in case of the locking client is crashed in the middle, so we can appear in the deadlock.

The cure for it is in another experimental feature of Clickhouse named Live View. Live views are dissolved in a timeout after the last listener is out.

So the overall plan is the next:

  1. Create an empty table _none with one int field where nobody will write.
  2. When you want to claim a resource, do: CREATE LIVE VIEW _my_resource_lock WITH TIMEOUT 2 AS SELECT * FROM _none;
  3. If the create table request is ok, then do WATCH _my_resource_lock
  4. If the create table request returned the "table already exists" error, then try again in a while.
  5. If the create table request returned another error, then something went wrong.

Code of the solution in GO:

import (
    "fmt"
    "strings"
    "time"
    "context"
    "github.com/ClickHouse/clickhouse-go/v2"
    "github.com/ClickHouse/clickhouse-go/v2/lib/driver"
)

var (
    db clickhouse.Conn
)

func Init(_db driver.Conn) {
    db = _db
    err := db.Exec(context.Background(), "CREATE TABLE IF NOT EXISTS _none (i Int64)ENGINE=Null")
    if err != nil {
        panic(err)
    }
}


type Mutex struct {
    Id     string
    cancel context.CancelFunc
}

func (m *Mutex) Lock(ctx context.Context) error {
    if ctx == nil {
        ctx = context.Background()
    }
    err := db.Exec(ctx, fmt.Sprintf("CREATE LIVE VIEW _mtx_%s WITH TIMEOUT 2 AS SELECT * FROM _none", m.Id))
    for err != nil {
        if strings.Index(err.Error(), "already exists") == -1 {
            fmt.Printf("mutex error: %v\n", err)
        }
        select {
        case <-ctx.Done():
            return context.DeadlineExceeded
        case <-time.After(time.Millisecond * 100):
            err = db.Exec(ctx, fmt.Sprintf("CREATE LIVE VIEW _mtx_%s WITH TIMEOUT 2 AS SELECT * FROM _none", m.Id))
        }
    }
    queryCtx, cancel := context.WithCancel(context.Background())
    m.cancel = cancel
    rows, err := db.Query(queryCtx, fmt.Sprintf("WATCH _mtx_%s", m.Id))
    if err != nil {
        return err
    }
    go func() {
        for rows.Next() {
        }
    }()
    return nil
}

func (m *Mutex) Unlock() error {
    m.cancel()
    to, _ := context.WithTimeout(context.Background(), time.Millisecond*100)
    db.Exec(to, fmt.Sprintf("DROP VIEW _mtx_%s", m.Id))
    return nil
}

Testing

Let's do some load testing to check if it is an acceptable solution and how many resources it consumes:

import (
    "context"
    "fmt"
    clickhouse_v2 "github.com/ClickHouse/clickhouse-go/v2"
    "sync"
    "testing"
    "time"
)

func TestClickhouseLock(t *testing.T) {
    opt := &clickhouse_v2.Options{
        Addr:            []string{"127.0.0.1:9000"},
        Debug:           false,
        DialTimeout:     time.Second,
        MaxOpenConns:    100,
        MaxIdleConns:    2,
        ConnMaxLifetime: time.Hour,
        Settings: map[string]interface{}{
            "allow_experimental_live_view": "1",
        },
    }
    db, err := clickhouse_v2.Open(opt)
    if err != nil {
        t.Fatal(err)
    }
    Init(db)
    mtx := Mutex{Id: "mtx1"}
    var j int
    wg := sync.WaitGroup{}
    for i := 0; i < 100; i++ {
        wg.Add(1)
        go func(_i int) {
            defer wg.Done()
            mtx.Lock(context.Background())
            defer mtx.Unlock()
            if j > 0 {
                panic("LOCKER DOESN'T WORK!!!!!!")
            }
            fmt.Printf("Inc %d\n", _i)
            j++
            j--
            fmt.Printf("Dec %d\n", _i)
        }(i)
    }
    wg.Wait()
}

go test results:

$ go test
Inc 0
Dec 0
Inc 24
Dec 24
Inc 7
Dec 7
Inc 62
Dec 62
Inc 76
Dec 76
...
PASS
ok      2.431s

results of top | grep during the test:

$ top | egrep '(go|clickhouse)'
  36231 systemd+  20   0 8330960 618928 381500 S   3,3   1,9   0:16.32 clickhouse-serv                                                                                                                             
  36231 systemd+  20   0 8355536 536360 381560 S  29,2   1,6   0:17.21 clickhouse-serv                                                                                                                             
  37280 akvlad    20   0 1750984  24040  10708 S  18,0   0,1   0:00.55 go                                                                                                                                          
  36231 systemd+  20   0 8355536 537348 381560 S  73,8   1,6   0:19.44 clickhouse-serv 
  36231 systemd+  20   0 8355536 534060 381560 S   9,9   1,6   0:19.74 clickhouse-serv

So as we see the cpu usage jumped from 29% to 73% to sort out 100 simultaneous locking clients. Not great but not bad results for a workaround. In case of careful usage it can work.

Conclusion

There are a lot of interesting non-documented non-natural usages of Clickhouse able to make the life of developers easier. And qryn has a nice cross-application mutex.

That's all for today. We are still in a search of a shared context storage and a message queue over Clickhouse. So there will be new challenges and new blog posts.

Did you find this article valuable?

Support qryn by becoming a sponsor. Any amount is appreciated!

ย