Skip to content

Generalize listen/notify as a pubsub backend #6805

@pedro-psb

Description

@pedro-psb

This summariezes the known state of the problem with postgres listen/notify and present some approaches for solving it. Please share some thoughts on any missing part and provide opinions on what route to take.

Related issues/jiras:

Summary

Problems with listen/notify:

  1. Degradates connection pooling effectivness ref-00: Listening on a postgres channel causes connection pinning on RDS Proxy. Notifying doesn't.
  2. Notify might be interfering with performance. ref-01: Notify aquires a global table lock which might degrade performance with multiple writes. We don't know if Pulp is being hit by this. Also, as we can't target notifications to specific workers (everybody receives the data), the pubsub might be generating too much db activity (not proven to be a problem).

Current channels: ref-02

  • pulp_worker_wakeup: When resources are available workers get notified to wake up. The notify can be called by any component that is able to do dispatch (every Pulp component) or is able to request task cancelation (only API?). In the worker, notify it's called on task completion or the cancelling logic.
  • pulp_worker_metrics_heartbeat: Only used when otel is enabled. Notify can be called by any workers at every heartbeat. Workers race for the metric notify lock.
  • pulp_worker_cancel: Notify is called by the API.

Currently, only worker components listens for notifications.

Additional context on RDS Proxy

RDS Proxy sits in the middle of the client (in our case, we are interested in tasking workers) and the database. The connections of RDS with the database are called database connection. The connection between the client and RDS are called client connections. The whole idea is to enable multiplexing client connections on database connections. Under some conditions (e.g, listening on a postgres channel), client connections are pinned, which means the client connection gets pinned to a database connection, blocking it to be 'reused' by other client connections.

Learn more in ref-00.

General Approach

Define a general API for pubsub which can use different backends.
That allows to still use postgres listen/notify on installations that don't need connection pooling.

An example is provided below:

# Start the pubsub client on a thread.
# default to pg listen/notify implementation
pubsub = PGPubSub()
pubsub = RedisPubSub()
pubsub = EtcdPubSub()

pubsub.subscribe("channel", callback)
pubsub.unsubscribe("channel")
pubsub.publicate("channel", "optional-msg")

Specific Implementations considered

There are some centralized based solutions, some distributed based and some mixed.

1. redis pubsub

https://redis.io/docs/latest/develop/pubsub/

Workers listen/notify to redis channels.

  • Pros:
    • known technology
    • simple
  • Cons:
    • single-point of failure

2. etcd cluster

https://python-etcd3.readthedocs.io/en/latest/readme.html

Workers watch/update on specific keys (used as channels). The setup requires an etcd service running on each node to handle the kv-store replication. It stores a Write Ahead Log on disk and has an API that the worker talks to (directly or via a client).

It's an overkill just for the listen/notify, but its capabilities can make it easier to offload and improve other coordination tasks, such as various locks (unblocking, recording metrics, scheduling) and possibly reduce workers racing for writes and task lookups on the db.

  • Pros:
    • empowers the tasking system with distributed coordination primitives (e,g builtin locks, leases and (re)-election)
    • hight availability / fault tolerant (replicated logs on each node through RAFT)
    • Relatively lightweight and robust
  • Cons:
    • unknow technology
    • adds complexity for deployment. All components would require access to an etcd instance in it's node (to enable doing 'notify', for example)

There are other similar distributed services which uses consensus algorithms. E.g:

3. postgres-websockets

https://github.com/diogob/postgres-websockets

Workers talk to the websocket middleware provided by the postgres-websocket service. It uses listen/notify from postgres, but workers will talk to the middleware, so it doesnt' degrade connection pooling.

  • Pros:
    • We continue to use notify as we do today
  • Cons:
    • single point of failure

4. custom web-socket based solution:

Have a leader connected to all workers through websocket. The leader is the only component listening postgres notifications. That will cause the connection to be 'pinned' by the connection pool, but it's always only one, no matter how many workers.

As a variant without any postgres listen/notify, components that needs to notify
could have access to the workers network and talk directly to the current leader.

  • Pros:
    • No new components required
    • high availability (assuming a robust re-election implementation)
  • Cons:
    • Complexity of handling leader election and consensus

Options considered

  • Worker polling for updates: We get back to the problem of a herd of workers stressing the database.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions