Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions ydb/docs/en/core/concepts/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -440,6 +440,12 @@ A **shared cache** is an [actor](#actor) that stores data pages recently accesse

A **memory controller** is an [actor](#actor) that manages {{ ydb-short-name }} [memory limits](../reference/configuration/index.md#memory-controller).

### Spilling {#spilling}

**Spilling** is a memory management mechanism in {{ ydb-short-name }} that temporarily offloads intermediate query data to external storage when such data exceeds the available node RAM capacity. In {{ ydb-short-name }}, disk storage is currently used for spilling.

For more details on spilling, see [{#T}](spilling.md).

### Tablet types {#tablet-types}

[Tablets](#tablet) can be considered a framework for building reliable components operating in a distributed system. {{ ydb-short-name }} has multiple components implemented using this framework, listed below.
Expand Down
85 changes: 85 additions & 0 deletions ydb/docs/en/core/concepts/spilling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Spilling

## Spilling in General

**Spilling** is a memory management mechanism that temporarily offloads intermediate data arising from computations and exceeding available node RAM capacity to external storage. In {{ ydb-short-name }}, disk storage is currently used for spilling. Spilling enables execution of user queries that require processing large data volumes exceeding available node memory.

In data processing systems, including {{ ydb-short-name }}, spilling is essential for:

- processing queries with large data volumes when intermediate results don't fit in RAM
- executing complex analytical operations (aggregations, table joins) over large datasets
- optimizing query performance through intermediate materialization of part of the data in external memory, which in certain scenarios can accelerate overall execution time


Spilling operates based on the memory hierarchy principle:

1. **Random Access Memory (RAM)** — fast but limited.
2. **External storage** — slower but more capacious.

When memory usage approaches the limit, the system:

- serializes part of the data
- saves it to external storage
- frees the corresponding memory
- continues processing the query using data remaining in memory
- loads data back into memory, when necessary, to continue computations


## Spilling in {{ ydb-short-name }}

{{ ydb-short-name }} implements the spilling mechanism through the **Spilling Service**, an [actor service](glossary.md#actor-service) that provides temporary storage for data blobs. Spilling is only performed on [database nodes](glossary.md#database-node). Detailed technical information about it is available in [{#T}](../contributor/spilling-service.md).

### Types of Spilling in {{ ydb-short-name }}

{{ ydb-short-name }} implements two primary types of spilling that operate at different levels of the computational process:

* [Computation Spilling](#computation-spilling)
* [Transport Spilling](#transport-spilling)

These types work independently and can activate simultaneously within a single query, providing comprehensive memory management.

#### Computation Spilling {#computation-spilling}

{{ ydb-short-name }} compute cores automatically offload intermediate data to disk when executing operations that require significant memory. This type of spilling is implemented at the level of individual computational operations and activates when memory limits are reached.

Main usage scenarios:

* **Aggregations** — when grouping large data volumes, the system offloads intermediate hash tables to disk.
* **Join operations** — when joining large tables, the [Grace Hash Join](https://en.wikipedia.org/wiki/Hash_join#Grace_hash_join) algorithm is used with data partitioning and offloading to disk.

##### Operation Mechanism

Compute nodes contain specialized objects for monitoring memory usage. When the data volume approaches the set limit:

1. The system switches to spilling mode.
2. Data is serialized and divided into blocks (buckets).
3. Part of the blocks is transferred to the Spilling Service for disk storage.
4. Metadata about the data location is kept in memory.
5. The system continues processing the data remaining in memory, which frees additional space.
6. When necessary, data is loaded back and processed.


#### Transport Spilling {#transport-spilling}

This type of spilling operates at the level of data transfer between different query execution stages. The system automatically buffers and offloads data when transfer buffers overflow. This helps avoid blocking the execution of data-generating operations, even when receiving operations are not ready to accept data.

##### Operation Mechanism

The data transfer system continuously monitors its state:

1. **Buffering**: Incoming data accumulates in the transfer system's internal buffers
2. **Fill control**: The system tracks buffer fill levels
3. **Automatic spilling**: When limits are reached, data is automatically serialized and transferred to the Spilling Service
4. **Continued operation**: The transfer system continues accepting new data after freeing memory space
5. **Recovery**: When the next stage is ready, data is read from external storage and passed further

## Interaction with Memory Controller

When executing queries, {{ ydb-short-name }} tries to stay within the memory limit set by the [memory controller](../reference/configuration/index.md#memory-controller). To continue fitting within this limit as intermediate computations grow, spilling is used. For more details, see the [Memory Management section](../reference/configuration/table_service_config.md#memory-management).

## See Also

- [Spilling Service](../contributor/spilling-service.md)
- [Spilling configuration](../reference/configuration/table_service_config.md)
- [{{ ydb-short-name }} monitoring](../devops/observability/monitoring.md)
- [Performance diagnostics](../troubleshooting/performance/index.md)
2 changes: 2 additions & 0 deletions ydb/docs/en/core/concepts/toc_i.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ items:
href: secondary_indexes.md
- name: Vector search
href: vector_search.md
- name: Spilling
href: spilling.md
- name: Change Data Capture (CDC)
href: cdc.md
when: feature_changefeed
Expand Down
86 changes: 86 additions & 0 deletions ydb/docs/en/core/contributor/spilling-service.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Spilling Service

## Overview

The **Spilling Service** is an [actor service](../concepts/glossary.md#actor-service) that provides temporary storage for data blobs in the {{ ydb-short-name }} system. The service operates as a key-value store where clients can save data using a unique identifier and later retrieve it with that identifier.

## Architecture

### Main Components

- **Task queue**: The service maintains an internal queue of read and write operations. All spilling requests are placed in this queue and processed asynchronously.
- **Thread pool**: A pool of worker threads is used to perform I/O operations. The number of threads is [configurable](../reference/configuration/table_service_config.md#workerscount) and affects service performance.
- **File management**: The service automatically creates, deletes, and manages files on disk.
- **Resource monitoring**: The service monitors disk space usage, the number of active operations, and other performance metrics.

### Data Storage

Data is saved in files on the local file system. The Spilling Service ensures:

* distribution of records across files
* file deletion
* data lifecycle management

In case of an unexpected restart, obsolete files are automatically deleted.

## Component Interaction

System components are integrated with the Spilling Service and interact with it through actor system events, explained below.

### Memory State Monitoring

Compute nodes continuously monitor memory state through the allocator. The allocator informs nodes about decreasing free memory volume. However, the system does not wait for complete memory exhaustion because the spilling process also requires additional memory resources for serialization and buffering.

### Event Dispatch

When spilling is required, the compute component (data transfer system or compute core) performs the following actions:

1. Serializes data into a blob.
2. Generates a unique identifier for the blob.
3. Creates a spilling request with the blob and the generated identifier.
4. Sends the request to the Spilling Service.
5. Releases resources and enters waiting mode, allowing other tasks to utilize computational resources.

### Waiting for Results

After sending the request, the compute component releases resources for other tasks and enters waiting mode, allowing the system to optimally utilize cluster computing resources until the external storage write is complete.

### Response Handling

The Spilling Service processes the request and returns a write confirmation for the specified identifier or an error message. The compute component can continue only after receiving confirmation.

### Data Reading

When data recovery is needed, the component sends a read request with the blob identifier. The Spilling Service reads data from external storage and returns a response with the recovered data. During data loading, freed computational resources are utilized to process other tasks.

## Spilling Workflow Diagram

```mermaid
sequenceDiagram
participant CN as Compute node
participant SS as Spilling Service
participant ES as External storage

Note over CN: Memory full
CN->>SS: Send data (asynchronously)
Note over CN: Continues working with other data
SS->>ES: Save data
SS-->>CN: Write confirmation (asynchronously)

Note over CN: Need saved data
CN->>SS: Request data (asynchronously)
Note over CN: Continues working with other data
SS->>ES: Read data
SS-->>CN: Return data (asynchronously)
```

## Configuration

Detailed information about configuring the Spilling Service is available in the [Spilling configuration](../reference/configuration/table_service_config.md) section.

## See Also

- [Spilling Concept](../concepts/spilling.md)
- [Spilling configuration](../reference/configuration/table_service_config.md)
- [{{ ydb-short-name }} monitoring](../devops/observability/monitoring.md)
- [Performance diagnostics](../troubleshooting/performance/index.md)
2 changes: 2 additions & 0 deletions ydb/docs/en/core/contributor/toc_i.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ items:
href: datashard-locks-and-change-visibility.md
- name: Distributed transactions
href: datashard-distributed-txs.md
- name: Spilling Service
href: spilling-service.md
- name: Testing with load actors
href: load-actors-overview.md
items:
Expand Down
2 changes: 2 additions & 0 deletions ydb/docs/en/core/devops/concepts/system-requirements.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ Prefer to use physical local disk drives for {{ ydb-short-name }} instead of vir

Remember that {{ ydb-short-name }} uses some disk space for internal needs when planning disk capacity. For example, on a medium-sized cluster of 8 nodes, you can expect approximately 100 GB to be consumed for a static group on the whole cluster. On a large cluster with more than 1500 nodes, this will be about 200 GB. There are also 25.6 GB of logs on each Pdisk and a system area on each Pdisk. Its size depends on the size of the Pdisk, but is no less than 0.2 GB.

The disk is also used for [spilling](../../concepts/glossary.md#spilling), a memory management mechanism that temporarily saves intermediate query execution results to disk when RAM is insufficient. This is important to consider when planning disk capacity. Detailed spilling configuration is described in the [Spilling Configuration](../../reference/configuration/table_service_config.md) section.

## Software Configuration {#software}

A {{ ydb-short-name }} server can be run on servers with a Linux operating system, kernel 4.19 and higher, and libc 2.30. For example, Ubuntu 20.04, Debian 11, Fedora 34, or newer releases. {{ ydb-short-name }} uses the [TCMalloc](https://google.github.io/tcmalloc) memory allocator. To make it efficient, [enable](https://google.github.io/tcmalloc/tuning.html#system-level-optimizations) Transparent Huge Pages and Memory overcommitment.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,27 @@ To ensure that {{ ydb-short-name }} can access block disks, add the user that wi
sudo usermod -aG disk ydb
```

## Configure File Descriptor Limits {#file-descriptors}

For proper operation of {{ ydb-short-name }}, especially when using [spilling](../../../concepts/spilling.md) in multi-node clusters, it is recommended to increase the limit of simultaneously open file descriptors.

To change the file descriptor limit, add the following lines to the `/etc/security/limits.conf` file:

```bash
ydb soft nofile 10000
ydb hard nofile 10000
```

Where `ydb` is the username under which `ydbd` runs.

After changing the file, you need to reboot the system or log in again to apply the new limits.

{% note info %}

For more information about spilling configuration and its relationship with file descriptors, see the [Spilling Configuration](../../../reference/configuration/table_service_config.md#file-system-requirements) section.

{% endnote %}

## Install {{ ydb-short-name }} Software on Each Server {#install-binaries}

1. Download and unpack an archive with the `ydbd` executable and the libraries required for {{ ydb-short-name }} to run:
Expand Down
2 changes: 1 addition & 1 deletion ydb/docs/en/core/reference/configuration/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -557,7 +557,7 @@ The activity components include:

- KQP

The memory limit for each activity component specifies the maximum amount of memory it can attempt to use. However, to prevent the {{ ydb-short-name }} process from exceeding the soft memory limit, the total consumption of activity components is further constrained by an additional limit known as the activities memory limit. If the total memory usage of the activity components exceeds this limit, any additional memory requests will be denied.
The memory limit for each activity component specifies the maximum amount of memory it can attempt to use. However, to prevent the {{ ydb-short-name }} process from exceeding the soft memory limit, the total consumption of activity components is further constrained by an additional limit known as the activities memory limit. If the total memory usage of the activity components exceeds this limit, any additional memory requests will be denied. When query execution approaches memory limits, {{ ydb-short-name }} activates [spilling](../../concepts/spilling.md) to temporarily save intermediate data to disk, preventing memory limit violations.

As a result, while the combined individual limits of the activity components might collectively exceed the activities memory limit, each component's individual limit should be less than this overall cap. Additionally, the sum of the minimum memory limits for the cache components, plus the activities memory limit, must be less than the soft memory limit.

Expand Down
Loading