SaSPartitioner

SaSPartitioner is a self-adaptive stream partitioning framework that leverages Deep Reinforcement Learning based on real running metrics.

System requirements

Java 11
Apache Flink 1.20
Ray 2.40.0

We use a modified version of Flink 1.20 with the ability to collect metrics at custom intervals. You should compile and deploy this modified version of Flink on every machine in your cluster.

The system contains two main components: the Flink partitioner written in Java, and the reinforcement learning agent implemented with Ray RLlib.

For Java code, point flink.source.path in pom.xml to our modified Flink, then compile with mvn package.
For the RL agent, install the required Python packages in scripts/rl/requirements.txt.

Run

The parameters are configured in src/main/resources/params.yaml and scripts/rl/configurations.py respectively. An example of the yaml and Python configuration file is provided in params/ and scripts/rl/configuration_pool/.

Offline training

Set the learningPartitioner in params.yaml to dalton-offline.
Offline data collection: run the Java class cn.edu.zju.daily.metricflux.task.wordcount.WordCountStaticDistRouteTrainingExperiment to collect the offline data.
Configure the log_folder and data_path in configurations.py. Change run_mode to offline.
Run scripts/rl/offline_online_train_remote_n.py to obtain the pre-trained model.

Online training

Set the learningPartitioner in params.yaml to saspartitioner.
Set the checkpoint_path in configurations.py to point to the offline model.
Run scripts/rl/offline_online_train_remote_n.py to start the RL agent server.
Run the Java class cn.edu.zju.daily.metricflux.task.wordcount.WordCountStaticDistRouteTrainingExperiment to start the online training process.

Throughput test

To test the maximum throughput of the system:

Set the checkpoint_path in configurations.py to point to the online model.
Run scripts/rl/offline_online_train_remote_n.py to start the RL agent server.
Set the partitioner in params.yaml to saspartitioner.
Run the Java class cn.edu.zju.daily.metricflux.task.wordcount.WordCountThroughputExperimentV2 to start testing. The source data rate will gradually increase until backpressure is detected, and the maximum throughput will be logged.

Baselines

You can run the following baselines to compare with SaSPartitioner by setting the partitioner to these values:

Hash: hash
cAM: cam
DAGreedy: dagreedy
FlexD: flexd
Dalton: dalton-original
Dalton with collected metrics as observations: dalton-metrics

There are some bash scripts in bin/ to facilitate batch experiments. You can refer to these scripts for the TDigest task.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
scripts/rl		scripts/rl
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
lombok.config		lombok.config
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SaSPartitioner

System requirements

Run

Offline training

Online training

Throughput test

Baselines

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ZJU-DAILY/SaSPartitioner

Folders and files

Latest commit

History

Repository files navigation

SaSPartitioner

System requirements

Run

Offline training

Online training

Throughput test

Baselines

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages