Skip to content

Commit 92c4b05

Browse files
committed
2 parents 1cf0760 + 5a2043e commit 92c4b05

File tree

8 files changed

+861
-35
lines changed

8 files changed

+861
-35
lines changed

README.md

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,10 @@ Create a `.env` file with the following environment variables:
141141
- `DISCORD_DEBUG_CLUSTER_STAGING_ID` : The ID of the "staging" server you want to connect to
142142
- `DISCORD_CLUSTER_STAGING_ID` : The ID of the "production" server you want to connect to
143143
- `GITHUB_TOKEN` : A Github token with permissions to trigger workflows, for now only new branches from [discord-cluster-manager](https://github.com/gpu-mode/discord-cluster-manager) are tested, since the bot triggers workflows on your behalf
144+
- `GITHUB_REPO` : The repository where the cluster manager is hosted.
145+
- `GITHUB_WORKFLOW_BRANCH` : The branch to start the GitHub Actions jobs from when submitting a task.
144146
- `DATABASE_URL` : The URL you use to connect to Postgres.
147+
- `DISABLE_SSL` : (Optional) set if you want to disable SSL when connecting to Postgres.
145148

146149
Below is where to find these environment variables:
147150

@@ -169,14 +172,22 @@ Below is where to find these environment variables:
169172
<img width="1440" alt="Screenshot 2024-12-30 at 8 51 59 AM" src="https://github.com/user-attachments/assets/e3467871-bd2c-4f94-b0c5-c8a6ef5ce89e">
170173
</details>
171174

175+
- `GITHUB_REPO`: This should be set to this repository, which is usually `gpu-mode/discord-cluster-manager`.
176+
177+
- `GITHUB_WORKFLOW_BRANCH`: Usually `main` or the branch you are working from.
178+
172179
- `DATABASE_URL`: This contains the connection details for your local database, and has the form `postgresql://user:password@localhost/clusterdev`.
173180

181+
- `DISABLE_SSL`: Set to `1` when developing.
182+
174183
### Verify Setup
175184

185+
Install the kernel bot as editable using `pip install -e .`
186+
176187
Run the following command to run the bot:
177188

178189
```
179-
python src/discord-cluster-manager/bot.py --debug
190+
python src/kernelbot/main.py --debug
180191
```
181192

182193
Then in your staging server, use the `/verifyruns` command to test basic functionalities of the bot and the `/verifydb` command to check database connectivity.
@@ -232,7 +243,7 @@ specify the available GPUs that the leaderboard evaluates on.
232243
The Discord bot internally contains an `eval.py` script that handles the correctness and timing
233244
analysis for the leaderboard. The `reference_code` that the leaderboard creator submits must have
234245
the following function signatures with their implementations filled out. `InputType` and
235-
`OutputType` are generics that could be a `torch.Tensor`, `List[torch.Tensor]`, etc.
246+
`OutputType` are generics that could be a `torch.Tensor`, `List[torch.Tensor]`, etc.
236247
depending on the reference code specifications. We leave this flexibility to the leaderboard creator.
237248

238249
```python
@@ -257,8 +268,8 @@ handle the typing system for tensors. The `reference.cu` that the leaderboard cr
257268
the following function signatures with their implementations filled out:
258269

259270
The main difference is we now need to define an alias for the type that the input / outputs are. A
260-
simple and common example is a list of FP32 tensors, which can be defined using a pre-defined array of
261-
`const int`s called `N_SIZES`, then define an array of containers, e.g.
271+
simple and common example is a list of FP32 tensors, which can be defined using a pre-defined array of
272+
`const int`s called `N_SIZES`, then define an array of containers, e.g.
262273
`std::array<std::vector<float>, N_SIZES>`.
263274

264275
```cuda
@@ -293,7 +304,7 @@ bool check_implementation(output_t out, output_t ref) {
293304
```
294305

295306
The leaderboard submission for _Python code_ requires the following function signatures, where
296-
`InputType` and `OutputType` are generics that could be a `torch.Tensor`, `List[torch.Tensor]`,
307+
`InputType` and `OutputType` are generics that could be a `torch.Tensor`, `List[torch.Tensor]`,
297308
etc. depending on the reference code specifications.
298309

299310
```python
@@ -354,7 +365,7 @@ If you'd like to donate a GPU to our efforts, we can make you a CI admin in Gith
354365

355366
## Citation
356367

357-
If you used our software please cite it as
368+
If you used our software please cite it as
358369

359370
```
360371
@misc{kernelbot2025,

docker-compose.test.yml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
services:
2+
postgres-test:
3+
image: postgres:15
4+
container_name: postgres_db_test
5+
environment:
6+
POSTGRES_USER: postgres
7+
POSTGRES_PASSWORD: postgres
8+
POSTGRES_DB: clusterdev
9+
ports:
10+
- "5433:5432"
11+
tmpfs:
12+
- /var/lib/postgresql/data
13+
healthcheck:
14+
test: ["CMD-SHELL", "pg_isready -U postgres -d clusterdev"]
15+
interval: 5s
16+
timeout: 3s
17+
retries: 10
18+
19+
migrate-test:
20+
image: python:3.11-slim
21+
depends_on:
22+
postgres-test:
23+
condition: service_healthy
24+
volumes:
25+
- ./src/migrations:/migrations
26+
command: >
27+
sh -c "
28+
pip install yoyo-migrations psycopg2-binary &&
29+
yoyo apply -b -d 'postgresql://postgres:postgres@postgres-test:5432/clusterdev' -v /migrations/ &&
30+
yoyo list -d 'postgresql://postgres:postgres@postgres-test:5432/clusterdev' /migrations/
31+
"
32+
restart: "no"

pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,8 @@ relative_files = true
4444
exclude_lines = [
4545
"pragma: no cover",
4646
"raise NotImplementedError",
47+
# For now, don't require coverage of db errors
48+
"except psycopg2.Error"
4749
]
4850

4951
[tool.pytest.ini_options]

src/kernelbot/cogs/admin_cog.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ async def leaderboard_create_local(
184184
try:
185185
old_lb = db.get_leaderboard(leaderboard_name)
186186
except LeaderboardDoesNotExist:
187-
pass
187+
old_lb = None
188188
db.delete_leaderboard(leaderboard_name, force=True)
189189

190190
# get existing forum thread or create new one

src/kernelbot/cogs/verify_run_cog.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -53,20 +53,20 @@ async def trigger_run(self, interaction: discord.Interaction, gpu: GPU, reporter
5353
sub_code = create_mock_attachment(
5454
"submission.py", Path("examples/identity_py/submission.py").read_text()
5555
)
56-
task = make_task_definition("examples/identity_py")
56+
leaderboard = make_task_definition("examples/identity_py")
5757
else:
5858
sub_code = create_mock_attachment(
5959
"test.cu", Path("examples/identity_cuda/submission.cu").read_text()
6060
)
61-
task = make_task_definition("examples/identity_cuda")
61+
leaderboard = make_task_definition("examples/identity_cuda")
6262

6363
return await submit_leaderboard(
6464
interaction,
6565
-1,
6666
sub_code,
6767
gpu,
6868
reporter=reporter,
69-
task=task,
69+
task=leaderboard.task,
7070
mode=SubmissionMode.TEST,
7171
seed=None,
7272
)
@@ -292,8 +292,8 @@ async def verify_runs(self, interaction: discord.Interaction):
292292
amd = get_gpu_by_name("mi300")
293293
t4 = get_gpu_by_name("T4")
294294

295-
reporter = MultiProgressReporterDiscord("Verifying")
296-
await reporter.show(interaction)
295+
reporter = MultiProgressReporterDiscord(interaction)
296+
await reporter.show("Verifying")
297297

298298
results = await asyncio.gather(
299299
self.verify_github_run(interaction, nvidia, reporter.add_run("NVIDIA-PY"), "py"),

src/libkernelbot/leaderboard_db.py

Lines changed: 69 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,8 @@ def create_leaderboard(
8787
forum_id: int,
8888
gpu_types: list | str,
8989
) -> int:
90+
# to prevent surprises, ensure we have specified a timezone
91+
assert deadline.tzinfo is not None
9092
try:
9193
task = definition.task
9294
self.cursor.execute(
@@ -127,10 +129,10 @@ def create_leaderboard(
127129
self.name_cache.invalidate() # Invalidate autocomplete cache
128130
return leaderboard_id
129131
except psycopg2.Error as e:
130-
logger.exception("Error in leaderboard creation.", e)
132+
logger.exception("Error in leaderboard creation.", exc_info=e)
131133
if isinstance(e, psycopg2.errors.UniqueViolation):
132134
raise KernelBotError(
133-
"Error: Tried to create a leaderboard " f'"{name}" that already exists.'
135+
f"Error: Tried to create a leaderboard '{name}' that already exists."
134136
) from e
135137
self.connection.rollback() # Ensure rollback if error occurs
136138
raise KernelBotError("Error in leaderboard creation.") from e
@@ -140,31 +142,21 @@ def update_leaderboard(
140142
):
141143
task = definition.task
142144
try:
145+
lb_id = self.get_leaderboard_id(name)
143146
self.cursor.execute(
144147
"""
145148
UPDATE leaderboard.leaderboard
146149
SET deadline = %s, task = %s, description = %s
147-
WHERE name = %s;
150+
WHERE id = %s;
148151
""",
149152
(
150153
deadline.astimezone(datetime.timezone.utc),
151154
task.to_str(),
152155
definition.description,
153-
name,
156+
lb_id,
154157
),
155158
)
156159

157-
self.cursor.execute(
158-
"""
159-
SELECT id
160-
FROM leaderboard.leaderboard
161-
WHERE name = %s
162-
""",
163-
(name,),
164-
)
165-
166-
lb_id = self.cursor.fetchone()[0]
167-
168160
# replace templates
169161
self.cursor.execute(
170162
"""
@@ -235,8 +227,13 @@ def delete_leaderboard(self, leaderboard_name: str, force: bool = False):
235227
self.name_cache.invalidate() # Invalidate autocomplete cache
236228
except psycopg2.Error as e:
237229
self.connection.rollback()
230+
if isinstance(e, psycopg2.errors.ForeignKeyViolation):
231+
raise KernelBotError(
232+
f"Could not delete leaderboard `{leaderboard_name}` with existing submissions."
233+
) from e
234+
238235
logger.exception("Could not delete leaderboard %s.", leaderboard_name, exc_info=e)
239-
raise KernelBotError(f"Could not delete leaderboard {leaderboard_name}.") from e
236+
raise KernelBotError(f"Could not delete leaderboard `{leaderboard_name}`.") from e
240237

241238
def create_submission(
242239
self,
@@ -260,7 +257,7 @@ def create_submission(
260257

261258
code_id = None
262259
for candidate in self.cursor.fetchall():
263-
if candidate[1] == code:
260+
if bytes(candidate[1]).decode("utf-8") == code:
264261
code_id = candidate[0]
265262
break
266263

@@ -357,6 +354,24 @@ def create_submission_run(
357354
if compilation is not None:
358355
compilation = json.dumps(dataclasses.asdict(compilation))
359356

357+
# check validity
358+
self.cursor.execute(
359+
"""
360+
SELECT done FROM leaderboard.submission WHERE id = %s
361+
""",
362+
(submission,),
363+
)
364+
if self.cursor.fetchone()[0]:
365+
logger.error(
366+
"Submission '%s' is already marked as done when trying to add %s run.",
367+
submission,
368+
mode,
369+
)
370+
raise KernelBotError(
371+
"Internal error: Attempted to add run, "
372+
"but submission was already marked as done."
373+
)
374+
360375
meta = {
361376
k: result.__dict__[k]
362377
for k in ["stdout", "stderr", "success", "exit_code", "command", "duration"]
@@ -408,7 +423,7 @@ def get_leaderboard_names(self, active_only: bool = False) -> list[str]:
408423
def get_leaderboards(self) -> list["LeaderboardItem"]:
409424
self.cursor.execute(
410425
"""
411-
SELECT id, name, deadline, task, creator_id, forum_id, description
426+
SELECT id, name, deadline, task, creator_id, forum_id, description, secret_seed
412427
FROM leaderboard.leaderboard
413428
"""
414429
)
@@ -432,6 +447,7 @@ def get_leaderboards(self) -> list["LeaderboardItem"]:
432447
creator_id=lb[4],
433448
forum_id=lb[5],
434449
description=lb[6],
450+
secret_seed=lb[7],
435451
)
436452
)
437453

@@ -461,7 +477,7 @@ def get_leaderboard_gpu_types(self, leaderboard_name: str) -> List[str]:
461477

462478
return [x[0] for x in self.cursor.fetchall()]
463479

464-
def get_leaderboard_templates(self, leaderboard_name: str) -> Dict[str, str]:
480+
def get_leaderboard_id(self, leaderboard_name: str) -> int:
465481
self.cursor.execute(
466482
"""
467483
SELECT id
@@ -473,14 +489,18 @@ def get_leaderboard_templates(self, leaderboard_name: str) -> Dict[str, str]:
473489
lb_id = self.cursor.fetchone()
474490
if lb_id is None:
475491
raise LeaderboardDoesNotExist(leaderboard_name)
492+
return lb_id[0]
493+
494+
def get_leaderboard_templates(self, leaderboard_name: str) -> Dict[str, str]:
495+
lb_id = self.get_leaderboard_id(leaderboard_name)
476496

477497
self.cursor.execute(
478498
"""
479499
SELECT lang, code
480500
FROM leaderboard.templates
481501
WHERE leaderboard_id = %s
482502
""",
483-
(lb_id[0],),
503+
(lb_id,),
484504
)
485505

486506
return {x[0]: x[1] for x in self.cursor.fetchall()}
@@ -585,7 +605,7 @@ def get_leaderboard_submissions(
585605

586606
self.cursor.execute(query, args)
587607

588-
return [
608+
result = [
589609
LeaderboardRankedEntry(
590610
submission_name=submission[0],
591611
submission_id=submission[1],
@@ -599,6 +619,19 @@ def get_leaderboard_submissions(
599619
)
600620
for submission in self.cursor.fetchall()
601621
]
622+
if len(result) == 0:
623+
# try to diagnose why we didn't get anything
624+
# this will raise if the LB does not exist at all.
625+
self.get_leaderboard_id(leaderboard_name)
626+
627+
# did we specify a valid GPU?
628+
gpus = self.get_leaderboard_gpu_types(leaderboard_name)
629+
if gpu_name not in gpus:
630+
raise KernelBotError(
631+
f"Invalid GPU type '{gpu_name}' for leaderboard '{leaderboard_name}'"
632+
)
633+
634+
return result
602635

603636
def generate_stats(self, last_day: bool):
604637
try:
@@ -784,7 +817,7 @@ def get_submission_by_id(self, submission_id: int) -> Optional["SubmissionItem"]
784817
user_id=submission[3],
785818
submission_time=submission[4],
786819
done=submission[5],
787-
code=submission[6],
820+
code=bytes(submission[6]).decode("utf-8"),
788821
runs=runs,
789822
)
790823

@@ -824,7 +857,20 @@ def get_leaderboard_submission_count(
824857
args = (leaderboard_name, gpu_name)
825858

826859
self.cursor.execute(query, args)
827-
return self.cursor.fetchone()[0]
860+
count = self.cursor.fetchone()[0]
861+
if count == 0:
862+
# try to diagnose why we didn't get anything
863+
# this will raise if the LB does not exist at all.
864+
self.get_leaderboard_id(leaderboard_name)
865+
866+
# did we specify a valid GPU?
867+
gpus = self.get_leaderboard_gpu_types(leaderboard_name)
868+
if gpu_name not in gpus:
869+
raise KernelBotError(
870+
f"Invalid GPU type '{gpu_name}' for leaderboard '{leaderboard_name}'"
871+
)
872+
873+
return count
828874

829875
def init_user_from_cli(self, cli_id: str, auth_provider: str):
830876
"""

0 commit comments

Comments
 (0)