-
Notifications
You must be signed in to change notification settings - Fork 66
fix(autograd): include traced keys in HDF5 hash input #2875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
yaugenst-flex
wants to merge
1
commit into
develop
Choose a base branch
from
FXC-3376-hash-must-include-traced-keys-remote-autograd-cache-bug
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+110
−49
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
"""Typed containers for autograd traced field metadata.""" | ||
|
||
from __future__ import annotations | ||
|
||
import json | ||
from typing import Any, Callable | ||
|
||
import pydantic.v1 as pydantic | ||
|
||
from tidy3d.components.autograd.types import AutogradFieldMap, dict_ag | ||
from tidy3d.components.base import Tidy3dBaseModel | ||
from tidy3d.components.types import ArrayLike, tidycomplex | ||
|
||
|
||
class Tracer(Tidy3dBaseModel): | ||
"""Representation of a single traced element within a model.""" | ||
|
||
path: tuple[Any, ...] = pydantic.Field( | ||
..., | ||
title="Path to the traced object in the model dictionary.", | ||
) | ||
data: float | tidycomplex | ArrayLike = pydantic.Field(..., title="Tracing data") | ||
|
||
|
||
class FieldMap(Tidy3dBaseModel): | ||
"""Collection of traced elements.""" | ||
|
||
tracers: tuple[Tracer, ...] = pydantic.Field( | ||
..., | ||
title="Collection of Tracers.", | ||
) | ||
|
||
@property | ||
def to_autograd_field_map(self) -> AutogradFieldMap: | ||
"""Convert to ``AutogradFieldMap`` autograd dictionary.""" | ||
return dict_ag({tracer.path: tracer.data for tracer in self.tracers}) | ||
|
||
@classmethod | ||
def from_autograd_field_map(cls, autograd_field_map: AutogradFieldMap) -> FieldMap: | ||
"""Initialize from an ``AutogradFieldMap`` autograd dictionary.""" | ||
tracers = [] | ||
for path, data in autograd_field_map.items(): | ||
tracers.append(Tracer(path=path, data=data)) | ||
return cls(tracers=tuple(tracers)) | ||
|
||
|
||
def _encoded_path(path: tuple[Any, ...]) -> str: | ||
"""Return a stable JSON representation for a traced path.""" | ||
return json.dumps(list(path), separators=(",", ":"), ensure_ascii=True) | ||
|
||
|
||
class TracerKeys(Tidy3dBaseModel): | ||
"""Collection of traced field paths.""" | ||
|
||
keys: tuple[tuple[Any, ...], ...] = pydantic.Field( | ||
..., | ||
title="Collection of tracer keys.", | ||
) | ||
|
||
def encoded_keys(self) -> list[str]: | ||
"""Return the JSON-encoded representation of keys.""" | ||
return [_encoded_path(path) for path in self.keys] | ||
|
||
@classmethod | ||
def from_field_mapping( | ||
cls, | ||
field_mapping: AutogradFieldMap, | ||
*, | ||
sort_key: Callable[[tuple[Any, ...]], str] | None = None, | ||
) -> TracerKeys: | ||
"""Construct keys from an autograd field mapping.""" | ||
if sort_key is None: | ||
sort_key = _encoded_path | ||
|
||
sorted_paths = tuple(sorted(field_mapping.keys(), key=sort_key)) | ||
return cls(keys=sorted_paths) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it makes a bit more sense to use a new dataset (like we do for the simulation json) rather than the
attrs
directly. Feels like a bit of an abuse of even the hdf5attrs
purpose.For the simulation json, I remember Tyler first tried putting it in attrs too, but it can be a problem when it becomes large. Here it's unlikely that the json will be very big, but still potentially better to just make a dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would actually argue that this here is metadata and exactly the kind of thing that should go in hdf5 attrs 😄 - it's a short list of paths, it never grows beyond a couple of bytes, and never needs random access, and hdf5 attrs are intended for exactly that style of short descriptive information. I think it's different from the simulation json, because that really isn't metadata, and also as you said potentially quite large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, yeah, fine by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, but I also feel uneasy about this. what is the disadvantage to just making a dataset? I feel it makes things more explicit that this is some data that we care about. It's not purely metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a fair point, maybe the line between metadata and model data is a bit fuzzy here. i lean toward calling it metadata, because it's a deterministic byproduct of the actual model and the user inputs. if the simulation is rebuilt from scratch, then the same traced paths will be "rediscovered", so it's closer to a property rather than part of the model.
including it as a dataset also might require some additional plumbing for the hdf5 reader/writer (haven't checked), attrs doesnt show up in the json reconstruction. that being said, the difference is mostly semantic 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I asked chat gpt and it agrees with yannick :D. so I'm fine with current implementation
Short answer:
Use HDF5 attributes if the new info is small (a few strings/ints/bools), describes the run, and you’ll read it whole.
Use a dataset only if it’s large, tabular/array-like, needs chunking/compression, or will be queried/sliced.
This info is metadata (provenance/config that affects processing, not the simulation arrays themselves).
Why attrs fit here
Attributes are meant for descriptive/run-level info (schema versions, flags, tool versions).
Cheap to read, no chunking needed, live with the object they describe (usually the root group).
Keep them small; if it grows or becomes structured, promote to a dataset or a small /cache_key group.
Cache/hash tips (to avoid fragile keys)
Scope the key: hash a defined subset only (inputs + this new flag/signature + relevant versions), not the whole file.
Canonicalization: store under /cache_key (either as attributes or tiny datasets). When building the hash, read keys, sort by name, encode deterministically (e.g., canonical JSON with sorted keys / UTF-8, or CBOR), and hash.
Versioning: include cache_key_version and producer/tool version fields. Bump when semantics change.
Defaults for back-compat: treat “missing field” as an explicit default (e.g., "backend_signature":"v0"), so old files don’t slip through incorrectly.
Auditability: also store the computed cache_key back into the file (attr) so you can explain cache hits later.
Rule of thumb
Attrs: small, descriptive, per-run constants → ✅
Dataset: big/structured, per-step/per-record data or anything you might slice/ compress → ✅
Given your description, I’d put the new field(s) as root attributes (or attributes under /cache_key) and include them in the deterministic hash. Yes—consider it metadata.