Skip to content

Conversation

@tsmathis
Copy link
Collaborator

should just work™️

@tsmathis
Copy link
Collaborator Author

tsmathis commented Oct 23, 2025

some notes:

1. depends on #1021
2. could supersede #974
3. this will need to be addressed for the progress bar to be accurate in the case were has_gnome_access=False:

# TODO: Update tasks (+ others?) resource to have emmet-api BatchIdQuery operator
# -> need to modify BatchIdQuery operator to handle root level
# batch_id, not only builder_meta.batch_id
# if not has_gnome_access:
# num_docs_needed = self.count(
# {"batch_id_neq_any": SETTINGS.ACCESS_CONTROLLED_BATCH_IDS}
# )

the count can be retrieved from s3, but the COUNT(*) ... WHERE NOT IN ... is slow
4. wasn't sure how to emit messages to the user, warnings might not be the best choice:
warnings.warn(
f"Dataset for {suffix} already exists at {target_path}, delete or move existing dataset "
"or re-run search query with MPRester(force_renew=True)",
MPLocalDatasetWarning,

warnings.warn(
f"Dataset for {suffix} written to {target_path}. It is recommended to optimize "
"the table according to your usage patterns prior to running intensive workloads, "
"see: https://delta-io.github.io/delta-rs/delta-lake-best-practices/#optimizing-table-layout",
MPLocalDatasetWarning,
)

5. On the fence if MPDataset should inherit user's choice of use_document_model or default to False, its extra overhead when True
use_document_model=self.use_document_model,

6. re: document model's, wasn't sure if making an MPDataDoc model was the right route so the emmet model is just passed through now.
7. @esoteric-ephemera, is this how coercing user input to AlphaIDs should go? Do you want to do something different?
as_alpha = str(AlphaID(task_id, padlen=8)).split("-")[-1]

8. Is MPAPIClientSettings the right place for these? Not sure if the user has the ability to adjust these if needed:
LOCAL_DATASET_CACHE: str = Field(
os.path.expanduser("~") + "/mp_datasets",
description="Target directory for downloading full datasets",
)
DATASET_FLUSH_THRESHOLD: int = Field(
100000,
description="Threshold number of rows to accumulate in memory before flushing dataset to disk",
)
ACCESS_CONTROLLED_BATCH_IDS: list[str] = Field(
["gnome_r2scan_statics"], description="Batch ids with access restrictions"
)

@tsmathis
Copy link
Collaborator Author

ah and based on the failing test for trajectories, I assumed returning the pymatgen object was correct, should the dict be returned? @esoteric-ephemera

return RelaxTrajectory(**traj_data[0]).to_pmg()

@esoteric-ephemera
Copy link
Collaborator

esoteric-ephemera commented Oct 23, 2025

@tsmathis think the API was set up to return the jsanitized trajectory info:
https://github.com/materialsproject/emmet/blob/3447c5af4746d539f1f4faf26b97715cb119c85d/emmet-api/emmet/api/routes/materials/tasks/query_operators.py#L73

Either way yeah I guess it returned the as_dict but we don't need to keep with that paradigm

For the AlphaID, to handle either the no prefix/separator ("aaaaaaft") and with prefix/separator ("mp-aaaaaaft") cases, both of these should work, but I can also just save the "padded identifier" as an attr on it to make this cleaner - I'll do that in the PR you linked:

"a"*(x._padlen-len(x._identifier)) + x._identifier

or

if (alpha := AlphaID(task_id, padlen=8))._separator:
  padded = str(alpha).rsplit(alpha._separator)[-1] 
else:
  padded = str(alpha)

@tsmathis
Copy link
Collaborator Author

tsmathis commented Oct 23, 2025

For the AlphaID, to handle either the no prefix/separator ("aaaaaaft") and with prefix/separator ("mp-aaaaaaft") cases, both of these should work, but I can also just save the "padded identifier" as an attr on it to make this cleaner - I'll do that in the PR you linked:

either way on this works for me, just want to make sure I stick to the intended usage (edit: or that we're at least consistent across the client)

Either way yeah I guess it returned the as_dict but we don't need to keep with that paradigm

Was going to say we could stick to whatever the frontend was expecting, but looking now the frontend doesn't even use the tasks.get_trajectory(...) function so it will need to be rewritten either way. The frontend does end up making a dataframe from the trajectory dict, so maybe just returning the dict will be best

@codecov-commenter
Copy link

codecov-commenter commented Oct 23, 2025

Codecov Report

❌ Patch coverage is 42.20183% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.92%. Comparing base (5ecebec) to head (33b787f).

Files with missing lines Patch % Lines
mp_api/client/core/client.py 16.66% 45 Missing ⚠️
mp_api/client/core/utils.py 55.00% 18 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1023      +/-   ##
==========================================
- Coverage   66.85%   65.92%   -0.94%     
==========================================
  Files          50       50              
  Lines        2767     2870     +103     
==========================================
+ Hits         1850     1892      +42     
- Misses        917      978      +61     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@tschaume tschaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Looking forward to rolling this out 😄

Copy link
Member

@tschaume tschaume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the updates @tsmathis ! Found a few more potential issues/improvements

@tsmathis
Copy link
Collaborator Author

tsmathis commented Nov 5, 2025

good catches @tschaume

you're obviously free to keep adding changes for testing, but you can also just ping me if you want me to update things as changes come in upstream

@tsmathis
Copy link
Collaborator Author

tsmathis commented Nov 5, 2025

and FYI, tests are mostly good locally with my endpoint set to preview.
these two fail regardless once the env gets reset during testing:

FAILED tests/test_mprester.py::TestMPRester::test_get_api_key_endpoint_from_env_var - KeyError: 'access_controlled_batch_ids'
FAILED tests/test_mprester.py::TestMPRester::test_get_default_api_key_endpoint - KeyError: 'access_controlled_batch_ids'

once the new deployment is live that will go away, and since this is a feature PR I'm not too concerned about the tests atm.

But I can also change the heartbeat meta ping to be more fault tolerant and use .get(...) and return nulls be default rather than directly accessing those keys :shrug:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants