-
Notifications
You must be signed in to change notification settings - Fork 131
Introduce dataframe to ExperimentData (step2) #1253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce dataframe to ExperimentData (step2) #1253
Conversation
…ct provides better reusability of the processed curve data.
3c5d9fb
to
38b44b6
Compare
…ined in the single method _create_figures. This allows subclass to flexibly modify the figure generation without overwriting the entire _run_analysis.
38b44b6
to
3f55ad9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some clue for review.
# Create convenient function to compute residual of the models. | ||
partial_residuals = [] | ||
valid_uncertainty = np.all(np.isfinite(curve_data.yerr.to_numpy())) | ||
for i, sub_data in list(curve_data.groupby("model_id")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filtering for each subset is now implemented by dataframe, but this creates another dataframe for subset. Running this in the objective function which is called many times by the optimizer degrades performance. Instead, this logic filters subset in advance and create partial-bound functions that evaluate the residual. This alleviates the performance bottleneck.
# Prepare for fitting | ||
self._initialize(experiment_data) | ||
|
||
curve_data = self._format_data(self._run_data_processing(experiment_data.data())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This curve_data object is now ScatterTable
. This table can store both processed_data and formatted_data and code becomes much simpler.
model_fit[:, 4] = i | ||
# type | ||
model_fit[:, 6] = "fit" | ||
curve_data = curve_data.append_list_values( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This curve_data also stores fit values. Note that we can regenerate figure object from this data, i.e. lazy figure gen (but metadata such as chisq and fit parameters must be separately provided). In follow-up maybe we can update internals of curve plotter to directly take this object.
) -> Tuple[List[AnalysisResultData], List["matplotlib.figure.Figure"]]: | ||
|
||
analysis_results = [] | ||
figures = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost copy of CurveAnalysis
class, but will be completely dropped once after we move ScatterTable
to artifact. Eventually I want to write
component_analysis_results = []
for analysis in self._analyses:
analysis.set_options(plot=False)
sub_results, _ = analysis._run_analysis(experiment_data)
component_analysis_results.extend(sub_results)
# --- I need this logic ---
# 1. get ScatterData of the sub analysis from the artifact
# 2. combine sub scatter data
figures = self._create_figures(combined_scatter_data)
return component_analysis_results + composite_results
but currently scatter data is discarded. Note that composite curve analysis usually creates a single plot including curves from sub analysis, c.f. HamiltonianTomo. Current implementation is just to keep scatter data of the sub analyses, so do not need to review carefully.
return wrapfunc(x=x, **sub_params) | ||
|
||
|
||
def shot_weighted_average( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions are tied to ScatterTable column structure. I cannot reuse existing ones.
I'm getting an error when running experiments with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @nkanazawa1989, this looks good overall, and modularizing the create figures functionality is nice.
The CurveData
mentions in the Curve Analysis tutorial also need to be updated. As for the auto_save
bug mentioned in my last comment, I think it's due to AnalysisResultTable.add_entry()
returning the results of DefaultColumnsMixIn.add_entry()
, which has no return value (missing a return?).
LOG = logging.getLogger(__name__) | ||
|
||
|
||
class ScatterTable(pd.DataFrame, DefaultColumnsMixIn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lmfit | ||
rustworkx | ||
rustworkx | ||
pandas>=1.1.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that pandas is a requirement, the requires_pandas()
decorator can be removed and so can the install pandas note in tutorials/calibrations.rst
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the decorator is not used in the rst file, but text is updated and the decorator is removed from the entire package. 46d350e
…ntry returns no data and get_entry must be explicitly called.
- raw -> processed - fit-ready -> formatted - ift -> fitted
Thanks @coruscating for detailed review and testing. The problem of auto saving was due to previous PR of #1252, in which I removed the return value because returning the added I also cleaned up the scatter table columns, namely, I dropped a builtin column of I'm still struggling with removing superclass (DataFrame) members from the docstring. I guess I need to write a custom Sphinx directive that is counterpart of |
- Avoid add_data call in data processing. Create dataframe with a bulk data is much faster than appending one by one. - Group sort with model_id instead of model_name. String search is slow in pandas. - Groupby in format subroutine is replaced with python native groupby function which is faster than pandas groupby.
Performance optimization is done in a0bcded. The performance is slightly worse (roughly several ms) due to creation of dataframe object, but this overhead cannot be avoided. I think this is acceptable considering the improved handling of the scatter data set. In future we can move away from pandas when this tiny performance regression becomes really matter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the performance improvement, this looks good to me now!
### Summary This PR modifies `ScatterTable` which is introduced in #1253. This change resolves some code issues in #1315 and #1243. ### Details and comments In the original design `ScatterTable` is tied to the fit models, and the columns contains `model_name` (str) and `model_id` (int). Also the fit module only allows to have three categorical data; "processed", "formatted", "fitted". However, #1243 breaks this assumption, namely, the `StarkRamseyXYAmpScanAnalysis` fitter defines two fit models which are not directly mapped to the results data. The data fed into the models is synthesized by consuming the input results data. The fitter needs to manage four categorical data; "raw", "ramsey" (raw results), "phase" (synthesized data for fit), and "fitted". This PR relaxes the tight coupling of data to the fit model. In above example, "raw" and "ramsey" category data can fill new fields `name` (formally model_name) and `class_id` (model_id) without indicating a particular fit model. Usually, raw category data is just classified according to the `data_subfit_map` definition, and the map doesn't need to match with the fit models. The connection to fit models is only introduced in a particular category defined by new option value `fit_category`. This option defaults to "formatted", but `StarkRamseyXYAmpScanAnalysis` fitter would set "phase" instead. Thus fit model assignment is effectively delayed until the formatter function. Also the original scatter table is designed to store all circuit metadata which causes some problem in data formatting, especially when it tries to average the data over the same x value in the group. Non-numeric data is averaged by builtin set operation, but this assumes the metadata value is hashable object, which is not generally true. This PR also drops all metadata from the scatter table. Note that important metadata fields for the curve analysis are one used for model classification (classifier fields), and other fields just decorate the table with unnecessary memory footprint requirements. The classifier fields and `name` (`class_id`) are sort of duplicated information. This implies the `name` and `class_id` fields are enough for end-users to reuse the table data for further analysis once after it's saved as an artifact. --------- Co-authored-by: Will Shanks <[email protected]>
### Summary Executing approved design proposal in https://github.com/Qiskit/rfcs/blob/master/0007-experiment-dataframe.md. This PR replaces the representation of curve data points with data frame. This object will be added to artifact in a follow up PR. ### Details and comments In this PR, representation of the intermediate data for `CurveAnalysis` is replaced with the data frame `ScatterTable`. Experimentalists want easier access to the XY data points after curve analysis with various motivations, e.g. plotting data with their own code, rerunning analysis outside the CurveAnalysis framework, or analyzing the time series of curve data. A curve data consists of not only x, y values, but also multiple metadata such as associated fit model, process status (raw, formatted, fitted), and some circuit metadata per each data point. The data frame representation is convenient to manage such complicated data set, and also allows us to record all information in a single object. In addition, `CurveAnalysis` gains `_create_figures` method thanks to `ScatterTable` representation, which cannot be implemented with conventional `CurveData` object. This allows a curve analysis subclass to overwrites the method to flexibly customize figure generation. For example, current `StarkRamseyXYAmpScanAnalysis` [overwrites entire `_run_analysis` method](https://github.com/Qiskit-Extensions/qiskit-experiments/blob/c01b0fad86a42ffb3437757a146b79d501992cf4/qiskit_experiments/library/characterization/analysis/ramsey_xy_analysis.py#L539-L685) just to add second axis to the figure.
### Summary This PR modifies `ScatterTable` which is introduced in qiskit-community#1253. This change resolves some code issues in qiskit-community#1315 and qiskit-community#1243. ### Details and comments In the original design `ScatterTable` is tied to the fit models, and the columns contains `model_name` (str) and `model_id` (int). Also the fit module only allows to have three categorical data; "processed", "formatted", "fitted". However, qiskit-community#1243 breaks this assumption, namely, the `StarkRamseyXYAmpScanAnalysis` fitter defines two fit models which are not directly mapped to the results data. The data fed into the models is synthesized by consuming the input results data. The fitter needs to manage four categorical data; "raw", "ramsey" (raw results), "phase" (synthesized data for fit), and "fitted". This PR relaxes the tight coupling of data to the fit model. In above example, "raw" and "ramsey" category data can fill new fields `name` (formally model_name) and `class_id` (model_id) without indicating a particular fit model. Usually, raw category data is just classified according to the `data_subfit_map` definition, and the map doesn't need to match with the fit models. The connection to fit models is only introduced in a particular category defined by new option value `fit_category`. This option defaults to "formatted", but `StarkRamseyXYAmpScanAnalysis` fitter would set "phase" instead. Thus fit model assignment is effectively delayed until the formatter function. Also the original scatter table is designed to store all circuit metadata which causes some problem in data formatting, especially when it tries to average the data over the same x value in the group. Non-numeric data is averaged by builtin set operation, but this assumes the metadata value is hashable object, which is not generally true. This PR also drops all metadata from the scatter table. Note that important metadata fields for the curve analysis are one used for model classification (classifier fields), and other fields just decorate the table with unnecessary memory footprint requirements. The classifier fields and `name` (`class_id`) are sort of duplicated information. This implies the `name` and `class_id` fields are enough for end-users to reuse the table data for further analysis once after it's saved as an artifact. --------- Co-authored-by: Will Shanks <[email protected]>
Summary
Executing approved design proposal in https://github.com/Qiskit/rfcs/blob/master/0007-experiment-dataframe.md. This PR replaces the representation of curve data points with data frame. This object will be added to artifact in a follow up PR.
Details and comments
In this PR, representation of the intermediate data for
CurveAnalysis
is replaced with the data frameScatterTable
. Experimentalists want easier access to the XY data points after curve analysis with various motivations, e.g. plotting data with their own code, rerunning analysis outside the CurveAnalysis framework, or analyzing the time series of curve data.A curve data consists of not only x, y values, but also multiple metadata such as associated fit model, process status (raw, formatted, fitted), and some circuit metadata per each data point. The data frame representation is convenient to manage such complicated data set, and also allows us to record all information in a single object.
In addition,
CurveAnalysis
gains_create_figures
method thanks toScatterTable
representation, which cannot be implemented with conventionalCurveData
object. This allows a curve analysis subclass to overwrites the method to flexibly customize figure generation. For example, currentStarkRamseyXYAmpScanAnalysis
overwrites entire_run_analysis
method just to add second axis to the figure.