Skip to content

Conversation

nkanazawa1989
Copy link
Collaborator

@nkanazawa1989 nkanazawa1989 commented Aug 18, 2023

Summary

Executing approved design proposal in https://github.com/Qiskit/rfcs/blob/master/0007-experiment-dataframe.md. This PR replaces the representation of curve data points with data frame. This object will be added to artifact in a follow up PR.

Details and comments

In this PR, representation of the intermediate data for CurveAnalysis is replaced with the data frame ScatterTable. Experimentalists want easier access to the XY data points after curve analysis with various motivations, e.g. plotting data with their own code, rerunning analysis outside the CurveAnalysis framework, or analyzing the time series of curve data.

A curve data consists of not only x, y values, but also multiple metadata such as associated fit model, process status (raw, formatted, fitted), and some circuit metadata per each data point. The data frame representation is convenient to manage such complicated data set, and also allows us to record all information in a single object.

In addition, CurveAnalysis gains _create_figures method thanks to ScatterTable representation, which cannot be implemented with conventional CurveData object. This allows a curve analysis subclass to overwrites the method to flexibly customize figure generation. For example, current StarkRamseyXYAmpScanAnalysis overwrites entire _run_analysis method just to add second axis to the figure.

…ct provides better reusability of the processed curve data.
@nkanazawa1989 nkanazawa1989 force-pushed the feature/dataframe-pr2 branch from 3c5d9fb to 38b44b6 Compare August 31, 2023 10:20
…ined in the single method _create_figures. This allows subclass to flexibly modify the figure generation without overwriting the entire _run_analysis.
@nkanazawa1989 nkanazawa1989 force-pushed the feature/dataframe-pr2 branch from 38b44b6 to 3f55ad9 Compare August 31, 2023 10:33
Copy link
Collaborator Author

@nkanazawa1989 nkanazawa1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some clue for review.

# Create convenient function to compute residual of the models.
partial_residuals = []
valid_uncertainty = np.all(np.isfinite(curve_data.yerr.to_numpy()))
for i, sub_data in list(curve_data.groupby("model_id")):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filtering for each subset is now implemented by dataframe, but this creates another dataframe for subset. Running this in the objective function which is called many times by the optimizer degrades performance. Instead, this logic filters subset in advance and create partial-bound functions that evaluate the residual. This alleviates the performance bottleneck.

# Prepare for fitting
self._initialize(experiment_data)

curve_data = self._format_data(self._run_data_processing(experiment_data.data()))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This curve_data object is now ScatterTable. This table can store both processed_data and formatted_data and code becomes much simpler.

model_fit[:, 4] = i
# type
model_fit[:, 6] = "fit"
curve_data = curve_data.append_list_values(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This curve_data also stores fit values. Note that we can regenerate figure object from this data, i.e. lazy figure gen (but metadata such as chisq and fit parameters must be separately provided). In follow-up maybe we can update internals of curve plotter to directly take this object.

) -> Tuple[List[AnalysisResultData], List["matplotlib.figure.Figure"]]:

analysis_results = []
figures = []
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost copy of CurveAnalysis class, but will be completely dropped once after we move ScatterTable to artifact. Eventually I want to write

component_analysis_results = []
for analysis in self._analyses:
    analysis.set_options(plot=False)
    sub_results, _ = analysis._run_analysis(experiment_data)
    component_analysis_results.extend(sub_results)

# --- I need this logic ---
# 1. get ScatterData of the sub analysis from the artifact
# 2. combine sub scatter data
figures = self._create_figures(combined_scatter_data)

return component_analysis_results + composite_results

but currently scatter data is discarded. Note that composite curve analysis usually creates a single plot including curves from sub analysis, c.f. HamiltonianTomo. Current implementation is just to keep scatter data of the sub analyses, so do not need to review carefully.

return wrapfunc(x=x, **sub_params)


def shot_weighted_average(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are tied to ScatterTable column structure. I cannot reuse existing ones.

@coruscating
Copy link
Collaborator

I'm getting an error when running experiments with auto_save turned on:

  File "qiskit_experiments/framework/experiment_data.py", line 1423, in add_analysis_results
    service_result = _series_to_service_result(
  File "qiskit_experiments/framework/experiment_data.py", line 2656, in _series_to_service_result
    qe_result = AnalysisResultData.from_table_element(**series.replace({np.nan: None}).to_dict())
AttributeError: 'NoneType' object has no attribute 'replace'

Copy link
Collaborator

@coruscating coruscating left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nkanazawa1989, this looks good overall, and modularizing the create figures functionality is nice.

The CurveData mentions in the Curve Analysis tutorial also need to be updated. As for the auto_save bug mentioned in my last comment, I think it's due to AnalysisResultTable.add_entry() returning the results of DefaultColumnsMixIn.add_entry(), which has no return value (missing a return?).

LOG = logging.getLogger(__name__)


class ScatterTable(pd.DataFrame, DefaultColumnsMixIn):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the API docs seem to be building everything inherited from DataFrame:
image
Can we remove the built-in attributes and methods and link to the DataFrame docs through intersphinx instead?

lmfit
rustworkx
rustworkx
pandas>=1.1.5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that pandas is a requirement, the requires_pandas() decorator can be removed and so can the install pandas note in tutorials/calibrations.rst.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the decorator is not used in the rst file, but text is updated and the decorator is removed from the entire package. 46d350e

@nkanazawa1989
Copy link
Collaborator Author

Thanks @coruscating for detailed review and testing. The problem of auto saving was due to previous PR of #1252, in which I removed the return value because returning the added Series actually creates new object from the table. This could be a performance bottleneck. The problem is fixed in 7b07559.

I also cleaned up the scatter table columns, namely, I dropped a builtin column of formatted because this is almost overlapped with table index. This slightly reduces memory footprint; 5c5b3b6.

I'm still struggling with removing superclass (DataFrame) members from the docstring. I guess I need to write a custom Sphinx directive that is counterpart of :exclude-member:, or maybe another directive that only renders members of current class.

- Avoid add_data call in data processing. Create dataframe with a bulk data is much faster than appending one by one.
- Group sort with model_id instead of model_name. String search is slow in pandas.
- Groupby in format subroutine is replaced with python native groupby function which is faster than pandas groupby.
@nkanazawa1989
Copy link
Collaborator Author

Performance optimization is done in a0bcded. The performance is slightly worse (roughly several ms) due to creation of dataframe object, but this overhead cannot be avoided. I think this is acceptable considering the improved handling of the scatter data set. In future we can move away from pandas when this tiny performance regression becomes really matter.

Copy link
Collaborator

@coruscating coruscating left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the performance improvement, this looks good to me now!

@coruscating coruscating added this pull request to the merge queue Sep 27, 2023
Merged via the queue into qiskit-community:main with commit f33bed7 Sep 27, 2023
github-merge-queue bot pushed a commit that referenced this pull request Dec 22, 2023
### Summary

This PR modifies `ScatterTable` which is introduced in #1253.

This change resolves some code issues in #1315 and #1243.

### Details and comments

In the original design `ScatterTable` is tied to the fit models, and the
columns contains `model_name` (str) and `model_id` (int). Also the fit
module only allows to have three categorical data; "processed",
"formatted", "fitted". However, #1243 breaks this assumption, namely,
the `StarkRamseyXYAmpScanAnalysis` fitter defines two fit models which
are not directly mapped to the results data. The data fed into the
models is synthesized by consuming the input results data. The fitter
needs to manage four categorical data; "raw", "ramsey" (raw results),
"phase" (synthesized data for fit), and "fitted".

This PR relaxes the tight coupling of data to the fit model. In above
example, "raw" and "ramsey" category data can fill new fields `name`
(formally model_name) and `class_id` (model_id) without indicating a
particular fit model. Usually, raw category data is just classified
according to the `data_subfit_map` definition, and the map doesn't need
to match with the fit models. The connection to fit models is only
introduced in a particular category defined by new option value
`fit_category`. This option defaults to "formatted", but
`StarkRamseyXYAmpScanAnalysis` fitter would set "phase" instead. Thus
fit model assignment is effectively delayed until the formatter
function.

Also the original scatter table is designed to store all circuit
metadata which causes some problem in data formatting, especially when
it tries to average the data over the same x value in the group.
Non-numeric data is averaged by builtin set operation, but this assumes
the metadata value is hashable object, which is not generally true. This
PR also drops all metadata from the scatter table. Note that important
metadata fields for the curve analysis are one used for model
classification (classifier fields), and other fields just decorate the
table with unnecessary memory footprint requirements. The classifier
fields and `name` (`class_id`) are sort of duplicated information. This
implies the `name` and `class_id` fields are enough for end-users to
reuse the table data for further analysis once after it's saved as an
artifact.

---------

Co-authored-by: Will Shanks <[email protected]>
nkanazawa1989 added a commit to nkanazawa1989/qiskit-experiments that referenced this pull request Jan 10, 2024
### Summary

Executing approved design proposal in
https://github.com/Qiskit/rfcs/blob/master/0007-experiment-dataframe.md.
This PR replaces the representation of curve data points with data
frame. This object will be added to artifact in a follow up PR.

### Details and comments

In this PR, representation of the intermediate data for `CurveAnalysis`
is replaced with the data frame `ScatterTable`. Experimentalists want
easier access to the XY data points after curve analysis with various
motivations, e.g. plotting data with their own code, rerunning analysis
outside the CurveAnalysis framework, or analyzing the time series of
curve data.

A curve data consists of not only x, y values, but also multiple
metadata such as associated fit model, process status (raw, formatted,
fitted), and some circuit metadata per each data point. The data frame
representation is convenient to manage such complicated data set, and
also allows us to record all information in a single object.

In addition, `CurveAnalysis` gains `_create_figures` method thanks to
`ScatterTable` representation, which cannot be implemented with
conventional `CurveData` object. This allows a curve analysis subclass
to overwrites the method to flexibly customize figure generation. For
example, current `StarkRamseyXYAmpScanAnalysis` [overwrites entire
`_run_analysis`
method](https://github.com/Qiskit-Extensions/qiskit-experiments/blob/c01b0fad86a42ffb3437757a146b79d501992cf4/qiskit_experiments/library/characterization/analysis/ramsey_xy_analysis.py#L539-L685)
just to add second axis to the figure.
nkanazawa1989 added a commit to nkanazawa1989/qiskit-experiments that referenced this pull request Jan 17, 2024
### Summary

This PR modifies `ScatterTable` which is introduced in qiskit-community#1253.

This change resolves some code issues in qiskit-community#1315 and qiskit-community#1243.

### Details and comments

In the original design `ScatterTable` is tied to the fit models, and the
columns contains `model_name` (str) and `model_id` (int). Also the fit
module only allows to have three categorical data; "processed",
"formatted", "fitted". However, qiskit-community#1243 breaks this assumption, namely,
the `StarkRamseyXYAmpScanAnalysis` fitter defines two fit models which
are not directly mapped to the results data. The data fed into the
models is synthesized by consuming the input results data. The fitter
needs to manage four categorical data; "raw", "ramsey" (raw results),
"phase" (synthesized data for fit), and "fitted".

This PR relaxes the tight coupling of data to the fit model. In above
example, "raw" and "ramsey" category data can fill new fields `name`
(formally model_name) and `class_id` (model_id) without indicating a
particular fit model. Usually, raw category data is just classified
according to the `data_subfit_map` definition, and the map doesn't need
to match with the fit models. The connection to fit models is only
introduced in a particular category defined by new option value
`fit_category`. This option defaults to "formatted", but
`StarkRamseyXYAmpScanAnalysis` fitter would set "phase" instead. Thus
fit model assignment is effectively delayed until the formatter
function.

Also the original scatter table is designed to store all circuit
metadata which causes some problem in data formatting, especially when
it tries to average the data over the same x value in the group.
Non-numeric data is averaged by builtin set operation, but this assumes
the metadata value is hashable object, which is not generally true. This
PR also drops all metadata from the scatter table. Note that important
metadata fields for the curve analysis are one used for model
classification (classifier fields), and other fields just decorate the
table with unnecessary memory footprint requirements. The classifier
fields and `name` (`class_id`) are sort of duplicated information. This
implies the `name` and `class_id` fields are enough for end-users to
reuse the table data for further analysis once after it's saved as an
artifact.

---------

Co-authored-by: Will Shanks <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants