Introduce dataframe to ExperimentData (step2) #1253

nkanazawa1989 · 2023-08-18T06:59:28Z

Summary

Executing approved design proposal in https://github.com/Qiskit/rfcs/blob/master/0007-experiment-dataframe.md. This PR replaces the representation of curve data points with data frame. This object will be added to artifact in a follow up PR.

Details and comments

In this PR, representation of the intermediate data for CurveAnalysis is replaced with the data frame ScatterTable. Experimentalists want easier access to the XY data points after curve analysis with various motivations, e.g. plotting data with their own code, rerunning analysis outside the CurveAnalysis framework, or analyzing the time series of curve data.

A curve data consists of not only x, y values, but also multiple metadata such as associated fit model, process status (raw, formatted, fitted), and some circuit metadata per each data point. The data frame representation is convenient to manage such complicated data set, and also allows us to record all information in a single object.

In addition, CurveAnalysis gains _create_figures method thanks to ScatterTable representation, which cannot be implemented with conventional CurveData object. This allows a curve analysis subclass to overwrites the method to flexibly customize figure generation. For example, current StarkRamseyXYAmpScanAnalysis overwrites entire _run_analysis method just to add second axis to the figure.

…ct provides better reusability of the processed curve data.

…ined in the single method _create_figures. This allows subclass to flexibly modify the figure generation without overwriting the entire _run_analysis.

nkanazawa1989

Some clue for review.

nkanazawa1989 · 2023-08-31T10:29:31Z

qiskit_experiments/curve_analysis/curve_analysis.py

+        # Create convenient function to compute residual of the models.
+        partial_residuals = []
+        valid_uncertainty = np.all(np.isfinite(curve_data.yerr.to_numpy()))
+        for i, sub_data in list(curve_data.groupby("model_id")):


Filtering for each subset is now implemented by dataframe, but this creates another dataframe for subset. Running this in the objective function which is called many times by the optimizer degrades performance. Instead, this logic filters subset in advance and create partial-bound functions that evaluate the residual. This alleviates the performance bottleneck.

nkanazawa1989 · 2023-08-31T10:38:36Z

qiskit_experiments/curve_analysis/curve_analysis.py

+        # Prepare for fitting
+        self._initialize(experiment_data)
+
+        curve_data = self._format_data(self._run_data_processing(experiment_data.data()))


This curve_data object is now ScatterTable. This table can store both processed_data and formatted_data and code becomes much simpler.

nkanazawa1989 · 2023-08-31T10:39:08Z

qiskit_experiments/curve_analysis/curve_analysis.py

+                model_fit[:, 4] = i
+                # type
+                model_fit[:, 6] = "fit"
+            curve_data = curve_data.append_list_values(


This curve_data also stores fit values. Note that we can regenerate figure object from this data, i.e. lazy figure gen (but metadata such as chisq and fit parameters must be separately provided). In follow-up maybe we can update internals of curve plotter to directly take this object.

qiskit_experiments/curve_analysis/curve_analysis.py

nkanazawa1989 · 2023-08-31T10:53:18Z

qiskit_experiments/curve_analysis/composite_curve_analysis.py

    ) -> Tuple[List[AnalysisResultData], List["matplotlib.figure.Figure"]]:

        analysis_results = []
+        figures = []


This is almost copy of CurveAnalysis class, but will be completely dropped once after we move ScatterTable to artifact. Eventually I want to write

component_analysis_results = [] for analysis in self._analyses: analysis.set_options(plot=False) sub_results, _ = analysis._run_analysis(experiment_data) component_analysis_results.extend(sub_results) # --- I need this logic --- # 1. get ScatterData of the sub analysis from the artifact # 2. combine sub scatter data figures = self._create_figures(combined_scatter_data) return component_analysis_results + composite_results

but currently scatter data is discarded. Note that composite curve analysis usually creates a single plot including curves from sub analysis, c.f. HamiltonianTomo. Current implementation is just to keep scatter data of the sub analyses, so do not need to review carefully.

nkanazawa1989 · 2023-08-31T10:56:54Z

qiskit_experiments/curve_analysis/utils.py

    return wrapfunc(x=x, **sub_params)


+def shot_weighted_average(


These functions are tied to ScatterTable column structure. I cannot reuse existing ones.

coruscating · 2023-09-05T13:20:41Z

I'm getting an error when running experiments with auto_save turned on:

  File "qiskit_experiments/framework/experiment_data.py", line 1423, in add_analysis_results
    service_result = _series_to_service_result(
  File "qiskit_experiments/framework/experiment_data.py", line 2656, in _series_to_service_result
    qe_result = AnalysisResultData.from_table_element(**series.replace({np.nan: None}).to_dict())
AttributeError: 'NoneType' object has no attribute 'replace'

coruscating

Thanks @nkanazawa1989, this looks good overall, and modularizing the create figures functionality is nice.

The CurveData mentions in the Curve Analysis tutorial also need to be updated. As for the auto_save bug mentioned in my last comment, I think it's due to AnalysisResultTable.add_entry() returning the results of DefaultColumnsMixIn.add_entry(), which has no return value (missing a return?).

qiskit_experiments/curve_analysis/utils.py

coruscating · 2023-09-11T03:46:47Z

qiskit_experiments/curve_analysis/scatter_table.py

+LOG = logging.getLogger(__name__)
+
+
+class ScatterTable(pd.DataFrame, DefaultColumnsMixIn):


Hmm, the API docs seem to be building everything inherited from DataFrame:

Can we remove the built-in attributes and methods and link to the DataFrame docs through intersphinx instead?

qiskit_experiments/curve_analysis/composite_curve_analysis.py

qiskit_experiments/curve_analysis/curve_analysis.py

coruscating · 2023-09-11T05:10:46Z

requirements.txt

 lmfit
-rustworkx
+rustworkx
+pandas>=1.1.5


Now that pandas is a requirement, the requires_pandas() decorator can be removed and so can the install pandas note in tutorials/calibrations.rst.

Seems like the decorator is not used in the rst file, but text is updated and the decorator is removed from the entire package. 46d350e

…ntry returns no data and get_entry must be explicitly called.

…ray.

- raw -> processed - fit-ready -> formatted - ift -> fitted

…e index.

…urveData.

nkanazawa1989 · 2023-09-12T18:51:37Z

Thanks @coruscating for detailed review and testing. The problem of auto saving was due to previous PR of #1252, in which I removed the return value because returning the added Series actually creates new object from the table. This could be a performance bottleneck. The problem is fixed in 7b07559.

I also cleaned up the scatter table columns, namely, I dropped a builtin column of formatted because this is almost overlapped with table index. This slightly reduces memory footprint; 5c5b3b6.

I'm still struggling with removing superclass (DataFrame) members from the docstring. I guess I need to write a custom Sphinx directive that is counterpart of :exclude-member:, or maybe another directive that only renders members of current class.

- Avoid add_data call in data processing. Create dataframe with a bulk data is much faster than appending one by one. - Group sort with model_id instead of model_name. String search is slow in pandas. - Groupby in format subroutine is replaced with python native groupby function which is faster than pandas groupby.

nkanazawa1989 · 2023-09-22T13:13:09Z

Performance optimization is done in a0bcded. The performance is slightly worse (roughly several ms) due to creation of dataframe object, but this overhead cannot be avoided. I think this is acceptable considering the improved handling of the scatter data set. In future we can move away from pandas when this tiny performance regression becomes really matter.

coruscating

Thanks for the performance improvement, this looks good to me now!

### Summary This PR modifies `ScatterTable` which is introduced in #1253. This change resolves some code issues in #1315 and #1243. ### Details and comments In the original design `ScatterTable` is tied to the fit models, and the columns contains `model_name` (str) and `model_id` (int). Also the fit module only allows to have three categorical data; "processed", "formatted", "fitted". However, #1243 breaks this assumption, namely, the `StarkRamseyXYAmpScanAnalysis` fitter defines two fit models which are not directly mapped to the results data. The data fed into the models is synthesized by consuming the input results data. The fitter needs to manage four categorical data; "raw", "ramsey" (raw results), "phase" (synthesized data for fit), and "fitted". This PR relaxes the tight coupling of data to the fit model. In above example, "raw" and "ramsey" category data can fill new fields `name` (formally model_name) and `class_id` (model_id) without indicating a particular fit model. Usually, raw category data is just classified according to the `data_subfit_map` definition, and the map doesn't need to match with the fit models. The connection to fit models is only introduced in a particular category defined by new option value `fit_category`. This option defaults to "formatted", but `StarkRamseyXYAmpScanAnalysis` fitter would set "phase" instead. Thus fit model assignment is effectively delayed until the formatter function. Also the original scatter table is designed to store all circuit metadata which causes some problem in data formatting, especially when it tries to average the data over the same x value in the group. Non-numeric data is averaged by builtin set operation, but this assumes the metadata value is hashable object, which is not generally true. This PR also drops all metadata from the scatter table. Note that important metadata fields for the curve analysis are one used for model classification (classifier fields), and other fields just decorate the table with unnecessary memory footprint requirements. The classifier fields and `name` (`class_id`) are sort of duplicated information. This implies the `name` and `class_id` fields are enough for end-users to reuse the table data for further analysis once after it's saved as an artifact. --------- Co-authored-by: Will Shanks <[email protected]>

### Summary Executing approved design proposal in https://github.com/Qiskit/rfcs/blob/master/0007-experiment-dataframe.md. This PR replaces the representation of curve data points with data frame. This object will be added to artifact in a follow up PR. ### Details and comments In this PR, representation of the intermediate data for `CurveAnalysis` is replaced with the data frame `ScatterTable`. Experimentalists want easier access to the XY data points after curve analysis with various motivations, e.g. plotting data with their own code, rerunning analysis outside the CurveAnalysis framework, or analyzing the time series of curve data. A curve data consists of not only x, y values, but also multiple metadata such as associated fit model, process status (raw, formatted, fitted), and some circuit metadata per each data point. The data frame representation is convenient to manage such complicated data set, and also allows us to record all information in a single object. In addition, `CurveAnalysis` gains `_create_figures` method thanks to `ScatterTable` representation, which cannot be implemented with conventional `CurveData` object. This allows a curve analysis subclass to overwrites the method to flexibly customize figure generation. For example, current `StarkRamseyXYAmpScanAnalysis` [overwrites entire `_run_analysis` method](https://github.com/Qiskit-Extensions/qiskit-experiments/blob/c01b0fad86a42ffb3437757a146b79d501992cf4/qiskit_experiments/library/characterization/analysis/ramsey_xy_analysis.py#L539-L685) just to add second axis to the figure.

### Summary This PR modifies `ScatterTable` which is introduced in qiskit-community#1253. This change resolves some code issues in qiskit-community#1315 and qiskit-community#1243. ### Details and comments In the original design `ScatterTable` is tied to the fit models, and the columns contains `model_name` (str) and `model_id` (int). Also the fit module only allows to have three categorical data; "processed", "formatted", "fitted". However, qiskit-community#1243 breaks this assumption, namely, the `StarkRamseyXYAmpScanAnalysis` fitter defines two fit models which are not directly mapped to the results data. The data fed into the models is synthesized by consuming the input results data. The fitter needs to manage four categorical data; "raw", "ramsey" (raw results), "phase" (synthesized data for fit), and "fitted". This PR relaxes the tight coupling of data to the fit model. In above example, "raw" and "ramsey" category data can fill new fields `name` (formally model_name) and `class_id` (model_id) without indicating a particular fit model. Usually, raw category data is just classified according to the `data_subfit_map` definition, and the map doesn't need to match with the fit models. The connection to fit models is only introduced in a particular category defined by new option value `fit_category`. This option defaults to "formatted", but `StarkRamseyXYAmpScanAnalysis` fitter would set "phase" instead. Thus fit model assignment is effectively delayed until the formatter function. Also the original scatter table is designed to store all circuit metadata which causes some problem in data formatting, especially when it tries to average the data over the same x value in the group. Non-numeric data is averaged by builtin set operation, but this assumes the metadata value is hashable object, which is not generally true. This PR also drops all metadata from the scatter table. Note that important metadata fields for the curve analysis are one used for model classification (classifier fields), and other fields just decorate the table with unnecessary memory footprint requirements. The classifier fields and `name` (`class_id`) are sort of duplicated information. This implies the `name` and `class_id` fields are enough for end-users to reuse the table data for further analysis once after it's saved as an artifact. --------- Co-authored-by: Will Shanks <[email protected]>

nkanazawa1989 requested a review from coruscating August 18, 2023 06:59

Add new data model as a replacement of CurveData dataclass. This obje…

fa675b4

…ct provides better reusability of the processed curve data.

nkanazawa1989 force-pushed the feature/dataframe-pr2 branch from 3c5d9fb to 38b44b6 Compare August 31, 2023 10:20

nkanazawa1989 added 5 commits August 31, 2023 19:32

Refactor the internals of curve fit. Figure creation code is now comb…

fc68e24

…ined in the single method _create_figures. This allows subclass to flexibly modify the figure generation without overwriting the entire _run_analysis.

Add pandas to requirements

4633b2f

Update unittest for scatter table.

731b763

Update type hint

07e85fc

Add release note

3f55ad9

nkanazawa1989 force-pushed the feature/dataframe-pr2 branch from 38b44b6 to 3f55ad9 Compare August 31, 2023 10:33

nkanazawa1989 commented Aug 31, 2023

View reviewed changes

coruscating reviewed Sep 11, 2023

View reviewed changes

nkanazawa1989 added 8 commits September 13, 2023 03:37

Fix missing series data for auto saving added analysis results. add_e…

7b07559

…ntry returns no data and get_entry must be explicitly called.

Remove hard-coded column index when creating the pandas table with ar…

adec962

…ray.

Update curve format labels:

15a0f04

- raw -> processed - fit-ready -> formatted - ift -> fitted

Remove format field from Scatter data. This can be replaced with tabl…

5c5b3b6

…e index.

Remove requires_pandas decorator

46d350e

Add warning message to curve analysis tutorial about deprecation of C…

657d292

…urveData.

Fix shots default value None -> nullable integer.

4de80e5

WIP table doc

d64f461

nkanazawa1989 added 3 commits September 14, 2023 01:47

drop ScatterTable from toctree

7b114de

Merge branch 'main' into feature/dataframe-pr2

92b384a

coruscating approved these changes Sep 27, 2023

View reviewed changes

coruscating added this pull request to the merge queue Sep 27, 2023

Merged via the queue into qiskit-community:main with commit f33bed7 Sep 27, 2023

nkanazawa1989 mentioned this pull request Nov 14, 2023

Scatter table refactoring #1319

Merged

nkanazawa1989 mentioned this pull request Nov 22, 2023

Epic - Implementation of RFC 0007: Dataframe for Qiskit Experiments Qiskit/RFCs#62

Closed

5 tasks

		return wrapfunc(x=x, **sub_params)


		def shot_weighted_average(

		LOG = logging.getLogger(__name__)


		class ScatterTable(pd.DataFrame, DefaultColumnsMixIn):

Introduce dataframe to ExperimentData (step2) #1253

Introduce dataframe to ExperimentData (step2) #1253

Uh oh!

Conversation

nkanazawa1989 commented Aug 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details and comments

Uh oh!

nkanazawa1989 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coruscating commented Sep 5, 2023

Uh oh!

coruscating left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nkanazawa1989 commented Sep 12, 2023

Uh oh!

nkanazawa1989 commented Sep 22, 2023

Uh oh!

coruscating left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nkanazawa1989 commented Aug 18, 2023 •

edited

Loading