You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+88-12Lines changed: 88 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,11 +29,7 @@ Features in development:
29
29
## Installation Instructions \[Standard\]
30
30
The basic environmental setup is shown below. A virtual / conda environment may be constructed; however, the requirements are quite lightweight and this is probably not needed.
We assume that any specified virtual / conda environment has been activated for all subsequent code snippets.
53
49
54
50
# Quick Start Guides
55
-
We include key notes for understanding the core ideas of the STEP code. Quick-start resources are included in both shell script and notebook form.
51
+
## Convenience: Automatic Test Instantiation
52
+
53
+
For convenience, you can automatically select between STEP and Lai (a baseline method) depending on the value of `n_max` using the factory function in `auto.py`:
54
+
55
+
```python
56
+
from sequentialized_barnard_tests import get_mirrored_test
57
+
test = get_mirrored_test(n_max, alternative, alpha, verbose=True, ...)
58
+
```
59
+
If `n_max > 500`, this will instantiate a `MirroredLaiTest`, which is a computationally efficient baseline with comparable performance to `MirroredStepTest` for a large-enough sample size; otherwise, it will use the more powerful `MirroredStepTest` which can take longer to synthesize the decision rule. All shared and class-specific arguments can be passed as keyword arguments.
60
+
61
+
## Example Usage
62
+
63
+
Below is a minimum example code with different policy evaluation data, leading to three distinct evaluation results.
64
+
65
+
### Case 1: Test yields `AcceptAlternative`
66
+
```python
67
+
from sequentialized_barnard_tests import get_mirrored_test, Hypothesis
68
+
69
+
n_max =100# maximum sample size is 100 (per policy)
70
+
alternative = Hypothesis.P0LessThanP1 # we want to test if "success rate of the first policy < success rate of the second policy"
71
+
alpha =0.05# false positive rate is 5%
72
+
73
+
test = get_mirrored_test(n_max=n_max, alternative=alternative, alpha=alpha)
74
+
75
+
success_array_policy_0 = [False] *10# the first policy failed 10 times
76
+
success_array_policy_1 = [True] *10# the second policy succeeded 10 times
77
+
78
+
result = test.run_on_sequence(success_array_policy_0, success_array_policy_1)
79
+
decision = result.decision
80
+
print(decision) # AcceptAlternative: success rate of the first policy < success rate of the second policy with 95% confidence
81
+
```
56
82
57
-
## Quick Start Guide: Making a STEP Policy for Specific \{n_max, alpha\}
83
+
### Case 2: Test yields `AcceptNull`
84
+
```python
85
+
from sequentialized_barnard_tests import get_mirrored_test, Hypothesis
86
+
87
+
n_max =100# maximum sample size is 100 (per policy)
88
+
alternative = Hypothesis.P0LessThanP1 # we want to test if "success rate of the first policy < success rate of the second policy"
89
+
alpha =0.05# false positive rate is 5%
90
+
91
+
test = get_mirrored_test(n_max=n_max, alternative=alternative, alpha=alpha)
92
+
93
+
success_array_policy_0 = [True] *10# the first policy succeeded 10 times
94
+
success_array_policy_1 = [False] *10# the second policy failed 10 times
95
+
96
+
result = test.run_on_sequence(success_array_policy_0, success_array_policy_1)
97
+
decision = result.decision
98
+
print(decision) # AcceptNull: success rate of the first policy > success rate of the second policy with 95% confidence
99
+
```
100
+
101
+
Note: `AcceptNull` is a valid decision only for "mirrored" tests.
102
+
In our terminology, a mirrored test is one that runs two one-sided tests
103
+
simultaneously, with the null and the alternaive flipped from each other.
104
+
(Because of the monotonicity of the test statistic, mirrored tests suffer no penalty for
105
+
running two tests simultaneously, and therefore essentially dominate one-sided tests.)
106
+
In the example above, the alternative is `Hypothesis.P0LessThanP1` and the decision is
107
+
`Decision.AcceptNull`, which should be interpreted as accepting `Hypothesis.P0MoreThanP1`.
108
+
If you rather want a more conventional one-sided test, you can instantiate one by calling
109
+
`get_test` instead of `get_mirrored_test`.
110
+
111
+
### Case 3: Test yields `FailToDecide`
112
+
```python
113
+
from sequentialized_barnard_tests import get_mirrored_test, Hypothesis
114
+
115
+
n_max =100# maximum sample size is 100 (per policy)
116
+
alternative = Hypothesis.P0LessThanP1 # we want to test if "success rate of the first policy < success rate of the second policy"
117
+
alpha =0.05# false positive rate is 5%
118
+
119
+
test = get_mirrored_test(n_max=n_max, alternative=alternative, alpha=alpha)
120
+
121
+
success_array_policy_0 = [True, False, False, True] # the first policy succeeded 2 out of 4 times
122
+
success_array_policy_1 = [False, True, True, True] # the second policy succeeded 3 out of 4 times
123
+
124
+
result = test.run_on_sequence(success_array_policy_0, success_array_policy_1)
125
+
decision = result.decision
126
+
print(decision) # FailToDecide: difference was not statistically separable; user can collect 100 - 4 = 96 more rollouts for each policy to re-run the test.
127
+
```
128
+
129
+
## Key Notes for Understanding the Core Ideas of STEP Code
130
+
131
+
We include key notes for understanding the core ideas of the STEP code. Quick-start resources are included in both shell script and notebook form.
58
132
59
133
### (1A) Understanding the Accepted Shape Parameters
60
134
In order to synthesize a STEP Policy for specific values of n_max and alpha, one additional set of parametric decisions will be required. The user will need to set the risk budget shape, which is specified by choice of function family (p-norm vs zeta-function) and particular shape parameter. The shape parameter is real-valued; it is used directly for zeta functions and is exponentiated for p-norms.
@@ -87,20 +161,22 @@ Generalizing the accepted risk budgets to arbitrary monotonic sequences $`\{0, \
87
161
Having decided an appropriate form for the risk budget shape, policy synthesis is straightforward to run. From the base directory, the general command would be:
Note: This script will be called automatically upon instantiation of a test object, if the corresponding polciy file is missing from `sequentialized_barnard_tests/policies/`.
168
+
93
169
### (2B) What If I Don't Know the Right Risk Budget?
94
170
We recommend using the default linear risk budget, which is the shape *used in the paper*. This corresponds to \{shape_parameter\}$`= 0.0`$ for each shape family. Thus, *each of the following commands constructs the same policy*:
95
171
96
172
```bash
97
-
$ python scripts/synthesize_general_step_policy.py -n {n_max} -a {alpha}
173
+
$ python sequentialized_barnard_tests/scripts/synthesize_general_step_policy.py -n {n_max} -a {alpha}
- At present, we have not tested extensively beyond \{n_max\}$`=500`$. Going beyond this limit may lead to issues, and the likelihood will grow the larger \{n_max\} is set to be. The code will also require increasing amounts of RAM as \{n_max\} is increased.
115
191
116
-
## Quick Start Guide: Evaluation on Real Data
192
+
## Script-Based Evaluation on Real Data
117
193
118
194
We now assume that a STEP policy has been constructed for the target problem. This can either be one of the default policies, or a newly constructed one following the recipe in the preceding section.
0 commit comments