|
| 1 | +# Enterprise Classifier Integration Tests |
| 2 | + |
| 3 | +This document describes the integration test suite for enterprise classifiers hosted on Hugging Face Hub. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The integration tests (`tests/test_enterprise_classifiers_integration.py`) verify that all 17 enterprise classifiers maintain their expected performance and behavior. These tests serve as regression tests to ensure code changes don't break the published models. |
| 8 | + |
| 9 | +## Test Coverage |
| 10 | + |
| 11 | +The integration test suite covers: |
| 12 | + |
| 13 | +- **Model Loading**: Can each model be loaded from HuggingFace Hub? |
| 14 | +- **Prediction Functionality**: Do models make valid predictions? |
| 15 | +- **k-Parameter Consistency**: Do k=1 and k=2 produce consistent results? (regression test for the k parameter bug) |
| 16 | +- **Prediction Stability**: Are repeated predictions consistent? |
| 17 | +- **Performance**: Does inference complete within reasonable time? |
| 18 | +- **Class Coverage**: Do models know about all expected classes? |
| 19 | +- **Health Check**: Overall ecosystem health assessment |
| 20 | + |
| 21 | +## Running Integration Tests |
| 22 | + |
| 23 | +### Run All Integration Tests |
| 24 | +```bash |
| 25 | +pytest tests/test_enterprise_classifiers_integration.py -v |
| 26 | +``` |
| 27 | + |
| 28 | +### Run Only Unit Tests (Skip Integration) |
| 29 | +```bash |
| 30 | +pytest tests/ -m "not integration" -v |
| 31 | +``` |
| 32 | + |
| 33 | +### Run Specific Test for One Classifier |
| 34 | +```bash |
| 35 | +pytest tests/test_enterprise_classifiers_integration.py -k "fraud-detection" -v |
| 36 | +``` |
| 37 | + |
| 38 | +### Run Specific Test Type |
| 39 | +```bash |
| 40 | +# Test k-parameter consistency for all classifiers |
| 41 | +pytest tests/test_enterprise_classifiers_integration.py::TestEnterpriseClassifiers::test_k_parameter_consistency -v |
| 42 | + |
| 43 | +# Test model loading for all classifiers |
| 44 | +pytest tests/test_enterprise_classifiers_integration.py::TestEnterpriseClassifiers::test_model_loading -v |
| 45 | +``` |
| 46 | + |
| 47 | +## CI/CD Integration |
| 48 | + |
| 49 | +The CI/CD pipeline runs integration tests automatically: |
| 50 | + |
| 51 | +1. **Unit Tests Job**: Runs all unit tests first |
| 52 | +2. **Integration Tests Job**: Runs only if unit tests pass |
| 53 | + - 30-minute timeout for model downloads |
| 54 | + - Tests all 17 enterprise classifiers |
| 55 | + - Reports detailed results |
| 56 | + |
| 57 | +## Tested Classifiers |
| 58 | + |
| 59 | +The following 17 enterprise classifiers are tested: |
| 60 | + |
| 61 | +| Classifier | Expected Accuracy | Classes | Use Case | |
| 62 | +|------------|------------------|---------|----------| |
| 63 | +| business-sentiment | 98.8% | 4 | Business text sentiment analysis | |
| 64 | +| compliance-classification | 65.3% | 5 | Regulatory compliance categorization | |
| 65 | +| content-moderation | 100.0% | 3 | Content filtering and moderation | |
| 66 | +| customer-intent | 85.2% | 4 | Customer service intent detection | |
| 67 | +| document-quality | 100.0% | 2 | Document quality assessment | |
| 68 | +| document-type | 98.0% | 5 | Document type classification | |
| 69 | +| email-priority | 83.9% | 3 | Email priority triage | |
| 70 | +| email-security | 93.8% | 4 | Email security threat detection | |
| 71 | +| escalation-detection | 97.6% | 2 | Support ticket escalation detection | |
| 72 | +| expense-category | 84.2% | 5 | Business expense categorization | |
| 73 | +| fraud-detection | 92.7% | 2 | Financial fraud detection | |
| 74 | +| language-detection | 100.0% | 4 | Text language identification | |
| 75 | +| pii-detection | 100.0% | 2 | Personal information detection | |
| 76 | +| product-category | 85.2% | 4 | E-commerce product categorization | |
| 77 | +| risk-assessment | 75.6% | 2 | Security risk assessment | |
| 78 | +| support-ticket | 82.9% | 4 | Support ticket categorization | |
| 79 | +| vendor-classification | 92.7% | 2 | Vendor relationship classification | |
| 80 | + |
| 81 | +## Test Assertions |
| 82 | + |
| 83 | +Each classifier is tested against: |
| 84 | + |
| 85 | +- **Minimum Accuracy Thresholds**: Must meet or exceed defined minimums |
| 86 | +- **k-Parameter Consistency**: k=1 and k=2 must produce identical top predictions |
| 87 | +- **Response Time**: Inference must complete within 2 seconds |
| 88 | +- **Class Completeness**: Must know about all expected classes |
| 89 | +- **Prediction Validity**: All predictions must be properly formatted |
| 90 | + |
| 91 | +## Failure Modes |
| 92 | + |
| 93 | +Tests will fail if: |
| 94 | + |
| 95 | +- Model cannot be loaded from HuggingFace Hub |
| 96 | +- Accuracy drops below minimum threshold |
| 97 | +- k=1 and k=2 produce different top predictions (regression) |
| 98 | +- Inference takes longer than 2 seconds |
| 99 | +- Predicted classes don't match expected class set |
| 100 | +- Prediction format is invalid |
| 101 | + |
| 102 | +## Adding New Enterprise Classifiers |
| 103 | + |
| 104 | +To add a new enterprise classifier to the test suite: |
| 105 | + |
| 106 | +1. Update `CLASSIFIER_METRICS` dictionary with expected metrics |
| 107 | +2. Add domain-specific test sentences to `TEST_SENTENCES` |
| 108 | +3. Ensure the model is available on HuggingFace Hub as `adaptive-classifier/{name}` |
| 109 | + |
| 110 | +## Debugging Test Failures |
| 111 | + |
| 112 | +### Model Loading Failures |
| 113 | +- Check that the model exists on HuggingFace Hub |
| 114 | +- Verify network connectivity |
| 115 | +- Check for authentication issues (if model is private) |
| 116 | + |
| 117 | +### Accuracy Failures |
| 118 | +- Compare actual vs expected accuracy in test output |
| 119 | +- Check if model was retrained recently |
| 120 | +- Verify test sentences are appropriate for the domain |
| 121 | + |
| 122 | +### k-Parameter Inconsistencies |
| 123 | +- This indicates a regression in the k parameter fix |
| 124 | +- Check prediction logic in `_predict_regular` method |
| 125 | +- Verify prototype and neural head prediction combination |
| 126 | + |
| 127 | +### Performance Issues |
| 128 | +- Check system resources during testing |
| 129 | +- Consider network latency for model downloads |
| 130 | +- Verify model size hasn't increased significantly |
| 131 | + |
| 132 | +## Local Development |
| 133 | + |
| 134 | +For faster local testing, you can: |
| 135 | + |
| 136 | +```bash |
| 137 | +# Skip integration tests during development |
| 138 | +pytest tests/ -m "not integration" |
| 139 | + |
| 140 | +# Test specific classifier during debugging |
| 141 | +pytest tests/test_enterprise_classifiers_integration.py -k "fraud-detection" -v -s |
| 142 | +``` |
| 143 | + |
| 144 | +## Maintenance |
| 145 | + |
| 146 | +The integration test suite should be updated when: |
| 147 | + |
| 148 | +- New enterprise classifiers are published |
| 149 | +- Expected accuracy thresholds change |
| 150 | +- New test dimensions are needed |
| 151 | +- Test sentences need updating for better coverage |
| 152 | + |
| 153 | +This ensures the test suite remains comprehensive and valuable for regression detection. |
0 commit comments