Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -162,3 +162,4 @@ cython_debug/

db-dump/
postgres-data/
.vscode/
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<div align="center">
<img alt="OpenHEXA Logo" src="https://raw.githubusercontent.com/BLSQ/openhexa-app/main/hexa/static/img/logo/logo_with_text_grey.svg" height="80">
<img alt="OpenHEXA Logo" src="https://raw.githubusercontent.com/BLSQ/openhexa-app/main/backend/hexa/static/img/logo/logo_with_text_black.svg" height="80">
</div>
<p align="center">
<em>Open-source Data integration platform</em>
Expand Down
242 changes: 242 additions & 0 deletions docs/expectations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
# Expectations Module

## Overview

The `Expectations` class provides a structured way to **validate datasets** against defined data quality rules using [Great Expectations](https://greatexpectations.io/).

It supports both **DataFrame-level** and **Column-level** checks, with validation rules defined in an external `expectations.yml` file.

The class supports datasets in both **pandas** and **polars**, automatically normalizing to pandas for validation.

### Features

- **DataFrame-level checks**
- Validate row/column count
- Validate emptiness/non-emptiness
- **Column-level checks**
- Column existence
- Data type enforcement
- Numeric range validation
- Nullability checks
- Allowed categorical values
- String length validation (fixed length or range)

---

## Installation Requirements

Ensure the following dependencies are installed:

```bash
pip install pandas polars pyyaml great-expectations
````

---

## Class: `Expectations`

### Initialization

```python
Expectations(
dataset: pd.DataFrame | pl.DataFrame,
expectations_yml_file: str | None = None
)
```

#### Parameters

* **dataset** (`pd.DataFrame | pl.DataFrame`)
The dataset to validate. Both pandas and polars are supported.
If a `polars.DataFrame` is provided, it will be automatically converted to pandas for validation.

* **expectations\_yml\_file** (`str | None`, optional)
Path to the expectations YAML file.
If not provided, defaults to `expectations.yml` located in the caller’s directory.

#### Raises

* `ValueError`

* If `dataset` is not a pandas or polars DataFrame
* If `expectations_yml_file` is not a string
* `FileNotFoundError`
If `expectations.yml` file is missing
* `yaml.YAMLError` or `ValueError`
If the YAML file cannot be parsed or is missing required sections

---

## YAML Expectations Schema

The `expectations.yml` file must define **two sections**:

```yaml
dataframe:
size: not empty # or empty
no_columns: 5
no_rows: 3

columns:
age:
type: int64
minimum: 18
maximum: 70
not-null: true

height:
type: int64
minimum: 5
maximum: 8
not-null: false

gender:
type: object
classes:
- male
- female
- other
not-null: false

phone:
type: object
not-null: false
length-between:
- 10
- 13

shirt_size:
type: object
classes:
- s
- m
- l
not-null: true
length-between:
- 1 # exact length of 1
```

---

## Supported Expectations Mapping

The YAML configuration is translated into the following **Great Expectations classes**:

| YAML Key | Great Expectations Class |
| ---------------------------- | ----------------------------------------------------------------------------- |
| `type` | `ExpectColumnValuesToBeOfType` |
| `minimum` / `maximum` | `ExpectColumnValuesToBeBetween` (only for numeric types) |
| `not-null: true` | `ExpectColumnValuesToNotBeNull` |
| `classes` | `ExpectColumnDistinctValuesToBeInSet` |
| `length-between: [N]` | `ExpectColumnValueLengthsToEqual` (exact length `N`) |
| `length-between: [min, max]` | `ExpectColumnValueLengthsToBeBetween` (string length between `min` and `max`) |

At the **DataFrame-level** (outside columns), the following checks are enforced internally:

* `size: not empty` → raises `ValueError` if DataFrame is empty
* `size: empty` → raises `ValueError` if DataFrame is not empty
* `no_columns` → raises `ValueError` if column count mismatches
* `no_rows` → raises `ValueError` if row count mismatches

---

## Methods

### `_read_definitions() -> dict`

Loads and validates expectations from the YAML file.

#### Returns

* Dictionary containing expectations.

#### Raises

* `FileNotFoundError`: If YAML file not found
* `ValueError`: If YAML is invalid or missing required sections (`dataframe`, `columns`)

---

### `validate_expectations()`

Validates the dataset against defined expectations.

#### Performs

* **DataFrame-level checks**

* Enforces `size` (empty / not empty)
* Enforces `no_rows` and `no_columns`

* **Column-level checks**

* Ensures required columns exist
* Validates data types
* Validates numeric ranges
* Enforces `not-null`
* Restricts categorical values
* Validates string length constraints

#### Raises

* `ValueError`: If dataset does not meet defined expectations
* `Exception`: If Great Expectations validation checkpoint fails

---

## Example Usage

```python
import polars as pl
from expectations import Expectations

# Example dataset
df = pl.DataFrame(
{
"age": [19, 20, 30],
"height": [7, 5, 6],
"gender": ["male", "female", None],
"phone": ["0711222333", "0722111333", "+256744123432"],
"shirt_size": ["s", "m", "l"],
}
)

# Initialize with default expectations.yml in caller's directory
validator = Expectations(df)

# Run validation
validator.validate_expectations()
```

---

## Output

On execution, a Great Expectations **checkpoint run report** is generated.
If validation fails, an exception is raised with a detailed message.

Example success log:

```text
INFO:root:Data passed validation check.
```

Example failure:

```text
Exception: Data failed validation check!
{
"success": false,
"results": [...]
}
```

---

## Best Practices

* Store `expectations.yml` alongside your pipeline scripts for maintainability.
* Version control `expectations.yml` to track schema changes over time.
* Start with broad rules (row/column counts, non-null constraints) and refine incrementally.
* Use `polars` for data wrangling if performance is critical — the class will handle conversion to pandas for validation.

16 changes: 10 additions & 6 deletions openhexa/toolbox/dhis2/dhis2.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,9 +183,11 @@ def organisation_unit_groups(

def format_unit_group(group: Dict[str, Any], fields: str) -> Dict[str, Any]:
return {
key: group.get(key)
if key != "organisationUnits"
else [ou.get("id") for ou in group.get("organisationUnits", [])]
key: (
group.get(key)
if key != "organisationUnits"
else [ou.get("id") for ou in group.get("organisationUnits", [])]
)
for key in fields.split(",")
}

Expand Down Expand Up @@ -494,9 +496,11 @@ def indicator_groups(

def format_group(group: Dict[str, Any], fields: str) -> Dict[str, Any]:
return {
key: group.get(key)
if key != "indicators"
else [indicator.get("id") for indicator in group.get("indicators", [])]
key: (
group.get(key)
if key != "indicators"
else [indicator.get("id") for indicator in group.get("indicators", [])]
)
for key in fields.split(",")
}

Expand Down
Loading
Loading