Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# CHANGELOG

## [Unreleased]

### Changed

- **BREAKING**: Made SVG processing optional to avoid installation issues with pycairo dependency
- SVG support (`svglib`) moved to optional dependencies. Install with `pip install llama-index-readers-confluence[svg]`
- SVG attachments will be skipped with a warning if optional dependencies are not installed
- Pinned svglib to <1.6.0 to avoid breaking changes in newer versions

### Fixed

- Fixed installation failures on Debian/Ubuntu systems due to pycairo compilation issues

## [0.1.8] - 2024-08-20

- Added observability events for ConfluenceReader
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
# Migration Guide: SVG Support Changes

## Overview

Starting from version 0.4.5, SVG processing support has been moved to an optional dependency to address installation issues on systems where the `pycairo` package cannot be compiled (particularly Debian/Ubuntu systems without C compilers or Cairo development libraries).

## What Changed?

### Before (versions < 0.4.5)

- `svglib` was a required dependency
- All users had to install `pycairo` even if they didn't need SVG support
- Installation could fail on systems without proper build tools

### After (versions >= 0.4.5)

- `svglib` is now an optional dependency
- SVG processing is skipped by default with a warning if optional dependencies are not installed
- Base installation works on all systems without requiring C compilers
- SVG version pinned to `<1.6.0` to avoid breaking changes

## Migration Paths

### Option 1: Continue Using Built-in SVG Support (Recommended if SVG is needed)

If you need SVG processing and can install the required system dependencies:

```bash
# Uninstall current version
pip uninstall llama-index-readers-confluence

# Install with SVG support
pip install 'llama-index-readers-confluence[svg]'
```

**System Requirements for SVG Support:**

- On Debian/Ubuntu: `sudo apt-get install gcc python3-dev libcairo2-dev`
- On macOS: `brew install cairo`
- On Windows: Install Visual C++ Build Tools

### Option 2: Skip SVG Processing (Recommended for Docker/CI environments)

If you don't need SVG processing or want to avoid installation issues:

```bash
# Install without SVG support (default)
pip install llama-index-readers-confluence
```

SVG attachments will be skipped with a warning in the logs. All other functionality remains unchanged.

### Option 3: Use Custom SVG Parser

If you need SVG processing but cannot install pycairo, use a custom parser:

```python
from llama_index.readers.confluence import ConfluenceReader
from llama_index.readers.confluence.event import FileType


# Simple text extraction from SVG (no OCR)
class SimpleSVGParser(BaseReader):
def load_data(self, file_path, **kwargs):
import xml.etree.ElementTree as ET

with open(file_path, "r") as f:
root = ET.fromstring(f.read())

# Extract text elements from SVG
texts = [elem.text for elem in root.findall(".//text") if elem.text]
extracted_text = " ".join(texts) or "[SVG Image]"

return [
Document(text=extracted_text, metadata={"file_path": file_path})
]


reader = ConfluenceReader(
base_url="https://yoursite.atlassian.com/wiki",
api_token="your_token",
custom_parsers={FileType.SVG: SimpleSVGParser()},
)
```

See `examples/svg_parsing_examples.py` for more custom parser examples.

### Option 4: Filter Out SVG Attachments

If you want to explicitly skip SVG files without warnings:

```python
def attachment_filter(
media_type: str, file_size: int, title: str
) -> tuple[bool, str]:
if media_type == "image/svg+xml":
return False, "SVG processing disabled"
return True, ""


reader = ConfluenceReader(
base_url="https://yoursite.atlassian.com/wiki",
api_token="your_token",
process_attachment_callback=attachment_filter,
)
```

## Docker/Container Deployments

### Before (versions < 0.4.5)

```dockerfile
FROM python:3.11-slim

# Required system dependencies for pycairo
RUN apt-get update && apt-get install -y \
gcc \
python3-dev \
libcairo2-dev \
&& rm -rf /var/lib/apt/lists/*

RUN pip install llama-index-readers-confluence
```

### After (versions >= 0.4.5) - Without SVG Support

```dockerfile
FROM python:3.11-slim

# No system dependencies needed!
RUN pip install llama-index-readers-confluence
```

### After (versions >= 0.4.5) - With SVG Support

```dockerfile
FROM python:3.11-slim

# Only if you need SVG support
RUN apt-get update && apt-get install -y \
gcc \
python3-dev \
libcairo2-dev \
&& rm -rf /var/lib/apt/lists/*

RUN pip install 'llama-index-readers-confluence[svg]'
```

## FAQ

### Q: Will my existing code break?

**A:** No, your existing code will continue to work. If you were using SVG processing and don't install the `[svg]` extra, SVG attachments will simply be skipped with a warning instead of failing.

### Q: How do I know if SVG dependencies are installed?

**A:** Check the logs. If you see warnings like "SVG processing skipped: Optional dependencies not installed", then SVG dependencies are not available.

### Q: Can I use a different OCR engine for SVG?

**A:** Yes! Use the custom parser approach (Option 3) and implement your own SVG-to-text conversion logic. You could use libraries like `cairosvg`, `pdf2image`, or pure XML parsing depending on your needs.

### Q: Why was this change made?

**A:** The `pycairo` dependency (required by `svglib`) requires C compilation and system libraries (Cairo). This caused installation failures in:

- Docker containers based on slim images
- CI/CD pipelines without build tools
- Systems managed by users without admin rights
- Environments where SVG support isn't needed

Making it optional allows the package to work everywhere while still supporting SVG for users who need it.

### Q: What if I encounter other issues?

**A:** Please file an issue on GitHub with:

1. Your Python version
2. Your operating system
3. Whether you installed with `[svg]` extra
4. The full error message
5. Output of `pip list` showing installed packages

## Testing Your Migration

After migrating, test your setup:

```python
from llama_index.readers.confluence import ConfluenceReader
import logging

# Enable logging to see SVG warnings
logging.basicConfig(level=logging.INFO)

reader = ConfluenceReader(
base_url="https://yoursite.atlassian.com/wiki",
api_token="your_token",
)

# Try loading data
documents = reader.load_data(space_key="MYSPACE", include_attachments=True)

# Check logs for any SVG-related warnings
print(f"Loaded {len(documents)} documents")
```

If you see "SVG processing skipped" warnings but didn't expect them, you may need to install the `[svg]` extra.
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,23 @@ include attachments, this is set to `False` by default, if set to `True` all att
ConfluenceReader will extract the text from the attachments and add it to the Document object.
Currently supported attachment types are: PDF, PNG, JPEG/JPG, SVG, Word and Excel.

### Optional Dependencies

**SVG Support**: SVG processing requires additional dependencies that can cause installation issues on some systems.
To enable SVG attachment processing, install with the `svg` extra:

```bash
pip install llama-index-readers-confluence[svg]
```

If SVG dependencies are not installed, SVG attachments will be skipped with a warning in the logs, but all other
functionality will work normally. This allows the package to be installed on systems where the SVG dependencies
(svglib and its transitive dependency pycairo) cannot be built.

**Migration Note for Existing Users**: If you were previously using SVG processing and want to continue doing so,
you need to install the svg extra as shown above. Alternatively, you can provide a custom SVG parser using the
`custom_parsers` parameter (see Advanced Configuration section and `examples/svg_parsing_examples.py` for details).

## Advanced Configuration

The ConfluenceReader supports several advanced configuration options for customizing the reading behavior:
Expand Down Expand Up @@ -98,7 +115,8 @@ confluence_parsers = {
# ConfluenceFileType.CSV: CSVParser(),
# ConfluenceFileType.SPREADSHEET: ExcelParser(),
# ConfluenceFileType.MARKDOWN: MarkdownParser(),
# ConfluenceFileType.TEXT: TextParser()
# ConfluenceFileType.TEXT: TextParser(),
# ConfluenceFileType.SVG: CustomSVGParser(), # Custom SVG parser to avoid pycairo issues
}

reader = ConfluenceReader(
Expand All @@ -108,6 +126,10 @@ reader = ConfluenceReader(
)
```

For SVG parsing examples including alternatives to the built-in parser, see `examples/svg_parsing_examples.py`.

````

**Processing Callbacks**:

- `process_attachment_callback`: A callback function to control which attachments should be processed. The function receives the media type and file size as parameters and should return a tuple of `(should_process: bool, reason: str)`.
Expand Down Expand Up @@ -425,3 +447,4 @@ print(f"Processing completed. Total documents: {len(documents)}")
```

This loader is designed to be used as a way to load data into [LlamaIndex](https://github.com/run-llama/llama_index/).
````
Loading