Smart Diff Processing

**Referenced Files in This Document ** - [src/main.rs](file://src/main.rs)

Table of Contents

  1. Introduction
  2. Diff Parsing and Sanitization
  3. Preprocessing for Context Preservation
  4. Integration with LLM Prompts
  5. Performance Considerations
  6. Raw vs Processed Diffs
  7. Customization and Extensibility

Introduction

Smart Diff Processing is a core functionality within the aicommit tool that enhances commit message quality by filtering noise and highlighting semantically meaningful changes in code diffs. This system processes raw git diff output through intelligent parsing, sanitization, and condensation techniques to remove irrelevant modifications such as formatting changes and comments while preserving essential context. The processed diff is then integrated into LLM prompts to maximize the signal-to-noise ratio, resulting in higher-quality commit messages. This document details the implementation based on analysis of src/main.rs.

Section sources

Diff Parsing and Sanitization

The Smart Diff Processing system parses git diff output using regex patterns and heuristic rules to identify and filter out non-essential changes. The process begins by splitting the diff into file sections using the pattern (?m)^diff --git which identifies individual file change blocks. Each section is analyzed to determine whether it contains meaningful semantic changes or merely cosmetic alterations.

The system employs several sanitization strategies:

This parsing approach ensures that only semantically significant changes are retained for further processing, reducing noise in the final output.

flowchart TD
A[Raw Git Diff] --> B{Size ≤ MAX_DIFF_CHARS?}
B --> |Yes| C[Return Unmodified]
B --> |No| D[Split into File Sections]
D --> E[Process Each Section]
E --> F{Section Size > MAX_FILE_DIFF_CHARS?}
F --> |Yes| G[Truncate with Header Preservation]
F --> |No| H[Include Full Section]
G --> I[Add Truncation Notice]
H --> J[Append to Result]
I --> K[Reassemble Processed Diff]
J --> K
K --> L{Final Size > MAX_DIFF_CHARS?}
L --> |Yes| M[Safely Truncate with Overall Notice]
L --> |No| N[Return Processed Diff]
M --> O[Return Truncated Result]
N --> P[Complete]

**Diagram sources **

Section sources

Preprocessing for Context Preservation

The preprocessing stage focuses on maintaining contextual integrity around changes while minimizing token usage. The system implements a two-tiered truncation strategy defined by constants MAX_DIFF_CHARS (15,000 characters) and MAX_FILE_DIFF_CHARS (3,000 characters per file). When a file’s diff exceeds the per-file limit, the system preserves the header portion (typically 4-5 lines containing file name, index, and ---/+++ lines) while truncating the detailed change content.

A critical aspect of preprocessing is the use of UTF-8 boundary-aware slicing through the get_safe_slice_length helper function, which ensures that string truncation respects character boundaries and prevents encoding corruption. This function iteratively reduces the slice length until it finds a valid UTF-8 character boundary, guaranteeing that the processed diff remains textually intact.

The preprocessing also maintains structural elements of the diff format, ensuring that the resulting output remains compatible with standard diff parsers while being significantly more concise than the original.

Section sources

Integration with LLM Prompts

Processed diffs are integrated into LLM prompts to maximize the signal-to-noise ratio in commit message generation. The system replaces direct inclusion of raw diffs with the sanitized processed_diff variable in all provider-specific message generation functions. This integration occurs in multiple LLM provider implementations including OpenRouter, Ollama, OpenAI Compatible, and Simple Free OpenRouter configurations.

The prompt structure follows a consistent pattern across providers:

This standardized prompt architecture ensures consistency in output format regardless of the underlying LLM provider, while the preprocessed diff input guarantees that the model receives high-signal, low-noise data for analysis.

sequenceDiagram
participant User as "User"
participant Tool as "aicommit"
participant LLM as "LLM Provider"
User->>Tool : Request Commit Message
Tool->>Tool : Retrieve Raw Git Diff
Tool->>Tool : Process Diff via process_git_diff_output()
Tool->>Tool : Construct Prompt with Processed Diff
Tool->>LLM : Send API Request with Prompt
LLM-->>Tool : Return Generated Message
Tool->>Tool : Validate and Clean Response
Tool-->>User : Display Final Commit Message

**Diagram sources **

Section sources

Performance Considerations

The Smart Diff Processing system incorporates several performance optimizations to handle large diffs efficiently. The primary constraint is memory efficiency, addressed through streaming-like processing that operates on string slices rather than loading entire diff contents into memory structures.

Key performance features include:

The system processes diffs in a single pass, splitting the input and processing each file section sequentially. For extremely large diffs, the final safety check ensures the total output does not exceed MAX_DIFF_CHARS, preventing excessive API usage and potential rate limiting.

Memory usage scales linearly with diff size but is capped by the maximum allowed characters, making the system predictable in its resource consumption regardless of repository size.

Section sources

Raw vs Processed Diffs

The transformation from raw to processed diffs demonstrates significant reduction in noise while preserving semantic meaning. For example, a raw diff containing extensive formatting changes across multiple files would be condensed to show only the file headers and minimal context, with clear truncation notices indicating omitted content.

When comparing raw and processed outputs:

This transformation enables LLMs to focus on actual code modifications rather than parsing through irrelevant changes, leading to more accurate and meaningful commit message generation.

Section sources

Customization and Extensibility

While the current implementation provides fixed thresholds for diff processing, the architecture allows for future customization possibilities. The constants MAX_DIFF_CHARS and MAX_FILE_DIFF_CHARS could be exposed through configuration options, enabling users to adjust sensitivity based on their specific needs.

Potential extensibility hooks include:

The modular design of the process_git_diff_output function makes it suitable for extension with additional preprocessing rules without affecting the core commit generation workflow.

Section sources