Network Resilience
**Referenced Files in This Document** - [main.rs](file://src/main.rs) - [readme.md](file://readme.md)Table of Contents
- Introduction
- Retry Mechanisms with Exponential Backoff
- Timeout Configurations
- Error Classification Logic
- Integration with Provider Request Loops and Model Jail Decisions
- Trade-offs Between Retry Aggressiveness and Latency
- Configuration and Tuning Guidance
Introduction
This document details the network resilience strategies implemented in the aicommit application to ensure reliable communication with AI providers despite intermittent connectivity or service outages. The system employs sophisticated retry mechanisms, timeout configurations, and error classification logic to maintain functionality across varying network conditions. These strategies are particularly important for the Simple Free OpenRouter provider, which dynamically manages access to free AI models through a model jail system that tracks performance and availability.
The implementation leverages tokio-based async delays and reqwest middleware to handle network operations efficiently. The system is designed to balance reliability with user-perceived latency, providing configurable parameters that can be tuned for different environments such as CI/CD pipelines versus local development.
Retry Mechanisms with Exponential Backoff
The application implements a configurable retry mechanism for API requests to AI providers, using tokio-based async delays to prevent overwhelming servers during transient failures. When a request fails, the system waits before attempting again, with a fixed delay between attempts rather than exponential backoff.
The retry behavior is controlled by the retry_attempts configuration parameter, which defaults to 3 attempts. After each failed attempt, the system sleeps for exactly 5 seconds using tokio::time::sleep(tokio::time::Duration::from_secs(5)). This fixed delay provides predictable recovery behavior while preventing excessive load on the AI providers.
flowchart TD
Start([Request Initiated]) --> Attempt["Send API Request"]
Attempt --> Success{"Request Successful?"}
Success --> |Yes| Complete["Return Response"]
Success --> |No| CheckAttempts["Check Attempt Count"]
CheckAttempts --> Limit{"Attempt < retry_attempts?"}
Limit --> |Yes| Wait["Wait 5 Seconds"]
Wait --> Attempt
Limit --> |No| Fail["Return Error"]
Complete --> End([Success])
Fail --> End
Diagram sources
Section sources
Timeout Configurations
The application implements multiple layers of timeout configurations to prevent hanging requests and ensure responsive behavior. These timeouts are applied at both the client and request levels using the reqwest library.
For general API operations, the system uses a 10-second timeout when creating the HTTP client:
let client = reqwest::Client::builder()
.timeout(std::time::Duration::from_secs(10))
.build()
.unwrap_or_default();
When fetching available models from the OpenRouter API, a 15-second timeout is applied using tokio’s timeout utility:
match tokio::time::timeout(
std::time::Duration::from_secs(15),
client.get("https://openrouter.ai/api/v1/models")
.send()
).await
For generating commit messages, a more generous 30-second timeout is used:
match tokio::time::timeout(std::time::Duration::from_secs(30), make_request).await
These tiered timeout values reflect the different criticality and expected response times for various operations, with model fetching requiring less time than full message generation.
Section sources
Error Classification Logic
The system implements sophisticated error classification logic that distinguishes between transient failures and permanent ones, enabling appropriate recovery strategies. This classification is crucial for determining whether to retry requests or mark models as unavailable.
Transient failures include:
- Network timeouts (detected via tokio::time::timeout)
- 5xx server errors from the API
- Connection issues
- Temporary rate limiting
Permanent failures include:
- 401 Unauthorized errors (invalid API keys)
- 404 Not Found errors
- Invalid model specifications
- Empty or malformed responses
The system also includes contextual intelligence to avoid penalizing models for network issues. When a timeout occurs, it checks the model’s history to determine if the failure is likely due to network issues rather than model problems:
flowchart TD
Start{Request Failed} --> Timeout{Timeout?}
Timeout --> |Yes| CheckHistory["Check Model History"]
CheckHistory --> RecentSuccess{"Recent Success?"}
RecentSuccess --> |Yes| NetworkIssue["Likely Network Issue"]
RecentSuccess --> |No| ModelFailure["Potential Model Failure"]
Timeout --> |No| StatusCode{Status Code}
StatusCode --> ServerError{"5xx Error?"}
ServerError --> |Yes| Transient["Transient Failure"]
ServerError --> |No| ClientError{"4xx Error?"}
ClientError --> |Yes| Permanent["Permanent Failure"]
ClientError --> |No| Other["Other Error Type"]
Diagram sources
Section sources
Integration with Provider Request Loops and Model Jail Decisions
The network resilience strategies are tightly integrated with the provider request loops and the model jail decision system. When a request fails, the system not only handles the immediate retry but also updates the model’s status in the jail system based on the nature of the failure.
The model jail system tracks several metrics for each model:
- Success and failure counts
- Last success and failure timestamps
- Jail status and expiration time
- Blacklist status
When a model fails consecutively (defined as MAX_CONSECUTIVE_FAILURES = 3), it is placed in “jail” with an exponentially increasing duration:
- Initial jail: 24 hours
- Subsequent jails: Multiplied by JAIL_TIME_MULTIPLIER (2)
- Maximum jail: 168 hours (7 days)
After BLACKLIST_AFTER_JAIL_COUNT (3) jail periods, the model is blacklisted for BLACKLIST_RETRY_DAYS (7) days before being reconsidered.
The request loop integrates with this system by checking a model’s availability before use:
sequenceDiagram
participant User as "User"
participant System as "aicommit System"
participant Provider as "AI Provider"
User->>System : Request Commit Message
System->>System : Find Best Available Model
alt Model Available
System->>Provider : Send Request
Provider-->>System : Response
System->>System : Record Success
System->>User : Return Commit Message
else Model Unavailable
System->>System : Select Alternative Model
System->>Provider : Send Request
Provider-->>System : Error
System->>System : Record Failure
System->>System : Update Model Status
alt Retry Available
System->>System : Wait 5s, Retry
else Max Attempts Reached
System->>User : Return Error
end
end
Diagram sources
Section sources
Trade-offs Between Retry Aggressiveness and Latency
The system balances the aggressiveness of retries against user-perceived latency through configurable parameters and intelligent defaults. The current implementation uses a fixed 5-second delay between attempts rather than exponential backoff to provide predictable wait times.
With the default configuration of 3 retry attempts, users may experience up to 10 seconds of additional latency (2 intervals × 5 seconds) before a final failure is reported. This represents a deliberate trade-off favoring reliability over responsiveness, as the system prioritizes successful completion over speed.
The fixed delay approach has several advantages:
- Predictable user experience
- Simpler debugging and monitoring
- Reduced risk of cascading failures during provider outages
However, it also has potential drawbacks:
- Less adaptive to varying network conditions
- May be too aggressive for brief, transient outages
- Could contribute to provider overload during widespread issues
Users can adjust the retry_attempts parameter in the configuration file to tune this balance according to their specific needs and tolerance for latency versus reliability.
Section sources
Configuration and Tuning Guidance
The network resilience parameters can be configured to suit different environments and connectivity conditions. The primary configuration options are available in the global settings of the .aicommit.json configuration file.
For CI/CD environments:
- Set
retry_attemptsto 1-2 to minimize pipeline duration - Accept faster failures in exchange for quicker feedback
- Prioritize reliability of the overall pipeline over individual commit message generation
For local development:
- Use the default
retry_attemptsof 3 for maximum reliability - Tolerate longer wait times for better success rates
- Benefit from the model jail system’s learning across multiple sessions
The system also provides command-line flags for testing and simulation:
--simulate-offline: Forces use of fallback model list--verbose: Shows detailed retry progress and timing information
Users experiencing frequent timeouts may want to consider:
- Reducing
retry_attemptsin poor connectivity areas - Checking API key validity to avoid 401 errors
- Monitoring model jail status with
--jail-status
The configuration is stored in ~/.aicommit.json and can be edited directly or through the interactive setup process.
Section sources