📋 Request for Comments
Platform architecture and design decisions
RFC: ISLETS
RFC Number: TBD
Version: 1.0.0
Title: ISLETS
Author: rsampaio
Date: February 5, 2026
Updated: February 12, 2026
Status: Draft
Overview
This RFC defines the standard architecture and operational requirements for inter-system communication using an event-driven approach. It specifies the components, infrastructure, and best practices for reliable, secure, and observable event processing across systems.
ISLETS (Inter-System Lightweight Event Transport Standard) provides the foundational platform capability that enables stream-aligned product teams to publish and consume business events without managing messaging infrastructure complexity.
Purpose and Motivation
To ensure scalable, decoupled, and resilient communication between systems, a unified event-driven architecture is required. This RFC establishes conventions and requirements to promote interoperability, reliability, and maintainability.
Scope
This RFC applies to all new systems that require knowledge about business events, such as: - ListingSold, ListingReceivedProposal, ListingRegistered, ListingPriceChanged
Any system that needs to publish, consume, or process such business events must adhere to the requirements and guidelines defined in this RFC.
Non-Goals
This RFC does not: - Propose a timeline for legacy system migration to the event-driven architecture - Mandate a specific architecture for all systems; it only describes requirements for inter-system communication, not internal system design
Definitions
- System: An application or service participating in event exchange.
- Topic: A logical channel for publishing and subscribing to messages.
- Event: a relevant business fact that happened. Represented as a Message sent to a given Topic.
- Queue: A buffer for messages awaiting processing by an Event Processor. Belongs to the Consumer System.
- Message: The data payload exchanged between systems. Represents an Event.
- Event Processor: A consumer of messages, typically implemented as a serverless function or container.
- Producer/Upstream System: A system that publishes messages to a Topic.
- Consumer/Downstream System: A system that consumes messages from a Topic.
- Dead Letter Queue (DLQ): Stores failed messages for later inspection.
- Event Audit Store: A database of all events transported via Islets for debugging and replaying purposes.
Architecture
Architecture Overview
Inter-system communication MUST use an event-driven architecture composed of Systems, Topics, Queues, Messages, and Event Processors.
A supporting diagram with an example of a possible application of this architecture is available at [diagram].
Producers and Consumers
- Systems publishing messages to Topics are Producers (Upstream Systems).
- Systems consuming messages from Topics are Consumers (Downstream Systems).
- Systems MAY act as both Producer and Consumer.
Message Flow
- Producers publish Events as Messages to Topics (AWS SNS)
- Topics fan-out to subscribed Queues (AWS SQS Standard)
- Queues buffer messages for Event Processors
- Event Processors (Lambda/Fargate) poll and process messages from Queues
- Failed messages (after 3 retries with backoff) are moved to Dead Letter Queue
Queue Type: Standard SQS queues MUST be used by default for scalability and cost-effectiveness. SQS FIFO queues MAY be used only when strict in-order processing of events from a single aggregate is a documented business requirement, acknowledging the trade-offs in throughput (300 TPS limit) and higher costs.
Topics MAY accept messages via: - AWS SDK (IAM-authenticated) for internal systems - Webhook URL for third-party systems (e.g., HubSpot, Stripe)
Reliability
- Queues MUST have associated Dead Letter Queues (DLQs) for failed messages
- Messages MUST be retried exactly 3 times before moving to DLQ
- Retry backoff intervals: 1 second, 3 seconds, 5 seconds
- Retry attempts MUST be logged with failure reason
- When DLQ is not empty, an alert MUST be issued and remediation actions undertaken by the responsible Product Squad
- DLQ messages MUST be retained for at least 14 days for investigation
Idempotency and Ordering
- Event Processors MUST be idempotent and handle out-of-order events by inspecting message timestamps.
Monitoring and Health
- Queue sizes and Event Processor logs MUST be monitored in Datadog.
- Every System MUST expose a health check route.
Standards
Event Definition Criteria
Events MUST represent business-significant state changes that: - Multiple systems need to know about - Are immutable facts (already happened) - Have clear business value for tracking/auditing - Represent meaningful domain transitions
Topics MUST be agreed upon at weekly Design Reviews.
Schema Management
- Event schemas MUST be stored in the
events/directory at repository root - Directory structure MUST be flat, organized by aggregate:
events/listing.yamlevents/company.yamlevents/agent.yaml- etc.
- Each aggregate MUST have its own YAML file (e.g.,
listing.yaml,company.yaml) - Schema files MUST include ownership metadata in header:
yaml # Owner: <Squad Name> (@slack-handle) # DLQ Alerts: #<squad-alert-channel> # Consumers: See consumers section below - Schemas MUST use JSON Schema format for validation
- SDKs MUST auto-discover and load schemas at runtime from
events/directory - Schema location MAY be overridden via
ISLETS_EVENTS_CONFIG_PATHenvironment variable - A Topic is fully defined when it has a name and message schema
Consumer Registry
Event schemas MUST document known consumers for impact analysis:
# events/listing.yaml
# Owner: Listing Management Squad (@listing-mgmt-team)
# DLQ Alerts: #listing-mgmt-alerts
aggregate: listing
events:
sold:
description: "Emitted when listing completes sale transaction"
consumers:
- team: Listings Portal Squad
service: listings-portal-api
queue: portal-listing-queue
critical: true # Breaking change blocks user-facing workflow
- team: Agents Squad
service: agent-notifications
queue: agents-listing-queue
critical: false # Nice-to-have notification
schema:
type: object
required: [listing_id, price, agent_id, sold_at]
properties:
listing_id: {type: string}
price: {type: number}
agent_id: {type: string}
sold_at: {type: string, format: date-time}
Purpose: Enables fast flow by providing visibility into breaking change impact before making schema changes.
Maintenance: Teams self-register when they start consuming an event (add entry via PR to schema file).
Naming Conventions
- Event names MUST use past-tense verbs (e.g.,
ListingSold, notSellListing) - Event names MUST follow pattern:
<Aggregate><PastTenseVerb>(e.g.,CompanyScoreChanged) - Topic names MUST follow pattern:
<aggregate>-events(e.g.,listing-events,company-events) - Queue names MUST follow pattern:
<consumer-name>-<aggregate>-queue - Event type identifiers MUST use snake_case (e.g.,
sold,score_changed)
Standard Message Attributes
Every message published to SNS MUST include the following MessageAttributes:
- event_type: The event type identifier (e.g., sold, score_changed)
- aggregate: The aggregate name (e.g., listing, company)
- schema_version: The message schema version (semantic versioning)
- producer_system: The name of the service/system that published the event (e.g., listing-management-api)
Message payloads SHOULD include (when applicable):
- event_id: A globally unique identifier for the specific event instance
- timestamp: The exact time the event occurred (ISO 8601 format)
- correlation_id: Trace identifier for end-to-end distributed tracing
Implementation
Security
- Webhooks MUST implement signature verification using provider-specific algorithms
- Webhooks MUST use TLS (HTTPS only)
- AWS SDK calls MUST use IAM roles with least-privilege policies
- SNS topics MUST be encrypted at rest
- Sensitive data in events MUST be encrypted or tokenized
- API keys and secrets MUST be stored in AWS Secrets Manager
Logging and Observability
- Queue sizes MUST be monitored in Datadog
- DLQ size MUST trigger alerts when non-zero
- Logs MUST be standardized and sent to Datadog (see RFC - Standard Logging)
- Distributed tracing MUST follow the message schema and traces sent to Datadog (see RFC - Message Schema)
- Event Processor invocations, durations, and error rates MUST be tracked
- SLO: p99 message processing latency SHOULD be < 5 seconds
Event Audit Store
- All published events MUST be persisted to the Audit Store for compliance and debugging
- Audit Store MUST retain events for minimum 1 year
- Events MUST be queryable by: event_id, correlation_id, timestamp, event_type, aggregate, producer_system
- Platform Team MUST provide replay tooling for disaster recovery scenarios
- Audit Store MUST support efficient time-range queries for debugging
- Recommended Implementation: S3 + Athena or dedicated event store (e.g., Amazon EventBridge Archive)
- Rationale: These options provide cost-effective long-term storage of immutable events, efficient time-series queries, and scale better than operational databases like MongoDB for append-only audit logs
- Decision MUST be finalized before production deployment as this underpins compliance and disaster recovery capabilities
Versioning and Change Management
- All message schemas MUST use semantic versioning (MAJOR.MINOR.PATCH)
- MAJOR version increments indicate breaking changes
- MINOR version increments indicate backward-compatible additions
- PATCH version increments indicate backward-compatible fixes
- Breaking changes require:
- New schema version
- Design Review meeting approval
- Migration plan for existing consumers
- Deprecation notice (30-day minimum)
- Pull request approval from at least 2 engineers familiar with the affected systems
- Producers and consumers MUST support backward compatibility for at least one previous MAJOR schema version
- Schema changes MUST be communicated via:
- Pull request description with impact analysis
- Design Review meeting notes
- Release notes / changelog
- Slack announcement in #engineering channel
- Deprecated schemas MUST be clearly marked in schema files with sunset date
Testing and Validation
- Event Processors MUST include automated unit and integration tests
- Tests MUST validate:
- Message handling and business logic
- Idempotency (duplicate message handling)
- Error scenarios and retry behavior
- Schema validation
- Message schemas MUST be validated against JSON Schema before deployment
- End-to-end tests SHOULD be implemented for critical event flows
Webhook Integration (Third-Party Systems)
For integrating external systems (e.g., HubSpot, Stripe) that publish events via webhooks:
- Webhook endpoints MUST validate signatures using provider-specific algorithms
- Webhook handlers MUST transform third-party payloads into domain events
- Webhook endpoints MUST return 200 OK immediately (process asynchronously)
- Webhook processing MUST be fire-and-forget (publisher doesn't wait for completion)
- Failed webhook deliveries SHOULD be retrievable via provider's API/dashboard
- The
webhooksSDK module MUST provide validators for common providers - Webhook infrastructure MUST be deployed via Pulumi components:
WebhookApiGateway- Creates API Gateway with custom domainWebhookLambdaHandler- Creates Lambda with signature validation- Example workflow:
- Third-party calls webhook endpoint
- Lambda validates signature
- Lambda transforms payload to domain event
- Lambda publishes to appropriate SNS topic
- Lambda returns 200 OK
- Normal event processing flow continues
Failure Scenarios and Recovery
- DLQ size MUST trigger alerts when non-zero (visible to all engineers in Datadog)
- Any engineer MAY investigate and remediate DLQ messages
- Common remediation steps:
- Review DLQ message and error logs
- Fix bug in consumer code if needed
- Redrive messages from DLQ after fix deployed
- Document incident and root cause
- Catastrophic failures (e.g., data loss) MUST trigger incident response
- Event replay from audit database MUST be available for disaster recovery
- Recovery procedures MUST be documented and periodically tested (quarterly)
Consumer Integration
To consume events, services MUST declare their subscriptions using Pulumi components provided by the pyslets SDK. Each service owns and manages its consumer infrastructure in its own repository.
Workflow for New Services
Step 1: Write Handler Function (Application Code)
# my-service/src/handlers.py
from pyslets.events import ListingSold
def handle_listing_sold(event: ListingSold):
"""Process listing sale event"""
# Type-safe access to event fields
print(f"Processing sale: {event.listing_id} for ${event.price}")
# Business logic
database.record_sale(
listing_id=event.listing_id,
price=event.price,
agent_id=event.agent_id,
sold_at=event.sold_at
)
Step 2: Declare Consumer (Service Infrastructure)
# my-service/infra/consumers.py
from pyslets.pulumi_components import Consumer
Consumer(
service="my-service",
event="listing.sold", # Event type to subscribe to
handler="my_service.handlers:handle_listing_sold", # Path to handler function
deployment="lambda", # or "fargate" for long-running services
timeout=30, # Optional: handler timeout in seconds
memory=256, # Optional: Lambda memory in MB
)
Step 3: Deploy
cd my-service/infra
pulumi up
What Gets Created:
- SQS queue: my-service-listing-sold-queue
- SQS DLQ: my-service-listing-sold-dlq
- SNS subscription: listing-events topic → queue
- Lambda function with handler code
- IAM roles with least-privilege policies
- CloudWatch alarms for DLQ size
- Datadog monitoring integration
At Runtime:
1. Events published to listing-events topic
2. SNS fans out to my-service-listing-queue
3. Lambda polls queue automatically (AWS manages polling)
4. Lambda deserializes message to ListingSold object
5. Lambda calls handle_listing_sold(event)
6. Success: message deleted; Failure: retry 3x → DLQ
Workflow for Legacy Systems
For systems that cannot easily adopt Lambda or prefer webhook integration:
Infrastructure Declaration:
# my-service/infra/consumers.py
from pyslets.pulumi_components import Consumer, WebhookHandler
from pyslets.pulumi_components.types import Service, Events
# Set secret first:
# $ pulumi config set --secret legacy_webhook_secret <secret-value>
Consumer(
service=Service.MY_SERVICE,
event=Events.Listing.SOLD,
handler=WebhookHandler(
url="https://legacy-api.company.com/api/webhooks/listing-sold",
secret_config_key='legacy_webhook_secret',
headers={ # Optional: custom headers
"X-API-Key": "from-config"
},
timeout=30
)
)
Legacy System (Receives Webhook):
# legacy-api/webhooks.py
from flask import request
from pyslets.webhooks import verify_hmac_signature
@app.route('/api/webhooks/listing-sold', methods=['POST'])
def handle_listing_sold():
# Verify HMAC signature from forwarder Lambda
if not verify_hmac_signature(request, secret=WEBHOOK_SECRET):
return 'Unauthorized', 401
data = request.json
# Call existing business logic (no changes needed)
existing_module.process_sale(
listing_id=data['listing_id'],
price=data['price']
)
return '', 200
The WebhookConsumer creates a Lambda that:
- Polls SQS queue
- POSTs JSON to webhook URL with HMAC signature
- Handles retries and DLQ routing
- Zero code changes needed in legacy system
Multiple Events, Single Service
# my-service/infra/consumers.py
from pyslets.pulumi_components import Consumer
# Subscribe to multiple events
Consumer(service="my-service", event="listing.sold",
handler="my_service.handlers:handle_listing_sold")
Consumer(service="my-service", event="company.score_changed",
handler="my_service.handlers:handle_company_score")
Infrastructure Ownership
Islets Repository Manages:
- Event schemas (events/ directory)
- SNS topics (one per aggregate)
- Pulumi component libraries (pyslets.pulumi_components)
- Python SDK (pyslets)
Consuming Service Manages: - Handler functions (application code) - Consumer declarations (Pulumi in service's repo) - SQS queues and DLQs - Lambda functions or webhook forwarders - Service-specific monitoring and alerts
This separation ensures each service owns its infrastructure end-to-end while using standardized components from islets.
Consumer Registration in Schema
When subscribing to an event, teams SHOULD add themselves to the consumer registry:
# events/listing.yaml (in islets repo)
events:
sold:
consumers:
- team: My Service Team
service: my-service
queue: my-service-listing-queue
critical: true # Breaking changes block our workflow
This enables impact analysis when schemas change.
Multi-Language SDK Support
- Event schemas are language-agnostic (JSON Schema in YAML files)
- SDKs for multiple languages SHARE the same schema repository (
events/directory) - Each SDK MUST implement dynamic event class generation from schemas at runtime
- SDKs MUST maintain API consistency across languages where idiomatic
- Currently implemented: Python SDK (
pyslets)