Files
sdi/sdi-saas-architecture-blueprint.md
austindebest d62468adf9 Initial commit: SDI SaaS Platform foundation
- Complete monorepo structure with pnpm workspaces
- Prisma database schema with 20+ entities
- NestJS API with 9 core modules
- BullMQ orchestration worker
- AWS and Azure provider adapters
- Docker Compose infrastructure
- Complete documentation
2026-04-20 00:00:59 +01:00

585 lines
23 KiB
Markdown

# SDI SaaS Architecture Blueprint
## Overview
This document outlines a production-oriented architecture blueprint for building a software-defined interconnection (SDI) SaaS platform similar in product direction to Console Connect. Console Connect presents itself as a software-defined interconnection platform that enables enterprises to provision and manage private connections between clouds, data centres, applications, and partners through a portal and APIs.[cite:11][cite:16] MEF's Lifecycle Service Orchestration (LSO) framework is intended to standardize automation across service ordering, inventory, billing, and multi-provider orchestration, making it a strong reference model for an interconnection platform intended to federate with partners and carriers.[cite:21][cite:24][cite:36]
The recommended product approach is a multi-tenant SaaS control plane with a customer portal, admin portal, orchestration engine, provider/cloud adapters, billing subsystem, and standards-aligned API layer. A TypeScript-first implementation matches the user's existing strengths in Node.js, TypeScript, Prisma, Docker, BullMQ, Vue.js, and production deployment on Ubuntu and Kubernetes.[cite:1][cite:2][cite:3][cite:4][cite:5][cite:7][cite:8]
## Product Scope
An SDI platform of this type acts as a digital control plane for ordering, provisioning, modifying, monitoring, and billing private connectivity services. Public material describing Console Connect emphasizes software-defined interconnection, private connectivity, self-service provisioning, and automation through APIs rather than slow manual provisioning alone.[cite:11][cite:14][cite:17][cite:18]
The initial commercial service catalog should focus on a limited set of high-value product types:
- Cloud-to-data-centre private interconnect.
- Multi-cloud connectivity between AWS and Azure.
- Partner-to-partner private interconnection.
- On-demand bandwidth changes for supported services.
- Service inventory, usage, billing, and lifecycle management.
## Architecture Principles
The architecture should follow a few hard rules from the beginning:
- Keep a canonical internal domain model independent of any single provider or standards body.
- Treat provisioning as an asynchronous workflow, not a request-response transaction.
- Separate the orchestration core from provider-specific adapter code.
- Persist every service-state transition for auditability and recovery.
- Expose APIs as a core product capability, not as a later add-on.[cite:12][cite:15][cite:18]
- Align external B2B APIs with MEF LSO concepts where possible so federation with partners is easier later.[cite:24][cite:27][cite:36]
## System Landscape
The platform should be organized into the following top-level systems:
| System | Purpose |
|---|---|
| Customer portal | Order services, manage inventory, monitor status, billing, teams |
| Admin portal | Provider onboarding, pricing, manual intervention, audits, NOC tooling |
| Public API | Customer automation, API keys, webhooks, partner integrations |
| Core domain API | Tenants, catalog, orders, services, billing, audit |
| Orchestration engine | Long-running workflows, retries, rollback, dependency sequencing |
| Provider adapters | AWS, Azure, carrier, IX, and data-centre integration |
| Event backbone | Async job processing and event fan-out |
| Observability stack | Logs, metrics, traces, alerts, SLOs |
## Recommended Tech Stack
The core implementation should use a TypeScript-first stack so that the portal, APIs, shared contracts, and workflow logic live in one strongly typed ecosystem. That aligns well with the user's known experience in TypeScript, Node.js, Prisma, Vue.js, Docker, Kubernetes, and BullMQ.[cite:2][cite:3][cite:4][cite:5][cite:7]
| Layer | Recommended stack |
|---|---|
| Frontend | Vue 3, Nuxt 3, TypeScript, Tailwind CSS, Pinia, TanStack Query |
| API backend | NestJS or Fastify with TypeScript |
| Data | PostgreSQL |
| ORM | Prisma[cite:3] |
| Queue and jobs | Redis + BullMQ[cite:7] |
| Search/log analytics | OpenSearch |
| Object storage | S3-compatible storage or MinIO |
| Realtime | SSE first, WebSockets where needed |
| Auth | Keycloak, Auth0, or Ory |
| Infra | Docker, Kubernetes, Helm, Terraform[cite:5][cite:8] |
| Observability | Prometheus, Grafana, Loki, OpenTelemetry |
| High-performance adapters | Go for selected components when concurrency or low-level control demands it |
## Repository and Module Structure
A production-friendly repository model should separate applications from shared packages:
```txt
apps/
customer-portal/
admin-portal/
api/
worker/
realtime-gateway/
packages/
domain-core/
shared-types/
auth-sdk/
billing-engine/
event-contracts/
adapter-aws/
adapter-azure/
adapter-mef-partner/
adapter-carrier-x/
infra/
terraform/
helm/
k8s/
```
Inside the main API, use bounded modules rather than one giant service layer:
- auth
- tenants
- users
- roles
- catalog
- endpoints
- quotes
- orders
- services
- provisioning
- inventory
- billing
- audit
- notifications
- webhooks
- providerAccounts
- incidents
## Domain Model
The internal canonical model should normalize the language across clouds, carriers, exchanges, and partners. This prevents the entire system from becoming tightly coupled to AWS Direct Connect, Azure ExpressRoute, or any one MEF payload shape.
### Core entities
- Tenant
- User
- Role
- Provider
- ProviderAccount
- Endpoint
- ProductOffering
- Quote
- Order
- Service
- ProvisioningTask
- InventoryRecord
- UsageRecord
- Invoice
- ApiKey
- WebhookEndpoint
- AuditEvent
- Incident
### Example service order type
```ts
export type ServiceOrderStatus =
| 'draft'
| 'submitted'
| 'validating'
| 'quoted'
| 'approved'
| 'queued'
| 'provisioning'
| 'active'
| 'failed'
| 'suspended'
| 'terminated';
export interface ServiceOrder {
id: string;
tenantId: string;
productOfferingId: string;
providerId: string;
sourceEndpointId: string;
targetEndpointId: string;
bandwidthMbps: number;
status: ServiceOrderStatus;
externalReference?: string;
createdAt: Date;
updatedAt: Date;
}
```
## MEF LSO Mapping
MEF's LSO standards are useful as the interoperability layer for B2B and partner automation. MEF materials describe API domains around service qualification, quoting, ordering, inventory, billing, and multi-provider automation under the LSO framework.[cite:24][cite:36][cite:39]
| Internal module | MEF-aligned domain | Purpose |
|---|---|---|
| endpoints / serviceability | Address validation, service qualification | Check whether a service can be delivered |
| quotes | Quote management | Generate price and commercial terms |
| orders | Product ordering | Accept and track service orders |
| inventory | Product inventory | Return active services and asset state |
| incidents | Trouble ticketing | Lifecycle of faults and support cases |
| billing | Billing management | Charges, invoices, reconciliation |
| partner gateway | Sonata / Cantata style inter-provider APIs | Federation with partners and providers |
| internal resource automation | Presto-like orchestration patterns | Domain-level provisioning inside the provider environment |
The clean implementation pattern is to keep the internal canonical objects stable, then map them to MEF-compliant payloads in a translation layer. That lets the SaaS expose MEF-shaped APIs externally without forcing the whole internal system into external standard payloads.[cite:24][cite:33][cite:36]
## API Design
The platform should expose three API families:
1. Customer API for self-service automation.
2. Partner API for inter-provider federation and standards alignment.
3. Internal service APIs for adapters, orchestration, billing, and observability.
### Example external endpoints
```txt
POST /v1/quotes
GET /v1/quotes/:id
POST /v1/orders
GET /v1/orders/:id
POST /v1/orders/:id/cancel
GET /v1/services
GET /v1/services/:id
POST /v1/services/:id/modify
POST /v1/services/:id/suspend
POST /v1/services/:id/terminate
GET /v1/inventory
GET /v1/billing/invoices
POST /v1/webhooks/test
```
### Webhook events
```txt
quote.ready
order.accepted
order.rejected
service.provisioning.started
service.provider.pending
service.active
service.failed
service.modified
service.suspended
service.terminated
invoice.generated
incident.created
```
## Provisioning Architecture
Provisioning must be implemented as a stateful orchestration flow. AWS Direct Connect and Azure ExpressRoute are both private-connectivity services managed through provider tooling and automation interfaces, and each has external dependencies, location constraints, routing details, and lifecycle operations that make asynchronous orchestration necessary.[cite:0][cite:1]
### Provisioning flow
1. Customer submits intent through the portal or API.
2. Core API validates payload, tenant permissions, and serviceability.
3. Order record is created in PostgreSQL.
4. An event is emitted to the orchestration queue.
5. Orchestrator resolves dependency graph and selects provider adapter.
6. Adapter calls downstream cloud or partner APIs.
7. Status is tracked through callbacks or polling.
8. State transitions are persisted.
9. Realtime gateway streams updates to the portal.
10. Billing metering starts after activation.
### Example adapter contract
```ts
export interface ProviderAdapter {
validate(payload: ServiceIntent): Promise<ValidationResult>;
quote(payload: ServiceIntent): Promise<QuoteResult>;
provision(payload: ProvisionRequest): Promise<ProvisionResponse>;
getStatus(externalId: string): Promise<ServiceStatus>;
modify(payload: ModifyRequest): Promise<ModifyResponse>;
suspend(externalId: string): Promise<ActionResult>;
terminate(externalId: string): Promise<ActionResult>;
syncInventory?(): Promise<void>;
}
```
### Example orchestration logic
```ts
async function provisionOrder(orderId: string) {
const order = await orderRepo.getById(orderId);
await orderRepo.updateStatus(orderId, 'validating');
const adapter = adapterRegistry.get(order.providerId);
const validation = await adapter.validate(toServiceIntent(order));
if (!validation.ok) {
await orderRepo.updateStatus(orderId, 'failed');
await audit.log(orderId, 'validation_failed', validation.errors);
return;
}
await orderRepo.updateStatus(orderId, 'provisioning');
const result = await adapter.provision({
sourceEndpointId: order.sourceEndpointId,
targetEndpointId: order.targetEndpointId,
bandwidthMbps: order.bandwidthMbps,
});
if (!result.success) {
await orderRepo.updateStatus(orderId, 'failed');
await audit.log(orderId, 'provision_failed', result.error);
return;
}
await orderRepo.updateExternalReference(orderId, result.externalServiceId);
await orderRepo.updateStatus(orderId, 'active');
await billing.activateMetering(orderId);
await audit.log(orderId, 'service_active', result);
}
```
## AWS Adapter Design
AWS Direct Connect is presented by AWS as a private connectivity service with global availability and deployment options such as dedicated and hosted connections, and AWS publishes management through console, CLI, and API tooling.[cite:1] The AWS adapter should therefore encapsulate AWS-specific service qualification, connection creation, lifecycle changes, and status retrieval while exposing a provider-neutral interface to the orchestration engine.
### AWS adapter responsibilities
- Maintain metadata for supported Direct Connect locations and regions.
- Validate feasible source and target combinations.
- Create or manage Direct Connect-related service components through AWS APIs.
- Persist AWS external identifiers and status codes.
- Store BGP and routing-related metadata where relevant.
- Support bandwidth changes, suspend or terminate operations, and inventory synchronization.
### AWS flow
1. Resolve available interconnection location.
2. Validate tenant entitlement and provider account mapping.
3. Generate quote using price book and optional AWS-linked commercial rules.
4. Submit provisioning call through the adapter.
5. Persist external identifiers and poll or subscribe for status changes.
6. Mark service active and start metering when all required conditions are met.
## Azure Adapter Design
Microsoft documents Azure ExpressRoute as private connectivity into Microsoft cloud services through a connectivity provider, exchange, or direct model, with BGP-based routing, redundant connections, and multiple automation paths including portal, PowerShell, CLI, ARM, Terraform, and Bicep.[cite:0] The Azure adapter should mirror the same provider-neutral contract used by AWS while capturing Azure-specific concepts such as ExpressRoute circuit metadata, peering state, redundancy, and regional capability.
### Azure adapter responsibilities
- Maintain ExpressRoute locations, supported providers, SKUs, and bandwidth options.
- Validate location and provider compatibility.
- Create or update circuits and peering-related metadata through Azure automation interfaces.
- Store circuit IDs, provisioning state, and change history.
- Support modify, suspend, terminate, and inventory-sync operations.
### Azure flow
1. Validate source site and target Azure region.
2. Resolve supported ExpressRoute location and provider path.
3. Generate quote and order summary.
4. Provision through adapter calls.
5. Persist circuit references and route state.
6. Move service to active only after dependency checks complete.
## Multi-Cloud Connectivity Pattern
A realistic AWS-to-Azure service is usually implemented through an intermediary fabric, exchange, or provider edge, rather than by directly linking two cloud-native constructs in isolation. Megaport's guidance explains connecting AWS Direct Connect and Azure ExpressRoute through a data-centre or interconnection hub where routing can be exchanged and BGP established across the private path.[cite:22][cite:25]
In the SaaS, that should be modeled as a composite service consisting of multiple linked sub-services:
- AWS-side connectivity leg.
- Exchange or partner-fabric leg.
- Azure-side connectivity leg.
- Composite service object exposed to the customer.
This is important because failures, billing, and lifecycle changes may occur on one leg without affecting the others equally. The orchestration engine therefore needs dependency-aware workflows and partial-failure handling.
## Realtime Backend Design
The real-time backend should use event-driven components around a durable transactional core. The user already uses BullMQ and TypeScript background job patterns, which makes Redis-backed orchestration a practical early-stage choice.[cite:7]
### Recommended real-time stack
- PostgreSQL for source-of-truth transactional data.
- Redis for queues, locks, and transient workflow state.
- BullMQ workers for orchestration.
- SSE for customer-facing live state changes.
- WebSockets only where bidirectional traffic is needed.
- OpenTelemetry traces across API, worker, and adapter boundaries.
- Grafana, Prometheus, and Loki for operations visibility.
### Event model
```txt
order.created
order.validated
quote.generated
order.approved
provisioning.started
provider.request.sent
provider.pending
provider.completed
service.active
service.failed
billing.metering.started
inventory.sync.completed
incident.opened
```
### Scaling strategy
- Scale API pods horizontally behind an ingress controller.
- Scale workers independently from the API.
- Isolate slow or noisy providers into dedicated adapter deployments.
- Use idempotency keys for retries and duplicate-callback protection.
- Introduce Kafka or NATS later if event volume significantly exceeds the practical comfort zone of Redis-backed orchestration.
## Database Schema Skeleton
A relational schema should center on strong referential integrity and auditability.
```sql
create table tenants (
id uuid primary key,
name text not null,
created_at timestamptz not null default now()
);
create table providers (
id uuid primary key,
name text not null,
type text not null,
created_at timestamptz not null default now()
);
create table endpoints (
id uuid primary key,
provider_id uuid references providers(id),
kind text not null,
region text,
metro text,
metadata jsonb not null default '{}'::jsonb
);
create table service_orders (
id uuid primary key,
tenant_id uuid references tenants(id),
provider_id uuid references providers(id),
source_endpoint_id uuid references endpoints(id),
target_endpoint_id uuid references endpoints(id),
status text not null,
bandwidth_mbps integer not null,
external_reference text,
created_at timestamptz not null default now(),
updated_at timestamptz not null default now()
);
create table services (
id uuid primary key,
order_id uuid references service_orders(id),
tenant_id uuid references tenants(id),
status text not null,
activated_at timestamptz,
terminated_at timestamptz
);
create table audit_events (
id uuid primary key,
aggregate_type text not null,
aggregate_id uuid not null,
event_type text not null,
payload jsonb not null,
created_at timestamptz not null default now()
);
```
## Security Model
Security should be designed for enterprise customers from the start:
- Multi-tenant data isolation.
- Strong RBAC with least-privilege roles.
- SSO and MFA for enterprise tenants.
- API keys with scopes and rotation.
- Signed webhooks.
- End-to-end audit trails.
- Secret storage outside application code.
- Encryption in transit and at rest.
- Per-tenant rate limiting and anomaly detection.
Because the product controls private connectivity, service lifecycle, and billing, its security posture must be closer to enterprise infrastructure software than a lightweight self-serve SaaS.
## Billing and Commercial Engine
The billing engine should support:
- One-time provisioning fees.
- Recurring monthly port or service fees.
- Usage-based bandwidth charges.
- Regional price books.
- Contract discounts and credits.
- Taxes and invoice generation.
- Reconciliation against provider-side usage records.
MEF's billing-related API work reinforces the value of having a distinct billing domain rather than embedding pricing logic deep inside the provisioning code.[cite:24][cite:36]
## Deployment Topology
A production launch should use a cloud-native deployment model with staged environments.
### Environments
- Local development with Docker Compose.
- Shared development cluster.
- Staging cluster with provider sandbox integrations.
- Production cluster in one primary region first.
- Optional secondary region for disaster recovery and later geo-expansion.
### Production components
- Kubernetes for API, worker, realtime gateway, and adapter services.[cite:8]
- Managed PostgreSQL or HA PostgreSQL.
- Managed Redis or HA Redis.
- Object storage for documents, invoices, and exports.
- Ingress controller with TLS termination and WAF.
- Prometheus, Grafana, Loki, and tracing backend.
## Phased Rollout Plan
### Phase 1: MVP
- Multi-tenant auth and RBAC.
- Customer portal and admin portal.
- Catalog of endpoints and product offerings.
- Quote and order APIs.
- One AWS adapter.
- One Azure adapter.
- Orchestration engine with retries and audit logs.
- Basic billing and invoices.
- Realtime order-status updates.
### Phase 2: Serious v1
- Composite multi-cloud services.
- Provider inventory sync.
- Customer API keys and webhooks.
- Incident and support workflows.
- Enhanced billing and reporting.
- More provider adapters.
- Manual intervention queue and NOC tooling.
### Phase 3: Federation and standards
- MEF-aligned partner APIs.
- Inter-provider automation and external order exchange.
- Broader inventory and trouble-ticket interoperability.
- Expanded SLA, compliance, and reporting features.
### Phase 4: Global scale
- Multi-region deployment.
- Advanced traffic engineering integrations.
- Stronger commercial routing and partner settlement.
- Regional data and operational segmentation.
## Open-Source Building Blocks
There is no exact open-source clone of Console Connect, but several open-source systems can accelerate a similar build. OpenDaylight is an open-source SDN platform for programmable network control, Faucet is an open-source SDN controller oriented toward production environments, and the MEF LSO Sonata SDK provides useful artifacts for standards-aligned integration work.[cite:29][cite:33][cite:38] ONOS is also commonly evaluated alongside OpenDaylight for service-provider and controller use cases.[cite:23]
| Tool | Role |
|---|---|
| OpenDaylight | Southbound SDN control and programmable network integration |
| ONOS | Carrier-style SDN control plane option |
| Faucet | Production-focused open SDN controller |
| MEF LSO Sonata SDK | Standards-oriented API artifacts and examples |
| Terraform | Infrastructure and cloud automation, including Azure-friendly workflows[cite:0] |
| OpenSearch | Search, event analytics, and operational visibility |
## Cost Model
Public 2026 cost estimates for SaaS products place complex platforms broadly in the six-figure range, with enterprise-grade systems frequently reaching $500,000 or more depending on integrations and compliance.[cite:31][cite:34][cite:37][cite:40] A global SDI SaaS should therefore be budgeted more like enterprise infrastructure software than a lightweight B2C web app.
| Stage | Build estimate | Notes |
|---|---|---|
| MVP | $80k-$180k | Limited providers, customer portal, admin panel, core orchestration[cite:31][cite:37] |
| Serious v1 | $180k-$500k | Better reliability, billing, AWS and Azure integration, audit, observability[cite:31][cite:34] |
| Global launch | $500k-$1M+ | Multi-region ops, partner federation, NOC tooling, compliance, sales engineering[cite:31][cite:34] |
Operating cost should be budgeted separately for infrastructure, observability, support, security, partner onboarding, and continuous product iteration.[cite:40]
## Recommended Next Build Sequence
A sensible execution sequence for this project is:
1. Model tenants, providers, endpoints, quotes, orders, services, and audit events.
2. Build auth, RBAC, tenant isolation, and audit logging.
3. Build quote and order workflows with a persisted state machine.
4. Add Redis and BullMQ orchestration.
5. Implement AWS and Azure adapters behind a common interface.
6. Add SSE-based realtime order tracking.
7. Add billing, invoices, and usage metering.
8. Expose customer API and webhooks.
9. Add provider inventory sync and support workflows.
10. Layer in MEF-aligned partner APIs once the internal model is stable.
## Final Recommendation
The strongest path is to start as a modular, API-first, TypeScript-based interconnection control plane rather than trying to recreate a full carrier-grade network fabric on day one. Public information about Console Connect, AWS Direct Connect, Azure ExpressRoute, and MEF LSO all point to the same conclusion: the hardest and most defensible part of the product is not the dashboard UI, but the orchestration, standards alignment, adapter design, and operational reliability of real-world service delivery.[cite:11][cite:15][cite:18][cite:0][cite:1][cite:24]