Deployment

Kubernetes operator

The pgroles operator watches PostgresPolicy custom resources and continuously reconciles your PostgreSQL databases to match the declared state.


For the internal controller design, see the operator architecture page.

Overview

The operator brings the same convergent model as the CLI into Kubernetes. Instead of running pgroles apply manually, you declare a PostgresPolicy resource and the operator reconciles on a configurable interval.

  • Same manifest semantics — profiles, schemas, grants, retirements all work identically
  • Database credentials referenced via Kubernetes Secrets
  • Status conditions and change summaries on the custom resource
  • Finalizer-based cleanup on resource deletion

Production-focused controller

The operator is intended for production use. The current API is still v1alpha1, so the remaining work is primarily around API hardening and lifecycle polish rather than basic controller viability.

Installation

Helm

helm install pgroles-operator oci://ghcr.io/hardbyte/charts/pgroles-operator

Rust crate

Install from crates.io:

[dependencies]
pgroles-operator = "0.1.5"

If you are embedding the reconciler or CRD types directly from source, depend on the repository in your Cargo.toml:

[dependencies]
pgroles-operator = { git = "https://github.com/hardbyte/pgroles", tag = "v0.1.5" }

Configuration

Key values you can override:

# values.yaml
installCRDs: true

operator:
  image:
    repository: ghcr.io/hardbyte/pgroles-operator
    tag: ""  # defaults to Chart.appVersion

  env:
    - name: RUST_LOG
      value: "info,pgroles_operator=debug"

  resources:
    requests:
      cpu: 50m
      memory: 64Mi
    limits:
      cpu: 200m
      memory: 128Mi

The operator runs as nobody (UID 65534) with a read-only root filesystem, no capabilities, and seccomp enabled by default.

Operational guidance

  • Use one PostgresPolicy per database and credential boundary.
  • Prefer a dedicated management role rather than an application login for reconciliation.
  • Validate and review the manifest with the CLI before handing it to the operator.
  • Treat deletion as "stop managing", not "revert the database".

Production roadmap

The operator now has the production-readiness foundations on main. The remaining work is mostly about API evolution and maintaining the stronger validation profile already in CI.

Implemented foundations

  • Canonical database identity and ownership claims let the controller detect overlapping PostgresPolicy resources targeting the same database.
  • Conflicting policies are rejected instead of allowing last-writer-wins behavior.
  • Status records managed database identity, owned role/schema summaries, lastAttemptedGeneration, lastSuccessfulReconcileTime, and the last error message.
  • Reconciliation is serialized per database target:
    • in-process locking prevents concurrent reconciles within one operator replica
    • PostgreSQL advisory locking prevents concurrent reconciles across multiple replicas
  • Retry behavior is failure-aware:
    • transient operational failures use exponential backoff with jitter
    • invalid specs, conflicts, and unsafe role-drop workflows fall back to the normal reconcile interval
    • lock contention keeps its own short retry path
  • The operator exposes:
    • /livez
    • /readyz
  • Metrics are exported via OpenTelemetry OTLP with the OpenTelemetry Collector as the intended Kubernetes sink.
  • Transition-based Kubernetes Events are emitted for notable policy state changes.

Current validation profile

CI covers:

  • multiple policies targeting the same database with conflicting ownership
  • multiple non-overlapping policies targeting the same database
  • shared-secret churn across multiple policies targeting the same database
  • invalid specs
  • missing secrets
  • insufficient database privileges
  • rotated secrets and connection recovery after secret repair
  • transition-based Kubernetes Event delivery for warning and recovery states

Default PR CI validates:

  • generated policies spanning 2 databases
  • 30 managed schemas total
  • 60 generated roles total
  • schema, table, and sequence privilege checks on both database targets

Scheduled fairness/load coverage on main additionally exercises:

  • 5 generated policies across 3 databases
  • 100 managed schemas total
  • 200 generated roles total
  • repeated shared-secret churn across 3 same-database policies
  • targeted secret churn on a separate database to verify isolation
  • latency reporting in the workflow summary for initial convergence and full churn completion

Remaining work

  • Carry the current controller semantics into the next CRD revision rather than leaving them as implementation-only conventions.
  • Promote the API beyond v1alpha1 only after the compatibility and upgrade story is explicit.
  • Keep the validation profile current as the manifest surface and operator behavior evolve.

Custom resource

A PostgresPolicy spec mirrors the CLI manifest format with added Kubernetes-specific fields for connection and scheduling.

apiVersion: pgroles.io/v1alpha1
kind: PostgresPolicy
metadata:
  name: myapp-roles
  namespace: default
spec:
  connection:
    secretRef:
      name: mydb-credentials
    secretKey: DATABASE_URL  # optional, defaults to DATABASE_URL

  interval: "5m"   # reconciliation interval (supports 5m, 1h, 30s, 1h30m)
  suspend: false   # set true to pause reconciliation
  mode: apply      # apply changes, or use plan for non-mutating drift preview

  default_owner: app_owner

  profiles:
    editor:
      grants:
        - privileges: [USAGE]
          'on': { type: schema }
        - privileges: [SELECT, INSERT, UPDATE, DELETE, REFERENCES, TRIGGER]
          'on': { type: table, name: "*" }
        - privileges: [USAGE, SELECT, UPDATE]
          'on': { type: sequence, name: "*" }
        - privileges: [EXECUTE]
          'on': { type: function, name: "*" }
      default_privileges:
        - privileges: [SELECT, INSERT, UPDATE, DELETE, REFERENCES, TRIGGER]
          on_type: table
        - privileges: [USAGE, SELECT, UPDATE]
          on_type: sequence
        - privileges: [EXECUTE]
          on_type: function

  schemas:
    - name: inventory
      profiles: [editor]

  roles:
    - name: app-service
      login: true
      comment: "Application service account"

  grants:
    - role: app-service
      privileges: [CONNECT]
      'on': { type: database, name: mydb }

  memberships:
    - role: inventory-editor
      members:
        - name: app-service

  retirements:
    - role: legacy_app
      reassign_owned_to: app_owner
      drop_owned: true

Database secret

Create a Secret containing your PostgreSQL connection string:

kubectl create secret generic mydb-credentials \
  --from-literal=DATABASE_URL='postgresql://user:password@host:5432/database'

The operator reads the Secret from the same namespace as the PostgresPolicy resource. When the Secret's resourceVersion changes (e.g. credential rotation), the operator automatically reconnects with updated credentials.

The controller also emits Kubernetes Events for notable state transitions. These are intended for kubectl describe and quick operational debugging, not as a durable audit trail or alerting mechanism.

Reconciliation

The operator reconciles on three paths:

  • PostgresPolicy spec changes
  • referenced Secret changes
  • the normal periodic interval

Each reconcile inspects the current database state, computes a diff from the policy, and then either applies it or publishes a non-mutating plan depending on spec.mode. Same-database policies are serialized, and status-only updates do not retrigger the controller.

Use this page for the external behavior and operating model. For the internal controller pipeline and locking model, see the operator architecture page.

1

Read policy and Secret

Load the PostgresPolicy, fetch DATABASE_URL from the referenced Secret, and refresh the cached pool when credentials change.

2

Build desired state

Convert the CRD to the shared PolicyManifest model, then expand profiles and schemas into concrete roles, grants, and memberships.

3

Inspect PostgreSQL

Query the live database state that matters for this policy, including managed roles, privileges, memberships, and provider-specific constraints.

4

Diff and safety checks

Compute the convergent change plan, detect conflicts, and enforce per-database locking before any mutation is attempted.

5

Apply in one transaction

Execute the rendered SQL statements inside a single transaction so the reconcile either commits fully or rolls back cleanly.

6

Patch status and emit telemetry

Write conditions, summaries, and last-error state back to Kubernetes, and export OTLP metrics for runtime visibility.

Insufficient privileges

If the operator can connect to PostgreSQL but the management role cannot inspect or apply the requested changes, the policy settles to a non-ready state instead of hot-looping as if the failure were transient.

Current behavior:

  • Ready=False
  • reason InsufficientPrivileges
  • last_error contains the PostgreSQL error message, for example permission denied to create role
  • the policy retries on its normal reconcile interval rather than exponential transient backoff

This is the expected state when the database credential is valid but under-privileged for the requested manifest.

Interval

The interval field controls how often the operator re-reconciles, even when the resource hasn't changed. This catches drift from manual SQL changes. Supports durations like 30s, 5m, 1h, or compound forms like 1h30m. Defaults to 5m.

Suspending

Set suspend: true to pause reconciliation without deleting the resource. The operator will skip the resource until suspend is set back to false.

Plan mode

Set mode: plan to let the operator inspect the database, compute the diff, and publish the planned SQL without executing it.

spec:
  connection:
    secretRef:
      name: postgres-credentials
  mode: plan
  roles:
    - name: preview-user
      login: true

Plan mode is useful when you want the operator to stay in-cluster but you are not ready to trust it with mutations yet.

Current behavior in plan mode:

  • the operator connects to the database and computes the full diff normally
  • no SQL is executed
  • status.change_summary records the pending changes
  • status.planned_sql stores the rendered SQL, truncated if needed for status size safety
  • Ready=True with reason Planned
  • Drifted=True when changes are pending, Drifted=False when the database is already in sync

Use suspend when you want the controller to stop reconciling entirely. Use plan when you want it to keep inspecting and showing you what it would do.

Health and telemetry

The operator exposes health probes on its internal HTTP port:

  • /livez
  • /readyz

The Helm chart configures these probes automatically. Metrics are exported via OpenTelemetry OTLP when standard OTel endpoint environment variables are set, for example:

operator:
  env:
    - name: OTEL_EXPORTER_OTLP_ENDPOINT
      value: http://otel-collector.observability.svc.cluster.local:4317
    - name: OTEL_METRICS_EXPORTER
      value: otlp

The intended deployment model is operator -> OpenTelemetry Collector -> your metrics backend.

The operator also emits transition-based Kubernetes Events such as:

  • ConflictDetected
  • ConflictResolved
  • Suspended
  • Reconciled
  • Recovered
  • SecretFetchFailed
  • DriftDetected
  • PlanClean
  • DatabaseConnectionFailed
  • InsufficientPrivileges
  • UnsafeRoleDropsBlocked

Deletion behaviour

When a PostgresPolicy resource is deleted, the operator does not revoke grants or drop roles. The database is left as-is. This is intentional — resource deletion means "stop managing", not "undo everything".

Status

The operator reports status on the custom resource:

status:
  conditions:
    - type: Ready
      status: "True"
      reason: Reconciled
      message: "Applied 5 changes"
      last_transition_time: "2026-03-06T10:30:00Z"
  observed_generation: 3
  last_reconcile_time: "2026-03-06T10:30:00Z"
  transient_failure_count: 0
  change_summary:
    roles_created: 2
    roles_altered: 0
    roles_dropped: 0
    grants_added: 3
    grants_revoked: 0
    default_privileges_set: 2
    default_privileges_revoked: 0
    members_added: 1
    members_removed: 0
    total: 8

An insufficient-privilege failure looks more like:

status:
  conditions:
    - type: Ready
      status: "False"
      reason: InsufficientPrivileges
      message: "error returned from database: permission denied to create role"
    - type: Degraded
      status: "True"
      reason: InsufficientPrivileges
  last_error: "error returned from database: permission denied to create role"
  transient_failure_count: 0

Conditions

TypeMeaning
ReadyTrue when the last reconciliation succeeded
DriftedTrue when plan mode found pending changes
ReconcilingTrue while a reconciliation is in progress
DegradedTrue when the last reconciliation failed (includes error detail)

On failure, the operator chooses a retry path based on the failure mode:

  • lock contention: short jittered retry
  • transient operational failures: exponential backoff with jitter
  • invalid specs, conflicts, and unsafe role-drop blockers: normal reconcile interval

RBAC

The operator requires a ClusterRole with these permissions:

ResourceVerbs
postgrespoliciesget, list, watch, patch, update
postgrespolicies/statusget, patch, update
postgrespolicies/finalizersupdate
secretsget, list, watch
eventscreate, patch

The Helm chart creates the ClusterRole, ClusterRoleBinding, and ServiceAccount automatically.

Previous
AWS RDS & Aurora