Initial import

2026-03-10 14:03:52 +03:00
commit 6c0ca4e28b
102 changed files with 6598 additions and 0 deletions
--- a/docs/ops/deployment.md
+++ b/docs/ops/deployment.md
@@ -0,0 +1,48 @@
+# Deployment Plan
+
+## Chosen target
+Deploy on one VPS with Docker Compose.
+
+## Why this target
+- The system has multiple long-lived components: web, worker, bot, database, and reverse proxy.
+- Compose gives predictable service boundaries, easier upgrades, and easier recovery than manually managed host processes.
+- It keeps the path open for later separation of web, worker, and bot without reworking the repository layout.
+
+## Expected services
+- `migrate`: one-shot schema bootstrap job run before app services start
+- `web`: Next.js app serving the site, dashboard, admin UI, and API routes
+- `worker`: background job processor
+- `bot`: Telegram admin bot runtime
+- `postgres`: primary database
+- `caddy`: TLS termination and reverse proxy
+- optional `minio`: self-hosted object storage for single-server deployments
+
+## Deployment notes
+- Run one Compose project on a single server.
+- Keep persistent data in named volumes or external storage.
+- Keep secrets in server-side environment files or a secret manager.
+- Back up PostgreSQL and object storage separately.
+- Prefer Telegram long polling in MVP to avoid an extra public webhook surface for the bot.
+
+## Upgrade strategy
+- Build new images.
+- Run the one-shot database schema job.
+- Restart `web`, `worker`, and `bot` in the same Compose project.
+- Roll back by redeploying the previous image set if schema changes are backward compatible.
+
+## Current database bootstrap state
+- The current Compose template runs a `migrate` service before `web`, `worker`, and `bot`.
+- The job runs `prisma migrate deploy` from the committed migration history.
+- The same bootstrap job also ensures the default MVP `SubscriptionPlan` row exists after migrations.
+- Schema changes must land with a new committed Prisma migration before deployment.
+
+## Initial operational checklist
+- provision VPS
+- install Docker and Compose plugin
+- provision DNS and TLS
+- provision PostgreSQL storage
+- provision S3-compatible storage or enable local MinIO
+- create `.env`
+- deploy Compose stack
+- run database migration job
+- verify web health, worker job loop, and bot polling
--- a/docs/ops/provider-key-pool.md
+++ b/docs/ops/provider-key-pool.md
@@ -0,0 +1,67 @@
+# Provider Key Pool
+
+## Purpose
+Route generation traffic through multiple provider API keys while hiding transient failures from end users.
+
+## Key selection
+- Only keys in `active` state are eligible for first-pass routing.
+- Requests start from the next active key by round robin.
+- A single request must not attempt the same key twice.
+
+## Optional proxy behavior
+- A key may have one optional proxy attached.
+- If a proxy exists, the first attempt uses the proxy.
+- If the proxy path fails with a transport error, retry the same key directly.
+- Direct fallback does not bypass other business checks.
+- Current runtime policy reads cooldown and manual-review thresholds from environment:
+  - `KEY_COOLDOWN_MINUTES`
+  - `KEY_FAILURES_BEFORE_MANUAL_REVIEW`
+
+## Retry rules
+Retry on the next key only for:
+- network errors
+- connection failures
+- timeouts
+- provider `5xx`
+
+Do not retry on the next key for:
+- validation errors
+- unsupported inputs
+- policy rejections
+- other user-caused provider `4xx`
+
+## States
+- `active`
+- `cooldown`
+- `out_of_funds`
+- `manual_review`
+- `disabled`
+
+## Transitions
+- `active -> cooldown` on retryable failures
+- `cooldown -> active` after successful automatic recheck
+- `cooldown -> manual_review` after more than 10 consecutive retryable failures across recovery cycles
+- `active|cooldown -> out_of_funds` on confirmed insufficient funds
+- `out_of_funds -> active` only by manual admin action
+- `manual_review -> active` only by manual admin action
+- `active -> disabled` by manual admin action
+
+## Current runtime note
+- The current worker implementation already applies proxy-first then direct fallback within one provider-key attempt.
+- The current worker implementation writes `GenerationAttempt.usedProxy` and `GenerationAttempt.directFallbackUsed` for auditability.
+- The current worker implementation also runs a background cooldown-recovery sweep and returns keys to `active` after `cooldownUntil` passes.
+
+## Balance tracking
+- Primary source of truth is the provider balance API.
+- Balance refresh runs periodically and also after relevant failures.
+- Telegram admin output must show per-key balance snapshots and the count of keys in `out_of_funds`.
+
+## Admin expectations
+Web admin and Telegram admin must both be able to:
+- inspect key state
+- inspect last error category and code
+- inspect balance snapshot and refresh time
+- enable or disable a key
+- return a key from `manual_review`
+- return a key from `out_of_funds`
+- add a new key
--- a/docs/ops/telegram-pairing.md
+++ b/docs/ops/telegram-pairing.md
@@ -0,0 +1,48 @@
+# Telegram Pairing Flow
+
+## Goal
+Allow a new Telegram admin to be approved from the server console without editing the database manually.
+
+## Runtime behavior
+### Unpaired user
+1. A user opens the Telegram bot.
+2. The bot checks whether `telegram_user_id` is present in the allowlist.
+3. If not present, the bot creates a pending pairing record with:
+   - Telegram user ID
+   - Telegram username and display name snapshot
+   - pairing code hash
+   - expiration timestamp
+   - status `pending`
+4. The bot replies with a message telling the user to run `nproxy pair <code>` on the server.
+
+Current runtime note:
+- The current bot runtime uses Telegram long polling.
+- On each message from an unpaired user, the bot rotates any previous pending code and issues a fresh pairing code.
+- Pending pairing creation writes an audit-log entry with actor type `system`.
+
+### Pair completion
+1. An operator runs `nproxy pair <code>` on the server.
+2. The CLI looks up the pending pairing by code.
+3. The CLI prints the target Telegram identity and asks for confirmation.
+4. On confirmation, the CLI adds the Telegram user to the allowlist.
+5. The CLI marks the pending pairing record as `completed`.
+6. The CLI writes an admin action log entry.
+
+## Required CLI commands
+- `nproxy pair <code>`
+- `nproxy pair list`
+- `nproxy pair revoke <telegram-user-id>`
+- `nproxy pair cleanup`
+
+## Current CLI behavior
+- `nproxy pair <code>` prints the Telegram identity and requires explicit confirmation unless `--yes` is provided.
+- `nproxy pair list` prints active allowlist entries and pending pairing records.
+- `nproxy pair revoke <telegram-user-id>` requires explicit confirmation unless `--yes` is provided.
+- `nproxy pair cleanup` marks expired pending pairing records as `expired` and writes an audit log entry.
+
+## Security rules
+- Pairing codes expire.
+- Pairing codes are stored hashed, not in plaintext.
+- Only the server-side CLI can complete a pairing.
+- Telegram bot access is denied until allowlist membership exists.
+- Every pairing and revocation action is auditable.