Chapter 24: Error Handling and Observability

Production systems fail. APIs return 500s. LLMs time out. Database connections drop. The question isn’t whether failures happen — it’s whether you can detect them, understand them, and recover from them.

The Error Logging Architecture

Astrelo has a single canonical error sink: the logError() function in src/infrastructure/logging/logger.ts. Every error in the system — API routes, cron jobs, background processors, tool executions — flows through this one function.


// src/infrastructure/logging/logger.ts
export function logError(params: ErrorLogParams): void {
  // 1. Console output (immediate visibility in dev)
  logger.error(`[${params.source || params.endpoint}] ${errorMessage}`, {
    userId: params.userId,
    statusCode: params.statusCode,
  });
 
  // 2. Truncate stack trace (max 4,000 chars)
  const errorStack = params.error?.stack?.slice(0, MAX_STACK_LENGTH);
 
  // 3. Sanitize request context (remove secrets, cap at 8,192 bytes)
  const sanitized = sanitizeContext(params.requestContext);
  const contextJson = JSON.stringify(sanitized).length > MAX_CONTEXT_BYTES
    ? JSON.stringify({ _truncated: true, _size: JSON.stringify(sanitized).length })
    : JSON.stringify(sanitized);
 
  // 4. Persist to error_logs table (fire-and-forget)
  pool.query(
    `INSERT INTO error_logs (user_id, endpoint, method, status_code, error_message,
       error_stack, error_code, request_context, source)
     VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)`,
    [params.userId, params.endpoint, params.method, params.statusCode,
     errorMessage, errorStack, params.errorCode, contextJson, params.source]
  ).catch((err) => {
    // If we can't even log the error, at least console.warn it
    console.warn('[ErrorLog] Failed to persist error log:', err);
  });
}

Fire-and-Forget Design

The database insert is fire-and-forget — logError never throws, never awaits, and never blocks the caller. The .catch() on the query ensures that even if the database is down, the error logger itself doesn’t crash the process.

This is important: if logError threw an exception, it would mask the original error. The error-logging system should be the most reliable part of the stack — and the simplest way to be reliable is to never fail.

Sensitive Data Sanitization

Before persisting, sanitizeContext() strips secrets from request context:


const SENSITIVE_KEYS = new Set([
  'password', 'password_hash', 'access_token', 'refresh_token',
  'bot_token', 'api_key', 'apiKey', 'x-api-key',
  'secret', 'authorization', 'cookie', 'set-cookie',
]);
 
const SENSITIVE_PATTERN = /secret|token|key|auth|pass|cred|cookie/i;
 
function sanitizeContext(context: Record<string, unknown>, depth = 0): Record<string, unknown> {
  if (depth > 5) return { _maxDepthReached: true };
 
  const sanitized: Record<string, unknown> = {};
  for (const [key, value] of Object.entries(context)) {
    if (SENSITIVE_KEYS.has(key.toLowerCase()) || SENSITIVE_PATTERN.test(key)) {
      sanitized[key] = '[REDACTED]';
    } else if (typeof value === 'object' && value !== null) {
      sanitized[key] = sanitizeContext(value as Record<string, unknown>, depth + 1);
    } else {
      sanitized[key] = value;
    }
  }
  return sanitized;
}

Two layers of protection: an exact blocklist (SENSITIVE_KEYS) catches known secrets, and a regex pattern (SENSITIVE_PATTERN) catches novel key names like hubspot_access_token or api_key_hash. Recursion is capped at depth 5 to prevent stack overflow on deeply nested objects.

The redacted value is '[REDACTED]', not omitted entirely. This lets you see that the key existed (helpful for debugging) without exposing the value.

Size Gates

Two size limits prevent error logs from becoming a storage problem:


MAX_STACK_LENGTH = 4,000 characters   (stack traces)
MAX_CONTEXT_BYTES = 8,192 bytes       (request context JSON)

If context exceeds 8KB, it’s replaced with { _truncated: true, _size: 12345 }. You know something was there, and how big it was, without storing a 50KB request body.

Two-Layer Error Handling in API Routes

Every API route has two layers of error protection:

Layer 1: The Handler’s Try/Catch


async function handler(req: AuthenticatedRequest, res: NextApiResponse) {
  if (req.method !== 'GET') {
    return res.status(405).json({ error: 'Method not allowed' });
  }
 
  const userId = req.userId!;
 
  try {
    const result = await someOperation(userId);
    return res.status(200).json(result);
  } catch (error) {
    logError({
      userId,
      endpoint: '/api/some-endpoint',
      method: req.method,
      statusCode: 500,
      error: error instanceof Error ? error : new Error(String(error)),
      source: 'some-endpoint',
    });
    return res.status(500).json({ error: 'Internal server error' });
  }
}

Expected errors (400-level) are handled explicitly: missing parameters → 400, not found → 404, unauthorized → 401. These are normal flow control, not exceptions.

Unexpected errors (500-level) are caught by the try/catch, logged, and returned as a generic “Internal server error”. The user never sees stack traces or database error messages.

Layer 2: The Middleware Safety Net

The requireAuth middleware wraps the entire handler:


// src/infrastructure/auth/middleware.ts
export function requireAuth(handler: AuthenticatedHandler) {
  return async (req: AuthenticatedRequest, res: NextApiResponse) => {
    try {
      // ... JWT verification, token parsing ...
      return await handler(req, res);
    } catch (error) {
      logError({
        endpoint: req.url || 'unknown',
        method: req.method || 'unknown',
        statusCode: 500,
        error: error instanceof Error ? error : new Error(String(error)),
        source: 'auth-middleware',
      });
      return res.status(500).json({ error: 'Internal server error' });
    }
  };
}

If the handler’s try/catch somehow doesn’t catch an error (e.g., a thrown non-Error value, or an error in the response serialization), the middleware catches it. Double protection.

Cron Job Error Handling

Cron handlers follow a specific pattern because they process multiple items:


// Standard cron error pattern
export default async function handler(req: NextApiRequest, res: NextApiResponse) {
  // Auth check
  if (CRON_SECRET && req.headers.authorization !== `Bearer ${CRON_SECRET}`) {
    return res.status(401).json({ error: 'Unauthorized' });
  }
 
  const startTime = Date.now();
  let processed = 0, failed = 0, timedOut = false;
 
  try {
    for (const item of items) {
      // Budget check — stop before hitting Lambda timeout
      if (Date.now() - startTime > REQUEST_DEADLINE_MS) {
        timedOut = true;
        break;
      }
 
      try {
        await processItem(item);
        processed++;
      } catch (itemError) {
        failed++;
        logError({ source: 'cron-name', error: itemError });
        // Continue to next item — don't let one failure kill the batch
      }
    }
 
    return res.status(200).json({ processed, failed, timedOut, durationMs: Date.now() - startTime });
  } catch (outerError) {
    logError({ source: 'cron-name', error: outerError });
    return res.status(500).json({ error: 'Internal server error' });
  }
}

Three levels of protection:

Per-item try/catch — a single failing item doesn’t kill the batch
Timeout budget — the cron stops gracefully before hitting infrastructure limits
Outer try/catch — catastrophic errors (database down, query syntax error) are caught

The response always includes { processed, failed, timedOut, durationMs } — operational metrics that monitoring systems can alert on.

The Error Logs Table


error_logs (11 cols):
  id UUID PK
  user_id UUID FK (nullable — cron jobs have no user)
  endpoint VARCHAR(500)
  method VARCHAR(10)
  status_code INT DEFAULT 500
  error_message TEXT
  error_stack TEXT (max 4,000 chars)
  error_code VARCHAR(100)
  request_context JSONB (max 8,192 bytes, sanitized)
  source VARCHAR(100)
  created_at TIMESTAMPTZ

This table is append-only — errors are never updated or deleted. It serves as an audit trail: “what went wrong, when, for whom, and what was the context?”

The source column is particularly useful for filtering: WHERE source LIKE 'cron-%' shows all cron failures. WHERE source = 'auth-middleware' shows authentication failures. WHERE error_code = 'RATE_LIMITED' shows capacity problems.

Graceful Degradation

Throughout the codebase, a pattern emerges: when a non-critical operation fails, the system degrades gracefully rather than crashing:


// Example: AI content generation failure
try {
  const aiContent = await generateAIContent(alert);
  await saveAIContent(alertId, aiContent);
} catch (error) {
  // AI content is a nice-to-have — alert still works without it
  console.warn(`[AI Content] Failed for alert ${alertId}:`, error);
  // Don't re-throw — the alert is already persisted and visible
}


// Example: Slack delivery failure
try {
  await deliverToSlack(event, userId, payload);
} catch (error) {
  logError({ source: 'slack-delivery', error });
  // Alert is still in the feed — Slack is supplementary
}

The principle: core functionality should never be blocked by supplementary features. If AI content generation fails, the alert still appears. If Slack delivery fails, the in-app notification still works. If enrichment fails, the company record still exists with whatever data it has.

Key Takeaways

Single error sink (logError) ensures consistent logging, sanitization, and persistence across the entire codebase.
Fire-and-forget logging means the error logger never crashes the process it’s trying to protect.
Sensitive data sanitization uses two layers (exact blocklist + regex pattern) to prevent secrets from reaching the database.
Two-layer API error handling (handler try/catch + middleware safety net) guarantees clean error responses.
Cron jobs use three-level protection (per-item, timeout budget, outer catch) to maximize throughput even when individual items fail.
Graceful degradation keeps core features working when supplementary systems fail.

Next chapter: the cron job system — how 18 scheduled jobs keep the platform running in the background.