Chapter 25: Cron Jobs and Background Processing

Astrelo runs 18 scheduled jobs that power everything from alert evaluation to pipeline snapshots. These jobs are invisible to the user but essential — without them, alerts wouldn’t fire, predictions wouldn’t update, and the daily digest wouldn’t arrive.

The Infrastructure: AWS EventBridge → API Routes

Cron jobs aren’t traditional Unix cron. They’re AWS EventBridge rules that send HTTP POST requests to Next.js API routes:


AWS EventBridge Rule (schedule: rate(1 minute))
  ↓
HTTP POST /api/cron/process-alert-evaluation
  Header: Authorization: Bearer ${CronSecret}
  ↓
Next.js API route handler
  ↓
Database operations, LLM calls, external API calls

The deploy/cron-schedules.yml CloudFormation template provisions the full infrastructure:


# EventBridge connection with auth header
CronConnection:
  Type: AWS::Events::Connection
  Properties:
    AuthorizationType: API_KEY
    AuthParameters:
      ApiKeyAuthParameters:
        ApiKeyName: Authorization
        ApiKeyValue: !Sub 'Bearer ${CronSecret}'
 
# API destination with rate limit
CronApiDestination:
  Type: AWS::Events::ApiDestination
  Properties:
    ConnectionArn: !GetAtt CronConnection.Arn
    HttpMethod: POST
    InvocationEndpoint: !Sub '${AppUrl}/api/cron/*'
    InvocationRateLimitPerSecond: 5
 
# Individual rules (one per cron job)
ProcessAlertEvaluation:
  Type: AWS::Events::Rule
  Properties:
    ScheduleExpression: 'rate(1 minute)'
    Targets:
      - Arn: !GetAtt CronApiDestination.Arn
        HttpParameters:
          PathParameterValues:
            - 'process-alert-evaluation'

Why HTTP routes instead of Lambda functions? Because the codebase is a Next.js monolith deployed on AWS Amplify. API routes share the same database pool, same service layer, same utility functions. A separate Lambda would need to duplicate all of that. HTTP-based cron keeps everything in one deployable unit.

The 18 Scheduled Jobs

Job	Frequency	What It Does
`process-alert-evaluation`	Every 1 min	Claims and processes `alert_evaluation` jobs from the queue
`process-alert-ai-content`	Every 1 min	Generates AI analysis for alerts via 70B LLM
`process-webhooks`	Every 1 min	Processes queued webhook events
`process-playbooks`	Every 5 min	Executes pending playbook steps
`proactive-triggers`	Every 15 min	Evaluates proactive trigger conditions
`execute-pending-actions`	Every 5 min	Auto-executes approved autonomous actions
`behavioral-triggers`	Every 4 hours	Detects activity slumps and behavioral patterns
`rebuild-rep-profiles`	2:00 AM daily	Recalculates rep performance profiles from deal history
`cleanup-alert-data`	3:00 AM daily	Expires old alerts, cleans stale predictions
`discovery-daily`	3:00 AM daily	Discovers new prospect companies via ML
`rebuild-persona-profiles`	3:30 AM daily	Recalculates persona winning profiles
`evaluate-segments`	4:00 AM daily	Evaluates exploratory segments for absorption/abandonment
`recalculate-learned-weights`	4:30 AM daily	Re-learns per-user intent weights from deal outcomes
`pipeline-snapshot`	5:00 AM daily	Captures daily pipeline state for trending
`process-deal-predictions`	5:30 AM daily	Generates slip/loss/champion predictions
`revenue-at-risk`	6:00 AM daily	Calculates revenue at risk
`cfo-briefing`	7:30 AM daily	Generates executive briefing
`tasks-daily-digest`	8:00 AM daily	Sends daily task digest to Slack

The daily jobs run in a specific order: rebuild profiles → clean up → discover → snapshot → predict → alert. Each depends on the previous step’s output: you rebuild profiles before predicting, predict before alerting, alert before digesting.

Authentication Pattern

Every cron endpoint verifies the secret:


const CRON_SECRET = process.env.CRON_SECRET;
 
export default async function handler(req: NextApiRequest, res: NextApiResponse) {
  if (req.method !== 'POST') {
    return res.status(405).json({ error: 'Method not allowed' });
  }
 
  const authHeader = req.headers.authorization;
  if (CRON_SECRET && authHeader !== `Bearer ${CRON_SECRET}`) {
    return res.status(401).json({ error: 'Unauthorized' });
  }
 
  // ... process jobs
}

The CRON_SECRET && check means: if the secret isn’t configured (local dev), skip auth. In production, the secret is always set.

Job Claiming: SKIP LOCKED

The critical pattern for concurrent-safe job processing:


UPDATE jobs SET status = 'processing', started_at = NOW()
WHERE id IN (
  SELECT id FROM jobs
  WHERE job_type = 'alert_evaluation'
    AND status = 'pending'
    AND created_at > NOW() - INTERVAL '1 hour'
  ORDER BY created_at ASC
  FOR UPDATE SKIP LOCKED
  LIMIT 10
)
RETURNING *

Let’s break this apart:

FOR UPDATE — Locks selected rows so no other transaction can claim them
SKIP LOCKED — If a row is already locked by another transaction, skip it instead of waiting
LIMIT 10 — Claim a batch of 10 jobs at a time

This is an atomic claim-and-update in a single query. If two cron invocations overlap (EventBridge fires before the previous run completes), they safely claim non-overlapping batches. No race conditions, no double-processing.

The 1-Hour Staleness Guard


AND created_at > NOW() - INTERVAL '1 hour'

Jobs older than 1 hour are ignored. This prevents the system from processing stale events that no longer reflect the current state of the CRM. If a webhook event is an hour old, the deal may have changed again — processing the old event could produce a misleading alert.

Retry Logic

When a job fails, it’s retried up to 3 times:


const attempts = (job.attempts || 0) + 1;
const maxAttempts = job.max_attempts || 3;
 
await pool.query(
  `UPDATE jobs SET
     status = CASE WHEN $2 >= $3 THEN 'failed' ELSE 'pending' END,
     attempts = $2,
     error = $4,
     completed_at = CASE WHEN $2 >= $3 THEN NOW() ELSE NULL END
   WHERE id = $1`,
  [job.id, attempts, maxAttempts, errorMessage]
);

On failure: if attempts < max, status goes back to 'pending' (will be retried next cron cycle). If attempts >= max, status becomes 'failed' (permanently). The error column stores the last error message for debugging.

Timeout Budgets: The Three-Tier Pattern

AWS Amplify serverless functions have a 30-second timeout. Cron handlers use a three-tier budget to stay within this limit:


const REQUEST_DEADLINE_MS = 25_000;  // Total cron budget: 25 seconds
const USER_TIMEOUT_MS = 20_000;       // Per-user timeout: 20 seconds
const PER_JOB_TIMEOUT_MS = 10_000;   // Per-job timeout: 10 seconds

Tier 1 — Request deadline: Check before each loop iteration:


for (const user of users) {
  if (Date.now() - startTime > REQUEST_DEADLINE_MS) {
    timedOut = true;
    break;  // Stop gracefully — remaining users will be processed next invocation
  }
  // ... process user
}

Tier 2 — Per-user timeout: Promise.race ensures no single user can hang the entire run:


await Promise.race([
  processUserPredictions(userId),
  new Promise<never>((_, reject) =>
    setTimeout(() => reject(new Error(`User ${userId} timeout`)), USER_TIMEOUT_MS)
  ),
]);

Promise.race returns whichever promise settles first. If processUserPredictions takes 25 seconds, the timeout promise rejects after 20 seconds, and we move to the next user.

Tier 3 — Per-job timeout: Inside the per-user processing, individual jobs also have timeouts (10 seconds for alert evaluation, 8 seconds for AI content generation).

Why Three Tiers?

Consider a scenario with 50 users. Without the request deadline, the cron might process 10 users and then get killed at 30 seconds by AWS — losing any partially-processed work. With the deadline, it processes as many users as it can in 25 seconds, responds with { timedOut: true }, and the remaining users are processed in the next invocation (1 minute later).

Without the per-user timeout, a single user with 500 stalled deals could consume the entire 25-second budget. The per-user cap ensures fair distribution across users.

Multi-User Processing Loop

Jobs that process all users follow a consistent pattern:


// Fetch all active users
const usersRes = await pool.query(
  `SELECT DISTINCT user_id FROM [relevant_table]
   WHERE [relevant_conditions]`
);
 
let processed = 0, failed = 0, timedOut = false;
 
for (const { user_id: userId } of usersRes.rows) {
  if (Date.now() - startTime > REQUEST_DEADLINE_MS) {
    timedOut = true;
    break;
  }
 
  try {
    const count = await Promise.race([
      processUser(userId),
      timeoutPromise(USER_TIMEOUT_MS),
    ]);
    processed += count;
  } catch (error) {
    failed++;
    logError({ source: 'cron-name', userId, error });
    // Continue — don't let one user's failure block others
  }
}
 
return res.status(200).json({
  processed,
  failed,
  timedOut,
  usersProcessed: usersRes.rows.length - (timedOut ? remainingUsers : 0),
  durationMs: Date.now() - startTime,
});

The response always includes operational metrics. A monitoring system can alert when failed > 0 or timedOut: true becomes chronic.

The Daily Job Sequence

Daily jobs are staggered across the early morning hours to avoid resource contention:


2:00 AM — rebuild-rep-profiles (recalculate from deal history)
3:00 AM — cleanup-alert-data (expire old alerts, clean stale data)
3:00 AM — discovery-daily (ML prospect discovery)
3:30 AM — rebuild-persona-profiles (persona winning profiles)
4:00 AM — evaluate-segments (exploratory segment lifecycle)
4:30 AM — recalculate-learned-weights (intent weight learning)
5:00 AM — pipeline-snapshot (capture daily state)
5:30 AM — process-deal-predictions (generate predictions)
6:00 AM — revenue-at-risk (risk calculations)
7:30 AM — cfo-briefing (executive summary)
8:00 AM — tasks-daily-digest (Slack digest)
9:00 AM — scheduled-reports (generated reports)

The 30-minute gaps give each job time to complete before the next one starts. Rebuild → clean → discover → learn → snapshot → predict → brief → digest. Data flows downstream through the day.

Key Takeaways

18 scheduled jobs range from every-minute processing (alert evaluation) to daily batches (pipeline snapshots).
EventBridge → HTTP POST keeps cron handlers in the same codebase as the rest of the application — shared code, shared database pool.
SKIP LOCKED enables concurrent-safe job claiming without race conditions.
Three-tier timeout budgets (request → user → job) ensure graceful degradation under infrastructure limits.
Retry with max attempts gives transient failures a chance to recover while permanently marking persistent failures.
Daily jobs are staggered with 30-minute gaps to avoid resource contention and establish data flow dependencies.

Next chapter: rate limiting and quotas — how the system protects itself from abuse.