Skip to Content
Part 6: ProductionCh 25: Cron Jobs

Astrelo runs 18 scheduled jobs that power everything from alert evaluation to pipeline snapshots. These jobs are invisible to the user but essential β€” without them, alerts wouldn’t fire, predictions wouldn’t update, and the daily digest wouldn’t arrive.

The Infrastructure: AWS EventBridge β†’ API Routes

Cron jobs aren’t traditional Unix cron. They’re AWS EventBridge rules that send HTTP POST requests to Next.js API routes:

AWS EventBridge Rule (schedule: rate(1 minute)) ↓ HTTP POST /api/cron/process-alert-evaluation Header: Authorization: Bearer ${CronSecret} ↓ Next.js API route handler ↓ Database operations, LLM calls, external API calls

The deploy/cron-schedules.yml CloudFormation template provisions the full infrastructure:

# EventBridge connection with auth header CronConnection: Type: AWS::Events::Connection Properties: AuthorizationType: API_KEY AuthParameters: ApiKeyAuthParameters: ApiKeyName: Authorization ApiKeyValue: !Sub 'Bearer ${CronSecret}' # API destination with rate limit CronApiDestination: Type: AWS::Events::ApiDestination Properties: ConnectionArn: !GetAtt CronConnection.Arn HttpMethod: POST InvocationEndpoint: !Sub '${AppUrl}/api/cron/*' InvocationRateLimitPerSecond: 5 # Individual rules (one per cron job) ProcessAlertEvaluation: Type: AWS::Events::Rule Properties: ScheduleExpression: 'rate(1 minute)' Targets: - Arn: !GetAtt CronApiDestination.Arn HttpParameters: PathParameterValues: - 'process-alert-evaluation'

Why HTTP routes instead of Lambda functions? Because the codebase is a Next.js monolith deployed on AWS Amplify. API routes share the same database pool, same service layer, same utility functions. A separate Lambda would need to duplicate all of that. HTTP-based cron keeps everything in one deployable unit.

The 18 Scheduled Jobs

JobFrequencyWhat It Does
process-alert-evaluationEvery 1 minClaims and processes alert_evaluation jobs from the queue
process-alert-ai-contentEvery 1 minGenerates AI analysis for alerts via 70B LLM
process-webhooksEvery 1 minProcesses queued webhook events
process-playbooksEvery 5 minExecutes pending playbook steps
proactive-triggersEvery 15 minEvaluates proactive trigger conditions
execute-pending-actionsEvery 5 minAuto-executes approved autonomous actions
behavioral-triggersEvery 4 hoursDetects activity slumps and behavioral patterns
rebuild-rep-profiles2:00 AM dailyRecalculates rep performance profiles from deal history
cleanup-alert-data3:00 AM dailyExpires old alerts, cleans stale predictions
discovery-daily3:00 AM dailyDiscovers new prospect companies via ML
rebuild-persona-profiles3:30 AM dailyRecalculates persona winning profiles
evaluate-segments4:00 AM dailyEvaluates exploratory segments for absorption/abandonment
recalculate-learned-weights4:30 AM dailyRe-learns per-user intent weights from deal outcomes
pipeline-snapshot5:00 AM dailyCaptures daily pipeline state for trending
process-deal-predictions5:30 AM dailyGenerates slip/loss/champion predictions
revenue-at-risk6:00 AM dailyCalculates revenue at risk
cfo-briefing7:30 AM dailyGenerates executive briefing
tasks-daily-digest8:00 AM dailySends daily task digest to Slack

The daily jobs run in a specific order: rebuild profiles β†’ clean up β†’ discover β†’ snapshot β†’ predict β†’ alert. Each depends on the previous step’s output: you rebuild profiles before predicting, predict before alerting, alert before digesting.

Authentication Pattern

Every cron endpoint verifies the secret:

const CRON_SECRET = process.env.CRON_SECRET; export default async function handler(req: NextApiRequest, res: NextApiResponse) { if (req.method !== 'POST') { return res.status(405).json({ error: 'Method not allowed' }); } const authHeader = req.headers.authorization; if (CRON_SECRET && authHeader !== `Bearer ${CRON_SECRET}`) { return res.status(401).json({ error: 'Unauthorized' }); } // ... process jobs }

The CRON_SECRET && check means: if the secret isn’t configured (local dev), skip auth. In production, the secret is always set.

Job Claiming: SKIP LOCKED

The critical pattern for concurrent-safe job processing:

UPDATE jobs SET status = 'processing', started_at = NOW() WHERE id IN ( SELECT id FROM jobs WHERE job_type = 'alert_evaluation' AND status = 'pending' AND created_at > NOW() - INTERVAL '1 hour' ORDER BY created_at ASC FOR UPDATE SKIP LOCKED LIMIT 10 ) RETURNING *

Let’s break this apart:

  1. FOR UPDATE β€” Locks selected rows so no other transaction can claim them
  2. SKIP LOCKED β€” If a row is already locked by another transaction, skip it instead of waiting
  3. LIMIT 10 β€” Claim a batch of 10 jobs at a time

This is an atomic claim-and-update in a single query. If two cron invocations overlap (EventBridge fires before the previous run completes), they safely claim non-overlapping batches. No race conditions, no double-processing.

The 1-Hour Staleness Guard

AND created_at > NOW() - INTERVAL '1 hour'

Jobs older than 1 hour are ignored. This prevents the system from processing stale events that no longer reflect the current state of the CRM. If a webhook event is an hour old, the deal may have changed again β€” processing the old event could produce a misleading alert.

Retry Logic

When a job fails, it’s retried up to 3 times:

const attempts = (job.attempts || 0) + 1; const maxAttempts = job.max_attempts || 3; await pool.query( `UPDATE jobs SET status = CASE WHEN $2 >= $3 THEN 'failed' ELSE 'pending' END, attempts = $2, error = $4, completed_at = CASE WHEN $2 >= $3 THEN NOW() ELSE NULL END WHERE id = $1`, [job.id, attempts, maxAttempts, errorMessage] );

On failure: if attempts < max, status goes back to 'pending' (will be retried next cron cycle). If attempts >= max, status becomes 'failed' (permanently). The error column stores the last error message for debugging.

Timeout Budgets: The Three-Tier Pattern

AWS Amplify serverless functions have a 30-second timeout. Cron handlers use a three-tier budget to stay within this limit:

const REQUEST_DEADLINE_MS = 25_000; // Total cron budget: 25 seconds const USER_TIMEOUT_MS = 20_000; // Per-user timeout: 20 seconds const PER_JOB_TIMEOUT_MS = 10_000; // Per-job timeout: 10 seconds

Tier 1 β€” Request deadline: Check before each loop iteration:

for (const user of users) { if (Date.now() - startTime > REQUEST_DEADLINE_MS) { timedOut = true; break; // Stop gracefully β€” remaining users will be processed next invocation } // ... process user }

Tier 2 β€” Per-user timeout: Promise.race ensures no single user can hang the entire run:

await Promise.race([ processUserPredictions(userId), new Promise<never>((_, reject) => setTimeout(() => reject(new Error(`User ${userId} timeout`)), USER_TIMEOUT_MS) ), ]);

Promise.race returns whichever promise settles first. If processUserPredictions takes 25 seconds, the timeout promise rejects after 20 seconds, and we move to the next user.

Tier 3 β€” Per-job timeout: Inside the per-user processing, individual jobs also have timeouts (10 seconds for alert evaluation, 8 seconds for AI content generation).

Why Three Tiers?

Consider a scenario with 50 users. Without the request deadline, the cron might process 10 users and then get killed at 30 seconds by AWS β€” losing any partially-processed work. With the deadline, it processes as many users as it can in 25 seconds, responds with { timedOut: true }, and the remaining users are processed in the next invocation (1 minute later).

Without the per-user timeout, a single user with 500 stalled deals could consume the entire 25-second budget. The per-user cap ensures fair distribution across users.

Multi-User Processing Loop

Jobs that process all users follow a consistent pattern:

// Fetch all active users const usersRes = await pool.query( `SELECT DISTINCT user_id FROM [relevant_table] WHERE [relevant_conditions]` ); let processed = 0, failed = 0, timedOut = false; for (const { user_id: userId } of usersRes.rows) { if (Date.now() - startTime > REQUEST_DEADLINE_MS) { timedOut = true; break; } try { const count = await Promise.race([ processUser(userId), timeoutPromise(USER_TIMEOUT_MS), ]); processed += count; } catch (error) { failed++; logError({ source: 'cron-name', userId, error }); // Continue β€” don't let one user's failure block others } } return res.status(200).json({ processed, failed, timedOut, usersProcessed: usersRes.rows.length - (timedOut ? remainingUsers : 0), durationMs: Date.now() - startTime, });

The response always includes operational metrics. A monitoring system can alert when failed > 0 or timedOut: true becomes chronic.

The Daily Job Sequence

Daily jobs are staggered across the early morning hours to avoid resource contention:

2:00 AM β€” rebuild-rep-profiles (recalculate from deal history) 3:00 AM β€” cleanup-alert-data (expire old alerts, clean stale data) 3:00 AM β€” discovery-daily (ML prospect discovery) 3:30 AM β€” rebuild-persona-profiles (persona winning profiles) 4:00 AM β€” evaluate-segments (exploratory segment lifecycle) 4:30 AM β€” recalculate-learned-weights (intent weight learning) 5:00 AM β€” pipeline-snapshot (capture daily state) 5:30 AM β€” process-deal-predictions (generate predictions) 6:00 AM β€” revenue-at-risk (risk calculations) 7:30 AM β€” cfo-briefing (executive summary) 8:00 AM β€” tasks-daily-digest (Slack digest) 9:00 AM β€” scheduled-reports (generated reports)

The 30-minute gaps give each job time to complete before the next one starts. Rebuild β†’ clean β†’ discover β†’ learn β†’ snapshot β†’ predict β†’ brief β†’ digest. Data flows downstream through the day.

Key Takeaways

  1. 18 scheduled jobs range from every-minute processing (alert evaluation) to daily batches (pipeline snapshots).

  2. EventBridge β†’ HTTP POST keeps cron handlers in the same codebase as the rest of the application β€” shared code, shared database pool.

  3. SKIP LOCKED enables concurrent-safe job claiming without race conditions.

  4. Three-tier timeout budgets (request β†’ user β†’ job) ensure graceful degradation under infrastructure limits.

  5. Retry with max attempts gives transient failures a chance to recover while permanently marking persistent failures.

  6. Daily jobs are staggered with 30-minute gaps to avoid resource contention and establish data flow dependencies.

Next chapter: rate limiting and quotas β€” how the system protects itself from abuse.

Last updated on