Chapter 14: Groq LLM Integration

Astrelo doesn’t train its own ML models. It uses Groq — an inference provider that runs open-source LLMs (Llama 3.1 8B and Llama 3.3 70B) at high speed and low cost. Every AI-powered feature in Astrelo — from chat responses to industry classification to alert content generation — routes through a single Groq client.

Two Models, Two Jobs

Astrelo uses two Llama models for different tasks:

Model	Size	Use Case	Speed	Cost
Llama 3.1 8B	8 billion params	Intent classification, NAICS mapping, quick analysis	~500 tokens/sec	Very cheap
Llama 3.3 70B	70 billion params	Cosmo responses, Goldilocks recommendations, complex reasoning	~200 tokens/sec	Moderate

The 8B model handles high-volume, low-complexity tasks. The 70B model handles tasks that need nuanced reasoning. This split keeps costs low while maintaining quality where it matters.

The Groq Client

All LLM calls go through a single client class:


// src/infrastructure/providers/groq/client.ts, lines 99-126
export class GroqAPIClient {
  private rateLimiter: RateLimiter;
 
  constructor(config?: GroqClientConfig) {
    this.apiKey = config?.apiKey || LLM.GROQ.API_KEY;
    this.model = config?.model || LLM.GROQ.MODEL;
 
    this.rateLimiter = new RateLimiter(
      LLM.GROQ.RATE_LIMIT.REQUESTS_PER_MINUTE,
      LLM.GROQ.RATE_LIMIT.TOKENS_PER_MINUTE
    );
 
    this.client = axios.create({
      baseURL: this.baseUrl,
      timeout: config?.timeout || LLM.GROQ.TIMEOUT,
      headers: {
        'Authorization': `Bearer ${this.apiKey}`,
        'Content-Type': 'application/json',
      },
    });
  }
}

Rate Limiting

Groq has rate limits (requests per minute and tokens per minute). The client enforces these locally before making API calls:


await this.rateLimiter.waitForSlot();

waitForSlot() either returns immediately (if under the limit) or pauses until a slot is available. This prevents 429 errors from Groq and ensures fair usage across concurrent requests.

The Completion Method

Every LLM call follows the same pattern:


// src/infrastructure/providers/groq/client.ts, lines 142-238
async createCompletion(
  messages: GroqMessage[],
  options?: GroqCompletionOptions
): Promise<{ content: string; usage: GroqUsageStats }> {
  const model = options?.model || this.model;
  const temperature = options?.temperature ?? LLM.GROQ.DEFAULTS.TEMPERATURE;
  const maxTokens = options?.maxTokens ?? LLM.GROQ.DEFAULTS.MAX_TOKENS;
 
  // 1. Check Redis cache
  const cacheKey = this.generateCacheKey(messages, model, temperature);
  if (this.enableCaching) {
    const cached = await redisCacheService.get(CACHE_KEYS.completion(cacheKey));
    if (cached) return cached;
  }
 
  // 2. Wait for rate limit
  await this.rateLimiter.waitForSlot();
 
  // 3. Build request
  const requestBody = {
    model,
    messages,
    temperature,
    max_tokens: maxTokens,
    stream: false,
  };
 
  // 4. Add JSON mode if requested
  if (options?.responseFormat === 'json') {
    requestBody.response_format = { type: 'json_object' };
  }
 
  // 5. Call Groq API
  const response = await this.client.post('/chat/completions', requestBody);
 
  const content = response.data.choices[0]?.message?.content || '';
  const usage = parseUsage(response.data.usage);
 
  // 6. Cache result
  if (this.enableCaching) {
    await redisCacheService.set(CACHE_KEYS.completion(cacheKey), { content, usage });
  }
 
  return { content, usage };
}

Caching LLM Responses

Identical prompts produce identical results (at temperature 0). The client caches responses in Redis:


Cache key: groq:completion:{hash of messages + model + temperature}
TTL: varies by use case (24h for scoring, 7d for topic matching)

This is significant for scoring. When re-scoring 500 companies, many will hit the same industry classification prompt (e.g., “What NAICS code matches ‘Software’?”). Caching turns a $0.50 operation into a $0.001 Redis lookup.

JSON Mode: Structured LLM Output

Many features need structured data from the LLM, not free-form text. Groq’s JSON mode guarantees the response is valid JSON:


async createJsonCompletion<T>(
  messages: GroqMessage[],
  options?: GroqCompletionOptions
): Promise<{ data: T; usage: GroqUsageStats }> {
  const { content, usage } = await this.createCompletion(messages, {
    ...options,
    responseFormat: 'json',
  });
 
  const data = JSON.parse(content) as T;
  return { data, usage };
}

Why JSON mode matters: Without it, the LLM might return text like:


The NAICS code is 541512, which stands for Computer Systems Design.

With JSON mode:


{ "naicsCode": "541512", "title": "Computer Systems Design Services", "confidence": 92 }

The structured response can be directly parsed and used by the scoring engine. No regex extraction, no “I hope the LLM formatted it correctly” — the API guarantees valid JSON.

Error Handling: Graceful Degradation

The client handles every Groq error type:


// src/infrastructure/providers/groq/client.ts, lines 368-401
private handleError(error: unknown): never {
  if (axios.isAxiosError(error)) {
    const status = error.response?.status;
 
    if (status === 401) throw this.createError('INVALID_API_KEY', 'Invalid Groq API key');
    if (status === 429) throw this.createError('RATE_LIMIT_EXCEEDED', 'Rate limit exceeded');
    if (status === 400 && errorType.includes('context_length'))
      throw this.createError('CONTEXT_LENGTH_EXCEEDED', 'Request exceeds context window');
    if (status === 503 || status === 502)
      throw this.createError('SERVICE_UNAVAILABLE', 'Groq service temporarily unavailable');
  }
  throw this.createError('UNKNOWN_ERROR', String(error));
}

Every caller wraps LLM calls in try-catch with fallback behavior:

Scoring: Falls back to non-semantic matching (exact string comparison)
Chat: Returns a friendly “I’m having trouble right now” message
Alerts: Queues AI content generation for retry later

The system never crashes because Groq is down. It degrades gracefully.

The Two-Model Pattern in Cosmo

Cosmo’s chat pipeline uses both models in sequence:


User message
  → Llama 3.1 8B: "What tools does this request need?"    (classification)
  → Execute tools (database queries, API calls)
  → Llama 3.3 70B: "Generate a helpful response using this data"  (generation)

The 8B model is fast and cheap — perfect for the simple classification task. The 70B model is slower but much better at nuanced language generation — perfect for the user-facing response. This architecture minimizes cost (most tokens are processed by the cheap model) while maximizing quality (the visible output uses the best model).

Key Takeaways

Two models, two jobs. 8B for classification/routing, 70B for generation/reasoning. Cost-optimized architecture.
Redis caching prevents duplicate LLM calls. Same prompt = same result = cache hit.
JSON mode guarantees structured output. No parsing guesswork.
Rate limiting is enforced client-side. The client waits for a slot before calling Groq.
Graceful degradation means Groq outages don’t crash the application. Every feature has a fallback.

Next chapter: we enter Part 4 — Proactive Intelligence. Starting with the event-driven architecture that turns CRM webhooks into alerts.