Skip to Content

Astrelo doesn’t train its own ML models. It uses Groq — an inference provider that runs open-source LLMs (Llama 3.1 8B and Llama 3.3 70B) at high speed and low cost. Every AI-powered feature in Astrelo — from chat responses to industry classification to alert content generation — routes through a single Groq client.

Two Models, Two Jobs

Astrelo uses two Llama models for different tasks:

ModelSizeUse CaseSpeedCost
Llama 3.1 8B8 billion paramsIntent classification, NAICS mapping, quick analysis~500 tokens/secVery cheap
Llama 3.3 70B70 billion paramsCosmo responses, Goldilocks recommendations, complex reasoning~200 tokens/secModerate

The 8B model handles high-volume, low-complexity tasks. The 70B model handles tasks that need nuanced reasoning. This split keeps costs low while maintaining quality where it matters.

The Groq Client

All LLM calls go through a single client class:

// src/infrastructure/providers/groq/client.ts, lines 99-126 export class GroqAPIClient { private rateLimiter: RateLimiter; constructor(config?: GroqClientConfig) { this.apiKey = config?.apiKey || LLM.GROQ.API_KEY; this.model = config?.model || LLM.GROQ.MODEL; this.rateLimiter = new RateLimiter( LLM.GROQ.RATE_LIMIT.REQUESTS_PER_MINUTE, LLM.GROQ.RATE_LIMIT.TOKENS_PER_MINUTE ); this.client = axios.create({ baseURL: this.baseUrl, timeout: config?.timeout || LLM.GROQ.TIMEOUT, headers: { 'Authorization': `Bearer ${this.apiKey}`, 'Content-Type': 'application/json', }, }); } }

Rate Limiting

Groq has rate limits (requests per minute and tokens per minute). The client enforces these locally before making API calls:

await this.rateLimiter.waitForSlot();

waitForSlot() either returns immediately (if under the limit) or pauses until a slot is available. This prevents 429 errors from Groq and ensures fair usage across concurrent requests.

The Completion Method

Every LLM call follows the same pattern:

// src/infrastructure/providers/groq/client.ts, lines 142-238 async createCompletion( messages: GroqMessage[], options?: GroqCompletionOptions ): Promise<{ content: string; usage: GroqUsageStats }> { const model = options?.model || this.model; const temperature = options?.temperature ?? LLM.GROQ.DEFAULTS.TEMPERATURE; const maxTokens = options?.maxTokens ?? LLM.GROQ.DEFAULTS.MAX_TOKENS; // 1. Check Redis cache const cacheKey = this.generateCacheKey(messages, model, temperature); if (this.enableCaching) { const cached = await redisCacheService.get(CACHE_KEYS.completion(cacheKey)); if (cached) return cached; } // 2. Wait for rate limit await this.rateLimiter.waitForSlot(); // 3. Build request const requestBody = { model, messages, temperature, max_tokens: maxTokens, stream: false, }; // 4. Add JSON mode if requested if (options?.responseFormat === 'json') { requestBody.response_format = { type: 'json_object' }; } // 5. Call Groq API const response = await this.client.post('/chat/completions', requestBody); const content = response.data.choices[0]?.message?.content || ''; const usage = parseUsage(response.data.usage); // 6. Cache result if (this.enableCaching) { await redisCacheService.set(CACHE_KEYS.completion(cacheKey), { content, usage }); } return { content, usage }; }

Caching LLM Responses

Identical prompts produce identical results (at temperature 0). The client caches responses in Redis:

Cache key: groq:completion:{hash of messages + model + temperature} TTL: varies by use case (24h for scoring, 7d for topic matching)

This is significant for scoring. When re-scoring 500 companies, many will hit the same industry classification prompt (e.g., “What NAICS code matches ‘Software’?”). Caching turns a $0.50 operation into a $0.001 Redis lookup.

JSON Mode: Structured LLM Output

Many features need structured data from the LLM, not free-form text. Groq’s JSON mode guarantees the response is valid JSON:

async createJsonCompletion<T>( messages: GroqMessage[], options?: GroqCompletionOptions ): Promise<{ data: T; usage: GroqUsageStats }> { const { content, usage } = await this.createCompletion(messages, { ...options, responseFormat: 'json', }); const data = JSON.parse(content) as T; return { data, usage }; }

Why JSON mode matters: Without it, the LLM might return text like:

The NAICS code is 541512, which stands for Computer Systems Design.

With JSON mode:

{ "naicsCode": "541512", "title": "Computer Systems Design Services", "confidence": 92 }

The structured response can be directly parsed and used by the scoring engine. No regex extraction, no “I hope the LLM formatted it correctly” — the API guarantees valid JSON.

Error Handling: Graceful Degradation

The client handles every Groq error type:

// src/infrastructure/providers/groq/client.ts, lines 368-401 private handleError(error: unknown): never { if (axios.isAxiosError(error)) { const status = error.response?.status; if (status === 401) throw this.createError('INVALID_API_KEY', 'Invalid Groq API key'); if (status === 429) throw this.createError('RATE_LIMIT_EXCEEDED', 'Rate limit exceeded'); if (status === 400 && errorType.includes('context_length')) throw this.createError('CONTEXT_LENGTH_EXCEEDED', 'Request exceeds context window'); if (status === 503 || status === 502) throw this.createError('SERVICE_UNAVAILABLE', 'Groq service temporarily unavailable'); } throw this.createError('UNKNOWN_ERROR', String(error)); }

Every caller wraps LLM calls in try-catch with fallback behavior:

  • Scoring: Falls back to non-semantic matching (exact string comparison)
  • Chat: Returns a friendly “I’m having trouble right now” message
  • Alerts: Queues AI content generation for retry later

The system never crashes because Groq is down. It degrades gracefully.

The Two-Model Pattern in Cosmo

Cosmo’s chat pipeline uses both models in sequence:

User message → Llama 3.1 8B: "What tools does this request need?" (classification) → Execute tools (database queries, API calls) → Llama 3.3 70B: "Generate a helpful response using this data" (generation)

The 8B model is fast and cheap — perfect for the simple classification task. The 70B model is slower but much better at nuanced language generation — perfect for the user-facing response. This architecture minimizes cost (most tokens are processed by the cheap model) while maximizing quality (the visible output uses the best model).

Key Takeaways

  1. Two models, two jobs. 8B for classification/routing, 70B for generation/reasoning. Cost-optimized architecture.
  2. Redis caching prevents duplicate LLM calls. Same prompt = same result = cache hit.
  3. JSON mode guarantees structured output. No parsing guesswork.
  4. Rate limiting is enforced client-side. The client waits for a slot before calling Groq.
  5. Graceful degradation means Groq outages don’t crash the application. Every feature has a fallback.

Next chapter: we enter Part 4 — Proactive Intelligence. Starting with the event-driven architecture that turns CRM webhooks into alerts.

Last updated on