Astrelo doesnât train its own ML models. It uses Groq â an inference provider that runs open-source LLMs (Llama 3.1 8B and Llama 3.3 70B) at high speed and low cost. Every AI-powered feature in Astrelo â from chat responses to industry classification to alert content generation â routes through a single Groq client.
Two Models, Two Jobs
Astrelo uses two Llama models for different tasks:
| Model | Size | Use Case | Speed | Cost |
|---|---|---|---|---|
| Llama 3.1 8B | 8 billion params | Intent classification, NAICS mapping, quick analysis | ~500 tokens/sec | Very cheap |
| Llama 3.3 70B | 70 billion params | Cosmo responses, Goldilocks recommendations, complex reasoning | ~200 tokens/sec | Moderate |
The 8B model handles high-volume, low-complexity tasks. The 70B model handles tasks that need nuanced reasoning. This split keeps costs low while maintaining quality where it matters.
The Groq Client
All LLM calls go through a single client class:
// src/infrastructure/providers/groq/client.ts, lines 99-126
export class GroqAPIClient {
private rateLimiter: RateLimiter;
constructor(config?: GroqClientConfig) {
this.apiKey = config?.apiKey || LLM.GROQ.API_KEY;
this.model = config?.model || LLM.GROQ.MODEL;
this.rateLimiter = new RateLimiter(
LLM.GROQ.RATE_LIMIT.REQUESTS_PER_MINUTE,
LLM.GROQ.RATE_LIMIT.TOKENS_PER_MINUTE
);
this.client = axios.create({
baseURL: this.baseUrl,
timeout: config?.timeout || LLM.GROQ.TIMEOUT,
headers: {
'Authorization': `Bearer ${this.apiKey}`,
'Content-Type': 'application/json',
},
});
}
}Rate Limiting
Groq has rate limits (requests per minute and tokens per minute). The client enforces these locally before making API calls:
await this.rateLimiter.waitForSlot();waitForSlot() either returns immediately (if under the limit) or pauses until a slot is available. This prevents 429 errors from Groq and ensures fair usage across concurrent requests.
The Completion Method
Every LLM call follows the same pattern:
// src/infrastructure/providers/groq/client.ts, lines 142-238
async createCompletion(
messages: GroqMessage[],
options?: GroqCompletionOptions
): Promise<{ content: string; usage: GroqUsageStats }> {
const model = options?.model || this.model;
const temperature = options?.temperature ?? LLM.GROQ.DEFAULTS.TEMPERATURE;
const maxTokens = options?.maxTokens ?? LLM.GROQ.DEFAULTS.MAX_TOKENS;
// 1. Check Redis cache
const cacheKey = this.generateCacheKey(messages, model, temperature);
if (this.enableCaching) {
const cached = await redisCacheService.get(CACHE_KEYS.completion(cacheKey));
if (cached) return cached;
}
// 2. Wait for rate limit
await this.rateLimiter.waitForSlot();
// 3. Build request
const requestBody = {
model,
messages,
temperature,
max_tokens: maxTokens,
stream: false,
};
// 4. Add JSON mode if requested
if (options?.responseFormat === 'json') {
requestBody.response_format = { type: 'json_object' };
}
// 5. Call Groq API
const response = await this.client.post('/chat/completions', requestBody);
const content = response.data.choices[0]?.message?.content || '';
const usage = parseUsage(response.data.usage);
// 6. Cache result
if (this.enableCaching) {
await redisCacheService.set(CACHE_KEYS.completion(cacheKey), { content, usage });
}
return { content, usage };
}Caching LLM Responses
Identical prompts produce identical results (at temperature 0). The client caches responses in Redis:
Cache key: groq:completion:{hash of messages + model + temperature}
TTL: varies by use case (24h for scoring, 7d for topic matching)This is significant for scoring. When re-scoring 500 companies, many will hit the same industry classification prompt (e.g., âWhat NAICS code matches âSoftwareâ?â). Caching turns a $0.50 operation into a $0.001 Redis lookup.
JSON Mode: Structured LLM Output
Many features need structured data from the LLM, not free-form text. Groqâs JSON mode guarantees the response is valid JSON:
async createJsonCompletion<T>(
messages: GroqMessage[],
options?: GroqCompletionOptions
): Promise<{ data: T; usage: GroqUsageStats }> {
const { content, usage } = await this.createCompletion(messages, {
...options,
responseFormat: 'json',
});
const data = JSON.parse(content) as T;
return { data, usage };
}Why JSON mode matters: Without it, the LLM might return text like:
The NAICS code is 541512, which stands for Computer Systems Design.With JSON mode:
{ "naicsCode": "541512", "title": "Computer Systems Design Services", "confidence": 92 }The structured response can be directly parsed and used by the scoring engine. No regex extraction, no âI hope the LLM formatted it correctlyâ â the API guarantees valid JSON.
Error Handling: Graceful Degradation
The client handles every Groq error type:
// src/infrastructure/providers/groq/client.ts, lines 368-401
private handleError(error: unknown): never {
if (axios.isAxiosError(error)) {
const status = error.response?.status;
if (status === 401) throw this.createError('INVALID_API_KEY', 'Invalid Groq API key');
if (status === 429) throw this.createError('RATE_LIMIT_EXCEEDED', 'Rate limit exceeded');
if (status === 400 && errorType.includes('context_length'))
throw this.createError('CONTEXT_LENGTH_EXCEEDED', 'Request exceeds context window');
if (status === 503 || status === 502)
throw this.createError('SERVICE_UNAVAILABLE', 'Groq service temporarily unavailable');
}
throw this.createError('UNKNOWN_ERROR', String(error));
}Every caller wraps LLM calls in try-catch with fallback behavior:
- Scoring: Falls back to non-semantic matching (exact string comparison)
- Chat: Returns a friendly âIâm having trouble right nowâ message
- Alerts: Queues AI content generation for retry later
The system never crashes because Groq is down. It degrades gracefully.
The Two-Model Pattern in Cosmo
Cosmoâs chat pipeline uses both models in sequence:
User message
â Llama 3.1 8B: "What tools does this request need?" (classification)
â Execute tools (database queries, API calls)
â Llama 3.3 70B: "Generate a helpful response using this data" (generation)The 8B model is fast and cheap â perfect for the simple classification task. The 70B model is slower but much better at nuanced language generation â perfect for the user-facing response. This architecture minimizes cost (most tokens are processed by the cheap model) while maximizing quality (the visible output uses the best model).
Key Takeaways
- Two models, two jobs. 8B for classification/routing, 70B for generation/reasoning. Cost-optimized architecture.
- Redis caching prevents duplicate LLM calls. Same prompt = same result = cache hit.
- JSON mode guarantees structured output. No parsing guesswork.
- Rate limiting is enforced client-side. The client waits for a slot before calling Groq.
- Graceful degradation means Groq outages donât crash the application. Every feature has a fallback.
Next chapter: we enter Part 4 â Proactive Intelligence. Starting with the event-driven architecture that turns CRM webhooks into alerts.