Chapter 10: The Discovery Engine

The scoring engine ranks companies already in your CRM. But what about companies you haven’t found yet? The Discovery Engine searches for new prospects that match your winning profile — and deliberately explores outside that profile to prevent tunnel vision.

The Explore/Exploit Dilemma

This is a fundamental problem in machine learning and decision theory: should you exploit what you know works, or explore something new?

Exploit only: You find more companies exactly like your past winners. Safe, predictable, but you’ll never discover new markets. If your winning profile says “100-200 person SaaS companies in North America,” you’ll never find that 500-person European fintech that’s actually a great fit.

Explore only: You cast a wide net, testing every possible company type. You’ll discover new segments, but waste enormous time on bad-fit prospects. Most explorations fail.

The solution: 80/20 split. 80% of discovered companies match your winning profile (exploit). 20% are deliberate experiments outside your profile (explore). This is inspired by Google’s “20% time” and the multi-armed bandit problem in statistics.

How Discovery Works


// src/domain/scoring/services/discovery/mlProspectDiscoveryService.ts (simplified)
async discoverProspectsForUser(options: DiscoverProspectsOptions): Promise<DiscoveryResult> {
  // 1. Get/build winning profile from closed deals
  const winningProfile = await mlFitScoreService.getOrBuildWinningProfile(options.userId);
 
  if (!winningProfile || winningProfile.confidence === 'insufficient') {
    return {
      success: false,
      error: 'Insufficient deal data. Need at least 5 closed deals for discovery.',
    };
  }
 
  // 2. Get existing domains to exclude (don't discover companies already in CRM)
  const existingDomains = await getExistingDomains(options.userId);
 
  // 3. Build LLM prompt from winning profile
  const promptContext = await this.buildPromptContextFromPatternAnalyzer(
    options.userId, winningProfile, config, existingDomains, options
  );
 
  // 4. Run discovery strategies in parallel
  // ... 7 strategies, each using Groq LLM
}

The Safety Check: Minimum 5 Closed Deals

Discovery requires at least 5 closed deals. Why? The winning profile is built from statistical analysis (weighted means, standard deviations, embedding centroids). With fewer than 5 data points, the statistics are unreliable — your “winning profile” might just be random noise. Five deals is the minimum for a meaningful pattern.

Domain Deduplication

Before running discovery, we load every domain already in the user’s CRM. Discovered companies are filtered against this set to prevent duplicates. If “acme.com” is already in your pipeline, the LLM won’t suggest it again.

The Seven Discovery Strategies

Discovery runs seven strategies, each targeting a different dimension:

Exploit Strategies (80%)

1. Profile Matches — Companies that closely match your winning profile across all dimensions (industry, size, geography).

2. Goldilocks Matches — Companies that would score Composite ≥ 70 based on available signals. These are the “high probability” prospects.

Explore Strategies (20%)

3. Smaller Companies — Companies 50-75% of your typical deal size. Tests whether your solution works at a lower price point.

4. Larger Companies — Companies 150-300% of your typical deal size. Tests whether you can sell upmarket.

5. Adjacent Industries — Companies in industries related to but not identical to your winners. If you win in “Computer Systems Design,” this explores “Data Processing” and “IT Consulting.”

6. New Geographies — Companies in regions where you haven’t closed deals. If all your wins are in North America, this explores European or APAC companies.

Each explore strategy has a minimum of 2 companies — enough to test the segment without over-investing:


const MIN_PER_CATEGORY = 2;      // Minimum 2 per exploration category
const EXPLORE_OVERFETCH = 4;     // Request 4, expect 2 after filtering

We request 4 and keep 2 because the LLM sometimes suggests duplicates or companies that fail domain verification.

The LLM Prompt

The discovery engine uses Groq (Llama 3.1/3.3) to generate prospect lists. The prompt is built from your winning profile:


You are a B2B sales intelligence expert. Find companies matching this profile:

WINNING PROFILE:
- Top industries: Computer Systems Design (45% of wins), Software Publishing (30%)
- Employee range: 50-300 (sweet spot: 150)
- Revenue range: $5M-$50M (sweet spot: $20M)
- Top regions: California (35%), New York (20%), Texas (15%)
- Average deal value: $85,000
- Win rate: 42%

EXCLUDE these domains (already in CRM):
acme.com, bigcorp.io, techstart.com, ...

Return 10 companies matching this profile. For each, provide:
- Company name, domain, industry, employee count, revenue range
- Why they match the winning profile
- A confidence score (0-100)

Respond in JSON format.

The LLM returns structured JSON (using Groq’s JSON mode), which is parsed, validated, and scored before being stored.

Tracking Exploration Outcomes

The explore/exploit split isn’t static. It adapts based on results:


-- exploratory_segments table tracks each exploration
exploratory_segments (20 cols)
├── segment_type          -- e.g., 'industry', 'employee_size', 'geography'
├── segment_value         -- e.g., 'Fintech', '>500 employees', 'Europe'
├── times_shown INT       -- How many times this segment was presented
├── times_acted_on INT    -- How many times the user engaged
├── deals_created INT     -- How many deals came from this segment
├── deals_won INT         -- How many were won
├── deals_lost INT        -- How many were lost
├── total_deal_value NUMERIC
├── status                -- 'exploring', 'absorbed', 'abandoned'
├── confidence_score NUMERIC

Each exploration segment accumulates data over time:

Exploring → The system is still testing this segment
Absorbed → The segment proved successful and is now part of the winning profile (the ICP expands)
Abandoned → After enough data, the segment clearly doesn’t work (too many losses or no engagement)

Bell Curve Shifts

When an exploratory segment is “absorbed,” it shifts the winning profile:


-- bell_curve_shifts table records how the ICP changed
bell_curve_shifts (12 cols)
├── shift_type            -- e.g., 'expansion', 'contraction'
├── dimension             -- e.g., 'employee_size', 'industry'
├── previous_range JSONB  -- { min: 50, max: 300 }
├── new_range JSONB       -- { min: 50, max: 500 }  (expanded!)
├── trigger_reason        -- e.g., "3 deals won in 300-500 employee segment"
├── supporting_deals INT  -- How many deals support this shift

If you start exploring larger companies (300-500 employees) and win 3 deals there, the system proposes expanding your ICP’s employee range from 50-300 to 50-500. The “bell curve” of your winning profile literally shifts to accommodate the new data.

This is the learning loop in action:

Winning profile → Discovery finds companies
Some are explored (outside the profile)
User converts some explores to deals
Deals close → Winning profile updates
Updated profile → Better discovery

The Discovery Pipeline

Discovery results go through a pipeline before reaching the user:


LLM generates companies
  → Domain verification (does the domain actually exist?)
    → Dedup against existing CRM data
      → Score each company (quick fit estimate)
        → Filter by minimum confidence (>50)
          → Store in discovery_results table
            → Present in "Ready to Engage" queue

The discovery_results table tracks the lifecycle:


discovery_results (37 cols)
├── status DEFAULT 'new'     -- new → viewed → saved/dismissed/converted
├── viewed_at                -- When the user first saw it
├── saved_at                 -- When saved for later
├── dismissed_at             -- When rejected
├── dismiss_reason TEXT      -- "Bad fit", "Already contacted", etc.
├── converted_at             -- When added to the CRM pipeline
├── ml_fit_score NUMERIC     -- Quick fit estimate at discovery time
├── discovery_source         -- 'groq_llm', 'web_search', etc.
├── explored_segment         -- Which explore category, if any
├── explore_category         -- 'smaller', 'larger', 'adjacent_industry', etc.
├── deviation_reason TEXT    -- Why this deviates from the profile

The Ready to Engage Queue

Discovered companies surface in the Command Center as a prioritized queue:


┌─────────────────────────────────────────┐
│  Ready to Engage                         │
│                                          │
│  ★ CloudTech Solutions  Fit: 87  New     │
│    Computer Systems Design, 200 emp      │
│    "Strong profile match: industry,      │
│     size, and geography align"           │
│    [Save] [Dismiss] [Add to Pipeline]    │
│                                          │
│  ◆ FinanceAI Corp      Fit: 72  Explore  │
│    Financial Software, 450 emp           │
│    "Testing larger company segment"      │
│    [Save] [Dismiss] [Add to Pipeline]    │
│                                          │
└─────────────────────────────────────────┘

Profile matches (★) are shown first. Explore picks (◆) are labeled so the user knows they’re experimental. The dismiss and save actions feed back into the segment tracking system.

Key Takeaways

80/20 explore/exploit balances known winners with deliberate experimentation. Without exploration, your pipeline ossifies. Without exploitation, it’s random.
Seven strategies cover profile matches, Goldilocks prospects, and four exploration dimensions (size up, size down, adjacent industries, new geographies).
Minimum 5 closed deals are required. The winning profile is statistical — it needs data.
Exploration outcomes are tracked. Successful experiments get absorbed into the ICP. Failed ones are abandoned. The system learns over time.
Bell curve shifts are the payoff — the winning profile literally expands when explorations prove successful. Discovery makes your scoring smarter over time.

Next chapter: we leave the scoring engine and enter the integrations layer — starting with how Astrelo connects to HubSpot through OAuth.