Chapter 6: Embeddings and Vector Similarity

When a human sees “SaaS company” and “cloud software vendor,” they instantly know these are similar. A computer sees two unrelated strings. Embeddings bridge this gap — they convert text into numerical vectors where similar concepts are close together in mathematical space.

This chapter explains the foundational math behind Astrelo’s ML scoring engine.

What Is an Embedding?

An embedding is an array of numbers that represents a concept. For Astrelo, every industry (NAICS code) becomes a 384-dimensional vector:


// src/domain/scoring/services/embeddings/naicsEmbeddingService.ts, lines 21-29
export interface NaicsEmbedding {
  naicsCode: string;
  embedding: number[];        // 384-dimensional vector
  title?: string;
  description?: string;
  parentCode?: string;
  hierarchyLevel: number;
  relatedCodes?: string[];
}

A 384-dimensional vector is just an array of 384 numbers. Think of it as coordinates in 384-dimensional space. In 2D space, a point has an (x, y) position. In 3D, it’s (x, y, z). In 384D, it’s (x₁, x₂, x₃, …, x₃₈₄).

Why 384 dimensions? This is the output size of the embedding model (Llama 3.1 8B’s embedding layer). More dimensions capture more nuance, but at the cost of storage and computation. 384 is a sweet spot — enough to distinguish “automotive manufacturing” from “pharmaceutical manufacturing” while keeping the vectors manageable.

Here’s a simplified example in 3 dimensions to build intuition:


"SaaS company"       → [0.82, 0.15, 0.91]
"Cloud software"     → [0.80, 0.18, 0.89]  ← Very close to SaaS!
"Steel manufacturing"→ [0.12, 0.87, 0.23]  ← Very far from SaaS

The embedding model places similar concepts at similar coordinates. The distance between points IS the semantic distance between concepts.

Cosine Similarity: Measuring Closeness

How do you measure “closeness” between two vectors? Astrelo uses cosine similarity — the cosine of the angle between two vectors:


// src/domain/scoring/utils/embeddingUtils.ts, lines 17-32
export function cosineSimilarity(a: number[], b: number[]): number {
  if (a.length !== b.length) return 0;
 
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;
 
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
 
  const magnitude = Math.sqrt(normA) * Math.sqrt(normB);
  return magnitude === 0 ? 0 : dotProduct / magnitude;
}

The formula: similarity = dot(a, b) / (||a|| × ||b||)

Let’s trace this with our 3D example:


a = [0.82, 0.15, 0.91]  (SaaS)
b = [0.80, 0.18, 0.89]  (Cloud software)

dotProduct = (0.82 × 0.80) + (0.15 × 0.18) + (0.91 × 0.89)
           = 0.656 + 0.027 + 0.8099
           = 1.4929

normA = sqrt(0.82² + 0.15² + 0.91²) = sqrt(0.6724 + 0.0225 + 0.8281) = sqrt(1.523) = 1.234
normB = sqrt(0.80² + 0.18² + 0.89²) = sqrt(0.64 + 0.0324 + 0.7921) = sqrt(1.4645) = 1.210

similarity = 1.4929 / (1.234 × 1.210) = 1.4929 / 1.493 = 0.9997

0.9997 — almost 1.0 (perfect similarity). SaaS and cloud software are almost identical in the embedding space.

Now compare SaaS to steel manufacturing:


a = [0.82, 0.15, 0.91]  (SaaS)
c = [0.12, 0.87, 0.23]  (Steel manufacturing)

dotProduct = (0.82 × 0.12) + (0.15 × 0.87) + (0.91 × 0.23)
           = 0.0984 + 0.1305 + 0.2093
           = 0.4382

normA = 1.234 (same as before)
normC = sqrt(0.0144 + 0.7569 + 0.0529) = sqrt(0.8242) = 0.908

similarity = 0.4382 / (1.234 × 0.908) = 0.4382 / 1.121 = 0.391

0.391 — much lower. SaaS and steel manufacturing are far apart semantically.

Why cosine instead of Euclidean distance? Cosine similarity measures the angle between vectors, not the distance. This makes it insensitive to magnitude — a vector [0.82, 0.15, 0.91] and [1.64, 0.30, 1.82] (same direction, double length) have cosine similarity 1.0. For embeddings, direction encodes meaning and magnitude is an artifact of the model. Cosine captures what we care about.

Where Embeddings Are Stored

Embeddings are stored in the naics_embeddings table:


naics_embeddings (10 cols)
├── naics_code VARCHAR(20) PRIMARY KEY    -- e.g., "541512"
├── naics_title                           -- e.g., "Computer Systems Design"
├── parent_code                           -- e.g., "5415"
├── embedding FLOAT8[]                    -- 384 numbers
├── naics_description TEXT
├── hierarchy_level INT                   -- 2-digit, 4-digit, 6-digit
├── related_codes JSONB
├── embedding_dimension INT               -- Always 384
├── created_at
└── updated_at

The embedding column stores the vector as a PostgreSQL array of 64-bit floats. This is compact: 384 × 8 bytes = 3,072 bytes per row. With ~1,200 NAICS codes, that’s about 3.6 MB — trivially small.

Vector Search: Finding Similar Industries

When scoring a company, we need to find which NAICS codes are similar to the company’s industry. There are two paths:


// src/domain/scoring/services/embeddings/vectorSearchService.ts, lines 81-95
export async function findSimilarNaics(
  queryEmbedding: number[],
  excludeCode: string,
  limit: number = 10,
  minSimilarity: number = 0
): Promise<VectorSearchResult[]> {
  const usePgvector = await isPgvectorAvailable();
 
  if (usePgvector) {
    return findSimilarNaicsPgvector(queryEmbedding, excludeCode, limit, minSimilarity);
  }
  return findSimilarNaicsFallback(queryEmbedding, excludeCode, limit, minSimilarity);
}

Path 1: pgvector (Fast)

If the pgvector PostgreSQL extension is installed, we use its native distance operator:


SELECT
  naics_code,
  naics_title,
  1 - (embedding_vec <=> $1::vector) AS similarity
FROM naics_embeddings
WHERE naics_code != $2
  AND embedding_vec IS NOT NULL
ORDER BY embedding_vec <=> $1::vector
LIMIT $3

The <=> operator computes L2 (Euclidean) distance between vectors at the database level. PostgreSQL uses an HNSW (Hierarchical Navigable Small World) index for approximate nearest neighbor search — O(log n) instead of O(n). For 1,200 NAICS codes, this is near-instant.

1 - distance converts distance to similarity (closer distance = higher similarity).

Path 2: In-Memory Fallback (Works Everywhere)

If pgvector isn’t available (e.g., in development), we load all embeddings into memory and compute cosine similarity in JavaScript:


// Fallback: fetch all embeddings, compute similarity in JS
const allEmbeddings = await pool.query(
  'SELECT naics_code, naics_title, embedding FROM naics_embeddings WHERE embedding IS NOT NULL'
);
 
const results = allEmbeddings.rows
  .filter(row => row.naics_code !== excludeCode)
  .map(row => ({
    naicsCode: row.naics_code,
    title: row.naics_title,
    similarity: cosineSimilarity(queryEmbedding, row.embedding),
  }))
  .filter(r => r.similarity >= minSimilarity)
  .sort((a, b) => b.similarity - a.similarity)
  .slice(0, limit);

This is O(n) — it checks every NAICS code. With 1,200 codes and 384 dimensions, that’s ~460,800 multiplications per search. On modern hardware, this takes about 5ms — fast enough for development.

How Companies Get Matched to NAICS Codes

Not every company in your CRM has a NAICS code. The industry classification service fills the gap:

The company has an industry field from the CRM (e.g., “Cloud Computing”)
The LLM is asked: “What NAICS code best matches ‘Cloud Computing’?”
The LLM returns a code (e.g., “541512 — Computer Systems Design Services”)
The code’s embedding is looked up from naics_embeddings
This embedding is compared to the winning profile’s embedding

If the LLM can’t classify the industry, the system falls back to text-based matching against NAICS titles.

Title Embeddings: The Persona Dimension

Industry isn’t the only thing that gets embedded. Job titles also become vectors, stored in title_embeddings:


title_embeddings (10 cols)
├── id UUID PRIMARY KEY
├── title VARCHAR(255) UNIQUE               -- e.g., "VP of Engineering"
├── embedding JSONB                          -- 384 numbers (stored as JSONB)
├── embedding_dimension INT DEFAULT 384
├── normalized_title                         -- e.g., "vp engineering"
├── seniority_inferred                       -- e.g., "VP"
├── department_inferred                      -- e.g., "Engineering"
├── model_version DEFAULT 'llama-3.1-8b-instant'
├── created_at
└── updated_at

This enables persona scoring — measuring how well a contact’s job title matches your winning persona profile. “VP of Engineering” and “Head of Engineering” have different strings but similar embeddings (high cosine similarity), so they score similarly.

The Embedding Constant

Throughout the scoring engine, one constant appears everywhere:


// src/domain/scoring/constants/index.ts, line 80
export const EMBEDDING_DIMENSION = 384;

This is used for:

Validating embedding length before computation
Initializing zero vectors (for companies without embeddings)
Array allocation for size embedding generation

If the embedding model ever changes to 768 dimensions (like some larger models), this single constant controls the migration.

Key Takeaways

Embeddings convert text to numbers. “Computer Systems Design” becomes a 384-number array that captures its semantic meaning.
Cosine similarity measures semantic closeness. Two vectors pointing in the same direction (angle near 0°, cosine near 1.0) represent similar concepts.
pgvector makes vector search fast. HNSW indexes enable O(log n) nearest neighbor search in PostgreSQL. The in-memory fallback works for smaller datasets.
Everything gets embedded. NAICS codes (industry), job titles (personas), and company descriptions all become vectors. This creates a unified mathematical space for comparison.
384 dimensions is the sweet spot. Enough nuance to distinguish subtle differences, small enough to store and compute efficiently.

Next chapter: we’ll use these embeddings to calculate the Fit Score — how well a prospect matches your historical winners.