Essay · Build log

A Custom AI Pipeline for Scoring Senior Roles Against My Profile

By Damien M. Power·First published April 2026

I built a two-stage AI pipeline to score Director-level job descriptions against my own candidate profile, ideally mirroring the way companies score resumes against an ideal candidate profile. It runs in a Google Colab notebook, uses Claude Sonnet 4.5 as the underlying model, and has scored 34 roles to date. The prompts were iteratively refined across the first five JDs in the corpus and stabilized after that. The pipeline tells me which roles are genuinely worth pursuing, which are worth applying to with caveats, and which to skip — and it does the work in minutes rather than hours.

I'm publishing this because it's a working demonstration of AI literacy at the level senior operators actually need: scoping a problem, designing a schema, iterating on prompts, debugging tooling, and producing useful output. Most operators talk about AI; fewer have built things with it that they actually use. Also because the pipeline is an example of the operating instinct described elsewhere on this site — the same impulse that produces operating cadences and scoring rubrics for workforce planning produces them for personal use, too.

Figure 01 · Pipeline architecture, scoring weights, and build stack

What I built

Two stages, five weighted dimensions, three structured outputs.

Stage 01Extraction

Takes raw JD text and parses it into a structured JSON schema across 18 fields. The schema covers the things that actually matter for fit assessment: canonical job title, company stage, organization size, the type of organizational problem the role is solving for (Scaling, Efficiency, Strategy, or Transformation), whether the role frames its scope as internal consulting versus operational ownership, and the level of AI experience the JD signals. The extraction step normalizes inconsistent JD language into a consistent representation that the next stage can score against. Fields are explicitly distinguished as either extract-directly or infer-from-signals, with the prompt instructed to return null rather than hallucinate when data isn't present.

Stage 02Scoring

Takes the extracted JD and scores it against my candidate profile across five weighted dimensions. The largest weight (40%) goes to required qualifications coverage — the gating question of whether I can credibly defend fit during a recruiter screen. The next-largest weight (25%) goes to organizational problem fit — whether the role's underlying challenge maps to the kinds of problems I've actually solved. Smaller weights cover seniority alignment, organization size, and AI signal fit. Cross-industry pivot — the extent to which a role does (or doesn't) value cross-industry experience — is tracked as an unweighted watch flag rather than a penalty. The pivot itself isn't disqualifying, it just affects how the materials need to be written. Remote flexibility is another unweighted watch flag.

The pipeline produces three outputs: a numeric composite score, a verdict (Pursue / Apply with caveats / Low priority), and a structured breakdown of which qualifications are covered, partially covered, or not covered. The breakdown is the part I actually use most. Every covered item maps to specific work in my profile and becomes a resume bullet or cover letter proof point. Every partial item is a framing opportunity, again mapped to the most relevant experience in my profile. Every not-covered item is either a gap to address head-on or a reason to deprioritize the application.

Why I built it

Three reasons, in increasing order of importance.

The first is volume management. At Director level, the application calculus is brutal: most roles are partially filled before posting, the average online application converts at 2–5%, and a senior search routinely takes four to six months. Spending real effort on every JD I see is wasted motion. Spending zero effort means I miss the ones that genuinely fit. A pipeline that produces a defensible verdict in less than a minute lets me allocate effort where it matters.

The second is consistency. Without a structured rubric, my evaluation of any single JD drifts based on how I'm feeling that day, how exciting the company sounds, or how recent the rejection from the last application was. A pipeline removes most of that noise. The criteria are locked; the weighting is documented; the verdict is reproducible across roles. What's more, the JSON logs are automatically saved to Google Drive, where I can reference them anytime.

The third is the most interesting one to me. Building the pipeline forced me to articulate, in machine-readable form, what I actually want and what I can actually defend. A schema field for internal_consultant_framing means I had to decide what counts as that framing and what doesn't. A weight of 40% on required qualifications coverage means I had to commit to gating my own pursuit of roles where I can't credibly defend fit. The act of building the system clarified my own search in ways no amount of journaling would have.

What I learned

Three lessons that surfaced in the iterations.

Prompts that look unambiguous to humans frequently aren't unambiguous to language models.

My first version of the extraction prompt produced a Hybrid value for remote flexibility on JDs that simply listed multiple office locations. The model was inferring policy from geography rather than extracting it from explicit policy language. The fix was to restate the rule numerically and explicitly: extract only from policy language, ignore office-location lists. The lesson: when an LLM gets something wrong, the prompt almost always has more ambiguity than I thought.

Single-value fields collapse multi-dimensional reality.

My early schema had org_problem_type as a single value (Scaling / Efficiency / Strategy / Transformation), and on three of four early JDs the role was clearly two of those things at once. I changed the field to a {primary, secondary} object, and the scoring became more accurate immediately. Schemas should be evolved, not pre-perfected.

The pipeline correctly identifies fit, but the candidate has to identify trajectory.

One JD scored 4.3 (Pursue) but was a lateral move into work I'd already mastered. The pipeline didn't know that; only I did. The lesson: scoring infrastructure improves judgment, but doesn't replace it. The verdict is an input to the decision, not the decision itself.

Where this lives

The technical and operational footprint.

The notebook lives in my Google Drive and runs against the Anthropic API. The candidate profile, scoring weights, and prompt iteration log are all version-controlled in the notebook itself. As of April 2026, the corpus includes 34 scored JDs across tech, financial services, and mission-driven organizations, and 12 submitted applications.

If you're building something similar — for yourself, for a recruiting team, or as part of a talent infrastructure product — I'd be glad to compare notes.

damienmatthewspower@gmail.com

← Back to the homepage The Three Functions of a Chief of Staff →