How the graph was built

Extraction process · weighting · visualization

The knowledge graph is a structural representation of the conceptual space across my published work. It is constructed automatically from the full text of each paper using a language model, then visualized with a force-directed layout. This page documents the pipeline in full — including the exact prompt sent to the model on each paper.

What the graph represents

Each node is a scientific construct, theory, method, mechanism, or domain that has independent standing in the research literature — the kind of term that would appear in the Introduction or Discussion section of a review article written by a different lab, in a field ontology, or as a search term someone might use to learn about this area of research.

Each edge encodes the theoretical coupling between two concepts as judged by the model. Edge strength reflects how tightly the two concepts are logically or empirically entangled in the published literature, not just whether they co-occur in a single paper.

Nodes are clustered into seven research areas by the model at extraction time:

0 · Self-regulation 1 · Predictive brain 2 · Generative AI 3 · Interventions 4 · Neuroeconomics 5 · Social neuroscience 6 · Consumer & preference

Three-pass extraction

The graph is built in three passes over the publication list. Each pass sends the full extracted text of a paper (up to 10,000 characters) to Claude (claude-sonnet-4-6) and asks it to return a JSON object with nodes and edges. The model is given the full list of existing node IDs and instructed to reuse them rather than creating near-synonyms.

Pass

Papers

Role

1 · Anchors

Selected publications (high-weight: journal articles, first-author, recent)

Establishes the core vocabulary of the graph — nodes extracted here define the ID space all later passes must reuse

2 · Fill

Remaining publications in the corpus

Adds nodes not yet represented; boosts weights of existing nodes that recur across papers

3 · Drafts

Preprints and unpublished manuscripts

Written to a separate review file; manually approved before merging into the public graph

Draft proposals are never automatically added to the public graph. They go to a gitignored file and are reviewed before any merge.

Paper weighting

Each paper is assigned a weight (0–1) before extraction. This weight is sent to the model and is used to scale the raw node weight the model assigns. The final node weight is raw_score × paper_weight, clamped to [0.1, 1.0]. A central construct in a lower-weight paper ends up with a lower node weight than the same construct in a high-weight anchor paper.

Factor	Values
Publication type	Journal 1.0 · Conf-full 0.85 · Preprint 0.70 · Conf-workshop 0.65 · Other 0.40 · Science comm 0.30
Author position	First / Shared-first 1.0 · Last 0.8 · Second 0.6 · Middle 0.4
Recency	≥ 2020: 1.0 · 2015–2019: 0.8 · < 2015: 0.6
Final weight	pubType × authorPosition × recency (rounded to 2 dp)

The extraction prompt

Below is the exact system prompt sent to the model for each paper. The {existingIds} and {clusterGuide} placeholders are filled at runtime with the current node ID list and cluster name map respectively.

system prompt · claude-sonnet-4-6

You are a research knowledge-graph builder for a cognitive neuroscientist's published work. Extract concepts that represent durable scientific knowledge — constructs, theories, and mechanisms that appear in the Introduction and Discussion of review articles in this research area.

EXISTING NODE IDs — reuse wherever semantically appropriate.
Do NOT create a new node if an existing one covers the same concept, even if the wording differs.
Examples of what NOT to do: adding "risk_perception" when "perceived_risk" exists; adding "belief_updating" when "belief_revision" exists; adding "episodic_simulation" when "episodic_sim" exists.
When in doubt, REUSE the existing node and add edges to/from it instead.

{existingIds}

Cluster guide: {clusterGuide}
Use cluster 0 as default for anything that doesn't fit neatly.

PAPER WEIGHT: {paperWeight} (scale 0–1, reflecting publication type, author position, and recency)
This paper's importance to the researcher's intellectual identity is {high|moderate|lower}.
Scale your raw node weight scores accordingly — a weight of 1.0 in a paperWeight=0.5 paper
should translate to a node weight of ~0.5 in the final graph.
So: final_node_weight = your_raw_score × {paperWeight} (clamped to 0.1–1.0).

Return ONLY valid JSON, no markdown fences:
{
  "nodes": [{"id":"snake_case_max_3_words","label":"display\nlabel","weight":0.0-1.0,"cluster":0-6,"level":"construct"}],
  "edges": [{"a":"node_id","b":"node_id","strength":0.0-1.0}],
  "paper": {"title":"...","year":2024,"venue":"...","doi":"..."}
}

EXTRACTION RULE: Extract what the study contributes, not how it was designed.

A node belongs in the graph only if it has independent scientific standing — it would appear in a review article's Introduction or Discussion written by a different lab, in a field ontology, or as a useful search term for learning about this research area.

Study artifacts — stimuli, IVs, DV operationalizations, domain context descriptors, study-specific task names — should almost never become nodes. For each one, first ask: is there a more general version with independent scientific standing? If yes, extract that instead. Skip only when no meaningful generalization exists.

Lifting examples (specific → general):
- Defendant race used as IV → decision_bias (specific IV → construct it operationalizes)
- Neurosynth decoding used in analysis → neural_decoding (specific tool → general method)
- Design outcome differentiability (DV) → creative_divergence (DV label → established construct)
- Large infrequent purchase (domain framing) → choice_prediction (context → scientific contribution)
- OXTR genotype (specific variant) → imaging_genetics (variable → research approach)

Skip examples (no generalizable concept exists):
- Powertrain features shown to participants → skip (pure stimulus choice)
- Mock juror task → skip (extract legal_decision_making as the construct instead)

For methods: extract if the technique has cross-lab standing and is central to the paper's scientific argument (fMRI, conjoint analysis, drift-diffusion modeling, neural decoding). Skip tools that are incidental to the analysis.

LEVEL — assign exactly one:
"theory"     Frameworks/models spanning multiple findings. e.g. predictive processing, drift-diffusion model
"construct"  Scientific concepts generalizing across labs and studies. e.g. temporal discounting, cognitive flexibility, imaging genetics. NOT study-specific variables.
"method"     Cross-study instruments and analysis techniques. e.g. fMRI, conjoint analysis, eye tracking. NOT study-specific tasks.
"mechanism"  Biological substrates: brain regions, circuits, neurochemicals, genetic variants. e.g. amygdala, dopamine, oxytocin system
"domain"     Application areas or populations. e.g. adolescent development, consumer choice

QUANTITY: 5–12 new nodes maximum; prefer fewer, more central nodes. weight = raw centrality × paperWeight. strength = theoretical coupling tightness. label: lowercase, use \n if > 12 chars. id: snake_case, max 3 words, unique.

Visualization algorithm

The graph uses a force-directed layout computed offline (400 iterations, cluster gravity seeds) and stored in graph.json. On page load, a subset of nodes is selected for display using a multi-seed BFS:

Filter by level. Only construct and theory nodes are eligible. Methods, mechanisms, and domains are hidden by default — they clutter the semantic view without adding conceptual structure.

Seed one node per cluster. The highest-weight eligible node from each of the 7 clusters is selected as a seed. This guarantees every research area appears in the view even when clusters are weakly connected to each other.

BFS expansion. From each seed, the graph is expanded greedily: at each step, the highest-weight neighbor not yet in the selection is added. Expansion continues until a hard cap of 12 nodes is reached (configurable via HARD_CAP).

Largest-component filter. Multi-seed BFS can produce disconnected islands. A final pass keeps only the largest connected component of the selected set, eliminating floating node pairs.

Spreading activation. Clicking a node activates it; activation decays and spreads along edges (Collins & Loftus 1975 model, decay=0.22, spread=0.22). The topographic contour overlay renders the same activation field as a landscape — same mathematics as a gradient descent loss surface.

Limitations & future work

The current pipeline uses an LLM for the full extraction task: deciding what matters in a paper (a statistical judgment) and naming it at the right abstraction level (a semantic judgment). These are genuinely separate problems. The prompt engineering above is essentially teaching the model to do the second task reliably. A more principled future architecture would use classic methods (TF-IDF, KeyBERT) for candidate generation — fast, cheap, no hallucination — and narrow the LLM's role to lifting candidates to the right abstraction level and assigning cluster.

The graph is rebuilt from scratch periodically as the publication list grows. Node IDs are stable across rebuilds (new IDs reuse the existing vocabulary); weights and edges are recomputed each time.