How the graph was built
Extraction process · weighting · visualization
The knowledge graph is a structural representation of the conceptual space across my published work. It is constructed automatically from the full text of each paper using a language model, then visualized with a force-directed layout. This page documents the pipeline in full — including the exact prompt sent to the model on each paper.
What the graph represents
Each node is a scientific construct, theory, method, mechanism, or domain that has independent standing in the research literature — the kind of term that would appear in the Introduction or Discussion section of a review article written by a different lab, in a field ontology, or as a search term someone might use to learn about this area of research.
Each edge encodes the theoretical coupling between two concepts as judged by the model. Edge strength reflects how tightly the two concepts are logically or empirically entangled in the published literature, not just whether they co-occur in a single paper.
Nodes are clustered into seven research areas by the model at extraction time:
Three-pass extraction
The graph is built in three passes over the publication list. Each pass sends the full extracted text of a paper (up to 10,000 characters) to Claude (claude-sonnet-4-6) and asks it to return a JSON object with nodes and edges. The model is given the full list of existing node IDs and instructed to reuse them rather than creating near-synonyms.
Draft proposals are never automatically added to the public graph. They go to a gitignored file and are reviewed before any merge.
Paper weighting
Each paper is assigned a weight (0–1) before extraction. This weight is sent to the model and is used to scale the raw node weight the model assigns. The final node weight is raw_score × paper_weight, clamped to [0.1, 1.0]. A central construct in a lower-weight paper ends up with a lower node weight than the same construct in a high-weight anchor paper.
| Factor | Values |
|---|---|
| Publication type | Journal 1.0 · Conf-full 0.85 · Preprint 0.70 · Conf-workshop 0.65 · Other 0.40 · Science comm 0.30 |
| Author position | First / Shared-first 1.0 · Last 0.8 · Second 0.6 · Middle 0.4 |
| Recency | ≥ 2020: 1.0 · 2015–2019: 0.8 · < 2015: 0.6 |
| Final weight | pubType × authorPosition × recency (rounded to 2 dp) |
The extraction prompt
Below is the exact system prompt sent to the model for each paper. The {existingIds} and {clusterGuide} placeholders are filled at runtime with the current node ID list and cluster name map respectively.
You are a research knowledge-graph builder for a cognitive neuroscientist's published work. Extract concepts that represent durable scientific knowledge — constructs, theories, and mechanisms that appear in the Introduction and Discussion of review articles in this research area.
EXISTING NODE IDs — reuse wherever semantically appropriate.
Do NOT create a new node if an existing one covers the same concept, even if the wording differs.
Examples of what NOT to do: adding "risk_perception" when "perceived_risk" exists; adding "belief_updating" when "belief_revision" exists; adding "episodic_simulation" when "episodic_sim" exists.
When in doubt, REUSE the existing node and add edges to/from it instead.
{existingIds}
Cluster guide: {clusterGuide}
Use cluster 0 as default for anything that doesn't fit neatly.
PAPER WEIGHT: {paperWeight} (scale 0–1, reflecting publication type, author position, and recency)
This paper's importance to the researcher's intellectual identity is {high|moderate|lower}.
Scale your raw node weight scores accordingly — a weight of 1.0 in a paperWeight=0.5 paper
should translate to a node weight of ~0.5 in the final graph.
So: final_node_weight = your_raw_score × {paperWeight} (clamped to 0.1–1.0).
Return ONLY valid JSON, no markdown fences:
{
"nodes": [{"id":"snake_case_max_3_words","label":"display\nlabel","weight":0.0-1.0,"cluster":0-6,"level":"construct"}],
"edges": [{"a":"node_id","b":"node_id","strength":0.0-1.0}],
"paper": {"title":"...","year":2024,"venue":"...","doi":"..."}
}
EXTRACTION RULE: Extract what the study contributes, not how it was designed.
A node belongs in the graph only if it has independent scientific standing — it would appear in a review article's Introduction or Discussion written by a different lab, in a field ontology, or as a useful search term for learning about this research area.
Study artifacts — stimuli, IVs, DV operationalizations, domain context descriptors, study-specific task names — should almost never become nodes. For each one, first ask: is there a more general version with independent scientific standing? If yes, extract that instead. Skip only when no meaningful generalization exists.
Lifting examples (specific → general):
- Defendant race used as IV → decision_bias (specific IV → construct it operationalizes)
- Neurosynth decoding used in analysis → neural_decoding (specific tool → general method)
- Design outcome differentiability (DV) → creative_divergence (DV label → established construct)
- Large infrequent purchase (domain framing) → choice_prediction (context → scientific contribution)
- OXTR genotype (specific variant) → imaging_genetics (variable → research approach)
Skip examples (no generalizable concept exists):
- Powertrain features shown to participants → skip (pure stimulus choice)
- Mock juror task → skip (extract legal_decision_making as the construct instead)
For methods: extract if the technique has cross-lab standing and is central to the paper's scientific argument (fMRI, conjoint analysis, drift-diffusion modeling, neural decoding). Skip tools that are incidental to the analysis.
LEVEL — assign exactly one:
"theory" Frameworks/models spanning multiple findings. e.g. predictive processing, drift-diffusion model
"construct" Scientific concepts generalizing across labs and studies. e.g. temporal discounting, cognitive flexibility, imaging genetics. NOT study-specific variables.
"method" Cross-study instruments and analysis techniques. e.g. fMRI, conjoint analysis, eye tracking. NOT study-specific tasks.
"mechanism" Biological substrates: brain regions, circuits, neurochemicals, genetic variants. e.g. amygdala, dopamine, oxytocin system
"domain" Application areas or populations. e.g. adolescent development, consumer choice
QUANTITY: 5–12 new nodes maximum; prefer fewer, more central nodes. weight = raw centrality × paperWeight. strength = theoretical coupling tightness. label: lowercase, use \n if > 12 chars. id: snake_case, max 3 words, unique.
Visualization algorithm
The graph uses a force-directed layout computed offline (400 iterations, cluster gravity seeds) and stored in graph.json. On page load, a subset of nodes is selected for display using a multi-seed BFS:
construct and theory nodes are eligible. Methods, mechanisms, and domains are hidden by default — they clutter the semantic view without adding conceptual structure.HARD_CAP).Limitations & future work
The current pipeline uses an LLM for the full extraction task: deciding what matters in a paper (a statistical judgment) and naming it at the right abstraction level (a semantic judgment). These are genuinely separate problems. The prompt engineering above is essentially teaching the model to do the second task reliably. A more principled future architecture would use classic methods (TF-IDF, KeyBERT) for candidate generation — fast, cheap, no hallucination — and narrow the LLM's role to lifting candidates to the right abstraction level and assigning cluster.
The graph is rebuilt from scratch periodically as the publication list grows. Node IDs are stable across rebuilds (new IDs reuse the existing vocabulary); weights and edges are recomputed each time.