Grep doesn’t understand feelings. You search for “fear” and it dutifully returns every literal occurrence of those four characters. Meanwhile, paragraphs discussing “anxiety”, “dread”, and “terror” slip right through. Grep is extremely good at its job — it’s just that its job is pattern matching, not comprehension.
This becomes a problem when you’re analyzing text that’s semantically rich. Like, say, psychological journals.
The Use Case
I’ve been digging through psychology papers and therapy notes, looking for patterns in how emotions are described. The thing is, humans are wonderfully inconsistent. One author writes about “fear responses,” another discusses “anxiety disorders,” a third mentions “panic states.” They’re all circling the same conceptual territory, but grep treats them as completely unrelated strings.
What I needed was grep that went to therapy itself — something that understands words have meaning.
Enter Word2Vec
Word2Vec is one of those “why didn’t we think of this sooner” ideas. Train a neural network on massive amounts of text, and it learns to represent words as vectors in high-dimensional space. Words that appear in similar contexts end up close together. “King” minus “man” plus “woman” equals “queen” — that kind of magic.
The practical upshot: you can measure semantic similarity between words. And if you can do that, you can build a grep that finds not just what you typed, but what you meant.
mcp-w2vgrep
So I built mcp-w2vgrep — an MCP server that combines Word2Vec embeddings with ripgrep’s speed. You give it a search term, it finds semantically similar words, then searches for all of them in your files.
Search for “fear” and you’ll get results containing “anxiety”, “terror”, “dread”, “panic”, and “horror”. The similarity scores tell you how close each match is to your original query.
Docker: Because Dependencies Are a Horror Story
Here’s the thing about Word2Vec: it requires models. Big ones. The FastText models I’m using are about 2.3GB each. Plus you need the w2vgrep binary, ripgrep, Node.js, and various other pieces.
Installing all of this on your machine is possible. It’s also the kind of experience that makes you question your life choices.
So everything runs in Docker. One docker compose build and you’re done. The models even auto-download on first run if you set DOWNLOAD_MODELS=english (or russian).
Setup
Create a docker-compose.yml:
services:
mcp-w2vgrep:
build: https://github.com/astartsky/mcp-w2vgrep.git
stdin_open: true
volumes:
- ~/.mcp-w2vgrep:/data/models
- ~/Documents/research:/search
environment:
- DOWNLOAD_MODELS=english
The first volume stores the downloaded models (so you don’t re-download 2.3GB every time). The second mounts whatever directory you want to search.
Build it:
docker compose build
Then configure your MCP client. For Claude Code, add to ~/.claude/settings.json:
{
"mcpServers": {
"w2vgrep": {
"type": "stdio",
"command": "docker",
"args": [
"compose",
"-f",
"/path/to/docker-compose.yml",
"run",
"--rm",
"-i",
"mcp-w2vgrep"
]
}
}
}
Usage
Once configured, you can ask Claude to search semantically:
“Search my research notes for content related to ‘anxiety’”
Under the hood, mcp-w2vgrep expands “anxiety” to include similar terms, then ripgreps through your files. Results come back with similarity scores:
fear (0.82): /search/notes/session-03.md:47
The patient described intense fear when approaching the building...
dread (0.76): /search/notes/session-07.md:23
A sense of dread that builds throughout the workday...
panic (0.71): /search/notes/session-12.md:89
Reported three panic episodes this week...
Tuning the Threshold
The threshold parameter controls how similar a word needs to be to count as a match:
- 0.7 (default): Strict. Only very close semantic matches.
- 0.5-0.6: Balanced. Good for exploratory searches.
- 0.4: Cast a wider net. Expect some noise.
- Below 0.5: Warning — you might get millions of characters of output. The tool will warn you, but it won’t stop you from shooting yourself in the foot.
Limitation: Single Words Only
One important limitation: queries must be single words. No phrases, no multi-word concepts. This is a constraint of the underlying w2vgrep tool — Word2Vec embeddings are per-word, and there’s no phrase composition happening under the hood.
If you need to search for a concept that spans multiple words, find the closest single-word synonym. “Cognitive behavioral therapy” becomes “CBT” or just “therapy”. “Panic attack” becomes “panic”. It’s a constraint, but one that forces clarity about what you’re actually looking for.
Closing Thoughts
Regular grep is a precision instrument. You know exactly what you’re looking for, you write the pattern, you get the matches. It’s deterministic and reliable.
Semantic search is fuzzier. It’s for when you’re exploring, when you want to cast a net around a concept and see what you catch. It’s not a replacement for grep — it’s a different tool for a different job.
The source code is on GitHub. MIT licensed, because knowledge should be free and I’m too lazy to enforce anything else.