There’s a long-held belief out there that MCP’s are bad, and that you should actually use CLI’s instead, if you want to save on token spend when it comes to LLM tool use. To me that argument never really made any sense, because you still have to provide the LLM context so it knows what CLI to run, how to run it and how to interpret the result of it. Things that the MCP does for you. How come the CLI takes fewer tokens?

Sure, there’s probably some communication overhead, but I can’t see how it would be something significant enough to entirely forgo the benefits of a standardized communication protocol that is the MCP and to instead favor the wild west of self-made CLI integrations, which are more costly to create and to maintain, because you have to make those, whereas MCP servers are ready-made for LLM integration.

Talk is cheap: experiment time

Instead of talking in hypotheticals, let’s run an actual experiment, shall we? At work I help build the CodeScene CodeHealth MCP, and it has a tool called code_health_review. Simply put, it takes a file and analyses its code health.

What it does is not all that important, but what is important is that this MCP tool is just a thin wrapper over the CodeScene CLI’s cs review {file} call, which makes it ideal to use for our experiment.

I will be using the exact same context for the CLI test as is in the MCP’s tool description, that way the context the LLM has is identical in knowing what input it needs to give and what output it can expect, and the only difference we should see is what we’re actually interested in - the communication overhead difference between the MCP and the CLI.

For this test I’ve created a little script that uses Claude Sonnet 4.6.

The MCP approach is to run the code_health_review tool with full description + JSON schema input_schema in the tools array, have the LLM emit a tool_use block with {"file_path": "/path/to/repo/calculator.py"}, return the tool result and have the LLM summarize.

The CLI approach is to use the equivalent instructions of the MCP in the system prompt (how to run cs review, what it returns, score interpretation), have a generic bash tool in tools array for command execution, have the LLM emit a tool_use block with ‌{"command": "cs review --output-format=json /path/to/repo/calculator.py"}, return the tool result and have the LLM summarize.

Experiment result: no difference

Based on this little experiment it seems my gut feeling was right - there’s no significant difference between the MCP and the CLI when it comes to direct tool calling.

The MCP used 5 fewer input tokens on the initial request (900 vs 905). It seems Anthropic’s internal tool schema representation is approximately the same cost as equivalent free-form text in a system prompt.

The per-call overhead is negligible. The assistant’s tool call + the tool result being appended to the context took 2 additional tokens over the CLI (93 for the MCP vs 91 for the CLI).

The MCP tool definition (name + description + JSON schema) seems to tokenize to roughly the same as equivalent system prompt instructions + a bash tool definition. The JSON schema structure of the MCP (type, properties, required) is offset by the CLI approach needing both the instructional text AND a generic bash tool definition.

The LLM summary output token spend is inconclusive because it differs on each run. Sometimes the CLI takes more tokens, sometimes the MCP does. Generally the difference here doesn’t seem to exceed more than ~70 tokens, and this is irrelevant to our experiment anyway as the summary is based on the prompt given, which in our case is identical.

The only real argument that could be made against the MCP’s is that MCP’s will load all their tools to the context window at all times, whereas with the CLI you can either pick and choose what you add to context window, or use SKILL’s, which can be deferred and thus loaded on-demand.

Except that argument also falls flat on its face, because MCP tools are also deferred. By now most mainstream clients use some sort of MCP tool deferral mechanism. OpenAI does, Anthropic does, with many others in progress of adding it.

SKILL’s vs MCP’s

SKILL’s and MCP’s are two different things, aimed at solving different problems, but that doesn’t stop people from using one over the other or one as the other. In a world where both use deferred search, the initial context usage does not really differ, and so you might think it doesn’t matter which you use for your use case, but it does.

SKILL’s, once loaded into the context window, use less tokens because they are only loaded once into the context window, but this comes at a cost of accuracy over subsequent requests (i.e context pollution) and compaction processes.

MCP’s, once loaded into the context window, use more tokens because they are added to every request, but this means that they keep their accuracy and are more reliable than SKILL’s. Keep in mind that only the MCP tools that the tool search found are loaded into the context window, not all tools of any given MCP server.

The two are behaviorally different, so I would really consider your use case before going ahead and using one as the other. They do complement each other well though.

Conclusion

There is virtually no difference between MCP’s, CLI’s or a mix of CLI’s with SKILL’s when it comes to input token spend. There is behavioral differences going forward in how SKILL’s and MCP’s differ, which come with their own pro’s and con’s, and it’s clear that one is not really a substitute for the other.

The narrative that MCP’s consume significantly more tokens is, well, contextual. It is only once the MCP tools have been searched and loaded into the context window that they do, but as you go about your day otherwise, they won’t. If you have 1000 MCP tools it will NOT load 1000 tools into the context window.