What is the local AI agent stack in plain language?

An open-source agent framework, a local inference engine, and an open-weight model running on hardware the buyer owns. The agent framework turns the model into a persistent multi-channel assistant. The inference engine actually runs the model. The hardware platform sits on the buyer's network. In our default configuration these are Hermes Agent, Ollama, the Qwen 3.5 family, and Apple Silicon Macs.

Why Hermes Agent specifically?

Hermes Agent is MIT licensed, container-hardened, has a documented release discipline, and natively supports multi-channel deployment, persistent memory, autonomous skill creation, and MCP servers. Those four features are what determine whether the deployment feels like a personal AI rather than a chatbot. We do not install community agent frameworks with active high-severity CVEs in the trailing 12 months, which narrows the candidate pool considerably.

Why Apple Silicon rather than a Linux server with GPUs?

The unified memory architecture of M-series chips suits LLM inference well. A Mac mini M4 Pro with 64GB of unified memory will run the model sizes most professional-services workloads need, in a low-power, low-noise device that can sit on a small shelf in a working office. A workstation with a discrete GPU at equivalent capability needs more power, more cooling, and a different IT support skill set. For larger deployments we use the Mac Studio M4 Max, which supports up to 128GB of unified memory.

How much operational work is involved in keeping it running?

Approximately 4 to 8 hours per month of skilled work per deployment in a year where nothing dramatic happens. That covers weekly Hermes Agent patching, CVE monitoring, model upgrades with regression testing, skill refinement, and the monthly usage review. Self-managed deployments tend to drift out of date inside the first quarter, which is why we deliver this as a retainer service rather than a one-off install.

Can a buyer install this themselves and skip the retainer?

Technically yes. Commercially we do not recommend it. The labour cost of staying current with a fast-moving open-source agent and model ecosystem is non-trivial, and a drifted local AI deployment with email and case-management access is a security liability rather than an asset. We tell buyers this directly during the scoping call and decline engagements where the retainer is not commercially feasible.

Hermes Agent and Ollama Local AI Stack UK 2026

Q: Why the Qwen 3.5 family rather than Llama, Mistral, or a Hermes-tuned model?

Tool-calling reliability is the deciding factor at the agent backend. The Qwen 3.5 family currently emits structured tool calls correctly and consistently at the model sizes that fit on a Mac mini M4 Pro. Models that benchmark well on chat-style evaluation but stumble on structured output are not workable agent backends. We expect the choice to move; the retainer covers regression testing of new releases against your real workflows before any swap-in.

At a glance

What the stack is: an open-source agent framework (Hermes Agent, Nous Research, MIT licensed), a local inference engine (Ollama), and an open-weight model (the Qwen 3.5 family is the current best pick for tool-calling reliability), running on Apple Silicon hardware (Mac mini M4 Pro for solo and small-team deployments, Mac Studio M4 Max for larger ones).
What it gives the buyer: a persistent, learning, multi-channel AI assistant that operates over email, Telegram, Slack, Signal, and existing business tools, with no client data leaving the local network in normal operation.
What the operational burden looks like: the agent and model layers release approximately weekly. CVE monitoring, regression testing, and skill curation take 4 to 8 hours per month for a single deployment. Self-managed deployments tend to drift out of date inside the first quarter.
Where the gap to cloud LLMs sits: long-context analysis above 100K tokens, complex code generation, and the very hardest reasoning tasks. For ordinary professional-services drafting, summarisation, and structured extraction, the gap is workable.
Where this fits commercially: as a distinct service line for UK regulated professionals. We deliver it as Private AI Concierge, with a one-off implementation plus monthly retainer.

What is the local AI agent stack in 2026?

The local AI agent stack is the combination of an open-source agent framework, a local inference engine, and an open-weight large language model, running on hardware that the buyer owns and operates. Each layer is replaceable; the architecture is layered specifically so individual components can be swapped as the open-source ecosystem moves.

The reason this matters in 2026, and did not seriously matter in 2023 or 2024, is that the capability gap between hosted frontier models and the best open-weight models has narrowed enough that on-premises deployment is now a workable answer for most professional-services workloads. It is not the right answer for every buyer. For UK SMEs handling routine business data, a Claude or ChatGPT rollout under a standard Data Processing Agreement remains the correct default. For UK regulated professionals whose default answer to "has client data left the building" must be no, the local stack is now a serious commercial option rather than a research curiosity.

This article describes the stack we use, why we chose each component, the hardware footprint for a typical deployment, the integration patterns we apply, and the operational burden that comes with running it. The audience is UK technology buyers and IT leads who are evaluating local AI for their own practice. The bias is toward what we have actually deployed, not toward whatever is most fashionable on the open-source AI Twitter timeline.

Why these components, and not others

The stack we install consists of three named components plus a hardware platform. Each was chosen against a small number of decision criteria and we revisit the choices at every monthly retainer cycle.

Hermes Agent (Nous Research, MIT licensed)

Hermes Agent is the agent framework: the layer that turns a language model into a persistent, multi-channel assistant with tool use, scheduling, memory, and skill creation. It sits one layer above the inference engine and is the part of the system that actually orchestrates a workflow.

The decision criteria we apply at this layer are licence, security posture, release discipline, and architectural fit with multi-channel deployment.

Licence. MIT is the cleanest licence for commercial UK deployment. It allows redistribution, modification, and use in proprietary contexts without copyleft contamination, and it carries no AGPL-style network-use clauses that could trigger downstream disclosure obligations.
Security posture. Hermes Agent ships container-hardened by default. Software selection at the agent layer matters because the agent has access to email accounts, calendars, messaging platforms, and business systems. We do not install community agent frameworks with active high-severity CVEs in the trailing 12 months. The candidate pool is narrower than buyers expect once that filter is applied.
Release discipline. The Hermes Agent release cadence is currently approximately weekly. Releases are documented in a changelog with security-relevant changes flagged. This level of discipline is a precondition for production deployment in a regulated UK practice.
Architectural fit. Hermes Agent natively supports multi-channel deployment (email, Telegram, Slack, Signal, SMS gateways), persistent memory, autonomous skill creation, and Model Context Protocol (MCP) servers. These are the four features that determine whether the deployment feels like a personal AI or a chatbot.

We expect the agent layer to keep moving. We do not assume the framework we install today will be the framework we run in 18 months. The retainer covers re-evaluation at each quarterly architecture review.

Ollama as the local inference engine

Ollama is the local inference layer: the runtime that takes an open-weight model file and serves it as an API to the agent. We use Ollama because of three properties that matter operationally.

Apple Silicon optimisation. Ollama uses Metal acceleration on Apple Silicon and benefits directly from the unified memory architecture of M-series chips. On a Mac mini M4 Pro with 64GB of unified memory, this is the difference between workable and not workable for the model sizes we deploy.
Model swap simplicity. Pulling a new model is a single command. Switching the agent between models is a configuration change, not a redeploy. This matters when the open-weight model layer is releasing major new versions every few months.
Operational maturity. Ollama has been the default open-source local inference engine for long enough that the tooling around it (Homebrew packaging, systemd-equivalent agents on macOS, log capture, monitoring) is settled.

The alternatives we considered include running models directly through llama.cpp, vLLM, or LM Studio. Each has merits in specific contexts. For a managed, retainer-supported deployment on Apple Silicon, Ollama wins on operational simplicity and stability of interface.

The Qwen 3.5 family for tool-calling reliability

The model layer is where the open-weight ecosystem moves fastest. Our current default is the Qwen 3.5 family. The decision is driven by tool-calling reliability rather than raw benchmark scores. An agent framework needs the model to emit structured tool calls correctly and consistently. Models that benchmark well on chat-style evaluation but stumble on structured output are not workable agent backends, regardless of their headline numbers.

For a Mac mini M4 Pro with 48 to 64GB of unified memory, the practical model size ceiling is in the 30 to 70 billion parameter range, depending on quantisation level. The Qwen 3.5 family covers this band well and the smaller variants run with comfortable memory headroom for the agent framework, the OS, and the buffers that messaging channels and MCP servers need.

We expect the model choice to move. Qwen, Llama, Mistral, and Hermes-tuned variants are each on independent release schedules. The retainer includes regression testing of new releases against your real workflows before any swap-in, which means a model upgrade is a commercial decision rather than a maintenance task the client has to manage.

Hardware fit on Apple Silicon

The hardware platform matters because the entire commercial argument for on-premises AI rests on the assumption that the buyer can run capable models on consumer-grade hardware they already understand operationally. Apple Silicon Macs are the practical answer in 2026 because of the unified memory architecture.

Mac mini M4 Pro for Solo and Practice tiers

The Mac mini with M4 Pro silicon, configured to 48 or 64GB of unified memory, is the default platform for sole practitioners and small teams. The fit is driven by three factors:

Memory bandwidth. Unified memory means the GPU and CPU share the same physical RAM at high bandwidth. For LLM inference, this is materially better than the equivalent x86 CPU plus discrete GPU configuration at the same price point.
Power profile. The Mac mini draws under 100 watts under sustained inference load. It can sit on a small shelf in a working office with no special ventilation or cooling requirements.
Operational footprint. macOS is a familiar operating system to buyers and to their existing IT providers. FileVault, automatic unlock for headless operation, Touch ID-equivalent authentication, and standard mac-management tooling are all available without additional engineering work.

Indicative cost as of mid-2026: GBP 1,799 to GBP 2,499 depending on memory configuration.

Mac Studio M4 Max for the Chambers tier

For larger regulated practices and multi-partner firms, the Mac Studio with M4 Max silicon is the appropriate platform. The Studio supports up to 128GB of unified memory, which opens up larger models and parallel session capacity for multiple concurrent users. The same operational characteristics apply: low-noise operation, standard mac management, and Apple Silicon performance per watt.

Indicative cost: GBP 2,099 to GBP 4,000+ depending on memory and storage configuration.

Why not a server-class GPU box

The obvious alternative is a workstation or rack server with one or two consumer GPUs. We have evaluated this configuration and reject it as a default for UK professional-services deployment. The reasons are practical:

Power and cooling. A 350-watt GPU under sustained load needs ventilation that most professional-services offices are not designed for.
Operational complexity. Linux server administration is a different skill set from the buyer's existing IT support. Adding it to the engagement increases ongoing operational risk.
Cost. A workstation with adequate GPU memory for the same model sizes costs more than a Mac mini, before factoring in the additional administrative overhead.

For larger deployments where the workload genuinely requires GPU-class throughput, the answer is usually a managed off-site colocation rather than an on-premises server. We discuss that route only where the buyer specifically requests it.

MCP wiring patterns

The Model Context Protocol (MCP), introduced by Anthropic in late 2024 and now adopted across most major agent frameworks, is the standard way to connect an agent to business tools. For a Private AI Concierge deployment, the typical MCP server set we wire up looks like this:

Email. An MCP server for the buyer's email account (Microsoft 365 or Google Workspace), giving the agent the ability to read, draft, and send under explicit user control.
Calendar. Read access for awareness, write access only for confirmed bookings. The boundary is set in workflow design.
Document repository. Read access to a folder or library structure on the buyer's existing storage, typically OneDrive, Google Drive, or a local NAS. We do not migrate the document store; we connect to where it already lives.
Practice management or matter management software. Where the buyer uses a system with an MCP-compatible API or SDK, we wire it. Where they do not, we build a thin adapter that exposes the operations the agent needs.
Messaging channels. Telegram, Slack, Signal, SMS gateway. These are agent input channels rather than tool servers; the agent listens and responds across them as a multi-channel assistant.

The MCP wiring decisions are made in the workflow design sprint, not at install time. Connecting more tools is not always better. An agent with too many sources of context becomes harder to audit and harder to trust. We connect what is necessary for the named workflows, document the boundary, and review it at quarterly architecture reviews.

What the assistant actually does

The above is plumbing. The commercial value of the deployment comes from the named skills that the assistant runs repeatedly, configured to the buyer's specific workflow. A typical Solo-tier deployment for a sole practitioner solicitor might include skills for:

Drafting client correspondence from a structured brief, in the buyer's house style
Generating attendance notes from rough dictation
Triaging inbox into urgent, requires response, and FYI
Preparing a daily summary of matters due in the next 14 days
Producing first-draft time recording narratives at end of day
Handling the initial steps of new-client intake before a paid call is booked

For an IFA practice, the named skill set would be different: suitability review preparation, fact-find structured note-taking, ongoing review summaries, FCA Consumer Duty fair-value documentation. For a private dental practice, different again: consultation note drafting from dictation, referral letter generation, patient communication triage, treatment plan summarisation. The agent framework is general-purpose; the value comes from configuring it specifically.

This is the part of the deployment that makes the difference between a productive personal AI and a generic chatbot, and it is the part that the retainer is most directly buying. Skills drift, workflows change, and the buyer's preferences refine over time. A skill installed on day one that is not refined at month three is usually being used at half its potential or quietly abandoned.

The operational burden, and why a retainer matters

The operational burden of running a local AI agent stack in production is the single most underestimated element by buyers approaching this for the first time. The components are open source and the licence costs are zero. The labour cost is not zero.

For a single-deployment Private AI Concierge instance, the recurring operational work is roughly:

Activity	Frequency	Time per cycle
Hermes Agent patching	Approximately weekly	30 to 90 minutes
CVE monitoring across agent, model runtime, and OS layers	Continuous, with active response	1 to 2 hours per month average
Open-weight model upgrades with regression testing	Every 6 to 12 weeks	2 to 4 hours per cycle
Skill refinement based on usage	Monthly	1 to 2 hours
Monthly usage review and written summary	Monthly	1 hour
Quarterly architecture review	Quarterly	2 to 4 hours
Incident response (averaged across deployments)	Variable	0 to several hours

That is approximately 4 to 8 hours of skilled work per deployment per month, on average, in a year where nothing dramatic happens. It is more in months where a major Hermes release lands, a model swap is due, or an OS-level CVE forces immediate patching.

For a buyer running a single deployment for their own practice, this work is professionally outside their core competency. For an IT support provider used to managing a dozen Microsoft 365 tenants, it is a different skill set. For a one-person consultancy who installed it themselves over a weekend, it is the work that quietly stops happening after about three months.

The retainer is not a margin extraction. It is the structural answer to the question of who keeps the system current. Self-managed deployments without a retainer tend to drift out of date inside the first quarter, and a drifted local AI deployment with internet-connected channels is a security liability rather than an asset.

Software selection discipline

One of the recurring questions in client conversations is "why this stack and not the other one being talked about on social media this month". The answer in most cases is one of three things: licence incompatibility, an unacceptable CVE history, or an architecture that does not fit multi-channel deployment.

The CVE point is worth being explicit about. The local AI agent ecosystem in 2025 and 2026 has produced more than one high-profile framework with serious security flaws, including some agent-takeover-from-a-visited-webpage class vulnerabilities. The headline "MIT-licensed, capable, popular" framework is not always the right framework to install on a device that has access to a UK regulated practitioner's email and case management system.

Our software selection discipline, applied at every monthly retainer cycle:

The agent framework must be MIT, BSD, or Apache 2.0 licensed.
The framework must not have active high-severity CVEs (CVSS 7.0 and above) disclosed in the trailing 12 months that remain materially unmitigated.
Framework releases must follow a documented changelog with security-relevant changes flagged.
The framework must support the multi-channel architecture the buyer needs.

This filter has the effect of narrowing the candidate pool considerably, which is the intended effect.

What the buyer should expect from us

The Private AI Concierge engagement is structured around four stages: a free 30-minute scoping call, a paid workflow design sprint of 1 to 2 weeks, an implementation phase of 2 to 4 weeks, and an ongoing monthly retainer.

The implementation phase delivers hardware procurement, install, configuration, the named skill set, channel and MCP wiring, security hardening, written documentation, and two handover sessions. The retainer takes over from there. The first three months tend to involve more skill refinement than the steady state, as the buyer's actual use patterns become visible.

Pricing is published at tier-band level on the Private AI Concierge service page. Solo, Practice, and Chambers tiers are differentiated by team size, hardware platform, and concurrent-user capacity.

Where local AI does not fit

It is worth closing with the cases where the local AI stack is not the right answer.

Where the buyer's data sensitivity already permits a cloud LLM rollout under a standard DPA, the cloud route is simpler, cheaper to run, and has access to the strongest current frontier models. Claude Implementation is the appropriate service line.
Where the workload genuinely needs frontier-model-class capability for long-context reasoning above 100K tokens, complex code generation, or the very hardest planning tasks, local models will be the limiting factor. Hybrid mode with Claude API fallback is one answer. A pure cloud rollout is another.
Where the buyer wants to install something themselves over a weekend and not pay an ongoing retainer, the operational burden will eat the deployment. We tell buyers this directly during the scoping call and decline engagements where the retainer is not commercially feasible.
Where the buyer's data volume is low enough that an off-the-shelf consumer subscription suffices, the entire on-premises argument is overhead.

For UK regulated professionals where cloud is the wrong answer, local AI is now a serious commercial option. The components are mature enough, the hardware is workable, and the operational pattern is repeatable. What is missing in most cases is not technology; it is the consulting and custodianship layer that turns a stack of open-source components into a system that a sole practitioner can rely on for their daily work.

If you are evaluating an on-premises AI assistant for a UK regulated practice, see the Private AI Concierge service page for the engagement structure, pricing tiers, and the routes into a free 30-minute scoping call.

Hermes Agent and Ollama: a UK consultant's view of the local AI agent stack in 2026