Running local models on Macs gets faster with Ollama's MLX support

Foto: Ollama
Thirty-two gigabytes of RAM is the absolute minimum required to test the latest capabilities of the Ollama platform on Mac computers. The popular tool for running large language models locally has introduced support for the MLX framework—Apple's open-source library that drastically optimizes Machine Learning processes on M1 series processors and newer. Update version 0.19, currently available in preview, also introduces support for Nvidia's NVFP4 format, allowing for significantly more efficient model compression and better cache management. For end users and developers, this represents a breakthrough in working with demanding models such as Qwen2.5-Coder-32B, which is the first to fully utilize the new architecture. In an era of growing frustration with API limits and the high subscription costs of tools like Claude Code or ChatGPT, running AI locally is becoming a viable alternative. Thanks to deeper integration with Visual Studio Code, software developers gain powerful support directly on their hardware, maintaining full data privacy and independence from the cloud. Shifting the computational load to local Apple Silicon chips is no longer just the domain of hobbyists, but is becoming a professional standard in daily creative and programming work.
A breakthrough in the performance of local language models on Mac computers has become a reality. Ollama, currently the most popular runtime for supporting LLMs (Large Language Models) on personal computers, has announced the introduction of support for the MLX framework. This open-source solution from Apple, designed specifically for machine learning, allows for squeezing the maximum out of the Apple Silicon architecture. This change, combined with new compression methods and cache optimization, drastically pushes the boundaries of what users can achieve without relying on the cloud.
The timing of this update is not accidental. We are currently observing a rapid surge in interest in local AI instances, driven by the success of projects such as OpenClaw. This project gained over 300,000 stars on GitHub in record time and became the foundation for high-profile experiments like Moltbook. It caused a particular stir in China, but this wave is spreading across the globe, prompting professionals to seek alternatives to paid subscriptions and the limits imposed by AI sector giants.
MLX Architecture and the End of Resource Waste
The key to Ollama's new performance is deep integration with MLX. Until now, many AI tools on macOS operated in a universal manner, which did not always allow for full utilization of the specific Unified Memory in M-series chips. Thanks to MLX, Ollama can now communicate with GPU units and the Neural Engine in a nearly native way. This translates not only to a higher number of generated tokens per second but, above all, to smarter resource management when working with multiple tasks simultaneously.
Read also

Parallel to the support for the Apple framework, Ollama introduced support for the NVFP4 format from Nvidia. This is an advanced model compression method (quantization) that allows for a significant reduction in VRAM requirements while maintaining high response precision. For Mac users, this means that models that previously required massive amounts of RAM can now fit into smaller hardware configurations, while running faster thanks to an improved data caching system (caching performance).
Qwen3.5 and the Entry Barrier for Professionals
The new functionalities debuted in version Ollama 0.19, which currently holds preview status. At this moment, the list of supported models utilizing the full potential of MLX is short but impressive — it is opened by a variant of the Qwen3.5 model from Alibaba, featuring 35 billion parameters. The choice of this specific model is not accidental; the Qwen family has gained recognition for its excellent quality-to-size ratio, especially in tasks related to logic and programming.
- Hardware Requirements: Mac computer with Apple Silicon processor (M1, M2, M3, M4 or newer).
- RAM: Minimum 32GB of unified memory for the Qwen3.5 35B model.
- Software Version: Ollama 0.19 (Preview).
- Key Technologies: MLX Framework, NVFP4 compression, improved caching.
While the requirement of having 32GB of RAM may seem high for the average home user, it is a standard for professionals involved in data analysis or programming. Ollama recognizes this trend, as evidenced by the recent expansion of integration with Visual Studio Code. Developers are increasingly moving away from tools like Claude Code or ChatGPT Codex in favor of local solutions to avoid high subscription costs and restrictive rate limits that can paralyze work at the least convenient moment.

Privacy and Independence Drive Change
The development of local models is not just a matter of pure performance, but above all, data sovereignty. Companies and independent creators are increasingly willing to invest in more powerful Apple hardware, knowing that their source code or confidential documents will never leave the local disk. The success of OpenClaw proved that there is a huge demand for tools that give the user full control over the model's inference process.
Thanks to the new optimizations in Ollama, the line between a model running in the cloud and one running locally is beginning to blur. The ability to run a model with a scale of 35 billion parameters on a laptop with the fluidity offered by MLX is a game-changer. The Apple Silicon architecture, which was designed from the beginning with energy efficiency and memory bandwidth in mind, has finally seen software that fully exploits its unique features in the context of generative artificial intelligence.
MLX support in Ollama is just the beginning of a broader consolidation of AI tools around dedicated hardware. As the library of supported models expands, Apple Silicon will become the default platform for AI developers who value mobility without compromising on performance. Local artificial intelligence is ceasing to be the domain of enthusiasts building powerful workstations with multiple graphics cards and is becoming a real work tool available within reach of every MacBook owner with an adequate supply of RAM.
More from AI

How did Anthropic measure AI's "theoretical capabilities" in the job market?

You can order Grubhub and Uber Eats ‘conversationally’ with Alexa Plus

15% of Americans say they’d be willing to work for an AI boss, according to new poll






