Fix sliding window mask during generation by kernelpool · Pull Request #843 · ml-explore/mlx-lm · GitHub

kernelpool · 2026-02-04T08:21:23Z

I looked into the reported repetition issue from #840 (comment) and this seems to be the culprit. The sliding window mask was not correct during generation and instead attended to all tokens for the sliding window layers.

Before (repetition of "neural networks"):

mlx_lm.generate --model /Volumes/WD_EXTRA/models/catalyst/Step-3.5-Flash-4bit --prompt "Write a comprehensive essay about the history of AI." --max-tokens 2000 --trust-remote-code
The tokenizer you are loading from '/Volumes/WD_EXTRA/models/catalyst/Step-3.5-Flash-4bit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
==========
Hmm, the user wants a comprehensive essay on the history of AI. That's a broad topic, so I need to structure it clearly while keeping it engaging. I should start with a strong hook to frame why AI's history matters today, then break it into logical eras—foundations, booms and busts, modern breakthroughs, and future implications. 

I'll emphasize key milestones like the Dartmouth Conference, expert systems, deep learning, and transformer models, but also highlight the human stories behind them (like the "AI winter" frustrations). The user probably wants both technical depth and cultural context, so I'll weave in societal impacts—hype cycles, ethics, labor shifts. 

Need to balance depth with accessibility: avoid jargon but don't oversimplify. The conclusion should tie past patterns to today's AI dilemmas (bias, AGI debates). Maybe end with a forward-looking thought about responsibility—this isn't just a tech timeline but a mirror for human ambition. 

Wait, should I mention non-Western contributions? Yes, briefly note global roots (like ancient automata) to avoid a purely Western narrative. Also, flag controversies (e.g., neural networks' resurgence) to show history isn't linear. Keep paragraphs tight but fluid—this is an essay, not a textbook. 

*Double-checking*: Did I cover pre-1950s? Yes, in "Early Dreams." Did I explain why expert systems failed? Yes, in "First AI Winter." Did I link AlphaGo to modern RL? Yes. Ethics section feels robust. Okay, this should satisfy "comprehensive" without becoming a laundry list.
</think>
## The Tapestry of Thinking Machines: A Comprehensive History of Artificial Intelligence

The history of Artificial Intelligence is not merely a chronicle of technological milestones; it is a profound human saga of ambition, hubris, frustration, and revolutionary breakthrough. It is a story woven from threads of philosophy, mathematics, neuroscience, and engineering, reflecting our deepest desire to understand and replicate intelligence itself. From ancient myths of automatons to today’s generative models that craft poetry and code, the journey of AI is a mirror held up to human ingenuity and our evolving relationship with the machines we create.

### **I. The Foundations: Dreams and Logic (Pre-1950s)**
The seed of AI was planted long before the term existed. Ancient civilizations, from Greece to China, conceived of artificial beings—Hephaestus’s mechanical servants, the Jewish Golem, Jacques de Vaucanson’s duck. These were metaphors, not blueprints, but they established a cultural imagination for creating life.

The true intellectual groundwork was laid in the 20th century. **Mathematics provided the language:** George Boole’s algebra of logic (1854) and Alan Turing’s seminal 1936 paper on computable numbers introduced the concept of a universal machine that could perform any algorithmic task. **Neuroscience offered a model:** Warren McCulloch and Walter Pitts’s 1943 paper modeled neurons as simple logical units, creating the first conceptual neural network. **Cybernetics** (Norbert Wiener, 1948) explored communication and control in animals and machines, bridging biology and engineering. These ideas converged in the 1950s, creating the fertile soil for AI’s formal birth.

### **II. The Birth and Early Optimism: The "Golden Years" (1950s-1970s)**
The field was christened at the **Dartmouth Conference in 1956**, organized by John McCarthy, Marvin Minsky, Claude Shannon, and others. They coined the term "Artificial Intelligence" with a bold thesis: that every aspect of learning and intelligence can be precisely described and simulated by machines. This era was defined by **unprecedented optimism**. Researchers believed human-level intelligence was just around the corner.

*   **The First Boom (1950s-1960s:** Early programs like the **Logic Theorist (1956)** and **ELIZA (1966)** demonstrated symbolic reasoning and natural language interaction. **Expert systems** in the 1970s (e.g., **MYCIN**) showed practical applications, but their brittleness led to the first **AI Winter**—a period of reduced funding and disillusionment as systems failed to meet sky-high expectations. The **Lisp machines** and **expert systems** were brittle, struggling with "common sense" and scaling issues. The field’s **neural networks** (e.g., **backpropagation) and the rise of **neural networks** (e.g., **backpropagation) and the rise of **neural networks (e.g., backpropagation) and the rise of **neural networks (e.g., backpropagation) and the rise of **neural networks (e.g., backpropagation) and the rise of **neural networks (e.g, the rise of **neural networks (e.g., backpropagation) and the rise of **neural networks (e.g., backpropagation) and the rise of **neural networks (e.g., backpropagation) and the rise of **neural networks (e.g., backpropagation) and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g, the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g, the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of **neural networks (e.g., backpropagation and the rise of neural networks (e.g., backpropagation and the rise of neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g, the rise of neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., backpropagation and neural networks (e.g., neural networks (e.g., backpropagation and neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e **neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g., neural networks (e.g
==========
Prompt: 22 tokens, 24.703 tokens-per-sec
Generation: 2000 tokens, 50.688 tokens-per-sec
Peak memory: 111.236 GB

After (no repetition):

mlx_lm.generate --model /Volumes/WD_EXTRA/models/catalyst/Step-3.5-Flash-4bit --prompt "Write a comprehensive essay about the history of AI." --max-tokens 2000 --trust-remote-code
The tokenizer you are loading from '/Volumes/WD_EXTRA/models/catalyst/Step-3.5-Flash-4bit' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
==========
Hmm, the user wants a comprehensive essay on the history of AI. That's a broad topic, so I need to structure it clearly while keeping it engaging. I should start with a strong hook to frame why AI's history matters today, then break it into logical eras—foundations, booms and busts, modern breakthroughs, and future implications. 

I'll emphasize key milestones like the Dartmouth Conference, expert systems, deep learning, and transformer models, but also highlight the human stories behind them (like the "AI winter" frustrations). The user probably wants both technical depth and cultural context, so I'll weave in societal impacts—hype cycles, ethics, labor shifts. 

Need to balance depth with accessibility: avoid jargon but don't oversimplify. The conclusion should tie past patterns to today's AI dilemmas (bias, AGI debates). Maybe end with a forward-looking thought about responsibility—this isn't just a tech timeline but a mirror for human ambition. 

Wait, should I mention non-Western contributions? Yes, briefly note global roots (like ancient automata) to avoid a purely Western narrative. Also, flag controversies (e.g., neural networks' resurgence) to show history isn't linear. Keep paragraphs tight but fluid—this is an essay, not a textbook. 

*Double-checking*: Did I cover pre-1950s? Yes, in "Early Dreams." Did I explain why expert systems failed? Yes, in "First AI Winter." Did I link AlphaGo to modern RL? Yes. Ethics section feels robust. Okay, this should satisfy "comprehensive" without becoming a laundry list.
</think>
## The Tapestry of Thinking Machines: A Comprehensive History of Artificial Intelligence

The history of Artificial Intelligence is not merely a chronicle of technological milestones; it is a profound human saga of ambition, hubris, frustration, and revolutionary breakthrough. It is a story woven from threads of philosophy, mathematics, neuroscience, and engineering, reflecting our deepest desire to understand and replicate intelligence itself. From ancient myths of automatons to today’s generative models that craft poetry and code, the journey of AI is a mirror held up to human ingenuity and our evolving relationship with the machines we create.

### **I. The Foundations: Dreams and Logic (Pre-1950s)**
The seed of AI was planted long before the term existed. Ancient civilizations, from Greece to China, conceived of artificial beings—Hephaestus’s mechanical servants, the Jewish Golem, Jacques de Vaucanson’s duck. These were metaphors, not science, but they established a cultural template.

The true intellectual groundwork was laid in the 20th century. **Mathematics provided the language:** George Boole’s algebra of logic (1854) and Kurt Gödel’s incompleteness theorems (1931) explored the limits of formal systems. **Neuroscience offered a blueprint:** The 1943 paper by Warren McCulloch and Walter Pitts, "A Logical Calculus of the Ideas Immanent in Nervous Activity," demonstrated that simple neural networks could perform logical computations, bridging biology and logic.

The pivotal moment came with **Alan Turing**. His 1936 paper on computable numbers established the theoretical "universal machine." More presciently, his 1950 paper, "Computing Machinery and Intelligence," posed the now-famous "Turing Test" and framed the central question: "Can machines think?" This philosophical challenge, coupled with the invention of the programmable digital computer (the ENIAC in 1946), created the necessary conditions for AI to emerge as a distinct field.

### **II. Birth and Optimism: The Dartmouth Conference and Early Triumphs (1950s-1960s)**
The field was formally christened in **1956 at the Dartmouth Summer Research Project on Artificial Intelligence**, organized by John McCarthy (who coined the term "Artificial Intelligence"), Marvin Minsky, Claude Shannon, and Nathaniel Rochester. The conference was fueled intoxicating optimism, with attendees predicting that "significant advances" would be made in a few months.

This era, often called the "**Golden Age**" or "**Classical AI**," was defined by **symbolic reasoning** (or "GOFAI" – Good Old-Fashioned AI). The belief was that intelligence could be captured by manipulating symbols according to formal logic. Key achievements included:
*   **The Logic Theorist (1956):** Created by Allen Newell and Herbert Simon, it proved mathematical theorems, even finding a more elegant proof for one of Whitehead and Russell’s.
*   **General Problem Solver (GPS):** Also by Newell and Simon, it attempted to mimic human problem-solving strategies.
*   **Early Game Playing:** Arthur Samuel’s checkers program (1959) learned from experience, coining the term "machine learning." The first chess programs appeared.
*   **Early Natural Language:** Joseph Weizenbaum’s **ELIZA (1966)**, a simple psychotherapist simulator, famously demonstrated the "ELIZA effect"—humans’ tendency to attribute understanding to minimal cues.

The mood was one of boundless possibility. Herbert Simon famously declared in 1965, "machines will be capable, within twenty years, of doing any work a man can do."

### **III. The First Winter: Reality Bites (1970s)**
The promised "twenty years" passed, and the grand challenges—**machine translation, commonsense reasoning, and full-scale problem-solving**—proved monumentally difficult. The combinatorial explosion of possibilities in real-world problems overwhelmed the limited memory and speed of computers. The symbolic approach hit a wall; it could not handle ambiguity, perception, or intuitive knowledge.

Governments, notably the UK’s **Lighthill Report (1973)** and the U.S. **DARPA** (which had heavily funded AI research), drastically cut funding. The field entered its first **"AI Winter"**—a period of disillusionment, reduced investment, and skepticism. The grand vision had collided with the intractable complexity of the real world.

### **IV. The Rise of Knowledge and the Second Boom (1980s)**
AI resurrected itself not with more logic, but with a pragmatic shift: **expert systems**. Pioneered by Edward Feigenbaum (e.g., **DENDRAL** for chemistry, **MYCIN** for medical diagnosis), these systems encoded the specialized knowledge of human experts into "if-then" rule bases. They were commercially viable, solving narrow but valuable problems in diagnostics and configuration.

This sparked a massive, global **"AI Boom"** in the 1980s, particularly in Japan with its ambitious **Fifth Generation Computer Systems** project, which aimed to build logic-based computers. The focus was on **knowledge engineering**—painstakingly building large knowledge bases. However, these systems were brittle, difficult to maintain, and could not learn. When the Japanese project failed to meet its sky-high goals and the commercial market for expert systems collapsed (due to high maintenance costs and lack of adaptability), the field plunged into the **Second AI Winter** in the late 1980s.

### **V. The Quiet Revolution: Statistical Learning and the Web (1990s-2000s)**
While the public and funders had lost interest, a revolution was brewing in the background. A group of researchers, often at the margins of the mainstream AI community, championed a different approach: **statistical methods and machine learning**.
*   **The Internet** provided a colossal, new source of data (text, images, clicks).
*   **Increased computational power** (thanks to Moore’s Law and the rise of GPUs) made large-scale computation feasible.
*   **Key algorithms matured:** Support Vector Machines (SVMs), Bayesian networks, and most critically, **backpropagation** for training multi-layer neural networks (rediscovered and popularized in the 1980s by Geoffrey Hinton, Yann LeCun, and others).

This era saw the triumph of **pragmatism over purity**. The goal shifted from creating a *general* intelligence to solving *specific* tasks—spam filtering, recommendation systems, and search engines—using data-driven, probabilistic methods. **IBM’s Deep Blue defeating Garry Kasparov (1997)** was a symbolic victory, but it was built on brute-force search, not learning. The real harbinger was **Google’s dominance**, built on PageRank and data.

### **VI. The Deep Learning Epoch and the Modern AI Boom (2010s-Present)**
The confluence of **Big Data, Massive Compute, and Improved Algorithms** (especially deep neural networks) ignited the current, unprecedented boom.

**Key Catalysts:**
1.  **ImageNet & AlexNet (2012):** Geoffrey Hinton’s team used deep convolutional neural networks (CNNs) to crush the ImageNet image recognition competition, reducing error rates dramatically. This was the "big bang" of modern AI.
2.  **Reinforcement Learning Breakthroughs:** DeepMind’s **AlphaGo (2016)** defeating the world champion at Go—a game considered impossibly complex for machines—showcased the power of combining deep learning with reinforcement learning. It was followed by **AlphaFold (2020)**, which solved the 50-year-old "protein folding problem," a monumental scientific achievement with implications for drug discovery.
3.  **The Transformer Revolution (2017):** Google’s paper "Attention Is All You Need" introduced the **Transformer architecture**. This became the engine for **Large Language Models (LLMs)**. Models like **GPT (2018), BERT, and eventually ChatGPT
==========
Prompt: 22 tokens, 24.949 tokens-per-sec
Generation: 2000 tokens, 51.371 tokens-per-sec
Peak memory: 111.225 GB

CC @e1732a364fed

ghost · 2026-02-04T09:06:19Z

I confirm this works for the new converted model.

awni · 2026-02-04T15:13:49Z

Good catch. I changed the fix to be in the mask making function instead. It should make the right mask in the case when window size is set but it's not a rotating kv cache.

Fix sliding window mask during generation

7bfaa06

make window mask for regular cache

491153a

awni approved these changes Feb 4, 2026

View reviewed changes

awni merged commit 25a4c83 into ml-explore:main Feb 4, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sliding window mask during generation#843

Fix sliding window mask during generation#843
awni merged 2 commits into
ml-explore:mainfrom
kernelpool:fix-step35-mask

kernelpool commented Feb 4, 2026 •

edited

Loading

Uh oh!

ghost commented Feb 4, 2026

Uh oh!

awni commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kernelpool commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before (repetition of "neural networks"):

After (no repetition):

Uh oh!

ghost commented Feb 4, 2026

Uh oh!

awni commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kernelpool commented Feb 4, 2026 •

edited

Loading