Scalable and vibrant community

The goal of Apache Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases

Next Previous Stop

Algorithms

This section contains links to information, examples, use cases, etc. for the various algorithms we support. Click the individual links to learn more. The algorithms are grouped by use case.

For Papers, videos and books related to machine learning in general, see Machine Learning Resources

General advise

The main goal of Apache Mahout is to be useful to practitioners. This means implementations should be easy to use from within Java applications. It should be close to trivial to deploy the trained models. Scaling to include more and more diverse data should be simple.

If you are starting a data science project instead of looking for an algorithm you barely know about except for this one cool talk you attended recently rather try to find out what your real problem setting is. From there check out one of the sections below to learn more about what Mahout can do for you. Chances are decent feature engineering combined with increased amount of data can do much more for your business case than what you can achieve by investing your time only in finding the best algorithm. For more background also checkout the following slide deck by one of the committers:

Which Algorithms Really Matter from Ted Dunning

Note that as a result getting new algorithms into Mahout is pretty hard much in contrast to getting modifications, improvements and better documentation committed. If you absolutely do want to see you favourite algorithm it's up to you to make a case for replacing one of the existing implementations with your proposal.

Classification

A general introduction to the most common text classification algorithms can be found at Google Answers: http://answers.google.com/answers/main?cmd=threadview&id=225316 For information on the algorithms implemented in Mahout (or scheduled for implementation) please visit the following pages.

Fully supported:

Logistic Regression (SGD) - model parameter selection can be done in Hadoop
Naive Bayes/ Complementary Naive Bayes - training runs on Hadoop
Random Forests (MAHOUT-122, - training is done in Hadoop MAHOUT-140, MAHOUT-145)
Hidden Markov Models (see MAHOUT-627, MAHOUT-396, MAHOUT-734) - training is done in Map-Reduce

Deprecated or drafts only:

Support Vector Machines (see MAHOUT-14 , MAHOUT-232 and MAHOUT-334
Perceptron and Winnow (see MAHOUT-85)
Neural Network (see MAHOUT-228)
Restricted Boltzmann Machines (see MAHOUT-375)
Online Passive Aggressive (see MAHOUT-702
Boosting (see MAHOUT-716)

Clustering

For a more detailed explanation see Wikipedia page or checkout our Reference Reading

Fully supported:

MAHOUT:Canopy Clustering (MAHOUT-3 - runs on Hadoop
K-Means Clustering (MAHOUT-5 - runs on Hadoop
Fuzzy K-Means (MAHOUT-74 - runs on Hadoop
[Expectation Maximization](../clustering/expectation-maximization.html (MAHOUT-28 - runs on Hadoop
Mean Shift Clustering (MAHOUT-15 - runs on Hadoop
Dirichlet Process Clustering (MAHOUT-30 - runs on Hadoop
Latent Dirichlet Allocation (MAHOUT-123) - runs on Hadoop
Minhash Clustering (MAHOUT-344) - runs on Hadoop
kMeans++ streaming clustering - documentation missing

Deprecated or drafts only:

Dimension reduction

Fully supported:

Singular Value Decomposition and other Dimension Reduction Techniques (available since 0.3)
Stochastic Singular Value Decomposition with PCA workflow

Deprecated or drafts only:

Principal Components Analysis (PCA)
Gaussian Discriminative Analysis (GDA)

Evolutionary Algorithms

NOTE: Watchmaker support has been removed as of 0.7

Recommenders / Collaborative Filtering

Mahout contains both simple non-distributed recommender implementations and distributed Hadoop-based recommenders.

First-timer FAQ
Non-distributed recommenders ("Taste")
Distributed Item-Based Collaborative Filtering
Collaborative Filtering using a parallel matrix factorization

Other

Fullly supported:

RowSimilarityJob -- Builds an inverted index and then computes distances between items that have co-occurrences. This is a fully distributed calculation.
VectorDistanceJob -- Does a map side join between a set of "seed" vectors and all of the input vectors.
Collocations ... find co-locations of tokens in text, runs on Hadoop

Deprecated or drafts only:

Pattern mining: Parallel FP Growth Algorithm (Also known as Frequent Itemset mining)

Twitter

Apache Software Foundation

Related Projects

Algorithms

General advise

Classification

Clustering

Dimension reduction

Evolutionary Algorithms

Recommenders / Collaborative Filtering

Other