Algorithms
This section contains links to information, examples, use cases, etc. for the various algorithms we support. Click the individual links to learn more. The algorithms are grouped by use case.
For Papers, videos and books related to machine learning in general, see Machine Learning Resources
General advise
The main goal of Apache Mahout is to be useful to practitioners. This means implementations should be easy to use from within Java applications. It should be close to trivial to deploy the trained models. Scaling to include more and more diverse data should be simple.
If you are starting a data science project instead of looking for an algorithm you barely know about except for this one cool talk you attended recently rather try to find out what your real problem setting is. From there check out one of the sections below to learn more about what Mahout can do for you. Chances are decent feature engineering combined with increased amount of data can do much more for your business case than what you can achieve by investing your time only in finding the best algorithm. For more background also checkout the following slide deck by one of the committers:
Note that as a result getting new algorithms into Mahout is pretty hard much in contrast to getting modifications, improvements and better documentation committed. If you absolutely do want to see you favourite algorithm it's up to you to make a case for replacing one of the existing implementations with your proposal.
Classification
A general introduction to the most common text classification algorithms can be found at Google Answers: http://answers.google.com/answers/main?cmd=threadview&id=225316 For information on the algorithms implemented in Mahout (or scheduled for implementation) please visit the following pages.
Fully supported:
- Logistic Regression (SGD) - model parameter selection can be done in Hadoop
- Naive Bayes/ Complementary Naive Bayes - training runs on Hadoop
- Random Forests (MAHOUT-122, - training is done in Hadoop MAHOUT-140, MAHOUT-145)
- Hidden Markov Models (see MAHOUT-627, MAHOUT-396, MAHOUT-734) - training is done in Map-Reduce
Deprecated or drafts only:
- Support Vector Machines (see MAHOUT-14 , MAHOUT-232 and MAHOUT-334
- Perceptron and Winnow (see MAHOUT-85)
- Neural Network (see MAHOUT-228)
- Restricted Boltzmann Machines (see MAHOUT-375)
- Online Passive Aggressive (see MAHOUT-702
- Boosting (see MAHOUT-716)
Clustering
For a more detailed explanation see Wikipedia page or checkout our Reference Reading
Fully supported:
- MAHOUT:Canopy Clustering (MAHOUT-3 - runs on Hadoop
- K-Means Clustering (MAHOUT-5 - runs on Hadoop
- Fuzzy K-Means (MAHOUT-74 - runs on Hadoop
- [Expectation Maximization](../clustering/expectation-maximization.html (MAHOUT-28 - runs on Hadoop
- Mean Shift Clustering (MAHOUT-15 - runs on Hadoop
- Dirichlet Process Clustering (MAHOUT-30 - runs on Hadoop
- Latent Dirichlet Allocation (MAHOUT-123) - runs on Hadoop
- Minhash Clustering (MAHOUT-344) - runs on Hadoop
- kMeans++ streaming clustering - documentation missing
Deprecated or drafts only:
Dimension reduction
Fully supported:
- Singular Value Decomposition and other Dimension Reduction Techniques (available since 0.3)
- Stochastic Singular Value Decomposition with PCA workflow
Deprecated or drafts only:
Evolutionary Algorithms
NOTE: Watchmaker support has been removed as of 0.7
see also: MAHOUT-56
You will find here information, examples, use cases, etc. related to Evolutionary Algorithms.
Introductions and Tutorials:
Example: Traveling Salesman
Recommenders / Collaborative Filtering
Mahout contains both simple non-distributed recommender implementations and distributed Hadoop-based recommenders.
- First-timer FAQ
- Non-distributed recommenders ("Taste")
- Distributed Item-Based Collaborative Filtering
- Collaborative Filtering using a parallel matrix factorization
Other
Fullly supported:
- RowSimilarityJob -- Builds an inverted index and then computes distances between items that have co-occurrences. This is a fully distributed calculation.
- VectorDistanceJob -- Does a map side join between a set of "seed" vectors and all of the input vectors.
- Collocations ... find co-locations of tokens in text, runs on Hadoop
Deprecated or drafts only:
- Pattern mining: Parallel FP Growth Algorithm (Also known as Frequent Itemset mining)