Algorithms

This section contains links to information, examples, use cases, etc. for the various algorithms we support. Click the individual links to learn more. The algorithms are grouped by use case.

For Papers, videos and books related to machine learning in general, see Machine Learning Resources

General advise

The main goal of Apache Mahout is to be useful to practitioners. This means implementations should be easy to use from within Java applications. It should be close to trivial to deploy the trained models. Scaling to include more and more diverse data should be simple.

If you are starting a data science project instead of looking for an algorithm you barely know about except for this one cool talk you attended recently rather try to find out what your real problem setting is. From there check out one of the sections below to learn more about what Mahout can do for you. Chances are decent feature engineering combined with increased amount of data can do much more for your business case than what you can achieve by investing your time only in finding the best algorithm. For more background also checkout the following slide deck by one of the committers:

Which Algorithms Really Matter from Ted Dunning

Note that as a result getting new algorithms into Mahout is pretty hard much in contrast to getting modifications, improvements and better documentation committed. If you absolutely do want to see you favourite algorithm it's up to you to make a case for replacing one of the existing implementations with your proposal.

Classification

A general introduction to the most common text classification algorithms can be found at Google Answers: http://answers.google.com/answers/main?cmd=threadview&id=225316 For information on the algorithms implemented in Mahout (or scheduled for implementation) please visit the following pages.

Fully supported:

Deprecated or drafts only:

Clustering

For a more detailed explanation see Wikipedia page or checkout our Reference Reading

Fully supported:

Deprecated or drafts only:

Dimension reduction

Fully supported:

Deprecated or drafts only:

Evolutionary Algorithms

NOTE: Watchmaker support has been removed as of 0.7

see also: MAHOUT-56

You will find here information, examples, use cases, etc. related to Evolutionary Algorithms.

Introductions and Tutorials:

Example: Traveling Salesman

Recommenders / Collaborative Filtering

Mahout contains both simple non-distributed recommender implementations and distributed Hadoop-based recommenders.

Other

Fullly supported:

  • RowSimilarityJob -- Builds an inverted index and then computes distances between items that have co-occurrences. This is a fully distributed calculation.
  • VectorDistanceJob -- Does a map side join between a set of "seed" vectors and all of the input vectors.
  • Collocations ... find co-locations of tokens in text, runs on Hadoop

Deprecated or drafts only: