Monday, November 8, 2010

Machine Learning Toolkits

Wow. So much to read today.
While following link upon link, I found so many great toolkits that I think it is worth listing them here.
One of the greatest sources was the GNU/Linux AI & Alife HOWTO.

[edit] It's been a while since I wrote this blog post but still many people seem to find it, so here a quick update. After looking into many libraries, I started using scikit-learn and then using it exclusively. Now I am a regular contributor. It is a fast growing project with great documentation resources, many algorithms and it is just so easy to use. Also, working with Python and the Python crowd is fun. I heartly recommend it. [/edit]

Here goes:
  • Vowpal Wabbit - project on very fast online gradient descent by Yahoo research (C++)
  • VFML (Very Fast Machine Learning) - library for very fast decision trees and Bayes networks (C++)
  • Stochastic Gradient Descent - library for SVMs with stochastic gradient descent (C++)
  • Maximum Entropy Modeling Toolkit for Python and C++ - the name says it all
  • Elefant - toolkit that includes kernel methods, optimization strategies and belief propagation. It has a gui
  • Milk - toolkit for python that includes SVMs, decision trees, kNN, PCA, Kmeans, NMF and feature selection
  • Peach - pure Python library that includes neural networks, fuzzy logic, genetic algorithms and swarm intelligence
  • Pebl - python library and command line application for learning the structure of a Bayesian network
  • Machine Learning: An Algorithmic Perspective - Actually a book. But with MANY MANY MANY examples online. All in Python. MOST AWESOME! - I just ordered the book
  • dbacl - a digramic Bayesian classifier - a collection of command line tools for Bayesian classification particularly for spam filtering
  • Shark - Modular library including neural networks, kernel methods, discrete and continuous optimization, fuzzy logic and control and mixtures density models (C++)
  • PyMVPA - python module including more classifiers, regression and feature selection methods than can be listed here. Do a cross-validated classifier sweep and parameter search in < 10 lines of python.
  • Monte - gradient based learning in Python - Python module that contains neural networks, Kmeans, logistic regression with a focus on parametric models
  • scikit-learn - python module with good API. Includes SVMs, generalized linear models, gaussian mixture models, mean-shift, feature selection and ranking and data management and many more.
  • mlpy - Python module that includes Wavelet transforms, Kernel methods, FDA, PDA, LASSO, LARS, feature selection and ranking and data management. Very clean interface.
  • Modular toolkit for Data Processing - Python toolkit for data processing. In my opinion the API needs a little getting used to. Includes PCA, Kmeans, RMBs, FastICA, Neural Gas, SVms, Perceptrons and many more.
  • Orange - Data mining through visual programming or Python. Large toolbox that includes great visualization features, classifiers, data management, regression and clustering. Definitely worth trying.
  • Weka - A classic tool for all data mining. Contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. Can be used via interface, scripting or java.


  1. There's also RapidMiner, PyBrain, Apache Mahout, LibLinear, and that's just from the first couple of pages of

  2. Thanks for your note about scikit-learn.
    I'd like to test this library for a bayesian network work but I have some difficulties to create a simple bayesian network. Do you know where I could find an example of a simple implementation using scikit-learn?

  3. This comment has been removed by the author.