tag:blogger.com,1999:blog-73458061473654250732024-03-29T04:29:49.667+01:00PeekabooRamblings about Machine Learning, Python and scikit-learn.Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.comBlogger89125tag:blogger.com,1999:blog-7345806147365425073.post-29755009424292136292020-01-07T18:26:00.000+01:002020-01-07T19:07:57.516+01:00Don't fund Software that doesn't exist<div class="c0">
<span class="c5">I’ve been happy to see an increase in funding for open source software across research areas and across funding bodies. However, I observed that a majority of funding from, say, the NSF, goes to projects that do not exist yet, and where the funding is supposed to create a new project, or to extend projects that are developed and used within a single research lab. I think this top-down approach to creating software comes from a misunderstanding of the existing open source software that is used in science. This post collects thoughts on the effectiveness of current grant-based funding and how to improve it from the perspective of the grant-makers. </span></div>
<div class="c0 c1">
<span class="c5"></span></div>
<div class="c0">
Instead of the current approach of funding new projects, I would recommend funding existing open source software, ideally software that is widely used, and underfunded. The story of the underfunded but critically important open source software (which I’ll refer to as infrastructure software) should be an old tale by now. If this is news to you, look at the history of the <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://en.wikipedia.org/wiki/Heartbleed&sa=D&ust=1578421373008000">heartbleed bug</a></span> which basically broke security on the internet and lead to the <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://www.coreinfrastructure.org/&sa=D&ust=1578421373008000">core infrastructure initiative</a></span>, or read the insightful “<span class="c3"><a class="c4" href="https://www.google.com/url?q=https://www.fordfoundation.org/work/learning/research-reports/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure/&sa=D&ust=1578421373009000">Roads and Bridges</a></span>” report by Nadia Eghbal. For a concrete example of a critical infrastructure project that is particularly scientific in scope, but still unfunded, see “<span class="c3"><a class="c4" href="https://www.google.com/url?q=https://arxiv.org/pdf/1610.03159.pdf&sa=D&ust=1578421373009000">The AstroPy Problem</a></span>”.</div>
<div class="c0">
There has been a lot of discussion surrounding the analysis of this problem, and drawing relationships between peer-production open-source infrastructure and <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://en.wikipedia.org/wiki/Elinor_Ostrom%23Design_principles_for_Common_Pool_Resource_(CPR)_institution&sa=D&ust=1578421373010000">Common Pool Resources</a></span>. A great summary is provided again by <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://nadiaeghbal.com/tragedy-of-the-commons&sa=D&ust=1578421373010000">Nadia Eghbal</a></span><span class="c5">. In this particular writing I want to focus more directly on the impact and structure of grant funding in this context.</span></div>
<div class="c0 c1">
<span class="c5"></span></div>
<div class="c0">
Before I explain why I think funding new projects is a bad idea, let me start by briefly providing some background on how many open source projects that became infrastructure have been created. Commonly, these were started by a single person, usually out of need (Python, NumPy<span class="c5">, matplotlib, git, IPython/Jupyter) or curiosity (linux). Then, other people joined the project as they found it either interesting or useful themselves.</span></div>
<div class="c0">
Most of these projects are not only “open source”, they became <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://en.wikipedia.org/wiki/Peer_production&sa=D&ust=1578421373011000">peer-production projects</a></span>, in which a loosely assembled self-organizing community of volunteers collaborate to create software. In more historical terms, this could be characterized as the projects being<span class="c3"><a class="c4" href="https://www.google.com/url?q=https://opensource.com/article/17/11/open-source-or-free-software&sa=D&ust=1578421373011000"> “free software” not only “open source</a></span>” - open source is a statement about the license attributed to the code, while free software also includes aspects of governance and community engagement (this is also related to the bazaar metaphor from the <span class="c3"><a class="c4" href="https://www.google.com/url?q=http://www.catb.org/~esr/writings/cathedral-bazaar/cathedral-bazaar/ar01s02.html&sa=D&ust=1578421373012000">Cathedral and the Bazaar</a></span><span class="c5">).</span></div>
<div class="c0">
<br /></div>
<div class="c0">
<span class="c5">It seems to me that the important distinction between open source software and peer-production has been mostly lost on grant makers and funding agencies, as they are usually trying to create new projects, or plan big overhauls of existing projects. Both of these are in conflict with the peer-production (or bazaar) philosophy as I’ll describe below.</span></div>
<h2 class="c6" id="h.o2b6par71unx">
Was our infrastructure<span class="c8"> created from grants?</span></h2>
<div class="c0">
<span class="c5">Given the amount of current funding, we can ask “How many software packages that are central to science were made as a result of grants?” </span>I hope (and imagine) the answer is not zero, but<span class="c7"> I am not familiar with any examples.</span><span class="c5"> Looking at Jupyter or the scientific ecosystem (numpy, matplotlib, pandas, scipy, scikit-learn, scikit-image, seaborn, astropy, …)--none of these projects were started by grants (even though Jupyter benefited from substantial funding later on). From my conversations with people in the R community, my understanding was that the situation there is similar.</span></div>
<div class="c0">
<span class="c5">If you are aware of central pieces of open source infrastructure that were kick-started by a grant, please let me know, as I’m quite curious to see their development.</span></div>
<h2 class="c6" id="h.s5eahbdmnqss">
<span class="c8">Do grants create infrastructure?</span></h2>
<div class="c0">
The reverse question is being worked on by Johanna Cohoon and <span class="c3"><a class="c4" href="https://www.google.com/url?q=http://james.howison.name/&sa=D&ust=1578421373014000">James Howison</a></span>. You can see preliminary results in <span class="c3"><a class="c4" href="https://www.google.com/url?q=http://howisonlab.github.io/portland_workshop/papers/Howison2-Sustaining%2520scientific%2520infrastructures%2520transitioni.pdf&sa=D&ust=1578421373014000">Routes to Sustainable Software in Science: Transitioning to Peer Production</a></span>. They followed packages that did get funding via the NSF program “Software Infrastructure for Sustained Innovation” (SI2), and investigated whether and how these projects were transitioned to a community-based model<span class="c5">. As a side-note, the SI2 program is part of what funds my own work, so in a sense I’m quite lucky it exists. However Cohoon and Howison found that of the 23 projects they captured in their taxonomy, only one project transitioned from a single author project to a peer-production project, and none of the projects that were started within a tool group or a research lab transitioned to peer-production, i.e. to being a community project. The SI2 grant explicitly asks for a sustainability plan--which is usually peer-production. Still, only a single project was able to make this move. Given my experience with academic open source software, this is hardly surprising. Not transitioning to a community based project does not necessarily mean the project is not used, but it is a very strong indicator.</span></div>
<div class="c0">
<span class="c5">Several projects studied by Cohoon and Howison had no activity after the grant period ended, essentially meaning the project was abandoned, and the money and time wasted to create an artifact that was discarded immediately.</span></div>
<div class="c0">
I should note that three of the projects examined were already peer-production projects<span class="c5"> (the study did not include my work on scikit-learn as the grant is ongoing), meaning the ratio of projects that are likely to die with the grant to those that will be useful is 4:19.</span></div>
<div class="c0 c1">
<span class="c5"></span></div>
<div class="c0">
So we have seen that those projects that are central infrastructure were not established via grant money, and that those that are established via grant money do not become infrastructure<span class="c5">, and in fact rarely become peer-production projects at all. </span>What are possible explanations for these observations?</div>
<h2 class="c6" id="h.75ceux6zo2et">
Do grant-based projects adhere to open source principles?</h2>
<div class="c0">
<span class="c5">A more provocative version of this question might be “Are there fundamental reasons why creating infrastructure open source software via grants will fail?”. I think there are two potential reasons.</span></div>
<h3 class="c9" id="h.lh9mjzpvv643">
<span class="c2">Incentives</span></h3>
<div class="c0">
<span class="c5">To investigate this, let’s look at the manifesto of peer production software, “The Cathedral and the Bazaar” by Eric Steven Raymond. </span>It states, as the<span class="c3"><a class="c4" href="https://www.google.com/url?q=http://www.catb.org/~esr/writings/cathedral-bazaar/cathedral-bazaar/ar01s02.html&sa=D&ust=1578421373016000"> very first lesson of open source software</a></span><span class="c5"> (!!):</span></div>
<div class="c12">
<span class="c5">1. Every good work of software starts by scratching a developer's personal itch.</span></div>
<div class="c0">
I have found this to be true in my personal experience, and while Raymond doesn’t provide direct empirical data, I think this insight is largely accepted in the open source community and for example repeated in <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://www.amazon.com/Internet-Success-Open-Source-Software-Commons/dp/0262017253/&sa=D&ust=1578421373017000">Internet Success: A Study of Open-Source Software Commons,</a></span> by Charles Schweik, which is well-summarized in <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://knightlab.northwestern.edu/2013/07/24/six-lessons-on-success-and-failure-for-open-source-software/&sa=D&ust=1578421373018000">Six Things to know about successful open-source software by Rich Gordon</a></span>.</div>
<div class="c0 c1">
<span class="c5"></span></div>
<div class="c0">
I want to argue that this first principle is contradicting the creation of open source projects as a result of a grant. <span class="c7">If you need grant money to even start the project, it means the project is not useful enough to be worth your time otherwise.</span><span class="c5"> Writing software as a result of a grant means you didn’t actually need the software, otherwise you would have written it before. If even the author doesn’t need the software enough to incentivize them to write it, it’s unlikely that anyone else will, and it’s unlikely that a peer community will form that will will be incentivised to develop and maintain the software. </span></div>
<div class="c0 c1">
<span class="c5"></span></div>
<div class="c0">
<span class="c5">Let’s rephrase this last argument in a slightly different way. I think the impact a single developer can make is often inversely proportional to the size and maturity of an existing piece of software, and actually the personal return, both as a user, and in terms of community rewards, is similarly inversely related with project size and maturity.</span></div>
<div class="c0">
<span class="c5">That would imply that as a project grows, it will be harder to convince someone to voluntarily contribute to a project, because their trade-off gets worse and worse as the project grows.</span></div>
<div class="c0">
If the project was started by grant money, that means the incentives for the very first developer were not strong enough to start the project out of self-interest. As the project grows (as a result of the funding), the incentive for someone else to come in and contribute will be even smaller than for that first developer. Given that the project wasn’t attractive enough for the first contributor, how will it ever attract anyone else? More discussion on developer motivations and the difficulties of attracting developers can be found in <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://firstmonday.org/ojs/index.php/fm/article/view/1477/1392%23p4&sa=D&ust=1578421373019000">Cave or Community?: An Empirical Examination of 100 Mature Open Source Projects by Sandeep Krishnamurthy</a></span>. For a discussion of incentives in scientific open source software, see <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://herbsleb.org/web-pubs/pdfs/Howison-incentives-2013.pdf&sa=D&ust=1578421373019000">Incentives and Integration In Scientific Software Production by James Howison and Hersleb</a></span><span class="c5">.</span></div>
<div class="c0 c1">
<span class="c5"></span></div>
<h3 class="c9" id="h.agcpley18bmx">
<span class="c2">Natural Selection of Projects: Most open source projects fail</span></h3>
<div class="c0">
Another crucial aspect of peer-production projects is that there is a process of natural selection among projects. <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://knightlab.northwestern.edu/2013/07/24/six-lessons-on-success-and-failure-for-open-source-software/&sa=D&ust=1578421373020000">Most open source projects are not successful</a></span>. There are thousands of open source projects related to science created every year, many of them aim at being infrastructure. Most of them never achieve any uptake or even become functional, and according to the study in <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://knightlab.northwestern.edu/2013/07/24/six-lessons-on-success-and-failure-for-open-source-software/&sa=D&ust=1578421373020000">Schweik’s Internet Success</a></span>, <span class="c7">having a source of funding is not related to becoming successful.</span> Of the projects that Schweik studied, 17% were successful by their measure (not abandoned for at least three releases and fulfilled a clear need). While this number is not directly comparable to the numbers from the Cohoon and Howison study, this success rate over all projects on sourceforge (the dominating platform at the time of the study) is likely comparable or even higher than the number of successful projects that were <span class="c10">funded</span><span class="c5"> by SI2.</span></div>
<div class="c0 c1">
<span class="c5"></span></div>
<div class="c0">
<span class="c5">The bazaar-style development model means that enough developers have been convinced the project is viable and interesting to commit their time. This sets a bar for the ability for a project to become a peer production project. Initiating a new project means creating one more proposal in the thousands of projects that might be started and wither. Funding a project that is already in the peer-production phase means the project has already cleared that bar. To me (and having been on review panels), it seems unlikely that a funding body can accurately judge whether a project will be successful or not, which is made evident by the numbers of the Cohoon and Howison study. However, this ability is implicitly assumed in the grant-making process if the long-term goal of the grant is peer-production.</span></div>
<h2 class="c6" id="h.li4oaktrapj">
<span class="c8">Is funding existing software any better?</span></h2>
<div class="c0">
The observations above are not in conflict with asking for funding for existing, established projects. Often these have communities surrounding them that voluntarily contribute time and effort to support the project. However, the effort involved in developing and maintaining a project is usually proportional to the user base, and the developer community might not grow proportionally (NumPy has contributions from 46 developers over the last year <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://github.com/numpy/numpy/graphs/contributors?from%3D2018-12-11%26to%3D2019-12-17%26type%3Dc&sa=D&ust=1578421373021000">according to GitHub</a></span>, but is likely to have millions of users). Existing projects also suffer from <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://en.wikipedia.org/wiki/Software_rot&sa=D&ust=1578421373021000">bit-rot</a></span><span class="c5">, and maintaining an existing project can often be seen as much less fun than writing something new from scratch. It also brings with it the burdens of legacy software, and the mixed bag of dependent users---who can be encouraging and motivating one day, and then inflammatory and aggressive the next.</span></div>
<div class="c0">
<span class="c5">To ensure central infrastructure will stay available to the scientific community, funding is necessary for large existing projects.</span></div>
<div class="c0">
<span class="c5">However, both the known dynamics of open source development, as well as data on existing grants and existing infrastructure show that money invested in new projects does not pay off.</span></div>
<h2 class="c6" id="h.wrvszf3euvbb">
<span class="c8">Closing Notes</span></h2>
<div class="c0">
My recommendation for any future software funding would be to only fund software that already uses peer-production as its organizational principle, or absent that, to fund maintenance and enhancement of projects that are widely used. For now, ironically, the part that it is easiest to find volunteers for (implementing new features) is the easiest to fund, while the one that is basically impossible to fund (maintenance, though big props to the <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://www.moore.org/&sa=D&ust=1578421373022000">Gordon and Betty Moore Foundation </a></span><span class="c3"><a class="c4" href="https://www.google.com/url?q=https://sloan.org/&sa=D&ust=1578421373022000">Alfred P. Sloan foundation</a></span>, in particular <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://sloan.org/about/staff/joshua-m-greenberg&sa=D&ust=1578421373023000">Josh Greenberg</a></span> and <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://profiles.stanford.edu/chrismentzel&sa=D&ust=1578421373023000">Chris Mentzel</a></span>, and <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://chanzuckerberg.com/eoss/&sa=D&ust=1578421373023000">Chan-Zuckerberg</a></span> for starting to change that) is the one that it is hardest to find volunteers for.</div>
<div class="c0 c1">
<span class="c5"></span></div>
<div class="c0">
Going even further, the requirement of milestones and deliverables that is often required for grant applications is in harsh contrast to the peer-production nature of open source infrastructure, and more closely resembles the waterfall model, that has long been abandoned in favor of <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://en.wikipedia.org/wiki/Agile_software_development&sa=D&ust=1578421373024000">agile software development.</a></span> The more we can move towards community-driven decision making, the more advantageous the outcome for both scientific users and open source communities will be. In a sense, this is my response to a piece from Dan Katz, who was at the time an NSF director overseeing the SI2 program <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://danielskatzblog.wordpress.com/2016/03/23/only-mostly-dead-grants-for-software-maintenance-only/&sa=D&ust=1578421373024000">about maintenance-only grants</a></span><span class="c5">, questioning their usefulness. I think maintenance-only grants are the way forward (or maybe even open-ended grants), not because we don’t need new development, but because it’s impossible to plan many important developments on the time-scale that grants are made (see agile vs waterfall) and because it’s much easier to find volunteers for new features than it is to find them for maintenance. Effective maintenance requires multi-year commitments that are much easier to make with long-term grants.</span></div>
<div class="c0 c1">
<span class="c5"></span></div>
<div class="c0">
If you’ve read this far, I highly encourage you to read <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://knightlab.northwestern.edu/2013/07/24/six-lessons-on-success-and-failure-for-open-source-software/&sa=D&ust=1578421373025000">Six Things to know about successful open-source software by Rich Gordon</a></span> and the “<span class="c3"><a class="c4" href="https://www.google.com/url?q=https://www.fordfoundation.org/work/learning/research-reports/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure/&sa=D&ust=1578421373025000">Roads and Bridges</a></span>” report by Nadia Eghbal. <span class="c3"><a class="c4" href="https://www.google.com/url?q=http://james.howison.name/&sa=D&ust=1578421373025000">James Howison</a></span> has been studying the development of scientific open source, <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://herbsleb.org/web-pubs/pdfs/Howison-incentives-2013.pdf&sa=D&ust=1578421373026000">incentives</a></span> and in particular its <span class="c3"><a class="c4" href="https://www.google.com/url?q=https://www.ideals.illinois.edu/handle/2142/73439&sa=D&ust=1578421373026000">relationship with grant funding</a></span> and I highly recommend checking out his work.</div>
Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com0tag:blogger.com,1999:blog-7345806147365425073.post-51952796518117850362019-07-02T18:11:00.000+02:002019-07-02T23:01:43.909+02:00Don't cite the No Free Lunch TheoremTldr; You probably shouldn’t be citing the<a href="https://www.mitpressjournals.org/doi/abs/10.1162/neco.1996.8.7.1341"> "No Free Lunch" Theorem by Wolpert</a>. If you’ve cited it somewhere, you might have used it to support the wrong conclusion. What it actually (vaguely) says is “You can’t learn from data without making assumptions”.<br />
<br />
The <a href="https://www.mitpressjournals.org/doi/abs/10.1162/neco.1996.8.7.1341">paper on the “No Free Lunch Theorem”</a>, actually called "<i>The Lack of A Priori Distinctions Between Learning Algorithms</i>" is one of these papers that are often cited and rarely read, and I hear many people in the ML community refer to it when supporting the claim that “one model can’t be the best at everything” or “one model won’t always be better than another model”.
The point of this post is to convince you that this is not what the paper or theorem says (at least not the one usually cited by Wolpert), and you should not cite this theorem in this context; and also that common versions cited of the "No Free Lunch" Theorem are not actually true.<br />
<h3>
Multiple Theorems, one Name</h3>
The first problem is that there are at least two theorems with the name “no free lunch” that I know about. One by Wolpert, first published in <i>The Lack of A Priori Distinctions Between Learning Algorithms</i> (there's actually multiple but they go in the same direction), and one in<a href="http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/"> <i>Understanding Machine Learning</i> By Shalev-Shwarz and Ben-David</a> (a very excellent book!).
Wolpert also published a “No Free Lunch in Optimization”, but I'm only concerned with the theorem for supervised learning. The Theorem in <i>Understanding Machine Learning</i> is actually quite different, and I’ll discuss it below. I think the goal of Shalev-Shwarz and Ben-David was to formalize the folk wisdom of the “no free lunch” theorem in a different way than Wolpert did, and their theorem actually says “One model can’t always win”, though in a very specific way (and if you cite this one, there’s other caveats, see below). In a way, the theorem they present is much clearer in what we actually learn from it. But I’m not sure giving it a name that was already taken was a good idea.<br />
<h3>
So what does it say?</h3>
The main statement of the original paper can be summarized as “Under the assumptions of the theorem, any two prediction functions have the same ability to generalize”.
There are two crucial parts to this: the assumptions and the conclusions. Let’s start with trying to understand the conclusion. It is often read to mean something like “Gradient Boosting can’t always win”. Instead, what it actually says is that “Gradient boosting is as good as always predicting the most frequent class”. Or, “Neural networks are as good as predicting the <b>least</b> frequent class.”
Clearly these statements don’t correspond to our actual experience in machine learning practice. Always predicting the least frequent class is clearly a terrible strategy. But according to the theorem, it’s as good as the best state-of-the-art model you can find with respect to generalization properties. So what’s happening here?<br />
<h3>
The Assumptions</h3>
The key to understanding the theorem is in understanding the assumptions of the theorem. The theorem does not use one of the most commonly used assumptions in machine learning theory, which is that the data is drawn i.i.d. from some given distribution.
Instead, Wolpert assumes the data is a finite set, and training and test set are disjoint and drawn from a discrete distribution. This does sound reasonable; in practice our data is always finite, and we want to generalize to new data that we haven’t seen before.
Making these assumptions allow Wolpert to average over all possible datasets. The statement of the theorem is comparing two algorithms over all possible datasets generated using these assumptions.<br />
While these assumptions sound reasonable for doing machine learning, they really are not. What these assumptions are saying is that<br />
a) <b>the test data and the training data are statistically independent,</b> i.e. the test data has nothing to do with the training data, and<br />
b) <b>the labels have nothing to do with the features</b> (because we are averaging over all possible labelings).
Phrasing it this way, these assumptions are clearly not good for doing any predictive modelling.<br />
<h3>
So what does it mean?</h3>
Now that we revisited both conclusion and assumptions, let’s try to summarize, or maybe rephrase the No Free Lunch theorem by Wolpert. The conclusion includes “any model is as good as predicting the minority class”. What that really says is that “learning is impossible”.
Given our understanding of the assumptions, the full statement of the Theorem is something like:
“<b>If the training set has nothing to do with the test set, and the features have nothing to do with the labels, learning is impossible</b>”.
This statement makes intuitive sense, but it is very far from what I hear commonly associated with the theorem. The most sensible reading of the theorem is “<b>You need to make assumptions in order for learning to be possible</b>”. However, what it really shows is that if you make the specific assumption of the theorem, learning is not possible. So if you want to claim generally that “learning requires assumptions”, I don’t think you should cite this paper.<br />
<h3>
What (I think) Wolpert’s point was</h3>
I think the point of the paper is to challenge the i.i.d. assumption. Wolpert gives (good) reasons why he thinks it’s not appropriate, and why machine learning theory should explore other frameworks. In particular, there is a good case to be made for modeling datasets as being finite.
If this is the case, then the i.i.d. assumption would allow points to be both in the training and the test set. Clearly that’s not the point of generalization. So Wolpert requires training and test set to be disjoint, which also makes sense.
However, the consequence is that training and test set (and their labels) are completely independent now, which is strange.
I don’t know if he actually thought this was a good framework for machine learning. I assume his motivation was to challenge the field to revisit assumptions and find alternatives to the i.i.d. assumption that more closely resemble machine learning practice. Many years later, it seems to me there was an unfortunate lack of follow up from the rest of the community, and a clear misunderstanding of the theorem by many machine learning practitioners.<br />
<h3>
The other “No Free Lunch Theorem”</h3>
As I mentioned, there’s another “No Free Lunch Theorem”. It’s quite different in that it does use an i.i.d. assumption for evaluating a model. In other ways it’s quite similar in that it makes use of the fact that without additional assumptions, if you see part of the data, the remaining part could have any possible labeling. More concretely, the theorem says “<b>For any predictive algorithm there is a dataset on which it fails, i.e. a dataset on which a different learner would perform better</b>”. However, this does not prevent statements like “Algorithm X is always better than Algorithm Y”, because the algorithm that does better is not realizable (it's the algorithm that produces the true answer for this dataset without looking at the data). <b>Under this framework I’m sure you could easily prove that in an imbalanced dataset, it’s better (in expectation) to predict the majority class than to predict the minority class, a statement that is not true in Wolpert’s framework.</b><br />
<h3>
What to Cite</h3>
I think there are very few cases when citing Wolpert supports whatever point you’re making.
<b>If your point is “No model can always be best”, I would suggest citing Shalev-Shwartz and Ben-David.</b>
If your point is “Learning is impossible without proper assumptions”, you might cite the whole chapter by Shalev-Shwartz and Ben-David. I’m not sure there’s a good way to make this statement in a formal way. You could cite Wolpert if you really want to, but I think that might be more confusing than helpful. If your point is “The i.i.d. assumption is weird in the presence of finite data”, definitely cite Wolpert!<br />
<b>Finally, if your point is “Gradient boosting can’t always be better than neural networks because of the No Free Lunch Theorem”, then, as far as I know, you’re wrong, and there is nothing that would prohibit a statement of this form</b>. I don’t believe that there’s many strict “always worse” or “always better” relationships between commonly used ML algorithms, but I’m also not aware of any theoretical reason that would prevent such statements to exist (in a framework where learning is possible). And as I said above, predicting the majority class is always better than predicting the minority class, for a reasonable definition of “always”.<br />
<h3>
Learning More</h3>
If you’re interested in learning more about Machine Learning theory, I think the book by <a href="http://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/">Shalev-Shwartz and Ben-David</a> is actually quite excellent. I also really enjoyed <a href="https://cs.nyu.edu/~mohri/mlbook/">“Foundations of Machine Learning” by Mehryar Mohri, Afshin Rostamizadeh and Ameet Talwalkar</a>. Both books are available as PDFs on the author websites I linked to. I’m by no means a theory person, but I think having some background on machine learning theory helps provide a framework in reasoning about algorithms.
Of course you can also check out <a href="https://www.mitpressjournals.org/doi/abs/10.1162/neco.1996.8.7.1341">Wolpert’s paper</a>, though I think that’s mostly interesting if you want to learn why he doesn’t like the i.i.d. Assumption - so it’s more about philosophy of machine learning theory than about standard machine learning theory.Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com2tag:blogger.com,1999:blog-7345806147365425073.post-46573828240795676782014-03-20T21:39:00.003+01:002014-03-20T21:52:38.340+01:00Off-topic: speed reading like spritzAs the title suggests, this is a non-machine-learning, non-vision, non-python post *gasp*.<br />
Some people in my network posted about <a href="http://www.spritzinc.com/">spritz</a> a startup that recently went out of stealth-mode. They do a pretty cool app for speed reading. See <a href="http://www.huffingtonpost.com/2014/02/27/spritz-reading_n_4865756.html?utm_content=buffer5bf3e&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer">this huffington post article</a> for a quick demo and explanation.<br />
They say they are still in development, so the app is not available for the public.<br />
<br />
The app seemed seems pretty neat but also pretty easy to do. I said that and people came back with "they probably do a lot of natural language processing and parsing the sentence to align to the proper character" and similar remarks.<br />
So I reverse engineered it. By which I mean opening the demo gifs in Gimp and counting letters. And, surprise surprise: they just count letters. So the letter they highlight (at least in the demo) is only depending on the letter of the word.<br />
The magic formula is <br />
<code>highlight = 1 + ceil(word_length / 4)</code>
. They might also be timing the transitions differently, haven't really looked at that, I must admit.<br />
After this revelation, I coded up my own <a href="https://github.com/amueller/speed_reading/tree/master">little version in javascript.</a><br />
Obviously the real value of an app like that is integration with various ecosystems, browser integration etc...<br />
But if you are just interested in the concept, you can paste your favorite (or maybe rather least favorite given the nature of the app?) e-book into <a href="http://amueller.github.io/speed_reading/">my webpage and give it a shot</a>.<br />
The code is obviously on <a href="https://github.com/amueller/speed_reading/tree/master">github</a> and CC0.<br />
I wouldn't try to use it commercially though, without talking to spritz.Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com5tag:blogger.com,1999:blog-7345806147365425073.post-36658904505416566132013-07-29T14:30:00.001+02:002013-07-29T17:19:10.251+02:00Scikit-learn sprint and 0.14 release candidate (Update: binaries available :)Yesterday a week-long <a href="http://scikit-learn.org/dev/index.html">scikit-learn</a> coding sprint in Paris ended.<br />
And let me just say: a week is pretty long for a sprint. I think most of us were pretty exhausted in the end. But we put together a release candidate for 0.14 that <a href="http://gael-varoquaux.info/">Gael Varoquaux</a> tagged last night.<br />
<br />
You can install it via:<br />
<span style="font-size: small;"><code>pip install -U https://github.com/scikit-learn/scikit-learn/archive/0.14a1.zip</code></span><br />
<br />
There are also <a href="https://github.com/scikit-learn/scikit-learn/releases/tag/0.14a1">tarballs on github</a> and binaries on <a href="http://sourceforge.net/projects/scikit-learn/files/?source=navbar">sourceforge</a>.<br />
<br />
If you want the most current version, you can check out the release branch on github:<br />
<a href="https://github.com/scikit-learn/scikit-learn/tree/0.14.X">https://github.com/scikit-learn/scikit-learn/tree/0.14.X</a><br />
<br />
The full list of changes can be found in <a href="http://scikit-learn.org/dev/whats_new.html">what's new</a>.<br />
<br />
The purpose of the release candidate is to give users a chance to give us feedback before the release. So please try it out and report back if you have any issues.<br />
<a name='more'></a><br />
<h4>
New Website </h4>
Before I start talking about the release candidate and the sprint, I want to mention the new face of <a href="http://scikit-learn.org/dev/">scikit-learn.org</a>. I think it is an improvement with respect to design, but it is an even bigger and more important improvement with respect to navigation and accessibility of the docs. The new page was drafted by <a href="https://github.com/nellev">Nelle Varoquaux</a>, Vincent Michel and me, and the design is mostly due to <a href="http://www.montefiore.ulg.ac.be/~glouppe/">Gilles Louppe</a>, who I think did an amazing job.<br />
<br />
Basically we redid the front page to give a short overview of the package, and added a <a href="http://scikit-learn.org/dev/documentation.html">documentation overview,</a> to make it easier to find things.<br />
Feedback on design and navigation are more than welcome - on the mailing list or the <a href="https://github.com/scikit-learn/scikit-learn/issues/2286">issue tracker.</a><br />
<br />
We are also trying to address the issue of having completely separate pages for different versions, and having google links that are not to the latest stable. But we will have to see how that will play out.<br />
<br />
<h4>
Release Candidate for 0.14</h4>
Now to the release candidate:<br />
This is the first time we did a candidate. What this means is that people can choose to install the upcoming release and we can include changes based on their feedback before switching the default install version to 0.14.<br />
<br />
There are a lot of new real killer features in this version. You can find the full change log here. Let me say a bit about the most sexy new features.<br />
<br />
<h3>
Python 3 Support</h3>
We now have full Python3 support, more precisely Python 3.3. Using <a href="http://pythonhosted.org/six/">six</a>, we use a single code base to support Python 2.6, 2.7 and 3.3.<br />
<h3>
</h3>
<h3>
Faster Trees and Forests</h3>
Gilles did a complete rewrite of the <a href="http://scikit-learn.org/dev/modules/tree.html">tree module</a>, with the goal of decreasing runtime and memory consumption of all tree based estimators.<br />
As a consequence, random forests are about 50%-300% faster, and for extremely randomized trees it looks even better. That makes the scikit-learn implementation similarly fast as the commercial implementation by <a href="http://wise.io/">wise.io</a>.<br />
It is very hard to create a fair benchmark as run times vary widely with parameter settings and data sets. To his credit, Gilles didn't want to publish any timing results before he had time to perform extensive tests. But it does look pretty good.<br />
<br />
<h3>
AdaBoost Classification and Regression</h3>
<a href="http://scikit-learn.org/dev/modules/ensemble.html#adaboost">AdaBoost</a> is a classical weighed boosting method, that was implemented for scikit-learn by <a href="https://github.com/ndawe">Noel Dawe</a> and <a href="http://www.montefiore.ulg.ac.be/~glouppe/">Gilles</a>. By default, the implementation uses decision trees or stumps, but can be used with any other estimator that supports sample weights.<br />
The algorithm is generally applicable and often performs very well in practice.<br />
Unfortunately there is an issue with building the Sphinx documentation, and the API is currently not visible on the dev website. You can still look at the docstring in IPython, though, and it will be fixed for the release. <br />
Here is one of the <a href="http://scikit-learn.org/dev/auto_examples/ensemble/plot_adaboost_twoclass.html#example-ensemble-plot-adaboost-twoclass-py">examples</a>:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://scikit-learn.org/dev/_images/plot_adaboost_twoclass_1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="200" src="http://scikit-learn.org/dev/_images/plot_adaboost_twoclass_1.png" width="400" /></a></div>
<br />
<h3>
Restricted Boltzmann Machines</h3>
This one is a guest performance by <a href="http://ynd.github.io/">Yann Dauphin</a>, an expert in deep learning and feature learning.<a href="http://scikit-learn.org/dev/modules/neural_networks.html#rbm"> Restricted Boltzmann Machines</a> are usually used as feature extraction algorithms or for matrix completion problems.<br />
They are a generative graphical model that can approximate very complex data distributions, and were made popular as an initialization for neural networks in the deep learning paradigm.<br />
The scikit-learn implementation is of the Bernoulli Restricted Boltzmann Machine, which means that the input as well as the learned features are binary.<br />
Often, this is relaxed to input and output that is continuous between 0 and 1, but concentrated at these two values.<br />
This is one of the basic building block for deep learning, and can be made into a Deep Belief Network simply by stacking them using a Pipeline.<br />
On the down-side, RBMs take often long to train on the CPU, and it is not always clear if they perform better in feature extraction tasks than more simple methods, such as K-Means based encodings.<br />
One of the benefits of having an implementation in scikit-learn will be that much more people will be experimenting with it, which will lead to a better understanding of the behavior in practice.<br />
Of course, no mention of deep learning is complete without a <a href="http://scikit-learn.org/dev/auto_examples/plot_rbm_logistic_classification.html#example-plot-rbm-logistic-classification-py">plot of learned filters</a>:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://scikit-learn.org/dev/_images/plot_rbm_logistic_classification_1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="380" src="http://scikit-learn.org/dev/_images/plot_rbm_logistic_classification_1.png" width="400" /></a></div>
<br />
<br />
<h3>
Missing Value Imputation</h3>
A very recent addition, and the product of the Google Summer of Code by Nicolas Trésegnie. Until now, scikit-learn did not support missing values in any estimators, as they are often hard to handle. Unfortunately, missing values pop up frequently in practical applications.<br />
Nicolas introduces a new estimator, the Imputer, which can be used to preprocess data and fill in missing values using several strategies.<br />
Currently, only simple, but still effective methods are implemented, such as using the mean or median of a feature. For details, see the documentation.<br />
<br />
<h3>
Randomized Parameter Optimization</h3>
<a href="http://scikit-learn.org/dev/modules/grid_search.html#randomized-parameter-optimization">Randomized parameter search</a>, as an alternative to <a href="http://scikit-learn.org/dev/modules/grid_search.html#exhaustive-grid-search">grid search</a> is one of the few things that I did for the current release. It is an implementation of the approach <a href="http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf">put forward</a> by <a href="http://www.eng.uwaterloo.ca/~jbergstr/">James Bergstra</a>.<br />
The basic idea is to overcome the curse of dimensionality for hyper-parameters by using random sampling. Let me elaborate: For algorithms with many hyper parameters - such as complicated pipelines or neural networks - it is often not feasible to do a grid-search over all parameter settings of interest.<br />
It is indeed not always clear which parameters are relevant, and which are not.<br />
By specifying distributions over the parameter-space and sampling from this distribution, it is possible to overcome this problem in parts.<br />
<br />
Let me illustrate that with an example: <br />
Imagine you have two continuous parameters, one of which is completely irrelevant (which you don't know in advance). Say both parameters lie between 0 and 1. If you use a standard grid-search, using steps of 0.2 in both directions, you need 25 fitting runs to obtain the results for the whole grid, and you will<br />
have obtained an estimate for 5 different values of the relevant parameter.<br />
If you instead sample randomly from the uniform distribution over both parameters 25 times, you will get 25 settings of the relevant parameter, giving you a much finer search with the same number of fits.<br />
<br />
<h3>
Model evaluation and selection with more scoring functions</h3>
Until now, our API for <a href="http://scikit-learn.org/dev/modules/grid_search.html">grid search</a> and <a href="http://scikit-learn.org/dev/modules/cross_validation.html#computing-cross-validated-metrics">cross validation</a> allowed only functions that get a vector of ground truth values y_true and a vector of predictions y_hat. That made it impossible to use scores such as recall, area under the curve, or ranking losses, which all need certainty estimates.<br />
<br />
In 0.14, we introduced a <a href="http://scikit-learn.org/dev/modules/model_evaluation.html">new interface</a> that is much more flexible. We now support any callable with arguments <code>(estimator, X_test, y_test)</code>, i.e. a fitted estimator, the test data and the ground truth labels. This allows for quite sophisticated evaluation schemes, that even have full access to the fitted model.<br />
For convenience, we also allow string options for all the common methods.<br />
A list can be found in the documentation. <br />
<br />
<br />
There have been numerous other improvements and bug-fixes. In particular the <a href="http://scikit-learn.org/dev/modules/model_evaluation.html">metrics module</a> for model evaluation and the corresponding documentation were greatly improved by <a href="http://www.ajoly.org/">Arnaud Joly</a>. <a href="https://twitter.com/ogrisel">Oliver Grisel</a> implemented out of core learning for <a href="http://scikit-learn.org/dev/modules/naive_bayes.html">naive Bayes estimators</a> using partial_fit. Another great improvement has been the rewrite of the neighbors module by <a href="http://www.astro.washington.edu/users/vanderplas/">Jake Vanderplas</a>, which made many neighbors based algorithms more efficient.<br />
We now also use the neighbors module in the <a href="http://scikit-learn.org/dev/modules/clustering.html#dbscan">DBSCAN clustering</a>, which makes our implementation much faster and more scalable.<br />
<br />
<h4>
The Sprint</h4>
First and foremost, I want to thank <a href="https://github.com/nellev">Nelle</a> (president of <a href="http://www.afpy.org/">afpy</a>) for the organization of the sprint. She did a spectacular job in organizing travel, not losing people in Paris, and generally holding it all together. Also, she brought croissants every morning.<br />
I also want to thank <a href="http://alexandre.gramfort.net/">Alex Gramfort</a> who got us place at Telecom <a href="https://www.telecom-paristech.fr/">ParisTech</a> for most of the sprint, and the people at <a href="http://www.tinyclues.com/">tinyclues</a>, who gave us their office for the weekend.<br />
<br />
As I already mentioned some of the great contributions of the sprint above, and you can read the rest in the <a href="http://scikit-learn.org/dev/whats_new.html">change log</a>, here is just a brief account of my personal experience (i.e. the interesting part of the blog post, if any, ends here ;)<br />
<br />
The sprint was a very different experience for me than the last one, or any coding session I had so far, as I spent a lot of time on organization and on pushing work on the website.<br />
I'm very bad at web design, and I have not much experience with jinja. But I have been convinced for quite some time, that we needed to revamp the website, in particular to make the documentation more accessible and easier to navigate.<br />
Luckily, I found some much more experienced web-designers, who did the actual work: Gilles Louppe, Nelle Varoquaux and <a href="https://github.com/jaquesgrobler/scikit-learn/wiki/Jaques-Grobler">Jaques Grobler</a>. I really like the result, even though it is not finished yet.<br />
In particular, I think the new documentation overview is a <a href="http://scikit-learn.org/dev/documentation.html">great improvement</a>.<br />
<br />
For the rest of the time, I mostly reviewed pull requests, discussed API and tried to find priorities for the release. This made me task-switch quite a lot, and I don't feel I actually accomplished much. I am very happy with what the team achieved overall, though, and I guess I did my part. <br />
<br />
In the end, I am really exhausted now. Maybe seven days of sprinting is a bit too much. I guess <a href="http://gael-varoquaux.info/">Gael Varoquaux</a>, who finished off the release candidate last night together with <a href="http://vene.ro/">Vlad Niculae</a>, <a href="https://twitter.com/ogrisel">Olivier Grisel</a> and <a href="https://github.com/larsmans">Lars Buitinck</a> must be even more exhausted.<br />
<br />
Maybe we should do only five days next time, and not release (candidate) immediately afterwards. On the other hand, it is rare that so many people reserve so much time for the project, and it is good to get things done.<br />
<br />
I'll grab a much needed coffee now. Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com12tag:blogger.com,1999:blog-7345806147365425073.post-82152487232956778342013-07-02T11:06:00.001+02:002013-07-04T13:23:25.679+02:00ICML 2013 Reading ListThe ICML is now already over for two weeks, but I still wanted to write about my reading list, as there have been some quite interesting papers (<a href="http://jmlr.org/proceedings/papers/v28/">the proceedings are here</a>). Also, I haven't blogged in ages, for which I really have no excuse ;)<br />
<br />
There are three topics that I am particularly interested in, which got a lot of attention at this years ICML: Neural networks, feature expansion and kernel approximation, and Structured prediction.<br />
<br />
<a name='more'></a><br />
<br />
But first: <br />
<h3>
<b><a href="http://jmlr.org/proceedings/papers/v28/bergstra13.html">Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures</a></b></h3>
<div id="authors">
James Bergstra, Daniel Yamins, David Cox</div>
<br />
This is the newest in a series of papers by James Bergstra on hyperparamter optimization. I quite enjoy his work and his hyperopt software is in active use in my lab. In particular in computer vision applications, there is so much engineering, that it is very hard to separate research contributions from engineering contributions. This paper shows 1) how important engineering is and 2) how far automatization of the engineering part can really go.<br />
<br />
<h2>
<span style="font-size: x-large;">
Neural Networks</span></h2>
Now, let's come to the somewhat most unlikely candidate, neural networks.<br />
They
gained a lot of attention in the more machine-learny circles in the
last couple of years. Still I was a bit surprised how many - in
particular very empirical papers - made it to ICML.<br />
<br />
<br />
<h3>
<a href="http://jmlr.org/proceedings/papers/v28/wan13.pdf">Regularization of Neural Networks using DropConnect</a></h3>
<div id="authors">
Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, Rob Fergus</div>
<div id="authors">
</div>
<div id="authors">
One of the zoo of follow-ups on the drop-out work by Hinton, this paper suggests setting weights to zero, instead of hidden unit activations. It achieves better accuracy and is more efficient than drop-out.</div>
<br />
<h3 class="title">
<a href="http://jmlr.org/proceedings/papers/v28/goodfellow13.pdf">Maxout Networks</a></h3>
<span class="authors">
Ian Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, Yoshua Bengio</span><br />
<br />
<span class="authors">One of the most impressive follow-ups on the drop-out work, this paper demonstrates how to combine drop-out with a maximum nonlinearity.</span><br />
<span class="authors">That's right. The only nonlinearity is the maximum over a group of hidden units.</span><br />
<span class="authors">I feel this is pretty innovative and the results speak for themselves.</span><br />
<span class="authors">The authors argue that the max non-linearity allows the network to learn a linear approximation of any convex activation function. Unfortunately, it is not really clear from the paper how much of the performance can be attributed to the max non-linearity, as there are no results without max-out.</span><br />
<span class="authors"><br /></span>
<br />
<h3 class="title">
<a href="http://jmlr.org/proceedings/papers/v28/sutskever13.pdf">On the importance of initialization and momentum in deep learning</a></h3>
<span class="authors">
Ilya Sutskever, James Martens, George Dahl, Geoffrey Hinton</span><br />
<br />
<span class="authors">This work investigates relations between momentum and Nesterov's accelerated gradients. It argues that together with the right initialization, learning with momentum can yield to much better models. </span><br />
<br />
<h2>
<span style="font-size: x-large;">
Kernel Approximation and Feature Extraction </span></h2>
<br />
<div class="title">
<h3>
<a href="http://jmlr.org/proceedings/papers/v28/gittens13.pdf">Revisiting the Nystrom method for improved large-scale machine learning</a></h3>
</div>
<span class="authors">
Alex Gittens, Michael Mahoney</span><br />
<span class="authors">This work compares sample based and projection based methods for low rank approximations. I haven't looked into the details yet, but I'm a big fan of the Nystroem method for kernel approximations, so I will definitely see what's in there. </span><br />
<br />
<div class="title">
<h3>
<a href="http://jmlr.org/proceedings/papers/v28/balasubramanian13.pdf">Smooth Sparse Coding via Marginal Regression for Learning Sparse Representations</a></h3>
</div>
<span class="authors">
Krishnakumar Balasubramanian, Kai Yu, Guy Lebanon</span><br />
<span class="authors">The authors propose a new sparse coding framework using non-parametric kernel smoothing. They provide generalization bounds for sparse dictionary learning and demonstrate benefits compared to standard sparse coding and Locally Linear Coding. </span><br />
<br />
<br />
<h2>
<span style="font-size: x-large;">
Structured Prediction</span></h2>
<h3>
<a href="http://jmlr.org/proceedings/papers/v28/jancsary13.pdf">Learning Convex QP Relaxations for Structured Prediction</a></h3>
<div id="authors">
Jeremy Jancsary, Sebastian Nowozin, Carsten Rother</div>
<div id="authors">
</div>
<div id="authors">
This is quite exciting work by the folks from MSRC which I met during my internship. They propose to use a QP relaxation for learning structured prediction. Basically they parametrize the problem in a way that inference via the QP relaxation is always convex and learn this restricted family. I only skimmed it yet ;)</div>
<div id="authors">
</div>
<div id="authors">
<h3 class="title">
</h3>
<h3 class="title">
<a href="http://jmlr.org/proceedings/papers/v28/kraehenbuehl13.pdf">Parameter Learning and Convergent Inference for Dense Random Fields</a></h3>
<span class="authors">
Philipp Kraehenbuehl, Vladlen Koltun</span> </div>
<div id="authors">
<br /></div>
<div id="authors">
This is a continuation of the authors work on dense random fields for semantic image segmentation. It is another example of "learning for inference". In their previous work, it was shown that mean-field inference can be implemented efficiently by convolutions in certain cases. Here, the authors show how it is possible to directly minimize the loss of the prediction produced by mean-field inference.</div>
<div id="authors">
<br /></div>
<div id="authors">
<br /></div>
<div id="authors">
There are several more papers on optimization for inference and / or learning,</div>
<div id="authors">
but I can't possibly list them all. There are also some interesting theory papers, for example on random forests.</div>
<div id="authors">
</div>
<div id="authors">
Also, I want to mention a paper by a friend, Cho, who writes about <br />
<a href="http://jmlr.org/proceedings/papers/v28/cho13.pdf">Simple Sparsification Improves Sparse Denoising Autoencoders in Denoising Highly Corrupted </a>where he matches state of the art denoising algorithms using auto-encoders.<br />
<br />
That should be enough, otherwise you could just look at the proceedings ;)</div>
Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com5tag:blogger.com,1999:blog-7345806147365425073.post-7500797337357244532013-01-27T17:46:00.002+01:002013-08-24T11:17:15.627+02:00pystruct: more structured prediction with pythonSome time ago <a href="http://peekaboo-vision.blogspot.de/2012/06/structured-svm-and-structured.html">I wrote about a structured learning project</a> I have been working on for some time, called <a href="https://pystruct.github.io/">pystruct</a>.<br />
After not working on it for some time, I think it has come quite a long way the last couple of weeks as I picked up work on structured SVMs again. So here is a quick update on what you can do with it.<br />
<br />
To the best of my knowledge this is the only tool with ready-to-use functionality to learn structural SVMs (or max-margin CRFs) on loopy graphs - even though this is pretty standard in the (computer vision) literature.<br />
<br />
<a name='more'></a><br />
<br />
The most commonly used software for learning structural SVMs is SVM^struct by <a href="http://www.cs.cornell.edu/people/tj/svm_light/svm_struct.html">Thorsten Joachims</a>, which is a great piece of software, but imho not that easy to use and written completely in C (with a Python interface, though).<br />
<br />
<br />
A quick reminder on what structured prediction does (I wrote about this <a href="http://peekaboo-vision.blogspot.de/2012/06/basics-on-structured-learning-and.html">before</a>). So you are given a list of input objects $X_1, ..., X_n$ and corresponding $Y_1, ..., Y_n$, and you want to learn to predict the output $Y$ for some unknown input $X$.<br />
<br />
This is a generalization of standard a multi-class classification in two ways:<br />
<ul>
<li> The input objects $X$ are structured, meaning they are not just an array of numbers, as usually the case in machine learning, but something more complex, for example a sequence or an image or a graph. There are usually some numerical values associated with X, but it is somehow more than just a flat vector.</li>
<li>The output objects $Y$ usually belong to some very large set. Think of labelling a sequence of length $r$, where each entry can take one of $m$ classes. Then the set of possible $Y$ has size $m^r$ - which grows exponentially fast in the lenght of the sequence. To cope with this, $Y$ is also assumed to have some structure, that can help us generalize in the presence of so many classes.</li>
</ul>
<br />
The mathematical formulation of structured prediction is<br />
\[ Y = f(X) = \text{argmax}_{\hat{Y}} g(X, \hat{Y}, \theta) \]<br />
<br />
Here $g$ is a function that encodes the compatibility of $\hat{Y}$ with $X$ and depends on a set of parameters $\theta$. The prediction $Y$ is given as that $\hat{Y}$ that maximizes the compatibility with $X$.<br />
<br />
If you want to view this in a probabilistic way, you can replace $g(X, \hat{Y}, \theta)$ by $p(\hat{Y}| X, \theta)$. Then the prediction $Y$ is the maximum a postiory (MAP) estimate for $Y$.<br />
<br />
The task of structured prediction is to find parameters $\theta$ such that we predict well on future data $X$.<br />
<br />
There are several steps in doing this. First, we have to say what "good" means.<br />
A common measure is the hamming loss on $Y$. So for the sequence example above, we don't punish all possible outputs equally, but penalize based on how many entries of a predicted sequence are wrong.<br />
<br />
As we can't know about future data, we do the standard thing of regularized empirical risk minimization. In English, this means we want a model that does well on the training data and is simple. Usually this is formulated as<br />
<br />
\[\min_\theta L(\mathbf{X}, \mathbf{Y}, \theta) = \sum_{i=1}^n l(f(X_i), Y_i) + \frac{1}{C} R(\theta)<br />
\]<br />
where $l$ is for example the Hamming loss and $R$ is some penalty on the complexity of the parameters, for example the euclidean (L2) length and $C$ is a trade off between complexity and goodness of fit.<br />
<br />
The next step is choosing the form of $g$. In pystruct, I implemented the structural SVM and structural Perceptron approaches, which use a simple linear form:<br />
\[ g(X, \hat{Y}) = \left<\theta, \psi(X, \hat{Y}) \right> \]<br />
<br />
Here, $\psi(X, \hat{Y}) \in \mathbb{R}^d$ is a joint feature of $X$ and $\hat{Y}$ and $\theta$ is simply a vector of weights. All the structure of the problem is encoded in $\psi$.<br />
<br />
Now that we have all the math together, we can see there are several parts to the problem:<br />
<br />
<ol>
<li>Defining and computing $\psi$, the joint feature of $X$ and $\hat{Y}$ that defines the structure of the problem.</li>
<li>Computing $\text{argmax}_{\hat{Y}} g(X, \hat{Y}, \theta)$.</li>
<li>Find the best parameter settings $\theta$ by solving<br />\[\min_\theta L(\mathbf{X}, \mathbf{Y}, \theta),\]<br />i.e. the actual learning part.</li>
</ol>
There are three modules in pystruct, corresponding to these there parts.<br />
(For reference, SVM^struct concentrates on solving 3. - and does a great job at it).<br />
<h4>
1. Problems (i.e. CRFs): Knows about the problem structure.</h4>
These know about the structure of the problem, the loss and the inference.<br />
This is basically the part that you have to write yourself when using the<br />
Python interface in SVM^struct.<br />
I am only working on pairwise models and there is support for grids and<br />
general graphs. I am mostly working on the grids at the moment.<br />
<br />
<h4>
2. Inference Solvers: Does the heavy lifting in inference. </h4>
There are some options to use different solvers for inference.<br />
A linear programming solver using GLPK is included.<br />
I have Python interfaces for several other methods on github,<br />
including LibDAI, QPBO, AD3, which can all be easily used with pystruct.<br />
<br />
This is where the heavy lifting is done and in some sense these backends<br />
are exchangeable. <br />
<br />
<h4>
3. Learners: Know about learning.</h4>
These implement max margin learning, similar to SVM^struct.<br />
There is an online subgradient version, a one-slack QP version and the<br />
standard n-slack QP version. The QPs are solved via cvxopt.<br />
<br />
They are not particularly optimized but getting there.<br />
Often this is not the bottleneck when working with loopy graphs.<br />
There is also a simple perceptron. I might be adding an interface to SVM^struct here, if I'm not happy with my current solvers in the future.<br />
<br />
Now let the code speak. First, let us look at some ways to use the library.<br />
I definitely want to explain the inner workings, too, but that might be another post. <br />
pystruct includes several <a href="https://github.com/amueller/pystruct/tree/master/examples">examples</a>. Lets walk through a simplified version of the <a href="https://github.com/amueller/pystruct/blob/master/examples/binary_svm.py">binary svm example</a> first, and work our way up from there.<br />
<br />
The example simply shows how to implement a standard binary SVM in the pystruct framework. This is just an illustration - don't use that to actually solve your SVM problems, as it is not really optimized for that ;)<br />
<style type="text/css">
/**
* HTML5 ✰ Boilerplate
*
* style.css contains a reset, font normalization and some base styles.
*
* Credit is left where credit is due.
* Much inspiration was taken from these projects:
* - yui.yahooapis.com/2.8.1/build/base/base.css
* - camendesign.com/design/
* - praegnanz.de/weblog/htmlcssjs-kickstart
*/
/**
* html5doctor.com Reset Stylesheet (Eric Meyer's Reset Reloaded + HTML5 baseline)
* v1.6.1 2010-09-17 | Authors: Eric Meyer & Richard Clark
* html5doctor.com/html-5-reset-stylesheet/
*/
html, body, div, span, object, iframe,
h1, h2, h3, h4, h5, h6, p, blockquote, pre,
abbr, address, cite, code, del, dfn, em, img, ins, kbd, q, samp,
small, strong, sub, sup, var, b, i, dl, dt, dd, ol, ul, li,
fieldset, form, label, legend,
table, caption, tbody, tfoot, thead, tr, th, td,
article, aside, canvas, details, figcaption, figure,
footer, header, hgroup, menu, nav, section, summary,
time, mark, audio, video {
margin: 0;
padding: 0;
border: 0;
font-size: 100%;
font: inherit;
vertical-align: baseline;
}
sup { vertical-align: super; }
sub { vertical-align: sub; }
article, aside, details, figcaption, figure,
footer, header, hgroup, menu, nav, section {
display: block;
}
blockquote, q { quotes: none; }
blockquote:before, blockquote:after,
q:before, q:after { content: ""; content: none; }
ins { background-color: #ff9; color: #000; text-decoration: none; }
mark { background-color: #ff9; color: #000; font-style: italic; font-weight: bold; }
del { text-decoration: line-through; }
abbr[title], dfn[title] { border-bottom: 1px dotted; cursor: help; }
table { border-collapse: collapse; border-spacing: 0; }
hr { display: block; height: 1px; border: 0; border-top: 1px solid #ccc; margin: 1em 0; padding: 0; }
input, select { vertical-align: middle; }
/**
* Font normalization inspired by YUI Library's fonts.css: developer.yahoo.com/yui/
*/
body { font:13px/1.231 sans-serif; *font-size:small; } /* Hack retained to preserve specificity */
select, input, textarea, button { font:99% sans-serif; }
/* Normalize monospace sizing:
en.wikipedia.org/wiki/MediaWiki_talk:Common.css/Archive_11#Teletype_style_fix_for_Chrome */
pre, code, kbd, samp { font-family: monospace, sans-serif; }
em,i { font-style: italic; }
b,strong { font-weight: bold; }
</style>
<style type="text/css">
/* Flexible box model classes */
/* Taken from Alex Russell http://infrequently.org/2009/08/css-3-progress/ */
.hbox {
display: -webkit-box;
-webkit-box-orient: horizontal;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: horizontal;
-moz-box-align: stretch;
display: box;
box-orient: horizontal;
box-align: stretch;
}
.hbox > * {
-webkit-box-flex: 0;
-moz-box-flex: 0;
box-flex: 0;
}
.vbox {
display: -webkit-box;
-webkit-box-orient: vertical;
-webkit-box-align: stretch;
display: -moz-box;
-moz-box-orient: vertical;
-moz-box-align: stretch;
display: box;
box-orient: vertical;
box-align: stretch;
}
.vbox > * {
-webkit-box-flex: 0;
-moz-box-flex: 0;
box-flex: 0;
}
.reverse {
-webkit-box-direction: reverse;
-moz-box-direction: reverse;
box-direction: reverse;
}
.box-flex0 {
-webkit-box-flex: 0;
-moz-box-flex: 0;
box-flex: 0;
}
.box-flex1, .box-flex {
-webkit-box-flex: 1;
-moz-box-flex: 1;
box-flex: 1;
}
.box-flex2 {
-webkit-box-flex: 2;
-moz-box-flex: 2;
box-flex: 2;
}
.box-group1 {
-webkit-box-flex-group: 1;
-moz-box-flex-group: 1;
box-flex-group: 1;
}
.box-group2 {
-webkit-box-flex-group: 2;
-moz-box-flex-group: 2;
box-flex-group: 2;
}
.start {
-webkit-box-pack: start;
-moz-box-pack: start;
box-pack: start;
}
.end {
-webkit-box-pack: end;
-moz-box-pack: end;
box-pack: end;
}
.center {
-webkit-box-pack: center;
-moz-box-pack: center;
box-pack: center;
}
</style>
<style type="text/css">
/**
* Primary styles
*
* Author: IPython Development Team
*/
body {
overflow: hidden;
}
span#save_widget {
padding: 5px;
margin: 0px 0px 0px 300px;
display:inline-block;
}
span#notebook_name {
height: 1em;
line-height: 1em;
padding: 3px;
border: none;
font-size: 146.5%;
}
.ui-menubar-item .ui-button .ui-button-text {
padding: 0.4em 1.0em;
font-size: 100%;
}
.ui-menu {
-moz-box-shadow: 0px 6px 10px -1px #adadad;
-webkit-box-shadow: 0px 6px 10px -1px #adadad;
box-shadow: 0px 6px 10px -1px #adadad;
}
.ui-menu .ui-menu-item a {
border: 1px solid transparent;
padding: 2px 1.6em;
}
.ui-menu .ui-menu-item a.ui-state-focus {
margin: 0;
}
.ui-menu hr {
margin: 0.3em 0;
}
#menubar_container {
position: relative;
}
#notification {
position: absolute;
right: 3px;
top: 3px;
height: 25px;
padding: 3px 6px;
z-index: 10;
}
#toolbar {
padding: 3px 15px;
}
#cell_type {
font-size: 85%;
}
div#main_app {
width: 100%;
position: relative;
}
span#quick_help_area {
position: static;
padding: 5px 0px;
margin: 0px 0px 0px 0px;
}
.help_string {
float: right;
width: 170px;
padding: 0px 5px;
text-align: left;
font-size: 85%;
}
.help_string_label {
float: right;
font-size: 85%;
}
div#notebook_panel {
margin: 0px 0px 0px 0px;
padding: 0px;
}
div#notebook {
overflow-y: scroll;
overflow-x: auto;
width: 100%;
/* This spaces the cell away from the edge of the notebook area */
padding: 5px 5px 15px 5px;
margin: 0px;
background-color: white;
}
div#pager_splitter {
height: 8px;
}
div#pager {
padding: 15px;
overflow: auto;
display: none;
}
div.ui-widget-content {
border: 1px solid #aaa;
outline: none;
}
.cell {
border: 1px solid transparent;
}
div.cell {
width: 100%;
padding: 5px 5px 5px 0px;
/* This acts as a spacer between cells, that is outside the border */
margin: 2px 0px 2px 0px;
}
div.code_cell {
background-color: white;
}
/* any special styling for code cells that are currently running goes here */
div.code_cell.running {
}
div.prompt {
/* This needs to be wide enough for 3 digit prompt numbers: In[100]: */
width: 11ex;
/* This 0.4em is tuned to match the padding on the CodeMirror editor. */
padding: 0.4em;
margin: 0px;
font-family: monospace;
text-align:right;
}
div.input {
page-break-inside: avoid;
}
/* input_area and input_prompt must match in top border and margin for alignment */
div.input_area {
color: black;
border: 1px solid #ddd;
border-radius: 3px;
background: #f7f7f7;
}
div.input_prompt {
color: navy;
border-top: 1px solid transparent;
}
div.output_wrapper {
/* This is a spacer between the input and output of each cell */
margin-top: 5px;
margin-left: 5px;
/* FF needs explicit width to stretch */
width: 100%;
/* this position must be relative to enable descendents to be absolute within it */
position: relative;
}
/* class for the output area when it should be height-limited */
div.output_scroll {
/* ideally, this would be max-height, but FF barfs all over that */
height: 24em;
/* FF needs this *and the wrapper* to specify full width, or it will shrinkwrap */
width: 100%;
overflow: auto;
border-radius: 3px;
box-shadow: inset 0 2px 8px rgba(0, 0, 0, .8);
}
/* output div while it is collapsed */
div.output_collapsed {
margin-right: 5px;
}
div.out_prompt_overlay {
height: 100%;
padding: 0px;
position: absolute;
border-radius: 3px;
}
div.out_prompt_overlay:hover {
/* use inner shadow to get border that is computed the same on WebKit/FF */
box-shadow: inset 0 0 1px #000;
background: rgba(240, 240, 240, 0.5);
}
div.output_prompt {
color: darkred;
/* 5px right shift to account for margin in parent container */
margin: 0 5px 0 -5px;
}
/* This class is the outer container of all output sections. */
div.output_area {
padding: 0px;
page-break-inside: avoid;
}
/* This class is for the output subarea inside the output_area and after
the prompt div. */
div.output_subarea {
padding: 0.4em 0.4em 0.4em 0.4em;
}
/* The rest of the output_* classes are for special styling of the different
output types */
/* all text output has this class: */
div.output_text {
text-align: left;
color: black;
font-family: monospace;
}
/* stdout/stderr are 'text' as well as 'stream', but pyout/pyerr are *not* streams */
div.output_stream {
padding-top: 0.0em;
padding-bottom: 0.0em;
}
div.output_stdout {
}
div.output_stderr {
background: #fdd; /* very light red background for stderr */
}
div.output_latex {
text-align: left;
color: black;
}
div.output_html {
}
div.output_png {
}
div.output_jpeg {
}
div.text_cell {
background-color: white;
padding: 5px 5px 5px 5px;
}
div.text_cell_input {
color: black;
border: 1px solid #ddd;
border-radius: 3px;
background: #f7f7f7;
}
div.text_cell_render {
font-family: "Helvetica Neue", Arial, Helvetica, Geneva, sans-serif;
outline: none;
resize: none;
width: inherit;
border-style: none;
padding: 5px;
color: black;
}
/* The following gets added to the <head> if it is detected that the user has a
* monospace font with inconsistent normal/bold/italic height. See
* notebookmain.js. Such fonts will have keywords vertically offset with
* respect to the rest of the text. The user should select a better font.
* See: https://github.com/ipython/ipython/issues/1503
*
* .CodeMirror span {
* vertical-align: bottom;
* }
*/
.CodeMirror {
line-height: 1.231; /* Changed from 1em to our global default */
}
.CodeMirror-scroll {
height: auto; /* Changed to auto to autogrow */
/* The CodeMirror docs are a bit fuzzy on if overflow-y should be hidden or visible.*/
/* We have found that if it is visible, vertical scrollbars appear with font size changes.*/
overflow-y: hidden;
overflow-x: auto; /* Changed from auto to remove scrollbar */
}
/* CSS font colors for translated ANSI colors. */
.ansiblack {color: black;}
.ansired {color: darkred;}
.ansigreen {color: darkgreen;}
.ansiyellow {color: brown;}
.ansiblue {color: darkblue;}
.ansipurple {color: darkviolet;}
.ansicyan {color: steelblue;}
.ansigrey {color: grey;}
.ansibold {font-weight: bold;}
.completions {
position: absolute;
z-index: 10;
overflow: hidden;
border: 1px solid grey;
}
.completions select {
background: white;
outline: none;
border: none;
padding: 0px;
margin: 0px;
overflow: auto;
font-family: monospace;
}
option.context {
background-color: #DEF7FF;
}
option.introspection {
background-color: #EBF4EB;
}
/*fixed part of the completion*/
.completions p b {
font-weight:bold;
}
.completions p {
background: #DDF;
/*outline: none;
padding: 0px;*/
border-bottom: black solid 1px;
padding: 1px;
font-family: monospace;
}
pre.dialog {
background-color: #f7f7f7;
border: 1px solid #ddd;
border-radius: 3px;
padding: 0.4em;
padding-left: 2em;
}
p.dialog {
padding : 0.2em;
}
.shortcut_key {
display: inline-block;
width: 15ex;
text-align: right;
font-family: monospace;
}
.shortcut_descr {
}
/* Word-wrap output correctly. This is the CSS3 spelling, though Firefox seems
to not honor it correctly. Webkit browsers (Chrome, rekonq, Safari) do.
*/
pre, code, kbd, samp { white-space: pre-wrap; }
#fonttest {
font-family: monospace;
}
.js-error {
color: darkred;
}
</style>
<style type="text/css">
.rendered_html {color: black;}
.rendered_html em {font-style: italic;}
.rendered_html strong {font-weight: bold;}
.rendered_html u {text-decoration: underline;}
.rendered_html :link { text-decoration: underline }
.rendered_html :visited { text-decoration: underline }
.rendered_html h1 {font-size: 197%; margin: .65em 0; font-weight: bold;}
.rendered_html h2 {font-size: 153.9%; margin: .75em 0; font-weight: bold;}
.rendered_html h3 {font-size: 123.1%; margin: .85em 0; font-weight: bold;}
.rendered_html h4 {font-size: 100% margin: 0.95em 0; font-weight: bold;}
.rendered_html h5 {font-size: 85%; margin: 1.5em 0; font-weight: bold;}
.rendered_html h6 {font-size: 77%; margin: 1.65em 0; font-weight: bold;}
.rendered_html ul {list-style:disc; margin: 1em 2em;}
.rendered_html ul ul {list-style:square; margin: 0em 2em;}
.rendered_html ul ul ul {list-style:circle; margin-left: 0em 2em;}
.rendered_html ol {list-style:upper-roman; margin: 1em 2em;}
.rendered_html ol ol {list-style:upper-alpha; margin: 0em 2em;}
.rendered_html ol ol ol {list-style:decimal; margin: 0em 2em;}
.rendered_html ol ol ol ol {list-style:lower-alpha; margin 0em 2em;}
.rendered_html ol ol ol ol ol {list-style:lower-roman; 0em 2em;}
.rendered_html hr {
color: black;
background-color: black;
}
.rendered_html pre {
margin: 1em 2em;
}
.rendered_html blockquote {
margin: 1em 2em;
}
.rendered_html table {
border: 1px solid black;
border-collapse: collapse;
margin: 1em 2em;
}
.rendered_html td {
border: 1px solid black;
text-align: left;
vertical-align: middle;
padding: 4px;
}
.rendered_html th {
border: 1px solid black;
text-align: left;
vertical-align: middle;
padding: 4px;
font-weight: bold;
}
.rendered_html tr {
border: 1px solid black;
}
.rendered_html p + p {
margin-top: 1em;
}
</style>
<style type="text/css">
/* Overrides of notebook CSS for static HTML export
*/
body {
overflow: visible;
padding: 8px;
}
.input_area {
padding: 0.4em;
}
</style>
<style type="text/css">
.highlight .hll { background-color: #ffffcc }
.highlight { background: #f8f8f8; }
.highlight .c { color: #408080; font-style: italic } /* Comment */
.highlight .err { border: 1px solid #FF0000 } /* Error */
.highlight .k { color: #008000; font-weight: bold } /* Keyword */
.highlight .o { color: #666666 } /* Operator */
.highlight .cm { color: #408080; font-style: italic } /* Comment.Multiline */
.highlight .cp { color: #BC7A00 } /* Comment.Preproc */
.highlight .c1 { color: #408080; font-style: italic } /* Comment.Single */
.highlight .cs { color: #408080; font-style: italic } /* Comment.Special */
.highlight .gd { color: #A00000 } /* Generic.Deleted */
.highlight .ge { font-style: italic } /* Generic.Emph */
.highlight .gr { color: #FF0000 } /* Generic.Error */
.highlight .gh { color: #000080; font-weight: bold } /* Generic.Heading */
.highlight .gi { color: #00A000 } /* Generic.Inserted */
.highlight .go { color: #808080 } /* Generic.Output */
.highlight .gp { color: #000080; font-weight: bold } /* Generic.Prompt */
.highlight .gs { font-weight: bold } /* Generic.Strong */
.highlight .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
.highlight .gt { color: #0040D0 } /* Generic.Traceback */
.highlight .kc { color: #008000; font-weight: bold } /* Keyword.Constant */
.highlight .kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
.highlight .kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
.highlight .kp { color: #008000 } /* Keyword.Pseudo */
.highlight .kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
.highlight .kt { color: #B00040 } /* Keyword.Type */
.highlight .m { color: #666666 } /* Literal.Number */
.highlight .s { color: #BA2121 } /* Literal.String */
.highlight .na { color: #7D9029 } /* Name.Attribute */
.highlight .nb { color: #008000 } /* Name.Builtin */
.highlight .nc { color: #0000FF; font-weight: bold } /* Name.Class */
.highlight .no { color: #880000 } /* Name.Constant */
.highlight .nd { color: #AA22FF } /* Name.Decorator */
.highlight .ni { color: #999999; font-weight: bold } /* Name.Entity */
.highlight .ne { color: #D2413A; font-weight: bold } /* Name.Exception */
.highlight .nf { color: #0000FF } /* Name.Function */
.highlight .nl { color: #A0A000 } /* Name.Label */
.highlight .nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
.highlight .nt { color: #008000; font-weight: bold } /* Name.Tag */
.highlight .nv { color: #19177C } /* Name.Variable */
.highlight .ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
.highlight .w { color: #bbbbbb } /* Text.Whitespace */
.highlight .mf { color: #666666 } /* Literal.Number.Float */
.highlight .mh { color: #666666 } /* Literal.Number.Hex */
.highlight .mi { color: #666666 } /* Literal.Number.Integer */
.highlight .mo { color: #666666 } /* Literal.Number.Oct */
.highlight .sb { color: #BA2121 } /* Literal.String.Backtick */
.highlight .sc { color: #BA2121 } /* Literal.String.Char */
.highlight .sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
.highlight .s2 { color: #BA2121 } /* Literal.String.Double */
.highlight .se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
.highlight .sh { color: #BA2121 } /* Literal.String.Heredoc */
.highlight .si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
.highlight .sx { color: #008000 } /* Literal.String.Other */
.highlight .sr { color: #BB6688 } /* Literal.String.Regex */
.highlight .s1 { color: #BA2121 } /* Literal.String.Single */
.highlight .ss { color: #19177C } /* Literal.String.Symbol */
.highlight .bp { color: #008000 } /* Name.Builtin.Pseudo */
.highlight .vc { color: #19177C } /* Name.Variable.Class */
.highlight .vg { color: #19177C } /* Name.Variable.Global */
.highlight .vi { color: #19177C } /* Name.Variable.Instance */
.highlight .il { color: #666666 } /* Literal.Number.Integer.Long */
</style>
<script src="https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS_HTML" type="text/javascript">
</script>
<script type="text/javascript">
init_mathjax = function() {
if (window.MathJax) {
// MathJax loaded
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ['$','$'], ["\\(","\\)"] ],
displayMath: [ ['$$','$$'], ["\\[","\\]"] ]
},
displayAlign: 'left', // Change this to 'center' to center equations.
"HTML-CSS": {
styles: {'.MathJax_Display': {"margin": 0}}
}
});
MathJax.Hub.Queue(["Typeset",MathJax.Hub]);
}
}
init_mathjax();
</script><br />
<div class="text_cell_render border-box-sizing rendered_html">
<h3>
Binary SVM</h3>
First, we load some data using scikit-learn. We use the digits dataset and make it into a binary prediction task.
We add a column of ones as the structural SVM doesn't implement a bias term.</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [1]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">sklearn.datasets</span> <span class="kn">import</span> <span class="n">load_digits</span>
<span class="kn">from</span> <span class="nn">sklearn.cross_validation</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="c"># do a binary digit classification</span>
<span class="n">digits</span> <span class="o">=</span> <span class="n">load_digits</span><span class="p">()</span>
<span class="n">X</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">digits</span><span class="o">.</span><span class="n">data</span><span class="p">,</span> <span class="n">digits</span><span class="o">.</span><span class="n">target</span>
<span class="c"># make binary task by doing odd vs even numers</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">y</span> <span class="o">%</span> <span class="mi">2</span>
<span class="c"># code as +1 and -1</span>
<span class="n">y</span> <span class="o">=</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">y</span> <span class="o">-</span> <span class="mi">1</span>
<span class="n">X</span> <span class="o">/=</span> <span class="n">X</span><span class="o">.</span><span class="n">max</span><span class="p">()</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">X_train_bias</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">X_train</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">X_train</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">))])</span>
<span class="n">X_test_bias</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">X_test</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">X_test</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">))])</span>
</pre>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
The data here is actually not structured, just an array.</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [2]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">X_train_bias</span><span class="o">.</span><span class="n">shape</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[2]:</div>
<div class="output_subarea output_pyout">
<pre>(1347, 65)</pre>
</div>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
The labels are also not structured:</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [3]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">y_train</span><span class="o">.</span><span class="n">shape</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[3]:</div>
<div class="output_subarea output_pyout">
<pre>(1347,)</pre>
</div>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
The two labels are encoded as -1 and +1 as common in SVMs.</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [4]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">y_train</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[4]:</div>
<div class="output_subarea output_pyout">
<pre>array([ 1, -1, -1, ..., -1, 1, 1])</pre>
</div>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
Now we first import and instantiate the problem description for a binary SVM.</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [5]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="kn">from</span> <span class="nn">pystruct.problems</span> <span class="kn">import</span> <span class="n">BinarySVMProblem</span>
<span class="n">problem</span> <span class="o">=</span> <span class="n">BinarySVMProblem</span><span class="p">(</span><span class="n">n_features</span><span class="o">=</span><span class="n">X_train_bias</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
</pre>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
Now we import one of the learners. These are completely agnostic to the kind of problem we want to solve. Let's try the online subgradient method. We provide it with the instance of the binary SVM formulation, set the regularization parameter $C$ in the terminology from above) </div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [6]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="kn">from</span> <span class="nn">pystruct.learners</span> <span class="kn">import</span> <span class="n">SubgradientStructuredSVM</span>
<span class="n">ssvm</span> <span class="o">=</span> <span class="n">SubgradientStructuredSVM</span><span class="p">(</span><span class="n">problem</span><span class="p">,</span> <span class="n">C</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.0001</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
Now we train the algorithm. The learners all have the usual scikit-learn interface:</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [7]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">ssvm</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_bias</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
</div>
<div class="output_subarea output_stream output_stdout">
<pre>Training primal subgradient structural SVM
final objective: 0.339834
calls to inference: 67350
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
And evaluate on the test set:</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [8]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">ssvm</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test_bias</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[8]:</div>
<div class="output_subarea output_pyout">
<pre>0.88</pre>
</div>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
Let's have a look at some predictions:</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [9]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">ssvm</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test_bias</span><span class="p">))[:</span><span class="mi">5</span><span class="p">]</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[9]:</div>
<div class="output_subarea output_pyout">
<pre>array([-1., 1., 1., 1., -1.])</pre>
</div>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
That wasn't very hard now. But also not terribly exciting.<br />
So, a little step up in complexity:<br />
<h3>
Multi-Class SVM.</h3>
The Crammer-Singer multi-class formulation is a special case of a structural SVM.<br />
Let's get the original labels back:</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [10]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">y</span> <span class="o">=</span> <span class="n">digits</span><span class="o">.</span><span class="n">target</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">X_train_bias</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">X_train</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">X_train</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">))])</span>
<span class="n">X_test_bias</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">([</span><span class="n">X_test</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">X_test</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">))])</span>
</pre>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [11]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">y_train</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[11]:</div>
<div class="output_subarea output_pyout">
<pre>array([8, 5, 2, ..., 4, 9, 3])</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [12]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="kn">from</span> <span class="nn">pystruct.problems</span> <span class="kn">import</span> <span class="n">CrammerSingerSVMProblem</span>
</pre>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [13]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">problem</span> <span class="o">=</span> <span class="n">CrammerSingerSVMProblem</span><span class="p">(</span><span class="n">n_features</span><span class="o">=</span><span class="n">X_train_bias</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">n_classes</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [14]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">ssvm</span> <span class="o">=</span> <span class="n">SubgradientStructuredSVM</span><span class="p">(</span><span class="n">problem</span><span class="p">,</span> <span class="n">C</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span> <span class="n">learning_rate</span><span class="o">=</span><span class="mf">0.0001</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [15]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">ssvm</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train_bias</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
</div>
<div class="output_subarea output_stream output_stdout">
<pre>Training primal subgradient structural SVM
final objective: 0.619628
calls to inference: 67350
</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [16]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">ssvm</span><span class="o">.</span><span class="n">score</span><span class="p">(</span><span class="n">X_test_bias</span><span class="p">,</span> <span class="n">y_test</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[16]:</div>
<div class="output_subarea output_pyout">
<pre>0.92000000000000004</pre>
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [17]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">ssvm</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test_bias</span><span class="p">))[:</span><span class="mi">5</span><span class="p">]</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[17]:</div>
<div class="output_subarea output_pyout">
<pre>array([5, 5, 2, 0, 2])</pre>
</div>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
But no let us finally get to something a bit more interesting<br />
<h3>
CRFs on grid graphs</h3>
Actually pystruct contains some code to handle grid-graphs but to demonstrate the interface, I'll represent the graphs explicitly.
First, let us generate some 2d grid toy data.</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [36]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="kn">import</span> <span class="nn">pystruct.toy_datasets</span> <span class="kn">as</span> <span class="nn">toy</span>
<span class="kn">from</span> <span class="nn">pystruct.utils</span> <span class="kn">import</span> <span class="n">make_grid_edges</span>
<span class="n">X</span><span class="p">,</span> <span class="n">Y</span> <span class="o">=</span> <span class="n">toy</span><span class="o">.</span><span class="n">generate_blocks</span><span class="p">(</span><span class="n">n_samples</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
We generated 3 examples. Each example here is a 10 x 12 grid with two possible labels, 0 and 1:</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [44]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="k">print</span><span class="p">(</span><span class="n">Y</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">matshow</span><span class="p">(</span><span class="n">Y</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="p">:,</span> <span class="p">:])</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
</div>
<div class="output_subarea output_stream output_stdout">
<pre>(3, 10, 12)
</pre>
</div>
</div>
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[44]:</div>
<div class="output_subarea output_pyout">
<pre><matplotlib.image.AxesImage at 0x67bf650></pre>
</div>
</div>
<div class="hbox output_area">
<div class="prompt output_prompt">
</div>
<div class="output_subarea output_display_data">
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAASMAAAD5CAYAAABs8lPQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz
AAALEgAACxIB0t1+/AAADLlJREFUeJzt3G9MlfX/x/HXqcONkmnpBIqDO4xkcA4IJM7N1dSSnG04
U9aUHE6xbri2dK5103lHYdYS617rDy2Xbd0ocuQNZhiLUblDrc02m+NsKOaNCosw+dPne+PLrx/K
l+scIK/rjef52K4N4cLrdUOfcq5zPCHnnBMABOyeoAcAgESMABhBjACYQIwAmECMAJhAjACY4FuM
zpw5o5KSEi1fvlzNzc1+XXZW+vv7tX79esXjcZWVlenEiRNBT0rb+Pi4qqqqVFtbG/SUtAwODqqu
rk6lpaWKxWLq6ekJepKno0ePKh6Pq7y8XPX19bp582bQk6bYs2ePcnNzVV5e/s/nfv31V9XU1Ki4
uFhPPfWUBgcHA1w4DeeDsbExV1RU5Pr6+tzIyIirqKhwFy5c8OPSs3L16lXX29vrnHPujz/+cMXF
xab3Tvbaa6+5+vp6V1tbG/SUtDQ0NLi3337bOefc6OioGxwcDHjR9Pr6+lxhYaH766+/nHPOPfvs
s+69994LeNVUX375pUskEq6srOyfz7388suuubnZOedcU1OTe+WVV4KaNy1ffjL65ptv9Mgjjyga
jSorK0vbt2/Xp59+6selZyUvL0+VlZWSpOzsbJWWlmpgYCDgValdvnxZ7e3t2rt3r9w8eC3r9evX
1dXVpT179kiSwuGwFi1aFPCq6S1cuFBZWVkaHh7W2NiYhoeHlZ+fH/SsKR5//HE9+OCDt3yura1N
u3btkiTt2rVLn3zySRDTPPkSoytXrqigoOCfX0ciEV25csWPS89ZMplUb2+vVq9eHfSUlA4cOKBj
x47pnnvmx63Avr4+LV26VLt379ajjz6q559/XsPDw0HPmtbixYt18OBBLVu2TA8//LAeeOABbdiw
IehZabl27Zpyc3MlSbm5ubp27VrAi6by5U9tKBTy4zL/uqGhIdXV1amlpUXZ2dlBz/F0+vRp5eTk
qKqqal78VCRJY2NjSiQS2rdvnxKJhBYsWKCmpqagZ03r0qVLOn78uJLJpAYGBjQ0NKSTJ08GPWvG
QqGQyb+TvsQoPz9f/f39//y6v79fkUjEj0vP2ujoqLZt26adO3dqy5YtQc9Jqbu7W21tbSosLNSO
HTt09uxZNTQ0BD3LUyQSUSQS0apVqyRJdXV1SiQSAa+a3vnz57VmzRotWbJE4XBYW7duVXd3d9Cz
0pKbm6uff/5ZknT16lXl5OQEvGgqX2JUXV2tn376SclkUiMjI/roo4+0efNmPy49K845NTY2KhaL
af/+/UHPScuRI0fU39+vvr4+nTp1Sk888YTef//9oGd5ysvLU0FBgS5evChJ6ujoUDweD3jV9EpK
StTT06MbN27IOaeOjg7FYrGgZ6Vl8+bNam1tlSS1trba/AfWrzvl7e3trri42BUVFbkjR474ddlZ
6erqcqFQyFVUVLjKykpXWVnpPv/886Bnpa2zs3PePJv23XffuerqardixQr3zDPPmH42zTnnmpub
XSwWc2VlZa6hocGNjIwEPWmK7du3u4ceeshlZWW5SCTi3nnnHffLL7+4J5980i1fvtzV1NS43377
LeiZU4Scmyc3GADc1ebH0y4A7nrECIAJ4bl8s8WnBwHY5XVXaE4x+q9DMzi3U9K6uV/SN52aX3sl
K5sP6XDa53bKwuL0dWp+7ZVsbE71J4KHaQBMIEYATPA5RlF/Lzdn0aAHzEI06AEzFg16wAxFgx4w
C9GgB6SBGHmKBj1gFqJBD5ixaNADZiga9IBZiAY9IA08TANgAjECYELKGM2nt4sFMH95xmh8fFwv
vviizpw5owsXLujDDz/Ujz/+6Nc2ABnEM0bz7e1iAcxfnq/A/l9vF/v111/fdlbnpI+jmh/37QHc
acmJI12eMUrv/56tm8HlAGSKqG790eRcivM9H6bNx7eLBTA/ecZovr1dLID5y/NhWjgc1ptvvqmN
GzdqfHxcjY2NKi0t9WsbgAyS8i1ENm3apE2bNvmxBUAG4xXYAEwgRgBMIEYATCBGAEwgRgBMIEYA
TCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBM
IEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwg
RgBMIEYATPCMUX9/v9avX694PK6ysjKdOHHCr10AMkzY64tZWVl6/fXXVVlZqaGhIa1cuVI1NTUq
LS31ax+ADOEZo7y8POXl5UmSsrOzVVpaqoGBgdti1Dnp4+jEASDTJSeOdHnG6JbfOJlUb2+vVq9e
fdtX1s3gcgAyRVS3/mhyLsX5ad3AHhoaUl1dnVpaWpSdnT3LaQAwvZQxGh0d1bZt27Rz505t2bLF
j00AMpBnjJxzamxsVCwW0/79+/3aBCADecboq6++0gcffKAvvvhCVVVVqqqq0pkzZ/zaBiCDeN7A
fuyxx/T333/7tQVABuMV2ABMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYA
TCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBM
IEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATEgrRuPj46qqqlJtbe2d3gMg
Q6UVo5aWFsViMYVCoTu9B0CGShmjy5cvq729XXv37pVzzo9NADJQONUJBw4c0LFjx/T7779Pc0bn
pI+jEweATJecONLlGaPTp08rJydHVVVV6uzsnOasdTO4HIBMEdWtP5qcS3G+58O07u5utbW1qbCw
UDt27NDZs2fV0NAwx4kAMFXIpXkj6Ny5c3r11Vf12Wef/f83h0KSDt2pbZjHDulw0BNgzGHJ877z
jF5nxLNpAO6UlDew/8/atWu1du3aO7kFQAbjFdgATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBG
AEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYA
TCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBM
SBmjwcFB1dXVqbS0VLFYTD09PX7sApBhwqlOeOmll/T000/r448/1tjYmP78808/dgHIMJ4xun79
urq6utTa2vrfk8NhLVq0yJdhADKLZ4z6+vq0dOlS7d69W99//71WrlyplpYW3X///ZPO6pz0cXTi
AJDpkhNHujzvGY2NjSmRSGjfvn1KJBJasGCBmpqabjtr3aQjOoNLA7ibRXVrHVLxjFEkElEkEtGq
VaskSXV1dUokEnMaCAD/i2eM8vLyVFBQoIsXL0qSOjo6FI/HfRkGILOkfDbtjTfe0HPPPaeRkREV
FRXp3Xff9WMXgAyTMkYVFRX69ttv/dgCIIPxCmwAJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAj
ACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECMA
JhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACakjNHR
o0cVj8dVXl6u+vp63bx5049dADKMZ4ySyaTeeustJRIJ/fDDDxofH9epU6f82gYgg4S9vrhw4UJl
ZWVpeHhY9957r4aHh5Wfn+/XNgAZxDNGixcv1sGDB7Vs2TLdd9992rhxozZs2HDbWZ2TPo5OHAAy
XXLiSJfnw7RLly7p+PHjSiaTGhgY0NDQkE6ePHnbWesmHdEZXBrA3SyqW+uQimeMzp8/rzVr1mjJ
kiUKh8PaunWruru757oRAKbwjFFJSYl6enp048YNOefU0dGhWCzm1zYAGcQzRhUVFWpoaFB1dbVW
rFghSXrhhRd8GQYgs4Scc27W3xwKSTr0L87B3eKQDgc9AcYcluSVG16BDcAEYgTABGIEwARiBMAE
YgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARi
BMAEYgTABGIEwARiBMAEYgTABGIEwASfY5T093Jzlgx6wCwkgx4wY8mgB8xQMugBs5AMekAaiJGn
ZNADZiEZ9IAZSwY9YIaSQQ+YhWTQA9LAwzQAJhAjACaEnHNu1t8cCv2bWwDc5bxyE75TvzEAzAQP
0wCYQIwAmECMAJhAjACYQIwAmECMAJjwH4uFFd042wHKAAAAAElFTkSuQmCC
" />
</div>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
The input is a noisy version of this label (actually there are two features per point in the grid).</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [45]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="k">print</span><span class="p">(</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">matshow</span><span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="p">:,</span> <span class="p">:,</span> <span class="mi">1</span><span class="p">])</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
</div>
<div class="output_subarea output_stream output_stdout">
<pre>(3, 10, 12, 2)
</pre>
</div>
</div>
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[45]:</div>
<div class="output_subarea output_pyout">
<pre><matplotlib.image.AxesImage at 0x69c2910></pre>
</div>
</div>
<div class="hbox output_area">
<div class="prompt output_prompt">
</div>
<div class="output_subarea output_display_data">
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAASMAAAD5CAYAAABs8lPQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz
AAALEgAACxIB0t1+/AAAED5JREFUeJzt3Xts1XWax/HP0XbWS0MRhrZKITBIpactbQWCIgwFQYIK
IlTDbctykWRdJ0KIOJOdna2zGyhLjBTJzGYdVBgYccZNFBGYSYON1W5RUnSYwRkY0mbKdVy0aC3Q
i9/9w7VCkN85vXzPeep5v5ImLf35nCeKb3ounG/IOecEAHF2TbwXAACJGAEwghgBMIEYATCBGAEw
gRgBMCFmMdq7d69GjBih4cOHa926dbG62S5paGjQpEmTlJOTo9zcXG3cuDHeK0Wtvb1dhYWFmjFj
RrxXiUpjY6OKi4uVnZ2tcDismpqaeK8UaO3atcrJyVFeXp7mz5+vixcvxnulKyxZskTp6enKy8vr
+LWPP/5YU6dOVVZWlu655x41NjbGccOrcDHQ1tbmhg0b5urq6lxLS4vLz893hw8fjsVNd8mpU6fc
wYMHnXPOffbZZy4rK8v0vpd6+umn3fz5892MGTPivUpUSkpK3ObNm51zzrW2trrGxsY4b3R1dXV1
bujQoe7ChQvOOecefvhh9+KLL8Z5qyu99dZbrra21uXm5nb82hNPPOHWrVvnnHOurKzMPfnkk/Fa
76pi8pPRu+++q1tvvVVDhgxRcnKy5s6dq9deey0WN90lGRkZKigokCSlpKQoOztbJ0+ejPNWkR0/
fly7d+/WsmXL5HrBa1nPnTunqqoqLVmyRJKUlJSk1NTUOG91dX369FFycrKam5vV1tam5uZmDRw4
MN5rXWHChAm66aabLvu1nTt3atGiRZKkRYsW6dVXX43HaoFiEqMTJ05o0KBBHV9nZmbqxIkTsbjp
bquvr9fBgwc1duzYeK8S0cqVK7V+/Xpdc03veCiwrq5OAwYM0OLFi3X77bfrkUceUXNzc7zXuqp+
/fpp1apVGjx4sG655Rb17dtXU6ZMifdaUTlz5ozS09MlSenp6Tpz5kycN7pSTH7XhkKhWNxMj2tq
alJxcbHKy8uVkpIS73UC7dq1S2lpaSosLOwVPxVJUltbm2pra/Xoo4+qtrZWN954o8rKyuK91lUd
O3ZMGzZsUH19vU6ePKmmpiZt37493mt1WigUMvn/ZExiNHDgQDU0NHR83dDQoMzMzFjcdJe1trZq
zpw5WrhwoWbNmhXvdSKqrq7Wzp07NXToUM2bN0/79u1TSUlJvNcKlJmZqczMTI0ZM0aSVFxcrNra
2jhvdXUHDhzQuHHj1L9/fyUlJWn27Nmqrq6O91pRSU9P1+nTpyVJp06dUlpaWpw3ulJMYjR69Ggd
PXpU9fX1amlp0csvv6yZM2fG4qa7xDmnpUuXKhwOa8WKFfFeJypr1qxRQ0OD6urqtGPHDk2ePFlb
t26N91qBMjIyNGjQIB05ckSSVFFRoZycnDhvdXUjRoxQTU2Nzp8/L+ecKioqFA6H471WVGbOnKkt
W7ZIkrZs2WLzD9hYPVK+e/dul5WV5YYNG+bWrFkTq5vtkqqqKhcKhVx+fr4rKChwBQUFbs+ePfFe
K2qVlZW95tm0999/340ePdqNHDnSPfjgg6afTXPOuXXr1rlwOOxyc3NdSUmJa2lpifdKV5g7d667
+eabXXJyssvMzHTPP/+8O3v2rLv77rvd8OHD3dSpU90nn3wS7zWvEHKulzzAAOBbrXc87QLgW48Y
ATAhqTv/sMWnBwHYFfSoULdiJEna1YmHnH5VKs0vjerS3903oUvrROMFLY7qukOlrymv9IGo5770
0JKurhTZ/VFe92qpNKs06rHu537+QAkN6MTviyOlUlZp1Jff+fq+Tu8TjWbdENV1p0t/oYzSZZ2a
/cFv7+jKShFdd8fHUV3XVrZOST98slOzL/Ttfh4uF/zqeu6mATCBGAEwIbYxyiuK6c11V1rRbfFe
ofNGFMV7g87rXxTvDTolpej2eK/QadeMvyveK0REjAKkF42I9wqdR4y8650xGh/vFSLibhoAE4gR
ABMixqg3vV0sgN4rMEbt7e167LHHtHfvXh0+fFgvvfSSPvzww1jtBiCBBMaot71dLIDeK/Allt/0
drH79++//KJflX79eV5Rr3vGDIAvVZLejvrqwBhF9XfPovzrHQASzYT///hK8FsKB95N641vFwug
dwqMUW97u1gAvVfg3bSkpCRt2rRJ06ZNU3t7u5YuXars7OxY7QYggUR8j4Dp06dr+vTpsdgFQALj
FdgATCBGAEwgRgBMIEYATCBGAEzo1iGOoVBIP3Y/6sl9OvyTfuZlriQd0Cgvc/uq0ctcSZp89k0v
c1v+Evwm6V0VSvJ3Nuh/j7rXy9wp1+3xMleSUp/y9O/jD37GStJ//fLve3Te8tC2wNNB+MkIgAnE
CIAJxAiACcQIgAnECIAJxAiACcQIgAnECIAJxAiACcQIgAnECIAJxAiACcQIgAnECIAJxAiACcQI
gAnECIAJxAiACcQIgAnECIAJxAiACcQIgAndPqpI9/s5guV3u0Je5krSuAt+GnzjoS+8zJWk0M3n
vcy9LqXZy9z2tiQvcyWp9Q99vMx9feLdXuZK0v1P7vMyd/66zV7mStIrZ4t7dF7rd1M5qgiAfcQI
gAnECIAJxAiACcQIgAnECIAJxAiACYExamho0KRJk5STk6Pc3Fxt3LgxVnsBSDCBr0xLTk7WM888
o4KCAjU1NWnUqFGaOnWqsrOzY7UfgAQRGKOMjAxlZGRIklJSUpSdna2TJ09eHqMjpV9/3r/oyw8A
Ce+Ld6rk3nk76uujfs1+fX29Dh48qLFjx17+jazSqG8MQOK45q4J0l0TOr7+4j/Kgq+PZmhTU5OK
i4tVXl6ulJSU7m0IAN8gYoxaW1s1Z84cLVy4ULNmzYrFTgASUGCMnHNaunSpwuGwVqxYEaudACSg
wBi988472rZtm958800VFhaqsLBQe/fujdVuABJI4APY48eP1xdf+HuPHgD4Cq/ABmACMQJgAjEC
YAIxAmACMQJgQrdPB8l3/9OT+3T4O130MleS9r9U5Gdwqp+xkqRVfsa+/Sc/c8f/2s9cSQo9/K9e
5q5y/k40adYNXub+u/7Zy1xJ6v+DHj6RZlOI00EA2EeMAJhAjACYQIwAmECMAJhAjACYQIwAmECM
AJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwA
mNDto4p+1pPbXOIfazwNlqSf+Bm777d3+hksaXK5nyOhWv2c+qOcxvf9DJZ09I18L3MH3PdXL3Ml
6aMPBnuZuy7/B17mStIhjezRedtCyzmqCIB9xAiACcQIgAnECIAJxAiACcQIgAlRxai9vV2FhYWa
MWOG730AJKioYlReXq5wOKxQKOR7HwAJKmKMjh8/rt27d2vZsmWBL1gCgO5IinTBypUrtX79en36
6aff+P03Lvl8uKSsHloMQO92pvLPOlN5JOrrA2O0a9cupaWlqbCwUJWVld94zX2dWg9Aokgvuk3p
Rbd1fP37p14PvD7wblp1dbV27typoUOHat68edq3b59KSkp6ZlMAuERgjNasWaOGhgbV1dVpx44d
mjx5srZu3Rqr3QAkkE69zohn0wD4EvEB7K9MnDhREydO9LkLgATGK7ABmECMAJhAjACYQIwAmECM
AJhAjACY0O3TQdzjPbnOJe7yNFeSfuNnbKmnuZJU4jK8zP3ettNe5irNz1hJCp338xe29zxQ5GWu
JC3W817mnv7R97zMlaSUH3/Uo/OaUtI4HQSAfcQIgAnECIAJxAiACcQIgAnECIAJxAiACcQIgAnE
CIAJxAiACcQIgAnECIAJxAiACcQIgAnECIAJxAiACcQIgAnECIAJxAiACcQIgAnECIAJ3T4dZKLb
05P7dNh53XQvcyVp/4XxXuZOXfG2l7mSpKOe5o71M7b4J7/0M1jSW/q+l7nvabSXuZI05KG/+Rk8
189YSXKZoR6dF7pDnA4CwD5iBMAEYgTABGIEwARiBMAEYgTABGIEwISIMWpsbFRxcbGys7MVDodV
U1MTi70AJJikSBc8/vjjuvfee/XKK6+ora1Nn3/+eSz2ApBgAmN07tw5VVVVacuWLV9enJSk1NTU
mCwGILEExqiurk4DBgzQ4sWL9cEHH2jUqFEqLy/XDTfc0HFNfem2js/7Fo1U36KR/rYF0GtU1n75
Ea3AGLW1tam2tlabNm3SmDFjtGLFCpWVlemnP/1pxzVDShd2eVkA315Ft3/58ZWnfhF8feAD2JmZ
mcrMzNSYMWMkScXFxaqt7UTqACBKgTHKyMjQoEGDdOTIEUlSRUWFcnJyYrIYgMQS8dm0Z599VgsW
LFBLS4uGDRumF154IRZ7AUgwEWOUn5+v9957Lxa7AEhgvAIbgAnECIAJxAiACcQIgAnECIAJxAiA
CRGf2o+k+uy4ntjjCqnlXT5BKaI7tc/L3Kn/cLeXuZKkh/2MnfHGr73MTdcZL3Ml6aMBg73MHfL+
X73MlSQ1epqb4WmupNCunv5/MPjoI34yAmACMQJgAjECYAIxAmACMQJgAjECYAIxAmACMQJgAjEC
YAIxAmACMQJgAjECYAIxAmACMQJgAjECYAIxAmACMQJgAjECYAIxAmACMQJgAjECYELIOdflIwBC
oZBaPJ168J1Kf6eDnHqgr5e5h5TnZa4k3bO5ysvcPy4LPrGhqzLbkr3MlaTU+1u8zK3fk+ZlriQN
CT3qZe5S910vcyVprPb36LzloW0Kyg0/GQEwgRgBMIEYATCBGAEwgRgBMIEYATAhYozWrl2rnJwc
5eXlaf78+bp48WIs9gKQYAJjVF9fr+eee061tbU6dOiQ2tvbtWPHjljtBiCBJAV9s0+fPkpOTlZz
c7OuvfZaNTc3a+DAgbHaDUACCYxRv379tGrVKg0ePFjXX3+9pk2bpilTplx2zb+Vff3598dLE8d7
2RNAL/PnyjM6Unkm6usDY3Ts2DFt2LBB9fX1Sk1N1UMPPaTt27drwYIFHdf8yw+7viyAb6/bitJ1
W1F6x9evP/X7wOsDHzM6cOCAxo0bp/79+yspKUmzZ89WdXV1z2wKAJcIjNGIESNUU1Oj8+fPyzmn
iooKhcPhWO0GIIEExig/P18lJSUaPXq0Ro4cKUlavnx5TBYDkFgCHzOSpNWrV2v16tWx2AVAAuMV
2ABMIEYATCBGAEwgRgBMIEYATCBGAEwgRgBMiPg6o0i+s83PkULXLfzYy1xJunmLn/OVli8q9zJX
ktxrfo4UCv2nn/9+91/7Gy9zJUlDPI2d8Tc/gyXd6fZ5mbu5fLKXuZK0ufKxHp64LfC7/GQEwARi
BMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIE
wITYxuhoZUxvrtv+VBnvDTqt8n/jvUHnna38Y7xX6JyzlfHeoPP+UhnvDSIiRkF6Y4zOxnuDzjtb
eTjeK3QOMfKCu2kATCBGAEwIOee6/I7soZCfN4kH8O0UlJtunQ7SjY4BwGW4mwbABGIEwARiBMAE
YgTABGIEwARiBMCE/wOntQjk7qyxlwAAAABJRU5ErkJggg==
" />
</div>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
We hope that the pairwise CRF can learn to smooth the data to get consistent, sharp labels out.<br />
Now we import the class for handling CRFs with arbitrary (gobally shared) pairwise potentials, <code>GraphCRF</code> and instantiate it.
As for such a grid graph the inference is already non-trivial, we have to choose an inference method. Let's go with AD3 for the moment.</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [ ]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">crf</span> <span class="o">=</span> <span class="n">GraphCRF</span><span class="p">(</span><span class="n">inference_method</span><span class="o">=</span><span class="s">'ad3'</span><span class="p">)</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">SubgradientStructuredSVM</span><span class="p">(</span><span class="n">problem</span><span class="o">=</span><span class="n">crf</span><span class="p">,</span> <span class="n">max_iter</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span> <span class="n">C</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
To feed the instances to pystruct, we explictly represent each edge in the grid:</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [ ]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">G</span> <span class="o">=</span> <span class="p">[</span><span class="n">make_grid_edges</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">X</span><span class="p">]</span>
</pre>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [53]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">G</span><span class="p">[</span><span class="mi">0</span><span class="p">][:</span><span class="mi">12</span><span class="p">]</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[53]:</div>
<div class="output_subarea output_pyout">
<pre>array([[ 0, 1],
[ 1, 2],
[ 2, 3],
[ 3, 4],
[ 4, 5],
[ 5, 6],
[ 6, 7],
[ 7, 8],
[ 8, 9],
[ 9, 10],
[10, 11],
[12, 13]])</pre>
</div>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
Now we reshape the input a bit. GraphCRF expects the input <code>X</code> to be of shape <code>(n_nodes, n_features)</code> and the output <code>Y</code> to be of shape <code>(n_nodes,)</code> for each sample.</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [ ]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="c"># reshape / flatten x and y</span>
<span class="n">X_flat</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">X</span><span class="p">]</span>
<span class="n">Y_flat</span> <span class="o">=</span> <span class="p">[</span><span class="n">y</span><span class="o">.</span><span class="n">ravel</span><span class="p">()</span> <span class="k">for</span> <span class="n">y</span> <span class="ow">in</span> <span class="n">Y</span><span class="p">]</span>
</pre>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
The actuall input when using GraphCRF is then a tuple for each sample <code>x=(features, graph)</code>.</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [ ]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">X_structured</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="n">X_flat</span><span class="p">,</span> <span class="n">G</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
Then the rest is easy again:</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [ ]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">clf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_structured</span><span class="p">,</span> <span class="n">Y_flat</span><span class="p">)</span>
<span class="n">Y_pred</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_structured</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
</div>
<div class="text_cell_render border-box-sizing rendered_html">
Now let us look at the prediction:</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [57]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">plt</span><span class="o">.</span><span class="n">matshow</span><span class="p">(</span><span class="n">Y_pred</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">Y</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">shape</span><span class="p">))</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[57]:</div>
<div class="output_subarea output_pyout">
<pre><matplotlib.image.AxesImage at 0x6bc7b50></pre>
</div>
</div>
<div class="hbox output_area">
<div class="prompt output_prompt">
</div>
<div class="output_subarea output_display_data">
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAASMAAAD5CAYAAABs8lPQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz
AAALEgAACxIB0t1+/AAADMRJREFUeJzt3W9MlfX/x/HXqcONkmnpBIqDO4xkcA4IJM7N1dSSnG04
U9aUHE6xbri2dK5103lHYdYS617rDy2Xbd0ocuQNZhiLUblDrc02m+NsKOaNCosw+dPne+PLrx/K
l+scIK/rjef52K4N8aLrdcOenXOd4ynknHMCgIDdE/QAAJCIEQAjiBEAE4gRABOIEQATiBEAE3yL
0ZkzZ1RSUqLly5erubnZr8vOSn9/v9avX694PK6ysjKdOHEi6ElpGx8fV1VVlWpra4OekpbBwUHV
1dWptLRUsVhMPT09QU/ydPToUcXjcZWXl6u+vl43b94MetIUe/bsUW5ursrLy//53q+//qqamhoV
Fxfrqaee0uDgYIALp+F8MDY25oqKilxfX58bGRlxFRUV7sKFC35celauXr3qent7nXPO/fHHH664
uNj03slee+01V19f72pra4OekpaGhgb39ttvO+ecGx0ddYODgwEvml5fX58rLCx0f/31l3POuWef
fda99957Aa+a6ssvv3SJRMKVlZX9872XX37ZNTc3O+eca2pqcq+88kpQ86blyyOjb775Ro888oii
0aiysrK0fft2ffrpp35celby8vJUWVkpScrOzlZpaakGBgYCXpXa5cuX1d7err1798rNg/eyXr9+
XV1dXdqzZ48kKRwOa9GiRQGvmt7ChQuVlZWl4eFhjY2NaXh4WPn5+UHPmuLxxx/Xgw8+eMv32tra
tGvXLknSrl279MknnwQxzZMvMbpy5YoKCgr++XUkEtGVK1f8uPScJZNJ9fb2avXq1UFPSenAgQM6
duyY7rlnftwK7Ovr09KlS7V79249+uijev755zU8PBz0rGktXrxYBw8e1LJly/Twww/rgQce0IYN
G4KelZZr164pNzdXkpSbm6tr164FvGgqX/7UhkIhPy7zrxsaGlJdXZ1aWlqUnZ0d9BxPp0+fVk5O
jqqqqubFoyJJGhsbUyKR0L59+5RIJLRgwQI1NTUFPWtaly5d0vHjx5VMJjUwMKChoSGdPHky6Fkz
FgqFTP476UuM8vPz1d/f/8+v+/v7FYlE/Lj0rI2Ojmrbtm3auXOntmzZEvSclLq7u9XW1qbCwkLt
2LFDZ8+eVUNDQ9CzPEUiEUUiEa1atUqSVFdXp0QiEfCq6Z0/f15r1qzRkiVLFA6HtXXrVnV3dwc9
Ky25ubn6+eefJUlXr15VTk5OwIum8iVG1dXV+umnn5RMJjUyMqKPPvpImzdv9uPSs+KcU2Njo2Kx
mPbv3x/0nLQcOXJE/f396uvr06lTp/TEE0/o/fffD3qWp7y8PBUUFOjixYuSpI6ODsXj8YBXTa+k
pEQ9PT26ceOGnHPq6OhQLBYLelZaNm/erNbWVklSa2urzf/A+nWnvL293RUXF7uioiJ35MgRvy47
K11dXS4UCrmKigpXWVnpKisr3eeffx70rLR1dnbOm1fTvvvuO1ddXe1WrFjhnnnmGdOvpjnnXHNz
s4vFYq6srMw1NDS4kZGRoCdNsX37dvfQQw+5rKwsF4lE3DvvvON++eUX9+STT7rly5e7mpoa99tv
vwU9c4qQc/PkBgOAu9r8eNkFwF2PGAEwITyXH7b48iAAu7zuCs0pRv91aAbndkpaN/dL+qZT82uv
ZGXzIR1O+9xOWVicvk7Nr72Sjc2p/kTwNA2ACcQIgAk+xyjq7+XmLBr0gFmIBj1gxqJBD5ihaNAD
ZiEa9IA0ECNP0aAHzEI06AEzFg16wAxFgx4wC9GgB6SBp2kATCBGAExIGaP59HGxAOYvzxiNj4/r
xRdf1JkzZ3ThwgV9+OGH+vHHH/3aBiCDeMZovn1cLID5y/Md2P/r42K//vrr287qnPR1VPPjvj2A
Oy05caTLM0bp/d2zdTO4HIBMEdWtD03OpTjf82nafPy4WADzk2eM5tvHxQKYvzyfpoXDYb355pva
uHGjxsfH1djYqNLSUr+2AcggKT9CZNOmTdq0aZMfWwBkMN6BDcAEYgTABGIEwARiBMAEYgTABGIE
wARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTA
BGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAE
YgTABGIEwATPGPX392v9+vWKx+MqKyvTiRMn/NoFIMOEvX4zKytLr7/+uiorKzU0NKSVK1eqpqZG
paWlfu0DkCE8Y5SXl6e8vDxJUnZ2tkpLSzUwMHBbjDonfR2dOABkuuTEkS7PGN3yD04m1dvbq9Wr
V9/2O+tmcDkAmSKqWx+anEtxflo3sIeGhlRXV6eWlhZlZ2fPchoATC9ljEZHR7Vt2zbt3LlTW7Zs
8WMTgAzkGSPnnBobGxWLxbR//36/NgHIQJ4x+uqrr/TBBx/oiy++UFVVlaqqqnTmzBm/tgHIIJ43
sB977DH9/ffffm0BkMF4BzYAE4gRABOIEQATiBEAE4gRABOIEQATiBEAE4gRABOIEQATiBEAE4gR
ABOIEQATiBEAE4gRABOIEQATiBEAE4gRABOIEQATiBEAE4gRABOIEQATiBEAE4gRABOIEQATiBEA
E4gRABOIEQATiBEAE4gRABOIEQATiBEAE4gRABOIEQATiBEAE4gRABPSitH4+LiqqqpUW1t7p/cA
yFBpxailpUWxWEyhUOhO7wGQoVLG6PLly2pvb9fevXvlnPNjE4AMFE51woEDB3Ts2DH9/vvv05zR
Oenr6MQBINMlJ450ecbo9OnTysnJUVVVlTo7O6c5a90MLgcgU0R160OTcynO93ya1t3drba2NhUW
FmrHjh06e/asGhoa5jgRAKYKuTRvBJ07d06vvvqqPvvss///4VBI0qE7tQ3z2CEdDnoCjDksed53
ntH7jHg1DcCdkvIG9v9Zu3at1q5deye3AMhgvAMbgAnECIAJxAiACcQIgAnECIAJxAiACcQIgAnE
CIAJxAiACcQIgAnECIAJxAiACcQIgAnECIAJxAiACcQIgAnECIAJxAiACcQIgAnECIAJxAiACcQI
gAnECIAJxAiACcQIgAnECIAJxAiACcQIgAnECIAJxAiACcQIgAnECIAJxAiACcQIgAnECIAJxAiA
CSljNDg4qLq6OpWWlioWi6mnp8ePXQAyTDjVCS+99JKefvppffzxxxobG9Off/7pxy4AGcYzRtev
X1dXV5daW1v/e3I4rEWLFvkyDEBm8YxRX1+fli5dqt27d+v777/XypUr1dLSovvvv3/SWZ2Tvo5O
HAAyXXLiSJfnPaOxsTElEgnt27dPiURCCxYsUFNT021nrZt0RGdwaQB3s6hurUMqnjGKRCKKRCJa
tWqVJKmurk6JRGJOAwHgf/GMUV5engoKCnTx4kVJUkdHh+LxuC/DAGSWlK+mvfHGG3ruuec0MjKi
oqIivfvuu37sApBhUsaooqJC3377rR9bAGQw3oENwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARi
BMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIE
wARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMAEYgTABGIEwARiBMCElDE6
evSo4vG4ysvLVV9fr5s3b/qxC0CG8YxRMpnUW2+9pUQioR9++EHj4+M6deqUX9sAZJCw128uXLhQ
WVlZGh4e1r333qvh4WHl5+f7tQ1ABvGM0eLFi3Xw4EEtW7ZM9913nzZu3KgNGzbcdlbnpK+jEweA
TJecONLl+TTt0qVLOn78uJLJpAYGBjQ0NKSTJ0/edta6SUd0BpcGcDeL6tY6pOIZo/Pnz2vNmjVa
smSJwuGwtm7dqu7u7rluBIApPGNUUlKinp4e3bhxQ845dXR0KBaL+bUNQAbxjFFFRYUaGhpUXV2t
FStWSJJeeOEFX4YByCwh55yb9Q+HQpIO/YtzcLc4pMNBT4AxhyV55YZ3YAMwgRgBMIEYATCBGAEw
gRgBMIEYATCBGAEwwfMvyuLux/uBYAWPjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECM
AJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACY4HOMkv5ebs6SQQ+YhWTQA2YsGfSAGUoGPWAW
kkEPSAMx8pQMesAsJIMeMGPJoAfMUDLoAbOQDHpAGniaBsAEYgTAhJBzzs36h0Ohf3MLgLucV27m
9H8HmUPHAOAWPE0DYAIxAmACMQJgAjECYAIxAmACMQJgwn8AkiIW3/OMaQsAAAAASUVORK5CYII=
" />
</div>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [ ]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">And</span> <span class="n">compare</span> <span class="n">against</span> <span class="n">just</span> <span class="n">using</span> <span class="n">the</span> <span class="n">features</span><span class="p">,</span> <span class="n">without</span> <span class="n">using</span> <span class="nb">any</span> <span class="n">interaction</span> <span class="n">between</span> <span class="n">the</span> <span class="n">nodes</span><span class="p">:</span>
</pre>
</div>
</div>
</div>
</div>
<div class="cell border-box-sizing code_cell vbox">
<div class="input hbox">
<div class="prompt input_prompt">
In [61]:</div>
<div class="input_area box-flex1">
<div class="highlight">
<pre><span class="n">plt</span><span class="o">.</span><span class="n">matshow</span><span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="p">:,</span> <span class="p">:,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">></span> <span class="mf">0.0</span><span class="p">)</span>
</pre>
</div>
</div>
</div>
<div class="vbox output_wrapper">
<div class="output vbox">
<div class="hbox output_area">
<div class="prompt output_prompt">
Out[61]:</div>
<div class="output_subarea output_pyout">
<pre><matplotlib.image.AxesImage at 0x7379390></pre>
</div>
</div>
<div class="hbox output_area">
<div class="prompt output_prompt">
</div>
<div class="output_subarea output_display_data">
<img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAASMAAAD5CAYAAABs8lPQAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz
AAALEgAACxIB0t1+/AAADWFJREFUeJzt3V9sU3Ufx/HPtLtQFlAI23SFlCDL1m5skxESogEUJJiM
IDQGJhnhj14QEyFEvDTcwBY0MvTO+AcjERMvFMnkYsEhcZlKisYEEwxZk8GQC3XoHDI2f8/FswcZ
PDv9M3rOt/T9Sk4yxtnON7W8PW1P+ytyzjkBQMDuCXoAAJCIEQAjiBEAE4gRABOIEQATiBEAE3yL
0fHjx1VVVaV58+apra3Nr8Nmpa+vT8uWLVMsFlNNTY0OHjwY9EhpGx0dVUNDg5qamoIeJS0DAwOK
x+Oqrq5WNBpVT09P0CN52rdvn2KxmGpra9Xc3Kxr164FPdJttmzZorKyMtXW1t743m+//aYVK1ao
srJSTz31lAYGBgKccALOByMjI27u3Lmut7fXDQ8Pu7q6Onf27Fk/Dp2VS5cuuTNnzjjnnPvzzz9d
ZWWl6Xlv9vrrr7vm5mbX1NQU9ChpaWlpce+8845zzrnr16+7gYGBgCeaWG9vr5szZ477+++/nXPO
Pfvss+79998PeKrbffXVVy6RSLiampob33v55ZddW1ubc8651tZW98orrwQ13oR8OTP69ttv9cgj
jygSiai4uFjr16/XZ5995sehs1JeXq76+npJUklJiaqrq9Xf3x/wVKlduHBBHR0d2rZtm1weXMt6
5coVnTp1Slu2bJEkhUIhTZs2LeCpJjZ16lQVFxdraGhIIyMjGhoaUkVFRdBj3ebxxx/Xgw8+OO57
R48e1aZNmyRJmzZt0qeffhrEaJ58idHFixc1a9asG38Oh8O6ePGiH4eetGQyqTNnzmjRokVBj5LS
zp07tX//ft1zT348Fdjb26uZM2dq8+bNevTRR/X8889raGgo6LEmNH36dO3atUuzZ8/Www8/rAce
eEDLly8Peqy0XL58WWVlZZKksrIyXb58OeCJbufLvbaoqMiPw9xxg4ODisfjam9vV0lJSdDjeDp2
7JhKS0vV0NCQF2dFkjQyMqJEIqHt27crkUhoypQpam1tDXqsCZ0/f14HDhxQMplUf3+/BgcHdfjw
4aDHylhRUZHJf5O+xKiiokJ9fX03/tzX16dwOOzHobN2/fp1rVu3Ths3btSaNWuCHiel7u5uHT16
VHPmzNGGDRt04sQJtbS0BD2Wp3A4rHA4rIULF0qS4vG4EolEwFNN7PTp01q8eLFmzJihUCiktWvX
qru7O+ix0lJWVqZffvlFknTp0iWVlpYGPNHtfIlRY2Ojfv75ZyWTSQ0PD+vjjz/W6tWr/Th0Vpxz
2rp1q6LRqHbs2BH0OGnZu3ev+vr61NvbqyNHjuiJJ57QBx98EPRYnsrLyzVr1iydO3dOktTZ2alY
LBbwVBOrqqpST0+Prl69KuecOjs7FY1Ggx4rLatXr9ahQ4ckSYcOHbL5P1i/ninv6OhwlZWVbu7c
uW7v3r1+HTYrp06dckVFRa6urs7V19e7+vp698UXXwQ9Vtq6urry5tW077//3jU2Nrr58+e7Z555
xvSrac4519bW5qLRqKupqXEtLS1ueHg46JFus379evfQQw+54uJiFw6H3bvvvut+/fVX9+STT7p5
8+a5FStWuN9//z3oMW9T5FyePMEA4K6WHy+7ALjrESMAJoQm88MWXx4EYJfXs0KTipEkvZrBvl2S
lqa5756MfnNmXtWetPbrUvrzWtGlzGbO1e2c7m0s5d/t3KX8mleyMXOqewQP0wCYQIwAmOBrjCJ+
HuwOiAQ9QBYiQQ+QhUjQA2QoEvQAWYgEPUAaiJGHSNADZCES9ABZiAQ9QIYiQQ+QhUjQA6SBh2kA
TCBGAExIGaN8+rhYAPnLM0ajo6N68cUXdfz4cZ09e1YfffSRfvrpJ79mA1BAPGOUbx8XCyB/eV6B
/f8+Lvabb74Zt0/XTV9HlB/P2gPIveTYli7PGKXz3rOlGRwMQOGIaPzJyckU+3s+TMvHj4sFkJ88
Y5RvHxcLIH95PkwLhUJ66623tHLlSo2Ojmrr1q2qrq72azYABSTlR4isWrVKq1at8mMWAAWMK7AB
mECMAJhAjACYQIwAmECMAJgw6Q/kz0e5/LD/XMnkA+7vdvz388edv529bwPOjACYQIwAmECMAJhA
jACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECM
AJhAjACYQIwAmECMAJhAjACYUOScc1n/cFFRzhaNycflaHIpH5e6yTf5eJ/Lp/vFHkleueHMCIAJ
xAiACcQIgAnECIAJxAiACcQIgAnECIAJnjHq6+vTsmXLFIvFVFNTo4MHD/o1F4ACE/L6y+LiYr3x
xhuqr6/X4OCgFixYoBUrVqi6utqv+QAUCM8YlZeXq7y8XJJUUlKi6upq9ff3j4tR1037R8Y2AEiO
benyjNG4X5xM6syZM1q0aNG47y/N4GAACkdE409OTqbYP60nsAcHBxWPx9Xe3q6SkpIsRwOAiaWM
0fXr17Vu3Tpt3LhRa9as8WMmAAXIM0bOOW3dulXRaFQ7duzwayYABcgzRl9//bU+/PBDffnll2po
aFBDQ4OOHz/u12wACojnE9iPPfaY/vnnH79mAVDAuAIbgAnECIAJxAiACcQIgAnECIAJZlcHyaV8
XAUC/8qnFTFyLZf35Tt9O7M6CIC8QIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACY
QIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYMOmlisSyP/g/
WE4It2KpIgB5gRgBMIEYATCBGAEwgRgBMIEYATAhrRiNjo6qoaFBTU1NuZ4HQIFKK0bt7e2KRqNj
1xUBwJ2XMkYXLlxQR0eHtm3b5nnBEgBMRijVDjt37tT+/fv1xx9/TLBH101fR8Y2AIUuObalyzNG
x44dU2lpqRoaGtTV1TXBXkszOByAQhHR+FOTkyn293yY1t3draNHj2rOnDnasGGDTpw4oZaWlkmO
CAC3S/uNsidPntRrr72mzz///N8f5o2ymABvlMWt7ugbZXk1DUCupHwC+3+WLFmiJUuW5HIWAAWM
K7ABmECMAJhAjACYQIwAmECMAJhAjACYkPZL+0Am9uTwYthcXVCZjzPfTTgzAmACMQJgAjECYAIx
AmACMQJgAjECYAIxAmACMQJgAjECYAIxAmACMQJgAjECYAIxAmACMQJgAjECYAIxAmACMQJgAjEC
YAIxAmACMQJgAjECYMKkVwfJx5UakHushpH/7vy/Qe/7BGdGAEwgRgBMIEYATCBGAEwgRgBMIEYA
TCBGAExIGaOBgQHF43FVV1crGo2qp6fHj7kAFJiUFz2+9NJLevrpp/XJJ59oZGREf/31lx9zASgw
njG6cuWKTp06pUOHDv1351BI06ZN82UwAIXFM0a9vb2aOXOmNm/erB9++EELFixQe3u77r///hv7
dN20f2RsAwApObalx/M5o5GRESUSCW3fvl2JREJTpkxRa2vruH2W3rRFMpkTwF0uovGF8OYZo3A4
rHA4rIULF0qS4vG4EonEJAcEgNt5xqi8vFyzZs3SuXPnJEmdnZ2KxWK+DAagsKR8Ne3NN9/Uc889
p+HhYc2dO1fvvfeeH3MBKDApY1RXV6fvvvvOj1kAFDCuwAZgAjECYAIxAmACMQJgAjECYAIxAmBC
kXPOZf3DRUUsKHSTfFxeiSWF4Jc9krxyw5kRABOIEQATiBEAE4gRABOIEQATiBEAE4gRABOIEQAT
iBEAE4gRABOIEQATiBEAE4gRABOIEQATiBEAE4gRABOIEQATiBEAE4gRABOIEQATiBEAEya9Oohy
tCJGLletYBWPf3Fb/CuXt0U+rsJy52+PPawOAsA+YgTABGIEwARiBMAEYgTABGIEwISUMdq3b59i
sZhqa2vV3Nysa9eu+TEXgALjGaNkMqm3335biURCP/74o0ZHR3XkyBG/ZgNQQEJefzl16lQVFxdr
aGhI9957r4aGhlRRUeHXbAAKiGeMpk+frl27dmn27Nm67777tHLlSi1fvvyWvbpu+joytgFAcmxL
j+fDtPPnz+vAgQNKJpPq7+/X4OCgDh8+fMteS2/aIunPCeAuF9H4PnjzjNHp06e1ePFizZgxQ6FQ
SGvXrlV3d/fkZwSAW3jGqKqqSj09Pbp69aqcc+rs7FQ0GvVrNgAFxDNGdXV1amlpUWNjo+bPny9J
euGFF3wZDEBh8XwCW5J2796t3bt3+zELgALGFdgATCBGAEwgRgBMIEYATCBGAEwgRgBMIEYATJj0
UkX5t9BNfsrVMjr5uIQO8tMeiaWKANhHjACYQIwAmECMAJhAjACYQIwAmECMAJhAjACYQIwAmECM
AJhAjACYQIwAmECMAJhAjACYQIwAmECMAJjga4ySfh7sDkgGPUBWkkEPkLFk0ANkKBn0AFlIBj1A
GoiRh2TQA2QlGfQAGUsGPUCGkkEPkIVk0AOkgYdpAEwgRgBMmPQH8gNAurxyE8rVLwaATPAwDYAJ
xAiACcQIgAnECIAJxAiACcQIgAn/AaH7ZZLC2ReHAAAAAElFTkSuQmCC
" />
</div>
</div>
</div>
</div>
</div>
I'm getting a bit tired. Expect the post to be updated around tomorrow ;)Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com4tag:blogger.com,1999:blog-7345806147365425073.post-22082246585576769002013-01-25T19:38:00.001+01:002013-07-16T16:22:29.440+02:00Machine Learning Cheat Sheet (for scikit-learn)As you hopefully have heard, we at <a href="http://scikit-learn.org/dev/">scikit-learn</a> are doing a <a href="https://docs.google.com/spreadsheet/viewform?formkey=dFdyeGNhMzlCRWZUdldpMEZlZ1B1YkE6MQ#gid=0">user survey</a> (which is still open by the way).<br />
One of the requests there was to provide some sort of flow chart on how to do machine learning.<br />
<br />
As this is clearly impossible, I went to work straight away.<br />
<br />
This is the result:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihyphenhyphen0NANgWsnpVjOafuGCyBQppYXA25H8ZKv7SXxhbSxfSmBKpD5CCjGb_cXXqXzPIjdK_h8pLO7XAGDR1bDXKc9nAtRpE4tFCzcD6kUzMgZTE_NXN35GtvDD_JE9pF-jVCzY0aYfPPAk0/s1600/drop_shadows_background.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEihyphenhyphen0NANgWsnpVjOafuGCyBQppYXA25H8ZKv7SXxhbSxfSmBKpD5CCjGb_cXXqXzPIjdK_h8pLO7XAGDR1bDXKc9nAtRpE4tFCzcD6kUzMgZTE_NXN35GtvDD_JE9pF-jVCzY0aYfPPAk0/s640/drop_shadows_background.png" width="480" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
[edit2]<br />
clarification: With ensemble classifiers and ensemble regressors I mean<b> random forests</b>, <b>extremely randomized trees, gradient boosted trees</b>, and the soon-to-be-come weight boosted trees (adaboost).<br />
[/edit2]<br />
<br /><br />
Needless to say, this sheet is completely authoritative.<br />
<br />
<a name='more'></a>Thanks to Rob Zinkov for pointing out an error in one yes/no decision.<br />
<br />
More seriously: this is actually my work flow / train of thoughts whenever I try to solve a new problem. Basically, start simple first. If this doesn't work out, try something more complicated.<br />
The chart above includes the intersection of all algorithms that are in scikit-learn and the ones that I find most useful in practice.<br />
<br />
Only that I <b>always</b> start out with "just looking". To make any of the algorithms actually work, you need to do the <i>right</i> preprocessing of your data - which is much more of an art than picking the right algorithm imho.<br />
<br />
Anyhow, enjoy ;)<br />
<br />
[edit3]<br />
You can find the SVG and dia file I used <a href="https://gist.github.com/amueller/4642976">here</a>. I doubt this qualifies as a creative work, but to make, I put this under <a href="https://gist.github.com/amueller/4642976">CC0 license</a>, which translates to "public domain" in the US.<br />
[/edit3] <br />
<span style="font-size: x-small;">[edit]</span><br />
<span style="font-size: x-small;">As some people commented about structured prediction not being included in the chart: There is <a href="http://www.cs.cornell.edu/people/tj/svm_light/svm_struct.html">SVMstruct</a>, which is a great library and has interfaces to many languages, but is only free for non-comercial use.</span><br />
<span style="font-size: x-small;">There is also the library I'm working on, <a href="https://github.com/amueller/pystruct">pystruct</a>, which I will write about on another day ;)</span><br />
<span style="font-size: x-small;"><br /></span>
<span style="font-size: x-small;">The chart is not really comprehensive, as I focused on scikit-learn. Otherwise I certainly would have included neural networks ;)</span><br />
<span style="font-size: x-small;">[/edit]</span>Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com53tag:blogger.com,1999:blog-7345806147365425073.post-91040902462500441042013-01-21T23:44:00.000+01:002013-01-22T00:18:29.516+01:00Scikit-Learn 0.13 released! We want your feedback.After a little delay, the team finished work on the 0.13 release of scikit-learn.<br />
There is also a <a href="https://docs.google.com/spreadsheet/viewform?formkey=dFdyeGNhMzlCRWZUdldpMEZlZ1B1YkE6MQ#gid=0">user survey</a> that we launched in parallel with the release, to get some feedback from our users.<br />
<br />
There is a list of changes and new features <a href="http://scikit-learn.org/stable/whats_new.html">on the website</a>.<br />
You can upgrade using easy-install or pip using:<br />
<br />
pip install -U scikit-learn<br />
or<br />
easy_install -u scikit-learn <br />
<br />
<br />
There were more than 60 people contributing to this release, with 24 people having 10 commits or more.<br />
<br />
Again many improvements are behind the scenes or only slightly notable. We improved test coverage a lot and we have much more consistent parameter names now. There is now also a user guide entry for the classification metrics, and their naming was improved.<br />
<br />
This was one of the many improvements <a href="https://github.com/arjoly">Arnaud Joly</a>, who joined the project very recently but nevertheless wound up being the one with the second most commits in this release!<br />
<br />
Now let me get to some of the more visible highlights of this release from my perspective:<br />
<br />
- Thanks to Lars and Olivier, the <a href="http://scikit-learn.org/stable/modules/feature_extraction.html#hashing-vectorizer">Hashing Trick</a> finally made it into scikit-learn.<br />
This allows for very fast vectorization of large text corpora and stateless transformers for the same.<br />
<br />
- Sample weights were added to the tree module thanks to Noel and Gilles. This enabled the implementation of a smarter resampling for random forests, which leads to a speed-up of random forests of up to a factor of two! Also, this is the basis of including AdaBoost with Trees in the next release.<br />
<br />
- I added a method to use <a href="http://scikit-learn.org/stable/modules/ensemble.html#random-trees-embedding">totally randomized trees for hashing / embedding features</a> to a high-dimensional, sparse binary representation. It goes along the lines of my <a href="http://peekaboo-vision.blogspot.de/2012/12/kernel-approximations-for-efficient.html">last blog post</a> on using non-linear embeddings followed by simple linear classifiers.<br />
<br />
- I also added <a href="http://scikit-learn.org/stable/modules/kernel_approximation.html#nystroem-kernel-approx">Nystroem kernel approximations</a>, which are really easy to do but should come in quite handy. They still need some more work, though. For details, see my post on <a href="http://peekaboo-vision.blogspot.de/2012/12/kernel-approximations-for-efficient.html">kernel approximations</a>.<br />
<br />
<br />
Thanks to the team for working on this together. I am really happy with the way everybody joins forces, this is an amazing project!Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com0tag:blogger.com,1999:blog-7345806147365425073.post-91055859201500974902012-12-26T23:28:00.001+01:002013-06-20T21:30:01.637+02:00Kernel Approximations for Efficient SVMs (and other feature extraction methods) [update]Recently we added another method for kernel approximation, the Nyström method,
to <a href="http://scikit-learn.org/dev/">scikit-learn</a>, which will be featured in the upcoming 0.13 release.<br />
Kernel-approximations were my first somewhat bigger contribution to
scikit-learn and I have been thinking about them for a while.<br />
To dive into kernel approximations, first recall the <a href="https://en.wikipedia.org/wiki/Kernel_trick">kernel-trick</a>.<br />
<a name='more'></a><br />
<h3>
The Kernel Trick</h3>
The motivation is to obtain a non-linear decision boundary, because not all problems are linear.
Consider this simple 1d dataset<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj771imEPe4S5tJpSXz5aYyBp4jcUQkZH34VS7b6UlWgKqBgk5atb8qmLyxMJZPfLTBfUiwkvRFaSVofpZjPLaKiw0CQUEFmQtVfkGCT7qnlAGvGKu-EqZsNEcqKJdBDxtRB3ohpxQWtdw/s1600/kernel_trick1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj771imEPe4S5tJpSXz5aYyBp4jcUQkZH34VS7b6UlWgKqBgk5atb8qmLyxMJZPfLTBfUiwkvRFaSVofpZjPLaKiw0CQUEFmQtVfkGCT7qnlAGvGKu-EqZsNEcqKJdBDxtRB3ohpxQWtdw/s640/kernel_trick1.png" width="480" /></a></div>
<br />
There is obviously no way this can be linearly separated.<br />
There is an easy way to make it linearly separable, though. By embedding the data it in a higher-dimensional space using some non-linear mapping, for example $x \mapsto x^2$<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEih_VelA1SLh79UQQahNVgoXVxBZfI2_on8B8r7l4-LSpbMBa_vQIAq051F1hHF9LjVLfmmRBbbwzCxN87_ASVv_6Hhc-9bHigaGuMJEApAAPjPWKzGVMSZ6334n2a4vlgJ6QhqNDOYULY/s1600/kernel_trick2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEih_VelA1SLh79UQQahNVgoXVxBZfI2_on8B8r7l4-LSpbMBa_vQIAq051F1hHF9LjVLfmmRBbbwzCxN87_ASVv_6Hhc-9bHigaGuMJEApAAPjPWKzGVMSZ6334n2a4vlgJ6QhqNDOYULY/s640/kernel_trick2.png" width="480" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgyvCoRS7jMBUkYj5GSV2GdkigAMaHO3hZeplePj3R4ejLLCmvHr6b8yhUnLXNxdDm4FZpA4Frpt41tOspE7x4KJzX84wlMrtaJ4xMD34CBd7PUr3Mw2Ing7_rTklYxWnHQVkB7a1GIeEk/s1600/kernel_trick2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><br /></a></div>
<br />
This is obviously a contrived example, but not totally detached from reality.<br />
<br />
Learning classifiers in high dimensions is expensive, though, and you have to come up with the non-linear mapping. If you start out with a given number of features $n$ and you want to compute all polynomial terms up to degree 2, you get more than $n^2$ features, and this rises exponentially in the degree $d$ of polynomials you want!<br />
<br />
There is a very simple trick to circumvent this problem, though, which is the kernel-trick.<br />
The essence of the kernel-trick is that if you can describe an algorithm in a certain way -- which is using only inner products -- then you never need to actually use the feature mapping, as long as you can compute the inner product in the feature space.<br />
<br />
For the polynomial feature map, the inner product in the feature space is given by $k(x, y) = (x^Ty + c)^d$ which is easy enough to compute for any degree $d$.<br />
<br />
What is even better is that you don't really need to start from a feature map. You can specify an inner product $k(x, y)$ directly and under mild condition (if $k$ is a Mercer-kernel), there exists a space $H$ for which $k$ is the scalar product. It is possible to construct a mapping from the original $R^n \rightarrow H$ but you never actually need to compute it.<br />
<br />
One of the most popular $k$ is the Gaussian (or RBF) kernel<br />
$k(x, y) = \text{exp}(\gamma ||x - y|| ^ 2)$, <br />
which is a scalar product in a space that is even infinite dimensional.<br />
<br />
<h3>
Kernelized SVMs</h3>
One of the most popular applications of the kernel trick is the kernelized Support Vector Machine (SVM), which is one of the best of-the-shelf classifiers today. <br />
<br />
One of the characteristics of kernelized algorithms is that their runtime and space complexity is basically independent of the dimensionality of the input space, but rather scales with the number of data points used for training. There are lots of tricks to make SVM training fast, but in general you can assume that the run time is cubic in the number of samples.<br />
Usually good implementations avoid computing the kernel values for all pairs of training points but this comes at the cost of some runtime and algorithmic complexity. If you could afford it, you'd really like to store the whole kernel matrix, i.e. the kernel value for all pairs of training points, which is quadratic in the number of samples.<br />
<br />
If you have very complex, but small (say <10.000 samples or <100.000 depending on your patience) kernels are pretty neat. But if you deal with really A LOT of data, there is no way you can train an SVM - it will be just to slow or the memory complexity will just be to<b> </b>high. One trend in recent years has been to apply stochastic gradient descent (SGD) optimization to learn linear classifiers.<br />
<br />
These can be really fast - a single iteration is linear in the dimensionality (or rather in the number of non-zero features per sample, which is very nice for sparse data with very few non-zero features per sample) and convergence is sublinear in the number of samples (in theory)!<br />
<br />
So that is really, really fast. But classifiers learned in this way are usually linear in the data (except neural networks which I will not talk about much today).<br />
<br />
<h3>
Kernel approximation - Marrying kernels and SGD</h3>
So on the one hand, we have kernelized SVMs, which work quite well on
complicated that that is not linearly separable, but doesn't scale well
to many samples. On the other hand, we have SGD optimization that is
very efficient, but only produces linear classifiers.<br />
<br />
A somewhat recent trend is to combine the two by (sort of) moving
away from the kernel trick and computing explicit feature maps. So we
actually map our features to a high dimensional space an then apply a
linear classifier, which yields non-linear decisions in the original
space. I feel like this is a bit strange given the original motivation
for studying kernels, but why not.<br />
<br />
So what we want is to have an embedding into a reasonably sized
space, so that we can then learn a linear classifier using SGD on the
new representation.<br />
<br />
As some people really loved the rbf-kernel, it would be great to have
a way to compute the mapping there explicitly. But mapping to an
infinite-dimensional space is clearly not practical.
Fortunately it is known that we only need a finite subspace of that
infinite space to solve the SVM problem, the one that is spanned by the
images of the training data (this is called Representer Theorem).<br />
<br />
If we would use all data points, we would map to an $\mathbb{R}^N$
dimensional space and have the same scaling problems that the
kernel-SVM has. Actually we would do worse, as we would really need to
store all kernel values. But we can think of an easy approximation: We
don't go to the full space spanned by all $N$
training points, but we just use a subset. This will only yield an
approximate embedding but if we keep the number of samples we use the
same, the resulting embedding is independent of dataset size and we can
basically choose the complexity to suit our problem.
This algorithm is called Nyström method (or Nyström embedding) and is
what I just added to scikit-learn.<br />
There is also another method to compute approximatios to the
rbf-kernel, which is based on some ideas of Fourier-Analysis, connecting
kernels to measures.
It produces a monte-carlo (i.e. randomized) approximation to the feature
map. If you sample infinitely long, you will actually get "the real"
feature map back. But obviously you stop at a certain point, depending
on your budget. I implemented this method in scikit-learn last year.
<br />
<h3>
</h3>
<h3>
Some experiments</h3>
Let's do a simple experiment comparing the Nyström approximation and
the Fourier approximation of an rbf kernel.
I use a subset of MNIST for that (as I am impatient): 20000 taining
examples and 10000 test examples. This is basically the same as <a href="http://scikit-learn.org/dev/auto_examples/plot_kernel_approximation.html">this sklearn example</a>,
only with a somewhat bigger dataset. You can find the <a href="https://gist.github.com/4387511">code here</a>.<br />
<b>[update] </b>I forgot to set C on the approximate kernel SVMs. Here is the new plot, in which the Nystroem method is somewhat closer to the exact kernel and the gap between the two methods is bigger.<b> [/update]</b><br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFYtMVWZ_Ywu4yCzKS62SBqmOpzdiy9emU5oPofsksmZVCC5Fhfvh3k2ArKVmaBNWyZSE6OC0tISVnUTz7c2cbD5iVy8vivZmreGrxySfA6wSDCNSpmc4zURIeq7ISQnAEmjQF6logfmo/s1600/mnist_kernel_approx_2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFYtMVWZ_Ywu4yCzKS62SBqmOpzdiy9emU5oPofsksmZVCC5Fhfvh3k2ArKVmaBNWyZSE6OC0tISVnUTz7c2cbD5iVy8vivZmreGrxySfA6wSDCNSpmc4zURIeq7ISQnAEmjQF6logfmo/s640/mnist_kernel_approx_2.png" width="480" /></a></div>
The plot compares accuracy and training time of the two approximate kernel embeddings with an exact SVM on the rbf kernel.<br />
<br />
What we can see from this example immediately is that even with
relatively few dimensions (about 1000) both methods produce quite decent
classifiers.
It is also clear that for the same number of features (dimensions in the
embedding space), the Nyström method always beats the Fourier method.<br />
<br />
On the other hand, looking at the running time, Nyström scales a lot
worse than the Fourier embedding. This is caused mainly by an additional
normalization step (one needs to compute a singular value decomposition
of the kernel on the choosen subset of data points used to construct
the mapping).<br />
I might not use the fastest method for SVD here, but even with more
efficient methods, this step seems quiet costly.<br />
<br />
With many more samples, computing the embedding for all points might dominate, giving a different picture.
Any how, I feel it is nice to be able to compare these methods quite easily side-by-side.<br />
<br />
I plan to throw these at the cover type dataset, but didn't have the time so far.<br />
<h3>
</h3>
<h3>
Other Embeddings</h3>
Going through the motivation above, you might wonder: why bother with kernels at all?
In the end, what you care about is some form of embedding that allows you to do good classification.<br />
<br />
So why not construct one directly? There are several people working on this, but as far as I know mostly
completely disconnected (and not comparing to) the above methods.<br />
The ImageNet challenge led to some interesting methods in computer vision, like Fisher Vectors, but
research in this direction is still pretty early.
Intuitively it seems important to find out what is important about the data to create a good embedding.
This argument has been made by some people in the neural network community for some time now. <br />
<br />
There is only one embedding in scikit-learn right now that targets
linear classification, which is
a very unsophisticated random forest based embedding. You can find an
<a href="http://scikit-learn.org/dev/auto_examples/ensemble/plot_random_forest_embedding.html">example here</a>. The top left shows a toy dataset that is not linearly separable. Below is the decision boundary as given by naive Bayes using a high-dimensional embedding of the data.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://scikit-learn.org/dev/_images/plot_random_forest_embedding_1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://scikit-learn.org/dev/_images/plot_random_forest_embedding_1.png" width="480" /></a></div>
<br />
<br />
Hopefully <a href="https://en.wikipedia.org/wiki/Restricted_Boltzmann_machine">restricted Boltzmann machines</a> (and maybe at some point
Fisher vectors) will be added (<a href="https://github.com/scikit-learn/scikit-learn/pull/1200">PR</a>) and we can compare all these very
different approaches to see what actually works - which, in my opinion
is all that counts.<br />
<br />
Though I'd find it much more satisfying if we also know why ;)<br />
<br />
<h3>
Afterthoughts</h3>
Obviously the embedding $\rightarrow$ linear classifier path is not the only promising avenue for general purpose non-linear classifiers, but you can read about neural networks and random forests enough elsewhere (and also sometimes here ;).<br />
<br />
An just to mention it: there is an interesting connection between the Nystroem method and rbf-networks (remember those?).<br />
<br />
<h3>
The Literature</h3>
I wrote this blog-post to be somewhat concise (obviously without much success) and readable and didn't
give the references in the text.<br />
<br />
If you are interested, there is a huge
amount of inspiring research around what I talked above. Here are just some pointers to relevant work.<br />
<br />
The very recent paper that inspired me to revisit kernel approximations
is from this year's NIPS: <a href="http://books.nips.cc/papers/files/nips25/NIPS2012_0248.pdf">Nystroem Method vs Random Fourier Features: A Theoretical and Empirical Comparison</a><br />
<br />
The original formulation of Nyström approximations is from:<br />
Using the Nystrom method to speed up kernel machines
<br />
<br />
The paper that (re-)ignited interested in kernel-approxiamtion is by Rahimi and Recht: <a href="http://seattle.intel-research.net/pubs/rahimi-recht-random-features.pdf">Random features for large-scale kernel machines</a><br />
<br />
Finally, feature computation and embedding that don't relate to
kernels where investiated a lot in computer vision recently.<br />
A good
overview is:
<a href="http://eprints.pascal-network.org/archive/00008315/02/chatfield11comparison.pdf">The devil is in the details: an evaluation of recent feature encoding methods</a><br />
<br />
After writing this I just read <a href="http://www.machinedlearnings.com/2012/12/do-you-really-have-big-data.html">"Do you really have big data" </a>which discusses the complexity / time trade of from a slightly different angle.
Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com24tag:blogger.com,1999:blog-7345806147365425073.post-35081421722822069292012-12-15T22:34:00.000+01:002012-12-15T22:34:22.621+01:00Another look at MNISTI'm a bit obsessed with MNIST.<br />
Mainly because I think it should not be used in any papers any more - it is weird for a lot of reasons.<br />
When preparing the <a href="http://peekaboo-vision.blogspot.de/2012/12/workshop-on-python-machine-learning-and.html">workshop we held yesterday</a> I noticed one that I wasn't aware of yet: most of the 1-vs-1 subproblems, are <b>really</b> easy!<br />
<br />
Basically all pairs of numbers can be separated perfectly using a linear classifier!<br />
And even you you just do a PCA to two dimensions, they can pretty much still be linearly separated! It doesn't get much easier than that. This makes me even more sceptical about "feature learning" results on this dataset.<br />
<br />
To illustrate my point, here are all pairwise PCA projections. The image is pretty huge. Otherwise you wouldn't be able to make out individual data points.<br />
You can generate it using this very <a href="https://gist.github.com/4299381">simple gist</a>.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhjQWIjA7jgiiyhHsoMFVBvrvFR8TDdgpD-qTOztQtQkWSCPjewZB5INJJHGM8cP35PvtVRvAXED-2Q45UrmZYdHbPJRTj9i9K1BqEj77s_oeDGxsTxj6XKaQzV10fsXFVPbwtLS8b9I8/s1600/mnist_pairs.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjhjQWIjA7jgiiyhHsoMFVBvrvFR8TDdgpD-qTOztQtQkWSCPjewZB5INJJHGM8cP35PvtVRvAXED-2Q45UrmZYdHbPJRTj9i9K1BqEj77s_oeDGxsTxj6XKaQzV10fsXFVPbwtLS8b9I8/s400/mnist_pairs.png" width="400" /></a></div>
<br />
<br />
There are some classes that are not obviously separated: 3 vs 5, 4 vs 9, 5 vs 8 and 7 vs 9. But keep in mind, this is just a PCA to two dimensions. It doesn't mean that they couldn't be separated linarly in the original space.<br />
<br />
Interestingly the "1"s are very easy to identify, even with seven and nine there is basically no way to confuse them. The ones have a somewhat peculiar shape, though. It would be fun to see what a tour along the "bow" (see img at [2, 2]) would look like.<br />
Manifold-people should be delighted ;)<br />
<br />
I think this plot emphasizes again: look at your data!<br />
I hope you enjoyed this perspective.Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com5tag:blogger.com,1999:blog-7345806147365425073.post-36040029637303397422012-12-14T17:31:00.000+01:002012-12-14T17:31:47.647+01:00Workshop on Python, Machine Learning and Scikit-LearnToday there was a workshop at my uni, organized by my Professor Sven Behnke, together with my colleagues Hannes Schulz, Nenard Birešev and me.<br />
<br />
The target group was a local graduate school with a general scientific background, but not much CS or machine learning.<br />
<br />
The workshop consisted of us explaining the methods and the students then playing around with them and answering some questions using IPython notebooks that we provided (if you still don't know about IPython Notebooks, watch <a href="http://www.youtube.com/watch?v=F4rFuIb1Ie4">this talk</a> <b>now</b>).<br />
<br />
Using the notebooks worked out great! There is only so much you can teach in a 5 hour workshop but I think we got across some basic concepts of machine learning and working with data in Python.<br />
<br />
We got some positive feedback and the students really went exploring.<br />
We covered PCA, k-means, linear regression, logistic regression and nearest neighbors, including some real-world examples.<br />
<br />
<br />
You can find all resources, including tex and notebooks for generating figures etc. on <a href="https://github.com/amueller/tutorial_ml_gkbionics">github</a>.<br />
<br />
You are welcome to reuse our material, though dropping us a line would be nice.<br />
<br />
I haven't asked my coauthors about licensing but I think it shouldn't be a problem as long as you attribute.Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com0tag:blogger.com,1999:blog-7345806147365425073.post-18718379411825367092012-11-06T23:59:00.001+01:002012-11-07T11:32:25.103+01:00A Wordcloud in PythonLast week I was at <a href="https://2012.de.pycon.org/">Pycon DE</a>, the German Python conference. After hacking on scikit-learn a lot last week, I decided to to something different on my way back, that I had planned for quite a while:<br />
doing a <a href="http://www.wordle.net/">wordl</a>-like <a href="http://www.infobarrel.com/media/image/54054.jpg">word cloud</a>.<br />
<br />
I know, word clouds are a bit out of style but I kind of like them any way. My motivation to think about word clouds was that I thought these could be combined with topic-models to give somewhat more interesting visualizations.<br />
<br />
So I looked around to find a nice open-source implementation of word-clouds ... only to find none. (This has been a while, maybe it has changed since).<br />
<br />
While I was bored in the train last week, I came up with <a href="https://github.com/amueller/word_cloud">this code</a>.<br />
A little today-themed taste:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_nWIWYiku9D9JPK7M-rvYxWJBdiXVvSz8kR3dPESHOIB9lRg4sGyBM27IJl5lxlSijvwpGpgZ339XeAJ3-Yl9IXRogoFD5jhy2olk4ORw7H5ZJrbRtEhCcbDyQhQCoZ65CeW8KATQZr0/s1600/constitution.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><br /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-wzGA3_7_-U3mlw_wiTQLOJTsN0v0-v0TKPQ4eT5nETuaV2Mo96lBkpjqh12X-RTxOdjcJloxo0qPxn0R-J_KDYn1DCqvIEpPjtEHucpp-vRQXCjtoIrz1i115exlxa0u6x_BYcKLBSc/s1600/constitution_.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-wzGA3_7_-U3mlw_wiTQLOJTsN0v0-v0TKPQ4eT5nETuaV2Mo96lBkpjqh12X-RTxOdjcJloxo0qPxn0R-J_KDYn1DCqvIEpPjtEHucpp-vRQXCjtoIrz1i115exlxa0u6x_BYcKLBSc/s400/constitution_.png" width="400" /></a></div>
<br />
<br />
<a name='more'></a><br />
The first step is to get some document. I used the <a href="http://www.archives.gov/exhibits/charters/constitution_transcript.html">constitution of the united states </a>for the above.<br />
<pre class="brush:python"> with open("constitution.txt") as f:
lines f.readlines()
text = "".join(lines)
</pre>
<br />
The next step is to extract words and give the words some weighting - for example how often they occur in the document.
I used scikit-learn's <a href="http://scikit-learn.org/dev/modules/feature_extraction.html#common-vectorizer-usage">CountVectorizer</a> for that as it is convenient and fast, but you could also use <a href="http://nltk.org/">nltk</a> or just some regexp.<br />
I get the counts of the 200 most common non-stopwords and normalize by the maximum count (to be somewhat invariant to document size).<br />
<br />
<pre class="brush:python">cv = CountVectorizer(min_df=0, charset_error="ignore",
stop_words="english", max_features=200)
counts = cv.fit_transform([text]).toarray().ravel()
words = np.array(cv.get_feature_names())
# normalize
counts = counts / float(counts.max())
</pre>
<br />
Now the real work starts. The basic idea is to randomly sample a place on the canvas and draw a word with a size related to its importance (frequency).<br />
We have to take care not to make the words overlap, though.<br />
<br />
There seems to be no good alternative to the <a href="http://www.pythonware.com/products/pil/">Python image library</a> (PIL), which is really, really horrible. There are no docstrings. You specify colors using strings. There is a weird module structure. There are no docstrings.<br />
<br />
Any way, we can get a canvas and a drawing object like this:
<br />
<pre class="brush:python">img_grey = Image.new("L", (width, height))
draw = ImageDraw.Draw(img_grey)
</pre>
We can then write in the image using
<br />
<pre class="brush:python">font = ImageFont.truetype(font_path, font_size)
draw.setfont(font)
draw.text((y, x), "Text that will appear in white", fill="white")</pre>
The<code> font_path</code>
here is an absolute path to a true type font on your system. I found now way to get around this (didn't look very hard, though).<br />
<br />
Ok, now we could draw random positions and see if we could draw there without touching any other words.<br />
There is a handy function in <code>ImageDraw.textsize</code>, which tells you how large a piece of text will be once rendered. We can use that to test if there is any overlap.<br />
<br />
Unfortunately, random sampling any place in the image turns out to be very inefficient: if a lot of the room is already taken, we have to try quite often to find some space.<br />
<br />
My next idea was first to find out all possible free places in the image and then sample randomly from those. The easiest way to find free positions is to convolve the current image with a box of size <code>ImageDraw.textsize(next_word)</code>. The places where the result is zero are exactly the places that have enough room for the text.<br />
Using <code>scipy.ndimage.uniform_filter</code> that worked quite nicely.<br />
<br />
But what do we do if there is not enough room to draw a word in the size we want?<br />
Then we have to make the font smaller and try again. Which means convolving the image again, this time with a somewhat smaller box.<br />
<br />
The code wasn't very fast and this seemed pretty wasteful, so I wanted to use another approach:<a href="http://en.wikipedia.org/wiki/Integral_image"> integral images</a>! Integral images are a way to pre-compute a simple 2d structure from which it is possible to extract the sum over arbitrary rectangles in the image in constant time.<br />
The integral image is basically a 2d cumulative sum and can be computed as
<code>integral_image = np.cumsum(np.cumsum(image, axis=0), axis=1)</code>.
This can be done once, and then we can look up rectangles of any size very fast.
If we are interested in windows of size <code>(w, h)</code> we can find the sum over all possible windows of this size via
<br />
<pre class="brush:python">area = (integral_image[w:, h:] + integral_image[:w, :h]
- integral_image[w:, :h] - integral_image[:w, h:])
</pre>
This is a combination of the integral image query (<a href="http://en.wikipedia.org/wiki/Integral_image">see wikipedia</a>) and my favorite numpy trick to query all positions simulataneuosly.<br />
So basically this does the same as the convolution above, only it precomputes a structure so that we can query for all possible windows sizes.<br />
<br />
After drawing a word, we have to compute the integral image again.<br />
Unfortunately, the fancy indexing with the integral image was a bit sluggish.<br />
<br />
On the other hand, that was a great opportunity to try out <a href="http://docs.cython.org/src/userguide/memoryviews.html">typed memory views</a> in <a href="http://cython.org/">cython</a>, which I <a href="https://2012.de.pycon.org/programm/schedule/sessions/8/">learned about </a>from<a href="http://consulting.behnel.de/"> Stefan Behnel</a> at Pycon DE :)
<br />
<pre class="brush:python">def query_integral_image(unsigned int[:,:] integral_image, int size_x, int size_y):
cdef int x = integral_image.shape[0]
cdef int y = integral_image.shape[1]
cdef int area, i, j
x_pos, y_pos = []
for i in xrange(x - size_x):
for j in xrange(y - size_y):
area = integral_image[i, j] + integral_image[i + size_x, j + size_y]
area -= integral_image[i + size_x, j] + integral_image[i, j + size_y]
if not area:
x_pos.append(i)
y_pos.append(j)
</pre>
Awesome! Easy to write down and straight to C-Speed.<br />
<br />
Except for the last two lines ... lists are not fast.<br />
I couldn't get that much faster (the <a href="http://docs.python.org/2/library/array.html">array module</a> doesn't have a C API afaik).<br />
<br />
I wanted to sample from all possible positions any way, so I just rand the above code twice: once counting how many possible positions there are, then sampling, then going to the position that I sampled.<br />
Using C++ lists would probably be easier but I was to lazy to try...<br />
<br />
Anyhow, now I had pretty decent integral images :)<br />
The building still took some time, though... so I lazily recomputed only the part that is changed after I draw a new word.<br />
Check out the full code on <a href="https://github.com/amueller/word_cloud">github</a>.<br />
It is not very pretty but I think should be quite readable.<br />
<br />
Less talk more pictures:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtlUmIZg-9xkUTrCz1vyu4R7H2zCMaE5jehssFObyoANiOOilbO2l-s8q0KOg2abz5cAJrze3ssJb_Sagd-rnSc5FPN_3amTRwgXtTAdEFITq925Glxo80L7HfzRmhMH3cgu7FOcYpt3M/s1600/constitution2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjtlUmIZg-9xkUTrCz1vyu4R7H2zCMaE5jehssFObyoANiOOilbO2l-s8q0KOg2abz5cAJrze3ssJb_Sagd-rnSc5FPN_3amTRwgXtTAdEFITq925Glxo80L7HfzRmhMH3cgu7FOcYpt3M/s400/constitution2.png" width="400" /> </a> </div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
To scale the fonts I used some arbitrary logarithmic dependency on the frequency, that I felt looked decent.<br />
It is also possible just to become smaller if there is no more room.<br />
<br />
Oh and of course I allowed flipping of the words :)
I also played with using arbitrary colors. I didn't see anything like colormaps in PIL, so I just used the <a href="http://en.wikipedia.org/wiki/Hsl_color_space">HSL </a>space and just sampled the hue. More elaborate schemes are obviously possible.<br />
<br />
Again, I used a slight trick for a bit more speed: I first computed everything in grey-scale, saved all the positions and then re-did it in color.<br />
<br />
One more, this time a bit more with the theme of the blog (can you guess what this is?)
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg01YMEhQxeZr8J77faf-P5wnwx69lo157S4XL4-VbuTlejUVY1N1xSmP_248QJETKRSug6rXbfJYw0dGjXnRzkq7YjbIk3nugv0MdoCyVkohr47hm76YJFUqTlx6Kqeim3X2mXNbkC0AM/s1600/prml3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg01YMEhQxeZr8J77faf-P5wnwx69lo157S4XL4-VbuTlejUVY1N1xSmP_248QJETKRSug6rXbfJYw0dGjXnRzkq7YjbIk3nugv0MdoCyVkohr47hm76YJFUqTlx6Kqeim3X2mXNbkC0AM/s400/prml3.png" width="400" /></a></div>
<br />
And with less saturation:<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEAKb2h7CeL5xrJ_EnDsxyVl3BGi60fXK0EuS_zJUJG_PgAW00vrxRZi9C-wF9ofFIgt9uajHy9-5v252b3ZPr37ARLn3RxSWKlSgRL2mqjwHr6i3QycHnINQvHIL2q4Aeo2s3k9NTBow/s1600/prml2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEAKb2h7CeL5xrJ_EnDsxyVl3BGi60fXK0EuS_zJUJG_PgAW00vrxRZi9C-wF9ofFIgt9uajHy9-5v252b3ZPr37ARLn3RxSWKlSgRL2mqjwHr6i3QycHnINQvHIL2q4Aeo2s3k9NTBow/s400/prml2.png" width="400" /></a></div>
<br />
There is definitely some room for improvement w.r.t. the look of it, but I feel this is already a nice start if you want to play around.<br />
<br />
One last comment: I though about improving performance (apparently the only thing on my mind during this little project) by doing the whole thing at a lower resolution and then recreating it at a higher one.<br />
This has two problems: if you use a too small resolution, some text might actually become invisible as it is too small. The other problem is that PIL's font sizes don't scale linearly. So it is not possible to say "I want this font 4 times larger".<br />
You can work around that but it's not pretty.<br />
So I went with the cython / integral image way, which I think is kind of cool :)<br />
<br />
If you scrolled down for the <a href="https://github.com/amueller/word_cloud">code, it is here</a>.<br />
<br />
PS: yes, this doesn't generate css / html4. But as you get the text sizes and positions, it should be easy to use this as a backend to generate a html page. PR welcome ;) Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com62tag:blogger.com,1999:blog-7345806147365425073.post-88238327949233618902012-10-05T22:17:00.000+02:002012-10-05T23:03:44.402+02:00Animating Random Projections of High Dimensional DataRecently <a href="http://jakevdp.github.com/">Jake</a> showed some pretty cool videos in <a href="http://jakevdp.github.com/blog/2012/08/18/matplotlib-animation-tutorial/">his blog</a>.<br />
This inspired me to go back to an idea I had some time ago, about visualizing high-dimensional data via random projections.<br />
<br />
I love to do exploratory data analysis with <a href="http://scikit-learn.org/dev/">scikit-learn</a>, using the <a href="http://scikit-learn.org/dev/modules/manifold.html">manifold</a>, <a href="http://scikit-learn.org/dev/modules/decomposition.html">decomposition</a> and <a href="http://scikit-learn.org/dev/modules/clustering.html">clustering</a> module. But in the end, I can only look at two (or three) dimensions. And I really <a href="http://vimeo.com/36579366">like to see</a> what I am doing.<br />
<br />
<a name='more'></a>
So I go and look at the first two <a href="http://scikit-learn.org/dev/modules/decomposition.html#principal-component-analysis-pca">PCA</a> directions, than at the first and third, than at the second and third... and so on. That is a bit tedious and looking at more would be great. For example using time.<br />
<br />
There is a software out there, called <a href="http://www.ggobi.org/">ggobi</a>, which does a pretty good job at visualizing high dimensional data sets. It is possible to take interactive tours of your high dimensions, set projection angles and whatnot. It has a UI and tons of settings.<br />
I used it a couple of times and I really like it. But it doesn't really fit into my usual work flow. It has good R integration, but not Python integration that I know of. And it also seems a bit overkill for "just looking around a bit".<br />
<br />
So I thought I could try something in the same spirit, but way simpler.<br />
I'll have a look at there work again and see what kind of smart things
they came up with, but for now, I'll just follow my nose and do
something stupid.<br />
<br />
So let's start with <a href="http://scikit-learn.org/dev/auto_examples/datasets/plot_iris_dataset.html#example-datasets-plot-iris-dataset-py">iris.</a> This is the most simple data set ever, with three classes, four dimensions, and most of the variation in the first two PCA directions.<br />
You can get a pretty good idea by looking at the first three PCA directions (colors code classes):<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://scikit-learn.org/dev/_images/plot_iris_dataset_1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://scikit-learn.org/dev/_images/plot_iris_dataset_1.png" width="490" /></a></div>
<br />
<br />
But we wanted to look at all four. To do this, I want a projection that is parametrized, so I can vary it over time. Basically I want a 1d trip (the time) through the space of all 2d projections of iris.<br />
It's not so easy to cover all, so lets just start with some tour that get's around.<br />
Starting from a PCA'ed version of iris in X_pca, we take<br />
<br />
<span style="font-size: 8pt;">
<pre class="brush:python">interpolation1 = np.cos(alpha) * X_pca[:, 1] + np.sin(alpha) * X_pca[:, 2]
interpolation2 = np.cos(beta) * X_pca[:, 0] + np.sin(beta) * X_pca[:, 3]
</pre>
</span>
<br />
So we take some angle alpha between the second and third direction, and some angle beta between the first and fourth.<br />
As time only has one dimension, I have to give alpha and beta a common parametrization. I chose beta to go twice as fast as alpha.<br />
<br />
Using this together with matplotlib.animation.FuncAnimation gives (drums)<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' src='https://www.blogger.com/video.g?token=AD6v5dz5ttTL1_94zgsyRyU-QXEDBtKmjUUkhWpjQQNtDCenL8TwHD2kLRt2A-BxfF97-2zNHsQ8_PxmV7WdiZbrXg' class='b-hbp-video b-uploaded' frameborder='0'></iframe><br />
<br />
I think that looks pretty neat already. (Apart from bad video quality :-/ sorry for that. I blame blogger.)<br />
The full code is <a href="https://gist.github.com/3841780#file_iris_video.py">here</a> and just a couple of lines.<br />
<br />
Now let's move to a bit more high-dimensional data.<br />
I went with the digits dataset in scikit-learn, which is 64 dimensional, which is kind of hard to look at already. To be able to see something, I just used the classes 1, 2 and 7 (otherwise it gets pretty crowded).<br />
The tour I used for iris was pretty arbitrary and now I wanted something that would work more or less with an arbitrary number of dimensions.<br />
<br />
What I ended up doing is use a projection matrix and vary each entry with a slightly shifted sine-wave. This gives a rotating smooth look and also makes a close loop, so you can repeat the video.<br />
Here is what the result of using PCA on the three classes of digits looks like:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgF4Ir5lP_K0hofpjeJLpt9G-YW3udZ4G77NqJg2EMs4t4vMbm5cwvYx4gHkBnAl8nkzcVAb-pczXsdhwIEvRRYbhaE33BOGmTJm1W2NhyphenhyphenYctTZla5PNIha-mkdE1BCP7tp801vj5oYxOk/s1600/digits_pca.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="243" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgF4Ir5lP_K0hofpjeJLpt9G-YW3udZ4G77NqJg2EMs4t4vMbm5cwvYx4gHkBnAl8nkzcVAb-pczXsdhwIEvRRYbhaE33BOGmTJm1W2NhyphenhyphenYctTZla5PNIha-mkdE1BCP7tp801vj5oYxOk/s320/digits_pca.png" width="320" /></a></div>
Looks pretty good already (this data set is also very easy), but it is not so clear whether 1 and 2 can be easily linearly separated.<br />
So let's have a look at the video:<br />
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<iframe allowfullscreen='allowfullscreen' webkitallowfullscreen='webkitallowfullscreen' mozallowfullscreen='mozallowfullscreen' src='https://www.blogger.com/video.g?token=AD6v5dywERID8S2iWblqS9M2qGbW27ZHd_ty98oZysNCTRgzlS3axSzoIyqgwxGcmIpMix8Txv-NghOiks_fEkXQ' class='b-hbp-video b-uploaded' frameborder='0'></iframe><br />
<br />
And yes, indeed: If you watch the video a bit, it's easy to convince yourself that all classes are pretty well linearly separable.<br />
<br />
The code that does the animation is this piece:<br />
<br />
<span style="font-size: 8pt;">
<pre class="brush:python">def animate(self, i):
# set top 2x2 to identity
self.projection[0, 0] = 1
self.projection[1, 1] = 1
# set "free entries" of projection matrix
# gives them a "rotation" feel and makes the whole thing seamless.
scale = 2 * np.pi * i / self.frames
self.projection[2:, :] = np.sin(self.frequencies * scale + self.phases)
interpolation = np.dot(X, self.projection)
# normalize so we fit on screen
interpolation /= interpolation.max(axis=0)
for p, c in zip(self.points, np.unique(y)):
p.set_data(interpolation[y == c, 0], interpolation[y == c, 1])
return self.points
</pre></span>
<br />
The full code is <a href="https://gist.github.com/3841780#file_digits_video.py">here</a>. It should be generic and I invite you to play with it :)<br />
If you wonder about the 2x2 identity that I used, let me go off on a slight tangent...<br />
(if you're not into math, skip this ;)<br />
<br />
As I said before, we want to have a tour of the 2d projections of our data space.<br />
The set of 2d projections is given by the <a href="http://en.wikipedia.org/wiki/Grassmannian">Grassmannian manifold</a> - usually this is thought of as the space of all subspaces, but we can also think of it as the space of all projections.<br />
A point on the Grassmannian can be represented by the basis of the subspace we are interested in. But many different basis can give rise to the same subspace, i.e. all linear combinations yield the same subspace. We can get rid of that redundancy by fixing a square sub-part (for example the first two rows as I did above) and demand that this part is the unit matrix.<br />
Not all subspaces can be expressed in this way - for example the one that has only zeros where I want the unit matrix. But the set is dense in the Grassmannian, which basically means I get "nearly all" points.<br />
<br />
I doubt this makes any difference to the video but I like to think my math education <a href="http://www.ais.uni-bonn.de/%7Eamueller/diplom_mueller.pdf">(pdf)</a> was not totally wasted ;)<br />
<br />
Enjoy and I'd love to hear any feedback :)Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com4tag:blogger.com,1999:blog-7345806147365425073.post-33588431268291486482012-09-22T16:29:00.000+02:002012-10-04T15:08:51.522+02:00Recap of my first Kaggle Competition: Detecting Insults in Social Commentary [update 3]Recently I entered my first <a href="http://www.kaggle.com/">kaggle</a> competition - for those who don't know it, it is a site running machine learning competitions. A data set and time frame is provided and the best submission gets a money prize, often something between 5000$ and 50000$.<br />
<br />
I found the approach quite interesting and could definitely use a new laptop, so I entered <a href="http://www.kaggle.com/c/detecting-insults-in-social-commentary">Detecting Insults in Social Commentary. </a><br />
My weapon of choice was Python with <a href="http://scikit-learn.org/dev/">scikit-learn</a> - for those who haven't read my blog before: I am one of the core devs of the project and never shut up about it.<br />
<br />
<a name='more'></a>During the competition I was visiting Microsoft Reseach, so this
is where most of my time and energy went, in particular in the end of
the competition, as it was also the end of my internship. And there was
also the <a href="http://peekaboo-vision.blogspot.de/2012/09/scikit-learn-012-released.html">scikit-learn release</a> in between. Maybe I can spent a bit more time on the next competition.<br />
<br />
<h3>
The Task</h3>
The task was to classify forum posts / comments into "insult" and "not insult".<br />
The original data set was very small, ~3500 comments, each usually between 1 and 5 sentences.<br />
One week before the deadline, another ~3500 data points where released (the story is a bit more complicated but doesn't matter so much). Some data points had timestamps (mostly missing in training but available in the second set and the final validation). <br />
<br />
<h3>
The Result (Spoiler alert)</h3>
I made 6th place.<a class="team-link single-player" href="http://www.kaggle.com/users/5048/vivek-sharma" target="_blank" title="Vivek Sharma last submitted on 7:47 pm, Wednesday 19 September 2012 UTC"> Vivek Sharma</a> won.<br />
From some mail exchanges, comments in my blog and a <a href="http://www.kaggle.com/c/detecting-insults-in-social-commentary/forums/t/2744/what-did-you-use">thread I opened in the competition forum</a>, I know that at least places 1, 2, 4, 5 and 6 (me) used <a href="http://scikit-learn.org/dev/">scikit-learn</a> for classification and / or feature extraction. This seems like a huge success for the project! I haven't heard from the third place, yet, btw.<br />
<br />
<br />
Enough blabla, now to the interesting part:<br />
First <a href="https://github.com/amueller/kaggle_insults/">my code on github</a>. Probably not so easy to run. Try my "<a href="https://github.com/amueller/scikit-learn/tree/working">working" branch of sklearn</a> if you are interested.<br />
<br />
<h3>
Things That worked</h3>
My two best performing models are actually quite simple, so I'll just paste them here.<br />
The first uses character n-grams, some handcrafted features (in BadWordCounter), chi squared and logistic regression (output had to be probabilities):<br />
<br />
<pre class="brush:python"><span style="font-size: 8pt;"> select = SelectPercentile(score_func=chi2, percentile=18)
clf = LogisticRegression(tol=1e-8, penalty='l2', C=7)
countvect_char = TfidfVectorizer(ngram_range=(1, 5),
analyzer="char", binary=False)
badwords = BadWordCounter()
ft = FeatureStacker([("badwords", badwords), ("chars", countvect_char), ])
char_model = Pipeline([('vect', ft), ('select', select), ('logr', clf)])</span>
</pre>
<br />
The the second is very similar, but also used word-ngrams and actually preformed a little better on the final evaluation:<br />
<span style="font-size: x-small;"><br /></span>
<pre class="brush:python"><span style="font-size: 8pt;"> select = SelectPercentile(score_func=chi2, percentile=16)
clf = LogisticRegression(tol=1e-8, penalty='l2', C=4)
countvect_char = TfidfVectorizer(ngram_range=(1, 5),
analyzer="char", binary=False)
countvect_word = TfidfVectorizer(ngram_range=(1, 3),
analyzer="word", binary=False, min_df=3)
badwords = BadWordCounter()
ft = FeatureStacker([("badwords", badwords), ("chars", countvect_char),
("words", countvect_word)])
char_word_model = Pipeline([('vect', ft), ('select', select), ('logr', clf)]) </span></pre>
<br />
My final submission contained two more models and also the combination of all four. As expected, the combination performed better than any single model, but the improvement over char_model_word was not large (0.82590 AUC vs 0.82988 AUC, the winner had 0.84249).<br />
Basically all parameters here are crazily cross-validated, but many are quite robust (C= 12 and percentile=4 will give about the same results).<br />
Some of the magic happens obviously in BadWordCounter. You can see the implementation <a href="https://github.com/amueller/kaggle_insults/blob/master/features.py#L35">here</a>, but I think the most significant features are "number of words in a badlist", "ratio of words that is in badlist", "ratio of words in ALL CAPS".<br />
<br />
Here is a visualization of the largest coefficients of three of my model. Blue means positive sign (insult), red negative (not insult):<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidJm6ux6IsYGNGPZ4_xxHXHFTpYehOiDBAxKMyW_wkwKXjkaTCHULDhuJDUNDOSpQSeWzqnUBRlQ_W2PbedXqf0NyBdlDcCH6yfdDVYkOTaXBJBNSTiehmTnfRSlUW-tDAYkvh667Me7Q/s1600/all_plots_small.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEidJm6ux6IsYGNGPZ4_xxHXHFTpYehOiDBAxKMyW_wkwKXjkaTCHULDhuJDUNDOSpQSeWzqnUBRlQ_W2PbedXqf0NyBdlDcCH6yfdDVYkOTaXBJBNSTiehmTnfRSlUW-tDAYkvh667Me7Q/s1600/all_plots_small.png" width="490" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
<br />
Most of the used features are quite intuitive, which I guess is a nice result (bad_ratio is the fraction of "bad" words, n_bad is the number).<br />
<br />
But in particular the character plot looks pretty redundant, with most of the high positives detecting whether someone is a moron or idiot or maybe retarded...<br />
Still it performs quite well (and of course these are only 100 of over 10,000 used features).<br />
<br />
For the list of bad words, I used one that allegedly is also used by google.<br />
As this will include "motherfucker" but not "idiot" or "moron" (two VERY important words in the training / leaderboard set), I extended the list with these and whatever the thesaurus said was "stupid".<br />
Interestingly in some models, the word "fuck" had a very large negative weight.<br />
I speculate this is caused by n_bad (the number of bad words) having a high weight and "fuck" not actually indicating insults.<br />
<br />
As a side note: for the parameter selection, I used the ShuffleSplit (as <a href="https://github.com/ogrisel">Olivier</a> suggested), as StratifiedKFold didn't seem to be very stable. I have no idea why.<br />
I discovered very close to the end that there were some duplicates in the training set (I think one comment was present 5 times), which might have been messing with the cross-validation.<br />
<br />
<br />
<h3>
Things that didn't work</h3>
<h4>
Feature selection:</h4>
I tried L1 features selection with logistic regression followed by L2 penalized Logistic regression, though it was worse than univariate selection in all cases.<br />
I also tried RFE, but didn't really get it to work. I am not so familiar with it and didn't know how to adjust the step-size to work in reasonable time with so many features.<br />
I also gave the randomized logistic regression feature selection a shot (only briefly though), also without much success.<br />
<br />
<h4>
Classifiers: </h4>
One of my submissions used elastic net penalized SGD, but that also turned out to be a bit worse than Logistic Regression.<br />
I also tried Bernoulli naive Bayes, KNN, and random forests (after L1 feature selection) to no avail.<br />
What surprised me most was that I couldn't get SVC (LibSVM) to work.<br />
The logistic regression I used (from LibLinear) was a lot better than the LibSVM with Platt-scaling. Therefore I didn't really try any fancy kernels.<br />
<br />
<h4>
Features:</h4>
I tried to use features from PCA and K-Means (distance to centers).<br />
I also tried to use the chi squared kernel approximation in RandomizedChi2,<br />
as this often worked very well for bag of visual words, but didn't see any improvement.<br />
I also played with <a href="http://pypi.python.org/pypi/jellyfish/0.1.2">jellyfish</a>, which does some word stemming and standardization, but couldn't see an improvement.<br />
<br />
<br />
<b>A long complicated pipeline:</b><br />
I also tried to put more effort into handcrafting the features and parsing the text.<br />
I used sentence and word tokenizers from <a href="http://nltk.org/">nltk</a>, used collocations, extracted features <a href="https://github.com/amueller/kaggle_insults/blob/master/features.py#L169">using regex</a>, even tried to count and correct spelling mistakes.<br />
I briefly used part-of-speech tag histograms, but gave up on POS-tagging as it was very slow.<br />
You can look up the details of what I tried <a href="https://github.com/amueller/kaggle_insults/blob/master/features.py#L123">here</a>.<br />
The model using these features was by far the worst. I didn't use any character features, but many many handcrafted ones. And it didn't really overfit.<br />
It was also pretty bad on the cross-validation on which I designed the features.<br />
Apparently I didn't really find the features I was missing.<br />
I also used a database of positive and negative connotated words.<br />
<br />
I should probably have tried to combine each of these features with the other classifiers, though I wanted to avoid building to similar models (as I wanted to average them). Also I didn't really invest enough time to do that (my internship was more important to me).<br />
<br />
<h3>
Things I implemented </h3>
I made several additions to scikit-learn particularly for this competition.<br />
They basically focused on text feature extraction, parameter selection with grid search and feature selection.<br />
<br />
These are:<br />
<h4>
Merged</h4>
<ul>
<li>Enable grid searches using Recursive Feature Elimination. (<a href="https://github.com/scikit-learn/scikit-learn/pull/1128">PR</a>) </li>
<li>Add minimum document frequency option to CountVectorizer (n-gram based text feature extraction) (<a href="https://github.com/scikit-learn/scikit-learn/pull/1128">PR</a>)</li>
<li>Sparse Matrix support in Recursive Feature Elimination. (<a href="https://github.com/scikit-learn/scikit-learn/pull/1029">PR</a>)</li>
<li>Sparse Matrix support in Univariate Feature Selection. (<a href="https://github.com/scikit-learn/scikit-learn/pull/1025">PR</a>)</li>
<li>Enhanced grid search for n-gram extraction. (<a href="https://github.com/scikit-learn/scikit-learn/pull/1024">PR</a>)</li>
<li>Add AUC scoring function. (<a href="https://github.com/scikit-learn/scikit-learn/pull/1013">PR</a>)</li>
<li>MinMaxScaler: Scale data feature-wise between given values (i.e. 0-1). (<a href="https://github.com/scikit-learn/scikit-learn/pull/1131">PR</a>) </li>
</ul>
<h4>
Not merged (yet)</h4>
<ul>
<li>FeatureUnion: use several feature extraction methods and concatenate features. (<a href="https://github.com/scikit-learn/scikit-learn/pull/1173">PR</a>)</li>
<li>Sparse matrix support in randomized logistic regression (<a href="https://github.com/scikit-learn/scikit-learn/pull/1133">PR</a>).</li>
<li>Enhanced visualization and analysis of grid searches. (<a href="https://github.com/scikit-learn/scikit-learn/pull/1128">PR</a>)</li>
<li>Allow grid search using AUC scores. (<a href="https://github.com/scikit-learn/scikit-learn/pull/1014">PR</a>)</li>
</ul>
<h3>
Things I learned</h3>
I learned a lot about how to process text. I never worked with any text data before and I think now I have a pretty good grip on the general idea. The data was quite small for this kind of application but still I think I got a little feel.<br />
Also, it seems to me that the simplest model worked best, feature selection and feature extraction are very important, though hand-crafting features is very non-trivial.<br />
To recap: my best single model was the "char_word_model", which can be constructed in 7 lines of sklearn stuff, together with 30 lines for custom feature extraction. I think if I had added also the date, I might have had a good chance.<br />
<br />
<br />
<h3>
Things that worked for others</h3>
Most contestants used similar models as I did, i.e. linear classifiers,<br />
word and character n-grams and some form of counting swearwords.<br />
Vivek, who won, found that SVMs worked better for him than logistic regression. Chris Brew, who came in fourth, only used character n-grams<br />
and a customized SGD classifier. So even with very simple features, you can<br />
get very far.<br />
It seems most people didn't use feature selection, which I tried a lot.<br />
<br />
The most commonly used software was scikit-learn, as I said above, R, and <a href="http://nlp.stanford.edu/software/classifier.shtml">software from the Stanford NLP</a> group.<br />
<br />
For details on what others used, see the discussion in the <a href="http://www.kaggle.com/c/detecting-insults-in-social-commentary/forums/t/2744/what-did-you-use">kaggle forum</a>. <br />
<br />
<h3>
Final Comments</h3>
<br />
After the first version of this blog-post (which I now shamelessly rewrote), I got a huge amount (relatively speaking) of feedback from other competitors.<br />
Thanks to everybody who shared there methods - in the comments, at kaggle, and at the <a href="http://sourceforge.net/mailarchive/message.php?msg_id=29871399">scikit-learn mailing list</a> - and even <a href="https://github.com/cbrew/Insults">their code</a>!<br />
<br />
I feel it is great that even though this is a competition and money is involved, we can openly discuss what we use and what works. I think this will help push the "data science" community and also will help us create better tools.<br />
<br />
<br />
There where several thing that seemed a bit weird about the competition.<br />
I know the competitions are generally still somewhat in a beta, phase, but there are some things that could be improved:<br />
<br />
The scores from the leader board dropped significantly, from <a href="http://www.kaggle.com/c/detecting-insults-in-social-commentary/leaderboard/milestone">around 91 AUC</a> to <a href="http://www.kaggle.com/c/detecting-insults-in-social-commentary/leaderboard">around 83 AUC</a> on the final evaluation.
I'm pretty sure I did not overfit (in particular the leader board score
was always close to my cross validation score and I only scored on the
leader board 4 times). Some discussion about this is <a href="http://www.kaggle.com/c/detecting-insults-in-social-commentary/forums/t/2737/ask-the-leaders-what-should-have-i-done-to-avoid-overfitting">here</a>. Generally speaking, some sanity tests on the data sets would be great.<br />
<br />
I
was a bit disappointed during the competition as cross-validation
seemed very noisy and my standard deviation captured the scores of the
first 15 places.<br />
That also made it hard to see which changes actually helped.<br />
Also, there seemed to be a high amount of label noise.<br />
For example most of my models had this false positive:<br />
<br />
<span style="font-size: x-small;">Are you retarded faggot lol If you are blind and dont use widgets then that
doesnt mean everyone else does n't use them Widgets is one of the reasons
people like android and prefer it agains iOS You can have any types of widgets
for weather seeing your twitter and stuff and on ios you scroll like an idiot
like a minute and when you finally found the apps you still have to click a
couple of times before you see what you need Android 2:0 iOS ; ] </span><br />
<br />
<br />
Hope you enjoyed this lengthy post :)Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com9tag:blogger.com,1999:blog-7345806147365425073.post-47562914810085443392012-09-05T11:24:00.000+02:002012-10-03T13:29:24.063+02:00Scikit-learn 0.12 releasedLast night I uploaded the new version 0.12 of scikit-learn to <a href="http://pypi.python.org/pypi/scikit-learn/">pypi</a>. Also the <a href="http://scikit-learn.org/">updated website </a>is up and running and development now starts <a href="http://scikit-learn.org/dev/">towards 0.13</a>.<br />
<br />
The new release has some nifty new features (<a href="http://scikit-learn.org/stable/whats_new.html">see whatsnew</a>):<br />
* Multidimensional scaling <br />
* Multi-Output random forests (<a href="http://www.icg.tugraz.at/Members/kontschieder/iccv11.pdf">like these</a>)<br />
* Multi-task Lasso<br />
* More loss functions for ensemble methods and SGD<br />
* Better text feature extraction<br />
<br />
<a name='more'></a>
Eventhough, the majority of changes in this release are somewhat "under the hood".<br />
<a href="http://blog.vene.ro/">Vlad</a> developed and set up a <a href="http://jenkins-scikit-learn.github.com/scikit-learn-speed/">continuous performance benchmark</a> for the main algorithms during his google summer of code. I am sure this will help improve performance.<br />
There already has been a lot of work in improving performance, by Vlad, Immanuel, Gilles and others for this release.<br />
<br />
Another improvement was the installation of a set of common tests, that are applied to all our estimators. This led to some improvements in stability, but arguably more importantly in a more consistent interface, more robust input validation (check that input has as many features in test as in training, check that you have the same number of labels as data points etc) and better error messages.<br />
Work in this direction is not over but I think much progress has been made. And while this is no shiny new algorithm, I think that error messages of the form<br />
<br />
<div class="line" id="LC68">
<span class="s">"A sparse matrix was passed, but dense data</span> <span class="s">is required. Use X.todense() to convert to dense."</span></div>
<div class="line" id="LC68">
</div>
<div class="line" id="LC68">
<span class="s">will help users a lot more than some "invalid index" error deep in the code (thanks <a href="http://gael-varoquaux.info/blog/">Gael</a>) </span>.</div>
<br />
Even more behind the scenes, to make this possible, the <a href="http://scikit-learn.org/dev/developers/index.html#apis-of-scikit-learn-objects">API</a> of scikit-learn objects is now a bit more well defined and stricter. <br />
The number of mixin classes, from which algorithms derive, has been extended to:<br />
<br />
* ClusterMixin<br />
* TransformerMixin<br />
* ClassifierMixin<br />
* RegressorMixin<br />
* MetaEstimatorMixin (i.e. RFE, GridSearchCV. needs another estimator to be instantiated)<br />
<br />
These now give a very good handle on how estimators behave and how they should be used (for example clustering algorithms all implement a "fit" and "fit_predict" but not necessarily a "predict" etc).<br />
I think we are not far from a very unified interface with intuitive behavior and parameter names.<br />
<br />
<br />
Not completely related to the release, but worth noticing:<br />
During the last couple of weeks I had the feeling that there are more and more users that are contributing and several improvements in the release are due to some first time contributors.<br />
<br />
Also, people from other packages have been reaching out to join forces. <br />
I talked to <a href="http://www.igglybob.com/">Ryan Curtin</a> from <a href="http://www.mlpack.org/">mlpack</a> and someone from <a href="http://www.shogun-toolbox.org/">shogun</a> joined us for celebrating the release on IRC :) Even with different foci, I hope we can all collaborate a bit more in the future for even better software.<br />
Btw, shogun also released yesterday, at the same time as we did. Congratulations!<br />
The also have a <a href="http://www.shogun-toolbox.org/">pretty new website</a> be sure to check it out.<br />
<br />
That's all, enjoy the release and<a href="https://github.com/scikit-learn/scikit-learn/issues"> give us a shout</a> if you have any trouble.<br />
<br />Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com0tag:blogger.com,1999:blog-7345806147365425073.post-59497766951159026312012-09-01T13:34:00.000+02:002013-07-02T08:24:07.493+02:00Segmentation Algorithms in scikits-imageRecently some segmentation and superpixel algorithms I implemented were <a href="https://github.com/scikits-image/scikits-image/pull/206">merged</a> into <a href="http://scikits-image.org/">scikits-image</a>. You can see the example <a href="http://scikits-image.org/docs/dev/auto_examples/plot_segmentations.html">here</a>.<br />
<br />
I reimplemented <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.6267&rep=rep1&type=pdf">Felzenszwalb's fast graph based method</a>, <a href="http://vision.ucla.edu/papers/vedaldiS08quick.pdf">quickshift</a> and <a href="http://www.kev-smith.com/papers/SMITH_TPAMI12.pdf">SLIC</a>.<br />
The goal was to have easy access to some successful methods to make comparison easier and encourage experimenting with the algorithms. <br />
<br />
Here is a a comparison of my implementations against the original implementations on Lena (downscaled by a factor of 2). The first row is my implementation, the second the original.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2YeiFjlFEIjB7ssWRnSjxcMTRGD-U_DysM_Em0AGgg5QtBBz2cVRLXlA3mqR48eXA1lAibk8-3d9U0ikcSH8eslWqIVu4D1tCn48B2rLfLeGIhL7WNsNtu9UcIV4IPZ8qNYSc_fqwZMk/s1600/visual_comparison.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi2YeiFjlFEIjB7ssWRnSjxcMTRGD-U_DysM_Em0AGgg5QtBBz2cVRLXlA3mqR48eXA1lAibk8-3d9U0ikcSH8eslWqIVu4D1tCn48B2rLfLeGIhL7WNsNtu9UcIV4IPZ8qNYSc_fqwZMk/s640/visual_comparison.png" width="490" /></a></div>
<br />
<a name='more'></a><br />
<br />
For the comparison, I used my <a href="https://github.com/amueller/vlfeat">python bindings of vl_feat's quickshift</a>, my <a href="http://peekaboo-vision.blogspot.de/2012/05/superpixels-for-python-pretty-slic.html">SLIC bindings</a> and used the <a href="http://www.cs.brown.edu/%7Epff/segment/">executable provided</a> for Felzsenzwalb's method.<br />
<br />
In general, I think this looks quite good. The biggest visual difference is for SLIC, where my implementation clearly does not a s good as the original one. I am quite sure this is a matter of using the right color space transform.<br />
<br />
For the fast graph based approach, the result looks qualitatively similar, but is actually different. The reason is that I implemented a "per channel" approach, as advocated in the paper: Segment each RGB channel separately, then combine the segments using intersection.<br />
Reading the code later on, I saw the algorithm that is actually implemented works directly on the RGB image - which is what I first implemented, but then revised after reading the paper :-/<br />
It should be fairly easy to change my implementation to fit the original one (and not do what is said in the paper).<br />
<br />
Here are some timing comparisons - they are just on this image, so to be taken with a grain of salt. But I think the general message is clear:<br />
<br />
<table>
<tbody>
<tr><td></td><td>Fast Graph Based</td><td>SLIC</td><td>Quickshift</td></tr>
<tr><td>mine</td><td>910ms</td><td>589ms</td><td>5470ms</td></tr>
<tr><td>original</td><td>166ms</td><td>234ms</td><td>5130ms</td></tr>
</tbody></table>
<br />
So the original implementation of the Fast Graph Based approach is much faster than mine, though as said above, it implements a different approach. I would expect a speedup of roughly 3x by using theirs, which would make my code still half as slow.
For SLIC, my code is also about half as slow, while for Quickshift it is <strike>insignificantly faster</strike> somewhat slower.
I am a bit disappointed with the outcome, but I think this is at least a reasonable place to start for further improvements. My first priority would be to qualitatively match the performance of SLIC.<br />
<br />
While working on the code, I noticed that using the correct colorspace (<a href="http://en.wikipedia.org/wiki/Lab_color_space">Lab</a>) is really crucial for this algorithm to work. For quickshift, it did not make much of a difference.
One problem here is that the quickshift code, the SLIC code and scikits-image all use different transformations from RGB to Lab.<br />
<br />
I will have to play with these to get a feeling on how much they influence the outcome.
As the code is now included in scikits-image, you can find it in our (I recently gained commit rights :) <a href="https://github.com/scikits-image/scikits-image/tree/master/skimage/segmentation">github repository.</a><br />
<br />
During this quite long project, I really came to love <a href="http://www.cython.org/">Cython</a>, which I used to write all of the algorithms. The workflow from a standard Python program for testing to a implementation with C-speed is seamless! The differences in speed between my and the original implementation is quite certainly due to some algorithmic issues, rather than that "Python is slow". The C-Code generated by Cython is really straight-forward and fast.<br />
I want to point out the <br />
<pre>cython -a</pre>
command, as it took me some time to discover it. It is simply brilliant. It gives a html output of your cython code, highlighting lines in which Python API is called (and that are therefore slow).<br />
<br />
If you want to implement an algorithm from scratch for Python, and it must be fast, definitely use Cython!<br />
<br />
That's all :)Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com4tag:blogger.com,1999:blog-7345806147365425073.post-23316959234462726152012-08-15T17:08:00.002+02:002012-10-03T00:38:54.194+02:00[ECCV2012] Offset based image completionThis is a short post about an ECCV 2012 paper I just <a href="http://cvpapers.com/eccv2012.html">discovered</a>.<br />
The paper is <a href="http://research.microsoft.com/en-us/um/people/kahe/publications/eccv12completion.pdf">Statistics of Patch Off sets for Image Completion</a> by Kaiming He and Jian Sun from Microsoft Asia.<br />
There was a recent talk at MSRC about <a href="http://www.adobe.com/technology/projects/patchmatch.html">PatchMatch</a> by <a href="http://danbgoldman.com/">Dan Goldman</a>. PatchMatch is a simple but very efficient image completion algorithm that is used in Photoshop. The page linked above contains a beautiful illustration of what is possible using this very efficient, still very powerful, algorithm.<br />
For each patch that has parts that are missing, PatchMatch finds the best fit in the known image. <br />
The trick to do this fast is to use a randomized search and exploit the fact that neighboring patches have probably neighboring matches.<br />
But enough about PatchMatch - that just got me interested in this topic.<br />
<br />
So this new ECCV paper has a similar idea, but is a bit more radical.<br />
First, it looks for matching patches in the known parts of the image. It requires that these are not too close. The key observation is: the <b>offsets </b>of these patches <b>cluster</b>. These clusters correspond to repeating patterns in the image, to dominant lines and texture.<br />
Once we found the dominant offsets (done by using 2d histograms in the paper), we can just take some translated versions of the picture (in the paper, 60 offsets=translations are used) and try to stitch these together - using a standard MRF with alpha-expansion approach.<br />
BAM! State of the art. I think this is less than 50 lines of Python. I love it.<br />
<br />
Here is an illustration from the paper:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOp5Y4Imi9gcrmWVKlaeBdAdFLrdnwgnsTS40Yw-LkG2kCnzaDElrAxBu3rsmsFFPxW5Su4eZitxON2YbjXbkwQdPhjH2AMP40-iyW5NQ23bSnosCr8UK5EieJiPA2iAYTtzwMDMM35Rs/s1600/offset_patches.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOp5Y4Imi9gcrmWVKlaeBdAdFLrdnwgnsTS40Yw-LkG2kCnzaDElrAxBu3rsmsFFPxW5Su4eZitxON2YbjXbkwQdPhjH2AMP40-iyW5NQ23bSnosCr8UK5EieJiPA2iAYTtzwMDMM35Rs/s640/offset_patches.png" width="490" /></a></div>
<br />
You can find the project website with many images <a href="http://research.microsoft.com/en-us/um/people/kahe/eccv12/index.html">here</a>. In particular, an interesting application is filling in the missing bits in stitched panoramas.Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com3tag:blogger.com,1999:blog-7345806147365425073.post-39505731390331260742012-07-17T16:39:00.001+02:002012-10-03T00:40:06.727+02:00A filterbank for low-level visionAt the moment I have the pleasure to be at <a href="http://research.microsoft.com/en-us/labs/cambridge/">MSRC</a>, working under the supervision of <a href="http://research.microsoft.com/en-us/people/carrot/">Carsten Rother</a> and <a href="http://www.nowozin.net/sebastian/">Sebastian Nowozin</a>.<br />
<br />
We are tackling some low-level vision tasks (as in their <a href="http://research.microsoft.com/pubs/162522/jancsary2012rtf.pdf">recent CVPR paper</a>) and in this context,<br />
filter banks are very useful.<br />
They might also be useful for object detection, since one <a href="http://arxiv.org/abs/1109.6638">Gabor rules them all</a>, and Google uses collections of Gabor filters for his image retrieval.<br />
<br />
I used the maximum response bank from the website of<a href="http://www.robots.ox.ac.uk/%7Evgg/index.html"> Andrew Zissermans Visual Geometry</a> group <a href="http://www.robots.ox.ac.uk/%7Evgg/research/texclass/filters.html">here</a>.<br />
<br />
My Python adaption is available as a<a href="https://gist.github.com/3129692"> github gist</a>.<br />
The Root Filter set looks like this: <br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVup56xio9ZMObblmGFIf8BHNLN0sHkTZWWxDoXB_3TbYIonxEQvEKkhx5fvx6rtE0fY54kMsM-r__DmwvFaqTV1sG8pn9SM4PjwC6dp9fDuGdUOsbKAZchOGZ0ageWWhqnxHzvuFUBz4/s1600/filters.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhVup56xio9ZMObblmGFIf8BHNLN0sHkTZWWxDoXB_3TbYIonxEQvEKkhx5fvx6rtE0fY54kMsM-r__DmwvFaqTV1sG8pn9SM4PjwC6dp9fDuGdUOsbKAZchOGZ0ageWWhqnxHzvuFUBz4/s640/filters.png" width="490" /></a></div>
<br />
<a name='more'></a><br />The maximum responses are obtained by using the max response for the first 6 rows (the other two remain), giving something like orientation invariant edges and bars.<br />
Applying this method to Lena looks something like this:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-EDBgxvWL8eMpmIQ-AXn8K6HPlNDxRtWxYdc8A3OpYx1wU54d-scnSd4topJPx6dGcUbaschREpAX5aPtbvXIRPSF7WOEgyTXaYAYqkXnIaggN_OxJ148zu6nOhkoV2nGlR1MqdN7FXY/s1600/lena_filters.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj-EDBgxvWL8eMpmIQ-AXn8K6HPlNDxRtWxYdc8A3OpYx1wU54d-scnSd4topJPx6dGcUbaschREpAX5aPtbvXIRPSF7WOEgyTXaYAYqkXnIaggN_OxJ148zu6nOhkoV2nGlR1MqdN7FXY/s640/lena_filters.png" width="490" /></a></div>
<br />
<br />
It's a bit slow but can easily be accelerated using joblib (see code).<br />
Happy filtering!Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com0tag:blogger.com,1999:blog-7345806147365425073.post-43804100121171036742012-06-14T19:03:00.001+02:002012-08-07T18:39:28.854+02:00Update for structured SVM in PythonI just <a href="https://github.com/amueller/pystruct">pushed an update</a> for my <a href="http://peekaboo-vision.blogspot.de/2012/06/structured-svm-and-structured.html">structured SVM</a> in Python.<br />
This contains a bugfix in the dual formulation and a subgradient descent version of the structured SVM.<br />
<br />
<a name='more'></a><br /><br />
The new version now has some options to be really verbose, track the slacks and constraints of all examples and measure the primal objective.<br />
<br />
Also, quite handy for approximate inference, it complains when the slack of the "most violated constraint" is smaller than the slacks of the previous constraints.<br />
<br />
The code is not as pretty as it could be but I'm still hacking at it all day.<br />
I hope it is still usable.<br />
<br />
A quick word about the bug that I fixed:<br />
In the usual SVM, all the dual variables alpha are constraint to be within<br />
0 < alpha < C. In the structural SVM, several dual variables correspond to the same example and share a common slack variable. Therefore, the <b>sum</b> of all alpha that correspond to a single example <b>is bound by C</b>.<br />
Took me 3 days to find :-/<br />
<br />
A good read on subgradient methods for structured prediction is <a href="http://repository.cmu.edu/cgi/viewcontent.cgi?article=1054&context=robotics">"(Online) Subgradient Methods for Structured Prediction"</a> by Ratliff et. al.<br />
<br />
I found the learning rate to be a bit hard to tune, though. I guess this method is a gain better for many examples. Please note that I implemented a batch version, not an online version. Though this is trivial to change.<br />
<br />
Any comments would be very welcome :)Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com3tag:blogger.com,1999:blog-7345806147365425073.post-89515115771617911182012-06-05T21:28:00.001+02:002013-08-24T11:16:01.054+02:00Structured SVM and Structured Perceptron for CRF learning in Python<span style="background-color: #f9cb9c;">[EDIT: If you are reading this now, have a look at <a href="http://pystruct.github.io/">pystruct.github.io</a>. The project matured quit a bit in the meantime.] </span><br />
<br />
Today I pushed some of <a href="https://github.com/amueller/pystruct">my code to github</a> that I use for experimenting with CRF learning. This goes along the lines of my recent posts on graphcut and I hope to post a full CRF learning framework for semantic image segmentation soon.
This is a pretty standard setup in computer vision, but I really haven't found much code online.<br />
<br />
Actually I haven't found any code to learn loopy CRFs, so I hope my simple implementation can help to get a better understanding of these methods. It certainly helped me ;)<br />
<br />
<a name='more'></a><br />
<br />
Let me start by saying how a structured perceptron works.<br />
Given a training set (x^i, y^i), a structured perceptron tries to find a function f such that<br />
<br />
<a href="http://www.codecogs.com/eqnedit.php?latex=y_i%20=%20%5Carg%20%5Cmax_%7By%20%5Cin%20Y%7D%20f%28x_i,%20y%29" target="_blank"><img src="http://latex.codecogs.com/gif.latex?y_i%20=%20%5Carg%20%5Cmax_%7By%20%5Cin%20Y%7D%20f%28x_i,%20y%29" title="y_i = \arg \max_{y \in Y} f(x^i, y)" /></a><br />
This means is basically tries to minimize the zero-one loss.<br />
<br />
The function f is given by <br />
<a href="http://www.codecogs.com/eqnedit.php?latex=f%28x,%20y%29%20=%20%3Cw,%20%5Cphi%28x,%20y%29%3E" target="_blank"><img src="http://latex.codecogs.com/gif.latex?f%28x,%20y%29%20=%20%3Cw,%20%5Cphi%28x,%20y%29%3E" title="f(x, y) = <w, \phi(x, y)>" /></a><br />
so it is a linear function of some joint feature phi of x and y parametrized by w.<br />
Learning basically uses the perceptron algorithm on phi. One iteration of learning does:<br />
<br />
For each example (x^i, y^i): <br />
<br />
<a href="http://www.codecogs.com/eqnedit.php?latex=%5Chat%7By%7D%20=%20%5Carg%20%5Cmax%20f%28x_i,%20y%29" target="_blank"><img src="http://latex.codecogs.com/gif.latex?%5Chat%7By%7D%20=%20%5Carg%20%5Cmax%20f%28x_i,%20y%29" title="\hat{y} = \arg \max f(x^i, y)" /></a><br />
<a href="http://www.codecogs.com/eqnedit.php?latex=w%20%5Cleftarrow%20w%20@plus;%20%5Cphi%28x%5Ei,%20y%5Ei%29%20-%20%5Cphi%28x%5Ei,%20%5Chat%7By%7D%29" target="_blank"><img src="http://latex.codecogs.com/gif.latex?w%20%5Cleftarrow%20w%20+%20%5Cphi%28x%5Ei,%20y%5Ei%29%20-%20%5Cphi%28x%5Ei,%20%5Chat%7By%7D%29" title="w \leftarrow w + \phi(x^i, y^i) - \phi(x^i, \hat{y})" /></a><br />
<br />
This is very easy to implement and already gives nice results in many cases.<br />
So how do we apply this to CRF learning?<br />
For ease of exposition, assume a binary CRF.
The parameters are pairwise and unary potentials, usually written like this:<br />
<br />
<br />
<a href="http://www.codecogs.com/eqnedit.php?latex=f%28x,y%29%20=%20%5Csum_i%20%5Ctheta_i%28x%29y_i%20@plus;%20%5Ctheta_%7Bij%7D%28x%29y_i%20y_j" target="_blank"><img src="http://latex.codecogs.com/gif.latex?f%28x,y%29%20=%20%5Csum_i%20%5Ctheta_i%28x%29y_i%20+%20%5Ctheta_%7Bij%7D%28x%29y_i%20y_j" title="f(x,y) = \sum_i \theta_i(x)y_i + \theta_{ij}(x)y_i y_j" /></a><br />
So for using a structured perceptron for CRF learning, we basically need to put these theta in the w above and we need to find the argmax somehow.<br />
In my implementation, the argmax is done using <a href="http://pub.ist.ac.at/%7Evnk/">QPBO</a> (my Python bindings are <a href="https://github.com/amueller/pyqpbo">here</a> and work the same as my <a href="http://peekaboo-vision.blogspot.de/2012/05/graphcuts-for-python-pygco.html">gco bindings</a>). This is a graph-cut based inference procedure for general energy functions. Another possiblity would be to use <a href="http://research.microsoft.com/en-us/downloads/dad6c31e-2c04-471f-b724-ded18bf70fe3/">TRW-S</a>.<br />
<br />
I like to use graph-cut for inference since it is usually much faster than message passing. The problem is, that the usual methods only work with submodular energy functions. If you do learning, you would need to restrict yourself to these kind of functions - which I'm not doing here.<br />
<br />
I implemented some very simple forms of CRFs, but they should be easy enough to extend. I am using global parameters theta and the unary terms are just given by a scalar weighting of x.<br />
So this is more of a traditional MRF with learned affinities.<br />
<br />
Here are some simple examples included in my code.<br />
Learning to denoise blocks: <br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1PmRUCyFjqBebrU415r2ROPKDs2ZiBCXaeR1LOZXMCmGsfL_h7KeOUDvcPtunXaRK_LwskbMz6HvsCB9oxSvwAt4v_g_anSGQkiPdhvOzIDUfxjLJizKFiZL4Qg9rcTiH62J81BTgBAY/s1600/denoise.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg1PmRUCyFjqBebrU415r2ROPKDs2ZiBCXaeR1LOZXMCmGsfL_h7KeOUDvcPtunXaRK_LwskbMz6HvsCB9oxSvwAt4v_g_anSGQkiPdhvOzIDUfxjLJizKFiZL4Qg9rcTiH62J81BTgBAY/s320/denoise.png" width="320" /></a></div>
Learning a checkerboard (this is sort of interesting since a non-submodular function is learned:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyGNalbCMbRfl6v6XLnP-VoxUqZjvpayF2Zh1fUbSPWoA7qa6rjHzzgJznoJ0OqfgVchuXOdHiv7sMhCrFyiizF21y8h2MgQbXYrfwR-3hMYQOWUL6VLMV5w7BUbOKEfonVV5AKu2_q0o/s1600/checker.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="227" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyGNalbCMbRfl6v6XLnP-VoxUqZjvpayF2Zh1fUbSPWoA7qa6rjHzzgJznoJ0OqfgVchuXOdHiv7sMhCrFyiizF21y8h2MgQbXYrfwR-3hMYQOWUL6VLMV5w7BUbOKEfonVV5AKu2_q0o/s320/checker.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
Learning to denoise stripes:</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieI552tAZ2q7ky8xiSVgw1vsSaOMRbBUCwMWE6iL5b5aLWFzHiwIAF8rtqhVKAYPdeHBvOZZn5xLtny1JAfAq6XkNKCw2BEHCkEL4ZH6oGsfDTx8m8D2Z-9rS3IOWaQomy3h9Bjn8TSr0/s1600/stripes.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="240" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEieI552tAZ2q7ky8xiSVgw1vsSaOMRbBUCwMWE6iL5b5aLWFzHiwIAF8rtqhVKAYPdeHBvOZZn5xLtny1JAfAq6XkNKCw2BEHCkEL4ZH6oGsfDTx8m8D2Z-9rS3IOWaQomy3h9Bjn8TSr0/s320/stripes.png" width="320" /></a></div>
These are obviously very simple examples but I feel the still convey the spirit of the method.<br />
<br />
My code also includes a very naive Structured SVM.<br />
The details of learning a structure SVM are well described in Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun
<a href="http://www.jmlr.org/papers/volume6/tsochantaridis05a/tsochantaridis05a.pdf"><i>Large Margin Methods for Structured and Interdependent Output Variables</i></a> and I won't go into the details here.<br />
Similar to the structured perceptron, it alternates two steps: the argmax and the update.<br />
The update here is solving the dual formulation of the SVM-version of the linear function described above, where constraints are derived from the "negative examples" obtained by the argmax.<br />
I solve this dual formulation simply using <a href="http://abel.ee.ucla.edu/cvxopt/">cvxopt</a>.<br />
I say my implementation is pretty naive since I optimize the dual from scratch each time I add new constraints. Since I am mainly interested in learning for loopy CRFs where inference is hard, this doesn't matter so much, as solving the optimization is much faster than doing the inference.<br />
Any hints on how to do this better are still welcome ;)<br />
<br />
There is one more significant difference between the structured perceptron and the structured SVM: to obtain the constraints (="negatives") in the SVM formulation, one needs to take your loss into account - for CRFs usually the number of variables that are labeled wrong. This is the so-called "loss augmented prediction". The idea is that not only should all negatives have lower energy as the ground truth, they should also have a margin that is as big as their loss. (This is the slack-rescaled version which I implemented).<br />
This can be done by modifying the unary potentials basically by subtracting the energy of the ground truth.<br />
<br />
For the simple examples above, the SVM doesn't give very different results to the Perceptron, so I won't show the images.<br />
I hope this is useful and I'd love to get some feedback and suggestions.<br />
<br />
[update] I forgot to mention that my toy CRF implementation not only works on grid graphs. There is also a version for general graphs.[/update]Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com4tag:blogger.com,1999:blog-7345806147365425073.post-90740690532990399412012-06-05T20:17:00.001+02:002013-08-11T22:22:05.442+02:00Basics on structured learning and prediction<br />
I just pushed some of my structured learning code to <a href="https://github.com/amueller/pystruct">github</a> and hope that some people might find it useful. Before describing my code here, I wanted to give a basic intro into structured prediction. I hope I can at least convey some intuition for this vast research area. So here goes...<br />
<br />
<b>What is structured learning and prediction?</b><br />
<br />
Structured prediction is a generalization of the standard paradigms of supervised learning, classification and regression. All of these can be thought of finding a function that minimizes some loss over a training set. The differences are in the kind of functions that are used and the losses.<br />
In classification, the target domain are discrete class labels, and the loss is usually the 0-1 loss, i.e. counting the misclassifications. In regression, the target domain is the real numbers, and the loss is usually mean squared error.<br />
In structured prediction, both the target domain and the loss are more or less arbitrary. This means the goal is not to predict a label or a number, but a possibly much more complicated object like a sequence or a graph.<br />
<br />
<a name='more'></a><br />
<br />
<b>What does that mean?</b><br />
<br />
In structured prediction, we often deal with finite, but large output spaces Y.<br />
This situation could be dealt with using classification with a very large number of classes. The idea behind structured prediction is that we can do better than this, by making use of the structure of the output space.<br />
<br />
<b>A (very simplified) example</b><br />
<br />
Let's say we want to generate text from spoken sentences. Viewed as a pure classification problem, we could see each possible sentence as a class. This has several drawbacks: we have many classes, and to do correct predictions, we have to have all possible sentences in the training set. That doesn't work well. Also, we might not care about getting the sentence completely right.<br />
If we misinterpret a single word, this might be not as bad as misinterpreting every word. So a 0-1 loss on sentences seems inappropriate.<br />
We could also try to view every word as a separate class and try to predict each word individually. This seems somehow better, since we could learn to get most of the word in a sentence right. On the other hand, we lose all context. So for example the expression "car door" is way more likely than "car boar", while predicted individually these could be easily confused.<br />
Structured prediction tries to overcome these problems by considering the output (here the sentence) as a whole and using a loss function that is appropriate for this domain.<br />
<br />
<b>A formalism</b><br />
I hope I have convinced you that structured prediction is a useful thing. So how are we going to formalize this? Having functions that produce arbitrary objects seem a bit hard to handle. There is one very basic formula at the heart of structured prediction:<br />
<div style="text-align: center;">
<a href="http://www.codecogs.com/eqnedit.php?latex=y*%20=%20%5Carg%20%5Cmax_%7By%20%5Cin%20Y%7D%20f%28x,%20y%29" target="_blank"><img src="http://latex.codecogs.com/gif.latex?y*%20=%20%5Carg%20%5Cmax_%7By%20%5Cin%20Y%7D%20f%28x,%20y%29" title="y* = \arg \max_{y \in Y} f(x, y)" /></a>
</div>
<br />
Here x is the input, Y is the set of all possible outputs and f is a compatibility function that says how well y fits the input x. The prediction for x is y*, the element of Y that maximizes the compatibility.<br />
This very simple formula allows us to predict arbitrarily complex outputs, as long as we can say how compatible a given output is with the input.<br />
<br />
This approach opens up two questions:<br />
<br />
<b>How do we specify f? How do we compute y*?</b><br />
As I said above, the output set Y is usually a finite but very large set (all graphs, all sentences in the English language, all images of a given resolution). Finding the argmax in the above equation by exhaustive search is therefore out of the question. So we need to restrict ourselves to f such that we can do the maximization over y efficiently. The most popular tool for building such f is using energy functions or conditional random fields (CRFs) [which are basically the same for finding y*].<br />
<br />
I won't go into the details of these methods as this is a vast field. One example, which I am most interested in, are pairwise energy functions of discrete variables, which are explained a bit in <a href="http://peekaboo-vision.blogspot.de/2012/05/graphcuts-for-python-pygco.html">my last post</a>.<br />
<br />
There are basically three challenges in doing structured learning and prediction:<br />
- Choosing a parametric form of f<br />
- solving argmax_y f(x, y)<br />
- learning parameters for f to minimize a loss<br />
<br />
My last post was just concerned with the second part (given a particular f, find y*), while my next post will be about the third part, learning parameters.<br />
<br />
There have been many publications and book on this topics. For a nice introduction in the (context of computer vision), I recommend <br />
Sebastian Nowozin, Christoph H. Lampert:
<i><a href="http://pub.ist.ac.at/%7Echl/papers/nowozin-fnt2011.pdf">"Structured Learning and Prediction in Computer Vision"</a></i><br />
One of the founding publications on the topic of learning structured models is<br />
Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun
<a href="http://www.jmlr.org/papers/volume6/tsochantaridis05a/tsochantaridis05a.pdf"><i>Large Margin Methods for Structured and Interdependent Output Variables</i></a>
which is also a must-read on the topic. <i><br /></i>Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com1tag:blogger.com,1999:blog-7345806147365425073.post-4359417304402319612012-05-20T19:15:00.000+02:002012-10-05T21:58:58.613+02:00Graphcuts for Python: pygco (slight update)I have been using the excellent <a href="http://vision.csd.uwo.ca/code/">gco</a> library for energy minimization with graph cuts for quite some time. Finally I got around to clean up / rewrite some of my <a href="https://github.com/amueller/gco_python">Python wrappers</a> so that maybe someone else can use them, too.<br />
<br />
<a name='more'></a><br />
So what does this library do?<br />
It does discrete energy minimization on loopy graphs. This is an important topic in computer vision, since it can be used for segmentation, stereo vision and optical flows. When you read "CRF" in a computer vision paper and there is no learning involved, it just means they minimized an energy function (probably using gco). Often this is just some form of "smoothing" but a lot more can be done with it.<br />
<br />
Let's talk about pairwise energy functions. These functions take the form<br />
<br />
E = \sum v_i(y_i) + \sum w_ij(y_i, w_j) <br />
<br />
<br />
with v_i and w_ij are some fixed parameters and y_i are the multinomial variables over which we want to minimize this energy. Typically the i run over all pixels in an image, and w_ij is only nonzero for adjacent pixels.<br />
In general, think of v and w as tables (of size n_labels or n_labels x n_labels) for each node resp. edge in the graph. In simple models, these are the same tables for all nodes / edges, in more complicated ones they vary. <br />
<br />
In general this is NP hard, but there are many interesting cases, in which this problem can (somewhat surprisingly) be solved efficiently and exactly.<br />
If y_i are binary and the energy is submodular, the exact solution can be found efficiently.<br />
<br />
If the y_i are not binary, but the energy fulfills a related property, it is possible to efficiently find good, if not optimal, solutions. See the seminal paper of Yuri Boykov, Olga Veksler and Ramin Zabih <span class="reference-text"><a class="external text" href="http://www.cs.cornell.edu/%7Erdz/Papers/BVZ-pami01-final.pdf" rel="nofollow">Fast approximate energy minimisation via graph cuts.</a></span><br />
<br />
This is where gco comes in: it implements two move-making algorithms, alpha-expansion and alpha-beta-swaps, that solve a series of binary problems to obtain a solution to a multi-label problem.<br />
<br />
Going back to the applications I mentioned above, these labels could be different depths in stereo vision, different directions of movements in optical flow, or different object classes in semantic segmentation.<br />
<br />
There are several ways to specify the energy for gco and I don't want to go into to much detail here. Have a look at the README if you are interested.<br />
<br />
For the moment, I have ported two ways to Python (but the rest should come soon). One is for general graphs, specified by listing all edges, the other one is for 2D grid graphs (images).<br />
In all cases, the unary potentials (the v_i above) are just given as one number for each vertex (pixel in the grid graph case) and state.<br />
In the simple case that I wrapped for Python, the pairwise potentials (the w_ij)<br />
are fixed over the whole graph and only depend on the combination of labels y_i and y_j.<br />
This contains the case of the simple Pott's potential, where w_ij is one if y_i = y_j and zero otherwise.<br />
<br />
If you want to use the code, you need to download the original <a href="http://vision.csd.uwo.ca/code/">gco library </a>and <a href="https://github.com/amueller/gco_python">my wrappers</a>. There is one thing you have to keep in mind when working with gco, though: all potentials are expected to be integers, so you need to round them!<br />
<br />
Here are some examples (available at my <a href="https://github.com/amueller/gco_python">git repo</a>):<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-n1wmTYEeh85Nx-m9mRPQ4Bch1bicUka2Bjgu-aXta7EocM62n_O3xZOGyzpfbIhCHi73aW6dKCS48o8f1Xm0Y3-WKh9dJAoq4b6jmdta0FMlkdkBi7ytlCIwwQmYzSnqmifJ8RmaPqY/s1600/simple.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="241" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh-n1wmTYEeh85Nx-m9mRPQ4Bch1bicUka2Bjgu-aXta7EocM62n_O3xZOGyzpfbIhCHi73aW6dKCS48o8f1Xm0Y3-WKh9dJAoq4b6jmdta0FMlkdkBi7ytlCIwwQmYzSnqmifJ8RmaPqY/s320/simple.png" width="320" /></a></div>
This is a trivial example of smoothing. We have a binary image, add some noise and want to recover the original image. The example contains a "grid graph" version and a version where I explicitly construct the edges.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBotJjmt7qEij0F2eut_MJJaln8CypjhEp8das-ZzY87Pa8BDoUA52wlPA9tXPtdOWuLaIOJs73ddAQGUzhADJypXrYeb1CrhvbG1ZlPe3ffYnTPjx7dNZaaTIxZAd8JGeSkU0agHnOuk/s1600/simple_multinomial.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="68" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBotJjmt7qEij0F2eut_MJJaln8CypjhEp8das-ZzY87Pa8BDoUA52wlPA9tXPtdOWuLaIOJs73ddAQGUzhADJypXrYeb1CrhvbG1ZlPe3ffYnTPjx7dNZaaTIxZAd8JGeSkU0agHnOuk/s320/simple_multinomial.png" width="320" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjvwbmOR5QwLltejDh9I3PVPvdVvq1NiQ3Cm4MHZMpFmcyfG8uiE1JVkPuKZBikaCCQ19WnolvUssw9HEdH35SxPlQOJAm07Rz_skLk6-2fUO6xwjMlwhdYhmwT4uFe2UDoJ7WPZowd3Do/s1600/simple_multinomial.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><br /></a></div>
The same works with multinomial input. Here I used two different kind of pairwise potentials, the second from the right used Pott's potentials, while the one on the very right knows that blue-green and green-red edges are more likely than blue-red.<br />
<br />
And finally, a not-as-trivial-looking example:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCz_SCSOdt26snqCNHLYnJO81Ayy88EePrd2CEkoT9jecc9q1S0k6T5e1ZB50lVmB7yJh0fsVts3Y-sXe7ylF2K_ulD3q3-yG8im_eGNWKlMGde8GBbLztiYewbb2u9iYPqqasdXLpCqQ/s1600/middlebury.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgCz_SCSOdt26snqCNHLYnJO81Ayy88EePrd2CEkoT9jecc9q1S0k6T5e1ZB50lVmB7yJh0fsVts3Y-sXe7ylF2K_ulD3q3-yG8im_eGNWKlMGde8GBbLztiYewbb2u9iYPqqasdXLpCqQ/s640/middlebury.png" width="490" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
The first two images on the top are consecutive frames from a video sequence from the middlebury benchmark. This example tries to find depth layers in the image by using paralax.<br />
The second image is translated in x direction and the best fit for each pixel is found. The image on the top right gives the best translation for each individual pixel. The bottom images show the results obtained with alpha-expansion and Pott's potentials (right) and a 1d topology (left).<br />
<br />
This is not really how you would do stereo vision (the potentials I used are very naive) but I feel it is a nice example (about 10 lines in Python).<br />
<br />
I hope some people find this helpful an there will be more computer vision code in Python soon :)Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com35tag:blogger.com,1999:blog-7345806147365425073.post-81274874927404764422012-05-13T15:02:00.000+02:002012-05-18T18:11:19.236+02:00ICML 2012 Deep Learning and Unsupervised Feature Extraction Reading ListThe <a href="http://icml.cc/2012/">ICML2012</a> accepted papers are officially <a href="http://icml.cc/2012/papers/">online</a>.<br />
<br />
On <a href="https://twitter.com/#%21/t3kcit">twitter</a>, <a href="http://cs.stanford.edu/%7Ekarpathy/">Andrej Kaparthy</a> complained that the list is a bit hard to browse through. I agree and even though this is probably not the nice visualization he had in mind, I felt like having topical reading lists would somehow mitigate this problem.<br />
Here is my reading list on deep learning and unsupervised feature extraction:<br />
<br />
<div class="paper" id="paper-73">
<div class="paper" id="paper-910">
<h2>
A Generative Process for Contractive Auto-Encoders</h2>
<div class="authors">
Salah Rifai, Yann Dauphin, Pascal Vincent,
Yoshua Bengio </div>
<div class="type">
– Accepted </div>
<div class="abstract">
<b>Abstract: </b>The contractive
auto-encoder learns a representation of the input data that
captures the local manifold structure around each data point,
through the leading singular vectors of the Jacobian of the
transformation from input to representation. The corresponding
singular values specify how much local variation is plausible
in directions associated with the corresponding singular
vectors, while remaining in a high-density region of the input
space. This paper proposes a procedure for generating samples
that are consistent with the local structure captured by a
contractive auto-encoder. The associated stochastic process
defines a distribution from which one can sample, and which
experimentally appears to converge quickly and mix well
between modes, compared to Restricted Boltzmann Machines and
Deep Belief Networks. The intuitions behind this procedure can
also be used to train the second layer of contraction that
pools lower-level features and learns to be invariant to the
local directions of variation discovered in the first layer.
We show that this can help learn and represent invariances
present in the data and improve classification error.
</div>
</div>
<h2>
<br />
</h2>
<h2>
Building high-level features using large scale unsupervised
learning</h2>
<div class="authors">
Quoc Le, Marc'Aurelio Ranzato, Rajat Monga,
Matthieu Devin, Greg Corrado, Kai Chen, Jeff Dean, Andrew Ng </div>
<div class="type">
– Accepted </div>
<div class="abstract">
<b>Abstract: </b>We consider the
challenge of building feature detectors for high-level concepts
from only unlabeled data. For example, we would like to
understand if it is possible to learn a face detector using only
unlabeled images downloaded from the Internet. To answer this
question, we trained a 9-layered locally connected sparse
autoencoder with pooling and local contrast normalization on a
large dataset of images (which has 10 million images, each image
has 200x200 pixels). On contrary to what appears to be a
widely-held negative belief, our experimental results reveal
that it is possible to achieve a face detector via only
unlabeled data. Control experiments show that the feature
detector is robust not only to translation but also to scaling
and 3D rotation. Also via recognition and visualization, we find
that the same network is sensitive to other high-level concepts
such as cat faces and human bodies.
</div>
</div>
<br />
<div class="paper" id="paper-405">
<h2>
Evaluating Bayesian and L1 Approaches for Sparse Unsupervised
Learning </h2>
<div class="authors">
Shakir Mohamed, Katherine Heller, Zoubin
Ghahramani </div>
<div class="type">
– Accepted </div>
<div class="abstract">
<b>Abstract: </b> The use of <i>L_1</i>
regularisation for sparse learning has generated immense
research interest, with many successful applications in diverse
areas such as signal acquisition, image coding, genomics and
collaborative filtering. While existing work highlights the many
advantages of <i>L_1</i> methods, in this paper we find that <i>L_1</i>
regularisation often dramatically under-performs in terms of
predictive performance when compared with other methods for
inferring sparsity. We focus on unsupervised latent variable
models, and develop <i>L_1</i> minimising factor models,
Bayesian variants of “<i>L_1</i>”, and Bayesian models with a
stronger <i>L_0</i>-like sparsity induced through
spike-and-slab distributions. These spike-and-slab Bayesian
factor models encourage sparsity while accounting for
uncertainty in a principled manner, and avoid unnecessary
shrinkage of non-zero values. We demonstrate on a number of data
sets that in practice spike-and-slab Bayesian methods outperform
<i>L_1</i> minimisation, even on a computational budget. We thus
highlight the need to re-assess the wide use of <i>L_1</i>
methods in sparsity-reliant applications, particularly when we
care about generalising to previously unseen data, and provide
an alternative that, over many varying conditions, provides
improved generalisation performance.</div>
</div>
<br />
<div class="paper" id="paper-461">
<div class="paper" id="paper-105">
<h2>
On multi-view feature learning</h2>
<div class="authors">
Roland Memisevic </div>
<div class="type">
– Accepted </div>
<div class="abstract">
<b>Abstract: </b> Sparse coding
is a common approach to learning local features for object
recognition. Recently, there has been an increasing interest
in learning features from spatio-temporal, binocular, or other
multi-observation data, where the goal is to encode the
relationship between images rather than the content of a
single image. We discuss the role of multiplicative
interactions and of squaring non-linearities in learning such
relations. In particular, we show that training a sparse
coding model whose filter responses are multiplied or squared
amounts to jointly diagonalizing a set of matrices that encode
image transformations. Inference amounts to detecting
rotations in the shared eigenspaces. Our analysis helps
explain recent experimental results showing that Fourier
features and circular Fourier features emerge when training
complex cell models on translating or rotating images. It also
shows how learning about transformations makes it possible to
learn invariant features.</div>
<div class="paper" id="paper-284">
<h2>
</h2>
<h2>
Deep Mixtures of Factor Analysers</h2>
<div class="authors">
Yichuan Tang, Ruslan Salakhutdinov,
Geoffrey Hinton </div>
<div class="type">
– Accepted </div>
<div class="abstract">
<b>Abstract: </b>An efficient
way to learn deep density models that have many layers of
latent variables is to learn one layer at a time using a
model that has only one layer of latent variables. After
learning each layer, samples from the posterior
distributions for that layer are used as training data for
learning the next layer. This approach is commonly used with
Restricted Boltzmann Machines, which are <i>undirected</i>
graphical models with a single hidden layer, but it can also
be used with Mixtures of Factor Analysers (MFAs) which are <i>directed</i>
graphical models. In this paper, we present a greedy
layer-wise learning algorithm for Deep Mixtures of Factor
Analysers (DMFAs). Even though a DMFA can be converted to an
equivalent shallow MFA by multiplying together the factor
loading matrices at different levels, learning and inference
are much more efficient in a DMFA and the sharing of each
lower-level factor loading matrix by many different higher
level MFAs prevents overfitting. We demonstrate empirically
that DMFAs learn better density models than both MFAs and
two types of Restricted Boltzmann Machines on a wide variety
of datasets.</div>
<div class="paper" id="paper-659">
<h2>
</h2>
<h2>
Learning Local Transformation Invariance with Restricted
Boltzmann Machines</h2>
<div class="authors">
Kihyuk Sohn, Honglak Lee </div>
<div class="type">
– Accepted </div>
<div class="abstract">
<b>Abstract: </b>The difficulty
of developing feature learning algorithms that are robust
to the novel transformations (e.g., scale, rotation, or
translation) has been a challenge in many applications
(e.g., object recognition problems). In this paper, we
address this important problem of transformation invariant
feature learning by introducing the transformation
matrices into the energy function of the restricted
Boltzmann machines. Specifically, the proposed
transformation-invariant restricted Boltzmann machines not
only learn the diverse patterns by explicitly transforming
the weight matrix, but it also achieves the invariance of
the feature representation via probabilistic max pooling
of hidden units over the set of transformations.
Furthermore, we show that our transformation-invariant
feature learning framework is not limited to this specific
algorithm, but can be also extended to many unsupervised
learning methods, such as an autoencoder or sparse coding.
To validate, we evaluate our algorithm on several
benchmark image databases such as MNIST variation,
CIFAR-10, and STL-10 as well as the customized digit
datasets with significant transformations, and show very
competitive classification performance to the
state-of-the-art. Besides the image data, we apply the
method for phone classification tasks on TIMIT database to
show the wide applicability of our proposed algorithms to
other domains, achieving state-of-the-art performance.</div>
<div class="abstract">
<br /></div>
<div class="paper" id="paper-718">
<h2>
Large-Scale Feature Learning With Spike-and-Slab
Sparse Coding</h2>
<div class="authors">
Ian Goodfellow, Aaron Courville, Yoshua
Bengio </div>
<div class="type">
– Accepted </div>
<div class="abstract">
<b>Abstract: </b>We
consider the problem of object recogni- tion with a
large number of classes. In or- der to scale existing
feature learning algo- rithms to this setting, we
introduce a new feature learning and extraction
procedure based on a factor model we call spike-and-
slab sparse coding (S3C). Prior work on this model has
not prioritized the ability to ex- ploit parallel
architectures and scale to the enormous problem sizes
needed for object recognition. We present an inference
proce- dure appropriate for use with GPUs which allows
us to dramatically increase both the training set size
and the amount of latent factors. We demonstrate that
this approach improves upon the supervised learning ca-
pabilities of both sparse coding and the ss- RBM on the
CIFAR-10 dataset. We use the CIFAR-100 dataset to
demonstrate that our method scales to large numbers of
classes bet- ter than previous methods. Finally, we use
our method to win the NIPS 2011 Workshop on Challenges
In Learning Hierarchical Mod- els’ Transfer Learning
Challenge.</div>
<div class="abstract">
<br /></div>
<div class="paper" id="paper-791">
<h2>
Deep Lambertian Networks</h2>
<div class="authors">
Yichuan Tang, Ruslan Salakhutdinov,
Geoffrey Hinton </div>
<div class="type">
– Accepted </div>
<div class="abstract">
<b>Abstract: </b>Visual
perception is a challenging problem in part due to
illumination variations. A possible solution is to
first estimate an illumination invariant
representation before using it for recognition. The
object albedo and surface normals are examples of such
representation. In this paper, we introduce a
multilayer generative model where the latent variables
include the albedo, surface normals, and the light
source. Combining Deep Belief Nets with the Lambertian
reflectance assumption, our model can learn good
priors over the albedo from 2D images. Illumination
variations can be explained by changing only the
lighting latent variable in our model. By transferring
learned knowledge from similar objects, albedo and
surface normals estimation from a <i>single</i> image
is possible in our model. Experiments demonstrate that
our model is able to generalize as well as improve
over standard baselines in <i>one-shot</i> face
recognition.
</div>
</div>
<div class="abstract">
<br /></div>
</div>
<div class="paper" id="paper-319">
<h2>
Scene parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers</h2>
<div class="authors">
Clément Farabet, Camille Couprie, Laurent Najman, Yann LeCun
</div>
<div class="type">
– Accepted
</div>
<div class="abstract">
<b>Abstract: </b>Scene parsing
consists in labeling each pixel in an image with the category of the
object it belongs to. We propose a method that uses a multiscale
convolutional network trained from raw pixels to extract dense feature
vectors that encode regions of multiple sizes centered on each pixel.
The method alleviates the need for engineered features. In parallel to
feature extraction, a tree of segments is computed from a graph of pixel
dissimilarities. The feature vectors associated with the segments
covered by each node in the tree are aggregated and fed to a classifier
which produces an estimate of the distribution of object categories
contained in the segment. A subset of tree nodes that cover the image
are then selected so as to maximize the average 'purity' of the class
distributions, hence maximizing the overall likelihood that each segment
will contain a single object. The system yields record accuracies on
the the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170
classes) and near-record accuracy on Stanford Background Dataset (8
classes), while being an order of magnitude faster than competing
approaches, producing a 320x240 image labeling in less than 1 second,
including feature extraction.
</div>
</div>
<div class="abstract">
</div>
</div>
</div>
</div>
</div>
<br />Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com0tag:blogger.com,1999:blog-7345806147365425073.post-75507060924744329302012-05-04T20:10:00.001+02:002014-02-01T19:44:11.912+01:00Superpixels for Python - pretty SLICYesterday I wanted to try out a "new" superpixel algorithm that seemed quite successful: <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.165.8269&rep=rep1&type=pdf">SLIC superpixels</a>.<br />
This is actually a very simple algorithm, basically doing KMeans in the color+(x,y) space. I'm a bit bummed that they named that, since I already tried the same approach a couple of years ago and didn't think it was very useful. Well, apparently it is.<br />
The authors have a nice <a href="http://ivrg.epfl.ch/supplementary_material/RK_SLICSuperpixels/index.html">website</a> with some examples.
Unfortunately the linux binary didn't run on my box and building on linux seemed somewhat non-trivial.<br />
<br />
So I did what I always do: wrote some Python wrappers. You can find them on <a href="https://github.com/amueller/slic-python">github</a> [update] I did an implementation for <a href="http://scikit-image.org/">scikit-image</a> which is now quite mature thanks to some other contributors. I would recommend using that instead if you want SLIC in python.[/update]. The whole thing is pretty small, easy to build and easy to use. Also damn fast (less than a second per image).<br />
<br />
There are two variations, one where you can specify the number of superpixels and one where you can specify the number of pixels in a superpixel. Both have an additional parameter, the "compactness", which is a trade-off between the similarity in colorspace and (x,y) space.<br />
Results for varying parameter settings look something like this:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbAQDp110AHnp2GR2wMG8HkmkowRtW7U310vl3Do-eIIXXs64fDd2st7gqAgBYc3eEnLPBA7jzLKex_e3G41OGoyLFMy08Kqtvk6CFoUqTvB5twkENdHSDmISmJHBLLPlC5gsnApPttB8/s1600/slic_17_13_s.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjbAQDp110AHnp2GR2wMG8HkmkowRtW7U310vl3Do-eIIXXs64fDd2st7gqAgBYc3eEnLPBA7jzLKex_e3G41OGoyLFMy08Kqtvk6CFoUqTvB5twkENdHSDmISmJHBLLPlC5gsnApPttB8/s320/slic_17_13_s.png" height="240" width="320" /></a></div>
<br />
Compare to my (former) favorite, <a href="http://peekaboo-vision.blogspot.de/2011/06/really-quick-shift-python-bindings.html">quickshift</a>:
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiZ8tbdWyr1-AehOFPA1OCAAkQ5DZ9PMdECf5GfNLRNG8s103pKEK5XFBUxN7bReGWmvjJuj4w8n8FURhIm1dXImLY94sZEGHR_wFVNfGIgyypyTFcEk3OHCbm7bIc5NvTmFGuDpbCDMM/s1600/17_13_s.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjiZ8tbdWyr1-AehOFPA1OCAAkQ5DZ9PMdECf5GfNLRNG8s103pKEK5XFBUxN7bReGWmvjJuj4w8n8FURhIm1dXImLY94sZEGHR_wFVNfGIgyypyTFcEk3OHCbm7bIc5NvTmFGuDpbCDMM/s400/17_13_s.png" height="300" width="400" /></a></div>
<strike>Both are done in RGB colorspace and could probably benefit from going to Lab.</strike><br />
The SLIC implementation converts to Lab, while I didn't do the conversion for quickshift (which I probably should have done).<strike><br /></strike>Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com19tag:blogger.com,1999:blog-7345806147365425073.post-34460853487943137632012-04-30T12:34:00.001+02:002012-05-02T08:48:27.203+02:00Python tidbits: inverting the nesting of a nested list.More than once I came across the problem of rearranging a nested list.
I had a nested list of the form
<pre class="brush:python">
X = [['a', 'b', 'c'], ['d', 'e', 'f']]
</pre>
And I want
<pre class="brush:python">
Y = [['a', 'd'], ['b', 'e'], ['c', 'f']]
</pre>
without having to resort to an ugly list comprehension over 3 lines.
A friend told me to use
<pre class="brush:python">
Y = zip(*X)
</pre>
So easy! Kind of obvious but I didn't find it on the web. So I thought I'd write it down. Enjoy!Andreas Muellerhttp://www.blogger.com/profile/10177962095394942563noreply@blogger.com4