Patrick EmamiMy summaries of Machine Learning papers and investigations into various topics concerning artificial intelligence
http://pemami4911.github.io/
Fri, 26 Feb 2021 21:47:50 +0000Fri, 26 Feb 2021 21:47:50 +0000Jekyll v3.9.0A Symmetric and Object-centric World Model for Stochastic Environments<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } },
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<hr />
<p><strong>Patrick Emami</strong>, Pan He, Anand Rangarajan, Sanjay Ranka</p>
<p>2020 NeurIPS Object Representations for Learning and Reasoning Workshop <strong>Oral Presentation</strong></p>
<div>
<div class="image-wrapper">
<img src="/img/orlr/orlr_main.png" alt="" width="" height="" />
</div>
<div>
</div>
</div>
<h2 id="abstract">Abstract</h2>
<p>Object-centric world models learn useful representations for planning and control but have so far only been applied to synthetic and deterministic environments. We introduce a perceptual-grouping-based world model for the dual task of extracting object-centric representations and modeling stochastic dynamics in visually complex and noisy video environments. The world model is built upon a novel latent state space model that learns the variance for object discovery and dynamics separately. This design is motivated by the disparity in available information that exists between the discovery component, which takes a provided video frame and decomposes it into objects, and the dynamics component, which predicts representations for future video frames conditioned only on past frames. To learn the dynamics variance, we introduce a best-of-many-rollouts objective. We show that the world model successfully learns accurate and diverse rollouts in a real-world robotic manipulation environment with noisy actions while learning interpretable object-centric representations.</p>
<p><a href="https://github.com/orlrworkshop/orlrworkshop.github.io/blob/master/pdf/ORLR_3.pdf">[paper]</a> <a href="https://github.com/pemami4911/symmetric-and-object-centric-world-models">[code]</a> <a href="/pdfs/Workshop_poster_HD.pdf">[poster]</a></p>
<h2 id="demos---bair-towel-pick-30k-with-noisy-actions">Demos - BAIR Towel Pick 30K with noisy actions</h2>
<!--
<div>
<div class="image-wrapper" >
<img src="/img/orlr/BAIR_Ours-video-1-rollout-0.gif" alt="" width="" height=""/>
</div>
<div>
</div>
</div>
<div>
<div class="image-wrapper" >
<img src="/img/orlr/BAIR_VRNN-video-1-rollout-0.gif" alt="" width="" height=""/>
</div>
<div>
</div>
</div>
-->
<p>From left to right: ground truth, <strong>Ours</strong>, OP3, VRNN:</p>
<div class="image-wrapper">
<img src="/img/orlr/BAIR_gt-video-1-rollout-0.gif" />
<img src="/img/orlr/BAIR_Ours-video-1-rollout-0.gif" />
<img src="/img/orlr/BAIR_OP3-video-1-rollout-0.gif" />
<img src="/img/orlr/BAIR_VRNN-video-1-rollout-0.gif" />
</div>
<div class="image-wrapper">
<img src="/img/orlr/BAIR_gt-video-2-rollout-0.gif" />
<img src="/img/orlr/BAIR_Ours-video-2-rollout-0.gif" />
<img src="/img/orlr/BAIR_OP3-video-2-rollout-0.gif" />
<img src="/img/orlr/BAIR_VRNN-video-2-rollout-0.gif" />
</div>
<div class="image-wrapper">
<img src="/img/orlr/BAIR_gt-video-3-rollout-0.gif" />
<img src="/img/orlr/BAIR_Ours-video-3-rollout-0.gif" />
<img src="/img/orlr/BAIR_OP3-video-3-rollout-0.gif" />
<img src="/img/orlr/BAIR_VRNN-video-3-rollout-0.gif" />
</div>
<p>Object decompositions, <strong>Ours</strong>:</p>
<div class="image-wrapper">
<img src="/img/orlr/BAIR_Ours-slot-video-1-rollout-0.gif" />
<img src="/img/orlr/BAIR_Ours-slot-video-3-rollout-0.gif" />
<img src="/img/orlr/BAIR_Ours-slot-video-4-rollout-0.gif" />
<img src="/img/orlr/BAIR_Ours-slot-video-5-rollout-0.gif" />
<img src="/img/orlr/BAIR_Ours-slot-video-9-rollout-0.gif" />
</div>
Tue, 08 Dec 2020 00:00:00 +0000
http://pemami4911.github.io/blog/2020/12/08/symmetric-and-object-centric-world-models.html
http://pemami4911.github.io/blog/2020/12/08/symmetric-and-object-centric-world-models.htmlblogWhat Can Neural Networks Reason About?<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } },
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<hr />
<p><a href="https://openreview.net/forum?id=rJxbJeHFPS">Xu, Li, Zhang, Du, Kawarabayashi, Jegelka, 2020</a></p>
<h2 id="summary">Summary</h2>
<p>This work proposes a theory they call “algorithmic alignment” to explain why some classes of neural net architectures generalize much better than others on certain reasoning problems. They use PAC learning to derive sample complexity bounds that show that the number of samples needed to achieve a desired amount of generalization increases when certain subproblems are “hard” to learn.</p>
<p>For example, they empirically show that DeepSets can easily learn summary statistics for a set of numbers, like “max” or “min”. Since a single MLP has to learn the aggregation function (an easy subproblem) plus the for loop over all elements (a harder subproblem), their theory suggests that the number of samples required to acheive good generalization at test time is much higher for the MLP, which their experiments confirm. Interestingly, they explain why graph neural nets (GNNs) align well with dynamic programming (DP) problems (because of their iterative message-passing style updates), but then also explain why they do not align well with NP-Hard problems. They provide further experimental evidence on a shortest-paths DP problem and Subset-Sum NP-Hard problem to verify this.</p>
<h2 id="what-about-transformers">What About Transformers?</h2>
<p>The paper doesn’t discuss Transformers, so as a simple exercise, I thought about how it fits into their framework. <a href="https://graphdeeplearning.github.io/post/transformers-are-gnns/">Transformers map readily onto fully connected GNNs</a>, which suggests that they should “align algorithmically” with DP problems but not NP-Hard problems. Note that for a set of $K$ objects, a multi-head attention Transformer/GNN performs an $O(K^2 d)$ operation. This highlights one limitation of Transformers; they can easily reason about object-object relations, but will struggle to generalize when faced with higher-order relations such as object-object-object. Relational reasoning over $k$-partite graphs, $k > 2$, shows up in certain NP-Hard problems like <a href="https://pubsonline.informs.org/doi/pdf/10.1287/opre.16.2.422">Multidimensional Assignment</a>. It seems like algorithm alignment will certainly be useful for future research on designing neural net architectures for NP-Hard problems.</p>
Sun, 22 Mar 2020 00:00:00 +0000
http://pemami4911.github.io/paper-summaries/deep-learning-theory/2020/03/22/what-can-neural-nets-reason-about.html
http://pemami4911.github.io/paper-summaries/deep-learning-theory/2020/03/22/what-can-neural-nets-reason-about.htmlpaper-summariesdeep-learning-theoryMLSS 2019<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } },
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<hr />
<p>Now that the <a href="https://sites.google.com/view/mlss-2019">2019 London MLSS</a> is over, I thought I’d share a couple things about my experience and summarize some of the fascinating technical content I learned over the 10 days of lectures. This will not only help improve my recall of the material, but will also make it easy to share some of what I learned with my lab and with others that weren’t able to attend. If you find any mistakes, please feel free to reach out so I can make corrections. :)</p>
<p>First and foremost, however, let me send a huge thank you to the organizers, Arthur Gretton and Marc Deisenroth, for putting this amazing event together. It was an incredibly rewarding experience for me. Also, thank you to all the lecturers and volunteers as well!</p>
<p>Before I jump into discussing technical highlights…</p>
<h2 id="wait-why-are-phd-students-going-to-summer-school">Wait, why are PhD students going to summer school?</h2>
<p>For those of you reading this that may not know what a summer school is, summer schools are typically focused on a specific topic, have daily lectures and practicals, and can last from a few days to two weeks. The location of the summer school is usually in an exciting place, so that students can explore the area together. Usually there’s also a poster session for students to share their ongoing work. At this MLSS, days were structured as two 2-hour lectures followed by a practical which involved writing code for some exercises.</p>
<p>From what I’ve been told and from what I experienced, one of the best reason to attend a summer school is to make friends with other highly-motivated students from around the world. We were lucky enough that Arthur and Marc assembled an amazingly diverse cohort. Out of the ~200 students that attended, 53 nationalities were represented!</p>
<h2 id="it-wasnt-all-work">It wasn’t all work</h2>
<p>I was surrounded by a ton of brilliant people from the moment I arrived in London. Here are some highlights from the fun times we had:</p>
<ul>
<li>Experienced the <a href="https://www.bbc.com/news/uk-49157898">hottest day in UK history</a>. OK so this wasn’t “fun” but our joint suffering was a great bonding experience. I’m used to this type of heat back in Florida, but in London most buildings don’t have air conditioning.</li>
<li>On the Saturday afternoon we bought ~half the food at a local grocery store to feed a large group of us that were having a picnic in Green park near Buckingham Palace</li>
<li>A group of us saw A Midsummer’s Night Dream at the Open Air Theatre in Regent’s Park</li>
</ul>
<div>
<div class="image-wrapper">
<img src="/img/MLSS/mlss2019.jpeg" alt="" width="" height="" />
</div>
<div>
<p class="image-caption">After our tour of the Tower of London</p>
</div>
</div>
<h2 id="technical-highlights-from-mlss">Technical highlights from MLSS</h2>
<p>I’ll only be discussing some of the lectures in this post, but all of the slides and lectures are available at the <a href="https://github.com/mlss-2019">MLSS Github page</a>.</p>
<h2 id="optimization">Optimization</h2>
<p>Lecturer: John Duchi [<a href="https://github.com/mlss-2019/slides/tree/master/optimisation">Slides</a>] [<a href="https://videoken.com/embed/d6i9-ZvjDmY">Video 1</a>] [<a href="https://videoken.com/embed/iNrBhcEjZbc">Video 2</a>]</p>
<p>John first introduced to us the main ideas in convex optimization and then covered some special topics. A big takeaway from his lecture is to stop and think about each particular optimization problem for a bit instead of just tossing SGD at it (supposedly we shouldn’t say “SGD” because the trajectory taken by stochastic gradient methods like “SGD” don’t strictly “descend”, rather they wander around quite a bit). His point is that we can do a lot better.</p>
<p>The concepts of subgradients and a subdifferential were new to me. A vector $g$ is a <strong>subgradient</strong> of a function $f$ at $x$ if
\(f(y) \geq f(x) + <g, y - x>.\)</p>
<p>In a picture:</p>
<div>
<div class="image-wrapper">
<img src="/img/MLSS/IMG_DD39FE1AF60A-1.jpeg" alt="" width="" height="" />
</div>
<div>
<p class="image-caption">A subgradient of the absolute value function</p>
</div>
</div>
<p>The subdifferential is the set of all vectors $g$ that satisfy the above inequality. These are useful because they can be computed so long as $f$ can be evaluated, which means we can derive optimization algorithms in terms of subgradients to make them more broadly applicable.</p>
<p>Another new concept was the notion of $\rho$-weakly convex functions. We learned that regularizing your function to make it sort-of convex by adding a large quadratic to it results in what is known as a $\rho$-weakly convex function.</p>
<p>Ultimately, John gave us three steps to follow:</p>
<ol>
<li>Make a local approximation to your function</li>
<li>Add regularization to “avoid bouncing around like an idiot”</li>
<li>Optimize approximately</li>
</ol>
<h3 id="additional-resources">Additional Resources</h3>
<ul>
<li><a href="https://web.stanford.edu/~boyd/cvxbook/">Convex Optimization - Boyd and Vandenberghe</a></li>
<li><a href="https://arxiv.org/abs/1810.05633">Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity - Asi, Duchi</a></li>
</ul>
<h2 id="interpretability">Interpretability</h2>
<p>Lecturer: Sanmi Koyejo [<a href="https://github.com/mlss-2019/slides/tree/master/interpretability">Slides</a>] [<a href="https://videoken.com/embed/dMzqU9aaTFQ">Video 1</a>] [<a href="https://videoken.com/embed/KO8rdIaij8s">Video 2</a>]</p>
<p>Key takeaways:</p>
<ul>
<li>When building an ML system for a task where the repercussions of false positives/false negatives are unknown, it is usually best to defer decision-making to someone downstream</li>
<li>Monotonicity as an interpretability criterion: if feature $X$ “increases”, the output should “increase” as well</li>
<li>A standard definition of what it means for a model to be “interpretable” is still missing</li>
</ul>
<h3 id="additional-resources-1">Additional Resources</h3>
<ul>
<li><a href="https://github.com/jphall663/awesome-machine-learning-interpretability">Awesome-ML-Interpretability</a></li>
<li><a href="https://distill.pub/2017/feature-visualization/">Distill Pub’s Feature Viz article</a></li>
<li><a href="https://www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime">Introduction to LIME</a></li>
</ul>
<h2 id="gaussian-processes">Gaussian Processes</h2>
<p>Lecturer: James Hensman [<a href="https://github.com/mlss-2019/slides/tree/master/gaussian_processes">Slides</a>] [<a href="https://videoken.com/embed/GxZWMgRydoM">Video 1</a>] [<a href="https://videoken.com/embed/uuKyVS5K8F0">Video 2</a>]</p>
<p>GPs have always been a bit mind-boggling to me. Good thing MLSS was held in the UK, where there seems to be a high concentration of GP/Bayesian ML researchers. Also, James’ slides contain amazing visuals.</p>
<p>What’s a GP? As Mackay puts it, GPs are smoothing techniques that involve placing a Gaussian prior over the <em>functions</em> in the hypothesis class.</p>
<p>I highly recommend checking out Lab 0 and Lab 1 from the GP practical <a href="https://github.com/mlss-2019/tutorials/tree/master/gaussian_processes">here</a> which were great for building the intuitions.</p>
<h3 id="multivariate-normals">Multivariate normals</h3>
<p>GPs heavily make use of <a href="https://en.wikipedia.org/wiki/Multivariate_normal_distribution">multivariate normal distributions</a> (MVN).
In fact, GPs are based on infinite-dimensional MVNs… It took a while for me to wrap my head around this, but these plots from James’ slides really helped:</p>
<div>
<div class="image-wrapper">
<img src="/img/MLSS/GP-1.png" alt="" width="" height="" />
</div>
<div>
<p class="image-caption">samples from a 2D MVN. See slides for GIF versions.</p>
</div>
</div>
<p>The plot on the RHS is the important one. The x-axis represents the two dimensions of this MVN; there are no units for the x-axis, it simply depicts each of the N dimensions of an MVN from 1 to N (the order matters!). The y-axis depicts the values of a random variate sampled from this MVN. A line is drawn connecting $f(x_1)$ to $f(x_2)$. For a 6D MVN, there would be 6 ticks on the x-axis and a line connecting all 6 function values. The trick with GPs is that we (theoretically) let the number of dimensions go to infinity, so the x-axis becomes a continuum. But in practice, we only have a finite number of data points, and we assume the $(x_i, f(x_i))$ pairs are observations of this unknown function.</p>
<p>Confused? Let’s take a step back. In Lab 0, we build some intuition by looking at Linear Regression and then the extension to Bayesian Linear Regression. For GP regression, we make the usual assumption that some unknown target function generated our data, except now our hypothesis class is much more expressive. GPs place a prior over the functions in the hypothesis class. They do this by defining a (Gaussian) stochastic process indexed by the data points:</p>
\[p\Biggl(\begin{bmatrix} f(x_1)\\ f(x_2) \\ \cdots \\ f(x_n) \end{bmatrix} \Biggr) = \mathscr{N} \Biggl( \begin{bmatrix} \mu(x_1)\\ \mu(x_2) \\ \cdots \\ \mu(x_n) \end{bmatrix}, \begin{bmatrix} k(x_1, x_1) & \cdots & k(x_1,x_n)\\
\vdots & & \vdots \\
k(x_n, x_1) & \cdots & k(x_n, x_n)\end{bmatrix} \Biggr).\]
<h3 id="covariance-functions">Covariance functions</h3>
<p>Note that we can usually transform our data to be zero-mean, so the mean function $\mathbf{\mu}$ is generally assumed to be $0$ and our main concern is the kernel matrix $K$, also known as the gram matrix. $K$ is the <strong>covariance function</strong> of the GP. Unlike regular (Bayesian) Linear Regression, GPs have a certain useful structure that comes with formulating the problem as a (Gaussian) stochastic process. That is, the kernel matrix $K$ enables us to incorporate prior knowledge about how we expect the data points to interact. In the next section, I’ll take about kernels and the kernel matrix $K$ in more detail.</p>
<p>Once we have the basic machinery of GPs down, we can do some cool stuff like compute the conditional distribution of the function that fits a test point $x^*$ given a training set of point pairs $(x_i, f(x_i))$.</p>
<p>We also covered model selection using the marginal likelihood of a GP. Kernels are typically parameterized by a lengthscale and variance parameter, and the marginal likelihood objective for selecting these kernel parameters nicely trades off model complexity with data fit.</p>
<p>Finally, I’ll briefly mention something we touched on at the end of the GP lecture that was interesting. Since GPs require inverting $K$ to do inference, <em>sparse</em> GPs cleverly select a small set of “pseudo-inputs” that approximate the original GP well and allow for using GPs on much larger datasets.</p>
<p>GPs are useful in modern ML because many problems are concerned with interpolating a function given noisy observations using a Bayesian approach.</p>
<h3 id="additional-resources-2">Additional Resources</h3>
<ul>
<li><a href="http://www.gaussianprocess.org/gpml/chapters/">Gaussian Processes for Machine Learning - Rasmussen and Williams</a></li>
<li><a href="http://papers.nips.cc/paper/6877-convolutional-gaussian-processes">Convolutional Gaussian Processes - van der Wilk, Rasmussen, Hensman</a></li>
<li><a href="http://proceedings.mlr.press/v97/burt19a.html">Rates of Convergence for Sparse Variational Gaussian Process Regression - Burt, Rasmussen, van der Wilk</a></li>
</ul>
<h2 id="kernels">Kernels</h2>
<p>Lecturer: Lorenzo Rosasco [<a href="https://github.com/mlss-2019/slides/tree/master/kernels">Slides</a>] [<a href="https://videoken.com/embed/6bUEdtUmh_4">Video 1</a>] [<a href="https://videoken.com/embed/uHPi7q0QuY0">Video 2</a>]</p>
<p>You may have heard of the <em>kernel trick</em>, which involves replacing all inner products of features in an algorithm with a kernel evaluation, e.g., as in <a href="https://sites.google.com/site/dataclusteringalgorithms/kernel-k-means-clustering-algorithm">Kernel K-Means</a>.</p>
<h4 id="more-than-just-a-trick">More than just a trick</h4>
<p>Lorenzo showed us some recent work he’s been doing on scaling up kernel methods to the sizes of datasets commonly consumed by deep learning. To do this type of research, it is important to understand the theory of kernel methods. We covered some of this during the lecture, like Reproducing Kernel Hilbert Spaces (RKHSs). I think this lecture was one of the most difficult to swallow in the time allotted from the entire MLSS.</p>
<h4 id="rkhs">RKHS</h4>
<p>Here is my attempt to summarize my understanding of RKHSs. Please see Arthur’s notes (linked below), especially Section 4.1, for a great explanation.</p>
<p>An RKHS is an abstract mathematical space closely intertwined with the notion of kernels. An RKHS is an extension of a Hilbert space; that is, it has all of the properties of a Hilbert space (it’s a vector space, has an inner product, and is complete), plus two other key properties. Note that RKHSs are <em>function spaces</em>; the primitive object that lives in an RKHS is a function.</p>
<p>In fact, <em>evaluating</em> a function in the RKHS maps points from $\mathscr{X} \rightarrow \mathbb{R}$, where $\mathscr{X}$ is the data. However, the “representation” of a function in the RKHS could be an infinite-dimensional vector, if the feature maps corresponding to a particular kernel are infinite dimensional. Since we care about the inner product (evaluation) of the functions in the RKHS, we never actually have to deal with these individual, potentially infinite-dimensional representations. This took me a long time to wrap my head around (see Lemma 9 in Arthur’s notes).</p>
<p>To my understanding, the fact that certain kernels (e.g., the RBF kernel) are said to lift your original feature space to an “infinite-dimensional feature space” comes from the fact that such a kernel is defined as an <em>infinite series</em> of inner products of the original features. This is valid when the series converges (the assumption here is that each feature map is in L2).</p>
<p>If we are given an RKHS, then the key tool we can leverage, which comes for free with the RKHS, is the reproducing property of said RKHS. The reproducing property of the RKHS says that the evaluation of any <em>function</em> in the RKHS at a point $x \in \mathscr{X}$ is the inner product of the <em>function</em> with the corresponding <em>reproducing kernel</em> of the RKHS at the point. The Riesz representation theorem guarantees for us that such a reproducing kernel exists.</p>
<p>Lorenzo showed us how we can start with feature maps and arrive at RKHSs. That is, if you aren’t given an RKHS, you can instead start by defining a kernel as the inner product $k(x,y) := \langle \phi(x), \phi(y) \rangle$ between feature maps. Equivalently, we can define a kernel as a symmetric, positive definite function $k : \mathscr{X} \times \mathscr{X} \rightarrow \mathbb{R}$, and a theorem (Moore-Aronszajn) tells us that there’s a corresponding RKHS for which $k$ is a reproducing kernel. However, in the latter case, there are infinitely many feature maps corresponding to the symmetric, positive definite function we started with.</p>
<h3 id="additional-resources-3">Additional Resources</h3>
<ul>
<li><a href="http://www.gatsby.ucl.ac.uk/~gretton/coursefiles/lecture4_introToRKHS.pdf">Arthur Gretton’s course notes on RKHS</a></li>
<li><a href="https://www.cs.toronto.edu/~duvenaud/cookbook/">David Duvenaud’s notes on kernels</a></li>
</ul>
<h2 id="mcmc">MCMC</h2>
<p>Lecturer: Michael Betancourt [<a href="https://github.com/mlss-2019/slides/tree/master/mcmc">Slides</a>] [<a href="https://videoken.com/embed/UzcLe-kpMDQ">Video 1</a>] [<a href="https://videoken.com/embed/drCwg49Ba_U">Video 2</a>]</p>
<p>What is Bayesian computation?</p>
<ol>
<li>Specify a model</li>
<li>Compute posterior expectation values</li>
</ol>
<p>The first part of our MCMC lecture was all about having us update our mental priors about high-dimensional statistics. For example, although we may think that the mode of a probability density contributes most to an integral when integrating over parameter space, this falls apart quickly in higher dimensions. This is because the relative amount of volume the mode occupies in high dimensions rapidly disappears ($d\theta \approx 0$), so that wherever the majority of the probability mass is sitting and occupying a non-trivial amount of volume ($d\theta$ » $0$) contributes most to the integral. We learned that probability mass concentrates on a thin and hard to find hypersurface called the <em>typical set</em> in higher dimensions.</p>
<h3 id="what-about-variational-methods">What about variational methods?</h3>
<p>Variational inference, which replaces integration with an optimization problem and makes simplistic assumptions about the variational posterior, underestimates the variance and favors solutions that land inside the target typical set. Yikes!</p>
<h3 id="markov-chains">Markov chains</h3>
<p>Monte carlo methods use samples from the exact target typical set. Markov chains are great because a Markov transition that targets a particular distribution will naturally concentrate towards its probability mass. In practice for MCMC, Michael showed us that the chain quickly converges onto the typical set. The main challenge of MCMC methods seem to be handling various failure modes that cause divergence or cases where the chain gets stuck somewhere within the typical set.</p>
<h3 id="diagnosing-your-mcmc">Diagnosing your MCMC</h3>
<p>From the law of large numbers, we get that Markov chains are asymptotically consistent estimators. However, due to various pathologies, we need more sophisticated diagnostic tools and algorithms to get good finite time behavior. One diagnostic tool is the effective sample size (ESS). The more anti-correlated the samples from the Markov chain are, the higher the ESS.</p>
<p>A robust strategy is to run multiple chains from various initial states and compare the expectations. Another strategy is to check the potential scale reduction factor, or R-hat, which is similar to an ANOVA between multiple chains.</p>
<h3 id="hamiltonian-monte-carlo">Hamiltonian Monte Carlo</h3>
<p>HMC is an alternative to Metropolis-Hastings MCMC. One issue with MH is that a naive Normal proposal density will almost always jump outside of the typical set, because most of the volume lies outside. Lowering the std dev of the proposal to avoid this means convergence will be too slow.</p>
<p>We want to efficiently explore the typical set without leaving it. Insight: align a vector field to the typical set and then integrate along the field. HMC is the formal procedure for adding just enough <em>momentum</em> to the gradient vector field to align the gradient field with the typical set. The analogy Michael used is that we need to add the momentum needed for a satellite to enter orbit around the earth.</p>
<p>How to do this? Use Hamiltonian measure-preserving flow, which is a scalar function $H(p,q)$ of the parameters $q$ and momentum $p$ such that $H(p,q) = -log (\pi(p|q)) -\log (\pi(q))$. Here, $\pi(q)$ is the target density, which is specified by the problem. $\pi(p|q)$ is a conditional density for the momentum, which we choose. According to Michael, this is the only way of producing the gradient vector field we want to integrate along that actually works (this is a magic, hand-wavy thing).</p>
<p>The Markov transitions now consist of randomly sampling a <em>Hamiltonian trajectory</em>. Motion is governed by the above Hamiltonian, which has a kinetic and potential energy term. From the current location on the typical set, we randomly sample the momentum and the trajectory length, and then use an ODE numerical integrator to compute where we end up. I picture this like a marble being released with some initial momentum inside a bowl-shaped surface; wherever the marble finally rests after rolling around for a while is analogous to the next sample from the posterior we draw and is the next state of the chain.</p>
<p>Numerical integration introduces some errors, so we have to be careful about the type of integration we use and how we correct for the introduced bias in our estimator. Use <a href="https://mc-stan.org/">Stan</a> in practice for your MCMC needs.</p>
<h3 id="additional-resources-4">Additional Resources</h3>
<ul>
<li><a href="https://arxiv.org/abs/1701.02434">A Conceptual Introduction to Hamiltonian Monte Carlo - Betancourt</a></li>
<li><a href="https://betanalpha.github.io/writing/">Michael Betancourt’s blog</a></li>
</ul>
<h2 id="fairness">Fairness</h2>
<p>Lecturer: Timnit Gebru [<a href="https://github.com/mlss-2019/slides/tree/master/fairness">Slides</a>] [<a href="https://videoken.com/embed/7uV_VohAwnw">Video</a>]</p>
<p>Timnit gave an incredibly engaging lecture, which also doubled as a nice break from brain-bending equations. Here are some key takeaways:</p>
<p>Timnit talked about “Fairwashing”, which is also discussed by Zachary Lipton in his <a href="https://twimlai.com/twiml-talk-285-fairwashing-and-the-folly-of-ml-solutionism-with-zachary-lipton/">recent TWiML interview</a>. Basically, it is not enough to take a contentious research topic and simply try to “make it fair”. The idea of “make X fair” misses the point: we need to step back and critically consider the implications of doing the research in the first place. Ask: should we even be building this technology? And then the follow-up questions: if we don’t, will someone else build it anyways? If so, what do we do?</p>
<p>We learned about her PhD project where they took Google street view images and ran computer vision data mining on images of cars to extract patterns about the US population. Since I grew up in Jacksonville, FL, I thought I’d include this slide showing that it is the 10th most segregated city (measured by disparity in average car prices) in the US. Yikes!</p>
<div>
<div class="image-wrapper">
<img src="/img/MLSS/jacksonville.png" alt="" width="" height="" />
</div>
<div>
<p class="image-caption">The red in the center is San Marco/Riverside. Duuuval</p>
</div>
</div>
<p>When the data is biased (and data collected out in the real world reflects the current state of the society from which it was extracted from), model predictions will be biased. Some examples we saw: using ML to predict who to hire, placing current ML techniques in the hands of ICE, predictive policing, gender classification, and facial recognition.</p>
<p>Science is inherently political. We all have a responsibility to be cognizant of the <a href="https://openai.com/blog/preparing-for-malicious-uses-of-ai/">dual-use nature</a> of the technologies we create. Even people working on “theory” must adhere to this. The people working on the foundations of our modern statistics (<a href="https://en.wikipedia.org/wiki/Ronald_Fisher#Eugenics">like Fisher</a>) were eugenicists!</p>
<h3 id="additional-resources-5">Additional Resources</h3>
<ol>
<li><a href="https://arxiv.org/abs/1803.09010">Datasheets for Datasets - Gebru et al.</a></li>
<li><a href="https://arxiv.org/abs/1810.03993">Model Cards for Model Reporting - Mitchell et al.</a></li>
<li><a href="https://mitpress.mit.edu/books/sorting-things-out">Sorting Things Out: Classification and its Consequences - Bowker and Star</a></li>
<li><a href="https://ali-alkhatib.com/blog/anthropological-intelligence">Anthropological/AI & the HAI - Ali Alkhatib</a></li>
<li><a href="https://www.americaunderwatch.com/">America Under Watch - Garvie, Moy</a></li>
</ol>
<h2 id="submodularity">Submodularity</h2>
<p>Lecturer: Stefanie Jegelka [<a href="https://github.com/mlss-2019/slides/tree/master/submodular">Slides</a>] [<a href="https://videoken.com/embed/P2mcH0RoPD0">Video</a>]</p>
<p>This was a fairly niche topic and most people hadn’t heard of it. Essentially, it is a technique in combinatorial optimization for optimal subset selection. Subset selection is clearly relevant to ML, e.g., for active learning, coordinate selection, compression, and so on. However, optimizing a <em>set function</em> over subsets $F(s), s \subseteq S$ is non-trivial; sets are discrete objects, and for any set $S$ there are $2^S$ possible subsets!</p>
<p>It turns out there are specific classes of functions called <em>submodular functions</em> that can be optimized in a reasonable amount of time. These functions have a property called <em>submodularity</em>. In words, this means that if you have a subset $T$ and a subset $S$, and $T \subseteq S$, adding a new element $a \notin S$ to the “smaller” subset $T$ should give you a higher value than adding it to $S$. This is also referred to as diminishing gains. Written out, it is</p>
\[F(T \cup \{a\}) - F(T) \geq F(S \cup \{a\}) - F(S), T \subseteq S, a \notin T.\]
<p>You can also think about $F$ like a concave function, such that its “discrete derivative” is non-increasing.</p>
<p>An example of a submodular function is the coverage function: given $A_1, …, A_n \subset U$,</p>
\[F(S) = \Bigl | \bigcup_{j \in S} A_j \Bigr |.\]
<p>As your coverage $|S|$ increases, the increase in the number of subsets in the union achieved by adding another index $j’$ to $S$ will be smaller.</p>
<h3 id="convex-or-concave">Convex or concave?</h3>
<p>It is not obvious whether submodular functions are more like convex or concave functions, as they share similarities with both. The fact that there are similarities hints at the reason why they are interesting, since convex and concave functions are “nice” to optimize in continuous optimization. We already mentioned the non-increasing discrete derivative property, which is similar to concave functions. The similarity to convexity comes from the fact that it is easier to <em>minimize</em> a submodular function than it is to maximize one (e.g., Max Cut; this is NP-Hard).</p>
<h3 id="greedy-algorithms">Greedy algorithms</h3>
<p>We discussed greedy algorithms for maximizing submodular functions during the lecture. The basic algorithm for finding</p>
<p>\(\max_{S} F(S) \texttt{ s.t. } |S| \leq k\)
is as follows:</p>
<ol>
<li>$S_0 = \emptyset$</li>
<li>$\texttt{for } i=1,…,k-1$
<ol>
<li>$e^* = \texttt{arg max}_{e \in \mathscr{V} \setminus S_i} F(S_i \cup {e})$</li>
<li>$S_{i+1} = S_i \cup {e^*}$</li>
</ol>
</li>
</ol>
<p>This works because $F$ is submodular - the marginal improvement of adding a single item gives information about the global value. Furthemore, $F$ is monotonic in that adding items can never reduce $F$. Lots more on this in the slides.</p>
<h3 id="continuous-relaxation-for-minimization">Continuous relaxation for minimization</h3>
<p>What if we want to minimize a submodular function? We can consider a continuous relaxations of $F$ to a convex function. The Lovasz extension provides a method for relaxing a function defined on the Boolean hypercube to $[0,1]^N$. The result is a piece-wise, linear convex function $f(z)$ that can be optimized using subgradient methods, and a final result can be obtained by rounding.</p>
<h3 id="additional-resources-6">Additional Resources</h3>
<ol>
<li><a href="https://theory.stanford.edu/~jvondrak/data/submod-tutorial-1.pdf">Other intro tutorial slides - Vondrak</a></li>
<li><a href="https://theory.stanford.edu/~jvondrak/CS369P/lec17.pdf">More on the Lovasz Extension - Vondrak</a></li>
<li><a href="http://melodi.ee.washington.edu/~bilmes/mypubs/iyer2015-spps.pdf">Submodular Point Processes with Applications to Machine Learning - Iyer, Bilmes</a></li>
<li><a href="https://arxiv.org/abs/1701.08939">Deep Submodular Functions - Bilmes, Bai</a></li>
</ol>
Thu, 15 Aug 2019 00:00:00 +0000
http://pemami4911.github.io/blog/2019/08/15/mlss-2019.html
http://pemami4911.github.io/blog/2019/08/15/mlss-2019.htmlblogLearning From Demonstrations in the Wild<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } },
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<hr />
<p><a href="https://arxiv.org/abs/1811.03516v1">Behbahani et al., 2018</a></p>
<h2 id="summary">Summary</h2>
<p>The motivation behind this work is to develop an automated process for learning the behaviors of road users from large amounts of unlabeled video data. A generative model (trained policy) of road user behavior could be used within a larger traffic scene understanding pipeline. In this paper, they propose Horizon GAIL, an imitation-learning algorithm based on GAIL, that stabilizes learning from demonstration (LfD) over long horizons. Expert policy demonstrations are provided by a slightly improved Deep SORT tracker, and they use PPO as the “student” RL algorithm. The Unity game engine is used to build an RL env that mimcs the scene from the real-world environment to rollout their PPO Horizon-GAIL agent. Their experiments are on 850 minutes of traffic camera data of a large roundabout. By using a curriculum where the episode horizon is extended by 1 timestep each training epoch, they demonstrated how Horizon-GAIL can match the expert policy’s state/action distribution much more closely than GAIL, PS-GAIL, and behavior cloning while also improving on training stability.</p>
<h2 id="observations">Observations</h2>
<ul>
<li>The ability to auto-generate the Unity env from Google Maps would be crucial to scaling this technique up. maps2sim?</li>
<li>They provided an empirical comparison of DeepSORT with ViBe’s vision tracker, and showed that running the Kalman Filter in 3D space improved Deep SORT’s performance in multiple multi-object tracking metrics by a few percentage points</li>
<li>Each road user is modeled independently, i.e., the policy does not account for other agents in the environment explicitly. It looks like the policy used for learning vehicle and pedestrian behavior is the same, although because of the Mask R-CNN detector, they are able to differentiate between the two classes. In scenarios where the behaviors exhibited by the road users can be highly unpredictable and diverse (a busy traffic intersection with heavy pedestrian presence), perhaps a hierarchical policy could be useful that conditions on the inferred object class.</li>
<li>Interesting future work might include incorporating multi-agent modeling in the RL framework for more complex traffic scenarios.</li>
</ul>
Wed, 26 Dec 2018 00:00:00 +0000
http://pemami4911.github.io/paper-summaries/computer-vision/2018/12/26/LfD-in-the-wild.html
http://pemami4911.github.io/paper-summaries/computer-vision/2018/12/26/LfD-in-the-wild.htmlpaper-summariescomputer-visionAddressing Function Approximation Error in Actor-Critic Methods & Discriminator-Actor-Critic- Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } },
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<hr />
<p><a href="https://arxiv.org/abs/1802.09477">Fujimoto et al., 2018</a> and <a href="https://arxiv.org/abs/1809.02925v2">Kostrikov et al., 2018</a></p>
<h2 id="summary">Summary</h2>
<p>I discuss two recent related papers in the Deep RL literature in this post. The first paper, by Fujimoto et al., introduces techniques for reducing bias and variance in a popular actor-critic method, Deep Deterministic Policy Gradient (DDPG). The second paper, by Kostrikov et al., makes a similar contribution by evaluating and addressing bias and variance in inverse RL. Both of these papers take widely used Deep RL algorithms, empirically and theoretically demonstrate specific weaknesses, and suggest reasonable improvements. These are valuable studies that help develop a better understanding of Deep RL.</p>
<h3 id="addressing-function-approx-error-in-ac-methods">Addressing Function Approx. Error in AC Methods</h3>
<p>If you are unfamiliar with DDPG, you can check out <a href="https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html">my blog post</a> on the algorithm. The most important thing to know is that the success of the whole algorithm relies on having a critic network that can accurately estimate $Q$-values. The only signal the actor network gets in its gradient to help it achieve higher rewards comes from the gradient of the critic wrt the actions selected by the actor. If the critic gradient is biased, then the actor will fail to learn anything!</p>
<p>In Section 4, the authors begin by empirically demonstrating the overestimation bias present in the critic network (action-value estimator) in DDPG. They show that the overestimation bias essentially stems from the fact that DPG algorithms have to approximate both the policy and the value functions, and the approximate policy is maximized in the gradient direction provided by the approximate value function (rather than the true value function). Then, inspired by Double Q-Learning, they introduce a technique they call “clipped Double Q-Learning in AC” for achieving the same idea. Basically, the critic target becomes
\(y_1 = r + \gamma \min_{i=1,2} Q_{\theta'_i} (s', \pi_{\theta_1}(s')).\)
This requires introducing a second critic. The min makes it so that it is possible to underestimate Q-values, but this is preferable to overestimation.</p>
<p>Then, to help with variance reduction, they suggest:</p>
<ul>
<li>Delay updating the actor network until the critic network has almost converged</li>
<li>Add some noise to the actions selected by the actor network when updating the critic to help regularize the critic, reminiscent of Expected SARSA</li>
</ul>
<p>Their experimental results on MuJoco (they call their algorithm TD3) suggest these improvements are very effective, outperforming PPO, TRPO, ACKTR, and others.</p>
<h3 id="discriminator-actor-critic-addressing-sample-inefficiency-and-reward-bias-in-adversarial-imitation-learning">Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning</h3>
<p>EDIT: The title of the paper was previously “Addressing Sample Inefficiency and Reward Bias in Inverse RL”</p>
<p>Seemingly inspired by the former, this paper recently came out exploring inverse RL—specifically, generative adversarial imitation learning (GAIL) and behavioral cloning. In GAIL, the discriminator learns to distinguish between transitions sampled from an expert and those from a trained policy. The policy is rewarded for confusing the discrminiator. However, GAIL is typically quite sample inefficient.</p>
<p>One way the authors suggest to help with the sample inefficiency of GAIL is by using off-policy RL instead of on-policy RL. They modify the GAIL objective to be
\(\max_D \mathcal{E}_\mathscr{R} [\log(D(s,a))] + \mathcal{E}_{\pi_E}[\log(1 - D(s,a))] - \lambda H(\pi).\)
Basically, $\pi_E$ is the expert policy, from which trajectories are sampled, and $\mathscr{R}$ is the replay buffer, from which trajectories are sampled from ~all previous policies. They ignore the importance sampling term in practice. Since TD3 is technically a deterministic policy gradient algorithm, I’m assuming one way to implement this importance sampling would be to have the actor output the mean of a multivariate Gaussian—this Gaussian could then be used to define the entropy term of the policy and the importance sampling ratio. This is fairly common for continuous control tasks like MuJoco…the authors note that the importance sampling wasn’t used in practice, however.</p>
<p>They further analyzed different reward functions for GAIL, and show that certain GAIL reward functions can actually <em>inhibit</em> learning depending on the particular MDP (e.g., if the environment has a survival bonus or penalty). To create a more robust reward function that will learn the expert policies, <em>they suggest explicitly learning rewards for absorbing states of the MDP</em>. They implement this by adding an indicator to these particular states so that the GAIL discriminator can identify whether reaching an absorbing state is desirable from the perspective of the expert. In the <a href="https://openreview.net/pdf?id=Hk4fpoA5Km">OpenReview thread</a>, one reviewer makes sure to point out that the problems with inverse RL algorithms highlighted in the paper are due to incorrect implementations of the MDP, rather than shortcomings of the algorithms themselves (<a href="https://openreview.net/forum?id=Hk4fpoA5Km&noteId=B1gGZfAq6Q">see this comment in particular</a>).</p>
<p>Very interestingly, they used VR to generate expert trajectories of gripping blocks with a Kuka arm. This environment has a per-step penalty, and the normal GAIL reward fails to learn the expert policy due to the the learning inhibition caused by the reward function bias. The proposed method learns to imitate the expert quickly due to its added reward for absorbing states.</p>
<p>It would be great to investigate the effects of using off-policy samples in the objective more carefully (why exactly does importance sampling not matter? The absorbing state reward stuff being so useful is surprising, and should be helpful in future applications where GAIL is used for inverse RL.</p>
Thu, 13 Sep 2018 00:00:00 +0000
http://pemami4911.github.io/paper-summaries/deep-rl/2018/09/13/addressing-challenges-in-deep-rl.html
http://pemami4911.github.io/paper-summaries/deep-rl/2018/09/13/addressing-challenges-in-deep-rl.htmlpaper-summariesdeep-RLRUDDER: Return Decomposition for Delayed Rewards<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } },
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<hr />
<p><a href="https://arxiv.org/abs/1806.07857v1">Arjona-Medina, Gillhofer, et al., 2018</a></p>
<h2 id="intro">Intro</h2>
<p>Delayed rewards is one of the fundamental challenges of reinforcement learning. This paper proposes an algorithm for converting an MDP with delayed rewards into a different MDP with equivalent optimal policies where the delayed reward is now converted into immediate rewards. They accomplish this with “reward redistribution”—a potential solution to the RL credit assignment problem.</p>
<p>Basically, they argue that reward redistribution enables monte-carlo (MC) and temporal-difference (TD) methods to be applied under delayed rewards without high variance or biasing the Q-values. This is important, because TD methods are the most commonly used RL approach, and it is proven in this paper that with delayed rewards TD methods take time exponential in the length of the delay to remove the bias.</p>
<h3 id="some-details-about-rudder">Some details about RUDDER</h3>
<p>One of the main obstacles that is addressed in Section 3 is proving that the new MDP with redistributed rewards has the same optimal policies as the original MDP. I found that this section (and the previous) were unecessarily hard to read, but here is my take on what the key points are:</p>
<ul>
<li>Because the delayed reward MDP obeys the Markov property, the transformed MDP with equivalent optimal policies but immediate rewards needs to have an altered set of states. This altered set of states obey the Markov property for predicting the immediate rewards, but crucially, not for the delayed reward. This is accomplished by using differences between states $\triangle (s, s’)$ in the delayed reward MDP as the new set of states</li>
<li>The return of the delayed reward MDP at the end of an episode should be predicted by a function $g$ (the LSTM) that can be decomposed into a sum over the state, action pairs of the episode</li>
<li>It should hold that the partial sums that all together sum to $g$, $\sum_{\tau = 0}^{t} h(a_{\tau}, \triangle(s_{\tau}, s_{\tau+1}))$, equal the Q-values at time $t$ for the reward redistribution to be optimal</li>
<li>This approach requires strong exploration that can actually uncover the delayed reward. If a robot gets a +1 for doing the dishes correctly and +0 reward otherwise, it may <em>never</em> even see the +1 in the first place!</li>
<li>To help with the above, they use a “lessons replay buffer” to store episodes where the delayed reward was observed, to sample from when computing gradients for improved data efficiency</li>
<li>An LSTM is used for the return prediction, and techniques for credit assignment in deep learning like layer-wise relevance propagation and integrated gradients are used for “backwards analysis” to accomplish the reward redistribution. This gets pretty complex, and the appendix goes into detail of how they modified the LSTM cell for this (something called a “monotonic LSTM”)</li>
<li>They have to introduce other auxiliary tasks to encourage the reward prediction LSTM network to learn the optimal reward redistribution and get around the Markov property problem mentioned earlier. These aux tasks are to predict the q-values, predict the reward in the next 10 steps, and predict the reward accumulated from time 0 so far</li>
</ul>
<h2 id="experimental-results">Experimental results</h2>
<p>The authors note that RUDDER is aimed at RL problems with (i) delayed rewards (ii) no distractions due to other rewards or changing characteristics of the environment (iii) no skills to be learned to receive the delayed reward. (iii) seems to imply that RUDDER could be combined with hierarchical RL. (ii) suggests that when there are sporadic rewards throughout an episode, the improvements from RUDDER might be minimized. Hence, the environments they experiment with all have heavily delayed rewards observed only at the end of the episode, and their results are naturally impressive. I understand the difficulty (high computational/$$$ cost) in evaluating on all 50-something Atari games, but it would be really nice to see how it fares on other games that aren’t perfectly set up for RUDDER.</p>
<p>They also note that human demonstrations could be used to fill the lessons buffer, which might further improve recent <a href="https://arxiv.org/abs/1805.11592">impressive imitation learning results</a>.</p>
<h3 id="grid-world">Grid world</h3>
<p>They compute an optimal policy for a simple grid world with a delayed reward MDP. Then, they compute the bias and variance from the mean square error (MSE) of the Q-values for MC and TD estimators with policy evaluation. They demonstrate the exponential dependency on the reward delay length for the MC and TD methods by varying the delay. RUDDER shows favorable reductions in the bias and variance of estimated vs. optimal Q-values at all delay lengths.</p>
<h3 id="charge-discharge">Charge-discharge</h3>
<p>Simple 2-state and 2-action MDP with episodes of variable length and delayed reward obtained only at the end of the episode. RUDDER augmenting a Q-learning agent is much more data efficient than pure Q-learning, MC agent, and MCTS.</p>
<h2 id="atari-evaluation">Atari evaluation</h2>
<p>The most impressive results of the paper are the scores for the Venture and Bowling Atari games. They set new SOTA on Venture.
They augment <a href="https://arxiv.org/abs/1707.06347">PPO</a> with the return prediction network paired with integrated gradients for backwards analysis. The lessons buffer is used to train both the return prediction network and A2C (for PPO). I’m not sure about that because I thought PPO was an on-policy algorithm- using episodes from the lessons buffer would bias the policy gradient? They use the difference between two consecutive frames for $\triangle(s, s’)$ plus the current frame (for seeing static objects) as input.</p>
<div>
<div class="image-wrapper">
<img src="/img/rudder-arch.png" alt="" width="" height="" />
</div>
<div>
<p class="image-caption">NN architectures for RUDDER augmenting PPO</p>
</div>
</div>
<div>
<div class="image-wrapper">
<img src="/img/rudder.png" alt="" width="" height="" />
</div>
<div>
<p class="image-caption">Visual depiction of the reward redistribution</p>
</div>
</div>
<p>They show that RUDDER is much more data efficient than APE-X, DDQN, Noisy-DQN, Rainbow, and other SOTA Deep RL approaches on these two games. <a href="https://goo.gl/EQerZV">Video</a>.</p>
<h2 id="closing-thoughts">Closing thoughts</h2>
<p>I think this is an exciting RL paper that will have a major impact. Mainly because I can see lots of future research refining the ideas from this paper and then applying it to a variety of domains, especially in robotics. This paper could use polishing (some parts are not easy to read) and some more experiments showing how it performs on MDPs with both delayed rewards and also some other “distractor” rewards. I think this setting is most common, as it’s usually possible to do a little reward hacking to augment sparse/delayed reward MDPs. In the appendix, they go into the engineering details on how they got this to work, and it seems like that process + how the hyperparameters/auxiliary tasks/choice of method for backwards analysis will need to change for different environments will need to get cleaned up before making this approach widely useful.</p>
Fri, 22 Jun 2018 00:00:00 +0000
http://pemami4911.github.io/paper-summaries/reinforcement-learning-theory/2018/06/22/rudder.html
http://pemami4911.github.io/paper-summaries/reinforcement-learning-theory/2018/06/22/rudder.htmlpaper-summariesreinforcement-learning-theoryZ-Forcing: Training Stochastic Recurrent Networks<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } },
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<hr />
<p><a href="http://papers.nips.cc/paper/7248-z-forcing-training-stochastic-recurrent-networks">Goyal, et al., 2017</a></p>
<h2 id="intro">Intro</h2>
<p>A new training procedure for recurrent VAEs is proposed. Recall that for VAEs, we model a joint distribution over observations $x$ and latent variables $z$, and assume that $z$ is involved in the generation of $x$. This distribution is parameterized by $\theta$. Maximizing the marginal log-likelihood $p_{\theta}(x)$ wrt $\theta$ is intractable bc it requires integrating over $z$. Instead, introduce a variational distribution $q_{\phi}(z|x)$ and maximize a lower bound on the marginal log-likelihood–the ELBO.</p>
<h3 id="stochastic-recurrent-networks">Stochastic recurrent networks</h3>
<p>When applying VAEs to sequences, it has been proposed to use recurrent networks for the recognition network (aka inference network aka variation posterior) and the generation network (aka decoder aka conditional probability of the next observation given previous observations and latents). These probabilistic models can be autoregressive (in this paper, they use LSTMs with MLPs for predicting the parameters of Gaussian distributions). It is common to model these conditional distributions with Gaussians for continuous variables or categoricals for discrete variables.</p>
<p>Usually, the prior over latent variables is also learned with a parametric model.</p>
<p>If I’m not mistaken, learning the parameters of these parametric models with a training data set, and the using them at test time for fast inference is referred to as <em>amortized variational inference</em>, which appears to have <a href="http://gershmanlab.webfactional.com/pubs/GershmanGoodman14.pdf">correlaries in our cognition</a>.</p>
<h3 id="z-forcing">Z-forcing</h3>
<p>Strong autoregressive decoders overpower the latent variables $z$, preventing the CPD from learning complex multi-modal distributions. To mitigate this, they introduce an auxiliary cost to the training objective. An extra parametric model is introduced, $p_{\eta}(b | z)$, that “forces” the latents to be predictive of the hidden states $b$ of the “backward network” (the inference network).</p>
<h2 id="experiments">Experiments</h2>
<p>They validate the approach on speech modeling (TIMIT, Blizzard) and language modeling. The metric is average LL. On Seqeuential MNIST, z-forcing is competitive with “deeper” recurrent generative models like PixelRNN.</p>
<div>
<div class="image-wrapper">
<img src="/img/z-forcing-interp.png" alt="" width="" height="" />
</div>
<div>
<p class="image-caption">Some fun language modeling results interpolating the latent space</p>
</div>
</div>
<h2 id="takeaways">Takeaways</h2>
<p>It’s always a consideration as to whether increasing the complexity of an approach (adding an extra network and auxiliary cost) is worth the effort vs. simpler approaches that can get almost the same performance. The results on TIMIT and Blizzard are pretty convincing. The authors also suggest incorporating the auxiliary loss with PixelRNN/CNN in future work.</p>
Fri, 22 Jun 2018 00:00:00 +0000
http://pemami4911.github.io/paper-summaries/natural-language-processing/2018/06/22/z-forcing.html
http://pemami4911.github.io/paper-summaries/natural-language-processing/2018/06/22/z-forcing.htmlpaper-summariesnatural-language-processingTracking Occluded Objects and Recovering Incomplete Trajectories by Reasoning about Containment Relations and Human Actions<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } },
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<hr />
<p><a href="https://pdfs.semanticscholar.org/4170/0dc1e60f5c8eaef409ef014f37c8b9e1b8cd.pdf">Liang, Zhu, Zhu, 2018</a></p>
<h2 id="summary">Summary</h2>
<p>This paper looks at tracking severely occluded objects in long video sequences.</p>
<p>I like this passage:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Although some recent work adopted deep neural networks
to extract contexts for object detection and tracking, these
data-driven feedforward methods have well-known problems:
i) They are black-box models that cannot be explained
and only applicable with supervised training by fitting the
typical context of the object, thus difficult to generalize
to new tasks. ii) Lacking explicit representation to handle
occlusions, low resolution, and lighting variations—there
are millions of ways to occlude an object in a given image
(Wang et al. 2017), making it impossible to have enough
data for training and testing such black box models. In this
paper, we go beyond passive recognition by reasoning about
time-varying containment relations.
</code></pre></div></div>
<p>They look at <em>containment relations</em> induced by human activity. A containment relation occurs when an object contains or holds another object, obscuring it from view. Contained objects have the same trajectories as the container.</p>
<p>They use an idea that I have thought about as well and think is powerful; rather than only relying on detections for next state hypotheses, they suggest alternative hypotheses based on occlusion reasoning. The other hypotheses come from containment relations and blocked relations; if an object is contained, it inherits the track state of its container. If an object is blocked, its track state is considered stationary (not always true!).</p>
<p>This tracking scenario is unique because they only consider human action as the source of state change (i.e., this approach doesn’t apply to general ped or vehicle tracking).</p>
<p>They use state-of-the-art detection and activity recognition to carry out object tracking. They use the network flow approach for occlusion-aware data association in a sliding window. I think their algorithm has to decide between containment and blocked occlusion events via the activity recognition.</p>
<p>They note that their approach is limited by how good object detection is. I think an important direction to consider is when external forces other than people can act on the object. This requires a lot more complex modeling of what an occluded object might be doing. Also, we need better object detection!</p>
<h2 id="interesting-related-works">Interesting related works</h2>
<p>In <a href="https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4587584">Zhang, Li, and Nevatia 2008</a>, they use an Explicit Occlusion Model (EOM) in the network flow data association model. Good potential baseline.</p>
Sun, 03 Jun 2018 00:00:00 +0000
http://pemami4911.github.io/paper-summaries/computer-vision/2018/06/03/tracking-occluded-objects-by-reasoning-about-containment.html
http://pemami4911.github.io/paper-summaries/computer-vision/2018/06/03/tracking-occluded-objects-by-reasoning-about-containment.htmlpaper-summariescomputer-visionConvexified Convolutional Neural Networks<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } },
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<hr />
<p><a href="https://arxiv.org/abs/1609.01000">Zhang, Liang, Wainwright, 2016</a></p>
<h2 id="summary">Summary</h2>
<p>In this paper, the authors proposed a method for convexifying convolutional neural networks to train them without backpropagation. Furthermore, this relaxation to the convex setting allows for theoretical proofs of bounds on the generalization error.</p>
<p>Succinctly, they propose to use RKHS and the kernel trick to lift the data into a high-dimensional space that is expressive enough to capture certain nonlinear activation functions. Hence, on experiments on MNIST and CIFAR-10, they show that they can outperform smaller CNNs by “convexifying” them.</p>
<p>They note that their method doesn’t work with max pooling or very deep CNNs with lots of bells and whistles.</p>
<p>This is a thought-provoking paper. I like how the authors pursued a theoretically interesting question, even though there isn’t much practical use yet for this. I don’t have personal experience writing theory papers, but I imagine that this is a good(?) representation of how they often go in ML. The research is driven by an interesting theoretical question, not a practical application that needs solving/SOTA results.</p>
Sat, 07 Apr 2018 00:00:00 +0000
http://pemami4911.github.io/paper-summaries/deep-learning-theory/2018/04/07/convexified-cnns.html
http://pemami4911.github.io/paper-summaries/deep-learning-theory/2018/04/07/convexified-cnns.htmlpaper-summariesdeep-learning-theoryProgressive Growing of GANs for Improved Quality, Stability, and Variation<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } },
tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}
});
</script>
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>
<hr />
<p><a href="http://research.nvidia.com/sites/default/files/pubs/2017-10_Progressive-Growing-of/karras2018iclr-paper.pdf">Karras, et al., 2018</a></p>
<h2 id="summary">Summary</h2>
<p>The basic idea is to introduce a curriculum into the GAN training procedure. One starts by training the generator to produce 4 x 4 images, progressively adding layers to increase the resolution. In the paper, they generated high-quality 1024 x 1024 samples from CelebA, LSUN, and CIFAR-10.</p>
<p>This is a nice applied paper where the core idea is quite simple and explained clearly. They describe all of the challenges hidden under the surface of training large-scale GANs and tell the reader how they tackled them. Lots of good deep learning voodoo in this paper.</p>
<p>They found that the progressive scheme helps the GAN converege to much better optimum (image quality is amazing) and reduces total training time by about a factor of 2.</p>
<p>They mainly use the <a href="https://arxiv.org/abs/1704.00028">WGAN-GP</a> loss.
Recall that the WGAN loss is \(\min_G \max_D \mathbb{E}_{x \sim \mathbb{P}_r} [ D(x) ] - \mathbb{E}_{\hat{x} \sim \mathbb{P}_g} [D(\hat{x}) ]\)</p>
<p>The main change made in WGAN-GP is the addition of a gradient penalty term to take care of the 1-Lipschitz constraint. Previous, hard weight clipping within some [-c, c] was used. The new loss looks like \(\min_G \max_D \mathbb{E}_{x \sim \mathbb{P}_r} [ D(x) ] - \mathbb{E}_{\hat{x} \sim \mathbb{P}_g} [D(\hat{x}) ] + \color{blue}{\lambda \mathbb{E}_{x' \sim \mathbb{P}_{x'}} [ \|( \nabla_{x'} D(x')\|_2 - 1)^2]}\), and $\lambda$ is set to 10.</p>
<p>Definition: <strong>Inception score</strong> is an evaluation metric for GANs where generated samples are fed into an Inception model trained on ImageNet. Images with meaningful objects are supposed to have low label entropy, but the entropy across images should be high (high variation).</p>
Mon, 26 Mar 2018 00:00:00 +0000
http://pemami4911.github.io/paper-summaries/generative-adversarial-networks/2018/03/26/progressive-growing-gans.html
http://pemami4911.github.io/paper-summaries/generative-adversarial-networks/2018/03/26/progressive-growing-gans.htmlpaper-summariesgenerative-adversarial-networks