Circuits Thread (Various people) (summarized by Rohin): We’ve previously (AN #111) summarized the three main claims of the Circuits thread. Since then, several additional articles in the series have been published, so now seems like a good time to take another look at the work in this space. The three main claims of the Circuits agenda, introduced in Zoom In, are: 1. Neural network features - the activation values of hidden layers - are understandable. 2. Circuits - the weights connecting these features - are also understandable. 3. Universality - when training different models on different tasks, you will get analogous features. To support the first claim, there are seven different techniques that we can use to understand neural network features that all seem to work quite well in practice: 1. Feature visualization produces an input that maximally activates a particular neuron, thus showing what that neuron is “looking for”. 2. Dataset examples are the inputs in the training dataset that maximally activate a particular neuron, which provide additional evidence about what the neuron is “looking for”. 3. Synthetic examples created independently of the model or training set can be used to check a particular hypothesis for what the neuron activates on. 4. Tuning involves perturbing an image and seeing how the neuron activations change. For example, we might rotate images and see how that impacts a curve detector. 5. By looking at the circuit used to create a neuron, we can read off the algorithm that implements the feature, and make sure that it makes intuitive sense. 6. We can look to see how the feature is used in future circuits. 7. We can handwrite circuits that implement the feature after we understand how it works, in order to check our understanding and make sure it performs as well as the learned circuit. Note that since techniques 5-7 are based on understanding circuits, their empirical success also supports claim 2. The third claim, that the features learned are universal, is the most speculative. There isn’t direct support for it, but anecdotally it does seem to be the case that many vision networks do in fact learn the same features, especially in the earlier layers that learn lower-level features. In order to analyze circuits, we need to develop interpretability tools that work with them. Just as activations were the primary object of interest in understanding the behavior of a model on a given input, weights are the primary object of interest in circuits. Visualizing Weights explains how we can adapt all of the techniques from Building Blocks to this setting. Just as in Building Blocks, a key idea is to “name” each neuron with its feature visualization: this makes each neuron meaningful to humans, in the same way that informative variable names make the variable more meaningful to a software engineer. Once we have “named” our neurons, our core operation is to visualize the matrix of weights connecting one neuron in layer L to another in layer L+1. By looking at this visualization, we can see the “algorithm” (or at least part of the algorithm) for producing the layer L+1 neuron given the layer L neuron. The other visualizations are variations on this core operation: for example, we might instead visualize weights for multiple input neurons at once, or for a group of neurons together, or for neurons that are separated by more than one layer. We then apply our machinery to curve detectors, a collection of 10 neurons that detect curves. We first use techniques 1-4 (which don’t rely on circuits) to understand what the neurons are doing in Curve Detectors. This is probably my favorite post of the entire thread, because it spends ~7000 words and many, many visualizations digging into just 10 neurons, and it is still consistently interesting. It gives you a great sense of just how complex the behavior learned by neural networks can be. Unfortunately, it’s hard to summarize (both because there’s a lot of content and because the visualizations are so core to it); I recommend reading it directly. Curve Circuits delves into the analysis of how the curve detectors actually work. The key highlight of this post is that one author set values for weights without looking at the original network, and recreated a working curve detection algorithm (albeit not one that was robust to colors and textures, as in the original neural network). This is a powerful validation that at least that author really does “understand” the curve detectors. It’s not a small network either -- it takes around 50,000 parameters. Nonetheless, it can be described relatively concisely: Gabor filters turn into proto-lines which build lines and early curves. Finally, lines and early curves are composed into curves. In each case, each shape family (eg conv2d2 line) has positive weight across the tangent of the shape family it builds (eg. 3a early curve). Each shape family implements the rotational equivariance motif, containing multiple rotated copies of approximately the same neuron. (Yes, that’s a lot of jargon; it isn’t really meant to be understood just from that paragraph. The post goes into much more detail.) So why is it amenable to such a simple explanation? Equivariance helps a lot in reducing the amount we need to explain. Equivariance occurs when there is a kind of symmetry across the learned neurons, such that there’s a single motif copied across several neurons, except transformed each time. A typical example of equivariance is rotational equivariance, where the feature is simply rotated some amount. Curve detectors show rotational equivariance: we have 10 different detectors that are all implemented in roughly the same way, except that the orientation is changed slightly. Many of the underlying neurons also show this sort of equivariance. In this case, we only need to understand one of the circuits and the others follow naturally. This reduces the amount to understand by about 50x. This isn’t the only type of equivariance: you can also have e.g. neuron 1 that detects A on the left and B on the right, and neuron 2 that detects A on the right and B on the left, which is a reflection equivariance. If you trained a multilayer perceptron (MLP), that would probably have translational equivariance, where you’d see multiple neurons computing the same function but for different spatial locations of the image. (CNNs build translational equivariance into the architecture because it is so common.) The authors go into detail by showing many circuits that involve some sort of equivariance. Note that equivariance is a property of features (neurons), rather than the circuits that connect them. Nonetheless, this is reflected in the structure of the weights that make up the circuit. The post gives several examples of such circuits. These can be of three types: invariant-equivariant circuits (building a family of equivariant features starting from invariant features), equivariant-invariant circuits, and equivariant-equivariant circuits. High-Low Frequency Detectors applies all of these tools to understand a family of features that detect areas with high frequencies on one side, and low frequencies on the other. It’s mostly interesting as another validation of the tools, especially because these detectors were found through interpretability techniques and weren’t hypothesized before (whereas we could have and maybe did predict a priori that an image classifier would learn to detect curves).
|