Post by Anthony

78d

I've been reading up on the Lottery Ticket Hypothesis, which is super interesting.

Basically, the observation is that these days we build vast neural networks with billions of parameters, but most of the parameters aren't needed. That is, after training, you can just throw away 95% of the network (pruning), and it will still work fine.

The LTH paper is asking: could we start with a network just 5% of the size, and get comparable results? If so, that would be a huge performance win for Deep Learning.

What's interesting is that you can do this, but only by training the full network (perhaps several times) to see which neurons are needed. They argue that training a neural network isn't so much creating a model, as finding a lucky sub-network (a lottery ticket) from the randomly initialized network, a bit like a sculpter "finding" the bust hidden in a block of marble.

Initial LTH paper: http://arxiv.org/abs/1803.03635
Follow-up with major clarifications: http://arxiv.org/abs/1905.01067

#science #ai #machinelearning

1 0 0 View Post & Replies See Original

78d

The big question is whether you can spot a lottery ticket without training the network first. That would make this finding really useful, and perhaps tell us something very interesting about what "learning" even is in the first place.

But is it possible? I hope that I'm wrong, but I'm starting to think the answer is "no." Maybe all of this is just a funny procedure that's roughly equivalent to training a network the normal way.

Even so, I wonder if this alternate framing is useful?

#science #ai #machinelearning

1 0 0 View Post & Replies See Original

78d

@ngaylinn Are you familiar with random kitchen sinks? This sounds like a similar idea: creating a low-rank approximation of a "kernel":

Using tools from probability theory on Banach spaces, we show that with high probability, a fixed target function in a Reproducing Kernel Hilbert Space can be approximated well in the L∞ and the L2 sense as a linear combination of a few randomly chosen basis functions. For the class of functions we consider, the approximation rate turns out to be the same as that obtain [sic] by choosing the basis optimally.

(From Rahimi, A and Recht, B (2007). Uniform Approximation of Functions with Random Bases)

I honestly wish ML people would do better connecting the dots in their own field so we're not left wondering what the relationships are among these techniques. EC is terrible about this too.

Edited 78d ago

0 0 0 View Post & Replies See Original