I recently stumbled upon an interesting problem:

Determine values xi that maximizes tatxtwtx2t subject to constraint tbtxt=0. It is given that all wi values are positive and that the full array lengths do not fit into memory (hence they need to be loaded in batches).

I initially tried deriving an analytical solution to this problem. If I could not derive a solution, there is always some good ol backprop to fallback on as a backup plan: load arrays as batches and minimize the following objective function:

L=t(atxtwtx2t)+λt|btxt|

However, this approach seems a bit iffy: if λ is set too low the constraint is not satisfied. Furthermore, if λ is set too high the function is not maximized…

After some doodling I recalled there was some constrained optimization trick from my undergraduate calculus course called Lagrange Multipliers, which allows you to find function maxima (or minima) under equation constraints.

Method of Lagrange Multipliers: Suppose we have differentiable functions f and g where we want to maximize f(x) subjected to constraint g(x)=0. If x is a local maximum of f subjected to constraint g, and if g(x)0, then there exists λR such that the following systems of equations is satisfied by λ and x:

f(x)+λg(x)=0g(x)=0

Method intuition: Let’s construct function L(x,λ)=f(x)+λg(x) (which is called the Lagrange function). When our constraint g(x)=0 is enforced we have L(x,λ)=f(x). Thus, optimizing function f(x) under the constraint is equivalent to optimizing function L(x,λ) under the constraint. To find the extremum of L(x,λ) we set all partial derivatives Lxi=0 which gives us

L(x,λ)=f(x)+λg(x)0=f(x)+λg(x)

We also set Lλ=0. As Lλ=g(x), then Lλ=0 implies that g(x)=0 (i.e. our constraint). These conditions give us a system of n+1 equations for solving n+1 unknown variables (x1,x2,,xn,λ) which form the solution.

x1f(x)+λx1g(x)=0,x2f(x)+λx2g(x)=0,xnf(x)+λxng(x)=0,g(x)=0

Note: we require g(x)0 otherwise we would find the maxima of f with no dependence on g - but the whole goal is to find the maxima of f constrained by g.

Food for thought, relation to the naive-backprop approach: In the gradient descent approach I was thinking of minimizing the function

L=f(x)+λ|g(x)|

This expression is similar to the Lagrange function with the exception of the L1 norm placed on the constraint function g (also, here the negative of f is minimized rather than maximizing f - which is equivalent). Perhaps minimizing this expression would result in a similar answer as using the Lagrangian multiplier approach if the same λ is employed. However, I have not derived this or tested it empirically.

Solving our problem:

We start by constructing the Lagrange function L(x,λ) of our expression and calculating the partial derivative of it with respect to each xi:

xiL(x,λ)=xi(f(x)+g(x))=xi(tatxtwtx2t+λtbtxt)=ai2wixi+λbi

Now, setting this expression equal to zero and solving for xi gives us xi=ai+λbi2wi. We can then plug each xi into our constraint g to determine λ:

0=tbtxt0=tbt(at+λbt2wt)0=tbtat2wt+λtb2t2wtλ=tbtat2wttb2t2wt

Et voila! This gives us each xi that maximizes f while enforcing constraint function g. Furthermore, we do not need to load all the data into memory (as stipulated in the problem setup) and can calculate λ (and xi) in batches of data. Super super cool. Sidenote: we know our solution is a maximum as 22xiL(x,λ)=wi. As all wi are positive we get that the second-order partial derivatives are negative, hence we find a maxima solution.


<
Previous Post
Conditional Expectations
>
Next Post
Taking a look at Triton