In an earlier update, I wrote about implementing computational graphs to learn new programming languages. I took the time recently to implement one for Kotlin, as I was having a hard time moving forward in Rust. Here are some implementation decisions I made and how I feel about them looking back:

Implementing operators as standalone classes.

When I build Aij (the Java compute graph), Rust Compute Graph, and the first revision of CG4j, I had a graph class which contained ‘add node’/’add operator’ methods which appended lambdas to a big list of operations for forward and reverse nodes. This meant that the graph class was nice and portable. I could drop it into new projects and be on my way. The downside comes from serialization. When using Java’s internal serializer, Lambdas can’t get converted, so saving and restoring the graph automatically isn’t possible. Another downside to this is lambdas incur a small performance overhead, from what I see, even in Rust where we incur a heap allocation unnecessarily. The solution: define an operator or node class and subclass it.

Returning Nodes or Integers?

Depending on whether you opt to use nodes or node indices, you might find yourself passing either a node index and a graph pointer to your constructors OR a node. Passing around a graph is inconvenient, especially if you have to load one from disk. It means you need to deal with the whole chicken-and-egg thing. Passing around a node is easier by far, but it means you have to handle saving out the nodes which might exist on multiple paths and, more problematically…

Recursively computing the forward propagation calls.

One thing I did do correctly in the Rust implementation was calculating the nodes in order. Each node had an ID/index, and when doing forward-prop I’d just iterate from left-to-right across the node list and calculate each one. In the Kotlin implementation, I opted to memoize and recursively compute the nodes. That shouldn’t have been an issue, in theory, since I was caching results, but the function calls aren’t free, and it means getting stack-traces like this:

 at com.josephcatrambone.cg4j.Tensor.add_i(Tensor.kt:256)
 at com.josephcatrambone.cg4j.AddNode.forwardOperation(Node.kt:87)
 at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:33)
 at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30)
 at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30)
 at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30)
 at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30)
 at com.josephcatrambone.cg4j.Graph.forward(Graph.kt:30)
 at com.josephcatrambone.cg4j.Graph.forward$default(Graph.kt:19)
 at com.josephcatrambone.cg4j.RNN.predict(RNN.kt:129)
 at com.josephcatrambone.cg4j.RNN.predict$default(RNN.kt:97)
 at com.josephcatrambone.cg4j.MainKt.testRNN(Main.kt:107)

The single advantage to this approach is that unused code paths don’t get called. If you have nodes, A, B, … X, Y, Z, and Z = A+B, then when you compute using the non-memoized version you’re actually going to be computing the values for everything between A and Z, as opposed to just A, B, and Z for the recursive version. Depending on how much of your graph is used at any one time, this can be a big saving. Not sure who the clear winner is here.

Automatic Differentiation

2016-09-18

Math, Programming

UPDATE: This code is now available in both Java and Python!

I’ve been on an automatic differentiation kick ever since reading about dual numbers on Wikipedia.

I implemented a simple forward-mode autodiff system in Rust, thinking it would allow me to do ML faster. I failed to realize/read that forward differentiation, while simpler, requires one forward pass to get the derivative of ALL outputs with respect to ONE input variable. Reverse-mode, in contrast, gives you the derivative of all inputs with respect to one output.

That is to say, if I had f(x, y, z) = [a, b, c], forward mode would give me da/dx, db/dx, dc/dx in a single pass. Reverse mdoe would give me da/dx, da/dy, da/dz in a single pass.

Forward mode is really easy. I have a repo with code changes here: https://github.com/JosephCatrambone/RustML

Reverse mode took me a while to figure out, mostly because I was confused about how adjoints worked. I’m still confused, but I’m now so accustomed to the strangeness that I’m not noticing it. Here’s some simple, single-variable reverse-mode autodiff. It’s about 100 lines of Python:

	#!/usr/bin/env python
	# JAD: Joseph's Automatic Differentiation

	from collections import deque

	class Graph(object):
	def __init__(self):
	self.names = list()
	self.operations = list()
	self.derivatives = list() # A list of LISTS, where each item is the gradient with respect to that argument.
	self.node_inputs = list() # A list of the indices of the input nodes.
	self.shapes = list()
	self.graph_inputs = list()
	self.forward = list() # Cleared on forward pass.
	self.adjoint = list() # Cleared on reverse pass.

	def get_output(self, input_set, node=-1):
	self.forward = list()
	for i, op in enumerate(self.operations):
	self.forward.append(op(input_set))
	return self.forward[node]

	def get_gradient(self, input_set, node, forward_data=None):
	if forward_data is not None:
	self.forward = forward_data
	else:
	self.forward = list()
	for i, op in enumerate(self.operations):
	self.forward.append(op(input_set))
	# Initialize adjoints to 0 except our target, which is 1.
	self.adjoint = [0.0]*len(self.forward)
	self.adjoint[node] = 1.0
	gradient_stack = deque()
	for input_node in self.node_inputs[node]:
	gradient_stack.append((input_node, node)) # Keep pairs of target/parent.
	while gradient_stack: # While not empty.
	current_node, parent_node = gradient_stack.popleft()
	for dop in self.derivatives[current_node]:
	self.adjoint[current_node] += self.adjoint[parent_node]*dop(input_set)
	for input_arg in self.node_inputs[current_node]:
	gradient_stack.append((input_arg, current_node))
	return self.adjoint

	def get_shape(self, node):
	return self.shapes[node]

	def add_input(self, name, shape):
	index = len(self.names)
	self.names.append(name)
	self.operations.append(lambda inputs: inputs[name])
	self.derivatives.append([lambda inputs: 1])
	self.node_inputs.append([])
	self.graph_inputs.append(index)
	self.shapes.append(shape)
	return index

	def add_add(self, name, left, right):
	index = len(self.names)
	self.names.append(name)
	self.operations.append(lambda inputs: self.forward[left] + self.forward[right])
	self.derivatives.append([lambda inputs: 1, lambda inputs: 1]) # d/dx a + b = 1 + 0 or 0 + 1
	self.node_inputs.append([left, right])
	self.shapes.append(self.get_shape(left))
	return index

	def add_multiply(self, name, left, right):
	index = len(self.names)
	self.names.append(name)
	self.operations.append(lambda inputs: self.forward[left] * self.forward[right])
	self.derivatives.append([lambda inputs: self.forward[right], lambda inputs: self.forward[left]])
	self.node_inputs.append([left, right])
	self.shapes.append(self.get_shape(left))
	return index

	if __name__=="__main__":
	g = Graph()
	x = g.add_input("x", (1, 1))
	y = g.add_input("y", (1, 1))
	a = g.add_add("a", x, y)
	b = g.add_multiply("b", a, x)

	input_map = {'x': 2, 'y': 3}
	print(g.get_output(input_map)) # 10
	print(g.get_gradient(input_map, b)) # 3, 2, 2, 1.

view raw AutoDiff.py hosted with

by GitHub

Hello Arnold in Haskell

2014-03-07

Math

> let doit x y = x(x(y)) > let arnold x = x ++ "aauuugh" > putStrLn $ arnold "Get to da choppa" Get to da choppaaauuugh > putStrLn ( doit arnold "Nauuugh" ) Nauuughaauuughaauuugh

—Joseph's Blog

Math, Machine Learning, Game Development

Archive