building a deep learning machine


  • Pascal Titan X           ($1200, 12GB Mem, 3584 cores,  11TFlops, 250W)
  • Geforce GTX 1080    ( $699,    8GB Mem, 2560 cores, 10TFlops, 180W)

Disk: SSD is faster than RAID, while RAID is larger.


  • (from NV DIGIT devbox) 1600W Power Supply Unit from premium suppliers including EVGA
  • (recommended by Joseph Redmon) Rosewill Hercules 1600W

Motherboard (support 4 gpus):

  • GIGABYTE GA-X99-UD3P mother board
  • Asus X99-E WS workstation class motherboard with 4-way PCI-E Gen3 x16 support


Graph data structures

  • A Graph contains a set of operation and tensor

  • tf.Operation, which represent units of computation;

  • tf.Tensor, which represent the units of data that flow between operations. You can think of a TensorFlow tensor as an n-dimensional array or list

  • Variables maintain state across executions of the graph.
# Build a dataflow graph.
c = tf.constant([[1.0, 2.0], [3.0, 4.0]])
d = tf.constant([[1.0, 1.0], [0.0, 1.0]])
e = tf.matmul(c, d)

# Construct a `Session` to execute the graph.
sess = tf.Session()

# Execute the graph and store the value that `e` represents in `result`.
result =

Variables: Creation, Initialization, Saving, and Loading

– tf.get_variable(name, shape, dtype, initializer)


– tf.Variable


– ?? where is the variable saved? CPU or GPU?

# Create two variables.
weights = tf.Variable(tf.random_normal([784, 200], stddev=0.35), name="weights")
biases = tf.Variable(tf.zeros([200]), name="biases")

# Add an op to initialize the variables.
init_op = tf.initialize_all_variables()

# Add ops to save and restore all the variables.
saver = tf.train.Saver()

# Later, when launching the model
with tf.Session() as sess:
       # Run the init operation.

       # Save the variables to disk.
       save_path =, "/tmp/model.ckpt")

       # Restore variables from disk.
       saver.restore(sess, "/tmp/model.ckpt")




Example 1:  Least Square Regression

import tensorflow as tf
import numpy as np
# Create 100 phony x, y data points in NumPy, y = x * 0.1 + 0.3
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 0.1 + 0.3

# Try to find values for W and b that compute y_data = W * x_data + b
# (We know that W should be 0.1 and b 0.3, but Tensorflow will
# figure that out for us.)
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
b = tf.Variable(tf.zeros([1]))
y = W * x_data + b

# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)

# Before starting, initialize the variables.  We will 'run' this first.
init = tf.initialize_all_variables()

# Launch the graph.
sess = tf.Session()

# Fit the line.
for step in xrange(201):
    if step % 20 == 0:

QA notes

Antonie Bordes’s slides:Artificial Tasks for Artificial Intelligence

Big Data -> Big AI?

  • positive: We can certainly improve (a lot) on many (well defined) tasks.
  • negative:
    • Training data will never cover the whole range of possible situations
    • train/test distributions drift
    • Real large-scale data is complex, noisy, unlabeled –> hard to train/label
    • Interpretation of success or failure is complex

Recent model on sequences

  • Neural Turning Machines (Graves et al 14)
  • Stack-augmented RNN (Joulin & Mikolov 15)

FB AI-complete QA

  • 20 tasks: single supporting fact, two/three supporting facts, two/three augment relation, ….
  • Look for systems able to solve all tasks: no task specific engineering

Visual QA

  • 0.25M images, 0.76M questions, 10M answers

iterable, iterator and generator

iterables (including list, etc) required to support one method

iterator (e.g., iter()) required to support two methods
next #__next__ in python 3.x

so you can call

x = iter([1, 2, 3])

generator simplifies creation of iterators. A generator is a function that produces a sequence of results instead of a single value.

Question: how to write generator using hd5py?

Theano and numpy broadcasting



Naive codes

for i in xrange( len( X1 ) ): X[i,:] = X1[i,:] - X2

However, the rule of broadcasting is, when
1. the dimensions are equal
2. one of the dimensions is 1
dimensions with size 1 are stretched or “copied” to match the other.

For example

A = np.random.rand(8,1,6,1)
B = np.random.rand(7,1,5)
print (A*B).shape #=(8, 7, 6, 5)

theano variables and shared variables
By default, tensor matrix/tensor3/tensor4 are not broadcastable by default. theano’s all shared variable dimensions aren’t broadcastable either, as their shape could change.

Someone suggested

b = theano.shared(bval, broadcastable=True)

but it does not work for me.

Instead, I tried

b = T.addbroadcast(b, 1)

and it works.

We can also use  T.TensorType to set broadcastable

tensor.TensorType('float32', broadcastable = [False,True])

vector broadcasting in theano

vector can be broadcasted in the computing with matrix

xval = np.array([[1, 2, 3], [4, 5, 6]])
bval = np.array([10, 20, 30])

xshared = theano.shared(xval)
bshared = theano.shared(bval)
zvar = xshared *1.0 / bshared
print zvar.eval()


  • vector can be broadcasted only when #columns matches
  • we can use  “b.dimshuffle(0,’x’)” or “c2 = b[:,np.newaxis]” to change a vector into 2d array with shape = [d,1]

What if CVPR hates deep learning? Please mention faces

One of the most controversial papers in CV field this year,  “Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers” was rejected by CVPR but now is accepted by ICML’12. 🙂

For those who are not familiar with the story, here is the letter from the author to Serge Belongie, a CVPR PC chair. It is quite surprising to read the harsh reviews, but I cannot overlook the fact that the reviewers are very responsible and they have their arguments.

I met one of the authors in NYC Multimedia & Vision meeting who joked that CVPR reviewers hate deep learning.

Now I want to argue that he was wrong. The reason is that I found two papers on deep learning have been accepted by CVPR’12, as below:

  • Hierarchical Representations for Face Verification with Convolutional Deep Belief Networks, by Gary Huang, Honglak Lee, Erik Learned-Miller
  • Hierarchical Face Parsing via Deep Learning, by Ping Luo, Xiaogang Wang, Xiaoou Tang

What is the difference between these two papers and the ICML paper? We can easily see where is the magic. The two accepted papers are both related to “face”. In contrast, the scene parsing papers have nothing to do with faces so it was rejected!

Next time when I see Yann Lecun, I think I would suggest him to try to submit a CVPR submission with face recognition experiments. 🙂

Lessons from Jeopardy! (1)

In 2011, the IBM computer named “Watson” lit up the Web and tradition media by beating the best human player in the Jeopardy! game.

This is the Watson

This is a rough intuition how the Watson works

Why am I writing the series of posts NOW? Although Watson team won the game in 2011, there has been not many details published until the IBM Journal of Research & Develop in May 2012.  I want to take a note of the lessons I learned from the special issue and also from the discussions with my Watson colleagues.

The motivation to learn from the Watson team is not only due to my respect to those fantastic researchers in NLP domain, but also the desire to push forward other fields related to AI, especially my majoring in computer vision and multimedia.

Computers never behave exactly as a human being.  They are either (1) far powerful  or (2) much more stupid than human brains.  The Watson was in category (2) five years ago but now obviously in level (1).

With the measure used internally, the computer in 2007 can only answer 70% of the questions with a precision of 16%, or simply 16%@70. But amazing things happened after 2007 and the performance was boosted to 85%@70 in 2011 which is good enough to beat the best human player in the world. What is the magic behind this journey? How could the Watson team have the confidence to continue the project even with bad initial bad performance? Please see the next post:

Lessons from Jeopardy! (2)  Where does the confidence come from