Avi Singh's blog

Why Write

Sat, 12 Feb 2022 00:00:00 +0000

This blog has not seen a new post in over six years. So, why am I reviving it now? Put simply, the following lines from a Paul Graham blog post, “If writing down your ideas always makes them more precise and more complete, then no one who hasn’t written about a topic has fully formed ideas about it. And someone who never writes has no fully formed ideas about anything nontrivial.” This statement appears a bit extreme, but Graham has built this argument over multiple essays, and I find that his views resonate with my (albeit limited) writing experience.

In other words, my motivation to write comes from my desire to think better. Why, then, is it necessary to write in public? Why not write in a private journal instead? Writing in public can lead to interesting discussions with people who might stumble upon these articles, the benefits of which are (at least) two-fold. First, a discussion can lead to further refinement of my ideas and thinking. Second, a stimulating discussion is fun in and of itself, and might serve as yet another incentive for maintaining a writing habit.

So, what do I plan to write about? When I wrote this blog as an undergraduate student, the blog posts were instructive in nature. Like others trying to break into research, I often found research papers difficult to follow, and preferred reading blog posts written by researchers instead. These blog posts helped me build my understanding to an extent where traditional research papers became accessible to me. Each of my own blog post in turn was centered around explaining an approach to a particular technical problem, typically in the field of computer vision. However, I am not intending new posts in this blog to follow a similar tutorial format. While I might occasionally write such posts to improve my understanding of certain ideas or algorithms, the primary goal I have with the blog at this point is to improve my thinking for better everyday decision-making. It is likely that I will focus more on research or career-related topics, but I might occasionally venture into more personal topics as well. We’ll see.

I will conclude this blog post with some ideas for what I might write about next, in the hope that it might help me deal with the procrastination that I usually suffer from when it comes to writing. Nothing, however, is set in stone. At the end of the day, I want to write about what I can’t not write about.

Future blog post ideas:

How to choose a research problem? I like Vladlen Koltun’s thoughts on this front, and I will likely draw on them as I solidify my own thinking in this regard.
How to become a better writer? If I am going to do something, I might as well do it well. I plan to start by looking at what others have written on this topic, and then try and figure out what does (and does not) work for me.

Deep Learning for Visual Question Answering

Mon, 02 Nov 2015 00:00:00 +0000

In this blog post, I’ll talk about the Visual Question Answering problem, and I’ll also present neural network based approaches for same. The source code for this blog post is written in Python and Keras, and is available on Github.

An year or so ago, a chatbot named Eugene Goostman made it to the mainstream news, after having been reported as the first computer program to have passed the famed Turing Test in an event organized at the University of Reading. While the organizers hailed it as a historical achievement, most of the scientific community wasn’t impressed. This leads us to the question: Is the Turing Test, in its original form, a suitable test for AI in the modern day?

In the last couple of years, a number of papers (like this paper from JHU/Brown, and this one from MPI) have suggested that the task of Visual Question Answering (VQA, for short) can be used as an alternative Turing Test. The task involves answering an open-ended question (or a series of questions) about an image. An example is shown below:

Image from visualqa.org

The AI system needs to solve a number of sub-problems in Natural Language Processing and Computer Vision, in addition to being able to perform some kind of “common-sense” reasoning. It needs to localize the subject being referenced (the woman’s face, and more specifically the region around her lips), needs to detect objects (the banana), and should also have some common-sense knowledge that the word mustache is often used to refer to markings or objects on the face that are not actually mustaches (like milk mustaches). Since the problem cuts through two two very different modalities (vision and text), and requires high-level understanding of the scene, it appears to be an ideal candidate for a true Turing Test. The problem also has real world applications, like helping the visually impaired.

A few days ago, the Visual QA Challenge was launched, and along with it came a large dataset (~750K questions on ~250K images). After the MS COCO Image Captioning Challenge sparked a lot of interest in problem of image captioning (or was it the interest that led to the challenge?), the time seems ripe to move onto a much harder problem at the intersection of NLP and Vision.

This post will present ways to model this problem using Neural Networks, exploring both Feedforward Neural Networks, and the much more exciting Recurrent Neural Networks (LSTMs, to be specific). If you do not know much about Neural Networks, then I encourage you to check these two awesome blogs: Colah’s Blog and Karpathy’s Blog. Specifically, check out the posts on Recurrent Neural Nets, Convolutional Neural Nets and LSTM Nets. The models in this post take inspiration from this ICCV 2015 paper, this ICCV 2015 paper, and this NIPS 2015 paper.

Generating Answers

An important aspect of solving this problem is to have a system that can generate new answers. While most of the answers in the VQA dataset are short (1-3 words), we would still like to a have a system that can generate arbitrarily long answers, keeping up with our spirit of the Turing test. We can perhaps take inspiration from papers on Sequence to Sequence Learning using RNNs, that solve a similar problem when generating translations of arbitrary length. Multi-word methods have been presented for VQA too. However, for the purpose of this blog post, we will ignore this aspect of the problem. We will select the 1000 most frequent answers in the VQA training dataset, and solve the problem in a multi-class classification setting. These top 1000 answers cover over 80% of the answers in the VQA training set, so we can still expect to get reasonable results.

The Feedforward Neural Model

To get started, let’s first try to model the problem using a MultiLayer Perceptron. An MLP is a simple feedforward neural net that maps a feature vector (of fixed length) to an appropriate output. In our problem, this output will be a probability distribution over the set of possible answers. We will be using Keras, an awesome deep learning library based on Theano, and written in Python. Setting up Keras is fairly easy, just have a look at their readme to get started.

In order to use the MLP model, we need to map all our input questions and images to a feature vector of fixed length. We perform the following operations to achieve this:

For the question, we transform each word to its word vector, and sum up all the vectors. The length of this feature vector will be same as the length of a single word vector, and the word vectors (also called embeddings) that we use have a length of 300.
For the image, we pass it through a Deep Convolutional Neural Network (the well-known VGG Architecture), and extract the activation from the second last layer (before the softmax layer, that is). Size of this feature vector is 4096.

Once we have generated the feature vectors, all we need to do now is to define a model in Keras, set up a cost function and an optimizer, and we’re good to go. The following Keras code defines a multi-layer perceptron with two hidden layers, 1024 hidden units in each layer and dropout layers in the middle for regularization. The final layer is a softmax layer, and is responsible for generating the probability distribution over the set of possible answers. I have used the categorical_crossentropy loss function since it is a multi-class classification problem. The rmsprop method is used for optimzation. You can try experimenting with other optimizers, and see what kind of learning curves you get.

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation

img_dim = 4096 #top layer of the VGG net
word_vec_dim = 300 #dimension of pre-trained word vectors
nb_hidden_units = 1024 #number of hidden units, a hyperparameter

model = Sequential()
model.add(Dense(nb_hidden_units, input_dim=img_dim+word_vec_dim, 
          init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(nb_hidden_units, init='uniform'))
model.add(Activation('tanh'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes, init='uniform'))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

Have a look at the entire python script to see the code for generating the features and training the network. It does not access the hard disk once the training begins, and uses about ~4GB of RAM. You can reduce memory usage by lowering the batchSize variable, but that would also lead to longer training times. It is able to process over 215K image-question pairs in less than 160 seconds/epoch when working on a GTX 760 GPU with a batch size of 128. I ran my experiments for 100 epochs.

The Recurrent Neural Model

A drawback of the previous approach is that we ignore the sequential nature of the questions. Regardless of what order the words appear in, we’ll get the same vector representing the question, à la bag-of-words (BOW). A way to tackle this limitation is by use of Recurrent Neural Networks, which are well-suited for sequential data. We’ll be using LSTMs here, since they avoid some common nuances of vanilla RNNs, and often give a slightly better performance. You can also experiment with other recurrent layers in Keras, such as GRU. The word vectors corresponding to the tokens in the question are passed to an LSTM in a sequential fashion, and the output of the LSTM (from its output gate) after all the tokens have been passed is chosen as the representation for the entire question. This fixed length vector is concatenated with the 4096 dimensional CNN vector for the image, and passed on to a multi-layer perceptron with fully connected layers. The last layer is once again softmax, and provides us with a probability distribution over the possible outputs.

from keras.models import Sequential
from keras.layers.core import Dense, Activation, Merge, Dropout, Reshape
from keras.layers.recurrent import LSTM

num_hidden_units_mlp = 1024
num_hidden_units_lstm = 512
img_dim = 4096
word_vec_dim = 300

image_model = Sequential()
image_model.add(Reshape(input_shape = (img_dim,), dims=(img_dim,)))

language_model = Sequential()
language_model.add(LSTM(output_dim = num_hidden_units_lstm, 
			return_sequences=False, 
			input_shape=(max_len, word_vec_dim)))

model = Sequential()
model.add(Merge([language_model, image_model], 
			mode='concat', concat_axis=1))
model.add(Dense(num_hidden_units_mlp, init='uniform'))
model.add(Activation('tanh')
model.add(Dropout(0.5))
model.add(Dense(num_hidden_units_mlp, init='uniform'))
model.add(Activation('tanh')
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

A single train_on_batch method call in Keras expects the sequences to be of the same length (so that is can be represented as a Theano Tensor). There has been a lot of discussion regarding training LSTMs with variable length sequences, and I used the following technique: Sorted all the questions by their length, and then processed them in batches of 128 while training. Most batches had questions of the same length (say 9 or 10 words), and there was no need of zero-padding. For the few batched that did have questions of varying length, the shorter questions were zero-padded. I was able to achieve a training speed of 200 seconds/epoch on a GTX 760 GPU.

Show me the numbers

I trained my system on the Training Set of the VQA dataset, and evaluated performance on the validation set, following the rules of the VQA challenge. The answer produced by the Neural Net is checked against every answer provided by humans (there are ten human answers for every question). If the answer produced by the neural net exactly matches at least three of the ten answers, then we classify it as a correct prediction. Here is the performance of the models that I trained:

Model	Accuracy
BOW+CNN	48.46%
LSTM-Language only	44.17%
LSTM+CNN	51.63%

Update: The results that I reported earlier were based on a metric slightly different from the ones used on VQA. They have since been updated. Also, I was able to obtain a performance of 53.34% on the test-dev set (LSTM+CNN), which is practically the same as those set by the VQA authors in their LSTM baseline.

It’s interesting to see that even a “blind” model is able to obtain an accuracy of 44.17%. This shows that the model is pretty good at guessing the answers once it has identified the type of question. The LSTM+CNN model shows an improvement of about 3% as compared to the Feedforward Model (BOW+CNN), which tells us that the temporal structure of the question is indeed helpful. These results are in line with what was obtained in the original VQA paper. However, the results reported in the paper were on the test set (trained on train+val), while we have evaluated on the validation set (trained on only train). If we learn a model on both the training and the validation data, then we can expect a significant improvement in performance since the number of training examples will increase by 50%. Finally, there is a lot of scope for hyperparameter tuning (number of hidden units, number of MLP hidden layers, number of LSTM layers, dropout or no dropout etc.).

I carried out my experiments for 100 epochs¹, and observed the following curve:

The LSTM+CNN model flattens out in performance after about 50 epochs. The BOW+CNN also showed similar behavior, but took a surprising dive at epoch 90, which was soon rectified by the 100th epoch. I’ll probably re-initialize and run the models for 500 epochs, and see if such behavior is seen again or not. Update: I did run it once more, and the dip was not observed!

A note on word embeddings

We have a number of choices when using word embeddings, and I experimented with three of them:

GloVe Word Embeddings trained on the common-crawl: These gave the best performance, and all results reported here are using these embeddings.
Goldberg and Levy 2014: These are the default embeddings that come with spaCy, and they gave significantly worse results.
Embeddings Trained on the VQA questions: I used Gensim’s word2vec implementation to train my own embeddings on the questions in the training set of the VQA dataset. The performance was similar to, but slighly worse than the GloVe embeddings. This is primarily because the VQA training set alone is not sufficiently large (~2.5m words) to get reasonable word vectors, especially for less common words.

Link to github repo

Validation was done once per 10 epochs for BOW+CNN, once every 5 epochs for LSTMs. ↩

Monocular Visual Odometry using OpenCV

Mon, 08 Jun 2015 00:00:00 +0000

Last month, I made a post on Stereo Visual Odometry and its implementation in MATLAB. This post would be focussing on Monocular Visual Odometry, and how we can implement it in OpenCV/C++. The implementation that I describe in this post is once again freely available on github. It is also simpler to understand, and runs at 5fps, which is much faster than my older stereo implementation.

If you are new to Visual Odometry, I suggest having a look at the first few paragraphs (before all the math starts) of my old post. It talks about what Visual Odometry is, why we need it, and also compares the monocular and stereo approaches.

Acquanted with all the basics of visual odometry? Cool. Let’s go ahead.

Demo

Before I move onto describing the implementation, have a look at the algorithm in action!

Pretty cool, eh? Let’s dive into implementing it in OpenCV now.

Formulation of the problem

Input

We have a stream of gray scale images coming from a camera. Let the frames, captured at time \(t\) and \(t+1\) be referred to as \(\mathit{I}^{t}\), \(\mathit{I}^{t+1}\). We have prior knowledge of all the intrinsic parameters, obtained via calibration, which can also be done in OpenCV.

Output

For every pair of images, we need to find the rotation matrix \(R\) and the translation vector \(t\), which describes the motion of the vehicle between the two frames. The vector \(t\) can only be computed upto a scale factor in our monocular scheme.

Algorithm Outline

Capture images: \(\mathit{I}^t\), \(\mathit{I}^{t+1}\),
Undistort the above images.
Use FAST algorithm to detect features in \(\mathit{I}^t\), and track those features to \({I}^{t+1}\). A new detection is triggered if the number of features drop below a certain threshold.
Use Nister’s 5-point alogirthm with RANSAC to compute the essential matrix.
Estimate \(R, t\) from the essential matrix that was computed in the previous step.
Take scale information from some external source (like a speedometer), and concatenate the translation vectors, and rotation matrices.

You may or may not understand all the steps that have been metioned above, but don’t worry. All the points above will be explained in great detail in the text to follow.

Undistortion

Distortion happens when lines that are straight in the real world become curved in the images. T his step compensates for this lens distortion. It is performed with the help of the distortion parameters that were obtained during calibration. Since the KITTI dataset that I’m using already comes with undistorted images, I won’t write the code about it here. However, it is relatively straightforward to undistort with OpenCV.

Feature Detection

My approach uses the FAST corner detector, just like my stereo implementation. I’ll now explain in brief how the detector works, though you must have a look at the original paper and source code if you want to really understand how it works. Suppose there is a point \(\mathbf{P}\) which we want to test if it is a corner or not. We draw a circle of 16px circumference around this point as shown in figure below. For every pixel which lies on the circumference of this circle, we see if there exits a continuous set of pixels whose intensity exceed the intensity of the original pixel by a certain factor \(\mathbf{I}\) and for another set of contiguous pixels if the intensity is less by at least the same factor \(\mathbf{I}\). If yes, then we mark this point as a corner. A heuristic for rejecting the vast majority of non-corners is used, in which the pixel at 1,9,5,13 are examined first, and atleast three of them must have a higher intensity be amount at least \(\mathbf{I}\), or must have an intensity lower by the same amount \(\mathbf{I}\) for the point to be a corner. This particular approach is selected due to its computational efficiency as compared to other popular interest point detectors such as SIFT.

Image from the original FAST feature detection paper

Using OpenCV, detecting features is trivial, and here is the code that does it.

void featureDetection(Mat img_1, vector<Point2f>& points1)	{ 
  vector<KeyPoint> keypoints_1;
  int fast_threshold = 20;
  bool nonmaxSuppression = true;
  FAST(img_1, keypoints_1, fast_threshold, nonmaxSuppression);
  KeyPoint::convert(keypoints_1, points1, vector<int>());
}

The parameters in the code above are set such that it gives ~4000 features on one image from the KITTI dataset. You may want tune these parameters so as to obtain the best performance on your own data. Note that the code above also converts the datatype of the detected feature points from KeyPoints to a vector of Point2f, so that we can directly pass it to the feature tracking step, described below:

Feature Tracking

The fast corners detected in the previous step are fed to the next step, which uses a KLT tracker. The KLT tracker basically looks around every corner to be tracked, and uses this local information to find the corner in the next image. You are welcome to look into the KLT link to know more. The corners detected in \(\mathit{I}^{t}\) are tracked in \(\mathit{I}^{t+1}\). Let the set of features detected in \(\mathit{I}^{t}\) be \(\mathcal{F}^{t}\) , and the set of corresponding features in \(\mathit{I}^{t+1}\) be \(\mathcal{F}^{t+1}\). Here is the function that does feature tracking in OpenCV using the KLT tracker:

void featureTracking(Mat img_1, Mat img_2, vector<Point2f>& points1, vector<Point2f>& points2, vector<uchar>& status)	{ 

//this function automatically gets rid of points for which tracking fails

  vector<float> err;					
  Size winSize=Size(21,21);																								
  TermCriteria termcrit=TermCriteria(TermCriteria::COUNT+TermCriteria::EPS, 30, 0.01);

  calcOpticalFlowPyrLK(img_1, img_2, points1, points2, status, err, winSize, 3, termcrit, 0, 0.001);

  //getting rid of points for which the KLT tracking failed or those who have gone outside the frame
  int indexCorrection = 0;
  for( int i=0; i<status.size(); i++)
     {  Point2f pt = points2.at(i- indexCorrection);
     	if ((status.at(i) == 0)||(pt.x<0)||(pt.y<0))	{
     		  if((pt.x<0)||(pt.y<0))	{
     		  	status.at(i) = 0;
     		  }
     		  points1.erase (points1.begin() + i - indexCorrection);
     		  points2.erase (points2.begin() + i - indexCorrection);
     		  indexCorrection++;
     	}

     }

}

Feature Re-Detection

Note that while doing KLT tracking, we will eventually lose some points (as they move out of the field of view of the car), and we thus trigger a redetection whenver the total number of features go below a certain threshold (2000 in my implementation).

Essential Matrix Estimation

Once we have point-correspondences, we have several techniques for the computation of an essential matrix. The essential matrix is defined as follows: \(\begin{equation} y_{1}^{T}Ey_{2} = 0 \end{equation}\) Here, \(y_{1}\), \(y_{2}\) are homogenous normalised image coordinates. While a simple algorithm requiring eight point correspondences exists\cite{Higgins81}, a more recent approach that is shown to give better results is the five point algorithm¹. It solves a number of non-linear equations, and requires the minimum number of points possible, since the Essential Matrix has only five degrees of freedom.

RANSAC

If all of our point correspondences were perfect, then we would have need only five feature correspondences between two successive frames to estimate motion accurately. However, the feature tracking algorithms are not perfect, and therefore we have several erroneous correspondence. A standard technique of handling outliers when doing model estimation is RANSAC. It is an iterative algorithm. At every iteration, it randomly samples five points from out set of correspondences, estimates the Essential Matrix, and then checks if the other points are inliers when using this essential matrix. The algorithm terminates after a fixed number of iterations, and the Essential matrix with which the maximum number of points agree, is used.

Using the above in OpenCV is again pretty straightforward, and all you need is one line:

E = findEssentialMat(points2, points1, focal, pp, RANSAC, 0.999, 1.0, mask);

Computing R, t from the Essential Matrix

Another definition of the Essential Matrix (consistent) with the definition mentioned earlier is as follows: \(\begin{equation} E = R[t]_{x} \end{equation}\) Here, \(R\) is the rotation matrix, while \([t]_{x}\) is the matrix representation of a cross product with \(t\). Taking the SVD of the essential matrix, and then exploiting the constraints on the rotation matrix, we get the following:

\[E = U\Sigma V^{T}\] \[[t]_{x} = VW\Sigma V^{T}\] \[R = UW^{-1}V^{T}\]

Here’s the one-liner that implements it in OpenCV:

recoverPose(E, points2, points1, R, t, focal, pp, mask);

Constructing Trajectory

Let the pose of the camera be denoted by \(R_{pos}\), \(t_{pos}\). We can then track the trajectory using the following equation:

\[R_{pos} = R R_{pos}\] \[t_{pos} = t_{pos} + t R_{pos}\]

Note that the scale information of the translation vector \(t\) has to be obtained from some other source before concatenating. In my implementation, I extract this information from the ground truth that is supplied by the KITTI dataset.

Heuristics

Most Computer Vision algorithms are not complete without a few heuristics thrown in, and Visual Odometry is not an exception. The heuristive that we use is explained below:

Dominant Motion is Forward

The entire visual odometry algorithm makes the assumption that most of the points in its environment are rigid. However, if we are in a scenario where the vehicle is at a stand still, and a buss passes by (on a road intersection, for example), it would lead the algorithm to believe that the car has moved sideways, which is physically impossible. As a result, if we ever find the translation is dominant in a direction other than forward, we simply ignore that motion.

Results

So, how good is the performance of the algorithm on the KITTI dataset? See for yourself.

Computed Trajectory vs Ground Truth for 2000 frames

What next?

A major limitation of my implementation is that it cannot evaluate relative scale. I did try implementing some methods, but I encountered the problem which is known as “scale drift” i.e. small errors accumulate, leading to bad odometry estimates. I hope I’ll soon implement a more robust relative scale computation pipeline, and write a post about it!

David Nister An efficient solution to the five-point relative pose problem (2004) ↩

Recognizing Human Activities with Kinect - The implementation

Tue, 02 Jun 2015 00:00:00 +0000

Disclaimer: The work described in this post was done by me and my classmate at IIT-Kanpur, Ankit Goyal. Here is a link to the presentation that we gave.

This is a follow up of my earlier post, in which I explored temporal models, that can be applied to things like part-of-speech tagging, gesture recognition, and any sequential or temporal sources of data in general. In this post, I will describe in more detail the implementation of our project that classified RGBD videos according to the activity being performed in them.

Dataset

Quite a few RGBD datasets are available for human activity detection/classification, and we chose to use the MSR Daily Activity 3D dataset. Since we had limited computational resources (the mathserver of IITK), and a limited time before the submission deadline, we chose to use a subset of the above dataset, and worked with only 6 activities. So, our problem was now reduced to 6-class classification.

Features

In any machine learning problem, your model or learning algorithm is useless without a good set of features. I read a recent paper which had a decent review of the various features used. They were:

3D silhouettes - Finding the outline of the human body, and using the shape of this outline as features.
Skeletal joints or body part tracking - Kinect comes with an algorithm to determine the pose of the body from the depth image alone. Pose here refers to the 3D coordinates of 15 body joints.
Local Spatio-temporal features - Just like some 2D/3D image feature detector, but with the added dimension of time.
Local 3D occupancy features - This one seemed the most interesting. What this does is to treat an RGBD video as a function I(x, y, z, t). Now, this a very sparse function, and would be zero at most points in a 4D space. But, whenever a certain activity is performed, certain regions of this 4D space will become filled. Inferring from such data is now a matter sampling it efficiently, and this where all the innovation must lie, if this technique is to work.
3D optical flow - The 3D counter part of the popular optical flow, it is also known as [Scene Flow] in the academic literature. This is one paper that makes use of these features.

The features that we ultimately went ahead were the skeletal joints. The MSR Daily Activity 3D dataset already provides the skeletal joint coordinates to us, so all we had to was to take that data, and do some basic pre-processing on it.

Preprocessing the features.

The dataset provides us with the 3D coordinates of 15 human body joints. These cordinates are in the frame of reference of the Kinect. The first operation that we perform on them is the following: to transform the points from the Kinect reference frame to the frame of the person. By frame of the person, we refer to the joint corresponding to the torso.

Next thing that we do is what we call “body size normalization”. Basically all the body lengths, such as the distance between the elbo and hand, are scaled up or down to a standard body size. This ensures that the variation in bosy sizes is captured at the feature level itself, and our model does not have to worry about it anymore.

Clicke here to get the MATLAB code that does the feature extraction part from skeleton files that were obtained from the MSR dataset.

Model

Now, as I discussed in my previous post, Hidden Conditional Random Fields (HCRFs) was the model that we finally selected. The original authors had released a well documented toolbox, to which we directly fed the features that were computed above.

Results

Five-fold cross-validation without any hyper-parameter tuning yielded a precision of 71%. These results do not seem impressive on first glance, but it must be noted that all our experiments were performed in the “new person” setting i.e. the person in the test set did not appear in the training set, and we did not do any hyper parameter tuning. Our results can be summarised in the ollowing heatmap:

Where the algorithm succeeds and fails

The above figure made one thing clear: that accuracy is being seriously harmed by the algorithm’s inability to correctly distinguish between drinking and talking on phone. The reason for this is relatively simple. The features that we are using are skeletal features, and therefore we do not pay any attention to what objects the human is interacting with. If you look at the skelat stream, talking on the phone, and drinking water seem extrmemly similar! In both the cases, the human raises a hand, and brings it near his head. Thus, in order to make a truly useful activity detection system, it is important to model these interactions explicitly.

If we do get around to improving this model, I will post it here.

Recognizing Human Activities with Kinect - Choosing a temporal model

Wed, 27 May 2015 00:00:00 +0000

Update: I have posted the sequel to this post here

In this blog post, I will very briefly talk about some popular models used for temporal/sequence classification, their advantages/disadvantages, which one I used for my human activity recognition project, and why. This post is intended for people who would like to delve into sequence classification, but don’t know where to start. I plan to follow up on this post with another post that explains in detail our implementation of recognizing human activities from RGBD data. However, if you want to have a look at it now, here are the slides.

In one my graduate-level course Machine Learning for Computer Vision, we were asked to select a research paper to review and present. We selected the paper Unstructured Human Activity Detection from RGBD Images. Our reasons for this selection were several: it was fairly recent (2012), had a large number of citations (according to google scholar, at least), and it dealt with sequential data (RGBD videos). Temporal models, or sequence classification, was something that was not covered in our course, and so we were eager to explore this area of Machine Learning. We read the paper, made a poster out of it, and presented it to our peers, TAs and the professor.

The next part of the course was more interesting, and it involved us picking up a Machine Learning problem, and we then had the option of either implementing an existing approach to the problem, or we could come with our own approach to solve it. We could have implemented the paper that we reviewed, but it seemed to more interesting to have a look at the models available for sequence classification, and then use one such model for our problem.

So we started looking around, and found that that following three models (and their variations) seem to be the most popular:

Hidden Markov Models (HMMs)
Maximum Entropy Markov Models (MEMMs)
Conditional Random Fields (CRFs)

Here’s the very basic intuition about temporal models: Suppose you are reading some text character by character. The first character that you observe is an “i”. Now, what do you think are the chances of you observing another “i”. Pretty slim, right? This is because consecutive “i” are pretty rare while reading english text. Modeling such probabilistic relationships in a mathematical form is precisely why we use temporal models, instead of just using some regular classifier (such as logistic regression). There’s two more popular models for sequential classification (or structured prediction, as some people like to call it), and they are: 1) Structural SVM, 2) Recurrent Neural Nets. I won’t talk about for either of them, as I have not used them, but you are welcome to check them out.

Hidden Markov Models are the oldest, and have been used in things like speech-to-text since the 1960s. MEMMs came around in 2000, only to be followed (and overshadowed) by Conditional Random Fields an year later. Both MEMMs and CRF came from the Andrew McCallum’s research group, and were focused on Natural Language Processing tasks. However, once you have extracted features from sequential data, you can use these models as long as your features satisfy the assumptions made by these models. Note that all of these models are special cases of probabilistic graphical models, so all the inference and learning algorithms from there can directly be applied here.

Hidden Markov Models

Graphical Model Representation of a stack of HMMs

As I mentioned earlier, Hidden Markov Models have been around for a long time, and were heavily used by the speech processing community. I won’t much into the details/code of HMMs, as there are a large number of resources that describe the topic, targeted both at beginners and those who want to go into all the details. HMMs are generative models, and efficient dynamic programming algorithms are available for both training and inference. The models uses hidden states, and assumes that the observed states are independent of each other, given their hidden states. A common way to go about doing classification with HMMS is the following: Train an HMM for every class, and then for every new example, find the probability of that example being generated by each HMM, the HMM that gives the maximum probability is your final class.

However, with HMMs come a number of disadvantages, with the major ones being:

Requires enumeration of all possible observation sequences.
Requires the observations to be independent of each other (given the hidden state).
Generative approach for solving a conditional problem leading to unnecessary computations.

Maximum Entropy Markov Models

So, let’s move onto a new model, which, in theory, solves all of the above problems: MEMMs. MEMMs were introduced in 2000, and were at that time used in NLP tasks, and showed improvements in tasks where assumption [2] mentioned above was not true. MEMMs are discriminative models, so they also do away with problems [1] and [3]. There’s also a hierarchical version of the same model, and a Hierarchical MEMM is what was used in the paper that we reviewed. The paper contains an interesting way of selecting graph structure, and I recommend checking it out.

Graphical Representation of an MEMM. Note how the direction of arrow from observation to hidden state has been reversed.

But along with MEMMs comes it’s own problem, commonly called as the label-bias problem.

Label bias problem

States with low-entropy transition distributions ”effectively ignore” their observations. States with lower transitions have ”unfair advantage”.
Since training is always done with respect to known previous tags, so the model struggles at test time when there is uncertainty in the previous tag.

It is impossible to understand the above without some background on what MEMMs are, so it is advisable to first look at how MEMMs work , and then at the original CRF paper which talks about the label bias problem.

Conditional Random Fields -> Star of the show

Graphical Representation of a CRF. Note that this an undirected graphical model, as opposed to HMM/MEMM

To overcome the label-bias problem of MEMMs, CRFs were introduced an year later, and demonstrated superior or equivalent performance in almost every NLP task that the authors tested it on. CRFs (and its variants) are considered as state-of-the-art in a number of machine learning problems, specially in Computer Vision. They are used not only for temporal modeling, but can also model more complicated relationships in high-dimensional data, and some applications include image segmentation and depth estimation from monocular images. Understanding CRFs is a little more challenging than HMMs or MEMMs, so I will list a few resources for you to get started with. For beginners, the best resource is this short course by Charles Elkan. It also has accompanying course notes, and if you go to this guy’s academic website, you can also find some programming assignments to implement CRFs. Here is a more comprehensive list of resources related to CRFs, and it’s pretty thorough.

Now, in 2006, there was an extension to CRF by the MIT CSAIL lab, called hidden CRFs. Here is the original paperoriginal paper. What this does, in essence, is to introduce another layer of hidden states, and is designed to assign a single label to every sequence. This is different from MEMMs and CRFs, which assigned a label to every observation in a sequence, and different from HMMs too (wherein a stack of HMMs was trained for classification).

Graphical Representation of an hCRF. Note the extra hidden layer.

The original hCRF paper applied it to gesture recognition from RGB videos, and demonstrated superior performance to CRF in classifying gestures, so we zeroed down on this model, to be used for our Human Activity Classification task (note that activities are not exactly the same as gestures).

The real icing on the cake was this-> MIT CSAIL had released a well documented toolbox, making it ridiculously easy for us to use this model on whichever dataset that we wanted, and the only major programming part that was left to us now was was the feature extraction stage.

In a future blog post, I will describe in detail the implementation of our project: the dataset, the features we used, and what results we got.

Visual Odmetry from scratch - A tutorial for beginners

Mon, 25 May 2015 00:00:00 +0000

I made a post regarding Visual Odometry several months ago, but never followed it up with a post on the actual work that I did. I am hoping that this blog post will serve as a starting point for beginners looking to implement a Visual Odometry system for their robots. I will basically present the algorithm described in the paper Real-Time Stereo Visual Odometry for Autonomous Ground Vehicles(Howard2008), with some of my own changes. It’s a somewhat old paper, but very easy to understand, which is why I used it for my very first implementation. The MATLAB source code for the same is available on github.

What is odometry?

Have you seen that little gadget on a car’s dashboard that tells you how much distance the car has travelled? It’s called an odometer. It (probably) measures the number of rotations that the wheel is undergoing, and multiplies that by the circumference to get an estimate of the distance travlled by the car. Odometry in Robotics is a more general term, and often refers to estimating not only the distance traveled, but the entire trajectory of a moving robot. So for every time instance \(t\), there is a vector \([ x^{t} y^{t} z^{t} \alpha^{t} \beta^{t} \gamma^{t}]\) which describes the complete pose of the robot at that instance. Note that \(\alpha^{t}, \beta^{t}, \gamma^{t}\) here are the euler angles, while \(x^{t}, y^{t} ,z^{t}\) are caetesian coordinates of the robot.

What’s visual odometry?

There are more than one ways to determine the trajectory of a moving robot, but the one that we will focus on in this blog post is called Visual Odometry. In this approach we have a camera (or an array of cameras) rigidly attached to a moving object (such as a car or a robot), and our job is to construct a 6-DOF trajectory using the video stream coming from this camera(s). When we are using just one camera, it’s called Monocular Visual Odometry. When we’re using two (or more) cameras, it’s refered to as Stereo Visual Odometry.

Why stereo, or why monocular?

There are certain advantages and disadvantages associated with both the stereo and the monocular scheme of things, and I’ll briefly describe some of the main ones here. (Note that this blog post will only concentrate on stereo as of now, but I might document and post my monocular implementation also). The advantage of stereo is that you can estimate the exact trajectory, while in monocular you can only estimate the trajectory, unique only up to a scale factor. So, in monocular VO, you can only say that you moved one unit in x, two units in y, and so on, while in stereo, you can say that you moved one meter in x, two meters in y, and so on. Also, stereo VO is usually much more robust (due to more data being available). But, in cases where the distance of the objects from the camera are too high ( as compared to the distance between to the two cameras of the stereo system), the stereo case degenerates to the monocular case. So, let’s say you have a very small robot (like the robobees), then it’s useless to have a stereo system, and you would be much better off with a monocular VO algorithm like SVO. Alos, there’s a general trend of drones becoming smaller and smaller, so groups like those of Davide Scaramuzza are now focusing more on monocular VO approaches (or so he said in a talk that I happened to attend).

Enough english, let’s talk math now

Formulation of the problem

Input

We have a stream of (grayscale/color) images coming from a pair of cameras. Let the left and right frames, captured at time t and t+1 be referred to as \(\mathit{I}_l^t\), \(\mathit{I}_r^t\), \(\mathit{I}_l^{t+1}\), \(\mathit{I}_r^{t+1}\). We have prior knowledge of all the intrinsic as well as extrinsic calibration parameters of the stereo rig, obtained via any one of the numerous stereo calibration algorithms available.

Output

For every pair of stereo images, we need to find the rotation matrix \(R\) and the translation vector \(t\), which describes the motion of the vehicle between the two frames.

The algorithm

An outline:

Capture images: \(\mathit{I}_l^t\), \(\mathit{I}_r^t\), \(\mathit{I}_l^{t+1}\), \(\mathit{I}_r^{t+1}\)
Undistort, Rectify the above images.
Compute the disparity map \(\mathit{D}^t\) from \(\mathit{I}_l^t\), \(\mathit{I}_r^t\) and the map \(\mathit{D}^{t+1}\) from \(\mathit{I}_l^{t+1}\), \(\mathit{I}_r^{t+1}\).
Use FAST algorithm to detect features in \(\mathit{I}_l^t\), \(\mathit{I}_l^{t+1}\) and match them.
Use the disparity maps \(\mathit{D}^t\), \(\mathit{D}^{t+1}\) to calculate the 3D posistions of the features detected in the previous steps. Two point Clouds \(\mathcal{W}^{t}\), \(\mathcal{W}^{t+1}\) will be obtained
Select a subset of points from the above point cloud such that all the matches are mutually compatible.
Estimate \(R, t\) from the inliers that were detected in the previous step.

Do not worry if you do not understand some of the terminologies like disparity maps or FAST features that you see above. Most of them will be explained in greater detail in the text to follow, along with the code to use them in MATLAB.

Undistortion, Rectification

Before computing the disparity maps, we must perform a number of preprocessing steps.

Undistrortion: This step compensates for lens distortion. It is performed with the help of the distortion parameters that were obtained during calibration.

Rectification: This step is performed so as to ease up the problem of disparity map computation. After this step, all the epipolar lines become parallel to the horizontal, and the disparity computation step needs to perform its search for matching blocks only in one direction.

Stereo images overlayed from KITTI dataset, notice the feature matches are along parallel (horizontal) lines

Both of these operations are implemented in MATLAB, and since the KITTI Visual Odometry dataset that I used in my implmentation already has these operations implemented, you won’t find the code for them in my implmenation. You can see how to use these functions here and here. Note that you need the Computer Vision Toolbox, and MATLAB R2014a or newer for these functions.

Disparity Map Computation

Given a pair of images from a stereo camera, we can compute a disparity map. Suppose a particular 3D in the physical world \(F\) is located at the position \((x,y)\) in the left image, and the same feature is located on \((x+d,y)\) in the second image, then the location \((x,y)\) on the disparity map holds the value \(d\). Note that the y-cordinates are the same since the images have been rectified. Thus, we can define disparity at each point in the image plane as: \(\begin{equation} d = x_{l} - x_{r} \end{equation}\)

A disparity map computed on frames from KITTI VO dataset

Block-Matching Algorithm

Disparity at each point is computed using a sliding window. For every pixel in the left image a 15x15 pixels wide window is generated around it, and the value of all the pixels in the windows is stored. This window is then constructed at the same coordinate in the right image, and is slid horizontally, until the Sum-of-Absolute-Differences (SAD) is minimized. The algorithm used in our implementation is an advanced version of this block-matching technique, called the Semi-Global Block Matching algorithm. A function directly implements this algorithm in MATLAB:

disparityMap1 = disparity(I1_l,I1_r, 'DistanceThreshold', 5);

Feature Detection

My approach uses the FAST corner detector. I’ll now explain in brief how the detector works, though you must have a look at the original paper and source code if you want to really understand how it works. Suppose there is a point \(\mathbf{P}\) which we want to test if it is a corner or not. We draw a circle of 16px circumference around this point as shown in figure below. For every pixel which lies on the circumference of this circle, we see if there exits a continuous set of pixels whose intensity exceed the intensity of the original pixel by a certain factor \(\mathbf{I}\) and for another set of contiguous pixels if the intensity is less by at least the same factor \(\mathbf{I}\). If yes, then we mark this point as a corner. A heuristic for rejecting the vast majority of non-corners is used, in which the pixel at 1,9,5,13 are examined first, and atleast three of them must have a higher intensity be amount at least \(\mathbf{I}\), or must have an intensity lower by the same amount \(\mathbf{I}\) for the point to be a corner. This particular approach is selected due to its computational efficiency as compared to other popular interest point detectors such as SIFT.

Image from the original FAST feature detection paper

Another thing that we do in this approach is something that is called “bucketing”. If we just run a feature detector over an entire image, there is a very good chance that most of the features would be concentrated in certain rich regions of the image, while certain other regions would not have any representation. This is not good for our algorithm, since it relies on the assumption of a static scene, and to find the “true” static scene, we must look at all of the image, instead of just certain regions of it. In order to tackle this issue, we divide the images into grids (of roughly 100x100px), and extract at most 20 features from each of this grid, thus maintaing a more uniform distribution of fetures.

In the code, you will find the following line:

points1_l = bucketFeatures(I1_l, h, b, h_break, b_break, numCorners);

This line calls the following function:

function points = bucketFeatures(I, h, b, h_break, b_break, numCorners)
% input image I should be grayscale

y = floor(linspace(1, h - h/h_break, h_break));
x = floor(linspace(1, b - b/b_break, b_break));

final_points = [];
for i=1:length(y)
    for j=1:length(x)
    roi =   [x(j),y(i),floor(b/b_break),floor(h/h_break)];
    corners = detectFASTFeatures(I, 'MinQuality', 0.00, 'MinContrast', 0.1, 'ROI',roi );
    corners = corners.selectStrongest(numCorners);
    final_points = vertcat(final_points, corners.Location);
    end
end
points = cornerPoints(final_points);

As you can see, the image is divided into grids, and the strongest corners from each grid are selected for the subsequent steps.

Feature Description and Matching

The fast corners detected in the previous step are fed to the next step, which uses a KLT tracker. The KLT tracker basically looks around every corner to be tracked, and uses this local information to find the corner in the next image. You are welcome to look into the KLT link to know more. The corners detected in \(\mathit{I}_{l}^{t}\) are tracked in \(\mathit{I}_{l}^{t+1}\) Let the set of features detected in \(\mathit{I}_{l}^{t}\) be \(\mathcal{F}^{t}\) , and the set of corresponding features in \(\mathit{I}_{l}^{t+1}\) be \(\mathcal{F}^{t+1}\).

In MATLAB, this is again super-easy to do, and the following three lines intialize the tracker, and run it once.

tracker = vision.PointTracker('MaxBidirectionalError', 1);
initialize(tracker, points1_l.Location, I1_l);
[points2_l, validity] = step(tracker, I2_l);

Note that in my current implementation, I am just tracking the point from one frame to the next, and then again doing the detection part, but in a better implmentation, one would track these points as long as the number of points do not drop below a particular threshold.

Triangulation of 3D PointCloud

The real world 3D coordinates of all the point in \(\mathcal{F}^{t}\) and \(\mathcal{F}^{t+1}\) are computed with respect to the left camera using the disparity value corresponding to these features from the disparity map, and the known projection matrices of the two cameras \(\mathbf{P}_{1}\) and \(\mathbf{P}_{2}\). We first form the reprojection matrix \(\mathbf{Q}\), using data from \(\mathbf{P1}\) and \(\mathbf{P2}\):

\[Q= \left[ {\begin{array}{cccc} 1 & 0 & 0 & -c_{x} \\ 0 & 1 & 0 & -c_{y} \\ 0 & 0 & 0 & -f \\ 0 & 0 & -1/T_{x} & 0 \\ \end{array} } \right]\]

\(c_{x}\) = x-coordinate of the optical center of the left camera (in pixels)
\(c_{y}\) = y-coordinate of the optical center of the left camera (in pixels)
\(f\) = focal length of the first camera
\(T_{x}\) = The x-coordinate of the right camera with respect to the first camera (in meters)

We use the following relation to obtain the 3D coordinates of every feature in \(\mathcal{F}_{l}^{t}\) and \(\mathcal{F}_{l}^{t+1}\)

\[\begin{equation} \left[ \begin{array}{c} X \\ Y \\ Z \\ 1\end{array} \right] = \mathbf{Q} \times \left[ \begin{array}{c} x \\ y \\ d \\ 1\end{array} \right] \end{equation}\]

Let the set of point clouds obtained from be referred to as \(\mathcal{W}^{t}\) and \(\mathcal{W}^{t+1}\). To have a better understanding of the geometry that goes on in the above equations, you can have a look at the Bible of visual geometry i.e. Hartley and Zisserman’s Multiple View Geometry.

The Inlier Detection Step

This algorithm defers from most other visual odometry algorithms in the sense that it does not have an outlier detection step, but it has an inlier detection step. We assume that the scene is rigid, and hence it must not change between the time instance \(t\) and \(t+1\). As a result, the distance between any two features in the point cloud \(\mathcal{W}^{t}\) must be same as the distance between the corresponding points in \(\mathcal{W}^{t+1}\). If any such distance is not same, then either there is an error in 3D triangulation of at least one of the two features, or we have triangulated a moving, which we cannot use in the next step. In order to have the maximum set of consistent matches, we form the consistency matrix \(\mathbf{M}\) such that:

\[\begin{equation} \mathbf{M}_{i,j} = \begin{cases} 1, & \mbox{if the distance between i and j points is same in both the point clouds} \\ 0, & \mbox{otherwise} \end{cases} \end{equation}\]

From the original point clouds, we now wish to select the largest subset such that they are all the points in this subset are consistent with each other (every element in the reduced consistency matrix is 1). This problem is equivalent to the Maximum Clique Problem, with \(\mathbf{M}\) as an adjacency matrix. A cliques is basically a subset of a graph, that only contains nodes that are all connected to each other. An easy way to visualise this is to think of a graph as a social network, and then trying to find the largest group of people who all know each other.

This is how clique looks like.

This problem is known to be NP-complete, and thus an optimal solution cannot be found for any practical situation. We therefore employ a greedy heuristic that gives us a clique which is close to the optimal solution:

Select the node with the maximum degree, and initialize the clique to contain this node.
From the existing clique, determine the subset of nodes \(\mathit{v}\) which are connected to all the nodes present in the clique.
From the set \(\mathit{v}\), select a node which is connected to the maximum number of other nodes in \(\mathit{v}\). Repeat from step 2 till no more nodes can be added to the clique.

The above algorithm is implemented in the following two functions in my code:

function cl = updateClique(potentialNodes, clique, M)


maxNumMatches = 0;
curr_max = 0;
for i = 1:length(potentialNodes)
    if(potentialNodes(i)==1)
        numMatches = 0;
        for j = 1:length(potentialNodes)
            if (potentialNodes(j) & M(i,j))
                numMatches = numMatches + 1;
            end
        end
        if (numMatches>=maxNumMatches)
            curr_max = i;
            maxNumMatches = numMatches;
        end
    end
end

if (maxNumMatches~=0)
    clique(length(clique)+1) = curr_max;
end

cl = clique;


function newSet = findPotentialNodes(clique, M)

newSet = M(:,clique(1));
if (size(clique)>1)  
    for i=2:length(clique)
        newSet = newSet & M(:,clique(i));
    end
end

for i=1:length(clique)
    newSet(clique(i)) = 0;
end

Computation of \(\mathbf{R}\) and \(\mathbf{t}\)

In order to determine the rotation matrix \(\mathbf{R}\) and translation vector \(\mathbf{t}\), we use Levenberg-Marquardt non-linear least squares minimization to minimize the following sum:

\[\begin{equation} \epsilon = \sum_{\mathcal{F}^{t}, \mathcal{F}^{t+1}} (\mathbf{j_{t}} - \mathbf{P}\mathbf{T}\mathbf{w_{t+1}})^{2} + (\mathbf{j_{t+1}} - \mathbf{P}\mathbf{T^{-1}}\mathbf{w_{t}})^{2} \end{equation}\]

\(\mathcal{F}^{t}, \mathcal{F}^{t+1}\): Features in the left image at time \(t\) and \(t+1\) \(\mathbf{j_{t}}, \mathbf{j_{t+1}}\): 2D Homogeneous coordinates of the features \(\mathcal{F}^{t}, \mathcal{F}^{t+1}\)
\(\mathbf{w_{t}}, \mathbf{w_{t+1}}\): 3D Homogeneous coordinates of the features \(\mathcal{F}^{t}, \mathcal{F}^{t+1}\)
\(\mathbf{P}\): \(3\times4\) Projection matrix of left camera
\(\mathbf{T}\): \(4\times4\) Homogeneous Transformation matrix\

The Optimization Toolbox in MATLAB directly implements the Levenberg-Marquardt algorithm in the function lsqnonlin, which needs to be supplied with a vector objective function that needs to be minimized, and a set of parameters that can be varied.

This is how the function to be minimized is represented in MATLAB. This part of the algorithm, is the most computationally expensive one.

function F = minimize(PAR, F1, F2, W1, W2, P1)
r = PAR(1:3);
t = PAR(4:6);
%F1, F2 -> 2d coordinates of features in I1_l, I2_l
%W1, W2 -> 3d coordinates of the features that have been triangulated
%P1, P2 -> Projection matrices for the two cameras
%r, t -> 3x1 vectors, need to be varied for the minimization
F = zeros(2*size(F1,1), 3);
reproj1 = zeros(size(F1,1), 3);
reproj2 = zeros(size(F1,1), 3);

dcm = angle2dcm( r(1), r(2), r(3), 'ZXZ' );
tran = [ horzcat(dcm, t); [0 0 0 1]];

for k = 1:size(F1,1)
    f1 = F1(k, :)';
    f1(3) = 1;
    w2 = W2(k, :)';
    w2(4) = 1;
    
    f2 = F2(k, :)';
    f2(3) = 1;
    w1 = W1(k, :)';
    w1(4) = 1;
    
    f1_repr = P1*(tran)*w2;
    f1_repr = f1_repr/f1_repr(3);
    f2_repr = P1*pinv(tran)*w1;
    f2_repr = f2_repr/f2_repr(3);
    
    reproj1(k, :) = (f1 - f1_repr);
    reproj2(k, :) = (f2 - f2_repr);    
end

F = [reproj1; reproj2];

Validation of results

A particular set of \(\mathbf{R}\) and \(\mathbf{t}\) is said to be valid if it satisfies the following conditions:

If the number of features in the clique is at least 8.
The reprojection error \(\epsilon\) is less than a certain threshold.

The above constraints help in dealing with noisy data.

An important “hack”

If you run the above algorithm on real-world sequences, you will encounter a rather big problem. The assumption of scene rigidity stops holding when a large vehicle such as a truck or a van occupies a majority of the field of view of the camera. In order to deal with such data, we introduce a simple hack: accept a tranlsation/rotation matrix only if the dominant motion is in the forward direction. This is known to improve results significantly on the KITTI dataset, though you won’t find in this hack explicitly written in most of the papers that are published on the same!

Stitching Intra-Oral Images

Sat, 23 May 2015 00:00:00 +0000

Note: This is a repost of my January post on MIT Media Lab’s Wordpress blog of their RedX 2015 Camp held at IIT-Bombay. There are a few minor modifications though.

Most intraoral cameras have a relative narrow field of view, and the entire jaw is never visible in a single image. We are trying to stitch several images into one, so that the user has complete view of the jaw, and we can then segment the tooth from it, and keep a track for every individual tooth.

A basic image stitching pipeline has the following steps:

Matching features between two images
Computing the homography with RANSAC (minimal set is four matches)
Transforming , concatenating and blending the images.

Most of the existing panaroma building algorithms are well-suited for applications in which the object being photographed is quite far away from the camera, such as in the image shown below (obtained from the Autostitch page):

Panorama construction

However, we are photographing the teeth at a really close range, and minor changes in perspective are fatal for these algorithms. In order to overcome the problems imposed by changes in perspective, we are using ASIFT, a feature detection/description/matching algorithm which is robust to perspective changes when compared to SIFT. The next steps (homography computation, blending) are pretty standard, and here are some results:

A stitch of three images taken from an intraoral camera

Every Tooth Tracked

Sat, 23 May 2015 00:00:00 +0000

Note: This is a repost of my January post on MIT Media Lab’s Wordpress blog of their RedX 2015 Camp held at IIT-Bombay. There are a few minor modifications though.

We want to track the health of every tooth over time, and therefore wanted an algorithm that could extract the image of every single tooth from the stitch that we obtained in our previous step. Our first attempt was at a completely automated approach, and we soon found a paper which attempted to solve a problem that was a subset of ours. They wanted to separate the teeth part from the rest of the image, while we wanted to segment every teeth from the rest of the image. The algorithm that these guys had used was pretty basic (Active Contours Without Edges), and I got it working within half an hour on MATLAB, with the following results:

Obtained using Active Contours Without Edges (Chan-Vese)

But this approach had a few problems. It was computationally expensive (~ 2min to run on my Intel Core i7 machine), and could not be used to segment an individual tooth out.

So, I started looking at other algorithms, and soon stumbled across the Watershed transform. In order to generate good results, watershed needs certain markers, and these markers can be generated using both automated or manual methods. One popular automated method for generating these markers is ‘opening-by-reconstruction’ and ‘closing-by-reconstruction’. The following results were obtained using MATLAB’s watershed example:

Vanilla Watershed with automatic marker generation

As you can see, the above is a complete mess. A lot of unwanted segments are obtained, and some superpixels (clusters of pixels) flow into each other. So, I then tried a manual-marker approach, and the results were much better:

Watershed with manually-annotated markers

A matlab-based GUI is used to generate the masks as follows:

The mask file looks something like this:

The mask used to generate the above results

In the final product, we can assume to have a touchscreen based user interface, wherein the user slashes with his finger across every tooth once, and then gets the segmented image as an output. One several such images have been mannually annotated, we could use a learning algorithm that can automatically generate these masks.

Visual Odometry - The Reading List

Tue, 29 Jul 2014 00:00:00 +0000

I am thinking of taking up a project on ‘Visual Odometry’ as UGP-1 (Undergraduate Project) here in my fifth semester at IIT-Kanpur. This post is primarily a list of some useful links which will get one acquainted with the basics of Visual Odometry.

The first thing that anyone should read is this wonderful two-part review by Davide Scaramuzza and Friedrich Fraundorfer:

One thing that I did not understand from the above tutorials was the ‘5-point algorithm’ by Nister in 2003. The original paper is here. But, this paper also seemed quite complicated for me to implement without any background, so I moved onto a simpler algorithm, called the ‘8-point algorithm’, which was published a long time ago by Longuet-Higgins. You can find it here. There are some lecture slides which explain this in a simple manner, and you can find them here.

Note, there are more papers that one should read regarding this, most notably:

In my next post, I will hopefully start working on my implementation.

RANSAC

Mon, 21 Jul 2014 00:00:00 +0000

This post is about the popular outlier rejection algorithm RANSAC. It stands for RANdom SAmple Consensus. It is widely used in computer vision, with one of the application being in rejection of false feature matches in a pair of images from a stereo camera set.

Suppose you have been given a dataset and you want to fit a mathematical model on it. We now assume that this data has certain inliers and some outliers. Inliers refer to the data points whose presence can be explained with the help of a mathematical model, while outliers are data points whose presence can never be explained via any reasonable mathematical model. Usually their presence in the dataset deteriorates the quality of the mathematical model that we can fit to the data. For best results, we should ignore these outliers while estimating the parameters of our mathematical model. RANSAC helps us in identifying these points so that we can obtain a better fir for the inliers.

Note that even the inliers do not exactly fit the mathematical model as they might have some noise, but the outliers either have an extremely large amount of noise or they are obtained due to faults in measurement, or because of problems in the sensor from which we are obtaining the data.

The Algorithm

The Input

Data points
Some parametrized model (we need to estimate the parameters for this model)
Some confidence parameters

Algo

A set points from the original dataset are randomly selected, and are assumed to be the inliers.
Parameters are estimated to fit to this hypothetical inlier set.
Every point that was not a part of this hypothetical inlier set is tested against the mathematical model that we just fit.
The points that fit the model become a part of the consensus set. The model is good if a particular number of points have been classified as part of the consensus set.
This model is then re-estimated using all the members of a consensus set.
The above process is repeated a fixed number of times, and the model with the largest consensus set is kept.

How many times do we repeat?

It is possible to theoretically determine the fixed number of iterations ‘k’ which are needed, if we have an estimate of the percentage of outliers present in the data.