Final Year Project

FYP CA2

Introduction

Augmented Reality is defined as the modification of human visual perception of reality by imposing additional stimuli generated by the computer, such as 3D images. The stimuli can take many forms such as audio, visual markings, or even tactile feedback. As a comparison, virtual reality completely substitutes the real world with computer-generated reality.

There are many applications of AR in the modern world. Those include advertising, task support (e.g. medical surgery), navigation, entertainment and military services (e.g. for training purposes).

Areas of research

According to “Trends in Augmented Reality Tracking, Interaction and Display: A Review of Ten Years of ISMAR”, the core research of AR can be broken down into 5 main research areas:

1.       Tracking techniques

2.       Interaction techniques

3.       Calibration and registration

4.       AR applications

5.       Display techniques

However, the focus is mainly on tracking and calibration and registration as they are the fundamentals of AR.

In order to augment something, an image of the real world is required, and tracking algorithms are applied onto it. There are 2 types of tracking: marker-based tracking and natural feature tracking. Marker-based, as the name suggests, requires a fiducial marker in order to augment an object. Those markers are commonly square markers, as shown in figure 1.1. Natural feature tracking tracks an image by means of estimating the keypoints of the image, as shown in figure 1.2.

figure 1.1

figure 1.2

For natural feature tracking, the object that is to be detected is trained for its robust keypoints and stored into a database. Robust keypoints of an object are points that are stable, i.e. those points are robust to changes in illumination, noise, and minor changes in viewpoint. In addition, they are highly distinctive, relatively easy to extract and allow for correct object identification with low probability of mismatch. Elaboration on tracking will be discussed in the sections below.

Calibration is to determine the intrinsic properties of the camera, as well as the camera position when it is viewing the image. Intrinsic properties include the focal length coefficients and the principal points of the camera. The camera position refers to the transformation matrix (rotation and translation) of the camera in the real world when viewing the image. This transformation matrix can be estimated using algorithms such as EPnP (Efficient Perspective n Points).

Registration is the augmenting of virtual objects onto the image. Those virtual objects can be drawn onto the screen using graphic libraries such as OpenGL. After a square marker that exist in the database is found in the image (for marker-based tracking), or an object in the database is detected in the image (for nature feature tracking), the virtual object is drawn onto the image. Orientation of the virtual object is determined by the transformation matrix of the camera position calculated as mentioned in the above paragraph.

Research Question

The research question is to form a new algorithm by combining existing open-sources to accommodate real-time tracking and rendering on the mobile platform navigation, and making it open-source.

FYP overview

As mentioned in the introduction above, there are many applications for AR. However, the aim of this FYP is to develop an AR training system for Military purposes, namely for the CBRD (Chemical, Biological and Radiological Defense) of the SAF.

During operations and trainings, soldiers of the CBRD have to wear the MOPP (Mission Oriented Protective Posture) suit, which includes the protective gas mask. This makes it impossible for the soldier to be equipped with Head-Mounted Device (HMD) of any sort. Hence, until there is improvement on the hardware factor of the HMD, the usage of mobile devices, such as tablets, is the best option.

Special posters are used as feature points for AR tracking and augmenting during training. Those posters are placed at desired places for virtual objects to be augmented at.

Existing Similar AR SDKs on the Mobile Platform

There exist similar mobile AR SDKs in the market. 3 of the most popular SDKs are listed below:

1.       QualComm

2.       Metaio

3.       Layar

QualComm

The QualComm AR SDK (QCAR) supports multiple development tools including Eclipse (for Android), Xcode (for iOS), and Unity. It uses its own QualComm patented FastCV algorithms for tracking and detecting, which is optimised for the mobile platform. Images from users are processed via cloud computing to QualComm’s server and the results are sent back to the user. However, it is closed-source.

figure 2

Metaio

The Metaio Unifeye works similarly with QualComm, which also uses cloud computing. It requires purchase. The Metaio Unifeye is fully configured and controlled via graphical control elements, thus programming knowledge is not required. It features marker tracking, 2D planar markerless tracking, face tracking and extensible tracking. Like QualComm, its implementations are close-source.

Layar

The Layar, too, uses cloud computing for processing the images. Layar requires subscription. Payment is made per matches of the image. Layar uses images of users to extract fingerprints (of the image). The user remains the owner of his reference images but Layar owns the fingerprints and reserves the rights to use them in all its products. It features nature feature tracking, and excels on planar surfaces. However, Layar is also close-source.

Since the existing SDKs are close-source, users do not know that is being done to their images. Through this FYP, users can use or modify the codes to suit their purpose. Furthermore, these existing SDKs cannot be used with the CBRD training system, as all changes made to the SDKs must be consulted with the companies of the SDKs.

Tracking Algorithms

As mentioned before, tracking is the first thing needed for AR. Currently, there are many algorithms for tracking and detection, with SIFT (scale-invariant feature tracking) and SURF (speeded-up robust feature) being the most popular algorithms used. Some of these algorithms have already been implemented into the OpenCV library, hence the choice of using the OpenCV library in this FYP. Furthermore, being an open source library, improvements can be implemented into the library when necessary.

Before applying those algorithms, the image has to undergo gray-scaling. Then, Gaussian Blurring (also known as Gaussian Smoothing) is applied in order to reduce image noise and details, usually with an octave of 4.

figure 3: Original image

figure 4: image blurred using Gaussian blur

Below is a list of some popular tracking algorithms and their comparison.

SIFT:

SIFT feature descriptor is invariant to uniform scaling, orientation, and affine distortion. Important characteristic of these features is that the relative positions between them in the original scene should not change from one image to another.

Scale-space approximation

In the first step of SIFT, several octaves of the original image are generated. Each octave’s image size is half the previous one. Within an octave, images are progressively blurred using the Gaussian Blur operator (Utkarsh). Scale spaces are usually implemented as an image pyramid, where the images are scaled down with the same ratio. Those Gaussian-blurred images are sub-sampled through Difference of Gaussian (DoG) in order to achieve a higher level of the pyramid.

figure 5: Difference of Gaussian (DoG)

After creating the Gaussian Pyramid, keypoints of the image can then be found. Finding keypoints is done by locating Maxima/minima in DoG images.

Locating Keypoints

This is done by iterating through each pixel and check against all its neighbours. This check is done within the current image, the image above it and as well as below it.

figure 6:Checking the pixel marked ‘X’ against its neighbours, marked green

The process is illustrated in figure 6. The pixel X is marked as a keypoint if it is the greatest or least of all 26 neighbours. However, a keypoint almost never lies exactly on a single pixel. Therefore, subpixel values are generated using Taylor expansion of the image around the estimated keypoint.

Keypoints Removal

Several keypoints are generated with some of them lying on an edge or not having enough contrast, hence not useful as feature keypoints.

Low contrast keypoints

To remove low contrast keypoints, the magnitude of the intensity of the pixel/subpixel in the DoG image is checked against a certain value. If the magnitude is less, the keypoint is rejected.

Keypoints along an edge

2 gradients, both perpendicular to each other, are calculated. Based on the image around the keypoint, three possibilities exist:

1.       A flat region: both gradients will be small.

2.       An edge: one gradient will be big (perpendicular to the edge) and the other will be small (along the edge)

3.       A “corner”: both gradients will be big.

A corner makes an excellent feature point, hence algorithm similar to the Harris Corner Detection is used to check if it is a corner. The keypoint is rejected if it is not a corner.

By reducing the number of unnecessary keypoints, efficiency and robustness of SIFT is increased.

Keypoint Orientation Assignment

An “orientation collection region” is formed around the keypoint. Within this region, the most prominent orientation is calculated and assigned to that keypoint. The following formula is used to calculate the magnitude and orientation of each pixel in the region:

where m(x,y) is the magnitude and θ(x,y) is the orientation.

A histogram is then created using the results:

figure 7: Histogram of magnitude proportion and orientation bins

In the above histogram, the 360 degrees of orientation are broken into 36 bins (each 10 degrees). For example, the gradient direction at a certain point (in the “orientation collection region”) is 18.759 degrees, and it will be classified under the 10-19 degree bin. And the “amount” that is added to the bin is proportional to the magnitude of gradient at that point. The orientation with the highest peak will then be assigned to the current keypoint that is being checked.

SIFT feature generation

The last step is to generate unique fingerprints for each of the keypoints left. A 16×16 window of sub-pixels around the keypoint is chosen and then split into sixteen 4×4 windows. From each 4×4 window, a histogram of 8 bins is generated. Each bin corresponding to 0-44 degrees, 45-89 degrees, etc. Gradient orientations from the 4×4 are put into these bins. This is done for all 4×4 blocks, giving a result consisting of 4×4×8 (128) values. Finally, the 128 values are normalized, giving the keypoint a unique feature vector.

SURF (Speeded-Up Robust Features):

SURF is a method inspired by the SIFT algorithm with some differences in determine the keypoints.

Difference in Scale-space approximation

In SIFT, the image is scaled and blurred in a Gaussian Pyramid way. However, in SURF, the image is filtered “using a box filter approximation of second-order Gaussian partial derivatives, since integral images allow the computation of rectangular box filters in near constant time” (Luo, Gwun).

Hessian Matrix-Based Keypoints

The SURF detector is based on the Hessian matrix determinant. If the determinant of that pixel is a maximum, it is a keypoint. This method is accurate and much faster than SIFT.

FERNS:

The FERNS algorithm follows the same steps as the SIFT from the beginning until and including the part where keypoints are reduced, i.e. both algorithms have different ways of assigning unique identity to the keypoints of an image.

Approach

Problems of matching keypoints of an object are casted into classification problems. Firstly, the image undergoes partitioning, using algorithms such as k-means clustering, to create patches around its keypoints of the image. Ferns are then used to classify those patches. A visual representation of this approach is given in figure 8.The Semi-Naïve Bayesian approach is used for patch recognition between objects in the database and the image.

figure 8: Visual Representation of classifying patches

The difference between ferns and trees is that trees have hierarchical structure, while ferns do not.
 
figure 9.1: tree

figure 9.2: fern
 

From the figures above, it can be seen that Ferns are basically sequences of tests that is to be applied onto a patch. Those tests are randomly chosen binary tests on the intensity values around a keypoint.

Comparison in terms of object detection

Speed

SIFT is the slowest out of the 3 due to its methods being computationally intensive. SURF is fast because of its use of the Hessian matrix, while FERNS is due to classifying and approximation.

Performance
 
 

figure 10: Ratio of successfully tracked image (Lieberknecht, Benhimane,Meier, Navab)

Figure 10 shows the ratio of successfully tracked image of different foci (Angle, Range, Fast far, Fast close, Illumination) using test images of different characteristics (low textured, repetitive, normal, high-textured). Examples of the test images are under the appendix.

From this figure, SIFT has undeniably the best performance. However, as mentioned before, SIFT is computationally intensive and is relatively slow. The next best in performance is the FERNS algorithm, out-performing SURF in every aspect.

Hence, for this FYP, the Ferns algorithm is chosen.

Camera Pose

As mentioned under introduction to AR earlier, the camera pose is important to correctly augment the orientation and transformation of any virtual objects. To calculate the camera pose, I have chosen to use the Effective Perspective n Points (EPnP) algorithm.

EPnP

The aim of PnP is to “determine the position and orientation of a camera given its intrinsic parameters and a set of n correspondences between 3D points and their 2D projections” (Moreno-Noguer, Lepetit, Fua). Because EPnP is a non-iterative method, it is relatively faster than other iterative methods. EPnP also has good accuracy, speed and does not require an initial estimate.

The EPnP approach, based on the assumption that a set of n reference points whose both 3D coordinates in the real world and 2D corresponding image projections are known, is as follows:

1.       Retrieve their coordinates in camera coordinate system

2.       Express their coordinates as a weighted sum of virtual control points

3.       The solution is expressed as a matrix M of size 2n × 12 or 2n × 9, which can be computed easily from the 3D points and their corresponding 2D projections

4.       Appropriate weights can be obtained by solving small systems of quadratic equations

Analysis

This approach is relatively computationally efficient as the most expensive part is the computation of MTM, and it is only done once. Therefore it is chosen to be used in this FYP.

Augmenting

Augmenting is super-imposing a virtual object onto the image. In the case of this FYP, I am to augment virtual objects onto a video. To do that, the video is taken as frames. Using the ferns algorithm, keypoints on each frame are determined. When an object that exists in the database is detected and captured on the frame, the camera pose at that instance is calculated through EPnP. The virtual object is then drawn onto the image, with its transformation given by the camera pose.

Virtual Object

To draw a virtual object, a graphic library, such as OpenGL, is needed. For this FYP, OpenGL is used. The reason being that it is open standard, and cross platform. Development on the PC can be moved to the mobile platform later.

The frames of the video have to be displayed on an OpenGL window in order to display virtual objects onto the frame. Transformation matrix of the virtual object is replaced by the camera position matrix as calculated earlier. OpenGL only does the displaying of both the video frames, and the augmented virtual object, while all other processes, such as tracking and camera pose calculations, are done in the background by OpenCV and EPnP respectively.

Problem Faced

Problem 1:

The matrix of the camera pose estimated by the EPnP is relatively correct. However, when the result is used onto the transformation of the virtual object, the virtual object “jumps” around the screen. There is no available reason found yet for this problem, hence further investigation will be needed.

Problem 2:

When displaying the frames onto the OpenGL window, there is some colour distortion. According to one of my supervisor, it is most probably due to the channel of the colour bits. However, since the aim of that phase was to make the implementations work correctly, this problem is temporary put aside until the core functionalities are in working order.

References:

Utkarsh Sinha. SIFT: Scale Invariant Feature Transform. Retrieved on 10 November, from http://www.aishack.in/2010/05/sift-scale-invariant-feature-transform/

Francesc Moreno-Noguer, Vincent Lepetit, Pascal Fua. Accurate Non-iterative Solution to the PnP Problem.

Luo Juan, Oubong Gwun. A Comparison of SIFT, PCA-SIFT and SURF.

Feng Zhou, Henry Been-Lim Duh, Mark Bilinghurst. Trends in Augmented Reality Tracking, Interaction and display: A review of Ten Years of ISMAR.

Sebastian Lieberknecht, Selim Benhimane, Peter Meier, Nassir Navab. A Dataset and Evaluation Methodology for Template-based Tracking Algorithms.

Appendix

Figure 11: Reference Images used in dataset. From left: Low, Repetitive, Normal, High Textureness (Lieberknecht, Benhimane,Meier, Navab)

Figure 12: From top to bottom, illustration of  ”Angle”, “Range”, “Fast Far”, “Fast Close”, “illumination” (Lieberknecht, Benhimane,Meier, Navab)

Week 8

After studying the EPnP source code and demo code last week, it is time to start hands-on coding. The aim is to integrate the EPnP code into the object-in-video that I made during the recess week, in order to obtain the estimated camera translation and rotation.

Calling the function compute_pose() triggers a break point when the program runs. This problem is rectified by keeping the number of correspondences constant throughout the entire code instead of setting it to be the number of correspondences found in every frame.

Intrinsic properties of the camera is an important component, which are the uC, vC, fU, fV (uC, vCare the coordinates of the projection, and fU, fV are the focal lengths). Previously, I have used values 160 and 120 for uCand vCrespectively with the ratio aspect of 4:3 in mind, and 200 (this value is chosen randomly) for both fU and fV. Movement of the camera during testing is simulated by moving the object for detection instead of moving the camera itself. This is the idea of moving the entire environment rather than moving the camera. The change in translation when I shift the object is in the correct direction. Eg. When I move the object downwards, simulating the camera being elevated, the translation of the y-axis increase.

However, the intrinsic property used for testing is wrong. The Camera Calibration program that came with the OpenCV library is then used to determine the intrinsic property. Once again, the camera pose changes printed on the screen are in the correct direction. However, only by augmenting an object can I determine whether those values are truly correct.

Week 7

This week I have started on reading up the EPnP. According to the paper “Accurate Non-Iterative O(n) Solution to the PnP Problem”, the PnP problem is to determine the camera position given it’s intrinsic parameters and a set of n correspondence between 3D points in the real world and their 2D projections.

I am to study the EPnP source code and demo code. Basically, the code takes in Xw, Yw and Zw, which are the 3D coordinates of the object to be detected, along with u and v, which are the 2D correspondences of that object in the image (in the case of a video, frames), and compute the camera pose. Since in this FYP, I am doing on 2D planar surface, the Zw point is not needed.

However, the function in the EPnP code requires the argument of Zw, a static number ‘1’ is chosen instead of a random number used I the original demo code. Although the code can be run, the error in estimating the position is enormous. It is more accurate to have Zw a random number. This problem will be further investigated.

Feature-based detection using the OpenCV library. The window on the right displays the keypoints of the frames in grayscale. The left window shows the corresponding keypoints between the object and frame when detected.

Week 6

Started on video capturing. I had tried to run a code written on C++ for capturing the frames of a video, copy it and turn it into Mat (matrix class in OpenCV) format and display it on the monitor. However it could not be compiled. It took me 1.5 days to figure out that I failed to enter the cv namespace; I’ll never forget that again. Next step will be to combine the code I had learnt on week 5 with this new code. The objective is to code a working program that captures the keypoints of a video, frame by frame, and display it onto the monitor.

In the code, I have captured the video, frame by frame, and convert it into a matrix format (the class Mat in OpenCV). Next I tried to find the robust points of each frame. However, it takes an insanely long time per frame.

Recess week

Side note: on the 18th of September, there was a news report about an AR machine at Changi Airport Terminal 3 that looks similar to the “Magic Mirror” in the MXR lab. Its purpose is to allow customers to try on different clothes virtually on the screen, by dragging the virtual clothes onto the user.

This week, a big misunderstanding has been cleared. I had thought that the frames captured by the camera need to be trained for their robust points. They are actually supposed to be treated as the scenes. Finding their keypoints is enough. Training robust keypoints is only for the object that is in the database.

After altering the code, I managed to display 2 images in the same window. The top image shows the object to be detected with its kepoints pinpointed. The bottom is the frames that are captured by the camera with its keypoints pinpointed as well. The next step will be to match the correspond keypoints of the object.

After a few runs of the code, it seems that the keypoints pinpointed on the window are not stable, and the correspondence between the object and the frames cannot be found. The problem is that I locate the keypoints of the frames without converting it to grayscale, making the calculations of the intensity of the pixels inaccurate. The detection now works after converting both the frames and object to grayscale.

Despite that the program can detect the object in the frames, there still exist 2 problems:

1)      The detection fails when the object is rotated more than +90˚ or less than -90˚.

2)      Detection fails when object is relatively far away from the camera.

Therefore, the next step is to read up on EPnP (Effective Perspective n Points) in order to calculate the transformation matrix between the object and the frames and solve the rotation problem.

Week 5

This week is the start of hands-on coding.  I have been given a sample code for detecting the most stable 2D points, i.e. the keypoints, of 2 images. One image is a picture of an object, and the other a picture of a scene with that same object in it. Bascially, the code compares the keypoints of both images in order to find the correspondence between the keypoints of the 2 images, and then map the keypoints of the object onto the object in the scene.

First, the program converts both the object image and scene image into matrices. Next it smoothes the intensity of the pixels using Gaussian Blur. Different resolutions of the images are Gaussian Blurred using the Gaussian Pyramid. The object is then trained for its model and stable keypoints. Finally the keypoints of the object will be matched against the keypoints of the image. When found, it will display the lines connecting the corresponding keypoints.

Week 3

This week, I had borrowed some readings on AR to have a general idea on the types of AR, and the general idea of how it works. According to “Trends in Augmented Reality Tracking, Interaction and Display: A Review of Ten Years of ISMAR”, the core of AR can be broken down into 4 main research areas:

1.       Tracking techniques

2.       Interaction techniques

3.       Calibration and registration

4.       AR applications

5.       Display techniques

However, I am to focus mainly on tracking, and in the meantime to read up on interaction, as they are the fundamentals of AR.

There are 3 main tracking techniques for AR: Sensor-based, Vision-based and Hybrid. Sensor-based tracking techniques track by means of magnetic, sound, inertial, light and mechanical. They all have advantages, such as light weight and high update rates, yet most of the time their outputs are easily distorted by external noises.

As for vision-based, which is the field that I am to research in, it is divided into 2 classes: feature-based and model-based. The idea of feature-based tracking is to find a link between 2D image features and their 3D world frame coordinates, and in turn obtain the camera position. One way is the commonly-used fixed square markers by finding the 3D coordinates of the corners of the square markers. The other way is by means of natural features. It is done by finding the images’ most stable points to provide robust tracking. The principle behind model-tracking is mainly constructing model using the lines or edges of that model, and using the predefined model to track and identify objects in the scene. Lastly, hybrid tracking is basically a combination of various tracking techniques, such as SLAM (Simultaneous Localisation and Mapping), SIFT (Scale-invariant feature transform) inertial, and model-based.

There are a few types of interaction techniques and interfaces, including mainly of Tangible AR and Collaborative AR. The concept of tangible AR is to be able to interact with the virtual AR objects with real, physical real world objects. One example that I had found on the web would be the “ARis” made by the Japanese in 2008, where players can interact with the AR maid using a “magic stick”. The idea of collaborative AR is that more than 1 user can interact with the same AR object in the same space, and all users can see the change done on the AR object by the other users.

Week 4

Due to my negligence and lack of time management, I was only able to start my FYP proper this week. I was briefed on the idea of this FYP about using AR for military training purposes, namely the CBRE.

I was tasked to read up on the Ferns algorithm. According to “Fast Keypoint Recognition usng Random Ferns”, it is an algorithm that finds the most stable points of an image through classification of patches of the image. Naive Bayesian Classification is used here, as well as k-means clustering to partition the classifications of the patches in order to classify the patches. Next, a series of tests are done on the patches, which is where the difference between randomized trees and ferns lies. Randomized tree executes a hierarchy of tests onto the patches, while fern executes a series of tests, one test after another, onto the patches. Hence fern is much less computational intensive than randomized tree.

Timeline

Week                         Objective

3                              Learn anout AR in general

4                              Learn about Ferns algorithm for tracking

5                              Understand code for tracking with Ferns algorithm

6                              Combine video capture and tracking code

Recess week            - Combine video capture and tracking code

                               - Understand EPnP

7                              Understand EPnP and start on ARToolKit

8                              Learn ARToolKit

9                              Read up on mobile developement