Augmented Reality is defined as the modification of human visual perception of reality by imposing additional stimuli generated by the computer, such as 3D images. The stimuli can take many forms such as audio, visual markings, or even tactile feedback. As a comparison, virtual reality completely substitutes the real world with computer-generated reality.
There are many applications of AR in the modern world. Those include advertising, task support (e.g. medical surgery), navigation, entertainment and military services (e.g. for training purposes).
Areas of research
According to “Trends in Augmented Reality Tracking, Interaction and Display: A Review of Ten Years of ISMAR”, the core research of AR can be broken down into 5 main research areas:
1. Tracking techniques
2. Interaction techniques
3. Calibration and registration
4. AR applications
5. Display techniques
However, the focus is mainly on tracking and calibration and registration as they are the fundamentals of AR.
In order to augment something, an image of the real world is required, and tracking algorithms are applied onto it. There are 2 types of tracking: marker-based tracking and natural feature tracking. Marker-based, as the name suggests, requires a fiducial marker in order to augment an object. Those markers are commonly square markers, as shown in figure 1.1. Natural feature tracking tracks an image by means of estimating the keypoints of the image, as shown in figure 1.2.
For natural feature tracking, the object that is to be detected is trained for its robust keypoints and stored into a database. Robust keypoints of an object are points that are stable, i.e. those points are robust to changes in illumination, noise, and minor changes in viewpoint. In addition, they are highly distinctive, relatively easy to extract and allow for correct object identification with low probability of mismatch. Elaboration on tracking will be discussed in the sections below.
Calibration is to determine the intrinsic properties of the camera, as well as the camera position when it is viewing the image. Intrinsic properties include the focal length coefficients and the principal points of the camera. The camera position refers to the transformation matrix (rotation and translation) of the camera in the real world when viewing the image. This transformation matrix can be estimated using algorithms such as EPnP (Efficient Perspective n Points).
Registration is the augmenting of virtual objects onto the image. Those virtual objects can be drawn onto the screen using graphic libraries such as OpenGL. After a square marker that exist in the database is found in the image (for marker-based tracking), or an object in the database is detected in the image (for nature feature tracking), the virtual object is drawn onto the image. Orientation of the virtual object is determined by the transformation matrix of the camera position calculated as mentioned in the above paragraph.
The research question is to form a new algorithm by combining existing open-sources to accommodate real-time tracking and rendering on the mobile platform navigation, and making it open-source.
As mentioned in the introduction above, there are many applications for AR. However, the aim of this FYP is to develop an AR training system for Military purposes, namely for the CBRD (Chemical, Biological and Radiological Defense) of the SAF.
During operations and trainings, soldiers of the CBRD have to wear the MOPP (Mission Oriented Protective Posture) suit, which includes the protective gas mask. This makes it impossible for the soldier to be equipped with Head-Mounted Device (HMD) of any sort. Hence, until there is improvement on the hardware factor of the HMD, the usage of mobile devices, such as tablets, is the best option.
Special posters are used as feature points for AR tracking and augmenting during training. Those posters are placed at desired places for virtual objects to be augmented at.
Existing Similar AR SDKs on the Mobile Platform
There exist similar mobile AR SDKs in the market. 3 of the most popular SDKs are listed below:
The QualComm AR SDK (QCAR) supports multiple development tools including Eclipse (for Android), Xcode (for iOS), and Unity. It uses its own QualComm patented FastCV algorithms for tracking and detecting, which is optimised for the mobile platform. Images from users are processed via cloud computing to QualComm’s server and the results are sent back to the user. However, it is closed-source.
The Metaio Unifeye works similarly with QualComm, which also uses cloud computing. It requires purchase. The Metaio Unifeye is fully configured and controlled via graphical control elements, thus programming knowledge is not required. It features marker tracking, 2D planar markerless tracking, face tracking and extensible tracking. Like QualComm, its implementations are close-source.
The Layar, too, uses cloud computing for processing the images. Layar requires subscription. Payment is made per matches of the image. Layar uses images of users to extract fingerprints (of the image). The user remains the owner of his reference images but Layar owns the fingerprints and reserves the rights to use them in all its products. It features nature feature tracking, and excels on planar surfaces. However, Layar is also close-source.
Since the existing SDKs are close-source, users do not know that is being done to their images. Through this FYP, users can use or modify the codes to suit their purpose. Furthermore, these existing SDKs cannot be used with the CBRD training system, as all changes made to the SDKs must be consulted with the companies of the SDKs.
As mentioned before, tracking is the first thing needed for AR. Currently, there are many algorithms for tracking and detection, with SIFT (scale-invariant feature tracking) and SURF (speeded-up robust feature) being the most popular algorithms used. Some of these algorithms have already been implemented into the OpenCV library, hence the choice of using the OpenCV library in this FYP. Furthermore, being an open source library, improvements can be implemented into the library when necessary.
Before applying those algorithms, the image has to undergo gray-scaling. Then, Gaussian Blurring (also known as Gaussian Smoothing) is applied in order to reduce image noise and details, usually with an octave of 4.
figure 3: Original image
figure 4: image blurred using Gaussian blur
Below is a list of some popular tracking algorithms and their comparison.
SIFT feature descriptor is invariant to uniform scaling, orientation, and affine distortion. Important characteristic of these features is that the relative positions between them in the original scene should not change from one image to another.
In the first step of SIFT, several octaves of the original image are generated. Each octave’s image size is half the previous one. Within an octave, images are progressively blurred using the Gaussian Blur operator (Utkarsh). Scale spaces are usually implemented as an image pyramid, where the images are scaled down with the same ratio. Those Gaussian-blurred images are sub-sampled through Difference of Gaussian (DoG) in order to achieve a higher level of the pyramid.
figure 5: Difference of Gaussian (DoG)
After creating the Gaussian Pyramid, keypoints of the image can then be found. Finding keypoints is done by locating Maxima/minima in DoG images.
This is done by iterating through each pixel and check against all its neighbours. This check is done within the current image, the image above it and as well as below it.
figure 6:Checking the pixel marked ‘X’ against its neighbours, marked green
The process is illustrated in figure 6. The pixel X is marked as a keypoint if it is the greatest or least of all 26 neighbours. However, a keypoint almost never lies exactly on a single pixel. Therefore, subpixel values are generated using Taylor expansion of the image around the estimated keypoint.
Several keypoints are generated with some of them lying on an edge or not having enough contrast, hence not useful as feature keypoints.
Low contrast keypoints
To remove low contrast keypoints, the magnitude of the intensity of the pixel/subpixel in the DoG image is checked against a certain value. If the magnitude is less, the keypoint is rejected.
Keypoints along an edge
2 gradients, both perpendicular to each other, are calculated. Based on the image around the keypoint, three possibilities exist:
1. A flat region: both gradients will be small.
2. An edge: one gradient will be big (perpendicular to the edge) and the other will be small (along the edge)
3. A “corner”: both gradients will be big.
A corner makes an excellent feature point, hence algorithm similar to the Harris Corner Detection is used to check if it is a corner. The keypoint is rejected if it is not a corner.
By reducing the number of unnecessary keypoints, efficiency and robustness of SIFT is increased.
Keypoint Orientation Assignment
An “orientation collection region” is formed around the keypoint. Within this region, the most prominent orientation is calculated and assigned to that keypoint. The following formula is used to calculate the magnitude and orientation of each pixel in the region:
where m(x,y) is the magnitude and θ(x,y) is the orientation.
A histogram is then created using the results:
figure 7: Histogram of magnitude proportion and orientation bins
In the above histogram, the 360 degrees of orientation are broken into 36 bins (each 10 degrees). For example, the gradient direction at a certain point (in the “orientation collection region”) is 18.759 degrees, and it will be classified under the 10-19 degree bin. And the “amount” that is added to the bin is proportional to the magnitude of gradient at that point. The orientation with the highest peak will then be assigned to the current keypoint that is being checked.
SIFT feature generation
The last step is to generate unique fingerprints for each of the keypoints left. A 16×16 window of sub-pixels around the keypoint is chosen and then split into sixteen 4×4 windows. From each 4×4 window, a histogram of 8 bins is generated. Each bin corresponding to 0-44 degrees, 45-89 degrees, etc. Gradient orientations from the 4×4 are put into these bins. This is done for all 4×4 blocks, giving a result consisting of 4×4×8 (128) values. Finally, the 128 values are normalized, giving the keypoint a unique feature vector.
SURF (Speeded-Up Robust Features):
SURF is a method inspired by the SIFT algorithm with some differences in determine the keypoints.
Difference in Scale-space approximation
In SIFT, the image is scaled and blurred in a Gaussian Pyramid way. However, in SURF, the image is filtered “using a box filter approximation of second-order Gaussian partial derivatives, since integral images allow the computation of rectangular box filters in near constant time” (Luo, Gwun).
Hessian Matrix-Based Keypoints
The SURF detector is based on the Hessian matrix determinant. If the determinant of that pixel is a maximum, it is a keypoint. This method is accurate and much faster than SIFT.
The FERNS algorithm follows the same steps as the SIFT from the beginning until and including the part where keypoints are reduced, i.e. both algorithms have different ways of assigning unique identity to the keypoints of an image.
Problems of matching keypoints of an object are casted into classification problems. Firstly, the image undergoes partitioning, using algorithms such as k-means clustering, to create patches around its keypoints of the image. Ferns are then used to classify those patches. A visual representation of this approach is given in figure 8.The Semi-Naïve Bayesian approach is used for patch recognition between objects in the database and the image.
figure 8: Visual Representation of classifying patches
From the figures above, it can be seen that Ferns are basically sequences of tests that is to be applied onto a patch. Those tests are randomly chosen binary tests on the intensity values around a keypoint.
Comparison in terms of object detection
SIFT is the slowest out of the 3 due to its methods being computationally intensive. SURF is fast because of its use of the Hessian matrix, while FERNS is due to classifying and approximation.Performance
figure 10: Ratio of successfully tracked image (Lieberknecht, Benhimane,Meier, Navab)
Figure 10 shows the ratio of successfully tracked image of different foci (Angle, Range, Fast far, Fast close, Illumination) using test images of different characteristics (low textured, repetitive, normal, high-textured). Examples of the test images are under the appendix.
From this figure, SIFT has undeniably the best performance. However, as mentioned before, SIFT is computationally intensive and is relatively slow. The next best in performance is the FERNS algorithm, out-performing SURF in every aspect.
Hence, for this FYP, the Ferns algorithm is chosen.
As mentioned under introduction to AR earlier, the camera pose is important to correctly augment the orientation and transformation of any virtual objects. To calculate the camera pose, I have chosen to use the Effective Perspective n Points (EPnP) algorithm.
The aim of PnP is to “determine the position and orientation of a camera given its intrinsic parameters and a set of n correspondences between 3D points and their 2D projections” (Moreno-Noguer, Lepetit, Fua). Because EPnP is a non-iterative method, it is relatively faster than other iterative methods. EPnP also has good accuracy, speed and does not require an initial estimate.
The EPnP approach, based on the assumption that a set of n reference points whose both 3D coordinates in the real world and 2D corresponding image projections are known, is as follows:
1. Retrieve their coordinates in camera coordinate system
2. Express their coordinates as a weighted sum of virtual control points
3. The solution is expressed as a matrix M of size 2n × 12 or 2n × 9, which can be computed easily from the 3D points and their corresponding 2D projections
4. Appropriate weights can be obtained by solving small systems of quadratic equations
This approach is relatively computationally efficient as the most expensive part is the computation of MTM, and it is only done once. Therefore it is chosen to be used in this FYP.
Augmenting is super-imposing a virtual object onto the image. In the case of this FYP, I am to augment virtual objects onto a video. To do that, the video is taken as frames. Using the ferns algorithm, keypoints on each frame are determined. When an object that exists in the database is detected and captured on the frame, the camera pose at that instance is calculated through EPnP. The virtual object is then drawn onto the image, with its transformation given by the camera pose.
To draw a virtual object, a graphic library, such as OpenGL, is needed. For this FYP, OpenGL is used. The reason being that it is open standard, and cross platform. Development on the PC can be moved to the mobile platform later.
The frames of the video have to be displayed on an OpenGL window in order to display virtual objects onto the frame. Transformation matrix of the virtual object is replaced by the camera position matrix as calculated earlier. OpenGL only does the displaying of both the video frames, and the augmented virtual object, while all other processes, such as tracking and camera pose calculations, are done in the background by OpenCV and EPnP respectively.
The matrix of the camera pose estimated by the EPnP is relatively correct. However, when the result is used onto the transformation of the virtual object, the virtual object “jumps” around the screen. There is no available reason found yet for this problem, hence further investigation will be needed.
When displaying the frames onto the OpenGL window, there is some colour distortion. According to one of my supervisor, it is most probably due to the channel of the colour bits. However, since the aim of that phase was to make the implementations work correctly, this problem is temporary put aside until the core functionalities are in working order.
Francesc Moreno-Noguer, Vincent Lepetit, Pascal Fua. Accurate Non-iterative Solution to the PnP Problem.
Luo Juan, Oubong Gwun. A Comparison of SIFT, PCA-SIFT and SURF.
Feng Zhou, Henry Been-Lim Duh, Mark Bilinghurst. Trends in Augmented Reality Tracking, Interaction and display: A review of Ten Years of ISMAR.
Sebastian Lieberknecht, Selim Benhimane, Peter Meier, Nassir Navab. A Dataset and Evaluation Methodology for Template-based Tracking Algorithms.
Figure 11: Reference Images used in dataset. From left: Low, Repetitive, Normal, High Textureness (Lieberknecht, Benhimane,Meier, Navab)
Figure 12: From top to bottom, illustration of ”Angle”, “Range”, “Fast Far”, “Fast Close”, “illumination” (Lieberknecht, Benhimane,Meier, Navab)