Files
writeup/templates/baposter-template/examples/shrec/shrec_08_afgr.tex

483 lines
25 KiB
TeX

\documentclass[10pt,twocolumn,letterpaper]{article}
\usepackage{relsize}
\usepackage{fg}
\usepackage{times}
\usepackage{epsfig}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{bm}
\usepackage{textcomp}
\renewcommand*{\d}{\mathrm{d}}
\newcommand*{\dd}{\partial}
\newcommand*{\diffp}[2]{\ensuremath{\frac{\dd #1}{\dd #2}}}
\newcommand*{\diffpp}[3]{\ensuremath{\frac{\dd^2 #1}{\dd #2 \dd #3}}}
\newcommand*{\diffppp}[4]{\ensuremath{\frac{\dd^3 #1}{\dd #2 \dd #3 \dd #4}}}
\newcommand*{\difff}[2]{\ensuremath{\frac{\d #1}{\d #2}}}
\newcommand*{\diffff}[3]{\ensuremath{\frac{\d^2 #1}{\d #2 \d #3}}}
\newcommand*{\difffp}[3]{\ensuremath{\frac{\dd\d #1}{\d #2 \dd #3}}}
\newcommand*{\difffpp}[4]{\ensuremath{\frac{\dd^2\d #1}{\d #2 \dd #3 \dd #4}}}
\newcommand{\Matrix}[1]{\begin{bmatrix} #1 \end{bmatrix}}
\newcommand{\Vector}[1]{\Matrix{#1}}
\newcommand*{\SET}[1] {\ensuremath{\mathcal{#1}}}
\newcommand*{\MAT}[1] {\ensuremath{\mathbf{#1}}}
\newcommand*{\VEC}[1] {\ensuremath{\bm{#1}}}
\newcommand*{\CONST}[1]{\ensuremath{\mathit{#1}}}
\newcommand*{\norm}[1]{\mathopen\| #1 \mathclose\|}% use instead of $\|x\|$
\newcommand*{\abs}[1]{\mathopen| #1 \mathclose|}% use instead of $\|x\|$
\newcommand*{\absLR}[1]{\left| #1 \right|}% use instead of $\|x\|$
\newcommand*{\normLR}[1]{\left\| #1 \right\|}% use instead of $\|x\|$
% Include other packages here, before hyperref.
% If you comment hyperref and then uncomment it, you should delete
% egpaper.aux before re-running latex. (Or just hit 'q' on the first latex
% run, let it finish, and you should be clear).
\usepackage[pagebackref=true,breaklinks=true,letterpaper=true,colorlinks,bookmarks=false]{hyperref}
\fgfinalcopy % *** Uncomment this line for the final submission
\def\httilde{\mbox{\tt\raisebox{-.5ex}{\symbol{126}}}}
% Pages are numbered in submission mode, and unnumbered in camera-ready
\iffgfinal\pagestyle{empty}\fi
\begin{document}
%%%%%%%%% TITLE
\title{Expression Invariant 3D Face Recognition with a Morphable Model}
\author{Brian Amberg\\
{\tt\small brian.amberg@unibas.ch} \and
Reinhard Knothe\\
{\tt\small reinhard.knothe@unibas.ch} \and
Thomas Vetter\\
{\tt\small thomas.vetter@unibas.ch}
}
\maketitle
% \thispagestyle{empty}
%%%%%%%%% ABSTRACT
\begin{abstract}
We present an expression-invariant method for face recognition by fitting an
identity/expression separated 3D Morphable Model to shape data. The
expression model greatly improves recognition and retrieval rates in the
uncooperative setting, while achieving recognition rates on par with the best
recognition algorithms in the face recognition great vendor test. The
fitting is performed with a robust nonrigid ICP algorithm. It is able to
perform face recognition in a fully automated scenario and on noisy data.
The system was evaluated on two datasets, one
with a high noise level and strong expressions, and the standard UND range
scan database, showing that while expression invariance increases recognition
and retrieval performance for the expression dataset, it does not decrease
performance on the neutral dataset. The high recognition rates are achieved
even with a purely shape based method, without taking image data into
account.
\end{abstract}
%%%%%%%%% BODY TEXT
\section{Introduction}
We present a system which is using shape information from a 3D scanner to
perform automated face recognition. The main novelty of the system is its
invariance to expressions. The system is tested on two
public datasets. It is fully automatic and can handle the typical artifacts of
3D scanners, namely outliers and missing regions. Face recognition in this
setting is a difficult task, and difficult tasks benefit from strong prior knowledge.
To introduce the prior knowledge we use a 3D Morphable Model
(3DMM)~\cite{blanz:model}, which is a generative statistical model of 3D faces.
3DMMs have been applied successfully for face recognition on different
modalities. The most challenging setting is recognition from single images
under varying light and illumination. This was adressed
by~\cite{blanz03:face_rec,romdhani:recognition}. There, a 3DMM with shape,
texture and illumination model was fit to probe and gallery images. As the
model separates shape and albedo parameters from pose and lighting, it enables
pose and lighting-invariant recognition. We use the same idea for
expression-invariant face recognition from 3D shape. We fit an identity/expression
separating
3DMM~\cite{blanz03:expression} to shape data and normalize the
resulting face by removing the pose and expression components. See
Figure~\ref{fig:fitting} for an example of expression normalization. The
expression and pose normalized data allows then efficient and effective
recognition. A 3D MM has been fitted to range data before~\cite{blanz07:range}
and the results were even evaluated on part of the UND database. Our approach
differs from this work in the fitting method employed, which is independent of
the acquisition device, and in the use of an expression model to improve face
recognition. Additionally, our method is fully automatic,
while~\cite{blanz07:range} needed seven manually selected landmarks.
\begin{figure}
\vspace{-0.5em}
\begin{tabular}{@{ }c@{ }c@{ }c@{ }c@{}}
\includegraphics[height=0.42\linewidth]{16_1_tgt}&
\includegraphics[height=0.42\linewidth]{16_1_expression}&
\includegraphics[height=0.42\linewidth]{16_1_neutral}\\[-0.8em]
\smaller a) Target & \smaller b) Fit & \smaller c) Normalized\\[0.8em]
\includegraphics[height=0.42\linewidth]{16_6_tgt}&
\includegraphics[height=0.42\linewidth]{16_6_expression}&
\includegraphics[height=0.42\linewidth]{16_6_neutral}\\[-0.8em]
\smaller a) Target & \smaller b) Fit & \smaller c) Normalized
\end{tabular}
\vspace{0.5em}
\caption{Expression normalisation for two scans of the same individual.
The robust fitting gives a good estimate (b) of the true face surface given
the noisy measurement (a). It fills in holes and removes artifacts using
prior knowledge from the face model. The pose and expression normalized faces
(c) are used for face recognition.
}
\label{fig:fitting}
\end{figure}
Expression-invariant recognition for shape data was also approached in
\cite{xiaoguang06:face_matching}, where a person specific 3D Morphable
Expression Model was learned for each subject in the gallery. In contrast, we
are using a general 3DMM learned from an independent database of face shapes
which can be applied without any relearning to a new scan. This makes the
enrollment phase trivial and the recognition phase effectively constant in the
size of the gallery while still being accurate. We have to fit just one
model to the probe, which can then be compared efficiently to
the enrolled subjects, by comparing their coefficients in the low dimensional
face space. While the number of comparisions is still at most linear in the
number of examples (and can be made sublinear with an indexing method) the time
it takes to compare coefficients in face space is neglectible compared to
fitting time.
%
Model-less approaches which align the probe to each example in the database
using e.g.\ ICP~\cite{bowyer05:icp_recognition} suffer from the same problem
as~\cite{xiaoguang06:face_matching}.
Because the probe has to be aligned with each gallery scan these methods scale
linearly in the gallery size, While our model based approach needs only a
single fit to the probe.
Another interesting model-less approach~\cite{bronstein05:face_rec} compares
surface by the distribution of geodesics, which stays constant for nonrigidly
deforming (but not stretching or tearing) objects. This approach is difficult
to apply in this setting though, as the scanning produces holes, disconnected
regions and strong noise, which can best be handled by a method which uses
specific information about the object class.
\section{Model}
A PCA model~\cite{blanz:model} built from 175 subjects was used. It was build
from one neutral expression face scan per identity and 50 expression scans of a
subset of the subjects. The data was registered with a modification
of~\cite{amberg07:nicp}.
The identity model consists of a mean shape $\VEC\mu$ and a matrix of offset
vectors $\MAT M_n$ such that a new face instance $\VEC f$ is generated from a
vector of coefficients $\VEC\alpha_n$ as
\begin{align}
\VEC f&=\VEC\mu + \MAT M_n\VEC\alpha_n\qquad.
\end{align}
The model is constructed such that the $\alpha_i$ are independently normally
distributed with zero mean and unit variance under the standard assumption of a
Gaussian distribution of the data. This was done by performing PCA
on the data matrix built from the mean free shape vectors.
Additionally, for each of the 50 expression scans, we calculated an expression
vector as the difference between the expression scan and the corresponding
neutral scan of that subject.
This data is already mode-centered, if we regard the neutral
expression as the natural mode of expression data. On these offset vectors
again PCA was applied to get an expression matrix $\MAT M_e$ and
expression coefficients $\VEC\alpha_e$, such that the complete expression model is
\begin{align}
\VEC f&=\VEC\mu + \MAT M_n\VEC\alpha_n + \MAT M_e\VEC\alpha_e
=\VEC\mu + \MAT M\VEC\alpha\qquad,\\
\MAT M &= \Matrix{\MAT M_n &|& \MAT M_e} \qquad \VEC\alpha = \Matrix{\VEC\alpha_n \\ \VEC\alpha_e}\qquad.
\end{align}
The basic assumption of this paper is, that the face and expression space are
linearly independent, such that each face is represented by a unique set of
coefficients. While the resulting expression and identity matrices are not
perfectly orthogonal, they do have little overlap, which together with the
regularisation employed is sufficient for this application. We assume, that the
overlap between the spaces is due to the fact that it is impossible to aquire
perfectly consistent neutral expressions.
We use the registered scans and a mirrored version of each registered scan to
increase the variability of the model. This allows us to calculate a model with
more than 175 neutral coefficients.
\section{Fitting}
The fitting algorithm used in this paper is a variant of the nonrigid ICP work
in~\cite{amberg07:nicp}. The main difference, is that the deformation model is
a statistical model and the optimisation in each step is an iterative method,
which finds the minimum of a convex function. Additionally, as it is applied on
noisy data (see Figure~\ref{fig:difficult}), we included a more elaborate robust weighting term. Like other
ICP methods, it is a local optimization method, which does not guarantee
convergence to the global mimimum, but is dependent on the initialization. It
consists of the following steps
\begin{itemize}
\item Iterate over regularization values $\theta_1>\dots>\theta_N$:
\begin{itemize}
\item Repeat until convergence:
\begin{enumerate}
\item Find candidate correspondences by searching for the closest compatible
point for each model vertex.
\item Weight the correspondences by their distance using a robust estimator.
\item Fit the 3DMM to these correspondences using a
regularization strength of $\theta_i$\label{step_fit}.
\item Continue with the lower $\theta_{i+1}$ if the median change in vertex
position is smaller than a threshold.
\end{enumerate}
\end{itemize}
\end{itemize}
\begin{figure}
\vspace{-1.0em}
\begin{tabular}{@{ }c@{ }c@{ }c@{ }c@{}}
\includegraphics[height=0.42\linewidth]{56_4_tgt}&
\includegraphics[height=0.42\linewidth]{23_2_tgt}&
\includegraphics[height=0.42\linewidth]{5_6_tgt}\\[-1.0em]
& \smaller a) Targets & \\[0.2em]
\includegraphics[height=0.42\linewidth]{56_4_expression}&
\includegraphics[height=0.42\linewidth]{23_2_expression}&
\includegraphics[height=0.42\linewidth]{5_6_expression}\\[-0.8em]
& \smaller b) Fits &
\end{tabular}
\vspace{0.2em}
\caption{The reconstruction (b) is robust against scans (a) with artifacts, noise, and holes.}
\label{fig:difficult}
\end{figure}
The search for the closest compatible point takes only points into account which
have conforming normals, are closer than a threshold, and are not on or close
to the border of the scan. This has the effect of removing many outliers. The
search is sped up by organizing the target scan in a space partitioning tree
made up of spheres.
The correspondences are then weighted with a robust function by their
residual distance. The robust function is linear for distances smaller than
$2$mm, behaves like $1/x$ between $2$mm and
$20$mm, and is zero for a distance larger than $20$mm.
Note, that it is necessary to balance robustness and regularization, as the
right balance depends on the noise characteristic of the data. Suitable values
were determined manually from a few scans of the GavabDB database and kept
constant for all experiments as well on the GavaDB as on the UND database. In
step~\ref{step_fit} the 3DMM is fit to 3D-3D point correspondences. This is
done with a gauss-newton least squares optimization, using an analytic Jacobian
and Gauss-Newton Hessian approximation. Denote the correspondence points by
$\MAT u=\Matrix{\VEC u_1, \dots, \VEC u_n}$ and the rows of the model which
correspond to the $i$th vertex by subscript $i$, then we can write the cost
function mimized in this step as
\begin{align}
f(\MAT R, \VEC t, \VEC\alpha) &= \sum_i \normLR{\MAT R( \VEC\mu_i + \MAT M_i\VEC\alpha) + \VEC t - \VEC u_i}^2 + \lambda\normLR{\VEC\alpha}^2\qquad.\label{eqn:mincost}
\end{align}
%We make the norm dependent on the target normal by using an orthonormal
%covariance matrix $\MAT C_i$ per vertex, which makes the cost of deviation
%along the normal higher than deviations inside the target surface.
%\begin{align}
% \MAT C_i &= \Matrix{ \VEC n_i^T\\ \nu \VEC a_i^T\\\nu\VEC b_i^T} & \VEC n_i &\bot \VEC a_i \bot \VEC b_i \bot \VEC n_i
%\end{align}
%where $\VEC n_i$ is the normal of the target correspondence and $\nu$ is an
%anisotropy parameter. If we do not use the anisotropic distance measure (i.e.
%$\MAT C_i=\MAT I$), then the cost function Equation~\ref{eqn:mincost} can be
%minimized more efficiently by changing it to
This can be minimized more efficiently by changing the direction of the rigid transform to
\begin{align}
f(\MAT R, \VEC t, \VEC\alpha) &= \sum_i \normLR{ \VEC\mu_i + \MAT M_i\VEC\alpha + {\VEC t'} - {\MAT R'}\VEC u_i }^2 + \lambda\normLR{\VEC\alpha}^2\nonumber\\
{\VEC t'} &= \MAT R^{-1}\VEC t\qquad {\MAT R'} = \MAT R^{-1}\qquad.
\end{align}
because then the Jacobian consists of a large constant part and three columns
which depend on the iteration.
\begin{align}
F_i &= \VEC\mu_i + \MAT M_i\VEC\alpha + {\VEC t'} - {\MAT R'_{r_1,r_2,r_3}}\VEC u_i\\
\diffp{F_i}{\VEC\alpha} &= \MAT M_i\qquad
\diffp{F_i}{\VEC t'} = \MAT I_3\qquad
\diffp{F_i}{r_i} = \diffp{\MAT R'_{r_1,r_2,r_3}}{r_i}\VEC u_i\\
\MAT J &= \Matrix{\MAT J_c & | & \MAT J_d }\\
\MAT J_c &= \Matrix{\MAT M & \VEC 1 \otimes \MAT I_3\\ \MAT I & \MAT 0}\\
\MAT J_d &= \Matrix{(\MAT I \otimes \diffp{\MAT R'}{r_1})\MAT u^T & (\MAT I \otimes \diffp{\MAT R'}{r_2})\MAT u^T& (\MAT I \otimes \diffp{\MAT R'}{r_3})\MAT u^T\\\MAT 0 & \MAT 0 & \MAT 0}
\end{align}
Accordingly, the Hessian can be approximated as
\begin{align}
\MAT H &= \Matrix{
\MAT J_c^T\MAT J_c & (\MAT J_c^T\MAT J_d)^T\\
\MAT J_c^T\MAT J_d & \MAT J_d^T\MAT J_d
}\qquad.
\end{align}
By precalculating the constant parts of the matrices we can remove most of the
computation time, making step~\ref{step_fit} very fast.
We initialize the registration by locating the tip of the nose with a
heuristic, which assumes that the head is upright and looking into the camera.
This initialization is good enough to for a fully automatic
fit, as the fitting behaves like rigid ICP in the beginning, and rigid ICP is
known to have a large basin of convergence.
\section{Experiments}
\begin{figure*}
\begin{tabular}{cc}
\scalebox{0.82}{\input{shrec_MNCG}} &
\scalebox{0.82}{\input{und_MNCG}}
\end{tabular}
\caption{For the expression dataset the retrieval rate is improved by
including the expression model, while for the neutral expression dataset the
performance does not decrease. Plotted is the mean normalized cumulative
gain, which is the number of retrieved correct answers divided by the number
of possible correct answers. Note also the different scales of the MNCG
curves for the two datasets. Our approach has a high accuracy on the
neutral (UND) dataset.}
\label{fig:mcg}
\end{figure*}
\begin{figure*}
\begin{tabular}{cc}
\scalebox{0.82}{\input{shrec_PR}} &
\scalebox{0.82}{\input{und_PR}}
\end{tabular}
\caption{Use of the expression model improves retrieval performance.
Plotted are precision and recall for different retrieval depths. The lower
precision of the UND database is due to the fact that some queries have no
correct answers. For the UND database we achieve total recall when querying
nine answers, while the maximal number of scans per individual is eight,
while for the GavabDB database the expression model gives a strong
improvement in recall rate but full recall can not be achieved.}
\label{fig:precision_expression}
\end{figure*}
\begin{figure*}
\begin{tabular}{cc}
\scalebox{0.82}{\input{shrec_FARFRR}} &
\scalebox{0.82}{\input{und_FARFRR}}
\end{tabular}
\caption{Impostor detection is reliable, as the minimum distance to a match
is smaller than the minimum distance to a nonmatch. Note the vast increase in
recognition performance with the expression model on the expression database,
and the fact that the recognition rate is not decreasing on the neutral
database, even though we added expression invariance. We can operate at $0$\%
false acceptance rate with less than $4$\% false rejection rate, or less than
$1$\%\ FAR with less than $1$\%\ FRR.}
\label{fig:impostor}
\end{figure*}
We evaluated the system on two databases with and without
the expression model. We used the GavabDB~\cite{gavabdb} database and the
UND~\cite{bowyer05:2d3d_recognition} database. For both databases, only the shape information was
used. The GavabDB database contains 427 scans, with seven scans per ID, three
neutral and four expressions. The expressions in this dataset varies
considerably, including sticking out the tongue and strong facial distortions.
Additionally it has strong artifacts due to facial hair, motion and the bad
scanner quality. This dataset is typical for a non-cooperative environment.
The UND database was used in the face recognition grand challenge~\cite{frvt06} and consists
of 953 scans, with one to eight scans per ID. It is of better quality and
contains only slight expression variations. It represents a cooperative
scenario.
The fitting was initialized by detecting the nose, and assuming that the face is
upright and looking along the $z$-axis. To detect the nose we
first removed the spike artifacts typical of range scanners by repeated
min-filtering and removal of large triangles, then we detect the vertex with
the smallest depth, which in its horizontal slice is sufficiently closer to the
camera than the other pixels in that slice. For the UND dataset this gives us
reliably a point on the tip or ridge of the nose. The heuristic worked for 939
out of 953 Scans, in the remaining 16 scans we marked the nose manually. The
GavabDB database has the scans already aligned and the tip of the nose is at
the origin. We used this information for the GavabDB experiments. The same
regularisation parameters were used for all experiments, even though the
GavabDB data is more noisy than the UND data. The parameters were set manually
based on a few scans from the GavabDB Database. We used 100 principal identity
components and 30 expression components for all experiments.
In the experiments the distances between all scans were calculated, and we
measured recognition and retrieval rates by treating every scan once as the
probe and all other scans as the gallery. Both databases were used
independently.
\subsection{Retrieval Measures}
We measure similarity between faces in parameter space as the angle between the
face parameters in Mahalanobis space, which has proven to have high recognition
rates~\cite{blanz03:face_rec}. The distance measure is
\begin{align}
s(\VEC\alpha_1, \VEC\alpha_2) &= \arccos\left(\frac{\VEC\alpha_1^T\VEC\alpha_2}{\norm{\VEC\alpha_1}\norm{\VEC\alpha_2}}\right)\qquad.
\end{align}
We observed that the angular measure gives slightly larger recognition rates
than the Mahalanobis distance. The Mahalanobis angle has the effect of
regarding all caricatures of a face, which lie on a ray from the origin towards
any identity, as the same identity. We also evaluated other measures, but found
them to be consistently worse than the Mahalanobis angle.
\subsection{Results}
As expected, the two datasets behave differently because of the presence of
expressions in the examples.
\subsubsection{UND}
For the UND database we have good recognition rates with the neutral
model. The mean cumulative normalized gain curve in
Figure~\ref{fig:mcg} shows for varying retrieval depth the number of
correctly retrieved scans divided by the maximal number of scans that could be
retrieved at this level. From this it can be seen that the first match is
always the correct match, if there is any match in the database. But for some
probes no example is in the gallery. Therefore for face recognition we have to
threshold the maximum allowed distance to be able to reject impostors. Varying
the distance threshold leads to varying false acceptance rates (FAR) and false
rejection rates (FRR), which are shown in Figure~\ref{fig:impostor}. Even
though we have been tuning the model to the GavabDB dataset and not the UND
dataset our recognition rates at any FAR rate are as good or better than the
best results from the face recognition vendor test. This shows, that our basic
face recognition method without expression modelling gives convincing results.
Now we analyze how the expression modelling impacts recognition results on this
expression-less database. If face and expression space are not orthogonal, then
adding invariance towards expressions should make the recognition rates
decrease. In fact, we observe that the recognition results are slightly lower,
but only by a marginal amount, and still on par with the results from the face
recognition vendor test. Let us now turn towards the expression database, where
we expect to see an increase in recognition rate due to the expression model.
\subsubsection{GavabDB}
The recognition rates on the GavabDB without expression model are not quite as
good as for the expression-less UND dataset, so here we hope to find some
improvement by using expression normalization. And indeed, the closest point
recognition rate with only the neutral model is 96.25\% which can be improved
to 98.36\% by adding the expression model. Also the FAR/FRR values decrease
considerably. The largest improvement can be seen in retrieval performance,
displayed in the precision recall curves in
Figure~\ref{fig:precision_expression} and mean cumulative normalized gain
curves in Figure~\ref{fig:mcg}. This is because there are multiple examples in
the gallery, so finding a single match is relatively easy. But retrieving all
examples from the database, even those with strong expressions, is only made
possible by the expression model.
%\emph{TODO: Try also $k$-NN, that should give 100\% recognition rate on the
%GavabDB too.}
\section{Speed}
Though the method as presented operates at only approximately 40 seconds per
query, it has the potential for speedup. It is possible to parallelize the
closest point estimation and the optimisation, and more elaborate fitting
algorithms including multiresolution schemes can be developed. The speed also
depends on the number of vertices and components, for the results presented
here 11000 vertices and 100 neutral plus 30 expression components were fitted.
\section{Conclusion}
We have used a 3D Morphable Model with a separating expression model to develop
an expression-invariant face recognition algorithm. We have shown, that the
system has excellent recognition rates on difficult expression data and data
taken in a cooperative environment. The introduction of expression invariance
did not incur a significant loss of precision on easier neutral data. The strong prior
knowledge of the 3DMM allows robust handling of noisy data and allowed us to
build a fully automatic face recognition system. We also introduced a relatively
efficient fitting algorithm, which, as it has the potential for
paralellisation, could be made even faster.
Note that, as we do establish correspondence between the model and the scans,
it is trivial to add image based classification for datasets where a calibrated
photo is available. This can be done by comparing the rectified textures,
which should result in even higher recognition rates. It is also important to
note that the expression normalization described here for range data can be
applied equally well to other modalities, using any of the proposed 3DMM
fitting algorithms.
In the future we plan to include the additional texture cues and make the
method faster, such that it is applicable in real world scenarios where a
processing time of 40 seconds per probe is still a problem. Furthermore we
would like to investigate more sophisticated fitting algorithms and a morphable
model with a larger expression space.
%\section*{Acknowledgement}
%The authors wish to thank P.\ Paysan for the data
%acquisition. This work was supported in part by a grant from Microsoft
%Research and the Swiss National Science Foundation (200021-103814 and NCCR COME project 5005-66380).
{\small
\bibliographystyle{ieee}
%%use following if all content of bibtex file should be shown
%\nocite{*}
\bibliography{shrec_08}
}
\end{document}