Foregroundbackgr...
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Foreground/Background Segmentation with Learned Dictionary
CIPRIAN DAVID, VASILE GUI, FLORIN ALEXA
Electronics and Telecommunications Faculty
“Politehnica” University of Timisoara
No. 2, Blvd. Vasile Parvan, 300223, Timisoara
ROMANIA
*********************.ro, ******************.ro, ********************.ro
Abstract: - Video sequences are viewed as a temporal collection of inverse problems. This parallel with the classical inverse problem of denoising brings us to investigate a sparse representation based approach for background subtraction. A global trained dictionary is obtained using a k-means classifier and using the matching pursuit method a set of coefficients is estimated. By linear combination of dictionary vectors (atoms) and the set of coefficients a background estimate is computed for each frame to obtain the foreground-background segmentation. The global dictionary and the coefficients are propagated and updated along the sequence. The approach yields surprisingly preliminary results, encouraging for further investigations on the possible extensions of the algorithm.
Key-Words: -background subtraction, segmentation, sparse representations, video.
1 Introduction
This paper addresses the problem of foreground-background segmentation in video sequences. Video surveillance, human-computer interfacing, industrial monitoring or artificial vision, are a few examples of applications presenting the early task of foreground-background segmentation. The goal of this segmentation is to find a clear dual partitioning of each frame. On one hand is the static part of the image (background) and on the other hand, the moving objects of the video sequence (foreground).
An intensively researched direction of foreground-background segmentation is background subtraction. The goal is to estimate a robust background model from a temporal frame sequence, invariant to any illumination, shadows or other environmental changes that might appear. A trivial solution consists in using the previous frames as models for the background, but its lack of robustness does not qualify it as a suitable choice for background estimation.
More robust models propose a probability density function (usually a normal distribution) associated to each pixel location [10, 16]. Other approaches use a temporal median filter on previous frames in order to estimate the background [6, 18]. The main drawback of these approaches is the non realistic assumption of a simple model for the background. A wide variety of dynamic backgrounds do not fit this model (e.g. waves, clouds).
Improvements in trying to solve the problem of dynamic backgrounds are achieved by using a mixture of Gaussians model [15, 11, 1]. This method presents a relative high robustness and computing speed, but still, its parameters need careful tuning and the model is very sensitive to sudden changes in illumination.
Better adaptation to fast changes is proposed with the use of the non parametric kernel density estimation model [8]. The model takes into account the recent history of the pixel value by fitting a kernel estimator at each recent time frame. Although considerable improvements are reported using this model [4, 13, 9, 5, 14], the non parametric approaches present a relative high memory load and the selectivity is still questionable in some cases.
Due to its duality, the foreground-background segmentation problem, roughly, can be viewed as a denoising problem. We are trying to subtract a clear representation of the background, initially …corrupted” by the foreground, thus achieving the segmentation.
The present paper proposes a foreground-background segmentation method achieved with sparse representations. Generally, the background of a video sequence is at best represented by a slow changing scene, thus the straightforward idea of a certain amount of redundancy and sparsity, at least, between the current frame and the recent previous ones.
Sparse representation over learned dictionaries is an intensively treated subject in recent years [2, 7]. Originating in the inverse problem of image denoising, this topic found also many applications in vision tasks [17, 3].
The structure of the paper is as follows. In section 2 a presentation of the sparse representations principles and use in image denoising is given. Our adaptation of the sparse representations for the case of foreground-background segmentation of video sequences is presented in section 2. Some promising preliminary results obtained on video sequences are showed and analyzed in section 4. Section 5 concludes and outlines some ideas for future work.
2 Sparse Representations
This section gives a short overview of the use of sparse representations over learned dictionaries, as originally intended for image denoising.
The inverse problem deals with the recovery of an ideal image from a corrupted observed version of it. Usually this problem arises in the case of an image degraded by additive noise. The observed image is, thus:
N X Y +=
(1) where X denotes the image to recover, N denotes the noise and Y the observed image. The basic assumption of the sparse representation based methods is that the image can be represented as linear combination of vectors form a dictionary.
In order to recover the noiseless image, the following type of functional is minimized:
022
22
αα+−+−X
D Y
X
(2)
The first term ensures that the recovered image is similar to the observed version and no major structural changes are present. The second term ensures that the denoised image is an approximated linear combination over a dictionary D and coefficients α. Finally, the third term forces a
certain degree of sparsity of the coefficients, in fact
0 counts the null coefficients. This means that the recovered image is represented with the smallest possible number of vectors from the dictionary. The minimization of the above functional is done iteratively in three steps. In the first step a matching pursuit type algorithm is used to estimate the coefficients of the linear decomposition of the denoised image over the dictionary D :
{
}0
22
min arg ˆα
αα
+−=X D
(3)
Second, the dictionary is updated. And finally,
the last step updates the denoised image X .
2
222
min arg ˆX D Y X X
−+−=α
(4)
An important choice is the construction of the dictionary D . In [7], two choises are analyzed. The dictionary can be obtained by training on a database
or by training directly on the observed image. It is to be mentioned that the algorithm presented above is applied on small rectangular (or square) patches obtained from the observed image. The denoised patches will then form the final denoised image. And now it also becomes clear how the dictionary can be trained on the observed image. It also means that for each image patch there is a set of coefficients, ij α, for the linear representation of the denoised patch, and thus, the functional to be minimized becomes:
2
2
2
2ij ij
ij
ij
X D Y X αα
+−+
−∑ (5)
where ij X denotes the patch extracted from X at the
location ),
(j i .
3 Foreground - Background Segmentation
In the case of foreground-background segmentation of a video sequence, there is a certain amount of redundancy due to the static character of the background, or in the worst case due to a slow change of it over a certain time length. This fact brings us to consider the sparse representation paradigm for background subtraction. Furthermore, in a certain way, the task of recovering or at least identifying the background for segmentation is similar to the inverse problem of denoising.
Considering the case of a perfectly static background we can write:
f t b t X X Y += (6) where t Y is the frame observed at a given moment
in time, t , and obviously this frame is composed from the background image, b X , and the moving foreground, f t X . Notice that for the background we did not use a time index, due to its static assumption. It is very important to notice the difference between the two cases concerning the corrupting signal: in the case of image denoising a Gaussian noise and a completely unpredictable signal represented by the foreground for video sequences. It is yet another reason to consider the background as the image to recover.
Placing the task of background subtraction in a sparse representation context, we can write the following functional to minimize:
02
2
22
αα+−+−b t
b X D Y X
(7)
Notice that the first term lost the meaning from before. It is in fact strongly related to the foreground at time t . Due to the different nature of the two corrupting signals (noise and foreground), we cannot expect, like in the case of image denoising, to have a similarity between the observed image and the recovered background. It is the main reason to drop this term from the functional, obtaining, thus, the following expression to minimize:
22
min arg ˆαα+−=b b X D X
(8) T he initialization of the above minimizing algorithm needs to begin with a known or at least estimated version of the background, b X .
Our option is to consider a patch-wise average of recent frames as an initial background estimate. Equation (8) becomes:
02
2
min arg ˆαα+−=∑−k
k
t b
Y D X (9)
Based on this last equation the approach we
propose here is to iteratively estimate the background using the strategy imposed by equation (9), and to compare it to the observed image at each moment of time.
The success of the approach is in direct relation with the capacity to robustly train the dictionary and find suitable sets of coefficients for estimating the background. In order to increase this chance, a patch strategy is adopted.
In training the dictionary we will exploit the redundancy of consecutive frames by training an initial global dictionary by a k-means classifier. The set of coefficients are obtained by applying a matching pursuit type algorithm. Our choice is the basic matching pursuit [12].
In conclusion, our iterative algorithm consists in the following steps:
• Initialization of the background estimate
as the patch-wise average on recent
frames, ∑∑−=ij
k
ij
k
t b init Y
X ˆ;
• Global patch-wise dictionary design,
employing the k-means classifier;
• Basic matching pursuit algorithm to
estimate the coefficients of the linear decomposition, ij α; • Addition of new patches and update of
the dictionary at new frames under an Euclidian distance constraint;
• Matching pursuit to update the
coefficients;
The foreground-background segmentation is
summarized by the following equations:
b t f t X
Y X ˆ−≈
(10)
02
2
min arg ˆαα+−=∑−k
k
t b
Y D X (11)
4 Preliminary results
We use vectors obtained only from the gray level
intensity of the image. That is not a robust choice, and so, a special attention must be granted to dictionary training phase and to the initial background estimate. In achieving a higher initial robustness, the average background estimate is patch-wise computed, under a distance constraint between patches involved. This constraint is simply achieved by imposing a threshold on the distance between the patches taken into account for the background estimate. The patch size considered is 8x8. The dictionary was trained using all the patches from 20 consecutive initial frames, and for a decomposition base of 100 vectors.
(a) Initially trained dictionary;
(b) Example of image from the dictionary training
set;
Figure 1. Examples of trained dictionary
and image from the training set.
Figure 1 shows an example of trained dictionary, figure 1.a, on a set of 20 frames; and an example frame shown in figure 1.b.
An important assumption was made concerning the speed of the moving foreground objects. We considered a relative fast moving foreground in the dictionary training frames so that at each patch location we would see more times the background than the foreground. This, combined with the embedded averaging effect of the k-means classifier ensures the needed training robustness.
In the coefficient estimation phase, only 10 coefficients are computed for each patch. A reason for this limitation is the memory load. Also, this limitation is justified by the sparsity of the representation. An example illustrating this sparsity is given by table 1. Certain coefficients correspond to the same dictionary vectors and the last values computed are, already, very small.
NR. CRT. VALUE COEF. INDEX VECTOR
1 0.998771 3
2 -0.005720 88
3 0.011346 8
4 4 -0.004941 57
5 0.012511 47
6 -0.009206 48
7 0.011826
8 -0.002676
57 9 0.007432 84 10
-0.006993
48
Table 1. Coefficients and corresponding vectors for
an arbitrary image point.
Finally, figure 3 shows a few examples of results obtained with the proposed approach summarized by equations (10) and (11). On the left column are the original images and the corresponding results on the right. The results from the second and third rows show a better segmentation that the result from the first row. This is due to the fact that the assumption of fast moving object is not adequately ensured in this case and the initial background estimate is not sufficient robust.
5 Conclusion
A sparse representation over learned dictionary based foreground-background segmentation approach was investigated in the present paper. Although a simpler case was considered in the present study and several helping assumptions were made, the approaches design is also simple and yet it
presents surprisingly good preliminary results. We also consider the results encouraging in the light of further possibility to incorporate new type of local image information in order to increase the robustness in different situations. Different types of dictionary training and coefficient estimation will be investigated. Future work will also incorporate extensive comparative studies with classical and newer background subtraction approaches.
Acknowledgement: This work was supported by the grant IDEI, ID_931, contract no. 651/19.01.2009.
Figure 3. Foreground – background segmentation
results.
References:
[1] S.D. Babacan and T.N. Pappas, Spatiotemporal algorithm for background subtraction, in ICASSP 2007, (Honolulu), vol. 1, pp. I-1065-I-1068, April 2007.
[2] E. Candes and T. Tao, Decoding by linear programming, IEEE Trans. Information Theory , 51(12), 2005.
[3] V. Cevher, A. C. Sankaranarayanan, M. F. Duarte, D. Reddy, R. G. Baraniuk and R. Chellaooa, Compressive sensing for background subtraction, in ECCV,Marseille, France, 12-18 October, 2008.
[4] H. Chen and P. Meer, Robust Computer vision through kernel density estimation, ECCV , pp 236-250, Copenhagen, Denmark, May 2002. [5] S. Cvetkovic, P. Bakker, J. Schirris, and P.H.N. de With, Background Estimation and Adaptation
Model with Light-Change Removal for Heavily Down-Sampled Video Surveillance Signals, in 2006 IEEE International Conference on Image Processing, (Atlanta), pp. 1829-1832, Oct. 2006.
[6] R. Cucchiara, M. Piccardi, and A. Prati,
Detecting moving objects, ghosts, and shadows in video streams, IEEE Transactions on Pattern Analysis and Machine Intelligence 25, pp. 1337-1342, Oct 2003
[7] M. Elad and M. Aharon, Image denoising via
sparse and redundant representations over learned dictionaries, IEEE Trans. IP, 54(12):3736-3745, December 2006.
[8] A. Elgammal, D. Harwood, and L. Davis, Non-
parametric model for background subtraction, in Proceedings of IEEE ICCV'99 Frame-rate workshop, Sept 1999.
[9] C. N. Ianăşi, C. I. Toma, V. Gui, D. Pescaru,
Kernel Selection for Mean Shift Background Tracking in Video Surveillance, Proceedings 4th Int. Conference on Microelectronics and Computer Science (ICMCS-05), Chişinău, Moldova, Vol. II, September 15-17, 2005, pp.
389-392
[10] D. Koller, J. Weber, T. Huang, J. Malik, G.
Ogasawara, B. Rao, and S. Russell, Towards Robust Automatic Traffic Scene Analysis in Real-time, Proc. ICPR'94, pp. 126-131, Nov.
1994
[11] D.-S. Lee, J. Hull, and B. Erol, A Bayesian
framework for gaussian mixture background modeling, in Proceedings of IEEE International Confererence on Image Processing, (Barcelona, Spain), Sept 2003.
[12] S. Mallat and Z. Zhang, Matching pursuit in a
time-frequency dictionary, IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3397-3415, December 1993.
[13] A. Mittal, N. Paragios, Motion-Based
Background Subtraction using Adaptive Kernel Density Estimation, in Proceedings of the 2004 IEEE Computer Society Conference on CVPR, vol. 2, pp 302-309, 2004.
[14] M. Piccardi, T. Jan, Efficient mean-shift
background subtraction, in Proceedings of IEEE 2004 ICIP, Singapore, Oct. 2004.
[15] C. Stauffer and W. Grimson, Learning patterns
of activity using real-time tracking, in IEEE Trans. on Pattern Analysis and Machine Intelligence, 22, pp. 747-57, Aug 2000.
[16] C. Wren, A. Azarhayejani, T. Darrell, and A.P.
Pentland, Pfinder: real-time tracking of the human body, IEEE Trans. on Pattern Anal. and Machine Intell.,vol. 19, no. 7, pp. 780-785, 1997. [17] J. Wright, A. Yang, A. Ganesh, S. Sastry and
Y. Ma, Robust face recognition via sparse representation, IEEE Trans. on Pattern Analysis and Machine Intelligence, 2009.
[18] Q. Zhou and J. Aggarwal, Tracking and
classifying moving objects from videos, in Proceedings of IEEE Workshop on Performance Evaluation of Tracking and Surveillance, 2001.。