View independent human body pose estimation from a single perspective image

合集下载

如何使用计算机视觉技术进行人体姿态识别

如何使用计算机视觉技术进行人体姿态识别人体姿态识别是计算机视觉领域中的一个重要应用，通过使用计算机视觉技术，可以识别和分析人的姿态，例如姿势、动作和姿势变化等。

这项技术在许多领域具有广泛的应用，如医学、体育、安全监控等。

本文将介绍如何使用计算机视觉技术进行人体姿态识别。

首先，人体姿态识别技术需要获取人体图像或视频。

可以通过摄像头、深度传感器或其他可视化设备来采集人体图像或视频数据。

这些设备可以提供高质量的图像和深度信息，从而更好地捕捉人体姿态。

接下来，为了实现人体姿态识别，需要使用计算机视觉算法来处理图像或视频数据。

目前，有许多先进的算法可以用于人体姿态估计，如卷积神经网络（CNN）、支持向量机（SVM）等。

这些算法可以帮助识别和分析人体的骨骼结构、关节角度和身体姿势等信息。

在应用计算机视觉算法进行人体姿态识别时，还需要进行数据预处理。

人体图像或视频数据通常需要进行尺度归一化、去噪处理和关键点检测等操作。

这些预处理操作可以提高算法的准确性和稳定性，并降低噪声和冗余信息的影响。

在进行人体姿态识别时，可以使用两种主要方法：2D姿态和3D姿态。

2D姿态是在二维平面上对人体姿态进行分析和估计，可以获得人体的骨骼关键点位置和姿势信息。

而3D姿态是在三维空间中对人体姿态进行分析和估计，可以获得更加精确的人体姿态信息，如关节角度、旋转和缩放等。

对于2D姿态识别，可以使用基于深度学习的方法，如CNN、循环神经网络（RNN）等。

这些方法基于大量标注数据进行训练，可以实现较高的准确性和泛化能力。

此外，还可以结合传统的计算机视觉算法，如SVM和隐马尔可夫模型（HMM），以提高姿态识别的性能。

对于3D姿态识别，有许多技术可以应用，如多摄像头系统、运动捕捉设备和深度传感器等。

这些技术可以提供更多的数据维度，并准确地重建和跟踪人体姿态。

通过采集和分析人体的3D姿态数据，可以实现更加准确和自然的人机交互体验。

在实际应用中，人体姿态识别技术可以应用于许多领域。

top-down人体姿态估计算法

top-down人体姿态估计算法Top-down人体姿态估计算法是一种用于从图像或视频中推断人体姿态的方法。

它通过先检测人体的整体框架，然后逐步细化到各个关节的位置和角度，从而实现对人体姿态的准确估计。

该算法的基本思想是将人体姿态估计问题分解为两个子问题：人体检测和关节定位。

首先，通过使用目标检测算法，如Faster R-CNN 或YOLO，从图像中定位出包含人体的矩形框。

然后，将这些矩形框输入到关节定位网络中，逐步细化到每个关节的位置和角度。

在关节定位阶段，通常使用卷积神经网络（CNN）来对每个关节进行回归。

这些CNN模型通常包含多个卷积层和全连接层，用于从图像中提取特征并预测关节的位置和角度。

为了提高准确性，可以使用残差连接、空洞卷积等技术来改进模型。

除了CNN模型，还可以使用其他技术来改进关节定位的准确性。

例如，可以使用姿态先验信息来约束关节位置和角度的范围。

另外，还可以使用多尺度和多尺度融合的方法来提高对不同尺度人体的姿态估计准确性。

在实际应用中，Top-down人体姿态估计算法已经取得了很多成功。

它被广泛应用于人体动作识别、人机交互、虚拟现实等领域。

例如，在人机交互中，可以通过识别用户的手势和动作来实现自然的人机交互；在虚拟现实中，可以通过捕捉用户的姿态来实现身体感知和交互。

然而，Top-down人体姿态估计算法也存在一些挑战和限制。

首先，由于人体姿态的多样性和复杂性，算法对于姿态变化较大的情况可能存在一定的误差。

其次，算法对于遮挡、光照变化和背景干扰等因素也较为敏感。

此外，算法的计算复杂度较高，需要较大的计算资源和时间。

Top-down人体姿态估计算法是一种有效的方法，可以用于从图像或视频中推断人体的姿态。

它通过分解问题、使用CNN模型和其他技术来实现对人体姿态的准确估计。

尽管存在一些挑战和限制，但该算法在人机交互、虚拟现实等领域具有广泛的应用前景。

未来，随着算法的不断改进和硬件的发展，Top-down人体姿态估计算法将会变得更加准确和可靠。

传统的人体姿态估计算法

传统的人体姿态估计算法传统的人体姿态估计算法是指在深度学习盛行之前使用的一类技术。

人体姿态估计是指通过分析图像或视频中的人体关键点位置来推测并估计人体的姿态。

这一技术在计算机视觉、动作捕捉、人机交互等领域有着广泛的应用。

本文将介绍几种常见的传统人体姿态估计算法。

基于颜色特征的人体姿态估计算法利用肤色信息作为人体的特征来进行姿态估计。

通过颜色分布模型与肤色检测算法，可以有效地提取出人体的区域，并进行关键点的检测和跟踪。

其中比较经典的方法有基于肤色阈值分割的方法和基于皮肤颜色模型的方法。

基于模型的人体姿态估计算法使用数学模型来描述人体的姿态。

这些模型通常是基于人体关节的连接关系和角度约束构建的。

其中比较典型的方法有基于人体骨骼模型的方法、基于结构模型的方法和基于图模型的方法。

1.基于人体骨骼模型的方法：这种方法将人体表示为一个关节的层次结构。

通过从图像中检测到的关键点位置，可以通过模型的拓扑结构和连接关系来计算出人体的姿态。

典型的方法有基于人体骨骼模型的追踪任务、基于人体骨骼模型的姿态恢复和基于人体三维姿态的重构。

2.基于结构模型的方法：这种方法利用结构模型来描述人体关键点之间的相对位置和角度约束。

通过构建一个结构模型，可以使用追踪、检测等方法来估计人体的姿态。

结构模型通常由关节点和它们之间的连接关系组成，可以是二维结构模型也可以是三维结构模型。

3.基于图模型的方法：这种方法将人体姿态估计问题建模为一个图论问题。

通过将人体关键点表示为图的节点，关节点之间的连接关系表示为图的边，可以使用图论中的一些算法来求解姿态估计问题。

常用的图模型包括高斯图模型、条件随机场等。

基于优化的人体姿态估计算法通过定义一个优化目标函数，通过调整人体关键点的位置来最小化目标函数，从而得到人体的姿态估计结果。

常见的优化方法包括最小二乘法、非线性优化算法等。

以上介绍了几种常见的传统人体姿态估计算法，每种方法都有各自的优点和适用场景。

blazepose模型结构

blazepose模型结构BlazePose模型结构引言：BlazePose是一种用于人体姿势估计的深度学习模型，它能够准确地检测人体的关键点，如头部、肩膀、手肘、手腕、膝盖和脚踝等，从而帮助我们理解和分析人体的动作和姿势。

本文将介绍BlazePose模型的结构和工作原理，以及它在人体姿势估计方面的应用。

一、模型结构BlazePose模型采用了一种轻量级的神经网络结构，能够在实时性和准确性之间取得平衡。

它由两个主要的组成部分组成：一个用于检测人体的关键点的姿势估计器（Pose Estimator）和一个用于关键点的3D姿势重建的姿势重建器（Pose Reconstructor）。

1. 姿势估计器（Pose Estimator）：姿势估计器是BlazePose模型的第一个组件，它负责检测人体的关键点。

该组件采用了一个轻量级的卷积神经网络（CNN），通过对输入图像进行多次卷积和池化操作，逐渐提取出图像中的高层次特征。

然后，通过连接几个卷积和全连接层，网络能够输出每个关键点的位置和置信度。

2. 姿势重建器（Pose Reconstructor）：姿势重建器是BlazePose模型的第二个组件，它负责将检测到的关键点转化为人体的3D姿势。

该组件使用了一个神经网络来解决3D 姿势估计的问题。

首先，通过将2D关键点投影到图像平面上，姿势重建器可以获取关键点在3D空间中的大致位置。

然后，通过对这些位置进行优化，姿势重建器能够获得更准确的3D姿势。

二、工作原理BlazePose模型通过联合训练姿势估计器和姿势重建器来实现人体姿势估计的任务。

在训练过程中，模型通过最小化关键点位置的预测误差和姿势重建误差来优化网络参数。

为了提高模型的泛化能力，模型还采用了一些数据增强技术，如随机旋转、镜像和缩放等。

在实际应用中，BlazePose模型可以很好地应用于许多人体姿势估计的场景。

例如，它可以用于体育动作分析，帮助教练和运动员分析和改进动作的正确性和技巧。

Real-Time Human Pose Recognition in Parts from Single Depth Images中文翻译

Real-Time Human Pose Recognition in Parts from Single Depth Images 基于单深度特征图像的实时人体姿态识别摘要：我们提出了一种能够迅速精确地预测人体关节3D位置的新方法，这种方法仅需要单幅深度图像，无需使用时间信息。

我们采用了一种实物识别方案，并设计了一种人体组成中间模型，这种模型能够把高难度的姿势统计问题转化为更简单的像素分类问题。

我们大量、多种多样的训练数据库允许分类器能够估计出身体部位而不受姿势、身体形状和着装等的影响。

最后，我们提出了一种基于人体多个关节的3D检测可信方案，该方案通过重新投影分类结果并建立本地模型。

系统在消费者硬件上以200帧每秒的速度工作。

无论是合成的抑或真实的测试设置，我们的评价体系中多个训练参数都表明极高的精度。

在与相关研究的比较中我们达到了极高的精度要求，并且改进了整个人体骨架相邻匹配的精确度。

1.简介强大的交互式人体跟踪应用有游戏、人机交互、安全、远程呈现甚至健康监护。

随着实时深度相机的出现，这项任务被大大地简化[16,19,44,37,28,13]。

然而，即便是当前最好的系统仍然存在局限性。

尤其是在Kinect发布之前，并没有一款互动式的消费级别的硬件能够处理大范围的人体形状和尺寸[21]。

也有一些系统能够通过追踪一帧帧图案来达到高速度，但是快速初始化的努力却不够强大。

在本论文中，我们集中于姿势识别的研究：通过对单幅深度图像的检测识别出每个骨骼关节的3D位置。

我们对每帧图像的初始化和恢复的集中研究是为了补充一些合适的追踪算法。

[7,39,16,42,13]。

这些将来有可能合并暂停与运动的连贯性。

该算法目前是Kinect游戏平台的核心组成部分。

如图一所示，受最近把实体划分成多个部分进行实物识别的研究方法的影响[12,43]，我们的方法可以划分为两个关键性的设计目标:计算效率与鲁棒性。

一幅输入的深度图像被分割成身体紧密概率的标记部分，同时每一部分被定义为在空间上相近的感兴趣的骨骼关节。

利用计算机视觉技术进行人体姿势识别的步骤

利用计算机视觉技术进行人体姿势识别的步骤计算机视觉技术在近年来得到了广泛的应用，其中之一就是人体姿势识别。

人体姿势识别是指通过计算机视觉技术分析人体的动作和姿态，从而实现对人体姿势的理解和识别。

它可以应用于多个领域，如人机交互、虚拟现实、运动分析等。

要利用计算机视觉技术进行人体姿势识别，需要经过以下的步骤：1. 数据收集：首先需要收集用于人体姿势识别的数据集。

这个数据集可以包含不同种类的姿势和动作，以及不同角度和光照条件下的图像。

数据集的质量和多样性对于训练有效的姿势识别模型至关重要。

2. 数据预处理：在进行姿势识别之前，需要对收集到的数据进行预处理。

这包括图像的去噪、裁剪、调整大小和灰度化等操作。

预处理有助于提高数据的质量和准确性，同时减少计算的复杂度。

3. 特征提取：接下来，需要从预处理后的图像中提取有用的特征。

特征可以是人体的关键点、关节角度、轮廓等。

特征提取的目标是寻找能够准确描述人体姿势的特征，以便后续的分类和识别。

4. 训练模型：在特征提取完成后，需要选择适当的机器学习算法或深度学习模型来训练姿势识别模型。

常用的机器学习算法包括支持向量机（SVM）、随机森林等，而深度学习模型如卷积神经网络（CNN）也被广泛用于姿势识别。

通过使用已经标注好的数据对模型进行训练，使其能够学习并理解不同姿势的特征。

5. 模型评估和调优：在模型训练完成后，需要对模型进行评估和调优。

这可以通过将模型应用于测试数据集，并计算准确率、召回率、F1分数等指标来进行。

如果模型的性能不理想，可以尝试调整模型的结构、参数或使用更多的训练数据来提高模型的性能。

6. 实时姿势识别：当模型训练完成且通过评估后，可以将其应用于实时的姿势识别任务中。

这需要采集实时的图像或视频数据，并使用训练好的模型来识别人体的姿势。

在实时姿势识别中，还需要考虑到计算速度和算法的效率，以确保结果的及时性和准确性。

总结而言，利用计算机视觉技术进行人体姿势识别的步骤包括数据收集、数据预处理、特征提取、模型训练、模型评估和调优以及实时姿势识别。

人体姿态捕捉方法综述

人体姿态捕捉方法综述人体姿态捕捉（Human Pose Estimation）是指从图像或视频中提取人体姿态的过程。

它在许多应用领域中起着重要的作用，如人机交互、多媒体检索、人体动作分析等。

随着计算机视觉和深度学习的发展，人体姿态捕捉方法不断演进和改进。

本文将对人体姿态捕捉方法进行综述，系统地介绍几种主要方法。

传统的人体姿态捕捉方法主要分为基于模型的方法和基于特征的方法。

基于模型的方法试图通过建立人体姿态模型来解决捕捉问题，并通过优化算法来拟合模型与输入图像之间的对应关系。

基于特征的方法则试图直接从输入图像中提取特征，并通过分类或回归算法来估计人体姿态。

基于模型的方法主要包括预定义模型和灵活模型。

预定义模型是指事先定义好的人体姿态模型，如人体关节模型、骨骼模型等。

这些模型一般是基于人体解剖学知识构建的，并通过优化算法来拟合模型与图像之间的对应关系。

灵活模型则是指根据输入图像自动学习的模型，如图像表示模型、概率图模型等。

这些模型能够根据输入图像的不同自适应调整，提高姿态估计的准确性和鲁棒性。

基于特征的方法主要包括手工设计特征和深度学习特征。

手工设计特征是指通过对输入图像进行特征提取和降维，将复杂的姿态估计问题简化为特征分类或回归问题。

常用的手工设计特征包括HOG（Histogram of Oriented Gradient）、SIFT（Scale-Invariant Feature Transform）等。

深度学习特征则是指通过深度神经网络自动学习图像特征，并通过分类或回归算法来估计人体姿态。

深度学习特征在人体姿态捕捉问题中取得了显著的成果，如卷积神经网络（CNN）、循环神经网络（RNN）等。

除了基于模型和特征的方法，还有一些将两者结合起来的方法，如混合方法和端到端方法。

混合方法将传统的基于模型和特征的方法进行融合，通过建立模型和提取特征相结合来解决姿态捕捉问题。

端到端方法则是指直接从原始图像输入开始，通过一个深度神经网络来学习图像特征和姿态估计模型，实现一体化的姿态捕捉流程。

深度学习人体姿态估计总结汇报(HRNet)

2021/6/23
论文讲解
Deep High-Resolution Representation Learning for Human Pose Estimation
论文概述
这篇论文主要研究人的姿态问题 (human pose estimation problem)，着重于输出可靠的高分辨率表征(reliable highresolution representations)。现有的大多数方法都是从高分辨率到低分辨率网络(high-to-low resolution network) 产生的低分辨率表征中恢复高分辨率表征。相反，我们提出的网络能在整个过程中都保持高分辨率的表征。
在实验中，我们研究了一个小网和一个大网：HRNet-W32和HRNet-W48，其中32和48 分别代表最后三个阶段高分辨率子网的宽度（C）。其他三个并行子网的宽度为64，128， 256的HRNet-W32，以及HRNet-W48：96， 192，384。
2021/6/23
2021/6/23
2021/6/23
论文概述
本篇论文主要研究的是人的姿态问题，着重输出可靠的高分辨表征。
传统方法：大多数从高分辨率到低分辨率产生的低分辨表征中恢复高分辨率表征。
本文方法：网络能在整个过程中都保持高分辨率的表征。此人体姿态估计模型刷新了三项COCO纪录。
2021/6/23
近期工作
2021/6/23
2021/6/23
什么是人体姿势估计？
2021/6/23
2D姿势估计- 从RGB图像估计每个关节的2D姿势（x，y）坐标。 3D姿势估计 - 从RGB图像估计3D 姿势（x，y，z）坐标。
人体姿势估计具有一些非常酷的应用，并且大量用于动作识别、动画、游戏等。例如，一个非常流行的深度学习应用程序 HomeCourt(https://www.homeco u球rt员.ai动/)使作用。姿势估计来分析篮球

1、下载文档前请自行甄别文档内容的完整性，平台不提供额外的编辑、内容补充、找答案等附加服务。
2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
3、如文档侵犯您的权益，请联系客服反馈,我们会尽快为您处理(人工客服工作时间：9:00-18:30)。

View Independent Human Body Pose Estimation from a Single PerspectiveImageVasu Parameswaran and Rama ChellappaCenter for Automation Research,University of Maryland,College Park,MD20742vasc,rama@AbstractRecovering the3D coordinates of various joints of the hu-man body from an image is a criticalﬁrst step for several model-based human tracking and optical motion capture systems.Unlike previous approaches that have used a re-strictive camera model or assumed a calibrated camera,our work deals with the general case of a perspective uncali-brated camera and is thus well suited for archived video. The input to the system is an image of the human body and correspondences of several body landmarks,while the out-put is the set of3D coordinates of the landmarks in a body-centric coordinate ing ideas from3D model based invariants,we set up a polynomial system of equa-tions in the unknown head pitch,yaw and roll angles.If we are able to make the often-valid assumption that torso twist is small,we show that there exists aﬁnite number of solu-tions to the head-orientation which can be computed readily. Once the head orientation is computed,the epipolar geom-etry of the camera is recovered,leading to solutions to the 3D joint positions.Results are presented on synthetic and real images.1.IntroductionHuman body tracking and optical motion capture are two facets of the large area of research commonly referred to as human motion analysis.Good starting points for un-derstanding the applications,speciﬁc problems and solution strategies are survey papers,recent ones being[14]and[8]. Human body tracking and optical motion capture systems rely on good bootstrapping-an accurate initial estimate of the human body pose.This is a difﬁcult problem partly due to the large number of degrees of freedom of the body,mak-ing searching for solutions computationally intensive or in-tractable.Part of the difﬁculty also arises because of loss of depth information and non-linearity introduced due to per-spective effects.To make the problem more tractable,re-searchers have resorted to assuming a scaled orthographic camera in the uncalibrated case or a calibrated camera in the perspective case,both of which are more restrictive than one would like in practice.In[13],Taylor uses a scaled orthographic projection model and shows that there is an inﬁnite number of solutions parameterized by a single scaleparameter.Afterﬁxing arbitrarily,there is aﬁnite num-ber of solutions to the problem because of symmetries abouta plane parallel to the image plane.Further,the methodcannot be employed in cases where strong perspective ef-fects exist.In[1]the authors recover both anthropometry and pose of a human body.However,they use scaled or-thographic projection and present a search based,iterative solution to the problem pruning the search-space using an-thropometric statistics.In[2],Bregler and Malik restrict the projection model to scaled orthographic for initializa-tion and tracking of the joint angles of a human subject.In [3]and[6],multiple calibrated cameras are used to build a 3D model of subjects while in[7]the authors work with a single camera but assume that it is calibrated.In[9]the au-thors use a learning approach to predict the2D marker posi-tion(2D pose)from a single image while in[10]they build on the work and present an approach for recovering the ap-proximate3D pose of the body from a set of uncalibrated camera views.Their work is interesting in that they do not require joint correspondences to be provided in the image. Rather,they employ a machine learning and a probabilistic approach to map the segmented human body silhouette to a set of2D pose hypotheses and recover the3D pose from them.Recently,Grauman et.al[4],reported on a prob-abilistic structure and shape model of the human body for the recovery of the3D joint positions given multiple views of the silhouettes from calibrated cameras.In[11],Smin-chisescu and Triggs report an approach for monocular3D human tracking.Model initialization is search based and camera parameters are assumed known.Working with calibrated image/video data and/or mul-tiple cameras is possible only in restricted application do-mains.Most archived videos are monocular with unknown camera parameters(intrinsic and extrinsic).Moreover,the scaled orthographic assumption may be too restrictive for many cases.We believe a full-perspective solution to the problem will increase the applicability of good tracking al-gorithms such as Bregler’s[2]because in addition to pro-viding a more accurate initial estimate,one can recover theperspective3D to2D transform of the camera,making it possible to carry out full-perspective tracking of the human body.In this work,we aim for such a solution and seek to estimate the3D positions of various body landmarks in a body-centric coordinate ing ideas from model based invariance theory,we set up a simple polynomial sys-tem of equations for which analytical solutions exist.In cases where no solutions exist,an approximate solution is calculated.Recovering the3D joint angles,which are help-ful for tracking,then becomes possible by way of inverse kinematics on the limbs.2.Problem StatementWe employ a simpliﬁed human body model of fourteen joints and four face landmarks:two feet,two knees,two hips(about which the upper-legs rotate),pelvis,upper-neck (about which the head rotates),two shoulders,two elbows, two hands,forehead,nose,chin and(right or left)ear.The hip joints constitute a rigid body.Choosing the pelvis as the origin,we can deﬁne the X axis as the line passing through the pelvis and the two hips.The line joining the base of the neck with the pelvis can be taken as the positive Y axis. The Z axis points in the forward direction.We call the XY plane the torso plane.We scale the coordinate system such that the head-to-chin distance is unity.With respect to the input and output,the problem we seek to solve in this pa-per is similar to those addressed previously(e.g.[13],[1]): Given an image with the location in the image of the body landmarks and the relative body lengths,recover their body-centric coordinates.We make use of two assumptions:1.We use the isometry approximation where all subjectsare assumed to have the same body part lengths when scaled.The allometry approximation[16]where the proportions are dependent on body size is considered to be better because the relative proportions depend upon body size:for instance,children have a propor-tionally larger head than adults.Our algorithm,how-ever,is invariant to full-body3D projective transfor-mations.2.The torso twist is small such that the shoulders take onﬁxed coordinates in the body-centered coordinate sys-tem.Except for the case where the subject twists the shoulder-line relative to the hip-line by a large angle, this assumption is usually applicable.Further,since our algorithm relies on human input,it is easy to tell if this assumption is violated.3.ApproachBesides the articulated pose of the human body,the un-known variables in the problem are the extrinsic and in-trinsic camera parameters.In[12],Stiller et.al.derive camera-parameter independent relationships betweenﬁve world points on a rigid object and their imaged coordinates for an afﬁne camera.Weiss and Ray in[15]simpliﬁed and extended the result to the full-projective case showing that there exists one equation relating three3D invariants and four2D invariants formed six world points and their im-age coordinates.Our approach is motivated by theirs but we are able to derive a simplerﬁnal result involving two 3D invariants rather than three.In the following,we will ﬁrst show how to recover the three angles of rotations of the head in the body-centric coordinate system,given the image locations of the body landmarks.From the recovered head orientation,we next show how the3D coordinates of the remaining joints can be recovered.Recovery of these quan-tities also allows us to determine the epipolar geometry of the camera.3.1Motivating ExampleWe review and modify the approach of[15]below.Five points(in homogenous coordinates)in3D pro-jective space cannot be linearly independent.Assuming that theﬁrst four points are not all coplanar,we can write the3D coordinates of theﬁfth point with the basis as theﬁrst four points:(1) The are the unknown projective scale factors and are the unknown projective coordinates of the point in the basis of theﬁrst four points.We would liketo model a point conﬁguration where four points lie on the same plane.Given that we need theﬁrst four form a ba-sis,we can choose a labeling such that points1,2,4and5 form a plane while points3and6lie outside this plane1. In this conﬁguration,point3doesn’t contribute to point5’s coordinates,making zero.For wehave:Figure1:Six point conﬁguration used for analysis.Points form a plane and lie outside this plane(2) Here,,and are the basis coordinates for.If is the world to image transform such that, where is an unknown scale factor,the image coordinate for theﬁfth point,is given by:(3)Writing as,and doing the same algebra for point 6,we have the following two equations relating image co-ordinates:(4)We would like to eliminate the projective coordinates and the scale factors.Let denote the determinant(the notation is such that we index by the point left out from theﬁve-point set(). Substituting for from(1)and noting that determi-nants with two equal columns vanish we have:The projective coordinates and scale factors can be elimi-nated by taking cross ratios to obtain two3D invariants(as opposed to three in[15]):For the image coordinates,we follow the same approach of taking determinants and their ing the ‘points-left-out’notation,let denote the determinant :(6)Letting denote the determinant,we similarly obtain:(7) Obtaining expressions for the other determinants, ,,,calculating cross ratios andand equating them to and respectivelywe obtain:(8) The are known quantities,computed from image coor-dinates and we can rewrite the above equations in terms of known coefﬁcients,as(9) (9)expresses a view-invariant relationship between solely the3D coordinates and their2D image positions for the six points shown inﬁgure1.3.2Recovering the Head OrientationIf we choose the following labeling of points:right-hip(1),left-hip(2),left-shoulder(4),right-shoulder(5)and allow points3and6to be any two head features in(say forehead and chin),the only unknowns in the equation are the coor-dinates and.Being positions on the head,which is a rigid body that rotates about the upper-neck,in effect,there are only3scalar unknowns corresponding to a rotation ma-trix.If we use Euler angles,we can write:where and are known forehead and chin coordi-nates corresponding to a reference‘neutral’position.Ob-serving that the third elements of are zero,where is the signed area of points,a known constant and is the th row of.Similarly,When the above are substituted into(9),the scalar cancels out and we obtain(10) where,,. We now write expressions for:(11) Expanding R in terms of the Euler angles,and substituting it in the expressions for the determinants, (10)becomes a13term transcendental equation in the Eu-ler angles.Given the point correspondences of two more head features,say the nose and either ear,we will have three equations in the three unknown Euler angles.The equations depend on the neutral position of the head reﬂected in and.Choosing a neutral position where the head points forward with no yaw or roll,the coordinates are zero for the forehead,nose and chin and two of the equations be-come four term equations giving:(12)(13)(14) where and.Interestingly,(12)and (13)are independent of and can be solved rather trivially using and.We obtain a quadratic equation in:(15) where the can be written in terms of and.Hence there are upto four solutions for and.When these are substituted into(14),we obtain a simple equation in:(16)where the can be written in terms of,and.With ,we obtain two solutions for.Collectively, we then obtain upto eight solutions for the angles.The angle solutions represent head orientations that pro-duce the image.At this stage,we could do some rather basic anthropometricﬁltering by observing that the pitch angle cannot be so large that the chin penetrates the torso. Similarly,we could also impose constraints on the roll and yaw angles.The valid solutions can then be presented to the user from which one will be selected.3.3Recovering the Epipolar Geometry Recall that projects points from the body-centered coor-dinate system to the image plane.Given the calculated head orientation,we can recover,which has eleven unknowns. From the eight point correspondences at our disposal(four head plus four torso),we have an overdetermined set of six-teen equations in the elements of which we solve for in a least squares sense using singular value decomposition. The matrix contains all information necessary to re-trieve the camera center.can be written in the formwhere is the camera center[5]. Given this,can be recovered as.3.4Recovering Body Joint Coordinates Consider any unknown world point with known image point.Inverting the relationship, we obtain a set of solutions for parametrized by the un-known.This is simply the epipolar line of the image point in the body-centered coordinate system.(17)(18) where can easily be calculated in terms of elements of and.Let represent the right elbow which is con-nected to the right shoulder with known world coordinates.We also know the upper arm length, .We then have the following constraint:(19) which is a quadratic in,representing the two points of intersection of the epipolar line with the sphere of possible right elbow positions.These two solutions for the elbow represent the unavoidable forward/backwardﬂipping ambi-guity inherent in the problem.Once the correct right elbow position is found,the right hand can be found in the same manner.Similarly,we can obtain the3D coordinates of all the other joints of the body.The interactivity in this solu-tion process can be eliminated by having having the user pre-specify the relative depths of the joints.In other words, before the solution process starts,each joint is assigned a boolean variable that speciﬁes whether that joint is closer to the camera than its parent.Given that the user is specify-ing the point correspondences of body landmarks,this input imposes trivial additional burden.This idea is also used in [13].Since we have already calculated the camera center, we are able to calculate these distances readily.3.5Dealing with Unsolvable Cases Computation of the head-orientation as well as the limb 3D locations involves the solution of quadratic equations.In our experiments on real images and noisy synthetic im-ages,in several cases,there were no solutions to one or more quadratic.For the head-orientation case,we recov-ered as solutions to a constrained optimizationproblem with the objective function as the sum of squares of12and13along with the trigonometric identities as con-e of Lagrange multipliers resulted in a non-trivial system of polynomial equations in.Wetried two different approaches:(1)computing a Grobnerbasis of the polynomials so that they are reduced to triangu-lar form and(2),searching for local optima.Grobner basiscomputations were rather heavy and slowed down the algo-rithm,although the recovery of all local minima was guar-anteed.Searching for local optima(in the space wasfound to be much faster(the search space was quantized into bins)and produced a good approximatesolution most of the time.For the limb position(19)with nosolutions,we computed a scale such that the scaled limb-length(in this case)made the discriminant positive.This effectively accounted for variations in the assumed andactual limb lengths.4ResultsWe evaluated the approach on synthetic and real images,the results of which we present below.4.1Synthetic ImagesIn the synthetic case,given that the error is zero for a perfect model and perfect image correspondences,we focussed on empirical error analysis.There are two sources of error:(1) differences between the assumed model and imaged subject and(2)inaccuracies in the image correspondences.Forﬁve different viewpoints,and500random unknown poses per noise-level,we calculated the average error in full-body re-construction(sum of squares of the difference between real and recovered3D coordinates scaled by the head-to-foot distance)for Gaussian noise of zero mean and unit standard deviation and increasing noise intensities.The interactivity of the algorithm was eliminated by the evaluation program automatically choosing the head-orientation with minumum error among the solutions.There are three cases:noisy-model,noisy-image,and noisy-model with noisy-image. For image noise,we perturbed the image coordinates with the noise,scaled by the image dimensions which were taken to be those of the bounding box of the imaged body.For model-noise,the scale was the head-to-foot distance.Figure 2shows the dependency.An important observation is that the reconstruction is more sensitive to errors in the model than in the image point correspondences.Interestingly,the curve for noisy-model with noisy-image error is almost theFigure2:Error dependency on noisesame as the noisy-model curve.We believe that this is be-cause the model error swamps out image errors which are much smaller,especially at higher noise levels.Further, since the model and image errors are independent,errors cancel out in some cases.Nevertheless it can be seen that small errors in the model and image only produce small er-rors in theﬁnal reconstruction.4.2Real ImagesWe evaluated the qualitative performance of the approach on real images by using3D graphics to render the recon-structed body pose and epipolar geometry.We used a3D model derived photogrammetrically from front and side views of one subject and used the same3D model for all images.There were two important problems with real im-ages:One problem is that clothing obscures the location of the shoulders and hips,the accuracy of which affects the head orientation computation.We addressed this problem with two strategies.First,given that the shoulders,hips and upper-neck form a planar homography we compute and use it:though we do not use the upper-neck as a feature point in(12),(13)and(14),we require the user to locate it. The homography is uniquely speciﬁed by four planar points. We use theﬁve torso points to calculate the torso-plane-to-image homography in a least-squares sense,transform the torso-plane to the image using the homography and use the transformed points as input rather than the user-speciﬁed points.Second,rather than requiring the user to locate the true right and left hip(about which the upper legs rotate), we just require their surface locations(i.e.‘end-points’), which are easier to locate.The model stores the true centers of rotation of the legs as well as the surface locations.Another problem is due to the fact that we model the neck juction as a ball and socket joint.In reality,the skull rests on top of the cervical portion of the spinal cord and the cervical vertebrae are free to rotate(although by a smallFigure 3:Person Sitting,Front-viewFigure 4:Baseballamount and with a small radius).To compensate for this,we take the skull center of rotation to be midway between the neck-base and upper-neck.This produced a signiﬁcant im-provement in the head-orientation recovery for cases where subjects lunged their head forward or backward in addition to rotating it.For some images where these two effects were signiﬁ-cant,we had to guess the true image coordinates three or four times before the algorithm returned realistic looking results.Figure 3shows a subject sitting down and imaged from the front.Also shown in the image are user-input loca-tions of various body landmarks.Beside the image are two rendered views of the reconstructed body pose and epipo-lar lines of the body landmarks from novel viewpoints.The meeting of epipolar lines depicts the camera position.Fig-ure 4shows a baseball pitcher and the reconstruction.Inter-estingly in this case,the camera is behind the torso of the subject and this fact is recovered by the reconstruction.Fig-ure 5shows a subject sitting down with the hand pointed to-wards the camera,inducing strong perspective while ﬁgure 6shows subject skiing.The novel views of the reconstruc-tions show that the body pose is captured quite well.5ConclusionsWe presented a method to calculate the 3D positions of var-ious body landmarks in a body-centric coordinate system,given an uncalibrated perspective image and point corre-spondences in the image of the body landmarks -an impor-tant sub-problem of monocular model-based human body tracking and optical motion capture.Our small-torso-twist assumption gives us enough ground truth points on the torso and allows us to use ideas from 3D model based invariance theory to set up a simple polynomial system of equations to ﬁrst recover the head orientation and with it,the epipolar geometry and all of the limb positions.While theoretically correct given the assumptions,the method encountered spe-ciﬁc problems when applied to real images,which we ad-dressed by way of strategies to reduce error in input as well as the model.We demonstrated effectiveness of the method on real images with strong perspective effects and empiri-cally characterized the inﬂuence of errors in the model and image point correspondences on the ﬁnal reconstruction.Given that model accuracy has signiﬁcant impact on the re-construction,we are evaluating a probabilistic approach for reconstruction using anthropometric statistics.In future,we plan to exploit the analysis by synthesis approach to render the reconstructed head on to the image plane and iteratively reﬁne the reconstruction using color and edge cues.AcknowledgmentsThis work was supported in part by NSF Grant ECS 02-25475.Figure 5:Person Sitting,Side-viewFigure 6:Person SkiingReferences[1] C.Barron and I.A.Kakadiaris.Estimating anthropometryand pose from a single uncalibrated puter Vision and Image Understanding ,81,2001.[2] C.Bregler and J.Malik.Tracking people with twists and ex-ponential maps.Proc.IEEE Conference on Computer Vision and Pattern Recognition ,1998.[3] D.Gavrila and L Davis.3-d model-based tracking of humansin action.Proc.IEEE Conference on Computer Vision and Pattern Recognition ,pages 73–80,1996.[4]K.Grauman,G.Shakhnarovich,and T.Darrell.Inferring 3dstructure with a statistical image-based shape model.Proc.International Conference on Computer Vision ,2003.[5]R.Hartley.Chirality.International Journal of ComputerVision ,26(1):41–61,1998.[6] A.Hilton.Towards model-based capture of a person’s shape,appearance and motion.IEEE International Workshop on Modelling People ,1999.[7]H.J Lee and Z.Chen.Determination of 3d human bodyposture from a single puter Vision,Graphics and Image Processing ,30,1985.[8]T.Moeslund and E.Granum.A survey of computer visionbased human motion puter Vision and Image Understanding ,81(3),March 2001.[9]R.Rosales and S.Sclaroff.Specialized mappings and theestimation of human body pose from a single image.IEEE Workshop on Human Motion ,pages 19–24,2000.[10]R.Rosales,M.Siddiqui,J.Alon,and S.Sclaroff.Estimating3d body pose using uncalibrated cameras.Technical Report 2001-008,Dept.of Computer Science,Boston University,2001.[11] C.Sminchisescu and B Triggs.Kinematic jump processesfor monocular 3d human tracking.Proc.IEEE Conference on Computer Vision and Pattern Recognition ,2003.[12]P.F.Stiller,C.A.Asmuth,and C.S.Wan.Invariant indexingand single view recognition.Proc.DARPA Image Under-standing Workshop ,pages 1423–1428,1994.[13] C.Taylor.Reconstructions of articulated objects from pointcorrespondences in a single puter Vision and Image Understanding ,80(3),2000.[14]L.Wang,W.Hu,and T.Tan.Recent developments in humanmotion analysis.Pattern Recognition ,36(3):585–601,March 2003.[15]I.Weiss and M.Ray.Model-based recognition of 3d objectsfrom single images.IEEE Trans.on Pattern Analysis and Machine Intelligence ,23,February 2001.[16]V .M.Zatsiorsky.Kinetics of Human Motion .Human Kinet-ics,Champaign,IL,2002.。