Content-based Video Retrieval
最新基于内容的视频检索
层中每一个视频层次的数据都可以用一定的属 性加以描述。 如:视频序列的属性,主要包括场景的个数和 持续时间;场景的属性,包括标题、持续时间、 镜头数目、开始镜头、结束镜头等;镜头的属 性,包括持续时间、开始帧号、结束帧号、代 表帧集合、特征空间向量等;帧的属性,帧有 大量的属性,包括直方图、轮廓图、DC及AC 分量图等。
19
基于内容的视频处理
从所有的帧中提取主要内容,并从下至 上地对视频内容进行结构化描述。 为了实现这个目标,我们须对视频进行 如下处理:视频切分、特征提取和视频 内容组织等。 处理过程如下:
20
动态特 征
图2 基于内容的视频处理过程
静态特 征
21
基于内容的视频检索步骤: 1.将视频序列分割为镜头 2.在镜头内选择关键帧 3.提取镜头的特征及关键帧的视觉特 征存入视频数据库
其中 Ox(i,j,k)是k帧内(i ,j)像素光 流的X分量,Oy(i,j,k)是k帧内像素(i,j) 光流的Y分量。
41
然后寻找M (k)的局部最小值: 从k=0开始,扫描曲线M (k) ,找到两个局部 最小值 M(K1)和M(K2), M(K2) 的值与M(K1)的值 至少相差p%(由经验设定),如果 M(Kj)=min(M(K)),K1< Kj <K2 则把 Kj选为关键帧。然后把K2作为当前的K1, 继续寻找下一个Kj Wolf的这种基于运动的方法可以根据镜头的结 构选择相应数目的关键帧。如果先把图像中的 运动对象从背景中取出,再计算对象所在位置 的光流,可以取得更好的效果。
8
三、基于内容的视频检索简介
我们需要研究的是,信息检索系统如何适当 地表达用户所要求的内容,并在视频数据库 中找出符合这个查询要求的信息返回给用户。
第九章 视频信息检索
侯 颖 houying@
第九章
9.1 概述
视频信息检索
一、信息检索概述
随着计算机网络技术和多媒体技术的快速 发展, 发展,每天都有大量的图像和视频信息不断涌 现出来,我们被“淹没”在信息的海洋中。 现出来,我们被“淹没”在信息的海洋中。如 何组织、管理这些海量地、 何组织、管理这些海量地、包含大量非结构化 信息的数据, 信息的数据,并且从中有效地查询和检索出有 用地信息,这就是信息检索地任务。 用地信息,这就是信息检索地任务。
9.1
概述
所谓信息检索 所谓信息检索,就是根据用户的信息需求 信息检索, ,从信息集合中检索出与用户信息需求相关的 信息子集。 信息子集。 视频信息数据量极其庞大, 视频信息数据量极其庞大,并且图像数据 在组织结构、 在组织结构、表达形式等方面也不同于传统地 文字数据,如何对它们进行组织、表达、 文字数据,如何对它们进行组织、表达、存储 管理、 、管理、查询和检索是对传统数据库技术提出 一个严峻挑战重大挑战 重大挑战。 地一个严峻挑战重大挑战。特别是视频信息检 索问题, 索问题,尤其是基于内容地视频和图像检索技 术已经成为国内外研究的热点问题。 术已经成为国内外研究的热点问题。
9.1
概述
9.1
产生, 检索, 利用。 产生, 检索, 利用。
概述
信息的生命周期有三个主要阶段: 信息的生命周期有三个主要阶段:
9.1
概述
信息单元, 信息集合用来表示一个数据单元,可以是任 何的物理单元。 何的物理单元。如:文件、一个电子邮件、 文件、一个电子邮件、 WEB网页、图像、视频、音频。 WEB网页、图像、视频、音频。 网页 元数据是关于数据的组织、数据域及其关系 元数据是关于数据的组织、 的信息。 的信息。元数据为各种形态的数字化信息单 元和资源集合提供规范的一般性的描述。 元和资源集合提供规范的一般性的描述。
基于内容的视频检索与关键技术简述
基于内容的视频检索与关键技术简述作者:马晨晨周政龙门来源:《新学术论丛》2013年第04期1.引言随着多媒体技术的发展和信息高速公路的出现,数字视频的存储和传输技术都取得了重大的进展。
如何能在海量的视频中找到需要的资料,是视频检索要解决的问题。
传统的视频检索只能通过快进和快退等顺序的方法人工查找,因而是一件非常繁琐耗时的工作,这显然已无法满足多媒体数据库的要求。
用户往往希望只要给出例子或特征描述,系统就能自动地找到所需的视频片断点,即实现基于内容的视频检索。
2.基于内容的视频检索基于内容的视频检索(Content Based Video Retrieval, CBVR)指根据视频的内容及上下文关系,对大规模视频数据库中的视频数据进行检索。
主要特点:直接从视频数据中提取信息线索,它是一种近似匹配,在没人工参与的情况下自动提取并描述视频的特征和内容。
它融合了图像理解、模式识别、计算机视觉等技术。
基于内容的视频检索的过程是先将视频流通过镜头边界检测分割为镜头,并在镜头内选关键帧,再提取镜头的运动特征和关键帧中的视觉特征,作为一种检索机制存入视频数据库,最后根据用户提交的查询按一定特征进行视频检索,将检索结果按相似性程度交给用户,用户可优化查询结果,系统会依用户意见灵活优化检索结果。
特征的提取和检索算法的优劣决定了系统的效率和性能。
3.关键技术视频包含着丰富的内容。
一般对视频采用分层的表达方式表示视频。
一个视频可以表示为场景、镜头、帧几个层次。
帧是视频最基本组成单元,镜头边界检测是视频层次化的基础。
3.1镜头边界检测实现基于内容的视频检索首先要将视频数据自动地分割为镜头,称为镜头边界检测或场景转换检测。
镜头的切换有突变和渐变,突变表现为在相邻两帧之间发生的突变性的镜头转换。
(1)基于像素的镜头检测方法利用视频两帧对应像素之差的绝对值之和作为帧间差,当大于某个阈值m时,则认为有镜头的切换。
缺点是对噪声和物体运动敏感,易造成误识别。
视频内容的结构化
It is a fundamental step of automated indexing and content-based video retrieval or summarization applications which provide an efficient access to huge video archives
16
17
The Schema of Video Shot Detection
18
视频镜头边缘检测方法
基本思想:对比相邻帧间的特征 认为有重大变化的地方是镜头边缘的发生之处
Although cut detection appears to be a simple task for a human being, it is a non-trivial task for computers. Cut detection would be a trivial problem if each frame of a video was enriched with additional information about when and by which camera it was taken. While most algorithms achieve good results with hard cuts, many fail with recognizing soft cuts. Hard cuts usually go together with sudden and extensive changes in the visual content while soft cuts feature 19slow and gradual changes.
视觉与听觉信息处理国家重点实验室(教育部北大)
视觉与听觉信息处理国家重点实验室视觉与听觉信息处理国家重点实验室、智能科学系(信息科学中心)主任:查红彬副主任:吴玺宏谢昆青刘宏视觉与听觉信息处理国家重点实验室学术委员会视觉研究室视觉信息处理研究室在充分发挥学科交叉的综合优势基础上,特别注意从视觉信息处理中提出新的数学问题,进一步发展用于信息处理的数学方法,主要研究方法包括图像压缩与编码、图像处理和模式识别、计算机视觉等。
多年来我们承担了国家科技攻关、863高科技、攀登计划、自然科学基金、博士点基金等大量课题,同时取得了一批有独创性的重大成果。
本研究室具体包含如下4个研究领域:1、图像处理、图像理解和计算机视觉2、生物特征识别3、图像的压缩、复现、传输及其信息安全4、三维视觉计算与机器人(New!!!)联系人:张超电话: 62757000视觉信息处理研究室科研方向:[1]图像处理、图像理解和计算机视觉[2]生物特征识别[3]图像的压缩、复现、传输及其信息安全[4]三维视觉计算与机器人视觉信息处理研究室科研项目:用于患者行为实时跟踪的护理机器人主动视觉研究,国家自然科学基金项目,2002-2004建筑物与复杂场景三维数字化技术的基础研究,教育部科学技术研究重点项目,2003-2005数字博物馆关键技术研究,北京大学15-211项目,2003-2005Document Image Retrieval,与日本Ricoh Co.的合作项目,2003-20043D Model Retrieval,与日本Fujitsu研发中心的合作项目,2003-2004数字几何处理的理论框架与关键技术研究,国家自然科学基金重点项目,2004-2007计算机视觉研究,自然科学基金重大项目子项小波分析理论及其在图像处理中的应用,自然科学基金跨学部重点项目数学机械化与自动推理平台"课题"信息安全、传输与可靠性研究,国家重点基础研究发展规划项目微阵列基因表达数据分析和可视化研究,自然科学基金项目全国优秀博士学位论文奖励基金, 2001-2005JPEG2000图象压缩集成电路核心技术研究,国家863高科技研究项目基于小波的视频压缩与通讯系统研究,国家973重大基础项目的子专题符合JPEG2000的图象压缩集成电路研究,教育部骨干教师项目多进制小波理论及其在基于内容的图像压缩中的应用,国家自然科学基金项目, 2004.1-2006.12 视觉研究室主要人员:听觉研究室言语听觉信息处理研究室成立于1988年,多年来一直以人工神经网络、智能机器学习等为理论基石,以听觉信息处理、语音信号处理和语言信息处理为研究背景进行了深入的理论探索和实践研究。
第八章多媒体信息检索
②颜色数
通常,图片颜色数的可能的取值有:2色(这时图片只有 黑白两色)、16色、256色、16位增强色(共216即65 536 种颜色)、24仿真彩色(共224即16 777 216种颜色)等。 自然图片的颜色数越多,图片的视觉效果就越好。
(2)图片文件的格式类型 ①位图 位图是由许多个像素点组成的图片,相应的图片文件记 录了图形或图像的每一个像素点的位置及代表该像素颜色 的数值等信息。根据有无压缩或压缩的方法等,该类型的 图片文件又分为许多种格式,如:.bmp图 、.tif 图 、.gif图 、.jpg图。 ②矢量图 矢量图是计算机通过数学运算而产生的图形,而不是像 位图那样逐点描述的,因此,该图形所占容量很小,而且 它的显示效果不受大小或显示器分辨率的影响。 矢量图的文件格式视生成它的软件的不同而不同。矢量 图形格式也很多,如Adobe Illustrator的*.AI、 *.EPS和SVG、AutoCAD的*.dwg和dxf、 Corel DRAW的*.cdr、windows标准图元文件 *.wmf和增强型图元文件*.emf等等。
8.1.2 多媒体信息检索的方式
1.基于文本方式的多媒体信息检索技术
首先对多媒体进行人工分析并抽取反映该多媒体物理性 和内容特征的关键词,然后对这些关键词进行文字著录或标引, 建立类似于文本文献的标引著录数据库,从而将对多媒体信息 检索转变成对上述关键词的检索。
2.基于内容的多媒体信息检索技术
TVix视频搜索(/)
第八章
多媒体信息检索
Outline
多媒体信息 图像信息检索 音频信息检索 视频信息检索 Flash文件检索
感觉媒体是指客观 世界中能被人们的 (multimedia) 感觉器官感受得到 的信息的媒体类型。 例如声音、图形、 按照国际电信联盟ITU-T 建议的定义,媒体可以有 图像、语言、文字 等媒体类型。 感觉媒体、表示媒体、表现媒体、存储媒体和传输
基于内容的视频检索
基于内容的视频检索Content-Based Video Retrieval (CBVR)视频是集图像、声音、文字等为一体的综合性媒体,在众多媒体种类中携带的信息量最大。
随着互联网技术的发展和网络带宽的提升,网络视频数据量成爆炸式增长,如何对互联网上的海量视频数据进行检索已成为国内外的研究热点,是新一代搜索引擎的主要研究内容。
视频检索是通过对海量的非结构化的视频数据进行结构化分析,提取视频内容的特征(包含语义特征),在此基础上实现从内容上对视频进行检索。
原始视频要根据其内容建立索引,需要有一种算法,在无人参与的情况下,能够自动提取并描述视频的特征和内容。
与传统文本检索相比,视频检索存在很大的技术难度。
首先,视频内容的特征难以提取与处理,特别是语义特征的提取存在很大的困难。
其次,视频检索在索引建立、查询处理以及人机交互等方面都与传统的文本搜索存在很大区别,还有一些技术难题有待解决。
视频检索的基本流程:结构化分析→特征提取→语义提取→高维索引→检索反馈→浏览应用动态特征静态特征提取镜头的特征及关键帧的视觉特征存入视频数据库。
在建库后,利用相似度的测量实现基于内容的检索。
1.结构化分析对于视频可以按照如下结构进行分层:视频序列→→→→场景→→→→→→→→镜头→→→→→→→→→帧video scene shot frame(不一定时间连续)(时空连续)(静止画面)(最小语义单元)(摄像机的一次拍摄)(胶片的一格)各层都可以用一些属性来描述。
视频序列的属性主要包括场景的个数和持续时间;场景的属性包括标题、持续时间、镜头数目、开始镜头、结束镜头等;镜头的属性包括持续时间、开始帧号、结束帧号、代表帧集合、特征空间向量等;帧有大量的属性,包括直方图、轮廓图、DC及AC分量图等。
视频结构化分析是指对视频进行镜头分割、关键帧提取和场景分割等处理,从而得到视频的结构化信息,并进一步为视频的检索和浏览提供基本访问单元。
浅谈图像集在视觉概念检测中的应用
浅谈图像集在视觉概念检测中的应用摘要:视觉概念检测技术是一种对图像进行检测、管理及分类的有效方法,而检测算法需要有高质量的图像集作为训练集来测试算法的可行性及精确性。
本文介绍了理想的图像集应具备的特性及常用的图像集,为视觉概念检测的研究提供有价值的参考。
关键词:视觉概念图像集检测技术近年来,随着图像检索技术的快速发展,图像视觉内容信息作为一种直观形象、完整复现场景的信息表达形式产生着越来越重要的影响,可以说机器视觉的应用范围几乎涵盖了国民经济的各个行业,主要包括:工业、农业、医药、军事、航天、气象、天文、公安等。
面对如此大规模的图像视觉内容信息量,如何实现合理有效地组织、表达及搜索,已成为现阶段信息检索领域研究的热点问题。
视觉概念检测技术是一种对大量图像进行自动检测、管理及分类的有效方法,它通过合理的算法对获取的图像进行检测、识别、分类,从而达到用机器代替人来做图像测量和判断的目的。
若要使图像检测及分类准确性高,就需要使用高质量的图像集作为训练集,来验证算法的可行性及精确性。
1 理想的图像测试集应具备的特性[1]1.1 图像集应在图像检索领域具有代表性及整体性过去,研究人员使用的图像集常常是分散的,甚至可能自己的私人图像收藏,这样的测试集难免会具有片面性,理想情况是测试集包含许多不同的样本点,能够涵盖图像源的整个频谱,图像足够多到能够代表整个领域。
1.2 图像集应具备标准化的测试基准,以便执行客观的评价在目前的文献中,经常发生不同的研究人员在同一个图像集下执行不同的性能测试,这就使得无法执行比较基准。
标准化的测试基准应该至少包括典型的搜索概念、统一的图像信息,以及统一的绩效测量和报告的详细指引。
1.3 图像集应该便于用户访问及使用,而不必担心版权等问题有些图像集,如MPEG7测试集,被科学界使用已经有一些年了,但是现在却基本找不到,并且也不能随意的发布了。
对使用者来说,能够容易的访问并且在需要的时候可以再发表是必不可少的。
12 数字媒体资源检索
数字媒体资源检索
(2)基于内容的视频检索 基于内容的视频检索(content-based video retrieval, CBVR),就是根据视频的内容和上下文关系,对大规模视 频数据库中的视频数据进行检索。 基于内容的视频检索包括很多技术,如视频结构的分析(镜 头检测技术)、视频数据的自动索引和视频聚类。 视频结构的分析是指通过镜头边界的检测,把视频分割成基 本的组成单元——镜头。 视频数据的自动索引包括关键帧的选取和静止特征与运动特 征的提取,视频聚类就是根据这些特征进行的。
数字媒体资源检索
库客数字音乐图书馆还有很多的英语读物资源,都是由英国 BBC广播电台、美国ABC广播电台当红主播亲自朗读的, 并结合丰富的古典音乐配乐,内容涵盖了儿童文学、诗歌名 著、小说、历史传记等近千部作品。
数字媒体资源检索
KUKE数字音乐图书馆
数字媒体资源检索
3.MyET多媒体英语资源库 (1)数据库简介 MyET多媒体英语资源库是由北京策腾文化有限公司于2002年 开发的一套英语学习的多媒体英语资源库,该库以听说训练 法为基础,专门为解决中国人学习英语过程中的最大问题— —听说障碍所设计,以较为精准的语音分析技术为核心,并 与国内外著名的英语教学出版社和期刊社合作,优选适合不 同学习水平的学习者需要的课程,让学习者既可以快速提高 口语水平,也能够通过长期使用真正有效提高英语的实际应 用能力。
数字媒体资源检索
检定英语口语能力,追踪学习进程。 “自我检定”功能,帮助用户了解自己的英语口语能力并加 以改善;“我的成绩”菜单则记录用户的学习课程及成绩进 步状况,随时掌握自己的学习成果;“口说能力诊断书”会 统计分析用户的口说能力状况,这就好像人的一份体检表, 如:发音部分是什么元音出现了红字? 语调起伏程度是否足 够? 说话中是否出现不该有的停顿? 重音是否常常放错位置 ?
基于形状特征的图像检索
题目:基于形状特征的图像检索系统的设计与实现基于形状特征的图像检索系统的设计与实现摘要近年来,随着多媒体和计算机互联网技术的快速发展,数字图像的数量正以惊人的速度增长。
面对日益丰富的图像信息海洋,人们需要有效地从中获取所期望得到多媒体信息。
因此,在大规模的图像数据库中进行快速、准确的检索成为人们研究的热点。
为了实现快速而准确地检索图像,利用图像的视觉特征,如颜色、纹理、形状等来进行图像检索的技术,也就是基于内容的图像检索技术(CBIR)应运而生[6]。
本文主要研究基于形状特征的图像检索,边缘检测是基于形状特征的一种检索方法,边缘是图像最基本的特性。
在图像边缘检测中,微分算子可以提取出图像的细节信息,景物边缘是细节信息中最具有描述景物特征的部分,也是图像分析中的一个不可或缺的部分。
本文详细地分析了一种边缘检测方法Canny算子,用C++编程实现各算子的边缘检测,并根据边缘检测的有效性和定位的可靠性,得出Canny算子具备有最优边缘检测所需的特性。
并通过基于轮廓的描述方法,傅里叶描述符对图像的形状特征进行描述并存入数据库中。
对行相应的检索功能。
关键词:图像检索;形状特征;Canny算子;边缘检测;傅里叶描述符Design and Implementation of Image Retrieval System Based onShape FeaturesABSTRACTWith the rapid development of multimedia and computer network technique, the quantity of digital image and video is going up fabulously. Facing the vast ocean of information of image, it has a good sense to obtain the desired multimedia information. Currently, rapid and effective searching for desired image from large-scale image databases becomes an hot research topic.In order to retrieve image quickly and accurately using image visual features such as color, texture, shape, which named content-based image retrieval (CBIR) came into being. This paper introduces the principle of wavelet transform applying to image edge detection. Edge detection is based on the shape of the characteristics of a retrieval method, and the edge is the most basic characteristics of the image. In the image edge detection ,differential operator can be used to extract the details of the images, features’ edge is the most detailed information describing the characteristics of the features of the image analysis, and is also an integral part of the image. This paper analyzes a Canny operator edge detection method, and we complete with the C++ language procedure to come ture edge detection. According to the effectiveness of the image detection and the reliability of the orientation, we can deduced that the Canny operator have the characteristics which the image edge has. And contour-based method for describing the image Fourier descriptors to describe the shape feature and stored in the database. Align the corresponding search function.Key words:image retrieval;sharp feature;Canny operator;edge detection;Fourier shape descriptors目录1 前言 (1)1.1 课题背景及研究意义 (1)1.2 国内外发展状况 (1)1.3 课题研究的主要内容 (2)2 基于形状特征的图像检索 (3)2.1 图像检索技术的发展过程 (3)2.1.1 基于内容的图像检索技术 (3)2.1.2 基于形状特征的图像检索 (3)2.2 边缘检测 (4)2.3 Canny边缘检测 (4)2.3.1 Canny指标 (4)2.3.2 Canny算子的实现 (5)2.4 基于轮廓的描述方法 (7)2.4.1 傅立叶形状描述符 (7)2.5 图像的相似性度量 (9)3 基于形状特征的图像检索系统的设计 (10)3.1 Canny算子的程序设计 (10)3.2 图像特征数据库设计 (11)3.3 实验结果 (12)4 基于形状特征的图像检索系统实现 (13)4.1 系统框架 (13)4.2 编程环境 (14)4.3 程序结果 (14)5 总结 (15)参考文献 (16)致谢 (17)附录 (18)1前言1.1课题背景及研究意义随着多媒体技术、计算机技术、通信技术及Intemet网络的迅速发展,人们正在快速地进入一个信息化社会。
简述两种视频关键帧提取
简述两种视频关键帧提取作者:左璐来源:《北京电力高等专科学校学报》2010年第07期一、ffprobe的下载安装(一)下载ffprobe到/projects/ffprobe/页面,下载ffprobe源码。
(二)解压源码#tar -xzvf ffprob-53.tar.gz(三)安装一般Linux软件,从源码编译、安装过程为 :#./configure //检查环境,生成makefile#make //编译#make install //安装,一般可能需要root权限二、关键帧的提取根据上面的叙述,我们已经有了提取视频关键帧的思路:利用ffprobe定位关键帧的位置(即pkt_dts),利用ffmpeg将这一关键帧转换为png格式的图片。
有两种做法。
(一)不改变源代码,而是利用shell脚本编程,从ffprobe -show_frames命令输出的信息中提取出pkt_dts的值,然后利用ffmpeg将这一帧转换为图片我所写脚本中的关键语句:./ffprobe -show_frames $1 > ffprobetmp 2>>/dev/nullDTS=`awk 'BEGIN {FS="\n" ; RS="\["} {if($0~/codec_type=video/ && $0~/pkt_flag_key=K/) print substr($13,9)}' ffprobetmp`for singledts in $DTSecho $singledtsffmpeg -i $1 -sameq -ss $singledts -vframes 1 -y -vcodec png ./png/$singledts.pngdone解释:./ffprobe -show_frames $1 > ffprobetmp 2>>/dev/null 使用ffprobe命令显示帧信息,并写入文件ffprboetmp,错误信息不显示DTS=`awk 'BEGIN {FS="\n" ; RS="\["} {if($0~/codec_type=video/ && $0~/pkt_flag_key=K/) print substr($13,9)}' ffprobetmp` 使用awk处理ffprobe输出的信息。
Content-based Video Retrieval
Content-based Video RetrievalH ÃQr x vCentre for Telematics and Information Technology, University of TwenteP.O. Box 217, 7500 AE Enschede, The NetherlandsEmail: milan@cs.utwente.nl1. IntroductionWith technology advances in multimedia, digital TV and information highways, a large amount of video data is now publicly available. However, without appropriate search technique all these data are nearly not usable. Users are not satisfied with the video retrieval systems that provide analogue VCR functionality. They want to query the content instead of raw video data. For example, a user will ask for specific part of video, which contain some semantic information. Content-based search and retrieval of these data becomes a challenging and important problem. Therefore, the need for tools that can manipulate the video content in the same way as traditional databases manage numeric and textual data is significant.This extended abstract presents our approach for content-based video retrieval. It is organised as follows. In the next section, we give an overview of related work. The third section describes our approach with emphasis on the video modelling as one of the most critical processes in video retrieval. The fourth section draws conclusion.2. State of the artVideo content can been grouped into two types: low-level visual content and semantic content. Low-level visual content is characterised by visual features such as colour, shapes, textures etc. On the other hand, semantic content contains high-level concepts such as objects and events. The semantic content can be presented through many different visual presentations. The main distinction between these two types of content is different requirements for extracting each of these contents. The process of extracting the semantic content is more complex, because it requires domain knowledge or user interaction, while extraction of visual features is usually domain independent.Extensive research efforts have been made with regard to the retrieval of video and image data based on their visual content such as colour distribution, texture and shape. These approaches fall into two categories: query by example and visual sketches. Both of these are based on similarity measurement. Examples include IBM’s Query by Image Content (QBIC) [1], VisualSEEk [2], Photobook [3], Blobworld [4], as well as Virage video engine [5], CueVideo [6] and VideoQ [7] in the field of video. Query by example approaches are suitable if a user has a similar image at hand, but they would not perform well if the image is taken from a different angle or has a different scale. The naive user is interested in querying at the semantic level rather then having to use features to describe his concepts. Sometimes it is difficult to express concepts by sketching. Nevertheless, good match in terms of the feature metrics may yield poor results (multiple domain recall, e.g. a query for 60% of green and 40% of blue may return an image of a grass and sky, a green board on a blue wall or a blue car parked in front of a park, as well as many others).Modelling the semantic content is more difficult then modelling the low-level visual content of a video. At the physical level video is a temporal sequence of pixel regions without direct relation to its semantic content. Therefore, it is very difficult to explore semantic content from the raw video data. In addition to that, if we consider multiple semantic meaning such as metaphorical, associative, hidden or suppressed meaning, which the same video content may have, we make a problem more complex.The simplest way to model the video content is by using free text manual annotation. Some approaches [8, 9] introduce additional video entities, such as objects and events, as well as their relations, that should be annotated, because they are subjects of interests in video. One of the major limitations of these approaches is that search process is based mainly on the attribute information, which are associated by video segment manually by human or (semi)automatically in the process of annotation. These approaches are very limited in terms of spatial relations among sub-frame entities. Spatio-temporal data models overcome these limitations by associating the concept of video object to the sub-frame region that conveys useful information, and by defining events that include spatio-temporal relations among objects. Modelling of these high-level concepts gives the possibility to describe objects in space and time and capture movements of objects. As humans think in term of events and remember different events and objects after watching video, these high-level concepts are the most important cues in content-based video retrieval. A few attempts to include these high-level concepts into video model are made in [10, 11].The distinction, we made regarding modelling the video content, makes clear two important things. On the one hand, feature-based models use automatically extracted features to represent the content of a video, but they do not provide semantics that describes high-level concepts of video, such as objects and events. On the other hand semantic models usually use free text/attribute/keywords annotation to represent the high-level concepts of the video content that results in many lacks. The main one is that manual annotation is tedious, subjective and time consuming. Obviously, an integrated approach, that will provide automatic mapping from features to high-level concepts, is the challenging solution.3. The third way: Concept inferencingIn order to overcome the problem of mapping from features to high level concepts we propose a layered video data model that has the following structure. The raw video data layer is at the bottom. This layer consists of a sequence of frames, as well as some video attributes, such as compression format, frame rate, number of bits per pixel, colour model, duration, etc. The next layer is the feature layer that consists of domain-independent features that can be automatically extracted from raw data. Examples are shapes, textures, colour histogram, as well as dynamic features characterising frame sequences, such as temporality, motion, etc. The concept layer is on the top. It consists of logical concepts that are subject of interest of users or applications. Automatic mapping from raw video data layer to feature layer is already achieved, but automatic mapping from feature to concept layer is still a challenging problem. We simplify this problem by dividing the concept layer into object and event layer.We define a region, as a contiguous set of pixels that is homogeneous in the features such as texture, colour, shape and motion. As we already mentioned a region could be automatically extracted and tracked. Then, we define a video object as a collection of video regions, which have been grouped together under some criteria defined by the domain knowledge. As we can see in the literature [12, 13, 14] automatic detection of video objects (sub-frame entities) in a known domain are feasible. For this purpose, we proposed an object grammar that consists of rules for object extractions. A simplified example of an object rule in the soccer domain could be “if the shape of a region is round, and the colour is white, and it is moving, that object is a ball”. For the second part of the problem - automatic mapping from this object layer to event layer, we propose the event grammar that consists of rules for describing event types in terms of spatio-temporal object interactions. The event types can be primitive and compound. The primitive event type could be described using object types, spatio-temporal and real-world relations among object types, as well as audio segment types and temporal relations among them. Nevertheless, predefined event types, their temporal relations, as well as real-world and spatial relations among their objects can together be a part of compound event type description. For example, in the soccer domain, if the ball object type is inside the goalpost object type for a while and this is followed by very loud shouting and a long whistle, that might indicate that someone has scored a goal, which should be recognised as a goal event.The main advantage of the proposed layered video data model is automatic mapping from features to concepts. This approach bridges the gap between domain independent features, such as colour histograms, shapes, textures and domain dependent high-level concepts such as objects and events. The proposed event grammar formalises the description of spatio-temporal object interactions. However, metaphorical, associative, hidden or suppressed meaning of the video content is not covered by this grammar. Although we proposed traditional annotation approach for this kind of content, this could bea direction of our future work.4. ConclusionWe proposed a layered video data model that integrates audio and video primitives. Four layers structure of our video model makes easier a process of translating raw video data into efficient internal representation that captures video semantics. Our model allows dynamic (ad-hoc) definition of videoobjects and events that can be used in process of content-based retrieval. This enables a user to dynamically define a new event, insert a new index for it and query the database, all by one query. Easy description of video content is supported by robust object and event grammars that can be used for specifying even very complex objects and events. With the proposed event grammar, we try to go one step further in video content description. We put effort into formalising events as descriptions of objects’ (inter)actions. This results in easier capturing of high-level concepts of video content and queries are closer to user way of thinking (users’ cognitive maps of a video). The corresponding query language enables users to specify wide range of queries using audio, video and image media types. The layered model structure allows dynamic logical segmentation of video data during querying.A prototype of video database system based on proposed model and query language is under development. We use MOA object algebra [15] developed at the University of Twente and MONET database management system [16] developed at CWI and University of Amsterdam as implementation platform.References[1]M. Flinker, H. Samhey, W. Niblack et al., “Query by Image and Video Content: The QBICSystem”, IEEE Computer, 28, (Sept. 1995), pp. 23-32.[2]J. R. Smith, S-F. Chang, “VisualSEEk: A Fully Automated Content-Based Image QuerySystem”, ACM Multimedia Conference, Boston, MA, November 1996.[3] A. Pentland, R. W. Picard, S. Sclaroff, “Photobook: Content-Based Manipulation of ImageDatabases”, Int. J. Computer Vision, 18 (3), pp. 233-254.[4] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, J. Malik, “Blobworld: A System forRegion-Based Image Indexing and Retrieval”, Third Int. Conf. On Visual Information and Information Systems, Amsterdam, 1999, pp. 509-516.[5] A. Hampapur, A. Gupta, B. Horowitz, C-F. Shu, C. Fuller, J. Bach, M. Gorkani, R. Jain, “VirageVideo Engine”, SPIE Vol. 3022, 1997.[6] D. Ponceleon, S. Srinivasan, A. Amir, D. Petkovic, D. Diklic, “Key to Effective VideoRetrieval: Effevtive Cataloging and Browsing”, ACM Multimedia, ’98, pp. 99-107.[7]S-F. Chang, W. Chen, H. Meng, H. Sundaram, D. Zhong, “A Fully Automated Content BasedVideo Search Engine Supporting Spatio-Temporal Queries”, IEEE Transaction on Circuits and Systems for Video Tecnology, Vol. 8, No. 5, Sept., 1998.[8]S. Adali, K. S. Candan, S-S. Chen, K. Erol, V. S. Subrahmanian, “Advanced Video InformationSystem: Data Structure and Query Processing”, Multimedia System Vol. 4, No. 4, Aug. 1996, pp. 172-86.[9] C. Decleir, M-S. Hacid, J. Kouloumdjian, “A Database Approach for Modelling and QueryingVideo data”, LTCS-Report 99-03, 1999.[10]H. Jiang, A. Elmagarmid, “Spatial and temporal content-based access to hypervideo databases”VLDB Journal, 1998, No. 7, pp. 226-238.[11]J. Z. Li, M. T. Ozsu, D. Szafron, “Modeling of Video Spatial Relationships in an ObjectDatabase Management System”, Proc. of Int. Workshop on Multi-media Database Management Systems, 1996, pp. 124-132.[12]Y. Gong, L. T. Sin, C. H. Chuan, H-J. Zhang, M. Sakauchi, “Automatic Parsing of TV SoccerPrograms”, IEEE International Conference on Multimedia Computing and Systems, WashingtonD. C., 1995, pp. 167-174.[13]S. Intille, A. Bobick, “Visual Tracking Using Closed-Worlds”, M.I.T. Media Laboratory,Technical Report No. 294, Nov. 1994.[14]G. P. Pingali, Y. Jean I. Carlbom, “LucentVision: A System for Enhanced Sports Viewing”,Proc. of Visual’99, Amsterdam, 1999, pp. 689-696.[15]P. Boncz, A.N. Wilschut, M.L. Kersten, “Flattering an object algebra to provide performance”,Proceedings of the 14th IEEE International Conference on Data Engineering, Orlando, Florida, 1998, pp. 568-577.[16]P. Boncz, M.L. Kersten, “Monet: An Impressionist Sketch of an Advanced Database System”,Proceedings Basque International Workshop on Information Technology, San Sebastian, Spain, July 1995.。
一种通过视频片段进行视频检索的方法
+ Corresponding author: Phn: 86-10-62752426, Fax: 86-10-62981438, E-mail: peng_yuxin@
Received 2002-11-25; Accepted 2003-03-20 Peng YX, Ngo CW, Dong QJ, Guo ZM, Xiao JG. An approach for video retrieval by video clip. Journal of Software, 2003,14(8):1409~1417. /1000-9825/14/1409.htm Abstract: Video clip retrieval plays a critical role in the content-based video retrieval. Two major concerns in this
Vol.14, No.8
一种通过视频片段进行视频检索的方法
彭宇新 1,2+, Ngo Chong-Wah3, 董庆杰 1,2, 郭宗明 1,2, 肖建国 1,2
1 2 3
∗
(北京大学 计算机科学技术研究所,北京 (香港城市大学 计算机科学系,香港)
100871) 100871)
(北京大学 文字信息处理技术国家重点实验室,北京
视频片段检索是基于内容的视频检索的主要方式,它需要解决两个问题:(1) 从视频库里自动分割出与查
询片段相似的多个片段;(2) 按照相似度从高到低排列这些相似片段.首次尝试运用图论的匹配理论来解决这两个
∗ 第一作者简介 : 彭宇新 (1974- ),男 ,贵州都匀人 ,博士生 ,主要研究领域为基于内容的视频检索 .
音频信号的分类与分割
哈尔滨理工大学毕业设计题目:音频信号的分类与分割院系:电气与电子工程学院姓名:指导教师:系主任:2011年6月23日音频信号的分类与分割摘要随着计算机技术、网络技术和通讯技术的不断发展,图像、视频、音频等多媒体数据已逐渐成为信息处理领域中主要的信息媒体形式,其中音频信息占有很重要的地位。
同时,由于信息获取的方式、手段和技术的不断进步和多样化,使得信息数据量以极高的速度增加,为有效的处理和组织信息带来了挑战,而信息有效的处理和组织是深入分析和充分利用的前提。
原始音频数据是一种非语义符号表示和非结构化的二进制流,缺乏内容语义的描述和结构化的组织,给音频信息的深度处理和分析工作带来了很大的困难。
如何提取音频中的结构化信息和内容语义是音频信息深度处理、基于内容检索和辅助视频分析等应用的关键。
音频分类与分割技术是解决这一问题的关键技术,是音频结构化的基础。
本文介绍了在MATLAB环境中如何进行语音信号采集后的时频域分析处理,并通过实例分析了应用MATLAB处理语音信号的过程。
本文根据模式识别理论分析了音频分类与分割的技术流程,同时讨论了其中涉及的相关技术;介绍了特征分析与抽取,以及采用的相关音频处理技术。
关键词MATLAB;语音信号;特征分析The classification and segmentation of the AudioAbstractWith the continually evolving of computer technology, network technology and communication technology, images, video, audio and other multimedia data in the field of information processing has become the main form of information media, audio information plays an especially important role.At the same time, due to the way access to information, tools and technology continues to progress and diversify, the amount of data information increase at very high speed, which has brought challengesfor efficient processing and organizing of the information , and effective processing and organization of i information are premise of analysis and full use of the .The original audio data is a non-semantic notation and unstructured binary stream, lack of content and structure of semantic description of the organization, which has led to great difficulties to the depth of audio information processing and analysis. How to extract structured information in audio and audio information content is the key for the depth of semantic processing, video content-based retrieval and analysis applications supporting. Audio classification and segmentation is a key technology to solve this problem is the structural basis for the audio.This article describes how the MATLAB environment for voice signal collected after the time-frequency domain analysis and processing, and analysis of the application by example MATLAB to handle voice signals.Our theoretical analysis is based on pattern recognition, audio classification and segmentation of the technical process, and involving the relatedtechnologies discussed; We describe the characteristics analysis and extraction, and to the corresponding audio processing technologyThe last chapter involves the summary and evaluation all the work of the paper, and this research were discussed for future.Keywords:MATLAB;V oice signal; Characteristics目录摘要 (I)Abstract............................................................................................................... I I第1章绪论 (1)1.1 研究背景 (1)1.2 语音信号的采集 (3)1.2.1 预加重处理 (3)1.2.2 切分与加窗处理 (3)1.3 研究的主要内容 (4)第2章音频分类与分割技术研究现状 (5)2.1 音频语义内容分析 (5)2.2 层次化音频结构分析框架 (6)第3章音频信号特征的提取 (8)3.1 语音端点检测的基本方法 (8)3.1.1 短时加窗处理 (8)3.1.2 短时平均能量 (8)3.2 短时平均过零率 (11)3.3 基于能量和过零率的语音端点检测 (14)第4章语音信号的短时频阈分析 (16)4.1 语音信号的快速傅里叶变换 (16)4.2 临界频带谱平坦测度函数计算 (18)4.3 基于短时能量比的语音端点检测算法的研究 (19)4.4 音频信号的功率谱分析 (20)4.5 音频信号的子带熵分析 (21)结论 (22)致谢 (23)参考文献 (24)附录A (26)附录B (33)第1章绪论随着计算机技术和信息技术的发展,语音交互已经成为人机交互的必要手段,而语音信号的采集和处理是人机交互的前提和基础。
基于3d模型的行为识别
A Dissertation Submitted in Partial Fulfillment of the Requirementsfor the Degree of Master in EngineeringAction Recognition Based on3D ModelMaster Candidate:Jing DuMajor:Computer Application Technology Supervisor:Prof.Dongfang ChenWuhan University of Science and TechnologyWuhan,Hubei430081,P.R.ChinaMay,2014摘要人体行为识别技术被广泛应用于智能安全监控、自然用户界面和基于内容的图像和视频检索。
随着深度图像摄像头价格的不断降低和精确度的逐步提高,如何利用Kinect这类深度图像摄像头提取出的3D人体骨架关节点模型进行行为识别成为学者们关注的课题。
运动特征的表示方法是行为识别的重要部分。
本文提出一种名为肢体角度模型的姿态表示模型,此模型可以有效避免方向敏感性、人体长度和人体组成部分之间长度比例的影响。
肢体角度模型将所有关节点坐标转换到由一些身体部位的关节点建立的球坐标系上,然后计算由相邻关节点组成的肢体在这个球坐标系上的角度信息。
在行为归类步骤中使用隐马尔可夫模型作为模式识别工具。
实验中收集了不同视角方向和不同人体的骨架序列作为实验样本。
实验结果证明了方法的有效性。
关键词:3D模型;关节点;行为识别;姿态表示;隐马尔科夫模型AbstractHuman action recognition technology has been wildly applied to intelligent security surveillance,natural user interface and content-based image and video retrieval.As depth camera becomes cheap and accurate,how to make use of the new type of data,3D skeleton joint model extracted by depth camera such as Kinect,has been a highly active research topic.Action feature estimate is in important part of action recognition.A posture representation model named limb angle model is proposed,which is invariant to limb length,length ratio between body parts and body orientation.This model contains polar angle and azimuthal angle of each limb that consist of tow adjacent joints in the spherical coordinate system which is established by the features of body joints.Hidden Markov Model(HMM)is exploited as a pattern recognition tool for action classification.Skeleton sequences of different body orientation and different people are collected as experimental data.Experimental results demonstrate the effectiveness of our approach.Keywords:3D Model;Joint;Action Recognition;Posture Representation;HMM目录摘要 (I)Abstract (II)目录 (III)第1章绪论 (1)1.1研究背景及意义 (1)1.2行为识别技术发展概况 (4)1.2.1行为识别流程 (4)1.2.2基于2D图像序列的运动特征 (4)1.2.3基于3D图像序列的运动特征 (5)1.3全文研究内容及章节安排 (6)第2章深度图像捕捉设备 (8)2.1深度感应 (8)2.2深度图像感应设备 (9)2.2.1立体摄像机 (9)2.2.2结构光传感器 (9)2.2.3ToF摄像机 (9)2.3Kinect for Windows (10)第3章肢体角度模型 (12)3.1姿态表示模型 (12)3.2Kinect人体模型 (14)3.3肢体角度模型 (15)3.3.1关节点和行为识别 (15)3.3.2关节点 (16)3.3.3关节点位置信息提取 (17)3.3.4肢体 (18)3.3.5坐标系建立与转换 (19)3.3.6球坐标系转换 (20)3.3.7肢体角度 (23)3.3.8模型可重构特性 (24)3.3.9肢体角度差异 (25)3.3.10姿态差异衡量方法 (26)3.3.11肢体角度模型总结 (27)第4章行为归类 (28)4.1隐马尔可夫模型介绍 (28)4.1.1马尔可夫过程 (28)4.1.2隐马尔可夫模型 (28)4.1.3隐马尔可夫模型的训练和概率计算 (30)4.2隐马尔可夫模型应用 (30)4.2.1关键姿态 (31)4.2.2训练与识别 (31)第5章实验 (32)5.1实验样本 (32)5.2实验环境 (32)5.3样本存储 (33)5.4实验结果 (33)5.4.1不同视角样本之间的交叉实验 (33)5.4.2不同人体之间的交叉实验 (34)5.4.3阈值实验 (34)第6章结论与展望 (36)致谢 (37)参考文献 (38)附录1攻读硕士学位期间发表的论文 (42)附录2攻读硕士学位期间参加的科研项目 (43)第1章绪论1.1研究背景及意义人体行为识别是计算机视觉领域中一项非常热门的研究课题。
视频相似度的衡量
视频相似度的衡量吴翌;庄越挺;潘云鹤【期刊名称】《计算机辅助设计与图形学学报》【年(卷),期】2001(013)003【摘要】基于内容的视频检索系统中,最常用的检索方式是例子视频查询,即用户提交一部视频,系统返回相似的一系列视频.但是,怎样定义的两部视频是相似的,仍然是一个困难的问题.文中介绍了一种新的方法以解决这一难点.首先,提出了镜头质心特征向量的概念,减少了关键帧特征的存储量.其次,利用人类视觉判断中所潜在的因子,提出了视频在镜头间相似度的衡量,以及总体上相似度的衡量的方法,为不同粒度上的衡量提供了很大的灵活性,在现实意义上也是合理的.检索实验的结果证明了算法的有效性.%The main retrieval method of content-based video retrieval systemis query by example. If user submits a video as example, system returns a set of similar videos. But how to define whether two videos are similar is still a great problem. This paper puts forward a video similarity model to solve the difficulty. First, it advances the centroid feature vector of shot in order to reduce the storage of video database. Second, considering the latent factors existing in human's vision perception, it introduces a new comparison algorithm based on multi-level of video structure, such as from shot's view and from the overall view. This different granularity of measurement provides great flexibility, which is reasonable in real world. The final retrieval result demonstrates the validity of algorithm.【总页数】5页(P284-288)【作者】吴翌;庄越挺;潘云鹤【作者单位】浙江大学人工智能研究所;浙江大学人工智能研究所;浙江大学人工智能研究所【正文语种】中文【中图分类】TP391.4【相关文献】1.属性权重确定方式及在衡量机型相似度中的应用 [J], 杨卫东;左洪福;李怀远;刘若晨;蔡景2.一种衡量基因语义相似度的新方法 [J], 张少华;尚学群;王淼3.基于相似度衡量的决策树自适应迁移 [J], 王雪松;潘杰;程玉虎;曹戈4.基于结构相似度衡量的图像超分辨率重建 [J], 张晶5.新的颜色相似度衡量方法在图像检索中的应用 [J], 顾晓东;杨诚因版权原因,仅展示原文概要,查看原文内容请购买。
智能视频检索技术在校园安防建设中的发展应用
智能视频检索技术在校园安防建设中的发展应用李亚;张小平【摘要】校园是培养建设者和接班人的重要场所,承载着国家、社会和家庭的希望。
正是这种情况使得校园的安全管理成为不仅涉及自身区域环境的问题,而且成为影响整个社会稳定的重要问题,愈来愈成为社会各方面关注的焦点和核心。
随着移动互联网时代到来,平安校园成为众多学者关心的重要课题。
必须利用有效安防新技术、新产品的运用,和有效的管理方法提高安防系统的威慑力,对校园进行长期的、稳定的实时监控,保证校园的安全。
文章主要介绍移动互联网时代智能视频检索技术在校园安防工作中的的发展应用。
【期刊名称】《数字技术与应用》【年(卷),期】2016(000)009【总页数】1页(P104-104)【关键词】视频监控;智能视频检索技术;视频处理【作者】李亚;张小平【作者单位】新疆维吾尔自治区大中专招生办公室新疆维吾尔 830000;新疆轻工职业技术学院新疆维吾尔 830000【正文语种】中文【中图分类】TP391.41随着平安校园工作的推进,各个学校,特别是各各大学的各各角落已经布满了摄像头和预警设备。
一方面,大多数的安防技术和设备都是针对与一般居民环境,但学校筑物多、人员密集复杂、环境功能主体繁杂等特点。
所以一般的安防技术和设备不能对学校治安和安全进行预防和有效处理。
另方面,学生的社会关注度高,保证学校环境的安全和稳定对于国家和社会的整体稳定有着重要的意义,传统的基于预设告警和时间的视频检索方式,对于快速地从海量的、数以万计的摄像头视频录像中获取有价值的信息的需求往往无能为力如何在海量视频中快速提取有价值的线索便显得尤为重要。
普通的视频检索技术方法主要是以人工标注为基础,采用“人海战术”,对于学校来说每时每刻系统中都有海量的视频在进行存储,这种方法实现起来容易,但是局限性很大,一段视频往往需要花费很多的人力物力,获取到有效信息耗时过长,但仍无法完全避免遗漏和误差[1]。
90年代以前模拟体制下无法实现远距离操控、系统规模不能任意扩充、难以实现大规模长时间的图像储存,数字化是视频监控技术发展的大前提。
gru方法
gru方法GRU(Gated Recurrent Unit,门控循环单元)是一种常用于处理序列数据的循环神经网络(RNN)模型。
相较于传统的RNN结构,GRU具有更强的建模能力和更好的长期依赖处理能力。
在本文中,我们将介绍GRU的原理、结构以及应用,并给出一些相关论文和案例作为参考。
一、GRU的原理与结构GRU是RNN的一种变体,它在传统RNN的基础上引入了门控机制,以便更好地控制信息的更新和遗忘。
相比于长短期记忆网络(LSTM),GRU减少了一部分门控单元,从而简化了结构,并减少了计算消耗。
GRU的主要思想是通过两个门控信号,即“重置门”(reset gate)和“更新门”(update gate),来调节信息的流动。
1. 重置门(reset gate):重置门决定了在当前时间步,模型应该从之前的状态中丢弃哪些信息。
它通过一个sigmoid激活函数对输入进行加权求和,并输出一个取值范围为0到1之间的向量。
当重置门的输出接近0时,将丢弃之前的状态信息;当接近1时,保留之前的状态信息。
2. 更新门(update gate):更新门用于控制当前时间步的输入与先前的状态之间的权衡。
它也通过sigmoid函数对输入进行加权求和,并输出一个取值范围为0到1之间的向量。
接近0时,忽略更新;接近1时,允许状态信息的更新。
3. 隐藏状态(hidden state):在GRU中,隐藏状态在每个时间步都会更新。
通过重置门和更新门的调整机制,隐藏状态有效地捕捉到了输入序列中的相关信息,并在后续时间步中保留了这些信息。
二、GRU的应用GRU作为一种强大的序列建模工具,在自然语言处理、语音识别、图像描述生成等任务中被广泛应用。
下面是一些相关论文和案例,供参考:1. "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation"这篇论文提出了一个基于RNN的编码器-解码器框架,其中编码器部分采用了GRU单元来将输入序列编码为一个固定长度的向量表示,用于机器翻译任务。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Content-based Video RetrievalCentre for Telematics and Information Technology, University of TwenteP.O. Box 217, 7500 AE Enschede, The NetherlandsEmail: milan@cs.utwente.nl1. IntroductionWith technology advances in multimedia, digital TV and information highways, a large amount of video data is now publicly available. However, without appropriate search technique all these data are nearly not usable. Users are not satisfied with the video retrieval systems that provide analogue VCR functionality. They want to query the content instead of raw video data. For example, a user will ask for specific part of video, which contain some semantic information. Content-based search and retrieval of these data becomes a challenging and important problem. Therefore, the need for tools that can manipulate the video content in the same way as traditional databases manage numeric and textual data is significant.This extended abstract presents our approach for content-based video retrieval. It is organised as follows. In the next section, we give an overview of related work. The third section describes our approach with emphasis on the video modelling as one of the most critical processes in video retrieval. The fourth section draws conclusion.2. State of the artVideo content can been grouped into two types: low-level visual content and semantic content. Low-level visual content is characterised by visual features such as colour, shapes, textures etc. On the other hand, semantic content contains high-level concepts such as objects and events. The semantic content can be presented through many different visual presentations. The main distinction between these two types of content is different requirements for extracting each of these contents. The process of extracting the semantic content is more complex, because it requires domain knowledge or user interaction, while extraction of visual features is usually domain independent.Extensive research efforts have been made with regard to the retrieval of video and image data based on their visual content such as colour distribution, texture and shape. These approaches fall into two categories: query by example and visual sketches. Both of these are based on similarity measurement. Examples include IBM’s Query by Image Content (QBIC) [1], VisualSEEk [2], Photobook [3], Blobworld [4], as well as Virage video engine [5], CueVideo [6] and VideoQ [7] in the field of video. Query by example approaches are suitable if a user has a similar image at hand, but they would not perform well if the image is taken from a different angle or has a different scale. The naive user is interested in querying at the semantic level rather then having to use features to describe his concepts. Sometimes it is difficult to express concepts by sketching. Nevertheless, good match in terms of the feature metrics may yield poor results (multiple domain recall, e.g. a query for 60% of green and 40% of blue may return an image of a grass and sky, a green board on a blue wall or a blue car parked in front of a park, as well as many others).Modelling the semantic content is more difficult then modelling the low-level visual content of a video. At the physical level video is a temporal sequence of pixel regions without direct relation to its semantic content. Therefore, it is very difficult to explore semantic content from the raw video data. In addition to that, if we consider multiple semantic meaning such as metaphorical, associative, hidden or suppressed meaning, which the same video content may have, we make a problem more complex.The simplest way to model the video content is by using free text manual annotation. Some approaches [8, 9] introduce additional video entities, such as objects and events, as well as their relations, that should be annotated, because they are subjects of interests in video. One of the major limitations of these approaches is that search process is based mainly on the attribute information, which are associated by video segment manually by human or (semi)automatically in the process of annotation. These approaches are very limited in terms of spatial relations among sub-frame entities. Spatio-temporal data models overcome these limitations by associating the concept of video object to the sub-frame region that conveys useful information, and by defining events that include spatio-temporal relations among objects. Modelling of these high-level concepts gives the possibility to describe objects in space and time and capture movements of objects. As humans think in term of events and remember different events and objects after watching video, these high-level concepts are the most important cues in content-based video retrieval. A few attempts to include these high-level concepts into video model are made in [10, 11].The distinction, we made regarding modelling the video content, makes clear two important things. On the one hand, feature-based models use automatically extracted features to represent the content of a video, but they do not provide semantics that describes high-level concepts of video, such as objects and events. On the other hand semantic models usually use free text/attribute/keywords annotation to represent the high-level concepts of the video content that results in many lacks. The main one is that manual annotation is tedious, subjective and time consuming. Obviously, an integrated approach, that will provide automatic mapping from features to high-level concepts, is the challenging solution.3. The third way: Concept inferencingIn order to overcome the problem of mapping from features to high level concepts we propose a layered video data model that has the following structure. The raw video data layer is at the bottom. This layer consists of a sequence of frames, as well as some video attributes, such as compression format, frame rate, number of bits per pixel, colour model, duration, etc. The next layer is the feature layer that consists of domain-independent features that can be automatically extracted from raw data. Examples are shapes, textures, colour histogram, as well as dynamic features characterising frame sequences, such as temporality, motion, etc. The concept layer is on the top. It consists of logical concepts that are subject of interest of users or applications. Automatic mapping from raw video data layer to feature layer is already achieved, but automatic mapping from feature to concept layer is still a challenging problem. We simplify this problem by dividing the concept layer into object and event layer.We define a region, as a contiguous set of pixels that is homogeneous in the features such as texture, colour, shape and motion. As we already mentioned a region could be automatically extracted and tracked. Then, we define a video object as a collection of video regions, which have been grouped together under some criteria defined by the domain knowledge. As we can see in the literature [12, 13, 14] automatic detection of video objects (sub-frame entities) in a known domain are feasible. For this purpose, we proposed an object grammar that consists of rules for object extractions. A simplified example of an object rule in the soccer domain could be “if the shape of a region is round, and the colour is white, and it is moving, that object is a ball”. For the second part of the problem - automatic mapping from this object layer to event layer, we propose the event grammar that consists of rules for describing event types in terms of spatio-temporal object interactions. The event types can be primitive and compound. The primitive event type could be described using object types, spatio-temporal and real-world relations among object types, as well as audio segment types and temporal relations among them. Nevertheless, predefined event types, their temporal relations, as well as real-world and spatial relations among their objects can together be a part of compound event type description. For example, in the soccer domain, if the ball object type is inside the goalpost object type for a while and this is followed by very loud shouting and a long whistle, that might indicate that someone has scored a goal, which should be recognised as a goal event.The main advantage of the proposed layered video data model is automatic mapping from features to concepts. This approach bridges the gap between domain independent features, such as colour histograms, shapes, textures and domain dependent high-level concepts such as objects and events. The proposed event grammar formalises the description of spatio-temporal object interactions. However, metaphorical, associative, hidden or suppressed meaning of the video content is not covered by this grammar. Although we proposed traditional annotation approach for this kind of content, this could bea direction of our future work.4. ConclusionWe proposed a layered video data model that integrates audio and video primitives. Four layers structure of our video model makes easier a process of translating raw video data into efficient internal representation that captures video semantics. Our model allows dynamic (ad-hoc) definition of videoobjects and events that can be used in process of content-based retrieval. This enables a user to dynamically define a new event, insert a new index for it and query the database, all by one query. Easy description of video content is supported by robust object and event grammars that can be used for specifying even very complex objects and events. With the proposed event grammar, we try to go one step further in video content description. We put effort into formalising events as descriptions of objects’ (inter)actions. This results in easier capturing of high-level concepts of video content and queries are closer to user way of thinking (users’ cognitive maps of a video). The corresponding query language enables users to specify wide range of queries using audio, video and image media types. The layered model structure allows dynamic logical segmentation of video data during querying.A prototype of video database system based on proposed model and query language is under development. We use MOA object algebra [15] developed at the University of Twente and MONET database management system [16] developed at CWI and University of Amsterdam as implementation platform.References[1]M. Flinker, H. Samhey, W. Niblack et al., “Query by Image and Video Content: The QBICSystem”, IEEE Computer, 28, (Sept. 1995), pp. 23-32.[2]J. R. Smith, S-F. Chang, “VisualSEEk: A Fully Automated Content-Based Image QuerySystem”, ACM Multimedia Conference, Boston, MA, November 1996.[3] A. Pentland, R. W. Picard, S. Sclaroff, “Photobook: Content-Based Manipulation of ImageDatabases”, Int. J. Computer Vision, 18 (3), pp. 233-254.[4] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, J. Malik, “Blobworld: A System forRegion-Based Image Indexing and Retrieval”, Third Int. Conf. On Visual Information and Information Systems, Amsterdam, 1999, pp. 509-516.[5] A. Hampapur, A. Gupta, B. Horowitz, C-F. Shu, C. Fuller, J. Bach, M. Gorkani, R. Jain, “VirageVideo Engine”, SPIE Vol. 3022, 1997.[6] D. Ponceleon, S. Srinivasan, A. Amir, D. Petkovic, D. Diklic, “Key to Effective VideoRetrieval: Effevtive Cataloging and Browsing”, ACM Multimedia, ’98, pp. 99-107.[7]S-F. Chang, W. Chen, H. Meng, H. Sundaram, D. Zhong, “A Fully Automated Content BasedVideo Search Engine Supporting Spatio-Temporal Queries”, IEEE Transaction on Circuits and Systems for Video Tecnology, Vol. 8, No. 5, Sept., 1998.[8]S. Adali, K. S. Candan, S-S. Chen, K. Erol, V. S. Subrahmanian, “Advanced Video InformationSystem: Data Structure and Query Processing”, Multimedia System Vol. 4, No. 4, Aug. 1996, pp. 172-86.[9] C. Decleir, M-S. Hacid, J. Kouloumdjian, “A Database Approach for Modelling and QueryingVideo data”, LTCS-Report 99-03, 1999.[10]H. Jiang, A. Elmagarmid, “Spatial and temporal content-based access to hypervideo databases”VLDB Journal, 1998, No. 7, pp. 226-238.[11]J. Z. Li, M. T. Ozsu, D. Szafron, “Modeling of Video Spatial Relationships in an ObjectDatabase Management System”, Proc. of Int. Workshop on Multi-media Database Management Systems, 1996, pp. 124-132.[12]Y. Gong, L. T. Sin, C. H. Chuan, H-J. Zhang, M. Sakauchi, “Automatic Parsing of TV SoccerPrograms”, IEEE International Conference on Multimedia Computing and Systems, WashingtonD. C., 1995, pp. 167-174.[13]S. Intille, A. Bobick, “Visual Tracking Using Closed-Worlds”, M.I.T. Media Laboratory,Technical Report No. 294, Nov. 1994.[14]G. P. Pingali, Y. Jean I. Carlbom, “LucentVision: A System for Enhanced Sports Viewing”,Proc. of Visual’99, Amsterdam, 1999, pp. 689-696.[15]P. Boncz, A.N. Wilschut, M.L. Kersten, “Flattering an object algebra to provide performance”,Proceedings of the 14th IEEE International Conference on Data Engineering, Orlando, Florida, 1998, pp. 568-577.[16]P. Boncz, M.L. Kersten, “Monet: An Impressionist Sketch of an Advanced Database System”,Proceedings Basque International Workshop on Information Technology, San Sebastian, Spain, July 1995.。