Probabilistic region relevance learning for content-based image retrieval
Probabilistic Graphical Models
(h)唯一节点,排第8.
得到最终消元顺序:<A,X,D,T,S,L,B,R>.
注:最小缺边搜索往往优于其他方法给出的顺序!
基本概念 2.从F(X1,X2,...,Xn)出发,可通过如下方 式获得(X2,X3,...,Xn)的一个函数: G(X2,X3,...,Xn)= F(X1,X2,...,Xn) X1 这个过程称为消元(elimination)。
VE算法伪代码描述
VE(N,E,e,Q,p) 输入:N----一个贝叶斯网;E----证据变量; e----证据变量的取值;Q----查询变量 p----消元顺序,包含所有不在QUE中 的变量 输出:P(Q|E=e) 1.f<---N中所有概率分布的集合; 2.在f的因子中,将证据变量E设置为其 观测值e;
Probabilistic Graphical Model
一.Representation(表示)
二.Inference(推理) 三.Learning(学习)
简单回顾视频前六节的内容,讲的都是 Representation(表示),要点如下:
Bayesian Network(有向图)
从联合分布构造BN; BN上的独立,条件独立,环境独立; 链规则,联合概率的分解; d-分隔,I-map,P-map。
Markov Network(无向图)
BN->MN; 马尔科夫性,马尔科夫假设; 无向分隔,moral graph(端正图)。
Inference(推理)
推理(inference)是通过计算回答查询 (query)的过程,BN中推理有三类: 1.后验概率问题 即求P(Q|E=e) 2.最大后验假设问题(MAP) 即求h*=arg maxP(H=h|E=e) 3.最大可能解释问题(MPE) 是MAP的特例,H包含网络中所有非证 据变量
基于关联规则挖掘的蛋白质相互作用的预测
Pr d c i n f Pr t i Pr t i n e a to s Ba e n As o i to Ru e M i i g e i to o o e n— o e n I t r c i n s d o s c a i n l n n
( 上海 大 学 理 学 院 , 上海 2 0 4 0 ̄ . )
摘要 : 利用 蛋 白质的一级结构信息 , 采用三肽频数方法 刻画蛋 白质序列 , 将关联 规则 (soiinrl, R) 掘应用 asc t e A 挖 ao u 于蛋 白质相互作用 ( rti-rt nit at n , Ps 的预测 . poe po i e c os P I) n e nr i 计算 结果表 明 , 提出 的方 法在半 胱氨 酸不 同分 类的情 况下都能够准确地 预测 蛋 白质相互作用. 最后 , 比较半胱 氨酸的不 同分类对预测结果 的影响. 关 键词 : 关联规 则挖掘 ; 白质 相互 作用 ; 蛋 序列 编码 ; 氨基 酸分类
Jn 0 2 u .2 1
d i 1 . 9 9 j i n 1 0 ・8 1 2 1 . 3 0 0 o : 0 3 6 / .s . 0 7 2 6 . 0 2 0 . 1 s
基 于 关 联 规 则 挖 掘 的 蛋 白质 相 互 作 用 的预 测
林 合 同 , 龚云 路 , 秦 殿 刚 , 冯铁 男 , 王翼 飞
蛋 白质 问 的相互 作用 在 生命 活动 中扮 演着 重 要
的角 色 , 研究 蛋 白质 问 的 相互 作 用对 了解 蛋 白质 的
1992_Probabilistic Models in Information Retrieval
Norbert Fuhr March 4, n introduction and survey over probabilistic information retrieval (IR) is given. First, the basic concepts of this approach are described: the probability ranking principle shows that optimum retrieval quality can be achieved under certain assumptions; a conceptual model for IR along with the corresponding event space clarify the interpretation of the probabilistic parameters involved. For the estimation of these parameters, three di erent learning strategies are distinguished, namely query-related, document-related and description-related learning. As a representative for each of these strategies, a speci c model is described. A new approach regards IR as uncertain inference; here, imaging is used as a new technique for estimating the probabilistic parameters, and probabilistic inference networks support more complex forms of inference. Finally, the more general problems of parameter estimation, query expansion and the development of models for advanced document representations are discussed.
机器学习专业词汇中英文对照
机器学习专业词汇中英⽂对照activation 激活值activation function 激活函数additive noise 加性噪声autoencoder ⾃编码器Autoencoders ⾃编码算法average firing rate 平均激活率average sum-of-squares error 均⽅差backpropagation 后向传播basis 基basis feature vectors 特征基向量batch gradient ascent 批量梯度上升法Bayesian regularization method 贝叶斯规则化⽅法Bernoulli random variable 伯努利随机变量bias term 偏置项binary classfication ⼆元分类class labels 类型标记concatenation 级联conjugate gradient 共轭梯度contiguous groups 联通区域convex optimization software 凸优化软件convolution 卷积cost function 代价函数covariance matrix 协⽅差矩阵DC component 直流分量decorrelation 去相关degeneracy 退化demensionality reduction 降维derivative 导函数diagonal 对⾓线diffusion of gradients 梯度的弥散eigenvalue 特征值eigenvector 特征向量error term 残差feature matrix 特征矩阵feature standardization 特征标准化feedforward architectures 前馈结构算法feedforward neural network 前馈神经⽹络feedforward pass 前馈传导fine-tuned 微调first-order feature ⼀阶特征forward pass 前向传导forward propagation 前向传播Gaussian prior ⾼斯先验概率generative model ⽣成模型gradient descent 梯度下降Greedy layer-wise training 逐层贪婪训练⽅法grouping matrix 分组矩阵Hadamard product 阿达马乘积Hessian matrix Hessian 矩阵hidden layer 隐含层hidden units 隐藏神经元Hierarchical grouping 层次型分组higher-order features 更⾼阶特征highly non-convex optimization problem ⾼度⾮凸的优化问题histogram 直⽅图hyperbolic tangent 双曲正切函数hypothesis 估值,假设identity activation function 恒等激励函数IID 独⽴同分布illumination 照明inactive 抑制independent component analysis 独⽴成份分析input domains 输⼊域input layer 输⼊层intensity 亮度/灰度intercept term 截距KL divergence 相对熵KL divergence KL分散度k-Means K-均值learning rate 学习速率least squares 最⼩⼆乘法linear correspondence 线性响应linear superposition 线性叠加line-search algorithm 线搜索算法local mean subtraction 局部均值消减local optima 局部最优解logistic regression 逻辑回归loss function 损失函数low-pass filtering 低通滤波magnitude 幅值MAP 极⼤后验估计maximum likelihood estimation 极⼤似然估计mean 平均值MFCC Mel 倒频系数multi-class classification 多元分类neural networks 神经⽹络neuron 神经元Newton’s method ⽜顿法non-convex function ⾮凸函数non-linear feature ⾮线性特征norm 范式norm bounded 有界范数norm constrained 范数约束normalization 归⼀化numerical roundoff errors 数值舍⼊误差numerically checking 数值检验numerically reliable 数值计算上稳定object detection 物体检测objective function ⽬标函数off-by-one error 缺位错误orthogonalization 正交化output layer 输出层overall cost function 总体代价函数over-complete basis 超完备基over-fitting 过拟合parts of objects ⽬标的部件part-whole decompostion 部分-整体分解PCA 主元分析penalty term 惩罚因⼦per-example mean subtraction 逐样本均值消减pooling 池化pretrain 预训练principal components analysis 主成份分析quadratic constraints ⼆次约束RBMs 受限Boltzman机reconstruction based models 基于重构的模型reconstruction cost 重建代价reconstruction term 重构项redundant 冗余reflection matrix 反射矩阵regularization 正则化regularization term 正则化项rescaling 缩放robust 鲁棒性run ⾏程second-order feature ⼆阶特征sigmoid activation function S型激励函数significant digits 有效数字singular value 奇异值singular vector 奇异向量smoothed L1 penalty 平滑的L1范数惩罚Smoothed topographic L1 sparsity penalty 平滑地形L1稀疏惩罚函数smoothing 平滑Softmax Regresson Softmax回归sorted in decreasing order 降序排列source features 源特征sparse autoencoder 消减归⼀化Sparsity 稀疏性sparsity parameter 稀疏性参数sparsity penalty 稀疏惩罚square function 平⽅函数squared-error ⽅差stationary 平稳性(不变性)stationary stochastic process 平稳随机过程step-size 步长值supervised learning 监督学习symmetric positive semi-definite matrix 对称半正定矩阵symmetry breaking 对称失效tanh function 双曲正切函数the average activation 平均活跃度the derivative checking method 梯度验证⽅法the empirical distribution 经验分布函数the energy function 能量函数the Lagrange dual 拉格朗⽇对偶函数the log likelihood 对数似然函数the pixel intensity value 像素灰度值the rate of convergence 收敛速度topographic cost term 拓扑代价项topographic ordered 拓扑秩序transformation 变换translation invariant 平移不变性trivial answer 平凡解under-complete basis 不完备基unrolling 组合扩展unsupervised learning ⽆监督学习variance ⽅差vecotrized implementation 向量化实现vectorization ⽮量化visual cortex 视觉⽪层weight decay 权重衰减weighted average 加权平均值whitening ⽩化zero-mean 均值为零Letter AAccumulated error backpropagation 累积误差逆传播Activation Function 激活函数Adaptive Resonance Theory/ART ⾃适应谐振理论Addictive model 加性学习Adversarial Networks 对抗⽹络Affine Layer 仿射层Affinity matrix 亲和矩阵Agent 代理 / 智能体Algorithm 算法Alpha-beta pruning α-β剪枝Anomaly detection 异常检测Approximation 近似Area Under ROC Curve/AUC Roc 曲线下⾯积Artificial General Intelligence/AGI 通⽤⼈⼯智能Artificial Intelligence/AI ⼈⼯智能Association analysis 关联分析Attention mechanism 注意⼒机制Attribute conditional independence assumption 属性条件独⽴性假设Attribute space 属性空间Attribute value 属性值Autoencoder ⾃编码器Automatic speech recognition ⾃动语⾳识别Automatic summarization ⾃动摘要Average gradient 平均梯度Average-Pooling 平均池化Letter BBackpropagation Through Time 通过时间的反向传播Backpropagation/BP 反向传播Base learner 基学习器Base learning algorithm 基学习算法Batch Normalization/BN 批量归⼀化Bayes decision rule 贝叶斯判定准则Bayes Model Averaging/BMA 贝叶斯模型平均Bayes optimal classifier 贝叶斯最优分类器Bayesian decision theory 贝叶斯决策论Bayesian network 贝叶斯⽹络Between-class scatter matrix 类间散度矩阵Bias 偏置 / 偏差Bias-variance decomposition 偏差-⽅差分解Bias-Variance Dilemma 偏差 – ⽅差困境Bi-directional Long-Short Term Memory/Bi-LSTM 双向长短期记忆Binary classification ⼆分类Binomial test ⼆项检验Bi-partition ⼆分法Boltzmann machine 玻尔兹曼机Bootstrap sampling ⾃助采样法/可重复采样/有放回采样Bootstrapping ⾃助法Break-Event Point/BEP 平衡点Letter CCalibration 校准Cascade-Correlation 级联相关Categorical attribute 离散属性Class-conditional probability 类条件概率Classification and regression tree/CART 分类与回归树Classifier 分类器Class-imbalance 类别不平衡Closed -form 闭式Cluster 簇/类/集群Cluster analysis 聚类分析Clustering 聚类Clustering ensemble 聚类集成Co-adapting 共适应Coding matrix 编码矩阵COLT 国际学习理论会议Committee-based learning 基于委员会的学习Competitive learning 竞争型学习Component learner 组件学习器Comprehensibility 可解释性Computation Cost 计算成本Computational Linguistics 计算语⾔学Computer vision 计算机视觉Concept drift 概念漂移Concept Learning System /CLS 概念学习系统Conditional entropy 条件熵Conditional mutual information 条件互信息Conditional Probability Table/CPT 条件概率表Conditional random field/CRF 条件随机场Conditional risk 条件风险Confidence 置信度Confusion matrix 混淆矩阵Connection weight 连接权Connectionism 连结主义Consistency ⼀致性/相合性Contingency table 列联表Continuous attribute 连续属性Convergence 收敛Conversational agent 会话智能体Convex quadratic programming 凸⼆次规划Convexity 凸性Convolutional neural network/CNN 卷积神经⽹络Co-occurrence 同现Correlation coefficient 相关系数Cosine similarity 余弦相似度Cost curve 成本曲线Cost Function 成本函数Cost matrix 成本矩阵Cost-sensitive 成本敏感Cross entropy 交叉熵Cross validation 交叉验证Crowdsourcing 众包Curse of dimensionality 维数灾难Cut point 截断点Cutting plane algorithm 割平⾯法Letter DData mining 数据挖掘Data set 数据集Decision Boundary 决策边界Decision stump 决策树桩Decision tree 决策树/判定树Deduction 演绎Deep Belief Network 深度信念⽹络Deep Convolutional Generative Adversarial Network/DCGAN 深度卷积⽣成对抗⽹络Deep learning 深度学习Deep neural network/DNN 深度神经⽹络Deep Q-Learning 深度 Q 学习Deep Q-Network 深度 Q ⽹络Density estimation 密度估计Density-based clustering 密度聚类Differentiable neural computer 可微分神经计算机Dimensionality reduction algorithm 降维算法Directed edge 有向边Disagreement measure 不合度量Discriminative model 判别模型Discriminator 判别器Distance measure 距离度量Distance metric learning 距离度量学习Distribution 分布Divergence 散度Diversity measure 多样性度量/差异性度量Domain adaption 领域⾃适应Downsampling 下采样D-separation (Directed separation)有向分离Dual problem 对偶问题Dummy node 哑结点Dynamic Fusion 动态融合Dynamic programming 动态规划Letter EEigenvalue decomposition 特征值分解Embedding 嵌⼊Emotional analysis 情绪分析Empirical conditional entropy 经验条件熵Empirical entropy 经验熵Empirical error 经验误差Empirical risk 经验风险End-to-End 端到端Energy-based model 基于能量的模型Ensemble learning 集成学习Ensemble pruning 集成修剪Error Correcting Output Codes/ECOC 纠错输出码Error rate 错误率Error-ambiguity decomposition 误差-分歧分解Euclidean distance 欧⽒距离Evolutionary computation 演化计算Expectation-Maximization 期望最⼤化Expected loss 期望损失Exploding Gradient Problem 梯度爆炸问题Exponential loss function 指数损失函数Extreme Learning Machine/ELM 超限学习机Letter FFactorization 因⼦分解False negative 假负类False positive 假正类False Positive Rate/FPR 假正例率Feature engineering 特征⼯程Feature selection 特征选择Feature vector 特征向量Featured Learning 特征学习Feedforward Neural Networks/FNN 前馈神经⽹络Fine-tuning 微调Flipping output 翻转法Fluctuation 震荡Forward stagewise algorithm 前向分步算法Frequentist 频率主义学派Full-rank matrix 满秩矩阵Functional neuron 功能神经元Letter GGain ratio 增益率Game theory 博弈论Gaussian kernel function ⾼斯核函数Gaussian Mixture Model ⾼斯混合模型General Problem Solving 通⽤问题求解Generalization 泛化Generalization error 泛化误差Generalization error bound 泛化误差上界Generalized Lagrange function ⼴义拉格朗⽇函数Generalized linear model ⼴义线性模型Generalized Rayleigh quotient ⼴义瑞利商Generative Adversarial Networks/GAN ⽣成对抗⽹络Generative Model ⽣成模型Generator ⽣成器Genetic Algorithm/GA 遗传算法Gibbs sampling 吉布斯采样Gini index 基尼指数Global minimum 全局最⼩Global Optimization 全局优化Gradient boosting 梯度提升Gradient Descent 梯度下降Graph theory 图论Ground-truth 真相/真实Letter HHard margin 硬间隔Hard voting 硬投票Harmonic mean 调和平均Hesse matrix 海塞矩阵Hidden dynamic model 隐动态模型Hidden layer 隐藏层Hidden Markov Model/HMM 隐马尔可夫模型Hierarchical clustering 层次聚类Hilbert space 希尔伯特空间Hinge loss function 合页损失函数Hold-out 留出法Homogeneous 同质Hybrid computing 混合计算Hyperparameter 超参数Hypothesis 假设Hypothesis test 假设验证Letter IICML 国际机器学习会议Improved iterative scaling/IIS 改进的迭代尺度法Incremental learning 增量学习Independent and identically distributed/i.i.d. 独⽴同分布Independent Component Analysis/ICA 独⽴成分分析Indicator function 指⽰函数Individual learner 个体学习器Induction 归纳Inductive bias 归纳偏好Inductive learning 归纳学习Inductive Logic Programming/ILP 归纳逻辑程序设计Information entropy 信息熵Information gain 信息增益Input layer 输⼊层Insensitive loss 不敏感损失Inter-cluster similarity 簇间相似度International Conference for Machine Learning/ICML 国际机器学习⼤会Intra-cluster similarity 簇内相似度Intrinsic value 固有值Isometric Mapping/Isomap 等度量映射Isotonic regression 等分回归Iterative Dichotomiser 迭代⼆分器Letter KKernel method 核⽅法Kernel trick 核技巧Kernelized Linear Discriminant Analysis/KLDA 核线性判别分析K-fold cross validation k 折交叉验证/k 倍交叉验证K-Means Clustering K – 均值聚类K-Nearest Neighbours Algorithm/KNN K近邻算法Knowledge base 知识库Knowledge Representation 知识表征Letter LLabel space 标记空间Lagrange duality 拉格朗⽇对偶性Lagrange multiplier 拉格朗⽇乘⼦Laplace smoothing 拉普拉斯平滑Laplacian correction 拉普拉斯修正Latent Dirichlet Allocation 隐狄利克雷分布Latent semantic analysis 潜在语义分析Latent variable 隐变量Lazy learning 懒惰学习Learner 学习器Learning by analogy 类⽐学习Learning rate 学习率Learning Vector Quantization/LVQ 学习向量量化Least squares regression tree 最⼩⼆乘回归树Leave-One-Out/LOO 留⼀法linear chain conditional random field 线性链条件随机场Linear Discriminant Analysis/LDA 线性判别分析Linear model 线性模型Linear Regression 线性回归Link function 联系函数Local Markov property 局部马尔可夫性Local minimum 局部最⼩Log likelihood 对数似然Log odds/logit 对数⼏率Logistic Regression Logistic 回归Log-likelihood 对数似然Log-linear regression 对数线性回归Long-Short Term Memory/LSTM 长短期记忆Loss function 损失函数Letter MMachine translation/MT 机器翻译Macron-P 宏查准率Macron-R 宏查全率Majority voting 绝对多数投票法Manifold assumption 流形假设Manifold learning 流形学习Margin theory 间隔理论Marginal distribution 边际分布Marginal independence 边际独⽴性Marginalization 边际化Markov Chain Monte Carlo/MCMC 马尔可夫链蒙特卡罗⽅法Markov Random Field 马尔可夫随机场Maximal clique 最⼤团Maximum Likelihood Estimation/MLE 极⼤似然估计/极⼤似然法Maximum margin 最⼤间隔Maximum weighted spanning tree 最⼤带权⽣成树Max-Pooling 最⼤池化Mean squared error 均⽅误差Meta-learner 元学习器Metric learning 度量学习Micro-P 微查准率Micro-R 微查全率Minimal Description Length/MDL 最⼩描述长度Minimax game 极⼩极⼤博弈Misclassification cost 误分类成本Mixture of experts 混合专家Momentum 动量Moral graph 道德图/端正图Multi-class classification 多分类Multi-document summarization 多⽂档摘要Multi-layer feedforward neural networks 多层前馈神经⽹络Multilayer Perceptron/MLP 多层感知器Multimodal learning 多模态学习Multiple Dimensional Scaling 多维缩放Multiple linear regression 多元线性回归Multi-response Linear Regression /MLR 多响应线性回归Mutual information 互信息Letter NNaive bayes 朴素贝叶斯Naive Bayes Classifier 朴素贝叶斯分类器Named entity recognition 命名实体识别Nash equilibrium 纳什均衡Natural language generation/NLG ⾃然语⾔⽣成Natural language processing ⾃然语⾔处理Negative class 负类Negative correlation 负相关法Negative Log Likelihood 负对数似然Neighbourhood Component Analysis/NCA 近邻成分分析Neural Machine Translation 神经机器翻译Neural Turing Machine 神经图灵机Newton method ⽜顿法NIPS 国际神经信息处理系统会议No Free Lunch Theorem/NFL 没有免费的午餐定理Noise-contrastive estimation 噪⾳对⽐估计Nominal attribute 列名属性Non-convex optimization ⾮凸优化Nonlinear model ⾮线性模型Non-metric distance ⾮度量距离Non-negative matrix factorization ⾮负矩阵分解Non-ordinal attribute ⽆序属性Non-Saturating Game ⾮饱和博弈Norm 范数Normalization 归⼀化Nuclear norm 核范数Numerical attribute 数值属性Letter OObjective function ⽬标函数Oblique decision tree 斜决策树Occam’s razor 奥卡姆剃⼑Odds ⼏率Off-Policy 离策略One shot learning ⼀次性学习One-Dependent Estimator/ODE 独依赖估计On-Policy 在策略Ordinal attribute 有序属性Out-of-bag estimate 包外估计Output layer 输出层Output smearing 输出调制法Overfitting 过拟合/过配Oversampling 过采样Letter PPaired t-test 成对 t 检验Pairwise 成对型Pairwise Markov property 成对马尔可夫性Parameter 参数Parameter estimation 参数估计Parameter tuning 调参Parse tree 解析树Particle Swarm Optimization/PSO 粒⼦群优化算法Part-of-speech tagging 词性标注Perceptron 感知机Performance measure 性能度量Plug and Play Generative Network 即插即⽤⽣成⽹络Plurality voting 相对多数投票法Polarity detection 极性检测Polynomial kernel function 多项式核函数Pooling 池化Positive class 正类Positive definite matrix 正定矩阵Post-hoc test 后续检验Post-pruning 后剪枝potential function 势函数Precision 查准率/准确率Prepruning 预剪枝Principal component analysis/PCA 主成分分析Principle of multiple explanations 多释原则Prior 先验Probability Graphical Model 概率图模型Proximal Gradient Descent/PGD 近端梯度下降Pruning 剪枝Pseudo-label 伪标记Letter QQuantized Neural Network 量⼦化神经⽹络Quantum computer 量⼦计算机Quantum Computing 量⼦计算Quasi Newton method 拟⽜顿法Letter RRadial Basis Function/RBF 径向基函数Random Forest Algorithm 随机森林算法Random walk 随机漫步Recall 查全率/召回率Receiver Operating Characteristic/ROC 受试者⼯作特征Rectified Linear Unit/ReLU 线性修正单元Recurrent Neural Network 循环神经⽹络Recursive neural network 递归神经⽹络Reference model 参考模型Regression 回归Regularization 正则化Reinforcement learning/RL 强化学习Representation learning 表征学习Representer theorem 表⽰定理reproducing kernel Hilbert space/RKHS 再⽣核希尔伯特空间Re-sampling 重采样法Rescaling 再缩放Residual Mapping 残差映射Residual Network 残差⽹络Restricted Boltzmann Machine/RBM 受限玻尔兹曼机Restricted Isometry Property/RIP 限定等距性Re-weighting 重赋权法Robustness 稳健性/鲁棒性Root node 根结点Rule Engine 规则引擎Rule learning 规则学习Letter SSaddle point 鞍点Sample space 样本空间Sampling 采样Score function 评分函数Self-Driving ⾃动驾驶Self-Organizing Map/SOM ⾃组织映射Semi-naive Bayes classifiers 半朴素贝叶斯分类器Semi-Supervised Learning 半监督学习semi-Supervised Support Vector Machine 半监督⽀持向量机Sentiment analysis 情感分析Separating hyperplane 分离超平⾯Sigmoid function Sigmoid 函数Similarity measure 相似度度量Simulated annealing 模拟退⽕Simultaneous localization and mapping 同步定位与地图构建Singular Value Decomposition 奇异值分解Slack variables 松弛变量Smoothing 平滑Soft margin 软间隔Soft margin maximization 软间隔最⼤化Soft voting 软投票Sparse representation 稀疏表征Sparsity 稀疏性Specialization 特化Spectral Clustering 谱聚类Speech Recognition 语⾳识别Splitting variable 切分变量Squashing function 挤压函数Stability-plasticity dilemma 可塑性-稳定性困境Statistical learning 统计学习Status feature function 状态特征函Stochastic gradient descent 随机梯度下降Stratified sampling 分层采样Structural risk 结构风险Structural risk minimization/SRM 结构风险最⼩化Subspace ⼦空间Supervised learning 监督学习/有导师学习support vector expansion ⽀持向量展式Support Vector Machine/SVM ⽀持向量机Surrogat loss 替代损失Surrogate function 替代函数Symbolic learning 符号学习Symbolism 符号主义Synset 同义词集Letter TT-Distribution Stochastic Neighbour Embedding/t-SNE T – 分布随机近邻嵌⼊Tensor 张量Tensor Processing Units/TPU 张量处理单元The least square method 最⼩⼆乘法Threshold 阈值Threshold logic unit 阈值逻辑单元Threshold-moving 阈值移动Time Step 时间步骤Tokenization 标记化Training error 训练误差Training instance 训练⽰例/训练例Transductive learning 直推学习Transfer learning 迁移学习Treebank 树库Tria-by-error 试错法True negative 真负类True positive 真正类True Positive Rate/TPR 真正例率Turing Machine 图灵机Twice-learning ⼆次学习Letter UUnderfitting ⽋拟合/⽋配Undersampling ⽋采样Understandability 可理解性Unequal cost ⾮均等代价Unit-step function 单位阶跃函数Univariate decision tree 单变量决策树Unsupervised learning ⽆监督学习/⽆导师学习Unsupervised layer-wise training ⽆监督逐层训练Upsampling 上采样Letter VVanishing Gradient Problem 梯度消失问题Variational inference 变分推断VC Theory VC维理论Version space 版本空间Viterbi algorithm 维特⽐算法Von Neumann architecture 冯 · 诺伊曼架构Letter WWasserstein GAN/WGAN Wasserstein⽣成对抗⽹络Weak learner 弱学习器Weight 权重Weight sharing 权共享Weighted voting 加权投票法Within-class scatter matrix 类内散度矩阵Word embedding 词嵌⼊Word sense disambiguation 词义消歧Letter ZZero-data learning 零数据学习Zero-shot learning 零次学习Aapproximations近似值arbitrary随意的affine仿射的arbitrary任意的amino acid氨基酸amenable经得起检验的axiom公理,原则abstract提取architecture架构,体系结构;建造业absolute绝对的arsenal军⽕库assignment分配algebra线性代数asymptotically⽆症状的appropriate恰当的Bbias偏差brevity简短,简洁;短暂broader⼴泛briefly简短的batch批量Cconvergence 收敛,集中到⼀点convex凸的contours轮廓constraint约束constant常理commercial商务的complementarity补充coordinate ascent同等级上升clipping剪下物;剪报;修剪component分量;部件continuous连续的covariance协⽅差canonical正规的,正则的concave⾮凸的corresponds相符合;相当;通信corollary推论concrete具体的事物,实在的东西cross validation交叉验证correlation相互关系convention约定cluster⼀簇centroids 质⼼,形⼼converge收敛computationally计算(机)的calculus计算Dderive获得,取得dual⼆元的duality⼆元性;⼆象性;对偶性derivation求导;得到;起源denote预⽰,表⽰,是…的标志;意味着,[逻]指称divergence 散度;发散性dimension尺度,规格;维数dot⼩圆点distortion变形density概率密度函数discrete离散的discriminative有识别能⼒的diagonal对⾓dispersion分散,散开determinant决定因素disjoint不相交的Eencounter遇到ellipses椭圆equality等式extra额外的empirical经验;观察ennmerate例举,计数exceed超过,越出expectation期望efficient⽣效的endow赋予explicitly清楚的exponential family指数家族equivalently等价的Ffeasible可⾏的forary初次尝试finite有限的,限定的forgo摒弃,放弃fliter过滤frequentist最常发⽣的forward search前向式搜索formalize使定形Ggeneralized归纳的generalization概括,归纳;普遍化;判断(根据不⾜)guarantee保证;抵押品generate形成,产⽣geometric margins⼏何边界gap裂⼝generative⽣产的;有⽣产⼒的Hheuristic启发式的;启发法;启发程序hone怀恋;磨hyperplane超平⾯Linitial最初的implement执⾏intuitive凭直觉获知的incremental增加的intercept截距intuitious直觉instantiation例⼦indicator指⽰物,指⽰器interative重复的,迭代的integral积分identical相等的;完全相同的indicate表⽰,指出invariance不变性,恒定性impose把…强加于intermediate中间的interpretation解释,翻译Jjoint distribution联合概率Llieu替代logarithmic对数的,⽤对数表⽰的latent潜在的Leave-one-out cross validation留⼀法交叉验证Mmagnitude巨⼤mapping绘图,制图;映射matrix矩阵mutual相互的,共同的monotonically单调的minor较⼩的,次要的multinomial多项的multi-class classification⼆分类问题Nnasty讨厌的notation标志,注释naïve朴素的Oobtain得到oscillate摆动optimization problem最优化问题objective function⽬标函数optimal最理想的orthogonal(⽮量,矩阵等)正交的orientation⽅向ordinary普通的occasionally偶然的Ppartial derivative偏导数property性质proportional成⽐例的primal原始的,最初的permit允许pseudocode伪代码permissible可允许的polynomial多项式preliminary预备precision精度perturbation 不安,扰乱poist假定,设想positive semi-definite半正定的parentheses圆括号posterior probability后验概率plementarity补充pictorially图像的parameterize确定…的参数poisson distribution柏松分布pertinent相关的Qquadratic⼆次的quantity量,数量;分量query疑问的Rregularization使系统化;调整reoptimize重新优化restrict限制;限定;约束reminiscent回忆往事的;提醒的;使⼈联想…的(of)remark注意random variable随机变量respect考虑respectively各⾃的;分别的redundant过多的;冗余的Ssusceptible敏感的stochastic可能的;随机的symmetric对称的sophisticated复杂的spurious假的;伪造的subtract减去;减法器simultaneously同时发⽣地;同步地suffice满⾜scarce稀有的,难得的split分解,分离subset⼦集statistic统计量successive iteratious连续的迭代scale标度sort of有⼏分的squares平⽅Ttrajectory轨迹temporarily暂时的terminology专⽤名词tolerance容忍;公差thumb翻阅threshold阈,临界theorem定理tangent正弦Uunit-length vector单位向量Vvalid有效的,正确的variance⽅差variable变量;变元vocabulary词汇valued经估价的;宝贵的Wwrapper包装分类:。
An Introduction to Probabilistic Graphical Models
David Madigan Rutgers University
madigan@
Expert Systems •Explosion of interest in “Expert Systems” in the early 1980’s
•Many companies (Teknowledge, IntelliCorp, Inference, etc.), many IPO’s, much media hype •Ad-hoc uncertainty handling
Uncertainty in Expert Systems If A then C (p1) If B then C (p2) What if both A and B true? Then C true with CF: p1 + (p2 X (1- p1)) “Currently fashionable ad-hoc mumbo jumbo” A.F.M. Smith
v
εV
Lemma: If P admits a recursive factorization according to an ADG G, then P factorizes according GM (and chordal supergraphs of GM) Lemma: If P admits a recursive factorization according to an ADG G, and A is an ancestral set in G, then PA admits a recursive factorization according to the subgraph GA
Probability and Stochastic Processes
Probability and Stochastic Processes Probability and Stochastic Processes are essential concepts in the field of mathematics and have wide-ranging applications in various fields such as engineering, economics, and science. The study of probability involves understanding the likelihood of different outcomes occurring in a given situation, while stochastic processes deal with the random changes in systems over time. These concepts are crucial for making informed decisions in uncertain situations and for modeling complex systems. However, they can also be challenging to grasp for many students and professionals. One of the primary problems with probability and stochastic processes is the abstract nature of the concepts involved. Unlike other areas of mathematics that deal with concrete quantities and relationships, probability and stochastic processes often involve dealing with uncertainty and randomness. This can make it difficult for individuals to develop an intuitive understanding of these concepts, leading to confusion and frustration. Additionally, the mathematical formalism used to describe these concepts can be daunting for many, further complicating the learning process. Another challenge with probability and stochastic processes is the wide range of applications and contexts in which they are used. While the fundamental principles may remain the same, the specific details and techniques can vary significantly depending on the field of study. For example, the application of probability in finance may differ from its application in physics or engineering. This can make it challenging for individuals to apply their knowledge of probability and stochastic processes to real-world problems outside of their immediate domain of expertise. Moreover, the reliance on complex mathematical tools and techniques in probability and stochastic processes can be a barrier for many learners. Understanding concepts such as random variables, probability distributions, and stochastic differential equations often requires a strong foundation in mathematical analysis and statistics. For individuals without a solid background in these areas, grasping the intricacies of probability and stochastic processes can be a daunting task. Furthermore, the abstract nature of probability and stochastic processes can also lead to misconceptions and misinterpretations. Individuals may struggle to differentiate between independent and mutually exclusive events, or they maymisapply probability concepts in real-world scenarios. These misconceptions can hinder the ability to make sound decisions based on probabilistic reasoning and can lead to errors in judgment. In addition to the technical challenges, there are also psychological barriers that can impede the learning and application of probability and stochastic processes. The inherent uncertainty and randomness involved in these concepts can be unsettling for some individuals, leading to a reluctance to fully engage with the material. Moreover, the fear of making mistakes or misinterpreting probabilities can create anxiety and self-doubt, further hindering the learning process. To address these challenges, educators and practitioners can employ various strategies to enhance the understanding and application of probability and stochastic processes. One approach is to use real-world examples and practical applications to illustrate the concepts in a tangible way. By demonstrating how probability and stochastic processes are used in fields such as finance, engineering, and healthcare, learners can develop a deeper appreciation for the relevance and importance of these concepts. Furthermore, breaking down complex mathematical formalism into more accessible language and visual representations can help demystify probability and stochastic processes for many individuals. Utilizing simulations, interactive demonstrations, and hands-on activities can also provide a more concrete and intuitive understanding of these concepts, making them less intimidating and more approachable. Moreover,fostering a growth mindset and promoting a positive attitude towards uncertainty and randomness can help individuals overcome psychological barriers associated with probability and stochastic processes. Encouraging a willingness to embrace uncertainty and learn from mistakes can create a more supportive and conducive learning environment, enabling individuals to develop a more resilient and adaptive approach to probabilistic reasoning. In conclusion, probability and stochastic processes are fundamental concepts with wide-ranging applications, but they can be challenging to grasp and apply due to their abstract nature, wide range of applications, complex mathematical tools, potential for misconceptions, and psychological barriers. By employing strategies that emphasize real-world examples, practical applications, accessible language and visual representations, and a growth mindset, educators and practitioners can help individuals overcomethese challenges and develop a deeper understanding and appreciation for probability and stochastic processes.。
Bayesian representation of stochastic processes under learning De finetti revisited
Matthew O. Jackson Ehud Kalai Revised: May 6, 1998
Abstract
Rann Smorodinsky
1
weakened to an asymptotic mixing condition, and with his conclusion of a decomposition into iid component distributions weakened to components that are learnable and su cient for prediction.
R R R
0 0
1 For example, Nyarko 1996 argues that it is important for learning results in incomplete information games to be robust to equivalent reformulations of type spaces. He discusses examples which are not robust to such reformulations. In the language of this paper the reformulations are di erent representations of the process associated with the same game and strategies.
A probability distribution governing the evolution of a stochastic process has in nitely many Bayesian representations of the form R = d . Among these, a natural representation is one whose components 's are `learnable' one can approximate by conditioning on observation of the process and `su cient for prediction' 's predictions are not aided by conditioning on observation of the process. We show the existence and uniqueness of such a representation under a suitable asymptotic mixing condition on the process. This representation can be obtained by conditioning on the tail- eld of the process, and any learnable representation that is su cient for prediction is asymptotically like the tail- eld representation. This result is related to the celebrated de Finetti theorem, but with exchangeability
Probabilistic networks and explanatory coherence
Cognitive Science Quarterly (2000) 1, 93-116Probabilistic Networksand Explanatory Coherence1Paul ThagardUniversity of Waterloo2Causal reasoning can be understood qualitatively in terms of ex-planatory coherence or quantitatively in terms of probability theory.Comparison of these approaches can be done by looking at computa-tional models, using my explanatory coherence networks and Pearl’sprobabilistic ones. The explanatory coherence program E CHO can begiven a probabilistic interpretation, but there are many conceptual andcomputational problems that make it difficult to replace coherencenetworks by probabilistic ones. On the other hand, ECHO provides apsychologically plausible and computationally efficient model of somekinds of probabilistic causal reasoning. Hence coherence theory neednot give way to probability theory as the basis for epistemology anddecision making.Keywords: causality, explanation, coherence, probability, Bayesiannetworks.Two Traditions in Causal ReasoningWhen surprising events occur, people naturally try to generate explanations of them. Such explanations usually involve hypothesizing causes that have the events as effects. Reasoning from effects to prior causes is found in many domains, including:–Social reasoning: when friends are acting strange, we conjecture about what might be bothering them.1 Thanks to Gilbert Harman, Steve Kimbrough, and Steve Roehrig for very helpful discus-sions, and to E ugene Charniak for suggesting this comparison. Michael Ranney, Patricia Schank, and anonymous referees provided useful comments on an earlier draft. Thanks to Cameron Shelley for editorial assistance. This research is supported by the Natural Sciences and Engineering Research Council of Canada.2 Address for correspondence: Paul Thagard, Philosophy Department, University of Water-loo, Waterloo, Ontario, Canada, N2L 3G1. E-mail: pthagard@watarts.uwaterloo.ca. Home page: http://cogsci.uwaterloo.ca.Received April 14, 1999, rev. September 24, 1999© 2000 H ERMES S CIENCE P UBLICATIONS94Thagard: Probabilistic Networks–Legal reasoning: when a crime has been committed, jurors must decide whether the prosecution's case gives a convincing explanation of the evi-dence.–Medical diagnosis: given a set of symptoms, a physician tries to decide what disease or diseases produced them.–Fault diagnosis in manufacturing: when a piece of equipment breaks down, a trouble shooter must try to determine the cause of the break-down.–Scientific theory evaluation: scientists seek an acceptable theory to ex-plain experimental evidence.What is the nature of such reasoning? The many discussions of causal rea-soning over the centuries can be seen as falling under two general traditions that I shall call explanationism and probabilism. Explanationists understand causal reasoning qualitatively, while probabilists exploit the resources of the probability calculus to understand causal reasoning quantitatively. E x-planationism goes back at least to Aristotle (1984, vol. 1 p. 128) who consid-ered the inference that the planets are near as providing an explanation of why they do not twinkle. Some Renaissance astronomers such as Copernicus and Rheticus evaluated theories according to their explanatory capabilities (Blake, 1960). The leading explanationists in the nineteenth century were the British scientist, philosopher, and historian William Whewell (1967), and the American polymath C. S. Peirce (1931-58). The most enthusiastic explana-tionists in this century have been epistemologists such as Gilbert Harman (1973, 1986) and William Lycan (1988). In the field of artificial intelligence, computational models of inference to the best explanation have been devel-oped (Josephson and Josephson, 1994; Shrager and Langley, 1990; Thagard, 1992b).The probabilist tradition is less ancient than explanationism, for the mathematical theory of probability only arose in the seventeenth century through the work of Pascal, Bernoulli, and others (Hacking, 1975). Laplace and Jevons were the major proponents of probabilistic approaches to induc-tion in the eighteenth and nineteenth century, respectively (Laudan 1981, chap. 12). Many twentieth-century philosophers have advocated probabilis-tic approaches to epistemology, including Keynes (1921), Carnap (1950), Jeffrey (1983), Levi (1980), Kyburg (1983), and Kaplan (1996).Probabilistic approaches have recently become influential in artificial in-telligence as a way of dealing with uncertainty encountered in expert sys-tems (D’Ambrosio, 1999; Frey, 1998; Jordan, 1998; Neapolitain, 1990; Pearl, 1988, 1996; Peng and Reggia, 1990). Probabilistic approaches are also being applied to natural language understanding (Charniak, 1993). The explana-tionist versus probabilist issue surfaces in a variety of sub-areas. Some legal scholars concerned with evidential reasoning have been probabilist (Lem-Thagard: Probabalistic Networks95 pert, 1986; Cohen, 1977), while some are explanationist and see probabilist reasoning as neglecting important aspects of how jurors reach decisions (Allen, 1994; Pennington & Hastie, 1986). In the philosophy of science, there is an unresolved tension between probabilist accounts of scientific inference (Achinstein, 1991; Hesse, 1974; Horwich, 1982; Howson & Urbach, 1989; Maher, 1993) and explanationist accounts (E liasmith & Thagard, 1997; Lip-ton, 1991; Thagard 1988, 1992b). Neither the probabilist nor the explanation-ist tradition is monolithic: there are competing interpretations of probability and inference within the former, and differing views of explanatory infer-ence within the latter.In recent years, it has become possible to examine the differences be-tween explanationist and probabilist approaches at a much finer level, be-cause algorithms have been developed for implementing them computa-tionally. My theory of explanatory coherence incorporates the kinds of rea-soning advocated by explanationists, and is implemented in a connectionist program called ECHO that shows how explanatory coherence can be com-puted in networks of propositions. Pearl (1988) and others have shown how probabilistic reasoning can be implemented computationally using net-works. The question naturally arises of the relation between E CHO net-works and probabilistic networks. This paper shows how E CHO's qualita-tive input can be used to produce a probabilistic network to which Pearl's algorithms are applicable. At one level, this result can be interpreted as showing that ECHO is a special case of a probabilistic network.The production of a probabilistic version of ECHO highlights, however, several computational problems with probabilistic networks. The probabil-istic version of ECHO requires the provision of many conditional probabili-ties of dubious availability, and the computational techniques needed to translate E CHO into probabilistic networks are potentially combinatorially explosive. E CHO can therefore be viewed as an intuitively appealing and computationally efficient approximation to probabilistic reasoning. We will also see that ECHO puts important constraints on the conditional probabili-ties used in probabilistic networks.The comparison between E CHO and probabilistic networks does not in itself settle the relation between the explanationist and probabilist traditions, since there are other ways of being an explanationist besides E CHO, and there are other ways of being a probabilist besides Pearl's networks. But from a computational perspective, E CHO and Pearl networks are much more fully specified than previous explanationist and probabilist proposals, so a head-to-head comparison is potentially illuminating. After briefly re-viewing Pearl's approach to probabilistic networks, I shall sketch the prob-abilistic interpretation of explanatory coherence and discuss the computa-tional problems that arise. Then a demonstration of how E CHO naturally handles Pearl's central examples will support the conclusion that explana-tory coherence theory is not obviated by the probabilistic approach.96Thagard: Probabilistic NetworksThe point of this paper, however, is not simply a comparison of two computational models of causal reasoning. Causal reasoning is an essential part of human thinking, so the nature of such reasoning is an important question for cognitive science. Do people use coherence-based or probabilis-tic inference when they evaluate competing causal accounts in social, legal, medical, engineering, and scientific contexts? Many researchers in AI and philosophy assume that probabilistic approaches are the only ones appro-priate for understanding such reasoning, but there is much experimental evidence that human thinking is often not in accord with the prescriptions of probability theory (see, e.g. Kahneman, Slovic, & Tversky, 1982). On the other hand, there is some psychological evidence that explanatory coherence theory captures aspects of human thinking (Read & Marcus-Newhall, 1993; Schank & Ranney, 1991, 1992; Thagard & Kunda, 1998). Moreover, coher-ence-based reasoning is pervasive in human thinking, in areas as diverse as perception, decision making, ethical judgments, and emotional inference (see Thagard, forthcoming, for a broad survey). Clarification of the relation between explanatory coherence and probabilistic accounts is thus part of the general psychological project of understanding how human causal reason-ing works.The probabilistic view assumes that the degrees of belief that people have in various propositions can be described by quantities that comply with the principles of the mathematical theory of probability. In contrast, the ex-planationist approach sees no reason to use probability theory to model degrees of belief. Probability theory is an immensely valuable tool for mak-ing statistical inferences about patterns of frequencies in the world, but is not the appropriate mathematics for understanding human inference in general. Explanatory Coherence and ECHOThe theory of explanatory coherence, TE C is informally stated in the fol-lowing principles (Thagard 1989, 1992a, 1992b):Principle E1. Symmetry. Explanatory coherence is a symmetric relation, unlike, say, conditional probability.Principle E2. Explanation. (a) A hypothesis coheres with what it ex-plains, which can either be evidence or another hypothesis; (b) hy-potheses that together explain some other proposition cohere with each other; and (c) the more hypotheses it takes to explain something, the lower the degree of coherence.Principle E3. Analogy. Similar hypotheses that explain similar pieces of evidence cohere.Thagard: Probabalistic Networks97 Principle E4. Data priority. Propositions that describe the results of ob-servations have a degree of acceptability on their own.Principle E5. Contradiction. Contradictory propositions are incoherent with each other.Principle E6. Competition. If P and Q both explain a proposition, and if P and Q are not explanatorily connected, then P and Q are incoherent with each other. (P and Q are explanatorily connected if one explains the other or if together they explain something.)Principle E7. Acceptance. The acceptability of a proposition in a system of propositions depends on its coherence with them.A full exposition and defense of these principles, along with a detailed de-scription of their implementation in the computational model ECHO, can be found in Thagard (1992b).In ECHO, propositions are represented by units (artificial neurons), and coherence relations are represented by excitatory and inhibitory links. Prin-ciple 7, Acceptability, is implemented in CHO using a simple connectionist method for updating the activation of a unit based on the units to which it is linked. Units typically start with an activation level of 0, except for the special evidence unit whose activation is always 1. Activation spreads from it to the units representing data, and from them to units repre-senting propositions that explain data, and then to units representing higher-level propositions that explain propositions that explain data, and so on. Inhibitory links between units make them suppress each other's activa-tion. Activation of each unit a j is updated according to the following equa-tion:a j(t+1) = a j(t)(1-d) + net j(max - a j(t)) if net j > 0, otherwise net j(a j(t) - min)[1] Here d is a decay parameter (say .05) that decrements each unit at every cycle, min is a minimum activation (-1), max is maximum activation (1). Based on the weight w ij between each unit i and j, we can calculate net j , the net input to a unit, by:net j = Σi w ij a i(t). [2] Figure 1 displays a typical network produced by ECHO, given that H1 and H2 together explain E1, H1 and H2 together explain E2, H4 and H5 together also explain E1 and E2, and H1 is itself explained by H3. All links are sym-metric, although the flow of activation is not, since it originates in the E VIDE NCE unit. Note that there are many loops in the graph in figure 1. Loops with 3 nodes arise both from explanation (e.g. H1, H2, and E1) and98Thagard: Probabilistic Networksfrom competition (e.g. H1, H4 and E1). Loops with 4 nodes occur whenever there are two propositions that both explain two other propositions (e.g. H1,H4, E1, E2). The presence of loops brings no complications to ECHO's com-putation of acceptability, but we shall see that loops are problematic for probabilistic networks.Figure 1. A typical ECHO network. Solid lines indicate excitatory links, while dotted lines indicate inhibitory links.Updating an ECHO network is very simple and can be performed locally at each unit u i , which must have access, for each u j to which is linked, to the weight of the link between u i and u j and to the activation of u j . ECHO net-works are never completely connected – most units are not directly linked to most other units. Even if networks were completely connected, the maxi-mum number of links per unit in a network with n units would be n-1, so that n*(n-1) simple calculations are all that updating requires. In principle, if each unit has its own processor, the n-1 calculations can be performed inde-pendently of each other. There is, however, no guarantee that a small num-ber of updating steps will suffice to produce stable activation values. It is possible that activation will pass round and round the network, producing oscillations in activations of units so that no judgment of acceptability is produced. But computational experiments have shown that ECHO networks can be highly stable given appropriate weight values for excitatory and in-Thagard: Probabalistic Networks99 hibitory links. For most of the major ECHO examples, 1000 runs have been done to determine how sensitive network performance is to excitation (the default weight on excitatory links), inhibition (the default weight on inhibi-tory links), and decay, considering each in the range of .01 to .1. The sensi-tivity experiments show that E CHO performs well so long as excitation is low relative to inhibition, when the inhibitory links between units make the network settle without the oscillations that uncontrolled excitation would produce. Resulting activations can range between 1 (maximum acceptance) and -1 (maximum rejection). On the basis of these experiments, I chose the default parameter values of .04 for excitation, -.06 for inhibition, and .05 for decay, since they tend to lead to rapid settling. All runs of ECHO on differ-ent examples now use these parameter values.Strikingly, the settling time of the largest ECHO applications does not in-crease as a function of number of units: larger networks with more links do not require more updating steps to reach stable activation values. This result is consistent with experiments performed using ARCS, a program for analog retrieval that uses networks similar to E CHO's but with many more units and links (Thagard, Holyoak, Nelson, & Gochfeld, 1990). Thus although we can not show analytically that E CHO can efficiently compute explanatory coherence, experimental results suggest that ECHO's ability to determine the acceptability of propositions is little affected by the size or connectivity of networks. Further information on how ECHO works will be provided below in the discussion of how ECHO handles some probabilistic tasks. Probabilistic NetworksThe theory of explanatory coherence employs vague concepts such as expla-nation and acceptability, and E CHO requires input specifying explanatory relations. It is reasonable to desire a more precise way of understanding causal reasoning, for example in terms of the mathematical theory of prob-ability. That theory can be stated in straightforward axioms that establish probabilities as quantities between 0 and 1. From these axioms it is trivial to derive Bayes' theorem, which can be written as:P(H/E)=P(H)×P(E/H)P(E)[3]It says that the probability of a hypothesis given the evidence is the prior probability of the hypothesis times the probability of the evidence given the hypothesis, divided by the probability of the evidence. Bayes' theorem is very suggestive for causal reasoning, since we can hope to decide what caused an effect by considering what cause has the greatest probability given the effect. Hence probabilists are often called Bayesians.100Thagard: Probabilistic NetworksIn practice, however, application of the probability calculus becomes complicated. Harman (1986, p. 25) pointed out that in general probabilistic updating is combinatorially explosive, since we need to know the probabili-ties of a set of conjunctions whose size grow exponentially with the number of propositions. For example, full probabilistic information about three propositions, A, B, and C, would require knowing a total of 8 different val-ues: P(A & B & C), P(A & B & not-C), P(A ¬-B & not-C), etc. Only 30 propositions would require more than a billion probabilities. As Thagard and Verbeurgt (1998) showed, coherence maximization is also potentially intractable computationally, but the algorithms we describe provide efficient ways of computing coherence, and one, the semidefinite programming algo-rithm, is guaranteed to accomplish at least .878 of the optimal constraint satisfaction.Probabilistic networks prune enormously the required number of prob-abilities and probability calculations, since they restrict calculations to a limited set of dependencies. Suppose you know that B depends only on A, and C depends only on B. You then have the simple network A→B→C. This means that A can affect the probability of C only through B, so that the cal-culation of the probability of C can take into account the value of B while ignoring A.Probabilistic networks have gone under many different names: causal networks, belief networks, Bayesian networks, influence diagrams, and in-dependence networks. For precision's sake, I want to consider a particular kind of probabilistic network, concentrating on the elegant and powerful methods of Pearl (1988). Since methods for dealing with probabilistic net-works other than his are undoubtedly possible, I shall not try to compare ECHO generally with probabilistic networks, but will make the comparison specifically with Pearl networks.In Pearl networks, each node represents a multi-valued variable such as a patient's temperature, which might take three values: high, medium, low. In the simplest cases, the variable can be propositional, with two values, true and false. Already we see a difference between Pearl networks and E CHO networks, since E CHO requires separate nodes for a proposition and its negation. But translations between Pearl nodes and ECHO nodes are clearly possible and will be discussed below.More problematic are the edges in the two kinds of networks. Pearl net-works are directed, acyclic graphs. Edges are directed, pointing from causes to effects, so that A→B indicates that A causes B and not vice versa. In con-trast, E CHO's links are all symmetric, befitting the character of coherence and incoherence (Principle E1 above), but symmetries are not allowed in Pearl networks. The specification that the graphs be acyclic rules out rela-tions such as those shown in figure 2. Since the nodes are variables, a more accurate interpretation of the edge A→B would be: the values of B are caus-ally dependent on the values of A.Thagard: Probabalistic Networks101 The structure of Pearl networks is used to localize probability calcula-tions and surmount the combinatorial explosion that can result from consid-ering the probabilities of everything given everything else. Figure 3 shows a fragment of a Pearl network in which the variable D is identified as being dependent on A, B, and C, while E and F are dependent on D. The prob-abilities that D will take on its various values can then be calculated only by looking at A, B, C, E, and F, ignoring other variables in the network from which D is assumed to be conditionally independent, given the five vari-ables on which it is directly dependent. The probabilities of the values of D can be expressed as a vector corresponding to the set of values. For example, if D is temperature and has values (high, medium, low), the vector (.5 .3 .2) assigned to D means that the probability of high temperature is .5, of me-dium temperature is .3, and low temperature is .2. In accord with the axioms of probability theory, the numbers in the vector must sum to 1, since they are the probabilities of all the exclusive value of the variable.Figure 2. Examples of cyclic graphs.Figure 3. Sample Pearl network, in which the variable D is dependent on A, B, and C, while E and F are dependent on D. Lines with arrows indicate de-pendencies.102Thagard: Probabilistic NetworksThe desired result of computing using a Pearl network is that each node should have a stable vector representing the probabilities of its values given all the other information in the network. If a measurement determines that the temperature is high, then the vector for D would be (1 0 0). If the tem-perature is not known, it must be inferred using information gathered both from the variables on which D is dependent and the ones that depend on D. In terms of Bayes' theorem, we can think of A, B, and C as providing prior probabilities for the values of D, while E and F provide observed evidence for it. The explanatory coherence interpretation of figure 3 is that A, B, and C explain D, while D explains E and F. For each variable X, Pearl uses (x) to indicate the computed degrees of belief (probabilities) that X takes on for each of its values x. BE L(x) is thus a vector with as many entries as X has values, and is calculated using the equation:BEL(x)=α×λ(x)×π(x) [4] Here α is a normalizing constant used to ensure that the entries in the vec-tor sum to 1. λ(x) is a vector representing the amount of support for par-ticular values of X coming up from below, that is from variables that depend on X. π(x) is a vector representing the amount of support for particular values of X coming down from above, that is from variables on which X depends. For a variable V at the very top of the network, the value passed down by V will be a vector of the prior probabilities that V takes on its vari-ous values. Ultimately, BEL(x) should be a function of these prior probabili-ties and fixed probabilities at nodes where the value of the variable is known, producing BEL vectors such as (1 0 0).Calculating BEL values is non-trivial, because it requires repeatedly up-dating the BEL and other values until the prior probabilities and the known values based on evidence have propagated throughout the network. It has been shown that the general problem of probabilistic inference in networks is NP-hard (Cooper, 1990), so we should not expect there to be a universal efficient algorithm for updating BEL. Pearl presents algorithms for comput-ing BE L in the special case where networks are singly connected, that is where no more than one path exists between two nodes (Pearl, 1988, chap. 4; see also Neapolitain, 1990). A loop here is a sequence of edges independent of direction. If there is more than one path between nodes, then the network contains a loop that can interfere with achievement of stable values of BEL,λ, and π. Hence methods have been developed for converting multiply connected networks into singly connected ones by clustering nodes into new nodes with many values.For example, consider the network shown in figure 4 (from Pearl, 1988, p. 196). In this example, metastatic cancer is a cause of both increased total serum calcium and brain tumor, either of which can cause a coma. This is problematic for Pearl because there are two paths between A and D. Clus-Thagard: Probabalistic Networks103 tering involves collapsing nodes B and C into a new Z representing a vari-able with values that are all possible combinations of the values of B and C: increased calcium and tumor, increased calcium and no tumor, no increased calcium and tumor, and no increased calcium and no tumor. E CHO deals with cases such as these very differently, using the principle of competition, as we will see below.Figure 4. Pearl’s representation of a multiply connected network that must be manipulated before probability calculations can be performed.There are other ways of dealing with loops in probabilistic networks besides clustering. Pearl discusses two approximating alternatives, while Lauritzen and Spiegelharter (1988) have offered a powerful general method for con-verting any directed acyclic graph into a tree of cliques of that graph. See also Neapolitain, 1990, chap. 7. Hrycej (1990) shows how approximation by stochastic simulation can be understood as sampling from the Gibbs distri-bution in a random Markov field. Frey (1998) uses graph-based inference techniques to develop new algorithms for Bayesian networks.What probabilities must actually be known to compute BE L(x) even in singly connected networks? Consider again figure 3, where BEL values for node D are to be computed using values for A, B, C, E, and F. To simplify, consider only values a, b, c, d, and e of the respective variables. Pearl's algo-rithms do not simply require knowledge of the conditional probabilities P(d/a), P(d/b), and P(d/c). (Here P(d/a) is shorthand for the probability that D has value d given that A has value a.) Rather, the calculation consid-ers the probabilities of d given all the possible combinations of values of the variables on which D depends. Consider the simple propositional case where the possible values of D are that it is true (d) or false (not-d). Pearl's104Thagard: Probabilistic Networksalgorithm requires knowing P(d/a & b & c), P(d/a & b & not-c), P(d/a & not-b & not-c) and five other conditional probabilities. The analogous condi-tional probabilities for not-d can be computed from the ones just given.More generally, if D depends on n variables with k values each, k n condi-tional probabilities will be required for computation. This raises two prob-lems that Pearl discusses. First, if n is large, the calculations become compu-tationally intractable, so that approximation methods must be used. Prob-abilistic networks are nevertheless much more attractive computationally than the general problem of computing probabilities, since the threat of combinatorial explosion is localized to nodes which we can hope to be de-pendent on a relatively small number of other nodes. Second, even if n is not so large, there is the problem of obtaining sensible conditional probabilities to plug into the calculations. Pearl acknowledges that it is unreasonable to expect a human or other system to store all this information about condi-tional probabilities, and shows how it is sometimes possible to use simpli-fied models of particular kinds of causal interaction to avoid having to do many of the calculations that the algorithms would normally require. Table 1 summarizes the differences between ECHO and Pearl networks. We now have enough information to begin considering the relation between ECHO and Pearl networks.Table 1: Comparison of ECHO and Pearl networks.ECHO PearlNodes represent propositions variablesEdges represent coherence dependenciesDirectedness symmetric directedLoops many must be eliminatedNode quantity updated activation[-1, 1]BEL: vector of [0, 1]Additional updating noneλ, πAdditional information used explanationsdata conditional probabilities prior probabilities。
Modeling probabilistic actions for practical decision-theoretic planning
Any planning model that strives to solve real world problems must deal with the inherent uncertainty in the domains. Various approaches have been suggested (0; 0; 0; 0) and the generally accepted and traditional solution is to use probability to model domain uncertainty (0; 0; 0). A representative of this approach is the buridan planner (0). In buridan uncertainty about the true state of the world is modeled with a probability distribution over the state space. Actions have uncertain e ects, and each of these e ects is also modeled with a probability distribution. Projecting a plan thus does not result in a single nal state, but a probability distribution over the state space. To make the representation computationally tractable, the probability distributions involved take non-zero probabilities on only a nite number of states. The buridan representation, which we will call the single probability distribution (SPD) model has a wellfounded semantics and is the underlying representation
研究NLP100篇必读的论文---已整理可直接下载
研究NLP100篇必读的论⽂---已整理可直接下载100篇必读的NLP论⽂⾃⼰汇总的论⽂集,已更新链接:提取码:x7tnThis is a list of 100 important natural language processing (NLP) papers that serious students and researchers working in the field should probably know about and read.这是100篇重要的⾃然语⾔处理(NLP)论⽂的列表,认真的学⽣和研究⼈员在这个领域应该知道和阅读。
This list is compiled by .本榜单由编制。
I welcome any feedback on this list. 我欢迎对这个列表的任何反馈。
This list is originally based on the answers for a Quora question I posted years ago: .这个列表最初是基于我多年前在Quora上发布的⼀个问题的答案:[所有NLP学⽣都应该阅读的最重要的研究论⽂是什么?]( -are-the-most-important-research-paper -which-all-NLP-students-should- definitread)。
I thank all the people who contributed to the original post. 我感谢所有为原创⽂章做出贡献的⼈。
This list is far from complete or objective, and is evolving, as important papers are being published year after year.由于重要的论⽂年复⼀年地发表,这份清单还远远不够完整和客观,⽽且还在不断发展。
Spss词汇 中英文对照
Spss词汇中英文对照Absolute deviation, 绝对离差Absolute number, 绝对数Absolute residuals, 绝对残差Acceleration array, 加速度立体阵Acceleration in an arbitrary direction, 任意方向上的加速度Acceleration normal, 法向加速度Acceleration space dimension, 加速度空间的维数Acceleration tangential, 切向加速度Acceleration vector, 加速度向量Acceptable hypothesis, 可接受假设Accumulation, 累积Accuracy, 准确度Actual frequency, 实际频数Adaptive estimator, 自适应估计量Addition, 相加Addition theorem, 加法定理Additivity, 可加性Adjusted rate, 调整率Adjusted value, 校正值Admissible error, 容许误差Aggregation, 聚集性Alternative hypothesis, 备择假设Among groups, 组间Amounts, 总量Analysis of correlation, 相关分析Analysis of covariance, 协方差分析Analysis of regression, 回归分析Analysis of time series, 时间序列分析Analysis of variance, 方差分析Angular transformation, 角转换ANOVA (analysis of variance), 方差分析ANOVA Models, 方差分析模型Arcing, 弧/弧旋Arcsine transformation, 反正弦变换Area under the curve, 曲线面积AREG , 评估从一个时间点到下一个时间点回归相关时的误差ARIMA, 季节和非季节性单变量模型的极大似然估计Arithmetic grid paper, 算术格纸Arithmetic mean, 算术平均数Arrhenius relation, 艾恩尼斯关系Assessing fit, 拟合的评估Associative laws, 结合律Asymmetric distribution, 非对称分布Asymptotic bias, 渐近偏倚Asymptotic efficiency, 渐近效率Asymptotic variance, 渐近方差Attributable risk, 归因危险度Attribute data, 属性资料Attribution, 属性Autocorrelation, 自相关Autocorrelation of residuals, 残差的自相关Average, 平均数Average confidence interval length, 平均置信区间长度Average growth rate, 平均增长率Bar chart, 条形图Bar graph, 条形图Base period, 基期Bayes' theorem , Bayes定理Bell-shaped curve, 钟形曲线Bernoulli distribution, 伯努力分布Best-trim estimator, 最好切尾估计量Bias, 偏性Binary logistic regression, 二元逻辑斯蒂回归Binomial distribution, 二项分布Bisquare, 双平方Bivariate Correlate, 二变量相关Bivariate normal distribution, 双变量正态分布Bivariate normal population, 双变量正态总体Biweight interval, 双权区间Biweight M-estimator, 双权M估计量Block, 区组/配伍组BMDP(Biomedical computer programs), BMDP统计软件包Boxplots, 箱线图/箱尾图Breakdown bound, 崩溃界/崩溃点Canonical correlation, 典型相关Caption, 纵标目Case-control study, 病例对照研究Categorical variable, 分类变量Catenary, 悬链线Cauchy distribution, 柯西分布Cause-and-effect relationship, 因果关系Cell, 单元Censoring, 终检Center of symmetry, 对称中心Centering and scaling, 中心化和定标Central tendency, 集中趋势Central value, 中心值CHAID -χ2 Automatic Interaction Detector, 卡方自动交互检测Chance, 机遇Chance error, 随机误差Chance variable, 随机变量Characteristic equation, 特征方程Characteristic root, 特征根Characteristic vector, 特征向量Chebshev criterion of fit, 拟合的切比雪夫准则Chernoff faces, 切尔诺夫脸谱图Chi-square test, 卡方检验/χ2检验Choleskey decomposition, 乔洛斯基分解Circle chart, 圆图Class interval, 组距Class mid-value, 组中值Class upper limit, 组上限Classified variable, 分类变量Cluster analysis, 聚类分析Cluster sampling, 整群抽样Code, 代码Coded data, 编码数据Coding, 编码Coefficient of contingency, 列联系数Coefficient of determination, 决定系数Coefficient of multiple correlation, 多重相关系数Coefficient of partial correlation, 偏相关系数Coefficient of production-moment correlation, 积差相关系数Coefficient of rank correlation, 等级相关系数Coefficient of regression, 回归系数Coefficient of skewness, 偏度系数Coefficient of variation, 变异系数Cohort study, 队列研究Column, 列Column effect, 列效应Column factor, 列因素Combination pool, 合并Combinative table, 组合表Common factor, 共性因子Common regression coefficient, 公共回归系数Common value, 共同值Common variance, 公共方差Common variation, 公共变异Communality variance, 共性方差Comparability, 可比性Comparison of bathes, 批比较Comparison value, 比较值Compartment model, 分部模型Compassion, 伸缩Complement of an event, 补事件Complete association, 完全正相关Complete dissociation, 完全不相关Complete statistics, 完备统计量Completely randomized design, 完全随机化设计Composite event, 联合事件Composite events, 复合事件Concavity, 凹性Conditional expectation, 条件期望Conditional likelihood, 条件似然Conditional probability, 条件概率Conditionally linear, 依条件线性Confidence interval, 置信区间Confidence limit, 置信限Confidence lower limit, 置信下限Confidence upper limit, 置信上限Confirmatory Factor Analysis , 验证性因子分析Confirmatory research, 证实性实验研究Confounding factor, 混杂因素Conjoint, 联合分析Consistency, 相合性Consistency check, 一致性检验Consistent asymptotically normal estimate, 相合渐近正态估计Consistent estimate, 相合估计Constrained nonlinear regression, 受约束非线性回归Constraint, 约束Contaminated distribution, 污染分布Contaminated Gausssian, 污染高斯分布Contaminated normal distribution, 污染正态分布Contamination, 污染Contamination model, 污染模型Contingency table, 列联表Contour, 边界线Contribution rate, 贡献率Control, 对照Controlled experiments, 对照实验Conventional depth, 常规深度Convolution, 卷积Corrected factor, 校正因子Corrected mean, 校正均值Correction coefficient, 校正系数Correctness, 正确性Correlation coefficient, 相关系数Correlation index, 相关指数Correspondence, 对应Counting, 计数Counts, 计数/频数Covariance, 协方差Covariant, 共变Cox Regression, Cox回归Criteria for fitting, 拟合准则Criteria of least squares, 最小二乘准则Critical ratio, 临界比Critical region, 拒绝域Critical value, 临界值Cross-over design, 交叉设计Cross-section analysis, 横断面分析Cross-section survey, 横断面调查Crosstabs , 交叉表Cross-tabulation table, 复合表Cube root, 立方根Cumulative distribution function, 分布函数Cumulative probability, 累计概率Curvature, 曲率/弯曲Curvature, 曲率Curve fit , 曲线拟和Curve fitting, 曲线拟合Curvilinear regression, 曲线回归Curvilinear relation, 曲线关系Cut-and-try method, 尝试法Cycle, 周期Cyclist, 周期性D test, D检验Data acquisition, 资料收集Data bank, 数据库Data capacity, 数据容量Data deficiencies, 数据缺乏Data handling, 数据处理Data manipulation, 数据处理Data processing, 数据处理Data reduction, 数据缩减Data set, 数据集Data sources, 数据来源Data transformation, 数据变换Data validity, 数据有效性Data-in, 数据输入Data-out, 数据输出Dead time, 停滞期Degree of freedom, 自由度Degree of precision, 精密度Degree of reliability, 可靠性程度Degression, 递减Density function, 密度函数Density of data points, 数据点的密度Dependent variable, 应变量/依变量/因变量Dependent variable, 因变量Depth, 深度Derivative matrix, 导数矩阵Derivative-free methods, 无导数方法Design, 设计Determinacy, 确定性Determinant, 行列式Determinant, 决定因素Deviation, 离差Deviation from average, 离均差Diagnostic plot, 诊断图Dichotomous variable, 二分变量Differential equation, 微分方程Direct standardization, 直接标准化法Discrete variable, 离散型变量DISCRIMINANT, 判断Discriminant analysis, 判别分析Discriminant coefficient, 判别系数Discriminant function, 判别值Dispersion, 散布/分散度Disproportional, 不成比例的Disproportionate sub-class numbers, 不成比例次级组含量Distribution free, 分布无关性/免分布Distribution shape, 分布形状Distribution-free method, 任意分布法Distributive laws, 分配律Disturbance, 随机扰动项Dose response curve, 剂量反应曲线Double blind method, 双盲法Double blind trial, 双盲试验Double exponential distribution, 双指数分布Double logarithmic, 双对数Downward rank, 降秩Dual-space plot, 对偶空间图DUD, 无导数方法Duncan's new multiple range method, 新复极差法/Duncan新法Effect, 实验效应Eigenvalue, 特征值Eigenvector, 特征向量Ellipse, 椭圆Empirical distribution, 经验分布Empirical probability, 经验概率单位Enumeration data, 计数资料Equal sun-class number, 相等次级组含量Equally likely, 等可能Equivariance, 同变性Error, 误差/错误Error of estimate, 估计误差Error type I, 第一类错误Error type II, 第二类错误Estimand, 被估量Estimated error mean squares, 估计误差均方Estimated error sum of squares, 估计误差平方和Euclidean distance, 欧式距离Event, 事件Event, 事件Exceptional data point, 异常数据点Expectation plane, 期望平面Expectation surface, 期望曲面Expected values, 期望值Experiment, 实验Experimental sampling, 试验抽样Experimental unit, 试验单位Explanatory variable, 说明变量Exploratory data analysis, 探索性数据分析Explore Summarize, 探索-摘要Exponential curve, 指数曲线Exponential growth, 指数式增长EXSMOOTH, 指数平滑方法Extended fit, 扩充拟合Extra parameter, 附加参数Extrapolation, 外推法Extreme observation, 末端观测值Extremes, 极端值/极值F distribution, F分布F test, F检验Factor, 因素/因子Factor analysis, 因子分析Factor Analysis, 因子分析Factor score, 因子得分Factorial, 阶乘Factorial design, 析因试验设计False negative, 假阴性False negative error, 假阴性错误Family of distributions, 分布族Family of estimators, 估计量族Fanning, 扇面Fatality rate, 病死率Field investigation, 现场调查Field survey, 现场调查Finite population, 有限总体Finite-sample, 有限样本First derivative, 一阶导数First principal component, 第一主成分First quartile, 第一四分位数Fisher information, 费雪信息量Fitted value, 拟合值Fitting a curve, 曲线拟合Fixed base, 定基Fluctuation, 随机起伏Forecast, 预测Four fold table, 四格表Fourth, 四分点Fraction blow, 左侧比率Fractional error, 相对误差Frequency, 频率Frequency polygon, 频数多边图Frontier point, 界限点Function relationship, 泛函关系Gamma distribution, 伽玛分布Gauss increment, 高斯增量Gaussian distribution, 高斯分布/正态分布Gauss-Newton increment, 高斯-牛顿增量General census, 全面普查GENLOG (Generalized liner models), 广义线性模型Geometric mean, 几何平均数Gini's mean difference, 基尼均差GLM (General liner models), 一般线性模型Goodness of fit, 拟和优度/配合度Gradient of determinant, 行列式的梯度Graeco-Latin square, 希腊拉丁方Grand mean, 总均值Gross errors, 重大错误Gross-error sensitivity, 大错敏感度Group averages, 分组平均Grouped data, 分组资料Guessed mean, 假定平均数Half-life, 半衰期Hampel M-estimators, 汉佩尔M估计量Happenstance, 偶然事件Harmonic mean, 调和均数Hazard function, 风险均数Hazard rate, 风险率Heading, 标目Heavy-tailed distribution, 重尾分布Hessian array, 海森立体阵Heterogeneity, 不同质Heterogeneity of variance, 方差不齐Hierarchical classification, 组内分组Hierarchical clustering method, 系统聚类法High-leverage point, 高杠杆率点HILOGLINEAR, 多维列联表的层次对数线性模型Hinge, 折叶点Histogram, 直方图Historical cohort study, 历史性队列研究Holes, 空洞HOMALS, 多重响应分析Homogeneity of variance, 方差齐性Homogeneity test, 齐性检验Huber M-estimators, 休伯M估计量Hyperbola, 双曲线Hypothesis testing, 假设检验Hypothetical universe, 假设总体Impossible event, 不可能事件Independence, 独立性Independent variable, 自变量Index, 指标/指数Indirect standardization, 间接标准化法Individual, 个体Inference band, 推断带Infinite population, 无限总体Infinitely great, 无穷大Infinitely small, 无穷小Influence curve, 影响曲线Information capacity, 信息容量Initial condition, 初始条件Initial estimate, 初始估计值Initial level, 最初水平Interaction, 交互作用Interaction terms, 交互作用项Intercept, 截距Interpolation, 内插法Interquartile range, 四分位距Interval estimation, 区间估计Intervals of equal probability, 等概率区间Intrinsic curvature, 固有曲率Invariance, 不变性Inverse matrix, 逆矩阵Inverse probability, 逆概率Inverse sine transformation, 反正弦变换Iteration, 迭代Jacobian determinant, 雅可比行列式Joint distribution function, 分布函数Joint probability, 联合概率Joint probability distribution, 联合概率分布K means method, 逐步聚类法Kaplan-Meier, 评估事件的时间长度Kaplan-Merier chart, Kaplan-Merier图Kendall's rank correlation, Kendall等级相关Kinetic, 动力学Kolmogorov-Smirnove test, 柯尔莫哥洛夫-斯米尔诺夫检验Kruskal and Wallis test, Kruskal及Wallis检验/多样本的秩和检验/H检验Kurtosis, 峰度Lack of fit, 失拟Ladder of powers, 幂阶梯Lag, 滞后Large sample, 大样本Large sample test, 大样本检验Latin square, 拉丁方Latin square design, 拉丁方设计Leakage, 泄漏Least favorable configuration, 最不利构形Least favorable distribution, 最不利分布Least significant difference, 最小显著差法Least square method, 最小二乘法Least-absolute-residuals estimates, 最小绝对残差估计Least-absolute-residuals fit, 最小绝对残差拟合Least-absolute-residuals line, 最小绝对残差线Legend, 图例L-estimator, L估计量L-estimator of location, 位置L估计量L-estimator of scale, 尺度L估计量Level, 水平Life expectance, 预期期望寿命Life table, 寿命表Life table method, 生命表法Light-tailed distribution, 轻尾分布Likelihood function, 似然函数Likelihood ratio, 似然比line graph, 线图Linear correlation, 直线相关Linear equation, 线性方程Linear programming, 线性规划Linear regression, 直线回归Linear Regression, 线性回归Linear trend, 线性趋势Loading, 载荷Location and scale equivariance, 位置尺度同变性Location equivariance, 位置同变性Location invariance, 位置不变性Location scale family, 位置尺度族Log rank test, 时序检验Logarithmic curve, 对数曲线Logarithmic normal distribution, 对数正态分布Logarithmic scale, 对数尺度Logarithmic transformation, 对数变换Logic check, 逻辑检查Logistic distribution, 逻辑斯特分布Logit transformation, Logit转换LOGLINEAR, 多维列联表通用模型Lognormal distribution, 对数正态分布Lost function, 损失函数Low correlation, 低度相关Lower limit, 下限Lowest-attained variance, 最小可达方差LSD, 最小显著差法的简称Lurking variable, 潜在变量Main effect, 主效应Major heading, 主辞标目Marginal density function, 边缘密度函数Marginal probability, 边缘概率Marginal probability distribution, 边缘概率分布Matched data, 配对资料Matched distribution, 匹配过分布Matching of distribution, 分布的匹配Matching of transformation, 变换的匹配Mathematical expectation, 数学期望Mathematical model, 数学模型Maximum L-estimator, 极大极小L 估计量Maximum likelihood method, 最大似然法Mean, 均数Mean squares between groups, 组间均方Mean squares within group, 组内均方Means (Compare means), 均值-均值比较Median, 中位数Median effective dose, 半数效量Median lethal dose, 半数致死量Median polish, 中位数平滑Median test, 中位数检验Minimal sufficient statistic, 最小充分统计量Minimum distance estimation, 最小距离估计Minimum effective dose, 最小有效量Minimum lethal dose, 最小致死量Minimum variance estimator, 最小方差估计量MINITAB, 统计软件包Minor heading, 宾词标目Missing data, 缺失值Model specification, 模型的确定Modeling Statistics , 模型统计Models for outliers, 离群值模型Modifying the model, 模型的修正Modulus of continuity, 连续性模Morbidity, 发病率Most favorable configuration, 最有利构形Multidimensional Scaling (ASCAL), 多维尺度/多维标度Multinomial Logistic Regression , 多项逻辑斯蒂回归Multiple comparison, 多重比较Multiple correlation , 复相关Multiple covariance, 多元协方差Multiple linear regression, 多元线性回归Multiple response , 多重选项Multiple solutions, 多解Multiplication theorem, 乘法定理Multiresponse, 多元响应Multi-stage sampling, 多阶段抽样Multivariate T distribution, 多元T分布Mutual exclusive, 互不相容Mutual independence, 互相独立Natural boundary, 自然边界Natural dead, 自然死亡Natural zero, 自然零Negative correlation, 负相关Negative linear correlation, 负线性相关Negatively skewed, 负偏Newman-Keuls method, q检验NK method, q检验No statistical significance, 无统计意义Nominal variable, 名义变量Nonconstancy of variability, 变异的非定常性Nonlinear regression, 非线性相关Nonparametric statistics, 非参数统计Nonparametric test, 非参数检验Nonparametric tests, 非参数检验Normal deviate, 正态离差Normal distribution, 正态分布Normal equation, 正规方程组Normal ranges, 正常范围Normal value, 正常值Nuisance parameter, 多余参数/讨厌参数Null hypothesis, 无效假设Numerical variable, 数值变量Objective function, 目标函数Observation unit, 观察单位Observed value, 观察值One sided test, 单侧检验One-way analysis of variance, 单因素方差分析Oneway ANOVA , 单因素方差分析Open sequential trial, 开放型序贯设计Optrim, 优切尾Optrim efficiency, 优切尾效率Order statistics, 顺序统计量Ordered categories, 有序分类Ordinal logistic regression , 序数逻辑斯蒂回归Ordinal variable, 有序变量Orthogonal basis, 正交基Orthogonal design, 正交试验设计Orthogonality conditions, 正交条件ORTHOPLAN, 正交设计Outlier cutoffs, 离群值截断点Outliers, 极端值OVERALS , 多组变量的非线性正规相关Overshoot, 迭代过度Paired design, 配对设计Paired sample, 配对样本Pairwise slopes, 成对斜率Parabola, 抛物线Parallel tests, 平行试验Parameter, 参数Parametric statistics, 参数统计Parametric test, 参数检验Partial correlation, 偏相关Partial regression, 偏回归Partial sorting, 偏排序Partials residuals, 偏残差Pattern, 模式Pearson curves, 皮尔逊曲线Peeling, 退层Percent bar graph, 百分条形图Percentage, 百分比Percentile, 百分位数Percentile curves, 百分位曲线Periodicity, 周期性Permutation, 排列P-estimator, P估计量Pie graph, 饼图Pitman estimator, 皮特曼估计量Pivot, 枢轴量Planar, 平坦Planar assumption, 平面的假设PLANCARDS, 生成试验的计划卡Point estimation, 点估计Poisson distribution, 泊松分布Polishing, 平滑Polled standard deviation, 合并标准差Polled variance, 合并方差Polygon, 多边图Polynomial, 多项式Polynomial curve, 多项式曲线Population, 总体Population attributable risk, 人群归因危险度Positive correlation, 正相关Positively skewed, 正偏Posterior distribution, 后验分布Power of a test, 检验效能Precision, 精密度Predicted value, 预测值Preliminary analysis, 预备性分析Principal component analysis, 主成分分析Prior distribution, 先验分布Prior probability, 先验概率Probabilistic model, 概率模型probability, 概率Probability density, 概率密度Product moment, 乘积矩/协方差Profile trace, 截面迹图Proportion, 比/构成比Proportion allocation in stratified random sampling, 按比例分层随机抽样Proportionate, 成比例Proportionate sub-class numbers, 成比例次级组含量Prospective study, 前瞻性调查Proximities, 亲近性Pseudo F test, 近似F检验Pseudo model, 近似模型Pseudosigma, 伪标准差Purposive sampling, 有目的抽样QR decomposition, QR分解Quadratic approximation, 二次近似Qualitative classification, 属性分类Qualitative method, 定性方法Quantile-quantile plot, 分位数-分位数图/Q-Q图Quantitative analysis, 定量分析Quartile, 四分位数Quick Cluster, 快速聚类Radix sort, 基数排序Random allocation, 随机化分组Random blocks design, 随机区组设计Random event, 随机事件Randomization, 随机化Range, 极差/全距Rank correlation, 等级相关Rank sum test, 秩和检验Rank test, 秩检验Ranked data, 等级资料Rate, 比率Ratio, 比例Raw data, 原始资料Raw residual, 原始残差Rayleigh's test, 雷氏检验Rayleigh's Z, 雷氏Z值Reciprocal, 倒数Reciprocal transformation, 倒数变换Recording, 记录Redescending estimators, 回降估计量Reducing dimensions, 降维Re-expression, 重新表达Reference set, 标准组Region of acceptance, 接受域Regression coefficient, 回归系数Regression sum of square, 回归平方和Rejection point, 拒绝点Relative dispersion, 相对离散度Relative number, 相对数Reliability, 可靠性Reparametrization, 重新设置参数Replication, 重复Report Summaries, 报告摘要Residual sum of square, 剩余平方和Resistance, 耐抗性Resistant line, 耐抗线Resistant technique, 耐抗技术R-estimator of location, 位置R估计量R-estimator of scale, 尺度R估计量Retrospective study, 回顾性调查Ridge trace, 岭迹Ridit analysis, Ridit分析Rotation, 旋转Rounding, 舍入Row, 行Row effects, 行效应Row factor, 行因素RXC table, RXC表Sample, 样本Sample regression coefficient, 样本回归系数Sample size, 样本量Sample standard deviation, 样本标准差Sampling error, 抽样误差SAS(Statistical analysis system ), SAS统计软件包Scale, 尺度/量表Scatter diagram, 散点图Schematic plot, 示意图/简图Score test, 计分检验Screening, 筛检SEASON, 季节分析Second derivative, 二阶导数Second principal component, 第二主成分SEM (Structural equation modeling), 结构化方程模型Semi-logarithmic graph, 半对数图Semi-logarithmic paper, 半对数格纸Sensitivity curve, 敏感度曲线Sequential analysis, 贯序分析Sequential data set, 顺序数据集Sequential design, 贯序设计Sequential method, 贯序法Sequential test, 贯序检验法Serial tests, 系列试验Short-cut method, 简捷法Sigmoid curve, S形曲线Sign function, 正负号函数Sign test, 符号检验Signed rank, 符号秩Significance test, 显著性检验Significant figure, 有效数字Simple cluster sampling, 简单整群抽样Simple correlation, 简单相关Simple random sampling, 简单随机抽样Simple regression, 简单回归simple table, 简单表Sine estimator, 正弦估计量Single-valued estimate, 单值估计Singular matrix, 奇异矩阵Skewed distribution, 偏斜分布Skewness, 偏度Slash distribution, 斜线分布Slope, 斜率Smirnov test, 斯米尔诺夫检验Source of variation, 变异来源Spearman rank correlation, 斯皮尔曼等级相关Specific factor, 特殊因子Specific factor variance, 特殊因子方差Spectra , 频谱Spherical distribution, 球型正态分布Spread, 展布SPSS(Statistical package for the social science), SPSS统计软件包Spurious correlation, 假性相关Square root transformation, 平方根变换Stabilizing variance, 稳定方差Standard deviation, 标准差Standard error, 标准误Standard error of difference, 差别的标准误Standard error of estimate, 标准估计误差Standard error of rate, 率的标准误Standard normal distribution, 标准正态分布Standardization, 标准化Starting value, 起始值Statistic, 统计量Statistical control, 统计控制Statistical graph, 统计图Statistical inference, 统计推断Statistical table, 统计表Steepest descent, 最速下降法Stem and leaf display, 茎叶图Step factor, 步长因子Stepwise regression, 逐步回归Storage, 存Strata, 层(复数)Stratified sampling, 分层抽样Stratified sampling, 分层抽样Strength, 强度Stringency, 严密性Structural relationship, 结构关系Studentized residual, 学生化残差/t化残差Sub-class numbers, 次级组含量Subdividing, 分割Sufficient statistic, 充分统计量Sum of products, 积和Sum of squares, 离差平方和Sum of squares about regression, 回归平方和Sum of squares between groups, 组间平方和Sum of squares of partial regression, 偏回归平方和Sure event, 必然事件Survey, 调查Survival, 生存分析Survival rate, 生存率Suspended root gram, 悬吊根图Symmetry, 对称Systematic error, 系统误差Systematic sampling, 系统抽样Tags, 标签Tail area, 尾部面积Tail length, 尾长Tail weight, 尾重Tangent line, 切线Target distribution, 目标分布Taylor series, 泰勒级数Tendency of dispersion, 离散趋势Testing of hypotheses, 假设检验Theoretical frequency, 理论频数Time series, 时间序列Tolerance interval, 容忍区间Tolerance lower limit, 容忍下限Tolerance upper limit, 容忍上限Torsion, 扰率Total sum of square, 总平方和Total variation, 总变异Transformation, 转换Treatment, 处理Trend, 趋势Trend of percentage, 百分比趋势Trial, 试验Trial and error method, 试错法Tuning constant, 细调常数Two sided test, 双向检验Two-stage least squares, 二阶最小平方Two-stage sampling, 二阶段抽样Two-tailed test, 双侧检验Two-way analysis of variance, 双因素方差分析Two-way table, 双向表Type I error, 一类错误/α错误Type II error, 二类错误/β错误UMVU, 方差一致最小无偏估计简称Unbiased estimate, 无偏估计Unconstrained nonlinear regression , 无约束非线性回归Unequal subclass number, 不等次级组含量Ungrouped data, 不分组资料Uniform coordinate, 均匀坐标Uniform distribution, 均匀分布Uniformly minimum variance unbiased estimate, 方差一致最小无偏估计Unit, 单元Unordered categories, 无序分类Upper limit, 上限Upward rank, 升秩Vague concept, 模糊概念Validity, 有效性VARCOMP (Variance component estimation), 方差元素估计Variability, 变异性Variable, 变量Variance, 方差Variation, 变异Varimax orthogonal rotation, 方差最大正交旋转Volume of distribution, 容积W test, W检验Weibull distribution, 威布尔分布Weight, 权数Weighted Chi-square test, 加权卡方检验/Cochran检验Weighted linear regression method, 加权直线回归Weighted mean, 加权平均数Weighted mean square, 加权平均方差Weighted sum of square, 加权平方和Weighting coefficient, 权重系数Weighting method, 加权法W-estimation, W估计量W-estimation of location, 位置W估计量Width, 宽度Wilcoxon paired test, 威斯康星配对法/配对符号秩和检验Wild point, 野点/狂点Wild value, 野值/狂值Winsorized mean, 缩尾均值Withdraw, 失访Youden's index, 尤登指数Z test, Z检验Zero correlation, 零相关Z-transformation, Z变换。
Deep Learning for Multi-label Classification
1Deep Learning for Multi-label ClassificationJesse Read,Fernando Perez-CruzAbstract —In multi-label classification,the main focus has been to develop ways of learning the underlying dependencies between labels,and to take advantage of this at classification time.Developing better feature-space representations has been pre-dominantly employed to reduce complexity,e.g.,by eliminating non-helpful feature attributes from the input space prior to (or during)training.This is an important task,since many multi-label methods typically create many different copies or views of the same input data as they transform it,and considerable memory can be saved by taking advantage of redundancy.In this paper,we show that a proper development of the feature space can make labels less interdependent and easier to model and predict at inference time.For this task we use a deep learning approach with restricted Boltzmann machines.We present a deep network that,in an empirical evaluation,outperforms a number of competitive methods from the literature.I.I NTRODUCTIONMulti-label classification is the supervised learning problem where an instance may be associated with multiple labels.This is opposed to the traditional task of single-label classification (i.e.,multi-class,or binary)where each instance is only associated with a single class label.The multi-label context is receiving increased attention and is applicable to a wide variety of domains,including text,audio data,still images and video,and bioinformatics,[12],[22],[23]and the references therein.The most well-known approach to multi-label classification is to simply train an independent classifier for each label.This is usually known in the literature as the binary relevance (BR)transformation,e.g.,[22],[15].Essentially,a multi-label problem is transformed into one binary problem for each label and any off-the-shelf binary classifier is applied to each of these problems individually.Practically all the multi-label literature identifies that this method is limited by the fact that dependencies between labels are not explicitly modelled and proposes algorithms to take these dependencies into account.To date,many successful multi-label algorithms have been obtained by the so-called problem transformation methods (where the multi-label problem is transformed into several multi-class or binary problems),for example,[2],[5],[14],[24],[4].These methods make many copies of the feature space in memory (or make many passes over it).Most of the highest performing methods also use ensembles,for example with support vector machines (SVMs)[14],[24],decision trees [18],probabilistic methods [26],[28]or boosting [17],[25].That is to say,most competitive methods from the large part of the literature could benefit tremendously from more concise representations of the feature space,relatively much more so than in the singe-label context;the initial investment in reducing the number of feature variables in a multi-label problem is much more likely to offer considerable speed-upsduring learning and classification.However,relatively little work in the multi-label literature has considered this ing the raw instance data to construct a model makes the implicit assumption that the labels originate from this data and that they can be recovered directly from ually,however,both the labels and the feature variables originate from particular abstract concepts.For example,we generally think of an image as being labelled beach ,not because its pixel-data vector is beach-like,but rather because the image itself meets some criteria of our abstract idea of what a beach is.Ideally then,a feature set would include (for example)variables for a grainy surface such as sand or pebbles,and for being adjacent to a (significant)body of water.Hence,it is highly desirable to recover the hidden dependencies and structure from the original concepts behind the learning task.A good representation of these dependencies make the problem easier to learn.A Restricted Boltzmann Machine (RBM)[9]learns a layer of hidden features in an unsupervised fashion.This hidden layer can capture complex dependencies and structure from the input space,and represent it more compactly (whenever the number of hidden units is smaller than the number of original feature attributes).The methods we detail in this paper using RBMs offer some interesting benefits to multi-label classification in a variety of domains:•The predictive performance of existing state-of-the-art methods is generally improved.•Many classification paradigms previously relatively un-competitive in multi-label learning can often obtain much higher predictive performance and become competitive and thus now offer their respective advantages to this context,such as better posterior-probability estimates,lower memory consumption,faster performance,easier implementation,and incremental learning.•The output feature space can be updated incrementally.This not only makes incremental learning feasible,but also means that cost savings are magnified for batch-learners that need to be retrained at intervals on new data.•The model can be built using unlabeled examples,which are typically obtained much more cheaply than labelled examples;especially in multi-label contexts,since exam-ples are assigned multiple labels.We also stack several RBMs to create two varieties of Deep Belief Networks (DBNs).We look at two approaches using DBNs.In a first approach,we learn the final layer together with the labels and use an existing multi-label classifier.In a second approach,we use back-propagation to fine-tune the weights of our neural network for discriminative prediction,and augment this with a second multi-label predictive layer.We develop a framework to experiment with RBMs and DBNs in a variety of multi-label classification contexts.Withina r X i v :1502.05988v 1 [c s .L G ] 17 D e c 20142this framework we carry out an empirical evaluation with many different methods from the literature,on a collection of real-world datasets from diverse domains(to the best of our knowledge,this is also the largest and varied collection of datasets analysed with an RBM framework).The results indicate the benefits of this style of learning for multi-label classification.II.P RIOR W ORKMulti-label datasets and classification methods have rapidly become more numerous in recent years,and classification performance has steadily improved.An overview of the most well known and influential work in this area is provided in [22],[12].The binary relevance approach(BR)does not obtain high predictive performance because it does not model dependen-cies between labels.A number of methods have improved on this predictive performance with methods that do model label dependence.A well-known alternative is the label powerset(LP)method[23]which transforms the multi-label problem into single-label problem with a single class,having the powerset as the set of values(i.e.,all possible2L combinations).In LP,label dependencies are modelled directly and predictive performance is greater than BR,but computational complexity is too high for most practical applications.The complexity issue has been addressed in works such as[24]and[13].The former presents RAkEL(RAndom k-labEL sets),an ensemble method that selects m subsets of k labels and uses LP to learn each of these subproblems.The classifier chain approach(CC)[15]has received recent attention,for example in[3]and[26].This method employs one classifier for each label,like BR,but the classifiers are not independent.Rather,each classifier predicts the binary rele-vance of each label given the input space plus the predictions of the previous classifiers(hence the chain).Another type of binary-classification approach is the pair-wise transformation method(PW),where a binary model is trained for each pair of labels.The predictions result more naturally in a set of pairwise preferences than a multi-label prediction(thus becoming popular in ranking schemes),but PW methods can be adapted to make multi-label predictions, for example[5].These methods performs well in several domains,although their application can easily be prohibitive on many datasets due to its quadratic complexity.An alternative to problem transformation is algorithm adap-tation,where a specific single-label method is adapted directly for multi-label classification.MLkNN[30]is a k-nearest neigh-bours method adapted for multi-label learning by voting from the labels found in the neighbours.IBLR is a related method that also incorporates a second layer of logistic regression. BPMLL[29]is a back-propagation neural network adapted for multi-label classification by having multiple binary outputs as the label variables.Processing the feature space of multi-label data has already been studied in the literature.[20]presents an overview of the main techniques with respect to problem transformation methods.In[27]a clustering-based supervised approach is used to obtain label-specific features for each label.The advantages of this method are reduced where label-relevances are not trained separately,for example in LP methods(which learns all labels together as a single multi-class meta label). In any case,this a meta technique that can easily be applied independently of other preprocessing and learning techniques, such as the one we describe in this paper.In[25]redundancy is eliminated from the learning space of the BR method by taking random subsets of the training space across an ensemble.This work centers on the fact that a standard BR approach considers the full input space for each label,even though only a subset of the variables may be relevant to any particular pressive sensing techniques have also been used in the literature for reducing the complexity multi-label data by taking advantage of label sparsity[21],[11].These methods are mainly motivated by reducing an al-gorithm’s running-time by reducing the number of feature variables in the input space,rather than learning or modelling the dependencies between them.More examples of feature-space reduction for multi-label classification are reviewed in [22].The authors of[7]use a fully-connected network closely related to a Boltzmann machine for multi-label classification, using Gibbs sampling for inference.They use this network to model dependencies in the label space for prediction, rather than to improve the feature space.Since this is a fully connected network,it is tractable only for problems with a relatively small number of labels.Figure1roughly illustrates the way some of the different classifiers model correlations among attributes and labels, assuming a linear base classifier.Fig.1:A network view of various classifiers;the connections among features and labels.(a)BR(b)CC(c)LP3III.D EEP L EARNING WITH R ESTRICTED B OLTZMANNM ACHINESA well-known approach to deep learning is to model each layer of higher level features in a restricted Boltzmann machine [9].We base our approaches on this strategy.A.PreliminariesIn all that follows:X⊂R d is the input domain of all pos-sible feature values.An instance is represented as a vector of d feature values x=[x1,...,x d].The setŁ={λ1,...,λL} is the output domain of L possible labels.Each instance x is associated with a subset of these labels Y⊆Łtypically represented by a binary vector y=[y1,...,y L],where y j=1⇔λj∈Y;i.e.,y j=1if and only if the j th label is associated with instance x,and0otherwise.We assume a set of training data of N labelled examples {(x i,y i)}N i=1;y i is the label vector(labelset)assignment of the i th example;y(i)j is the relevance of the j th label to the i th example.In the BR context,for example,L binary classifiers h1,...,h L are trained,where each h j models the binary problem relating to the j th label,suchˆy=h(˜x)[y1,...,y L]=h1(˜x),...,h L(˜x)outputs prediction vectorˆy∈{0,1}L for any test instance˜x.B.Restricted Boltzmann MachinesA Boltzmann machine is a type of fully-connected neural network that can be used to discover the underlying regularities of the(observed)training data[1].When many features are involved,this type of network is only tractable in the restricted Boltzmann machine setting[9],where units are fully connected between layers,but are unconnected within layers. An RBM learns a layer of u hidden feature variables from the original d feature variables of a training set(usually u<d).These hidden variables can provide a compact representation of the underlying patterns and structure of the input.In fact,an RBM can capture2u input space regions, whereas standard clustering requires O(2u)parameters and examples to capture this much complexity.Figure2shows an RBM can as a graphical model with two sets of nodes:visible(X-variables,shaded)and hidden(Z-variables).Each X j is connected to all Z k|k=1,...,u by weight W jk(the same for both directions).Fig.2:An RBM with5input units and3hidden units.Each edge is associated with a weight W jk,which together make up weight matrix W.RBMs are energy-based models,where the joint probability of visible and hidden units is proportional to the energy between them:P(x,z)∝e−E(x,z).Hence,by manipulating the energy E we can in turn generate the probability P(x,z).Specifically,we minimize the energyE(x,z)=−x W zby learning the weight matrix W tofind low energy states. Contrastive divergence[8]is typically used for this task.C.Deep Belief NetworksRBMs can be stacked to form so-called DBNs[9].The RBMs are trained greedily:thefirst RBM takes the input space X and produces output Z(1),then the second RBM treats Z(1) as if it were the input space,and produces Z(2),and so on and so forth.When used for single-label classification,thefinal output layer is typically a softmax function,(which is appropriate where only one of the output units should be on,to indicate one of K classes).In the following section we outline our approach,creating DBNs suitable for multi-label classification.IV.D EEP B ELIEF N ETWORKS(DBN S)FOR M ULTI-LABELC LASSIFICATIONIdeally,an RBM would produce hidden variables that corre-spond directly to the label variables,and thus we could recover the label vector directly given any input vector;i.e.,y≡z( ) or deterministically mappable z( )→y.Unfortunately,this is seldom the case,because the abstract hidden variables do not need to correspond directly to the labels.However,we should expect the hidden layer of data to be more closely related to the labels than the original data,and thus it makes sense to use it as a feature space to classify instances.Hence,by using the hidden space created by the RBM, we would expect any multi-label classifier to obtain better performance(than when using the original feature space). We do this simply by using the hidden representation of each instance as the input feature space,and associating it with the labels to create training set{(z i,y i)}N i=1.We can then train any multi-label classifier h on this dataset.To evaluate a test instance˜x,we feed it through the RBM and obtain˜z from the upper layer,and then acquire a predictionˆy=h(˜z),and thus so for each test instance.From here we take two approaches.Since the sub-optimality produced by greedy learning is not necessarily harmful to many discriminative supervised methods[10],we can treat the final hidden layer variables Z as the feature input variables, and train any off-the-shelf multi-label model h that can predictˆy=h(˜z )where˜z is produced by the RBM for some test instance˜x; see Figure3a.In a second approach,we add afinal layer of weights W( )on top;see Figure3b.Now,the structure is similar4to the neural network of BPMLL[29],except that create the layers and initialize the weights using ter we will show that our methods performs much better.We can employ back propagation tofine-tune the network in a supervised fashion(with respect to label assignments)as in,for example, [9](for single-label classification).For a number of epochs, each training instance x i is propagated forward(upward) through the network and output as the predictionˆy i.The errors i=y i−ˆy i are then propagated backward through the network,updating the weights(previously initialized by the RBMs).Due to the initialisation with RBMs,far fewer epochs are required than would usually be typical for back propagation(and we actually observed that more than around 100epochs tends to result in overfitting).On both these approaches it is possible to add more depth in the form including an additional classification layer.In the multi-label context,this has previously been done to the basic BR method in[6],where a second BR is trained on the outputs of thefirst(a stacking approach).A related technique in the neural network context,often called a“skip layer”has been used in,e.g.,[19],[16].In our case we allow for generic classifiers.This helps add some further discriminative power for taking into account the dependencies in the label space.(a)A DBN with two layers of hidden units,i.e.,two RBMs.(b)A DBN where a3rd hidden layer represents the labels. Fig.3:DBNs for multi-label classification.In3a,the output space(second hidden layer)Z(2)can be trained with the label space Y by any multi-label classifier.In3b,the labels are predicted directly in a third hidden layer.Note that we have also experimented with a DBN that models the instance space and label space together genera-tively P(x,y,z).In the multi-label setting this complicates the inference,since there are2L possible y.We tried using Gibbs sampling,but could not obtain competitive results from this model in the multi-label setting compared to our other approaches(even after reducing x in an RBMfirst).However, this seems like an interesting direction,and we intend to follow this idea further in future work.V.E XPERIMENTSWe carry out an empirical evaluation to gauge the effec-tiveness and efficiency of RBMs and DBNs in a number of different multi-label classification scenarios,using different learning algorithms and a wide collection of databases.We have implemented these methods in the MEKA framework1; an open-source Java-based framework with a number of impor-tant benchmark multi-label methods.In this framework RBMs can easily be used in a wide variety of multi-label schemes. The source code of our implementations will be made available as part of the MEKA framework.We selected commonly-used datasets from a variety of domains,listed in Table I along with some basic statistics about them.The datasets vary considerably with respect to the type of data,and their dimensions(the number of labels,features,and examples).In M usic,instances of music are associated with emotions;in S cene,images belong to categories;in Y east proteins may be associated with multiple biological functions,and in G enbase gene sequences.M edical, E nron and R euters are text datasets where text documents are associated with categories.These datasets are described in greater detail in[12].TABLE I:A collection of multi-label datasets and associated statistics,where LC is label cardinality:the average number of labels relevant to each example.N L d LC TypeMusic593672 1.87audioScene24076294 1.07imageYeast241714103 4.24biologyGenbase661271185 1.25biologyMedical978451449 1.25medical/textEnron1702531001 3.38e-mail/textReuters6000103500 1.46news/textA.RBM performanceWefirst compare the performance of introducing an RBM, blindly trained,for reducing the input dimension and then try out three of the common paradigms in multi-label classifi-cation(namely BR,LP and PW)to test the improvements proposed for this feature extraction algorithm.The RBM would improve the performance of the multi-label classifi-cation paradigms,if the extracted features are relevant for better describing the task at hand and will be neutral or negative if those features that have been extracted blindly do not correspond with relevant features for assigning labels. The RBM has several parameters that need to befine-tuned (i.e.number of hidden units,learning rate and momentum) and we use three-fold cross validation to set them.We con-sidered the number of hidden units u∈{30,60,120,240}, the learning rateη∈{0.1,0.01,0.001},and momentum α∈{0.2,0.4,0.8}.We used weight costs of2·10−5and E=1000epochs throughout.15 1)Ensemble of Classifier Chains:CC is a competitive BRmethod that uses the chain rule to improve the prediction foreach potential label.As it is unclear what should be the bestordering,we use an ensemble of50CC,in which the labelsare randomly ordered in each realization(as in[15]).In TableIIa,we report the accuracy,as defined in[23],[6],[15],[13],to report the performance of our multi-label classifiers2:accuracy=1NNi=1|y i∧ˆy i||y i∨ˆy i|,where∧and∨are the bitwise AND and OR functions, respectively,for{0,1}L×{0,1}L→{0,1}L.TABLE II:We compare ECC with and without feature extraction using RBMs.(a)We report the accuracy for SVM and logisticregression based multi-label classifiers.SVM Log-RegECC R ECC ECC R ECCM usic0.5810.5760.5580.504S cene0.7310.7100.7090.554Y east0.5320.5350.5130.504G enbase0.9790.9810.9710.977M edical0.6950.7700.4490.706E nron0.4690.4540.4510.355R euters0.4590.4610.4080.376(b)The parameters chosen for ECC R on thefirst of thetwo folds(using an internal train/test set of the trainingset).Parameters for the second fold of each dataset wereinvariably similar or identical.SVMs Log.Reg.ηαuηαuM usic0.10.21200.10.830S cene0.10.82400.10.860Y east0.010.21200.010.230G enbase0.10.81200.10.460M edical0.10.61200.10.6120E nron0.10.61200.10.6120R euters0.10.61200.10.6120In Table IIa,ECC R and ECC,respectively,denote the accu-racy of the ECC with the RBM-generated features and with the original input space.We have used two different classifiers: nonlinear SVM and logistic regression(linear classifier),both of them have been trained with the default parameters in WEKA.It can be seen that the for the logistic regression classifier the achieved accuracy with the generated features by the RBM are significantly better for the M usic,S cene, E nron,R euters datasets,it only underperforms for the M edical dataset,and they are comparable for Y east and G enbase datasets.The RBM not only reduces the dimensionality of the input space for the classifier,but it also makes the features suit-able for linear classifiers,which allows interpreting the RBM 2There are a variety of multi-label evaluation measures used in multi-label experiments in the literature;[22]provides an overview of some of the most popular.The accuracy provides a good balance to gauge the overall predictive performance of multi-label methods[12],[15].features and understand how each one of them participate in the prediction for each label.For the SVM-based ECC classifiers there is not a significant difference when we use the RBM processed features compared to using the raw data directly,as the RBF kernel in the SVM can compensate for the preprocessing done by the RBM.In this case,almost all the results are comparable,except for the S cene and M edical,in which,respectively,the ECC R and ECC outperform.We should remark that the linear logistic regression is as good as the nonlinear SVM in most cases, so it seams that using the RBM features reduces the input dimension and makes the classification problem easier,as a linear classifier performs as well as a state-of-the-art nonlinear classifier.In Figure4we show the accuracy for the seven data bases for the ECC and ECC R multi-label classifier with an SVM classifier,as a function of the number of hidden units of the RBM.In this plot,it can be seen that once we have enough features,using the RBM is comparable to not using it and it is clear that for the M edical the number of features is too little and we would have needed to increase the number of extracted features3to achieved the same performance as the SVM does.0.10.20.30.40.50.60.70.80 50 100 150 200 250MusicSceneMedicalEnronFig.4:The number of hidden units(horizontal axis)and corresponding accuracy as compared to accuracy with the same methods on the original feature space(horizontal lines). Forη=0.1,α=0.1.Finally,in Table III we show the accuracy for the SVM-based classifier for the S cene dataset for all the tested combi-nations of the learning rate and the momentum,in which the number of hidden units isfixed to120.The accuracy for the ECC(without RBM generated features)is0.695and in this case any combination of learning rate and momentum does better,which indicates that with a sufficient number of hidden units,the RBM learning is quite robust and not overly sensitive to hyperparameter settings.2)RAndom K labEL subsets:RAkEL is a truncated power set method in which we try all combinations for3labels and we report an ensemble with2L classifiers.We use the same hyperparameter setting as we did for the ECC to make the 3We did not do so,to keep the experimental setting uniform for all proposed methods,as we think it is important that hyper-parameter setting should be general and notfinely tuned for each application.60.10.20.30.40.50.60.70.820406080100120140160180200Music Music Medical MedicalFig.5:The difference in accuracy (shown here on M usic and M edical datasets)between baseline BR (dashed lines)and more-advanced CC (solid lines)–both built on RBM-produced outputs –decreases with more hidden units (horizontal axis).For η=0.1,α=0.1.TABLE III:The accuracy of ECC ,with an SVM base classifier,for fixed number of hidden units u =120,and for varying learning rate (λ)and momentum (α).λαaccuracy 0.0010.20.7070.0010.40.7050.0010.80.7050.010.20.7100.010.40.7140.010.80.7200.10.20.7260.10.40.7270.10.80.726results comparable across multi-label classification paradigms,as reported in Table IIb and we report the acuracy in Table IV.TABLE IV:We report the accuracy for RAkEL with and without feature extraction using RBMs using an SVM and a logistic regression based multi-label classifiers.SVM Log-Reg RAk RRAk RAk R RAk M usic 0.5810.5790.5380.465S cene 0.7120.6840.6630.469Y east 0.5370.5370.497DNF G enbase 0.9840.9840.9680.976M edical 0.6520.7430.4940.639E nron 0.4520.4130.3760.273R euters0.3420.3370.285DNFThe results for this paradigm are similar to the ones that we reported for the ECC in the previous section.For the logistic regression (a linear classifier)the RBM generated features lend themselves for accurate predictions when compared with the unprocessed features with the same baseline classifier and they are comparable to the results achieved for the nonlinear SVM classifier.After processing the features with an RBM we might not need to rely on a nonlinear classifier.For the SVM using the RBM generated features does not help,but it does not hurt either,in terms of accuracy,as the SVM nonlinear mappingis versatile to learn any nonlinear mapping.3)Pairwise Classification:We implemented a pairwise approach,namely Four-class pairWise classifier (FW ),in which we build models to learn classes y jk ∈{00,01,10,11}for each label pair 1≤j <k ≤L ,dividing each into votes for the individual labels ˆy j and ˆy k and using a threshold at classification time.We find that overall it obtains better predictive performance than the pairwise methods that create decision boundaries between labels (where y jk ∈{01,10}),as in [5],for example,especially with SVMs.We report the accuracy in Table V,using the same hyper parameters as we did for the ECC to make the results comparable across multi-label classification paradigms,as reported in Table IIb.TABLE V:We report the accuracy for FW with and without feature extraction using RBMs,using an SVM and a logistic regression based multi-label classifiers.SVM Log-Reg FW RFW FW R FW M usic 0.5780.5730.5490.492S cene 0.6940.6490.6600.490Y east 0.5370.5380.5070.495G enbase 0.9850.9850.9490.975M edical 0.5710.7480.492DNF E nron0.4630.4080.376DNFThe conclusions are similar to the other two paradigms.The linear classifier (logistic regression)does significantly better with the RBM generated features than with the original input space,while the SVM nonlinear classifier is versatile enough to provide accurate predictions with or without RBM generated features.Fortunately,the linear classifier with RBM generated features is quite close to the SVM-based classifier and allows to interpret which RBM features contribute to each label,hence we can provide intuitive interpretations for each RBM features,while it is hard to get such interpretation from the SVM nonlinear mapping.B.DBN performanceAfter analyzing the performance of the RBM generated features,we focus on two DBN structures for multi-label classification:•DBN 2ECC :a network of two hidden layers,the final of which is united with the labels in a new dataset and trained with ECC (see Figure 3a)•DBN 3bp :a network of three hidden layers where the final layer represents the labels;fine-tuned with back propagation (see Figure 3b)Both setups can be visualised in Figure 6,where h ≡W in the case of DBN 3bp .We use u =d/5hidden units,1000RBM epochs,100BP epochs (on DBN 3bp ),and the best of either α=0.8,λ=0.1and α=0.8,λ=0.1on a 67:33percent internal train/test validation (taking advantage of the fact,as we explained earlier,that the choice of learning rate and momentum is fairly robust given enough hidden units).。
模拟ai英文面试题目及答案
模拟ai英文面试题目及答案模拟AI英文面试题目及答案1. 题目: What is the difference between a neural network anda deep learning model?答案: A neural network is a set of algorithms modeled loosely after the human brain that are designed to recognize patterns. A deep learning model is a neural network with multiple layers, allowing it to learn more complex patterns and features from data.2. 题目: Explain the concept of 'overfitting' in machine learning.答案: Overfitting occurs when a machine learning model learns the training data too well, including its noise and outliers, resulting in poor generalization to new, unseen data.3. 题目: What is the role of a 'bias' in an AI model?答案: Bias in an AI model refers to the systematic errors introduced by the model during the learning process. It can be due to the choice of model, the training data, or the algorithm's assumptions, and it can lead to unfair or inaccurate predictions.4. 题目: Describe the importance of data preprocessing in AI.答案: Data preprocessing is crucial in AI as it involves cleaning, transforming, and reducing the data to a suitableformat for the model to learn effectively. Proper preprocessing can significantly improve the performance of AI models by ensuring that the input data is relevant, accurate, and free from noise.5. 题目: How does reinforcement learning differ from supervised learning?答案: Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a reward signal. It differs from supervised learning, where the model learns from labeled data to predict outcomes based on input features.6. 题目: What is the purpose of a 'convolutional neural network' (CNN)?答案: A convolutional neural network (CNN) is a type of deep learning model that is particularly effective for processing data with a grid-like topology, such as images. CNNs use convolutional layers to automatically and adaptively learn spatial hierarchies of features from input images.7. 题目: Explain the concept of 'feature extraction' in AI.答案: Feature extraction in AI is the process of identifying and extracting relevant pieces of information from the raw data. It is a crucial step in many machine learning algorithms, as it helps to reduce the dimensionality of the data and to focus on the most informative aspects that can be used to make predictions or classifications.8. 题目: What is the significance of 'gradient descent' in training AI models?答案: Gradient descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In the context of AI, it is used to minimize the loss function of a model, thus refining the model's parameters to improve its accuracy.9. 题目: How does 'transfer learning' work in AI?答案: Transfer learning is a technique where a pre-trained model is used as the starting point for learning a new task. It leverages the knowledge gained from one problem to improve performance on a different but related problem, reducing the need for large amounts of labeled data and computational resources.10. 题目: What is the role of 'regularization' in preventing overfitting?答案: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function, which discourages overly complex models. It helps to control the model's capacity, forcing it to generalize better to new data by not fitting too closely to the training data.。
信息检索课程中的英文简称
信息检索课程中的英文简称Information Retrieval Course: An In-Depth Exploration.Information retrieval, commonly abbreviated as IR, is a crucial field in computer science that deals with the retrieval of information from large collections of unstructured or semi-structured data. It finds its applications in various domains, including libraries, e-commerce, search engines, and more. In this article, we delve into the intricacies of information retrieval, its importance, and the techniques used in this domain.1. Introduction to Information Retrieval.Information retrieval is the process of obtaining relevant information from a large, often unstructured, collection of data. It involves techniques such as indexing, searching, and ranking to ensure that the most relevant information is presented to the user. The goal is toprovide accurate and timely information to meet the user'sinformation needs.2. Core Components of Information Retrieval.Indexing: Indexing is the process of creating a data structure, such as an inverted index, that maps terms (keywords) to the locations (documents) where they appear. This allows efficient retrieval of documents containing specific terms.Searching: Searching involves the user submitting a query, which is then processed and compared against the index to retrieve relevant documents. Queries can be simple keywords or complex expressions.Ranking: Ranking algorithms determine the order in which retrieved documents are presented to the user. Relevance, popularity, and recency are common factors considered in ranking.3. Types of Information Retrieval Systems.Boolean Retrieval: Boolean retrieval systems allow users to specify search queries using Boolean operators (AND, OR, NOT) to combine terms and filter results.Vector Space Models: These models represent documents and queries as vectors in a high-dimensional space. Relevance is determined by measuring the similarity between these vectors.Probabilistic Models: Probabilistic models estimate the probability of a document being relevant to a given query. They consider factors like term frequencies and document lengths.Learning-to-Rank (L2R) Models: These models use machine learning techniques to learn the ranking function based on training data. They aim to optimize ranking metrics like mean reciprocal rank (MRR) or normalized discounted cumulative gain (NDCG).4. Challenges in Information Retrieval.Semantic Gap: The semantic gap refers to the mismatch between the user's information need and the representationof information in the system. Addressing this gap requires techniques like latent semantic indexing or word embeddings.Scalability: As data collections grow, it becomes challenging to maintain and query the index efficiently. Distributed retrieval systems and近似算法can help address scalability issues.User Intent Understanding: Understanding the trueintent behind a user's query is crucial for accurate retrieval. Techniques like query reformulation and user profiling can aid in understanding user intent.5. Applications of Information Retrieval.Search Engines: Search engines are the most visible application of information retrieval, serving billions of queries daily. They use a combination of IR techniques to provide relevant search results.E-commerce: E-commerce platforms rely on IR to help users find products or services that meet their needs. This involves searching product descriptions, user reviews, and more.Libraries and Archives: Libraries and archives use IR systems to catalog and retrieve books, documents, and other materials. These systems often incorporate metadata and faceted search to enhance retrieval accuracy.Question Answering Systems: Question answering systems aim to provide direct answers to user queries, often by analyzing a large corpus of text to extract relevant information.6. Future Trends in Information Retrieval.Semantic Retrieval: As the focus shifts towards understanding the true meaning of queries and documents, semantic retrieval techniques like entity linking and semantic role labeling will become increasingly important.Multimodal Retrieval: With the increasing availability of multimedia content, there is a growing need for systems that can handle text, images, audio, and video simultaneously.Personalized Retrieval: Techniques like user profiling and collaborative filtering will play a crucial role in personalizing search results based on user preferences and behavior.Interactive Retrieval: Systems that allow users to interactively refine their queries or provide feedback on search results will improve retrieval accuracy and user satisfaction.In conclusion, information retrieval is a crucial field that powers many of the technologies we rely on daily. It involves complex techniques and algorithms to ensure accurate and timely information delivery. As data volumes continue to grow and user needs become more sophisticated, IR research will focus on addressing challenges like the semantic gap, scalability, and user intent understanding.Future trends like semantic retrieval, multimodal retrieval, personalized retrieval, and interactive retrieval will further enhance the capabilities of IR systems and improve user experiences.。
A Probabilistic Model for Phonocardiograms Segmentation Based on Homomorphic Filtering
(1)
We denote:
ˆ (t ) = ln x(t ) = ln a(t ) + ln f (t ) . x In cases where x(t ) = 0 we add a small positive value, and then we have ˆ (t ) = ln a(t ) + ln f (t ) . x
BIOSIGNALModel for Phonocardiograms Segmentation Based on Homomorphic Filtering
Gill D 1, Intrator N 2, Gavriely N3 1 Department of Statistics, The Hebrew University of Jerusalem, Jerusalem 91905, Israel, 2 School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel, 3Rappaport Medicine Faculty, Technion IIT, Haifa 31096, Israel gill@mta.ac.il Abstract. This work presents a novel method for automatic detection and identification of heart sounds. Homomorphic filtering is used to obtain a smooth envelogram of the phonocardiogram, which enables a robust detection of events of interest in heart sound signal. Sequences of features extracted from the detected events are used as observations of a hidden Markov model. It is demonstrated that the task of detection and identification of the major heart sounds can be learned from unlabelled phonocardiograms by an unsupervised training process and without the assistance of any additional synchronizing channels.
Towards a Theory of Balancing Exploration and Exploitation in Probabilistic Environments
Towards a Theory of Balancing Exploration and Exploitation in ProbabilisticEnvironmentsStefani Nellen (snellen@)Marsha C. Lovett (lovett@)Carnegie Mellon University,Department of Psychology, 5000 Forbes AvePittsburgh, PA 15213, USAAbstractLearning to make good choices in a probabilistic environment requires that the Decision Maker resolves the tension between exploration (learning about all available options) and exploitation (consistently choosing the best option in order to maximize rewards). We present a mathematical learning model that makes selections in a repeated-choice probabilistic task based on the expected payoff associated with each option and the information gain that will result from choosing that option. This model can be used to analyze the relative impact of exploration and exploitation over time and under different conditions. It predicts the aggregated and individual learning trajectories of participants in various versions of the task sufficiently well to support our basic argument: Information gain is a valid and rational criterion underlying human decision making. Future modeling work will be addressing the exact nature of the interaction between exploration and exploitation.IntroductionDecision makers are often placed in novel situations that offer them a finite variety of choices. They know that each of the choices is associated with some probability of leading to a positive outcome, but they don’t know what these probabilities are. They might also know that making a choice will constrain the available choices in the subsequent decision cycle, but, again, they don’t quite know in which manner this will happen. All they know is that they will have n opportunities to make one choice at a time, and that the long-term goal is to maximize accumulated rewards. Each Decision Maker has to resolve the tension between exploration (learning about the payoffs of the options, which is achieved by selecting them and observing the outcome) and exploitation (consistently choosing the best option). Maximizing rewards depends on accurate payoff estimates and therefore on sufficient information. On the other hand, this exploration must eventually be discarded in favor of consistently choosing the best option if the goal is to be met. This example shows that exploration is a necessary prerequisite of probability learning. It also shows that the benefit of exploration depends on the amount of information that has already been accumulated. Therefore, it will not remain constant over time. We are convinced that it is possible to predict the relative impact of exploration and exploitation under different conditions over time, and that this relative impact varies in interaction with the probabilistic structure provided in the environment. The model we present here provides an implementation of this basic idea.In the context of decision making research, the representation of probabilistic information/learning has been somewhat understudied, partly because probabilistic representations are often assessed by eliciting one-shot probability estimates from participants instead of observing changes in their actual behavior (see Gigerenzer, 1994, for a critique of the notion of single-event probabilities). Additionally. following a long history of models of probability learning (e.g. Estes, 1964;), recent cognitive models of heuristics or decision making algorithms that rely on probabilistic cues often provide participants with the explicit probabilities from the outset (Broeder, 2000; Payne, Bettman and Johnson, 1993; Rieskamp & Hoffrage, 2000) or assume that the representations have already been formed (models often achieve this during a separate “training phase”) and are ready to be used. In ACT-R (Anderson & Lebiere, 1998), the selection of cognitive operators (production rules) is based on their “expected utility”, which is partly determined by estimates of the probability of success associated with that operator (the other components being “cost/ effort” and noise). This expected utility can gradually be learned by experience. Information gain, i.e. the increase in knowledge about the payoff-structure of the system, does not play an explicit role in utility learning and operator selection in any of these models right now.The model presented in this report proposes a definition of information gain and an explanation of information gain and estimated payoffs interactively influence decision making. The model also explains how the relative impact of information gain and estimated payoffs on decision making changes over time.Before describing the model and its predictions in more detail, we will provide a description of the task in which it was developed and tested, a repeated-choice probability learning task with immediate and delayed rewards. Afterwards, we will present the fit between some model predictions and behavioral data, compare this fit to that of a version of the model that does not take information gain into account, and explore future directions for refining the model.A Probabilistic Learning TaskA version of the task that was used for developing thismodel has previously been used by Brown and Lovett(2001) in order to assess people’s ability to learn to prefer long-term over short-term benefits. In that context, it hasbeen dubbed a “Single Player Version of the Prisoner’sDilemma (PD)”. However, the particular connotations of thePD-game are of less relevance for the present work andmight be confusing here. For this reason, we will simple bereferring to the “Probabilistic Learning Task (PLT)”. Figure 1 shows a schematic overview of the PLT, including its underlying rewards and constraints structure. Participants go through the task at the computer. They are presented with four options in the form of four closed doors, two red ones and two green ones (in the figure, the dark doors in the top row are both red). In the instructions, participants are given the following pieces of knowledge about the task ahead of them: They know that they are asked to make a series of choices, selecting one door at a time. They need a key to open a door of the same color, and they are given a red key at the beginning of the task. By opening a door, the current key will be given up. Upon opening a door, two things will happen: The participants may or may not receive a reward (5 “points”). Additionally, they will receive a new key that will be either red or green. This key will constrain their available options in the subsequent trial, because red keys open only red doors and green keys open only green doors. Finally, participants are told that the goal in this game is to gain as many points as possible.Figure 1: Overview of the choices and their outcomes in the PLT. The outcomes are shown for clarification. Participantshave to learn them .It may be noted that green doors have a lower probability ofreward than red doors. However, the upper right red door(UR), which has a higher reward probability than the upperleft red door (UL) gives a green key. To solve the game,participants have to learn that UL has the highest payoffprobability in the long run, even though its immediate rewards are less probable than those of UR. The behavioral manifestation of having solved the game is to choose UL consistently, disregarding the other options.Before being able to solve the game, participants must learnthe reward-probabilities associated with each door and theassociation between door and keys (the latter connection is deterministic). Participants complete a total of 200 trials in one session with this task.The “Expected Utility Differences (EUDs)” in Two Versions of the TaskGiven the constraint of the keys, there are three sequencesof choices that can be done repeatedly: Choosing UL anumber of times, choosing first UR and then BL, and Choosing BR repeatedly. These three strategies can beevaluated in terms of the payoffs associated with theinvolved doors. In the PLT, the expected values of thestrategies follow the following general constraint:2*payoff(UL) > payoff(UR)+payoff(BL) >2*payoff(BR) The payoff probabilities of the doors can be manipulated to characterize different versions of the PLT, which differ in the extent of the “greater than” relation. This relation is called the “Expected Utility Difference” (EUD). Inserting the reward probabilities from fig. 1 in the equation above and multiplying the outcome with five (because of the “points” that form the reward) yields2*(0.6*5) > (0.8*5+0.2*5) > 2*(0.4*5), or0.6 > 0.5 > 0.4Therefore, the EUD in this scenario is 1 (in [arbitrary] tenthof an expected point units). The other version of the PLT weare interested in has the following probabilities of reward(the mapping of keys to doors is identical): P(rewardUL)=0.8, p(reward UR)=0.9, p(reward BL)=0.1, p(rewardBR)=0.2. This results in an EUD of 3. Brown and Lovett (in preparation) have found that participants find it very hard to learn how to solve the game when the EUD is 1, while many more participants are able to learn the solution when the EUD is 3. The onset of learning is also much earlier in that condition. Interestingly, this effect is independent of the specific probabilities that are used to form the EUDs (Brown & Lovett, 2001). This justifies our decision to regard the EUD 1 and 3 versions of the task as two truly distinct conditions, and to compare the behavior of the model to human behavior under exactly these two conditions.Description of the ModelLike participants, our model for learning probabilities andmaking choices in the PLT starts out with very littleknowledge. The only initial constraint on its selection is therequired mapping between keys and doors of the samecolor. Like participants, it will be making 200 choices, and learn about the probabilistic reward structure of the system during the process of making the choices and from the feedback following them.At each “choice point”, there are two available options (twodoors matching the color of the key). The model selects theoption that has a higher current evaluation. The following three factors contribute to the overall evaluation of each option:(1)The current estimate of the probability of receivinga reward upon opening that door.(2)The current estimate of the value of the key thatwill be given by that door, weighed by a parameter relatingthe importance of future rewards to that of immediaterewards.(3) A measure of information gain, that expresses howmuch knowledge of the characteristics of all available options, i.e. of the system as a whole, will increase as aconsequence of “opening that door”. This is weighed by aparameter that grades the importance of information gain.A formal definition of the model’s selection S at time t is:S t =max(Eit,Ejt),where E it and E jt are the evaluations of the two available choices i and j at time t. The evaluation of both options is analogously computed asE it =successesitsuccessesit+failuresit+k*max(At;Bt+Ct2)+c*Nit−.5The first term is the estimate of the probability of success ofdoor i at time t. The second term is the value of the keygiven by the door, weighed by the parameter k. The value of a key is the estimate of the best future rewards that can beexpected from having that key. This is either the currentestimate of the repeatable cell whose door matches thecurrent key (denoted by A), or the expected value ofalternative sequences that starting with the current key butwill give a different key (denoted here by the average of the expected value of the door that can be opened with the current key and will itself yield a different key (B) and the current value of that different key (C). The third term, finally, denotes the information gain associated with choosing door i at time t, weighed by the parameter c. This measure simply decreases as a power function of N it,, the number of times door i has already been chosen at point t. This expresses the assumption that we learn something about the system each time we make a choice (which is true, because we get feedback), but that this information shows marginally decreasing returns. The selection of a power function to represent this effect quantitatively is partly based on the fact that the posterior distributions of the reward probabilities associated with the four doors are beta distributions, the variance of which decreases as a power function of additional observations. This characteristic of the posterior probability distribution of an event again points to the crucial interaction between exploration and exploitation we are trying to capture here: the accuracy of the estimates, both of the reward-probabilities and the value of the keys, critically depends on sufficient exploration, more specifically: a sufficiently large number of observations. However, the impact of exploration does not remain constant, but instead decreases systematically: rapidly in the beginning, marginally later on.The model presented is a learning model in the sense that allestimates are updated after each choice, as is theinformation gain measure associated with that choice. Itdoes not use noise; the only sources of variability arechanges in the estimates, which in turn are caused by the probabilistic feedback, and changes in the information gain measure, which leads the model to abandon familiar optionsand explore less familiar ones.The parameters in the model (k, c) are basically freeparameters that can be adjusted to reflect a different impact of future rewards (k) or Information Gain (c), either underdifferent circumstances or even between (simulated)participants. However, in all simulations presented here,they have been kept constant throughout, with k=3 and c=4. Some Basic Predictions and MechanismsAs the relative impact of the two competing components of the evaluations changes with time, the learning and behavior of the model can be describes as follows (note that they are not assuming discrete stages but continuous changes, the segmentation in the following paragraph was made to serve clarification):(1)Pure exploration. Choices will be made based on the Information Gain measure, i.e. choices that have not yet been explored will be explored. When c=4, the estimates of the actual payoffs (which are initialized to 0.5) are still too small to counteract this during the first few cycles (i.e. untila sufficient number of experiences has been accumulated).(2)Early, inaccurate Estimates. Because the measure of Information Gain decreases relatively rapidly in the beginning, the actual estimates of the reward probabilities and the key values begin to impact the models choices. However, due to the probabilistic feedback, and the fact that they are still based on relatively few observations, the estimates might not reflect the true ranking of the options, particularly in conditions where EUD=1. Two forces counteract these inaccurate representations. First, by choosing the currently best option repeatedly, the model gathers information that corrects its estimate. Now previously lower-ranked options can compete again for selection. Second, by exercising its bias towards the less explored options, it obtains a more accurate estimate of their payoff probabilities as well. Consequently, the model quickly recovers from initial false estimates. Now, the model has established estimates of the reward probabilities that are robust enough to remain unfazed by the occasional “failure”-feedback. The model will then consistently chose the option with the highest estimated long-term payoff (3)Familiarity breeds contempt. Flexibility is maintained for a while beyond this point, because, as one option, or a combination of two options, is chosen repeatedly, its information gain measure decreases, while that of the other options remains constant. Thus, there is the chance that the model abandons an option again in favor of another, particularly if this competitor has a similar expected payoff. It is clear how appropriate this behavior is in the task described here, where the EUD between choices are sometimes as close as 1 unit, and the chance to “accidentally” settle on the wrong solution is pronounced. (4)Optimal choices with Intermissions. Eventually, the model will learn to solve the game, because its estimates are becoming more and more accurate. However, because of the dynamics described in the previous section, the model will continue to explore alternative options. The Intervals between these “exploration fits” also follow a Power Function: the number of experiences between “explorationfits” becomes much longer each time, until they are so widely spaced as to be without any relevance anymore.We will examine how the model learns and behaves under different EUD-conditions of the task described here in the next section.Comparison Between Model and DataThe data to which we compare the behavior of our model were collected by Brown and Lovett (in preparation). A total of 80 participants worked on the version of the PLT described in this paper (for related work using a deterministic version, see Brown & Lovett, 2001). 60 Participants worked under the EUD 1 condition (participants were collapsed from different groups that used different sets of probabilities to form an EUD of 1, because the specific probabilities had no effect on behavior, as reported by Brown & Lovett, 2001). 20 Participants worked under a EUD = 3 condition. The reward probabilities for both groups are given in Table 1.Table 1: Reward Probabilities in the EUD1 and EUD3conditions.UL UR BL BR EUD=1p=0.6/0.7/0.8p=0.8/0.9/0.9p=0.2/0.3/0.5p=0.4/0.5/0.6 EUD=3p=0.8p=0.9p=0.1p=0.2Participants’ choices in both groups were recorded on a trial by trial basis. Choices of UL and BL, which are the choices that yield the “better key” (red) were coded as “1”, to indicate subjects’ attention to the keys, as opposed to the immediate rewards (which would favor UR and BR, respectively, choosing these options was recorded as “0”). Based on the binary raw data, the proportion of “choosing left” was computed for each of the 200 trials. Note that only a repeated choice of UL can lead to a proportion > 50%. An increase above that level is thus indicative of choosing the solution, UL, more often than not. Finally, the proportions were averaged over ten-trial blocks, resulting in 20 ten-trial blocks that depict the aggregated learning trajectory for all participants. Additionally, response latencies were obtained for all trials.It was our goal to have the model produce data of the same format as the empirical data. To this effect, we implemented it in a spreadsheet and recorded each of the 200 choices it made per run, coding them the same way the human data were coded. We produced 3*20 model runs under the EUD1 condition, using the probabilities given in table 4, and 60 model runs using the EUD3 condition, also using the same probabilities. The results of these model runs were aggregated in the same manner as the human data. The model (obviously) was constrained to open doors with the appropriate key in the same manner as humans were. Its choices were based on the evaluations elaborated in the preceding section.We will first present and comment on the correspondence between model and data for both EUD conditions on the aggregate level. However, the quality of a model can also be assessed by determining how well its individual runsresemble the individual learning trajectories of participants,particularly when behavior is very variable, as is the casehere. Therefore, we present some comparisons between individual participants and individual model runs that showthat the model can produce a range of behavior consistentwith that shown across the sample of participants. A note onthe parameters: The values of k=3 and c=4 were chosen toobtain a good the fit to the data in EUD1. They were keptconstant for all other comparisons, including the ones on the individual level.EUD1 and EUD 3: aggregate learning trajectoriesIt is easy to see how the behavior of the model differs under different EUD conditions. When the EUD is 1, i.e. very low, the model needs more trials to arrive at estimates that are accurate enough to warrant exploitation of one option. However, the small advantage associated with the winner, combined with the model’s bias towards exploration, limits the stability of behavior under this condition. While the model will eventually converge to solving the game even under this condition, it is, like humans, often unable to do so within the 200 trials allotted in the experimental task described here. However, if the EUD is as high as 3, learning is considerably sped up: the accuracy of the estimates increases faster, because the variability in binary feedback is lower as the probability of success (or 1 vs. 0) becomes more extreme. and the gap between the two options’ estimates increases faster. Exploration continues to be beneficial under this condition, but is more rare, as the differences between the estimates are pronounced enough to lead to appropriate exploitation and to counteract the impact of the i nformation gain measure.Figure 2 shows the (aggregated) learning trajectories of data and model under conditions 1 and 3. The difference between the groups is captured nicely by the model, especially the logarithmically shaped learning curve in the EUD 3 condition. Note that neither humans nor model arrive at 100% exploitation of UL under EUD 3. This reflects (in the model) the response to the probabilistic feedback and the ensuing recurrent, brief “exploration bursts”. Note particularly the downward dip in the final four 10-trial-blocks that is shared by empirical and model data. In the model, this is the consequence of having chosen the solution, UL, a considerable number of times. As a consequence, its information gain Measure has decreased sufficiently to allow the competitors a few more explorations: familiarity breeds contempt. Neither humans nor model show much learning under the EUD 1 condition, for the reasons outlined above.An even more powerful test of the validity of our model, in particular our claim that a component of information gain is essential for understanding and predicting human behavior, is a comparison between the data and a version of our model that sets c=0, thereby completely eliminating the component of information gain.The results of this comparison are shown in Fig. 3.Figure 2: Model and Data Curves under EUD1 and EUD3conditions.The “no-info-gain” model does not capture the learningunder the EUD3 condition. Levels of “choosing left” remainconstant during all trials for this model. Even moreimportantly, the “no-info-gain” model operates withoutnoise, and therefore does not exhibit any variability over time. The fact that the proportion of choosing left remainsbelow 100% for the no-info-gain-model is an artifact of this:Some model runs always choose UL from the beginning,others always choose ER-BL throughout all trials. Nochanges occur, because none of these options has a badenough payoff to jolt the model out of its inertia. The same inflexible behavior is true for the EUD1 version of the no-info-gain-model.Figure 3: Predictions of a model without information gain.This begs the question whether the addition of noise to the evaluation mechanism wouldn’t have the same effect as the notion of explicit information gain. This is still an issue for future exploration, especially since there are two kinds of noise that can play a role here: The estimates themselves can be noisy, or the selection process that operates on them can involve noise, the subtle difference between these two concepts of noise, and possible integrations with the present model are the objective of future work. Here, the following argument can be made against the use of noise and in favor of the notion of information gain proposed here. Cognitive models of probability matching within the ACT-R framework have to assume an extremely high level of noise in order to capture the observation that there is still variability in participants behavior after a large number of experiences. We will show examples of this kind of behavior in the current task in the next section, and show how our model can reproduce this without assuming any noise. We will address this issue again in the discussion. Individual Participants and Individual Model Runs Another measure of a model’s quality that goes beyond thecomparison of average curves involves the inspection of individual model runs with individual subjects. Especially in tasks that cause a high variability in behavior, this is interesting, because it enables us to inspect the flexibility of both model and humans. Therefore, we inspected whether we could identify individual model runs, in the set that fed the average curves shown in Fig. 2, that match the learning trajectories of individual subjects.One characteristic of the model, both in the EUD 1 and the EUD3 condition, is that it can exhibit relatively “gradual”learning. Essentially, the model settles on a “current best”based on the current estimates, and is drawn away from it again by increasingly correct estimates (which perhaps reveal that the option it has settled on is not that superior after all, as well as the decrease in that option’s information gain relative to the that of the other options. It gradually converges towards an increased choice of the true best, in this case UL. Figure 4 shows an example of this, the modelbeing matched to participant 126.Figure 4We can see similar patterns of learning in individual model runs under the EUD1 condition, even though the model does, on average, hardly learn. One striking and frequent pattern under the EUD1-condition, and one that we believe is hard to capture by a model that uses only noise as variability-inducer, is the complete abandonment and later re-uptake of the option UL. This pattern, which we call spikes (one of many examples is shown in Fig. 5) is due to two factors: Firstly, the quality of UL as “best” choice is less clear in EUD1 that in EUD2, so its estimate will remain closer to that of its competitors, occasionally falling below them. Secondly, the advantage that UL might have over the other options in terms of expected payoff is not big enough to counter the fact that its information gain measure will decrease as it is chosen more often, falling below that of the competitors: Familiarity breeds contempt. These two factors taken together model the fact that, under EUD1, the model can’t establish sufficient “trust” in an option in order to exploit it: As it repeatedly chooses UL, its estimate remains mediocre, at the same time, the other options begin to seem more attractive again. If there is one option that has been explored particularly rarely, this option will promise a high information gain and will be chosen for a couple of trials,until it has been established that its current estimate is belowthe UL and its information gain has decreased. The result isa “spike”, as the one seen below. Fig. 6 shows one of themany examples of this pattern that can be found in the data and in the set of model runs. Here, one model run ismatched to participant 293.Figure 6.DiscussionIt has been our goal to demonstrate that Information gain,not just actual payoffs, can drive Decision Making in aprobabilistic environment. To this end we have created a model that learns to make choices in a probability learning task, choosing among options based on its estimates of their actual payoffs and the information gain associated with selecting that option. Upon detecting that this model fits human behavior much better than a model that ignores information gain,we ask ourselves two questions: Is the behavior of such a model rational? And: Is the model correct? The answer to the first question is, in our opinion, a clear “yes”. Exploration is appropriate in probabilistic environments, because it increases the accuracy of the probabilistic representations. The model, like humans, is able to adjust its amount of exploration to the structure of the probabilities, abandoning exploration early when they are easier to discriminate. Its need for exploration prevents the model from being stuck with the wrong choices, but this need is also systematically related to the structure of the environment, such that its impact will be smaller the fewer data are needed to arrive at reliable estimates.But: Is the model as it stands now correct? This is unlikely.It has only recently been formulated and has only beentested with the datasets reported here. We regard the presentversion of the model as a skeleton containing the elementswe believe are essential for explaining human choices andlearning in a probabilistic situation. However, as the word “skeleton” suggests. augmentation is clearly called for. For instance, it is likely that the value of exploration, i.e. of information gain, varies from situation to situation. It is even more likely that Humans themselves can adapt its importance to different situational demands. Essentially, this calls for modeling work regarding systematic changes in the c parameter. Another open question concerns the precise definition of Information Gain. Right now it is a “raw”, content independent indicator of how much we have learned by making a choice, and it always decreases according to the same function. This is a strong assumption, which must be tested empirically. It remains to be seen whether themonotonous decrease of Information Gain remains adequateto model behavior in situations with, e.g., non-stationaryprobabilities. Human adjustment to this additional complexity will certainly pose another challenge.Finally, the view put forth in this paper is that choices inprobabilistic environments can be influenced by the explicit,active wish to explore. This notion is partially at odds withmodels that only assume a noisy estimation, or a noisychoice process in order to account for variability in behavior. A comparison between these two approaches can be resolved on two different levels: Experiments can be designed in which the two models make clearly distinct predictions, and formal analyses of both model can be conducted to reveal how the two models might “fit”different situations, and under which circumstances their choices converge. These efforts might lead toward a more complete theory of how these two drives of exploration and exploitation might be interacting in driving human behavior, an endeavor of which the model reported here merely scratches the surface.AcknowledgmentsS.N. would like to thank Niels Taatgen (CMU, Department of Psychology) and Howard Seltman (CMU, Department of Statistics) for their extremely helpful comments on this work.ReferencesAnderson, J.R. & Lebiere, C. (1998). The Atomic Components of Thought. Mahwah, New Jersey: Erlbaum.Broeder, A. (2000). Assessing the Empirical Validity of the “Take The Best” heuristic as a model of human probabilistic inference. Journal of Experimental Psychology: Learning, Memory and Cognition, 26(5), 1332-1346.Brown, J.C. & Lovett, M.C. (2001). The effect of reducing information in a modified Prisoner’s Dilemma Game. In J.D. Moore & K. Stenning (Eds.), Proceedings of the 23rd Annual Meeting of the Cognitive Science Society (pp. 162-167). Mahwah, New Jersey: Erlbaum.Brown, J.C. & Lovett, M. C. (in preparation). Learning to Choose in a Non-Deterministic, Single-Player Version of the Prisoner’s Dilemma Game. Carnegie Mellon University, Pittsburgh, PA. Estes, W.K. (1964). Probability Learning. In A. W. Melton (Ed.), Categories of human learning. New York: Academic Press. Gigerenzer, G. (1994). Why the distinction between single-event probabilities and frequencies is relevant for psychology (and vice versa). In G. Wright & P. Ayton (Eds.), Subjective Probability. New York: Wiley.Payne, J.W., Bettman, J.R. & Johnson, E.J. (1993). The Adaptive Decision Maker. New York: Cambridge University Press. Rieskamp, J. & Hoffrage, U. (2000). When do people use simple heuristics and how can we tell. In G. Gigerenzer, P.M. Todd & the ABC Research Group, Simple Heuristics that Make Us Smart. New York: Oxford University Press.。
Modeling wine preferences by data mining
Modeling wine preferences by data mining from physicochemical propertiesPaulo Cortez a,∗Ant´o nio Cerdeira b Fernando Almeida bTelmo Matos b Jos´e Reis a,ba Department of Information Systems/R&D Centre Algoritmi,University ofMinho,4800-058Guimar˜a es,Portugalb Viticulture Commission of the Vinho Verde region(CVRVV),4050-501Porto,PortugalAbstractWe propose a data mining approach to predict human wine taste preferences that is based on easily available analytical tests at the certification step.A large dataset (when compared to other studies in this domain)is considered,with white and red vinho verde samples(from Portugal).Three regression techniques were applied,un-der a computationally efficient procedure that performs simultaneous variable and model selection.The support vector machine achieved promising results,outper-forming the multiple regression and neural network methods.Such model is useful to support the oenologist wine tasting evaluations and improve wine production. Furthermore,similar techniques can help in target marketing by modeling consumer tastes from niche markets.Key words:Sensory preferences,Regression,Variable selection,Model selection, Support vector machines,Neural networksPreprint submitted to Elsevier22May20091IntroductionOnce viewed as a luxury good,nowadays wine is increasingly enjoyed by a wider range of consumers.Portugal is a top ten wine exporting country with 3.17%of the market share in2005[11].Exports of its vinho verde wine(from the northwest region)have increased by36%from1997to2007[8].To support its growth,the wine industry is investing in new technologies for both wine making and selling processes.Wine certification and quality assessment are key elements within this context.Certification prevents the illegal adulteration of wines(to safeguard human health)and assures quality for the wine market. Quality evaluation is often part of the certification process and can be used to improve wine making(by identifying the most influential factors)and to stratify wines such as premium brands(useful for setting prices).Wine certification is generally assessed by physicochemical and sensory tests [10].Physicochemical laboratory tests routinely used to characterize wine in-clude determination of density,alcohol or pH values,while sensory tests rely mainly on human experts.It should be stressed that taste is the least un-derstood of the human senses[25],thus wine classification is a difficult task. Moreover,the relationships between the physicochemical and sensory analysis are complex and still not fully understood[20].Advances in information technologies have made it possible to collect,store and process massive,often highly complex datasets.All this data hold valu-able information such as trends and patterns,which can be used to improve∗Corresponding author.E-mail pcortez@dsi.uminho.pt;tel.:+351253510313;fax: +351253510300.2decision making and optimize chances of success[28].Data mining(DM)tech-niques[33]aim at extracting high-level knowledge from raw data.There are several DM algorithms,each one with its own advantages.When modeling con-tinuous data,the linear/multiple regression(MR)is the classic approach.The backpropagation algorithm wasfirst introduced in1974[32]and later popular-ized in1986[23].Since then,neural networks(NNs)have become increasingly used.More recently,support vector machines(SVMs)have also been proposed [4][26].Due to their higherflexibility and nonlinear learning capabilities,both NNs and SVMs are gaining an attention within the DMfield,often attaining high predictive performances[16][17].SVMs present theoretical advantages over NNs,such as the absence of local minima in the learning phase.In effect, the SVM was recently considered one of the most influential DM algorithms [34].While the MR model is easier to interpret,it is still possible to extract knowledge from NNs and SVMs,given in terms of input variable importance [18][7].When applying these DM methods,variable and model selection are critical issues.Variable selection[14]is useful to discard irrelevant inputs,leading to simpler models that are easier to interpret and that usually give better plex models may overfit the data,losing the capability to generalize,while a model that is too simple will present limited learning capabilities.Indeed,both NN and SVM have hyperparameters that need to be adjusted[16],such as the number of NN hidden nodes or the SVM kernel parameter,in order to get good predictive accuracy(see Section2.3).The use of decision support systems by the wine industry is mainly focused on the wine production phase[12].Despite the potential of DM techniques to predict wine quality based on physicochemical data,their use is rather scarce3and mostly considers small datasets.For example,in1991the“Wine”dataset was donated into the UCI repository[1].The data contain178examples with measurements of13chemical constituents(e.g.alcohol,Mg)and the goal is to classify three cultivars from Italy.This dataset is very easy to discriminate and has been mainly used as a benchmark for new DM classifiers.In1997[27], a NN fed with15input variables(e.g.Zn and Mg levels)was used to predict six geographic wine origins.The data included170samples from Germany and a100%predictive rate was reported.In2001[30],NNs were used to classify three sensory attributes(e.g.sweetness)of Californian wine,based on grape maturity levels and chemical analysis(e.g.titrable acidity).Only 36examples were used and a6%error was achieved.Several physicochemical parameters(e.g.alcohol,density)were used in[20]to characterize56samples of Italian wine.Yet,the authors argued that mapping these parameters with a sensory taste panel is a very difficult task and instead they used a NN fed with data taken from an electronic tongue.More recently,mineral characterization (e.g.Zn and Mg)was used to discriminate54samples into two red wine classes[21].A probabilistic NN was adopted,attaining95%accuracy.As a powerful learning tool,SVM has outperformed NN in several applications, such as predicting meat preferences[7].Yet,in thefield of wine quality only one application has been reported,where spectral measurements from147 bottles were successfully used to predict3categories of rice wine age[35].In this paper,we present a case study for modeling taste preferences based on analytical data that are easily available at the wine certification step.Build-ing such model is valuable not only for certification entities but also wine producers and even consumers.It can be used to support the oenologist wine evaluations,potentially improving the quality and speed of their decisions.4Moreover,measuring the impact of the physicochemical tests in thefinal wine quality is useful for improving the production process.Furthermore,it can help in target marketing[24],i.e.by applying similar techniques to model the consumers preferences of niche and/or profitable markets.The main contributions of this work are:•We present a novel method that performs simultaneous variable and model selection for NN and SVM techniques.The variable selection is based on sensitivity analysis[18],which is a computationally efficient method that measures input relevance and guides the variable selection process.Also,we propose a parsimony search method to select the best SVM kernel parameter with a low computational effort.•We test such approach in a real-world application,the prediction of vinho verde wine(from the Minho region of Portugal)taste preferences,showing its impact in this domain.In contrast with previous studies,a large dataset is considered,with a total of4898white and1599red samples.Wine pref-erences are modeled under a regression approach,which preserves the order of the grades,and we show how the definition of the tolerance concept is useful for accessing different performance levels.We believe that this inte-grated approach is valuable to support applications where ranked sensory preferences are required,for example in wine or meat quality assurance.The paper is organized as follows:Section2presents the wine data,DM mod-els and variable selection approach;in Section3,the experimental design is described and the obtained results are analyzed;finally,conclusions are drawn in Section4.52Materials and methods2.1Wine dataThis study will consider vinho verde,a unique product from the Minho(north-west)region of Portugal.Medium in alcohol,is it particularly appreciated due to its freshness(specially in the summer).This wine accounts for15%of the total Portuguese production[8],and around10%is exported,mostly white wine.In this work,we will analyze the two most common variants,white and red(ros´e is also produced),from the demarcated region of vinho verde.The data were collected from May/2004to February/2007using only protected designation of origin samples that were tested at the official certification en-tity(CVRVV).The CVRVV is an inter-professional organization with the goal of improving the quality and marketing of vinho verde.The data were recorded by a computerized system(iLab),which automatically manages the process of wine sample testing from producer requests to laboratory and sen-sory analysis.Each entry denotes a given test(analytical or sensory)and the final database was exported into a single sheet(.csv).During the preprocessing stage,the database was transformed in order to include a distinct wine sample(with all tests)per row.To avoid discarding examples,only the most common physicochemical tests were selected.Since the red and white tastes are quite different,the analysis will be performed separately,thus two datasets1were built with1599red and4898white exam-ples.Table1presents the physicochemical statistics per dataset.Regarding the preferences,each sample was evaluated by a minimum of three sensory 1The datasets are available at:http://www3.dsi.uminho.pt/pcortez/wine/6assessors(using blind tastes),which graded the wine in a scale that ranges from0(very bad)to10(excellent).Thefinal sensory score is given by the me-dian of these evaluations.Fig.1plots the histograms of the target variables, denoting a typical normal shape distribution(i.e.with more normal grades that extreme ones).[insert Table1and Fig.1around here]2.2Data mining approach and evaluationWe will adopt a regression approach,which preserves the order of the prefer-ences.For instance,if the true grade is3,then a model that predicts4is better than one that predicts7.A regression dataset D is made up of k∈{1,...,N}examples,each mapping an input vector with I input variables(x k1,...,x kI)toa given target y k.The regression performance is commonly measured by an error metric,such as the mean absolute deviation(MAD)[33]:MAD= N i=1|y i− y i|/N(1)where y k is the predicted value for the k input pattern.The regression error characteristic(REC)curve[2]is also used to compare regression models,with the ideal model presenting an area of1.0.The curve plots the absolute error tolerance T(x-axis),versus the percentage of points correctly predicted(the accuracy)within the tolerance(y-axis).The confusion matrix is often used for classification analysis,where a C×C matrix(C is the number of classes)is created by matching the predicted values(in columns)with the desired classes(in rows).For an ordered output,7the predicted class is given by p i=y i,if|y i− y i|≤T,else p i=yi ,where yidenotes the closest class to y i,given that yi=y i.From the matrix,several metrics can be used to access the overall classification performance,such as the accuracy and precision(i.e.the predicted column accuracies)[33].The holdout validation is commonly used to estimate the generalization capa-bility of a model[19].This method randomly partitions the data into training and test subsets.The former subset is used tofit the model(typically with2/3 of the data),while the latter(with the remaining1/3)is used to compute the estimate.A more robust estimation procedure is the k-fold cross-validation [9],where the data is divided into k partitions of equal size.One subset is tested each time and the remaining data are used forfitting the model.The process is repeated sequentially until all subsets have been tested.Therefore, under this scheme,all data are used for training and testing.However,this method requires around k times more computation,since k models arefitted.2.3Data mining methodsWe will adopt the most common NN type,the multilayer perceptron,where neurons are grouped into layers and connected by feedforward links[3].For regression tasks,this NN architecture is often based on one hidden layer of H hidden nodes with a logistic activation and one output node with a linear function[16]:y=w o,0+o−1j=I+111+exp(− I i=1x i w j,i−w j,0)·w o,i(2)where w i,j denotes the weight of the connection from node j to i and o the output node.The performance is sensitive to the topology choice(H).A NN8with H=0is equivalent to the MR model.By increasing H,more complex mappings can be performed,yet an excess value of H will overfit the data, leading to generalization loss.A computationally efficient method to set H is to search through the range{0,1,2,3,...,H max}(i.e.from the simplest NN to more complex ones).For each H value,a NN is trained and its generalization estimate is measured(e.g.over a validation sample).The process is stopped when the generalization decreases or when H reaches the maximum value (H max).In SVM regression[26],the input x∈ I is transformed into a high m-dimensional feature space,by using a nonlinear mapping(φ)that does not need to be explicitly known but that depends of a kernel function(K).The aim of a SVM is tofind the best linear separating hyperplane,tolerating a small error( )whenfitting the data,in the feature space:y=w0+mi=1w iφi(x)(3)The -insensitive loss function sets an insensitive tube around the residuals and the tiny errors within the tube are discarded(Fig.2).[insert Fig.2around here]We will adopt the popular gaussian kernel,which presents less parameters than other kernels(e.g.polynomial)[31]:K(x,x )=exp(−γ||x−x ||2),γ>0.Under this setup,the SVM performance is affected by three parameters:γ, and C(a trade-offbetweenfitting the errors and theflatness of the mapping).To reduce the search space,thefirst two values will be set using the heuristics[5]:C=3(for a standardized output)and = σ/√N,where σ=1.5/N× N i=1(y i− y i)2and y is the value predicted by a3-nearest neighbor algorithm.The kernel9parameter(γ)produces the highest impact in the SVM performance,with values that are too large or too small leading to poor predictions.A practical method to setγis to start the search from one of the extremes and then search towards the middle of the range while the predictive estimate increases[31].2.4Variable and Model SelectionSensitivity analysis[18]is a simple procedure that is applied after the train-ing phase and analyzes the model responses when the inputs are changed. Originally proposed for NNs,this sensitivity method can also be applied to other algorithms,such as SVM[7].Let y adenote the output obtained byjholding all input variables at their average values except x a,which varies through its entire range with j∈{1,...,L}levels.If a given input variable (x a∈{x1,...,x I})is relevant then it should produce a high variance(V a). Thus,its relative importance(R a)can be given by:V a= L j=1( y a j− y a j)2/(L−1)(4)R a=V a/ I i=1V i×100(%)In this work,the R a values will be used to measure the importance of the inputs and also to discard irrelevant inputs,guiding the variable selection algorithm. We will adopt the popular backward selection,which starts with all variables and iteratively deletes one input until a stopping criterion is met[14].Yet, we guide the variable deletion(at each step)by the sensitivity analysis,in a variant that allows a reduction of the computational effort by a factor of I (when compared to the standard backward procedure)and that in[18]has outperformed other methods(e.g.backward and genetic algorithms).Similarly10to[36],the variable and model selection will be performed simultaneously,i.e. in each backward iteration several models are searched,with the one that presents the best generalization estimate selected.For a given DM method, the overall procedure is depicted bellow:(1)Start with all F={x1,...,x I}input variables.(2)If there is a hyperparameter P∈{P1,...,P k}to tune(e.g.NN or SVM),start with P1and go through the remaining range until the generalization estimate pute the generalization estimate of the model by using an internal validation method.For instance,if the holdout method is used,the available data are further split into training(tofit the model) and validation sets(to get the predictive estimate).(3)Afterfitting the model,compute the relative importances(R i)of all x i∈F variables and delete from F the least relevant input.Go to step4ifthe stopping criterion is met,otherwise return to step2.(4)Select the best F(and P in case of NN or SVM)values,i.e.,the inputvariables and model that provide the best predictive estimates.Finally, retrain this configuration with all available data.3Empirical resultsThe R environment[22]is an open source,multiple platform(e.g.Windows, Linux)and high-level matrix programming language for statistical and data analysis.All experiments reported in this work were written in R and con-ducted in a Linux server,with an Intel dual core processor.In particular,we adopted the RMiner[6],a library for the R tool that facilitates the use of DM techniques in classification and regression tasks.11Beforefitting the models,the data wasfirst standardized to a zero mean and one standard deviation[16].RMiner uses the efficient BFGS algorithm to train the NNs(nnet R package),while the SVMfit is based on the Sequential Minimal Optimization implementation provided by LIBSVM(kernlab pack-age).We adopted the default R suggestions[29].The only exception are the hyperparameters(H andγ),which will be set using the procedure described in the previous section and with the search ranges of H∈{0,1,...,11}[36] andγ∈{23,21,...,2−15}[31].While the maximum number of searches is 12/10,in practice the parsimony approach(step2of Section2.4)will reduce this number substantially.Regarding the variable selection,we set the estimation metric to the MAD value(Equation1),as advised in[31].To reduce the computational effort, we adopted the simpler2/3and1/3holdout split as the internal valida-tion method.The sensitivity analysis parameter was set to L=5,i.e.x a∈{−1.0,−0.5,...,1.0}for a standardized input.As a reasonable balance be-tween the pressure towards simpler models and the increase of computational search,the stopping criterion was set to2iterations without any improvement or when only one input is available.To evaluate the selected models,we adopted20runs of the more robust5-fold cross-validation,in a total of20×5=100experiments for each tested config-uration.Statistical confidence will be given by the t-student test at the95% confidence level[13].The results are summarized in Table2.The test set errors are shown in terms of the mean and confidence intervals.Three met-rics are present:MAD,the classification accuracy for different tolerances(i.e. T=0.25,0.5and1.0)and Kappa(T=0.5).The selected models are described in terms of the average number of inputs(I)and hyperparameter value(H or12γ).The last row shows the total computational time required in seconds.[insert Table2and Fig.3around here]For both tasks and all error metrics,the SVM is the best choice.The differences are higher for small tolerances and in particular for the white wine(e.g.for T=0.25,the SVM accuracy is almost two times better when compared to other methods).This effect is clearly visible when plotting the full REC curves (Fig.3).The Kappa statistic[33]measures the accuracy when compared with a random classifier(which presents a Kappa value of0%).The higher the statistic,the more accurate the result.The most practical tolerance values are T=0.5and T=1.0.The former tolerance rounds the regression response into the nearest class,while the latter accepts a response that is correct within one of the two closest classes(e.g.a3.1value can be interpreted as grade3 or4but not2or5).For T=0.5,the SVM accuracy improvement is3.3 pp for red wine(6.2pp for Kappa),a value that increases to12.0pp for the white task(20.4pp for Kappa).The NN is quite similar to MR in the red wine modeling,thus similar performances were achieved.For the white data,a more complex NN model(H=2.1)was selected,slightly outperforming the MR results.Regarding the variable selection,the average number of deleted inputs ranges from0.9to1.8,showing that most of the physicochemical tests used are relevant.In terms of computational effort,the SVM is the most expensive method,particularly for the larger white dataset.A detailed analysis of the SVM classification results is presented by the average confusion matrixes for T=0.5(Table3).To simplify the visualization,the3 and9grade predictions were omitted,since these were always empty.Most of the values are close to the diagonals(in bold),denoting a goodfit by the model.13The true predictive accuracy for each class is given by the precision metric (e.g.for the grade4and white wine,precision T=0.5=19/(19+7+4)=63.3%). This statistic is important in practice,since in a real deployment setting the actual values are unknown and all predictions within a given column would be treated the same.For a tolerance of0.5,the SVM red wine accuracies are around57.7to67.5%in the intermediate grades(5to7)and very low (0%/20%)for the extreme classes(3,8and4),which are less frequent(Fig.1).In general,the white data results are better:60.3/63.3%for classes6and 4,67.8/72.6%for grades7and5,and a surprising85.5%for the class8(the exception are the3and9extremes with0%,not shown in the table).When the tolerance is increased(T=1.0),high accuracies ranging from81.9to 100%are attained for both wine types and classes4to8.[insert Table3and Fig.4around here]The average SVM relative importance plots(R a values)of the analytical tests are shown in Fig.4.It should be noted that the whole11inputs are shown, since in each simulation different sets of variables can be selected.In several cases,the obtained results confirm the oenological theory.For instance,an increase in the alcohol(4th and2nd most relevant factor)tends to result in a higher quality wine.Also,the rankings are different within each wine type. For instance,the citric acid and residual sugar levels are more important in white wine,where the equilibrium between the freshness and sweet taste is more appreciated.Moreover,the volatile acidity has a negative impact,since acetic acid is the key ingredient in vinegar.The most intriguing result is the high importance of sulphates,rankedfirst for both cases.Oenologically this result could be very interesting.An increase in sulphates might be related to the fermenting nutrition,which is very important to improve the wine aroma.144Conclusions and implicationsIn recent years,the interest in wine has increased,leading to growth of the wine industry.As a consequence,companies are investing in new technolo-gies to improve wine production and selling.Quality certification is a crucial step for both processes and is currently largely dependent on wine tasting by human experts.This work aims at the prediction of wine preferences from objective analytical tests that are available at the certification step.A large dataset(with4898white and1599red entries)was considered,including vinho verde samples from the northwest region of Portugal.This case study was ad-dressed by two regression tasks,where each wine type preference is modeled in a continuous scale,from0(very bad)to10(excellent).This approach pre-serves the order of the classes,allowing the evaluation of distinct accuracies, according to the degree of error tolerance(T)that is accepted.Due to advances in the data mining(DM)field,it is possible to extract knowl-edge from raw data.Indeed,powerful techniques such as neural networks (NNs)and more recently support vector machines(SVMs)are emerging.While being moreflexible models(i.e.no a priori restriction is imposed),the per-formance depends on a correct setting of hyperparameters(e.g.number of hidden nodes of the NN architecture or SVM kernel parameter).On the other hand,the multiple regression(MR)is easier to interpret than NN/SVM,with most of the NN/SVM applications considering their models as black boxes. Another relevant aspect is variable selection,which leads to simpler models while often improving the predictive performance.In this study,we present an integrated and computationally efficient approach to deal with these issues. Sensitivity analysis is used to extract knowledge from the NN/SVM models,15given in terms of relative importance of the inputs.Simultaneous variable and model selection scheme is also proposed,where the variable selection is guided by sensitivity analysis and the model selection is based on parsimony search that starts from a reasonable value and is stopped when the generalization estimate decreases.Encouraging results were achieved,with the SVM model providing the best performances,outperforming the NN and MR techniques,particularly for white vinho verde wine,which is the most common type.When admitting only the correct classified classes(T=0.5),the overall accuracies are62.4% (red)and64.6%(white).It should be noted that the datasets contain six/seven classes(from3to8/9).These accuracies are much better than the ones ex-pected by a random classifier.The performance is substantially improved when the tolerance is set to accept responses that are correct within the one of the two nearest classes(T=1.0),obtaining a global accuracy of89.0%(red)and 86.8%(white).In particular,for both tasks the majority of the classes present an individual accuracy(precision)higher than90%.The superiority of SVM over NN is probably due to the differences in the train-ing phase.The SVM algorithm guarantees an optimumfit,while NN training may fall into a local minimum.Also,the SVM cost function(Fig.2)gives a linear penalty to large errors.In contrast,the NN algorithm minimizes the sum of squared errors.Thus,the SVM is expected to be less sensitive to outliers and this effect results in a higher accuracy for low error tolerances.As argued in[15],it is difficult to compare DM methods in a fair way,with data analysts tending to favor models that they know better.We adopted the default sug-gestions of the R tool[29],except for the hyperparameters(which were set using a grid search).Since the default settings are more commonly used,this16seems a reasonable assumption for the comparison.Nevertheless,different NN results could be achieved if different hidden node and/or minimization cost functions were used.Under the tested setup,the SVM algorithm provided the best results while requiring more computation.Yet,the SVMfitting can still be achieved within a reasonable time with current processors.For example, one run of the5-fold cross-validation testing takes around26minutes for the larger white dataset,which covers a three-year collection period.The result of this work is important for the wine industry.At the certification phase and by Portuguese law,the sensory analysis has to be performed by hu-man tasters.Yet,the evaluations are based in the experience and knowledge of the experts,which are prone to subjective factors.The proposed data-driven approach is based on objective tests and thus it can be integrated into a decision support system,aiding the speed and quality of the oenologist per-formance.For instance,the expert could repeat the tasting only if her/his grade is far from the one predicted by the DM model.In effect,within this domain the T=1.0distance is accepted as a good quality control process and, as shown in this study,high accuracies were achieved for this tolerance.The model could also be used to improve the training of oenology students.Fur-thermore,the relative importance of the inputs brought interesting insights regarding the impact of the analytical tests.Since some variables can be con-trolled in the production process this information can be used to improve the wine quality.For instance,alcohol concentration can be increased or decreased by monitoring the grape sugar concentration prior to the harvest.Also,the residual sugar in wine could be raised by suspending the sugar fermentation carried out by yeasts.Moreover,the volatile acidity produced during the malo-lactic fermentation in red wine depends on the lactic bacteria control activity.17。
Gaussian Mixture Models and Probabilistic Decision-Based Neural Networks for Pattern Classi
Keywords: Gaussian mixture models; probabilistic decision-based neural networks; pattern classi cation; EM algorithm.
1 Introduction
Pattern classi cation is to partition a feature space into a number of decision regions. One would like to have perfect partitioning so that none of the decisions is wrong. If there are overlaps between classes, it becomes necessary to minimize the probability of misclassi cation errors or the average cost of errors. One approach to minimizing the errors is to apply the Bayes' decision rule 1]. The Bayesian approach, however, requires the class-conditional probability density to be accurately estimated. In recent years, the application of semi-parametric methods 2], 3] to estimating probability density functions has attracted a great deal of attention. For example, Traven 3] proposed a method, called Gaussian clustering, to estimate the parameters of a Gaussian mixture distribution. The method uses a stochastic gradient descent procedure to nd the maximum likelihood estimates of a nite mixture distribution. Cwik and Koronacki 4] extended Traven's work so that no constraints on the covariance structure of the mixture components were imposed. Due to the capability of Gaussian mixtures to model arbitrary densities, Gaussian mixture models (GMMs) have been used in various problem domains such as pattern classi cation 5] and cluster analysis 6]. There have been several neural network approaches to statistical pattern classication (for a review, see 2]). One reason for their popularity is that the outputs of multi-layer neural networks are found to be the estimates of the Bayesian a posteriori 2
Conditional Random Fields Probabilistic Models for Segmenting and Labeling Sequence Data
Conditional Random Fields:Probabilistic Modelsfor Segmenting and Labeling Sequence DataJohn Lafferty LAFFERTY@ Andrew McCallum MCCALLUM@ Fernando Pereira FPEREIRA@ WhizBang!Labs–Research,4616Henry Street,Pittsburgh,PA15213USASchool of Computer Science,Carnegie Mellon University,Pittsburgh,PA15213USADepartment of Computer and Information Science,University of Pennsylvania,Philadelphia,PA19104USAAbstractWe present,a frame-work for building probabilistic models to seg-ment and label sequence data.Conditional ran-domfields offer several advantages over hid-den Markov models and stochastic grammarsfor such tasks,including the ability to relaxstrong independence assumptions made in thosemodels.Conditional randomfields also avoida fundamental limitation of maximum entropyMarkov models(MEMMs)and other discrimi-native Markov models based on directed graph-ical models,which can be biased towards stateswith few successor states.We present iterativeparameter estimation algorithms for conditionalrandomfields and compare the performance ofthe resulting models to HMMs and MEMMs onsynthetic and natural-language data.1.IntroductionThe need to segment and label sequences arises in many different problems in several scientificfields.Hidden Markov models(HMMs)and stochastic grammars are well understood and widely used probabilistic models for such problems.In computational biology,HMMs and stochas-tic grammars have been successfully used to align bio-logical sequences,find sequences homologous to a known evolutionary family,and analyze RNA secondary structure (Durbin et al.,1998).In computational linguistics and computer science,HMMs and stochastic grammars have been applied to a wide variety of problems in text and speech processing,including topic segmentation,part-of-speech(POS)tagging,information extraction,and syntac-tic disambiguation(Manning&Sch¨u tze,1999).HMMs and stochastic grammars are generative models,as-signing a joint probability to paired observation and label sequences;the parameters are typically trained to maxi-mize the joint likelihood of training examples.To define a joint probability over observation and label sequences, a generative model needs to enumerate all possible ob-servation sequences,typically requiring a representation in which observations are task-appropriate atomic entities, such as words or nucleotides.In particular,it is not practi-cal to represent multiple interacting features or long-range dependencies of the observations,since the inference prob-lem for such models is intractable.This difficulty is one of the main motivations for looking at conditional models as an alternative.A conditional model specifies the probabilities of possible label sequences given an observation sequence.Therefore,it does not expend modeling effort on the observations,which at test time arefixed anyway.Furthermore,the conditional probabil-ity of the label sequence can depend on arbitrary,non-independent features of the observation sequence without forcing the model to account for the distribution of those dependencies.The chosen features may represent attributes at different levels of granularity of the same observations (for example,words and characters in English text),or aggregate properties of the observation sequence(for in-stance,text layout).The probability of a transition between labels may depend not only on the current observation, but also on past and future observations,if available.In contrast,generative models must make very strict indepen-dence assumptions on the observations,for instance condi-tional independence given the labels,to achieve tractability. Maximum entropy Markov models(MEMMs)are condi-tional probabilistic sequence models that attain all of the above advantages(McCallum et al.,2000).In MEMMs, each source state1has a exponential model that takes the observation features as input,and outputs a distribution over possible next states.These exponential models are trained by an appropriate iterative scaling method in the 1Output labels are associated with states;it is possible for sev-eral states to have the same label,but for simplicity in the rest of this paper we assume a one-to-one correspondence.maximum entropy framework.Previously published exper-imental results show MEMMs increasing recall and dou-bling precision relative to HMMs in a FAQ segmentation task.MEMMs and other non-generativefinite-state models based on next-state classifiers,such as discriminative Markov models(Bottou,1991),share a weakness we callhere the:the transitions leaving a given state compete only against each other,rather than againstall other transitions in the model.In probabilistic terms, transition scores are the conditional probabilities of pos-sible next states given the current state and the observa-tion sequence.This per-state normalization of transition scores implies a“conservation of score mass”(Bottou, 1991)whereby all the mass that arrives at a state must be distributed among the possible successor states.An obser-vation can affect which destination states get the mass,but not how much total mass to pass on.This causes a bias to-ward states with fewer outgoing transitions.In the extreme case,a state with a single outgoing transition effectively ignores the observation.In those cases,unlike in HMMs, Viterbi decoding cannot downgrade a branch based on ob-servations after the branch point,and models with state-transition structures that have sparsely connected chains of states are not properly handled.The Markovian assump-tions in MEMMs and similar state-conditional models in-sulate decisions at one state from future decisions in a way that does not match the actual dependencies between con-secutive states.This paper introduces(CRFs),a sequence modeling framework that has all the advantages of MEMMs but also solves the label bias problem in a principled way.The critical difference between CRFs and MEMMs is that a MEMM uses per-state exponential mod-els for the conditional probabilities of next states given the current state,while a CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence.Therefore,the weights of different features at different states can be traded off against each other.We can also think of a CRF as afinite state model with un-normalized transition probabilities.However,unlike some other weightedfinite-state approaches(LeCun et al.,1998), CRFs assign a well-defined probability distribution over possible labelings,trained by maximum likelihood or MAP estimation.Furthermore,the loss function is convex,2guar-anteeing convergence to the global optimum.CRFs also generalize easily to analogues of stochastic context-free grammars that would be useful in such problems as RNA secondary structure prediction and natural language pro-cessing.2In the case of fully observable states,as we are discussing here;if several states have the same label,the usual local maxima of Baum-Welch arise.bel bias example,after(Bottou,1991).For concise-ness,we place observation-label pairs on transitions rather than states;the symbol‘’represents the null output label.We present the model,describe two training procedures and sketch a proof of convergence.We also give experimental results on synthetic data showing that CRFs solve the clas-sical version of the label bias problem,and,more signifi-cantly,that CRFs perform better than HMMs and MEMMs when the true data distribution has higher-order dependen-cies than the model,as is often the case in practice.Finally, we confirm these results as well as the claimed advantages of conditional models by evaluating HMMs,MEMMs and CRFs with identical state structure on a part-of-speech tag-ging task.2.The Label Bias ProblemClassical probabilistic automata(Paz,1971),discrimina-tive Markov models(Bottou,1991),maximum entropy taggers(Ratnaparkhi,1996),and MEMMs,as well as non-probabilistic sequence tagging and segmentation mod-els with independently trained next-state classifiers(Pun-yakanok&Roth,2001)are all potential victims of the label bias problem.For example,Figure1represents a simplefinite-state model designed to distinguish between the two words and.Suppose that the observation sequence is. In thefirst time step,matches both transitions from the start state,so the probability mass gets distributed roughly equally among those two transitions.Next we observe. Both states1and4have only one outgoing transition.State 1has seen this observation often in training,state4has al-most never seen this observation;but like state1,state4 has no choice but to pass all its mass to its single outgoing transition,since it is not generating the observation,only conditioning on it.Thus,states with a single outgoing tran-sition effectively ignore their observations.More generally, states with low-entropy next state distributions will take lit-tle notice of observations.Returning to the example,the top path and the bottom path will be about equally likely, independently of the observation sequence.If one of the two words is slightly more common in the training set,the transitions out of the start state will slightly prefer its cor-responding transition,and that word’s state sequence will always win.This behavior is demonstrated experimentally in Section5.L´e on Bottou(1991)discussed two solutions for the label bias problem.One is to change the state-transition struc-ture of the model.In the above example we could collapse states1and4,and delay the branching until we get a dis-criminating observation.This operation is a special case of determinization(Mohri,1997),but determinization of weightedfinite-state machines is not always possible,and even when possible,it may lead to combinatorial explo-sion.The other solution mentioned is to start with a fully-connected model and let the training procedurefigure out a good structure.But that would preclude the use of prior structural knowledge that has proven so valuable in infor-mation extraction tasks(Freitag&McCallum,2000). Proper solutions require models that account for whole state sequences at once by letting some transitions“vote”more strongly than others depending on the corresponding observations.This implies that score mass will not be con-served,but instead individual transitions can“amplify”or “dampen”the mass they receive.In the above example,the transitions from the start state would have a very weak ef-fect on path score,while the transitions from states1and4 would have much stronger effects,amplifying or damping depending on the actual observation,and a proportionally higher contribution to the selection of the Viterbi path.3In the related work section we discuss other heuristic model classes that account for state sequences globally rather than locally.To the best of our knowledge,CRFs are the only model class that does this in a purely probabilistic setting, with guaranteed global maximum likelihood convergence.3.Conditional Random FieldsIn what follows,is a random variable over data se-quences to be labeled,and is a random variable over corresponding label sequences.All components of are assumed to range over afinite label alphabet.For ex-ample,might range over natural language sentences and range over part-of-speech taggings of those sentences, with the set of possible part-of-speech tags.The ran-dom variables and are jointly distributed,but in a dis-criminative framework we construct a conditional model from paired observation and label sequences,and do not explicitly model the marginal..Thus,a CRF is a randomfield globally conditioned on the observation.Throughout the paper we tacitly assume that the graph isfixed.In the simplest and most impor-3Weighted determinization and minimization techniques shift transition weights while preserving overall path weight(Mohri, 2000);their connection to this discussion deserves further study.tant example for modeling sequences,is a simple chain or line:.may also have a natural graph structure;yet in gen-eral it is not necessary to assume that and have the same graphical structure,or even that has any graph-ical structure at all.However,in this paper we will be most concerned with sequencesand.If the graph of is a tree(of which a chain is the simplest example),its cliques are the edges and ver-tices.Therefore,by the fundamental theorem of random fields(Hammersley&Clifford,1971),the joint distribu-tion over the label sequence given has the form(1),where is a data sequence,a label sequence,and is the set of components of associated with the vertices in subgraph.We assume that the and are given andfixed. For example,a Boolean vertex feature might be true if the word is upper case and the tag is“proper noun.”The parameter estimation problem is to determine the pa-rameters from training datawith empirical distribution. In Section4we describe an iterative scaling algorithm that maximizes the log-likelihood objective function:.As a particular case,we can construct an HMM-like CRF by defining one feature for each state pair,and one feature for each state-observation pair:. The corresponding parameters and play a simi-lar role to the(logarithms of the)usual HMM parameters and.Boltzmann chain models(Saul&Jor-dan,1996;MacKay,1996)have a similar form but use a single normalization constant to yield a joint distribution, whereas CRFs use the observation-dependent normaliza-tion for conditional distributions.Although it encompasses HMM-like models,the class of conditional randomfields is much more expressive,be-cause it allows arbitrary dependencies on the observationFigure2.Graphical structures of simple HMMs(left),MEMMs(center),and the chain-structured case of CRFs(right)for sequences. An open circle indicates that the variable is not generated by the model.sequence.In addition,the features do not need to specify completely a state or observation,so one might expect that the model can be estimated from less training data.Another attractive property is the convexity of the loss function;in-deed,CRFs share all of the convexity properties of general maximum entropy models.For the remainder of the paper we assume that the depen-dencies of,conditioned on,form a chain.To sim-plify some expressions,we add special start and stop states and.Thus,we will be using the graphical structure shown in Figure2.For a chain struc-ture,the conditional probability of a label sequence can be expressed concisely in matrix form,which will be useful in describing the parameter estimation and inference al-gorithms in Section4.Suppose that is a CRF given by(1).For each position in the observation se-quence,we define the matrix random variableby,where is the edge with labels and is the vertex with label.In contrast to generative models,con-ditional models like CRFs do not need to enumerate over all possible observation sequences,and therefore these matrices can be computed directly as needed from a given training or test observation sequence and the parameter vector.Then the normalization(partition function)is the entry of the product of these matrices:. Using this notation,the conditional probability of a label sequence is written as, where and.4.Parameter Estimation for CRFsWe now describe two iterative scaling algorithms tofind the parameter vector that maximizes the log-likelihood of the training data.Both algorithms are based on the im-proved iterative scaling(IIS)algorithm of Della Pietra et al. (1997);the proof technique based on auxiliary functions can be extended to show convergence of the algorithms for CRFs.Iterative scaling algorithms update the weights asand for appropriately chosen and.In particular,the IIS update for an edge feature is the solution ofdef.where is thedef. The equations for vertex feature updates have similar form.However,efficiently computing the exponential sums on the right-hand sides of these equations is problematic,be-cause is a global property of,and dynamic programming will sum over sequences with potentially varying.To deal with this,thefirst algorithm,Algorithm S,uses a“slack feature.”The second,Algorithm T,keepstrack of partial totals.For Algorithm S,we define the bydef,where is a constant chosen so that for all and all observation vectors in the training set,thus making.Feature is“global,”that is,it does not correspond to any particular edge or vertex.For each index we now define thewith base caseifotherwiseand recurrence.Similarly,the are defined byifotherwiseand.With these definitions,the update equations are,where.The factors involving the forward and backward vectors in the above equations have the same meaning as for standard hidden Markov models.For example,is the marginal probability of label given that the observation sequence is.This algorithm is closely related to the algorithm of Darroch and Ratcliff(1972),and MART algorithms used in image reconstruction.The constant in Algorithm S can be quite large,since in practice it is proportional to the length of the longest train-ing observation sequence.As a result,the algorithm may converge slowly,taking very small steps toward the maxi-mum in each iteration.If the length of the observations and the number of active features varies greatly,a faster-converging algorithm can be obtained by keeping track of feature totals for each observation sequence separately. Let def.Algorithm T accumulates feature expectations into counters indexed by.More specifically,we use the forward-backward recurrences just introduced to compute the expectations of feature and of feature given that.Then our param-eter updates are and,whereand are the unique positive roots to the following polynomial equationsmax max,(2)which can be easily computed by Newton’s method.A single iteration of Algorithm S and Algorithm T has roughly the same time and space complexity as the well known Baum-Welch algorithm for HMMs.To prove con-vergence of our algorithms,we can derive an auxiliary function to bound the change in likelihood from below;this method is developed in detail by Della Pietra et al.(1997). The full proof is somewhat detailed;however,here we give an idea of how to derive the auxiliary function.To simplify notation,we assume only edge features with parameters .Given two parameter settings and,we bound from below the change in the objective function with anas followsdefwhere the inequalities follow from the convexity of and.Differentiating with respect to and setting the result to zero yields equation(2).5.ExperimentsWefirst discuss two sets of experiments with synthetic data that highlight the differences between CRFs and MEMMs. Thefirst experiments are a direct verification of the label bias problem discussed in Section2.In the second set of experiments,we generate synthetic data using randomly chosen hidden Markov models,each of which is a mix-ture of afirst-order and second-order peting models are then trained and compared on test data.As the data becomes more second-order,the test er-ror rates of the trained models increase.This experiment corresponds to the common modeling practice of approxi-mating complex local and long-range dependencies,as oc-cur in natural data,by small-order Markov models.OurFigure3.Plots of error rates for HMMs,CRFs,and MEMMs on randomly generated synthetic data sets,as described in Section5.2. As the data becomes“more second order,”the error rates of the test models increase.As shown in the left plot,the CRF typically significantly outperforms the MEMM.The center plot shows that the HMM outperforms the MEMM.In the right plot,each open square represents a data set with,and a solid circle indicates a data set with.The plot shows that when the data is mostly second order(),the discriminatively trained CRF typically outperforms the HMM.These experiments are not designed to demonstrate the advantages of the additional representational power of CRFs and MEMMs relative to HMMs.results clearly indicate that even when the models are pa-rameterized in exactly the same way,CRFs are more ro-bust to inaccurate modeling assumptions than MEMMs or HMMs,and resolve the label bias problem,which affects the performance of MEMMs.To avoid confusion of dif-ferent effects,the MEMMs and CRFs in these experiments use overlapping features of the observations.Fi-nally,in a set of POS tagging experiments,we confirm the advantage of CRFs over MEMMs.We also show that the addition of overlapping features to CRFs and MEMMs al-lows them to perform much better than HMMs,as already shown for MEMMs by McCallum et al.(2000).5.1Modeling label biasWe generate data from a simple HMM which encodes a noisy version of thefinite-state network in Figure1.Each state emits its designated symbol with probabilityand any of the other symbols with probability.We train both an MEMM and a CRF with the same topologies on the data generated by the HMM.The observation fea-tures are simply the identity of the observation symbols. In a typical run using training and test samples, trained to convergence of the iterative scaling algorithm, the CRF error is while the MEMM error is, showing that the MEMM fails to discriminate between the two branches.5.2Modeling mixed-order sourcesFor these results,we usefive labels,(),and26 observation values,();however,the results were qualitatively the same over a range of sizes for and .We generate data from a mixed-order HMM with state transition probabilities given byand,simi-larly,emission probabilities given by.Thus,for we have a standardfirst-order HMM.In order to limit the size of the Bayes error rate for the resulting models,the con-ditional probability tables are constrained to be sparse. In particular,can have at most two nonzero en-tries,for each,and can have at most three nonzero entries for each.For each randomly gener-ated model,a sample of1,000sequences of length25is generated for training and testing.On each randomly generated training set,a CRF is trained using Algorithm S.(Note that since the length of the se-quences and number of active features is constant,Algo-rithms S and T are identical.)The algorithm is fairly slow to converge,typically taking approximately500iterations for the model to stabilize.On the500MHz Pentium PC used in our experiments,each iteration takes approximately 0.2seconds.On the same data an MEMM is trained using iterative scaling,which does not require forward-backward calculations,and is thus more efficient.The MEMM train-ing converges more quickly,stabilizing after approximately 100iterations.For each model,the Viterbi algorithm is used to label a test set;the experimental results do not sig-nificantly change when using forward-backward decoding to minimize the per-symbol error rate.The results of several runs are presented in Figure3.Each plot compares two classes of models,with each point indi-cating the error rate for a single test set.As increases,the error rates generally increase,as thefirst-order models fail tofit the second-order data.Thefigure compares models parameterized as,,and;results for models parameterized as,,and are qualitatively the same.As shown in thefirst graph,the CRF generally out-performs the MEMM,often by a wide margin of10%–20% relative error.(The points for very small error rate,with ,where the MEMM does better than the CRF, are suspected to be the result of an insufficient number of training iterations for the CRF.)HMM 5.69%45.99%MEMM 6.37%54.61%CRF 5.55%48.05%MEMM 4.81%26.99%CRF 4.27%23.76%Using spelling featuresFigure4.Per-word error rates for POS tagging on the Penn tree-bank,usingfirst-order models trained on50%of the1.1million word corpus.The oov rate is5.45%.5.3POS tagging experimentsTo confirm our synthetic data results,we also compared HMMs,MEMMs and CRFs on Penn treebank POS tag-ging,where each word in a given input sentence must be labeled with one of45syntactic tags.We carried out two sets of experiments with this natural language data.First,we trainedfirst-order HMM,MEMM, and CRF models as in the synthetic data experiments,in-troducing parameters for each tag-word pair andfor each tag-tag pair in the training set.The results are con-sistent with what is observed on synthetic data:the HMM outperforms the MEMM,as a consequence of the label bias problem,while the CRF outperforms the HMM.The er-ror rates for training runs using a50%-50%train-test split are shown in Figure5.3;the results are qualitatively sim-ilar for other splits of the data.The error rates on out-of-vocabulary(oov)words,which are not observed in the training set,are reported separately.In the second set of experiments,we take advantage of the power of conditional models by adding a small set of or-thographic features:whether a spelling begins with a num-ber or upper case letter,whether it contains a hyphen,and whether it ends in one of the following suffixes:.Here wefind,as expected,that both the MEMM and the CRF benefit signif-icantly from the use of these features,with the overall error rate reduced by around25%,and the out-of-vocabulary er-ror rate reduced by around50%.One usually starts training from the all zero parameter vec-tor,corresponding to the uniform distribution.However, for these datasets,CRF training with that initialization is much slower than MEMM training.Fortunately,we can use the optimal MEMM parameter vector as a starting point for training the corresponding CRF.In Figure5.3, MEMM was trained to convergence in around100iter-ations.Its parameters were then used to initialize the train-ing of CRF,which converged in1,000iterations.In con-trast,training of the same CRF from the uniform distribu-tion had not converged even after2,000iterations.6.Further Aspects of CRFsMany further aspects of CRFs are attractive for applica-tions and deserve further study.In this section we briefly mention just two.Conditional randomfields can be trained using the expo-nential loss objective function used by the AdaBoost algo-rithm(Freund&Schapire,1997).Typically,boosting is applied to classification problems with a small,fixed num-ber of classes;applications of boosting to sequence labeling have treated each label as a separate classification problem (Abney et al.,1999).However,it is possible to apply the parallel update algorithm of Collins et al.(2000)to op-timize the per-sequence exponential loss.This requires a forward-backward algorithm to compute efficiently certain feature expectations,along the lines of Algorithm T,ex-cept that each feature requires a separate set of forward and backward accumulators.Another attractive aspect of CRFs is that one can imple-ment efficient feature selection and feature induction al-gorithms for them.That is,rather than specifying in ad-vance which features of to use,we could start from feature-generating rules and evaluate the benefit of gener-ated features automatically on data.In particular,the fea-ture induction algorithms presented in Della Pietra et al. (1997)can be adapted tofit the dynamic programming techniques of conditional randomfields.7.Related Work and ConclusionsAs far as we know,the present work is thefirst to combine the benefits of conditional models with the global normal-ization of randomfield models.Other applications of expo-nential models in sequence modeling have either attempted to build generative models(Rosenfeld,1997),which in-volve a hard normalization problem,or adopted local con-ditional models(Berger et al.,1996;Ratnaparkhi,1996; McCallum et al.,2000)that may suffer from label bias. Non-probabilistic local decision models have also been widely used in segmentation and tagging(Brill,1995; Roth,1998;Abney et al.,1999).Because of the computa-tional complexity of global training,these models are only trained to minimize the error of individual label decisions assuming that neighboring labels are correctly -bel bias would be expected to be a problem here too.An alternative approach to discriminative modeling of se-quence labeling is to use a permissive generative model, which can only model local dependencies,to produce a list of candidates,and then use a more global discrimina-tive model to rerank those candidates.This approach is standard in large-vocabulary speech recognition(Schwartz &Austin,1993),and has also been proposed for parsing (Collins,2000).However,these methods fail when the cor-rect output is pruned away in thefirst pass.。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Probabilistic Region Relevance Learning for Content-Based Image RetrievalIker Gondra Oklahoma State University Computer Science Department 219Mathematical Sciences Stillwater,OK74078-1053gondra@Douglas R.Heisterkamp Oklahoma State University Computer Science Department 219Mathematical Sciences Stillwater,OK74078-1053 doug@AbstractProbabilistic feature relevance learning(PFRL)is an effective method for adaptively computing local feature relevance in content-based image retrieval.It computes flexible retrieval metrics for producing neighborhoods that are elongated along less relevant feature dimensions and constricted along most influential ones.Based on the observation that regions in an image have unequal importance for computing image similarity,we propose a probabilistic method inspired by PFRL,probabilistic region relevance learning(PRRL),for automatically estimating region relevance based on user’s feedback.PRRL can be used to set region weights in region-based image retrieval frameworks that use an overall image-to-image similarity measure.Experimental results on general-purpose images show the effectiveness of PRRL in learning the relative importance of regions in an image.Keywords:Region-based Image Retrieval,Region Importance,Relevance Feedback1.IntroductionIn traditional approaches,a content-based image re-trieval(CBIR)system extracts some global features(such as color,texture,and shape)from an image.The features are then the components of a feature vector which makes the image correspond to a point in a feature space.In or-der to determine closeness between two images,a similar-ity measure is used to calculate the distance between their corresponding feature vectors.Then,the closest images in feature space to a query image are returned to the user as the query results.Because of the semantic gap between high-level concepts and low-level features,the performance of CBIR is not satisfactory.In order to overcome this prob-lem,two major approaches have been suggested:the use of a learning technique,such as relevance feedback to learn the user’s high-level concept and region-based image represen-tations that are closer to a user’s perception of an image’s content.Relevance feedback(RF)works by gathering semantic information from user interaction.In order to learn a user’s query concept,the user labels each image returned in the previous query round as relevant(1)or non-relevant(-1). Based on the feedback,the retrieval scheme is adjusted and the next set of images is presented to the user for labelling. Two main RF strategies have been proposed in CBIR:query shifting[13],and distance reweighting[6,12,11].Query shifting involves moving the query towards the region of the feature space containing relevant images and away from the region containing non-relevant images.Distance reweight-ing assumes that the relevant images are located along some direction of the feature space.Thus,the task is to deter-mine the features that help the most in retrieving relevant images and increase their importance in determining sim-ilarity.In[11],a probabilistic feature relevance learning (PFRL)method that automatically captures the feature rel-evance based on RF is presented.It computesflexible re-trieval metrics for producing neighborhoods that are elon-gated along less relevant feature dimensions and constricted along most influential ones(See Figure1).Retrieved im-ages with RF are used to compute local feature relevance.In contrast to traditional methods,which compute global features,region-based approaches[1,9,14]extract features from segmented regions of an image.Then,images are re-trieved according to the similarity between regions.The main objective of using regions is to do a more meaningful retrieval that is closer to a user’s perceptions of an image’s content.Instead of looking at the image as a whole,we look at the objects in the image and their relationships.The similarity measure that most of these systems[1,9]use to compare two images is based on individual region-to-region similarity.Both Blobworld[1]and Netra[9]require theXYFigure 1.Features are unequal in their differential rel-evance for computing similarity.The neighborhoods of queries b and c should be elongated along the less relevant Y and X axis respectively.For query a,features X and Y have equal discriminating strengthR R 1’2’Figure 2.Integrated region matching (IRM)user to select the region(s)of interest from the segmented image.This information is then used for determining sim-ilarity with database images.A major problem with these systems is that the segmented regions they produce usually do not correspond to actual objects in the image.For in-stance,an object may be partitioned into several regions,with none of them being representative of the object.In order to overcome the problems of innaccurate image segmentation,some approaches have been proposed that consider all the regions in an image for determining simi-larity [8,2,14].In [8],IRM (Integrated Region Matching)is proposed as a measure that allows a many-to-many region mapping relationship between two images by matching a re-gion of one image to several regions of another image.Ba-sically,the “most similar highest priority”principle is used and the smaller the distance between two regions and is,the larger their significance credit (weight)is (See Figure 2).Thus,by having a similarity measure which is a weighted sum of distances between all regions from differ-ent images,IRM is more robust to inaccurate segmentation.Recently,a fuzzy logic approach,UFM (Unified Feature Matching)[2]was proposed as an alternative to IRM.An image is represented by a set of segmented regions each of which is represented by a fuzzy feature denoting color,texture,and shape characteristics.Because fuzzy features can characterize the gradual transition between regions in an image,segmentation-related inaccuracies are implicitly considered.The similarity between two images is then de-fined as the overall similarity between two sets of fuzzy fea-tures.A key factor in these types of systems that consider all the regions to perform an overall image-to-image similar-ity is the weighting of regions.The weight that is assigned to each region for determining similarity is usually based on prior assumptions such as that larger regions,or regions that are close to the center of the image,should have larger weights.This is often inconsistent with human perception.For instance,a facial region may be the most important when the user is looking for images of people while other larger regions such as the background may be much less relevant.Based on the observation that regions in an image have unequal importance for computing image similarity (See Figure 3),we propose a probabilistic method inspired by PFRL[11],probabilistic region relevance learning (PRRL),for automatically capturing region relevance based on user’s feedback.PRRL can be used to set region weights in region-based image retrieval frameworks that use an overall image-to-image similarity measure.1.1Related WorkAlthough RF learning has been successfully applied to CBIR systems that use global image representations,not much research has been conducted on RF learning meth-ods for region-based CBIR.Based on the assumption that important regions should appear more often in relevant im-ages than unimportant regions,Jing et al.[7]proposed a(Region Frequency *Inverse Image Frequency)weighting scheme (RFIIF).Let be the vari-able length representation of a query image,where rep-resents the features extracted from a region in the image.Let be the set of all images in the database,andbe the set of cumulative relevant retrieved im-ages for query image .For each region ,the region frequency ()is defined aswhere if at least one region of is similar to and 0otherwise.Two regions are deemed similar if their distance is smaller than a predefined treshold.The123"People""Clothes""Animals"XYR R R Query ImageFigure 3.Regions are unequal in their differential rele-vance for computing similarity.Given that the user is look-ing for images of people,region is the most important,followed by and .Thus,the neighborhood of the similarity metric should be elongated along the direction ofand constricted along the direction ofinverse frequency ()is defined as1.2Paper OutlineThe rest of this paper is organized as follows.In Section 2,we describe the probabilistic approach for measuring the importance of each region in a query image.Section 3de-scribes how user’s feedback on the retrieval results is usedfor estimating the measure of region relevance.A brief de-scription of UFM [2],which is used as the particular region-based image retrieval measure with which PRRL is tested,is given in Section 4.In Section 5,we compare the retrieval performance of UFM against that of UFM with PRRL and UFM with RFIIF for setting region weights.Finally,we give some concluding remarks in Section 6.2.Region Relevance MeasureInspired by PFRL[11],we learn the differential region relevance by estimating the strength of each region in pre-dicting the class of a given query.Given a query image,where represents the features extractedfrom a region in the image.Let the class labelat be treated as a random variable from a distribution with the probabilities .Consider the func-tion of argumentsIn the absence of any argument assignments,the least-squares estimate for is simply its expected (average)valuewhere is the joint probability density.Now,suppose that we know the value of at a particular region .The least-squares estimate becomeswhere is the conditional density of the other re-gions.Because (i.e.,the query image is always relevant),is the maximum error that can be made when assigning 0to the probability that is rele-vant when the probability is in fact 1.On the other hand,is the error that is made by predictingto be the probability that is relevant.Therefore,represents a reduction in error between the two predictions.Therefore,a measure of the relevance of region for can be defined as(1)The relative relevance can then be used as the weight of region in a weighted similarity measuree a segmentation method to extract regions andrepresent current query by;initializeregion weight vector to(3)where1()returns1if its argument is true,and0otherwise. Thus,is an adaptive similarity threshold that changes so that there is sufficient data for the estimation of (1).The value of is chosen so that,where.The probabilistic region relevance learning algorithm is summarized in Figure4.4.Unified Feature Matching(UFM)Chen and Wang[2]proposed unified feature matching(UFM)as an improved alternative to IRM.In UFM,an im-age is characterized by a fuzzy feature denoting color,tex-ture,and shape characteristics.The similarity between two images is then defined as the overall similarity between twosets of fuzzy features.Because fuzzy features can char-acterize the gradual transition between regions in an im-age,segmentation-related inaccuracies are implicitly con-sidered.The image segmentation algorithm that is usedfirst par-titions an image into blocks of4x4pixels.Then,a featurevector representing color and texture properties is extracted for each block.Thefirst three features are the av-erage color components and the other three represent energyin high frequency bands of the wavelet transforms[3,10]. The-means algorithm is then used to cluster the feature vectors into regions.The number of regions is adaptively chosen according to a stopping criteria.A fea-ture vector is then extracted for each regionto describe its shape characteristics.The shape features are normalized inertia of order1to3[5].The color and texture properties of each region are represented by a fuzzy feature with a Cauchy membership function defined aswhere is the average of all feature vectors in andis the average distance between shape features.Let and be the fuzzy feature representations for a query and target image respectively.The color and texture similarity between thequery and the target image is captured by the similarity vec-torwhereand similarly for the shape similarity,captured by similarity vector.The UFM measure for the query and target imageis then defined aswhere the normalized weight vectors and can be set according to some region weighting heuristic,adjusts the importance of and,and de-termines the significance of(i.e.,color and texture simi-larity)and(i.e.,shape similarity).For further details,see [2].5.Experimental ResultsA subset of2000labelled images from the general-purpose COREL image database was used as the data set. There are20image categories,each containing100pic-tures.The region-based feature vectors of those images are obtained with the segmentation algorithm described in Sec-tion41.We tested the performance of UFM,UFM with PRRL (UFM+PRRL),and UFM with RFIIF(UFM+RFIIF).The retrieval performance is measured by precision and recall, defined asEvery image is used as a query image.A uniform weighting scheme is used to set the region weights of each query and target images.For UFM+PRRL,and UFM+RFIIF,user’s feedback was simulated by carrying out3RF iterations for each query.Because the images in the data set are labelled according to their category,it is known whether an image in the retrieval set would be labelled as relevant or non-relevant by the user.The average precision of the2000queries with respect to different number of RF iterations is shown in Figure5.0123Number of RF Iterations0.450.50.550.60.65PrecisionCorel ImagesUFMUFM+PRRLUFM+RFIIFFigure5.Precision at different number of RF iterations.The size of the retrieval set is20Figures6through9show the precision recall curves after each RF iteration.We can observe that UFM+PRRL has the best performance.It can be seen that,even after only 1RF iteration,the region weights learned by PRRL result in a very significant performance improvement.Figure10 shows the retrieval results obtained on a random query im-age.It is difficult to make objective comparisons with other region-based image retrieval systems such as Netra[9]or Blobworld[1]which require additional information from the user(i.e.,important regions and/or features)during the retrieval process.6.Conclusions and Future WorkRegion-based image retrieval frameworks that use an overall image-to-image similarity measure usually set re-gion weights based on some heuristic that is often inconsis-tent with human perception about the importance of regions in an image.In this paper,we presented a novel proba-bilistic method for automatically estimating the relative rel-evance of the regions in an image.The experimental results on general-purpose images show convincingly that learning region relevance based on user’s feedback can significantly improve retrieval performance.Currently,our method only performs intra-query learn-ing.That is,for each given query,the user’s feedback is used to learn the relevance of the regions in the query and the learning process starts from ground up for each new query.However,it is also possible to exploit inter-00.20.40.60.81Recall0.20.40.60.81P r e c i s i o nCorel Images −− 0 RF IterationsUFM, UFM+PRRL, UFM+RFIIFFigure 6.Precision-recall curve with no learning 00.20.40.60.81Recall0.20.40.60.81P r e c i s i o nCorel Images −− 1 RF IterationUFM+PRRL UFM+RFIIFFigure 7.Precision-recall curve after 1RF iteration00.20.40.60.81Recall0.20.40.60.81P r e c i s i o n sCorel Images −− 2 RF IterationsUFM+PRRL UFM+RFIIFFigure 8.Precision-recall curve after 2RF iterations 00.20.40.60.81Recall0.20.40.60.81P r e c i s i o nCorel Images −− 3 RF IterationsUFM+PRRL UFM+RFIIFFigure 9.Precision-recall curve of after 3RF iterationsRetrieval Set with UFM+PRRL after 2 RF iterations, precision = 0.75Initial Retrieval Set with UFM, precision = 0.3Figure 10.Retrieval results on a random query image(top leftmost).The images are sorted based on their simi-larity to the query image.The ranks descend from left to right and from top to bottom.query learning (i.e.,the long-term knowledge accumulated over the course of many query sessions)to enhance the retrieval performance of future queries.Thus,for a new query,instead of starting the learning process from ground up,we could exploit the previously learned region impor-tances of similar queries.This would be very beneficial specially in the initial retrieval set since,instead of using uniform weighting or some other weighting heuristic,we could make a more informed initial estimate of the rele-vance of regions in the new query.We plan to investigate the possibility of incorporating inter-query learning into theregion-based image retrieval framework as part of our fu-ture work.References[1] C.Carson,S.Belongie,H.Greenspan,and J.Malik.Blob-world:Image segmentation using expectation-maximization and its applications to image querying.IEEE Transactions on Pattern Analysis and Machine Intelligence ,24(8):1026–1038,2002.[2]Y .Chen and J.Wang.A region-based fuzzy feature match-ing approach to content-based image retrieval.IEEE Trans-actions on Pattern Analysis and Machine Intelligence ,pages 1252–1267,2002.[3]I.Daubechies.Ten Lectures on Wavelets .Capital City Press,1992.[4]J.Friedman.Flexible metric nearest neighbor classification.Technical report,Department of Statistics,Standford Uni-versity,1994.[5] A.Gersho.Asymptotically optimum block quantization.IEEE Transactions on Information Theory ,IT-25(4):231–262,July 1979.[6]Y .Ishikawa,R.Subramanys,and C.Faloutsos.Mindreader:Querying databased through multiple examples.In Proceed-ings of the 24th VLDB Conference ,pages 433–438,1998.[7] F.Jing,M.Li,L.Zhang,H.Zhang,and B.Zhang.Learningin region-based image retrieval.In Proceedings of 2nd In-ternational Conference on Image and Video Retrieval ,pages 206–215,2003.[8]J.Li,J.Wang,and G.Wiederhold.Irm:Integrated regionmatching for image retrieval.In Proceedings of the 8th ACM International Conference on Multimedia ,pages 147–156,2000.[9]W.Ma and ra:A toolbox for navigatinglarge image databases.In Proceedings of IEEE International Conference on Image Processing ,pages 568–571,1997.[10]Y .Meyer.Wavelets Algorithms and Applications .SIAM,Philadelphia,1993.[11]J.Peng,B.Bhanu,and S.Qing.Probabilistic feature rele-vance learning for content-based image puter Vision and Image Understanding ,75(1/2):150–164,1999.[12]Y .Rui and T.Huang.Relevance feedback:A powertool for interactive content-based image retrieval.IEEE Transactions on Circuits and Systems for Video Technology ,8(5):644–655,1998.[13]Y .Rui,T.Huang,and S.Mehrotra.Content-based imageretrieval with relevance feedback in mars.In Proceedings of the IEEE International Conference on Image Processing ,pages 815–818,1997.[14]J.Wang,G.Li,and G.Wiederhold.Simplicity:Semantics-sensitive integrated matching for picture libraries.IEEE Transactions on Pattern Analysis and Machine Intelligence ,23:947–963,2001.。