Algorithms for Non-negative Matrix Factorization

Daniel D.LeeBell Laboratories Lucent Technologies Murray Hill,NJ07974H.Sebastian SeungDept.of Brain and Cog.Sci.Massachusetts Institute of TechnologyCambridge,MA02138 AbstractNon-negative matrix factorization(NMF)has previously been shown tobe a useful decomposition for multivariate data.Two different multi-plicative algorithms for NMF are analyzed.They differ only slightly inthe multiplicative factor used in the update rules.One algorithm can beshown to minimize the conventional least squares error while the otherminimizes the generalized Kullback-Leibler divergence.The monotonicconvergence of both algorithms can be proven using an auxiliary func-tion analogous to that used for proving convergence of the Expectation-Maximization algorithm.The algorithms can also be interpreted as diag-onally rescaled gradient descent,where the rescaling factor is optimallychosen to ensure convergence.1IntroductionUnsupervised learning algorithms such as principal components analysis and vector quan-tization can be understood as factorizing a data matrix subject to different constraints.De-pending upon the constraints utilized,the resulting factors can be shown to have very dif-ferent representational properties.Principal components analysis enforces only a weak or-thogonality constraint,resulting in a very distributed representation that uses cancellations to generate variability[1,2].On the other hand,vector quantization uses a hard winner-take-all constraint that results in clustering the data into mutually exclusive prototypes[3]. We have previously shown that nonnegativity is a useful constraint for matrix factorization that can learn a parts representation of the data[4,5].The nonnegative basis vectors that are learned are used in distributed,yet still sparse combinations to generate expressiveness in the reconstructions[6,7].In this submission,we analyze in detail two numerical algorithms for learning the optimal nonnegative factors from data.2Non-negative matrix factorizationWe formally consider algorithms for solving the following problem:Non-negative matrix factorization(NMF)Given a non-negative matrix,find non-negative matrix factors and such that:(1)NMF can be applied to the statistical analysis of multivariate data in the following manner. Given a set of of multivariate-dimensional data vectors,the vectors are placed in the columns of an matrix where is the number of examples in the data set.This matrix is then approximately factorized into an matrix and an matrix. Usually is chosen to be smaller than or,so that and are smaller than the original matrix.This results in a compressed version of the original data matrix.What is the significance of the approximation in Eq.(1)?It can be rewritten column by column as,where and are the corresponding columns of and.In other words,each data vector is approximated by a linear combination of the columns of, weighted by the components of.Therefore can be regarded as containing a basis that is optimized for the linear approximation of the data in.Since relatively few basis vectors are used to represent many data vectors,good approximation can only be achieved if the basis vectors discover structure that is latent in the data.The present submission is not about applications of NMF,but focuses instead on the tech-nical aspects offinding non-negative matrix factorizations.Of course,other types of ma-trix factorizations have been extensively studied in numerical linear algebra,but the non-negativity constraint makes much of this previous work inapplicable to the present case [8].Here we discuss two algorithms for NMF based on iterative updates of and.Because these algorithms are easy to implement and their convergence properties are guaranteed, we have found them very useful in practical applications.Other algorithms may possibly be more efficient in overall computation time,but are more difficult to implement and may not generalize to different cost functions.Algorithms similar to ours where only one of the factors is adapted have previously been used for the deconvolution of emission tomography and astronomical images[9,10,11,12].At each iteration of our algorithms,the new value of or is found by multiplying the current value by some factor that depends on the quality of the approximation in Eq.(1).We prove that the quality of the approximation improves monotonically with the application of these multiplicative update rules.In practice,this means that repeated iteration of the update rules is guaranteed to converge to a locally optimal matrix factorization.3Cost functionsTofind an approximate factorization,wefirst need to define cost functions that quantify the quality of the approximation.Such a cost function can be constructed using some measure of distance between two non-negative matrices and.One useful measure is simply the square of the Euclidean distance between and[13],(2)This is lower bounded by zero,and clearly vanishes if and only if.Another useful measure isWe now consider two alternative formulations of NMF as optimization problems: Problem1Minimize with respect to and,subject to the constraints .Problem2Minimize with respect to and,subject to the constraints .Although the functions and are convex in only or only,they are not convex in both variables together.Therefore it is unrealistic to expect an algorithm to solve Problems1and2in the sense offinding global minima.However,there are many techniques from numerical optimization that can be applied tofind local minima. Gradient descent is perhaps the simplest technique to implement,but convergence can be slow.Other methods such as conjugate gradient have faster convergence,at least in the vicinity of local minima,but are more complicated to implement than gradient descent [8].The convergence of gradient based methods also have the disadvantage of being very sensitive to the choice of step size,which can be very inconvenient for large applications.4Multiplicative update rulesWe have found that the following“multiplicative update rules”are a good compromise between speed and ease of implementation for solving Problems1and2.Theorem1The Euclidean distance is nonincreasing under the update rules(4)The Euclidean distance is invariant under these updates if and only if and are at a stationary point of the distance.Theorem2The divergence is nonincreasing under the update rules(5)The divergence is invariant under these updates if and only if and are at a stationary point of the divergence.Proofs of these theorems are given in a later section.For now,we note that each update consists of multiplication by a factor.In particular,it is straightforward to see that this multiplicative factor is unity when,so that perfect reconstruction is necessarily afixed point of the update rules.5Multiplicative versus additive update rulesIt is useful to contrast these multiplicative updates with those arising from gradient descent [14].In particular,a simple additive update for that reduces the squared distance can be written as(6) If are all set equal to some small positive number,this is equivalent to conventional gradient descent.As long as this number is sufficiently small,the update should reduce .Now if we diagonally rescale the variables and set(8) Again,if the are small and positive,this update should reduce.If we now setminFigure1:Minimizing the auxiliary function guarantees that for.Lemma2If is the diagonal matrix(13) then(15) Proof:Since is obvious,we need only show that.To do this,we compare(22)(23)is a positive eigenvector of with unity eigenvalue,and application of the Frobenius-Perron theorem shows that Eq.17holds.We can now demonstrate the convergence of Theorem1:Proof of Theorem1Replacing in Eq.(11)by Eq.(14)results in the update rule:(24) Since Eq.(14)is an auxiliary function,is nonincreasing under this update rule,accordingto Lemma1.Writing the components of this equation explicitly,we obtain(28)Proof:It is straightforward to verify that.To show that, we use convexity of the log function to derive the inequality(30) we obtain(31) From this inequality it follows that.Theorem2then follows from the application of Lemma1:Proof of Theorem2:The minimum of with respect to is determined by setting the gradient to zero:7DiscussionWe have shown that application of the update rules in Eqs.(4)and(5)are guaranteed to find at least locally optimal solutions of Problems1and2,respectively.The convergence proofs rely upon defining an appropriate auxiliary function.We are currently working to generalize these theorems to more complex constraints.The update rules themselves are extremely easy to implement computationally,and will hopefully be utilized by others for a wide variety of applications.We acknowledge the support of Bell Laboratories.We would also like to thank Carlos Brody,Ken Clarkson,Corinna Cortes,Roland Freund,Linda Kaufman,Yann Le Cun,Sam Roweis,Larry Saul,and Margaret Wright for helpful discussions.References[1]Jolliffe,IT(1986).Principal Component Analysis.New York:Springer-Verlag.[2]Turk,M&Pentland,A(1991).Eigenfaces for recognition.J.Cogn.Neurosci.3,71–86.[3]Gersho,A&Gray,RM(1992).Vector Quantization and Signal Compression.Kluwer Acad.Press.[4]Lee,DD&Seung,HS.Unsupervised learning by convex and conic coding(1997).Proceedingsof the Conference on Neural Information Processing Systems9,515–521.[5]Lee,DD&Seung,HS(1999).Learning the parts of objects by non-negative matrix factoriza-tion.Nature401,788–791.[6]Field,DJ(1994).What is the goal of sensory coding?Neural Comput.6,559–601.[7]Foldiak,P&Young,M(1995).Sparse coding in the primate cortex.The Handbook of BrainTheory and Neural Networks,895–898.(MIT Press,Cambridge,MA).[8]Press,WH,Teukolsky,SA,Vetterling,WT&Flannery,BP(1993).Numerical recipes:the artof scientific computing.(Cambridge University Press,Cambridge,England).[9]Shepp,LA&Vardi,Y(1982).Maximum likelihood reconstruction for emission tomography.IEEE Trans.MI-2,113–122.[10]Richardson,WH(1972).Bayesian-based iterative method of image restoration.J.Opt.Soc.Am.62,55–59.[11]Lucy,LB(1974).An iterative technique for the rectification of observed distributions.Astron.J.74,745–754.[12]Bouman,CA&Sauer,K(1996).A unified approach to statistical tomography using coordinatedescent optimization.IEEE Trans.Image Proc.5,480–492.[13]Paatero,P&Tapper,U(1997).Least squares formulation of robust non-negative factor analy-b.37,23–35.[14]Kivinen,J&Warmuth,M(1997).Additive versus exponentiated gradient updates for linearprediction.Journal of Information and Computation132,1–64.[15]Dempster,AP,Laird,NM&Rubin,DB(1977).Maximum likelihood from incomplete data viathe EM algorithm.J.Royal Stat.Soc.39,1–38.[16]Saul,L&Pereira,F(1997).Aggregate and mixed-order Markov models for statistical languageprocessing.In C.Cardie and R.Weischedel(eds).Proceedings of the Second Conference on Empirical Methods in Natural Language Processing,81–89.ACL Press.。

贝叶斯⽹络结构学习总结完备数据集下的贝叶斯⽹络结构学习:基于依赖统计分析的⽅法—— 通常利⽤统计或是信息论的⽅法分析变量之间的依赖关系,从⽽获得最优的⽹络结构对于基于依赖统计分析⽅法的研究可分为三种:基于分解的⽅法(V结构的存在)Decomposition of search for v-structures in DAGsDecomposition of structural learning about directed acylic graphsStructural learning of chain graphs via decomposition基于Markov blanket的⽅法Using Markov blankets for causal structure learningLearning Bayesian network strcture using Markov blanket decomposition基于结构空间限制的⽅法Bayesian network learning algorithms using structural restrictions(将这些约束与pc算法相结合提出了⼀种改进算法,提⾼了结构学习效率)(约束由Campos指出包括1、⼀定存在⼀条⽆向边或是有向边 2、⼀定不存在⼀条⽆向边或有向边 3、部分节点的顺序)常⽤的算法:SGS——利⽤节点间的条件独⽴性来确定⽹络结构的⽅法PC——利⽤稀疏⽹络中节点不需要⾼阶独⽴性检验的特点,提出了⼀种削减策略:依次由0阶独⽴性检验开始到⾼阶独⽴性检验,对初始⽹络中节点之间的连接进⾏削减。

LectureNotesinComputerScience,2004:188—193.[5]JefferyM.AdaptivecleaningforRFIDDataStreams[S].InProc.ofthe32ndInternationalCon ferenceonVeryLargeDataBases。

Advances in Applied Mathematics 应用数学进展, 2023, 12(6), 2788-2801 Published Online June 2023 in Hans. https:///journal/aam https:///10.12677/aam.2023.126280经验模态分解的单通道呼吸信号自动睡眠分期白雨欣,令狐荣乾北方工业大学理学院,北京收稿日期:2023年5月16日;录用日期:2023年6月9日;发布日期:2023年6月16日摘要睡眠是人体基本的生理需求,可以保证机体的生长发育、为机体储蓄能量、维持机体免疫等。
本文采用的呼吸信号数据集来自SHHS ,它是一个中心队列研究,用来确定睡眠与呼吸障碍的心血管和其他病症的数据库。
首先,我们对SHHS 数据库中的单通道呼吸信号进行了分析,以便更好地了解人类睡眠情况。
实验结果表明,在4类和5类睡眠分期任务中,SHHS 数据库的呼吸信号自动睡眠分期准确率分别为89.22%和88.43%。
关键词经验模态分解算法,长短期记忆网络LSTM ,呼吸信号,特征提取,睡眠阶段分类Empirical Modal Decompositionof Single-Channel Respiratory Signals for Automatic Sleep StagingYuxin Bai, Rongqian LinghuCollege of Science, North China University of Technology, BeijingReceived: May 16th , 2023; accepted: Jun. 9th , 2023; published: Jun. 16th , 2023AbstractSleep is a basic physiological need of the body to ensure growth and development, save energy for白雨欣,令狐荣乾the body, and maintain immunity of the body. Accurate assessment of sleep quality is the key to recognizing sleep disorders and taking effective interventions. Manual sleep staging is time con-suming and subjective when performed by experienced sleep specialists. Currently, researchers have proposed a number of accurate, effective, and targeted sleep staging methods. For example, a single-channel computer signal automatic sleep staging method based on deep learning and em-pirical modal decomposition algorithms has been successfully used for respiratory signal (RESP) sleep staging, which provides a new way to decompose respiratory signals and identify sleep stages automatically. The data set used in this paper is from SHHS, which is a central cohort study to identify sleep and breathing disorders in a database of cardiovascular and other conditions. First, we analyzed the single-channel respiratory signals from the SHHS database to better under-stand human sleep. Second, the pre-processed respiratory signals were decomposed using an em-pirical modal decomposition algorithm (EMD) to extract nine features in the time domain, nonli-near dynamics, and statistics from the original respiratory signals and the six simple signals that were decomposed. Finally, a classification model was constructed using a long short-term memory network (LSTM) to classify and identify the extracted respiratory signal features for automatic sleep staging. The experimental results show that the accuracy of automatic sleep staging of res-piratory signals from SHHS database is 89.22% and 88.43% in 4 and 5 categories of sleep staging tasks, respectively. The experimental results show that the automatic sleep staging model pro-posed in this paper has high classification accuracy and efficiency, and has strong applicability and stability.KeywordsEmpirical Modal Decomposition Algorithm, Long Short-Term Memory Network LSTM, Respiratory Signal, Feature Extraction, Sleep Stage Classification.This work is licensed under the Creative Commons Attribution International License (CC BY 4.0)./licenses/by/4.0/1. 介绍睡眠是评价人类生活质量和身体健康的标准之一,并且了解睡眠质量和结构对人类的健康至关重要。

一种低复杂度的MIMO正交缺陷门限减格预编码算法王伟;李勇朝;张海林【摘要】To reduce the complexity of lattice reduction aided ( LRA ) precoding , a low complexity LRA precoding based on the orthogonality defect threshold is proposed . We introduce the orthogonality defect (od) threshold as an early‐termination condition into the lattice reduction (LR) algorithm which can reduce computational complexity by adaptively early terminating the LR processing . And , sorted QR decomposition of the channel matrix is used to enhance the probability of the early termination which further reduces computational complexity . Moreover , to achieve a favorable tradeoff between performance and complexity , we define a power loss factor ( PLF) to optimize the od threshold . Simulation results show that the proposed algorithm can achieve significant complexity savings with nearly the same bit‐error‐rate (BER) performance as the traditional LRA precoding algorithm .%针对减格预编码算法复杂度较高的问题,提出了一种基于正交缺陷门限的低复杂度减格预编码算法。

formulated as minimize f (x( ^ ; ^)); (3a) ^ ; ^) 2 ^ : subject to ( (3b) Alternately, a pro table extreme point or direction of X is generated through the solution of an approximation of (1), in which f is replaced by its rst-order, linear approximation, y 7! f (x) + rf (x)T(y ? x), de ned at the solution, x, to the RMP (3), that is, by the problem minimize rf (x)T y; (4a) subject to y 2 X ; (4b) this approximate problem is a linear programming problem, which in general is much easier to solve than the original one. (This is called the column generation subproblem, and corresponds to the decomposition step in some descriptions of column generation methods.) If the solution to this problem lies within the current inner approximation, then the conclusion is that the current solution, x, is optimal in (1), since, then, rf (x)T(y ? x) 0 must hold for all y 2 X . Oth^ ^ erwise, P or D is augmented by a new element, the resulting inner approximation is improved (that is, enlarged), and the solution to the new RMP has a strictly lower objective value than the previous one; the latter result follows since the strict inequality rf (x)T d < 0 holds (that is, d de nes a direction of descent with respect to f at x), where d denotes either the direction d := y ?x towards the new extreme point y or an extreme direction. The iteration is then repeated with the solution of a new column generation subproblem de ned at the solution to the RMP. In the method of 31], Caratheodory's Theorem is utilized in the validation of a column dropping rule, according to which any extreme point or direction whose weight in the expression of the solution x to the RMP is zero is removed; thanks to the niteness of P and D and the strictly decreasing values of f , the convergence of the SD algorithm in the number of RMP is nite. (In the case of non-polyhedral sets, it was observed in 11] that von Hohenbalken's original procedure does not necessarily converge. Their remedy is the introduction of a safe-guarding step which

碳排放IDA模型的算法比较及应用研究程郁泰;张纳军【期刊名称】《统计与信息论坛》【年(卷),期】2017(032)005【摘要】在分析碳排放指数分解模型IDA的基本理论框架和各类型算法的结构、特点基础上,以中国1991-2014年相关数据的实证分析作为各算法应用的解读,并基于适用性、有效性综合评价提出算法选择的参考信息:LD算法分解的各因素作用易于理解,但存在分解余项问题;RLD与Shapley算法能够实现因素完全分解,且本质上具有一致性;GFI算法适合多因素效应完全分解,但计算过程复杂;AMDI与AWD算法受到分解余项和对数权重赋值限制的影响约束;LMDIⅠ算法具有灵活的分解形式及因素完全分解特征等.%The research shows the basic theoretical framework of index decomposition analysis model of carbon emissions and the structure and characteristics analysis of different types of algorithms;Unscramble the application of each algorithm based on the empirical analysis with related data of China during the period 1991 to 2014.By the comprehensive evaluation of applicability and effectiveness of each algorithm, this paper provides the reference information for the method selection: The decomposition of the LD algorithm which exists residual items is easy to understand;The calculation results of RLD and Shapley algorithm which achieves complete decomposition of factors are almost identical;And the algorithm of GFI is suitable for multi-factor effect of complete decomposition with the computational complexity;Thealgorithm of AMDI and AWD influenced by the restriction of logarithmic assignment problem results in residual items;The algorithm of LMDIⅠ possesses the flexible features of decomposition form and complete decomposition of factors.【总页数】8页(P10-17)【作者】程郁泰;张纳军【作者单位】天津财经大学统计系, 天津 300222;天津财经大学统计系, 天津300222【正文语种】中文【中图分类】F222.1;C812【相关文献】1.建筑施工过程碳排放量预测模型及应用研究 [J], 刘家林;马朋2.碳排放SDA模型的算法比较及应用研究 [J], 张纳军;程郁泰3.数学模型在碳排放测算与预测中的应用研究 [J], 李婉婷;宋男哲;慎英才;邢洁4.“十四五”期间我国碳排放总量及其结构预测——基于混频数据ADL-MIDAS 模型 [J], 赫永达;文红;孙传旺5.MIDAS模型与EQW模型预测精度的比较——以资产价格的经济增长效应为例[J], 王春枝;赵国杰;王维国;于扬因版权原因,仅展示原文概要,查看原文内容请购买。
- 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
- 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
- 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
Ronald Parr Computer Science Department Stanford University Stanford,CA94305-9010parr@AbstractThis paper presents two new approaches to de-composing and solving large Markov decisionproblems(MDPs),a partial decoupling methodand a complete decoupling method.In these ap-proaches,a large,stochastic decision problem isdivided into smaller pieces.Thefirst approachbuilds a cache of policies for each part of theproblem independently,and then combines thepieces in a separate,light-weight step.A secondapproach also divides the problem into smallerpieces,but information is communicated betweenthe different problem pieces,allowing intelligentdecisions to be made about which piece requiresthe most attention.Both approaches can be usedtofind optimal policies or approximately optimalpolicies with provable bounds.These algorithmsalso provide a framework for the efficient transferof knowledge across problems that share similarstructure.1IntroductionThe Markov Decision Problem(MDP)framework pro-vides a formal framework for modeling a large variety of stochastic,sequential decision problems.It is a well-understood framework with well-known on-line and off-line algorithms for determining optimal behavior(see e.g. Puterman(1994)).The limitations of this framework are also well-known:compliance with the Markov property generally requires a veryfine grained description of the environment,i.e.,a very large number of states.One of the main research thrusts for MDPs has been the development of methods for large state spaces.A major complicating factor in this line of research is the apparent non-decomposability of MDPs—the utility or value of any state can,in general,be affected indirectly by the cost structure and the dynamics of any other state.This thwarts efforts to decompose MDPs into completely independent subproblems and complicates efforts to reduce computation time through parallelization.While some progress has been made on understanding some very special cases where MDPs may be decomposed into independent subproblems(Singh,1992;Lin,1997),much of the effort has focused on methods that decompose MDPs into“communicating”subproblems(Bertsekas&Tsitsiklis, 1989;Dean&Lin,1995).In these iterative methods,in-formation about subproblem solutions is communicated to neighboring subproblems.The solution for each subprob-lem may need to be updated many times until a globally optimal solution is obtained.This paper considers a special,but fairly general class of problem decompositions where each subproblem is “weakly”coupled with the neighboring subproblems.This means that the number of states connecting the two sub-problems is small,a relationship that appears naturally in many problems.For example,the problem of moving from one’s office to one’s house has this structure:one’s office is a small region that is connected by a much smaller region, the door,to an external corridor.Many other offices may be connected to this corridor,each with a similar structure.The corridor could be fairly large and connected to other corri-dors by relatively small intersection regions.Most buildings have a small number of doorways that connect them to the streets outside.Each street has a relatively small number of points where it connects to other streets.One such street connects to the house one calls home,which is itself an ag-gregation of weakly connected pieces.An MDP is weakly coupled if it can be divided into two or more subproblems that are weakly coupled with each other.Figure1shows a simple navigation MDP divided into four rooms,each of which can be considered a subproblem.This paper uses a similar approach to that used in commu-nicating MDP solution methods,but aims to avoid itera-tively updating solutions to subproblems by building a set of policies independently for each subproblem.Each set of policies is called a cache.The caches are constructed in such a way that they are guaranteed a priori to provide performance within a constant of the optimal,regardless of the structure of the other subproblems.This permits a complete decoupling of the MDP into independent subprob-lems that can be solved in parallel and then recombined in a light-weight step.The decoupling process is based upon the observation that any policy over a region of state spaceX XXX X X $Room 2Room 1Room 4Room 3XXFigure 1:A weakly coupled MDP.There is a reward in room ,indicated with a $.Connecting states are identified with an X.Similar examples and pictures are used by Precup and Sutton (1997)and Hauskrecht et al.(1998).defines a linear function for the values of the states inside the region in terms of the values of the states outside the region (see,for example,Parr (1998)).The linear relation-ship is exploited by the algorithms in this chapter to build caches for each region of the state space.The caches are built iteratively by constructing linear programs that dis-cover the values of the states outside the region for which cache performs the worst,then adding a new policy to the cache to cover the worst case.The efficient manipulation of policy caches also provides a formal basis for the transfer of knowledge across problems with similar substructures.The simplest case of this occurs when the reward structure for a problem changes.Suppose,for example,that the reward in the navigation problem is moved from room to room .Policy caches devised for rooms and can be can be transferred to the new problem.Similarly,if one’s destination is now a cafe instead of home,the policies designed for one’s office and the containing building should transfer to the new problem.Since the number of possible policies for a subproblem is exponential in the number of states in the subproblem,there may exist problems and accuracy requirements for which the size of the policy cache will be exponential.In these cases there still will be some benefit to constructing a small policy cache,even if it does not provide the desired accuracy guarantees.This paper presents an algorithm that augments standard communicating MDP algorithms with the use of a policy cache.The policy cache can be used to determine lower and upper bounds on the values that states in the subproblem can assume,and this provides a means of deciding when it is worth using a cached solution and when it is worth producing a new subproblem solution.This is particularly useful in determining if subproblem solutions from a related problem can be applied to a new one.2Markov Decision ProblemsTo review the basic MDP framework,an MDP is a 4-tuple,where is a set of states ,is a set of actions ,is a transition model mapping into probabilitiesin,and is a reward function mapping into real-valued rewards.Algorithms for solving MDPs can return a policy ,,that maps from to ,or a real-valued value function .In this paper,the focus is on infinite-horizon MDPs with a discount factor .The aim in these problems is to find an optimal policy,,that maximizes the expected discounted total reward of the agent,or to find an approximately optimal policy that comes within some bound of optimal.Value iteration,policy iteration or linear programming can be used to determine the optimal policy for an MDP.These algorithms all use some form of the Bellman equation (Bell-man,1957):When the Bellman equation is satisfied,the maximizing action for each state is the optimal action.For a particular policy,the Bellman equation becomes a system of linear equations:These can be solved to determine,,the value of follow-ingfrom any state.The Bellman error for a particular policy at a particular state is the difference between the value function for that policy and the right-hand side of the Bellman equation:For any policy,the maximum Bellman error over all states,,is a well-known bound onthe distance from the optimal value function (Williams &Baird,1993):assignment of policies to regions can be determined by solv-ing a“high-level”reduced decision problem defined over only the states in the out-spaces of the regions.This reduced decision problem removes all but the out-space states from the problem.Actions in the reduced problem correspond to assignments of policies to regions in the original deci-sion problem.This transformation is the basic insight of Forestier and Varaiya(1978)and it follows as a special case of the hierarchical results in Parr and Russell(1997).The approach is also investigated in Hauskrecht et al.(1998). This type of problem also can be viewed as Semi-Markov decision problem(SMDP),where each low-level policy be-comes a primitive SMDP action,as in Parr(1998).In Figure1,the high level problem would contain just the eight specially marked states.An action in the high level problem would correspond to a decision to adopt some pol-icy from the room’s cache upon entering the room,and staying with this policy until the next out-space state is reached.The solution to the high-level problem may pro-duce a non-stationary policy at the low-level,which means that the actions taken in any room may depend upon the manner in which the room is entered.A non-stationary policy of this type can be converted easily to a stationary policy that is at least as good(Parr,1998).The relationship between the size of the out-spaces and the complexity of the high-level problem should make the importance of weak coupling clear.If the size of the out-spaces approaches the size of the original MDP,then the high-level decision problem that combines the cached sub-problem solutions will be as difficult as the original MDP. An algorithm that completely decomposed an MDP would produce a for each,combine these to produce an op-timal or approximately optimal overall solution,and never need to revise any of the.Unless the are chosen very carefully,or the caches are very large,combinations of policies in the initial policy caches may not suffice.There are several approaches to revising the policy caches.One extreme end of this spectrum is the approach in Sutton, Precup,and Singh(1998),where policies and low-level ac-tions are mixed together in the same SMDP.This sacrifices the reduction in computational complexity obtained from solving a reduced decision problem in favor of a guarantee of obtaining optimality.Another approach considered by Dean and Lin(1995)updates each directly.Dean and Lin considered a special case in which the old policies were discarded at each iteration,and a new policy was computed for each region based upon the high-level decision prob-lem’s current estimates for the value of the out-space states. The approach advocated by Dean and Lin is guaranteed to converge to the optimal policy.However,it is just one special case of a general class of methods that must con-verge.Any reasonable scheme that improves the policies in the regions and propagates those improvements through the high-level decision problem is guaranteed to produce an optimal policy as long as no regions“starve”,i.e.,never have their policies improved.This result follows directly from the observation that the high-level problem of assign-ing policies to rooms is really just an SMDP where the set of permitted actions for the SMDP are just the set of possible policies defined over regions.The algorithms in this paper all aim to minimize the number of policies that are computed for MDP subproblems.The extent to which this can be minimized is a measure of how effectively an MDP has been decomposed.If each sub-problem requires only a small cache of candidate solutions, this means that the subproblem solutions are relatively in-dependent.These are precisely the situations in which a large computational benefit is reaped from decomposition, since the MDP can be divided and conquered by solving a reasonable number of small subproblems.The size of the policy caches also gives some measure of the paralleliz-ability of the problem.If a region can be solved with a small cache of policies,this suggests that the entire cache could be constructed a priori as a completely independent subprocess.The following section describes several algorithms for con-structing policy caches for subproblems with minimal as-sumptions about the rest of the MDP.These algorithms aim to minimize the size of the cache,while ensuring that so-lutions using the cache will be within a bound of optimal. The succeeding section describes a scheme for working with policy caches for which optimality bounds have not been established a priori.This method efficiently estab-lishes bounds on the benefit of adding a new policy to a cache,based upon the current contents of the cache.4Complete decouplingThis section presents algorithms thatfind a policy cache, ,for a particular region,,such that is guaranteed to provide policies that are within a constant of optimal when a high level problem using for is solved.The only assumptions that are made about the regions to which connects is that the states assume values on. Define,as a vector of values that the states in the out-space of can take on(the subscript will be dropped when there can be no confusion about the region in question). The fan-out of a region is defined as the dimension of this vector.In addition to storing a cache of policies it is useful to store a cache of functions,for each.Eachis a linear function that provides the value of any state as a linear function of.For any policy these functions can be determined by solving a system of linear equations(see Parr(1998)).The goal in constructing a policy cache for a region is to produce a cache such that for every possible value of the corresponding out-space states,there is a policy in the region’s cache for which the performance in the region will be within a bound of optimal.A policy,,for region is optimal with respect to if is the solution to the MDP defined just over the states in,with the assumption that states in the out-space of are absorbing states with valueslocked at the value of the corresponding entry in.In room of the four-room example,the optimal policy for would be determined by solving an MDP with just the states in room and the two connecting states in room and room.The value of the connecting state in room would be treated as a constant with value and the value of the connecting state in room would be a constant with value.A policy,,is said to be-optimal withrespect to if,when the values of the states in the out-space of arefixed by.For any state and any value of,there must be one policy in the cache that appears at least as good as all of the others.A policy,dominates at for a particular,if.This means that the low-level policy,,appears to be the best high level action at state for a particular.A cache of policies is optimal at t if,for any,the dominating policy is optimal.A cache of policies is optimal if it is optimal at all in the in-space of.Theorem1If an MDP is divided into regions, and an optimal cache of policies,, is determined for each region,these policies can be com-bined to produce a globallyoptimality.,where is the fan-out of the region.This will be unmanageable unless the range of values is very small,the fan out of the region is very small,or is very large.4.2Value Space SearchThis section presents an algorithm that aims to avoid con-structing an exponential number of policies by searching through space tofind a point at which the current pol-icy cache is not adequate.If such a point is found,a new policy is added to the cache,and the process is repeated until no points can be found for which the current cache is inadequate.The following formal results are the basis of the value space search algorithm:Lemma1For any state,the dominating policies at form a piecewise-linear convex function of.Proof:This follows from the observation that using the dominating policy means taking the maximum over a set of linear policy functions.For this example,the top-left state is treated as if it were an in-space state even though there is no entrance to the room in that area.This keeps the value surfaces corresponding to policies displayable in two dimensions.0.511.5200. 1.2 1.4 1.6 1.82V a l u e o f t o p -l e f t s t a t eValue of out-stateOptimal for Vo = 0Optimal for Vo = 2.0Figure 3:The optimal policy for avoids the exit,making the value of the top-left state nearly independent ofthe value of the exit.The optimal policy forgoes directly for the exit and has a strong dependence on .can be determined in time that is polynomial in ,,,and .Proof:This is achieved by means of a linear program.For all in the in-space of ,for all in ,for all ,and for all,the following linear program is solved:Maximize:Subject to:Note that the free variables in the system are the componentsof.The objective function maximizes the Bellman error at state under the assumption that action is taken.Thefirst set of constraints identifies the region inspace for which dominates at .If this region exists,it is guaranteed to be a single,continuous facet of a convex surface by Lemma 1.The last set of constraints bounds to be within the range of possible values.The largest value returned by the linear program over all ,,and provides the point at which the current cache of policies will have the largest Bellman error.The time bound is satisfied because linear programming is polynomial in the size of its inputs.point at which the error in a set of infinite horizon MDP value functions is the largest.The value space search algorithm was used tofind an optimal policy cache for room of Figure1using the same action model and discount factor as used for the one-exit problem.This subproblem contains states and has a fan-out of.Possible actions are right,left,up and down,but these actions are unreliable,resulting in move-ment in one of the three other axis-parallel directionsof the time.The discount factor used was.There are possible policies for this subproblem.Of course,many of these are unreasonable policies that,for example,move the agent in circles.However,as in the one-exit example,a variety of policies can still be induced by different values of the out-space states.If the values of the states are assumed to be on, then the-grid approach for this problem would require million policies for.The value space searchalgorithm produced a policy cache with the same optimality guarantees with just policies.For,the-grid approach would require million policies,while the value space search algorithm produced the same policies.In this particular case,the value space search algorithm has captured the intuition that this type of subproblem should not be that hard.A few seconds of computation has pro-duced a small cache of that will ensure a nearly optimal solution for this region no matter what happens in any con-necting region.This small subproblem is now decoupled and completely solved—at least for and for problems where the neighboring states can assume values on.Any MDP satisfying these conditions and with an optimality requirement of no more thanof optimal. This could be a problem,however,if the agent typically starts in some state that is not a high-level state.In such cases,the starting position of the agent can be treated as if it were a connecting state by adding it to the in-space of the enclosing region and constructing a policy cache as if it were a connecting state.If desired,every state could be treated as if it were an in-space state,ensuring full low-level optimality as well.The algorithm presented in this section has run time that is exponential in,the fan-out of the region,but unlike the -grid approach,it does not depend explicitly on andunlike the value space search algorithm,it can avoid consid-ering every state inside of a region if high-level-optimality is sufficient.The algorithm relies upon the following formal results:Lemma2For any point,set of points,, with set of policies,,such that is optimal with respect to and such that the form a convex hull around ,the optimal policy with respect to at any state is bounded from below by and from above by the hyperplane containing each of the. Proof:Bounding from below is obvious and follows from Lemma1:the optimal policy at any point must do at least as well as the dominating policy in the cache.The bound from above is somewhat more subtle:Let be the hy-perplane containing the.Suppose that there exists some and corresponding such that for some, is above.Let be the hyperplane corre-sponding to the linear value function of this policy at. There must exist some corner of the convex hull used to create(some)where is above, i.e.,.However,is known to be optimal with respect to,so this is a contradiction.0.511.5200. 1.2 1.4 1.6 1.82V a l u e o f t o p -l e f t s t a t eValue of out stateOptimal for Vo = 0Optimal for Vo = 2.0Upper bounding hullFigure 4:Two policies,and an upper surface bounding the their distance from the optimality.at.For any ,the linear function for the optimal policy cannot cross at or cross1.84at=[1.84].Thus,the value of optimal policy is bounded by the line shown.Theorem 3For region and cache of policies,,that are optimal at,the optimal policy value for any with respect to any is bounded from below by the convex surface formed by the maxi-mum over the correspondingand bounded from above by the convex hull containing the points:.Proof:The bound from below is a direct consequence of Lemma 1.The bound from above follows from Lemma 2and noting that the lowest bounding hyper-plane for anymust form a facet in the convex hull of.facets,making thisalgorithm exponential in .Still,the convex hull bound-ing algorithm is superior to thegrid approach since the grid approach has run time that depends directly oncache for,or generating a new policy that is optimal for the algorithm’s current estimate of.A straightforward way to answer this question would be to use the cached functions to assign values to every state in the subprob-lem and then compute the Bellman error for each state. However,this approach would require so much computa-tion that it would essentially defeat the purpose of solving a high-level problem.Instead,high-level optimality can be checked quite efficiently by using the tools of the convex hull bounding algorithm.Starting with some policy cache,,the el-ements of which are optimal at the corresponding ,for any particular,the value of any state under the optimal policy with respect to is bounded from below by,and the value is bounded from above the convex hull formed by(Theorem3).The situation here is slightly different from the bounding algo-rithm in that isfixed and known.Instead of a high-dimensional convex hull problem,the bounds for a partic-ular can be determined by solving a linear program. In the following is an unknown linear equation,i.e.,the coefficients and constant are free variables:Maximize:Subject to:To reassure oneself that this is indeed a linear program, recall that in this context,,the,and coefficients and constants for the are all known constants.The only vari-ables are the components of.Thefirst set of constraints requires that be no better than the optimal policy for at points in value space where the optimal policy is known. This is,essentially,a restatement of Lemma2.The second set of constraints requires that never exceeds the maxi-mum value any state can assume in this problem.Thus,the objective function forces the linear program tofind the high-est hyperplane that does not violate Lemma2or the bound on state values.If lies in the convex hull of, then will be the facet of the upper-bounding convex hull from Theorem3.Note that if does not lie in the convex hull,will be returned.This bound can be tightened by requiring that the constant of be no larger than the value of the optimal policy at and that the coefficients of sum to be no more than.If the distance between the dominating policy and the upper bound returned by the above linear program is less than for every state in the in-space of,then the policy cache for is sufficient to produce a high-level-optimal policy for the current value of.This means that a high-level decision problem can,for now,avoid updating the policy for region and focus attention on other regions.This decision will need to be reevaluated as values of the states in the out-space of change.One way to view this result is that it enables a form of high-level prioritized sweeping (Moore&Atkeson,1993;Andre,Friedman,&Parr,1997). This result also has significant consequences for the transfer of knowledge across problems.Suppose,for example,that a particular model substructure appears in many different problems.Consider a larger version of the four-room prob-lem with many interconnected rooms.Different tasks in this domain would correspond to different positions of the reward in different rooms.Every time a policy is produced for a room it can be added to the room’s policy cache.The above linear program can be used to determine quickly if for some new problem,the cache in a particular room is adequate.Thus,a form of cross-task learning is achieved where the time required to plan for new objectives declines as experience is gained with the environment.Moreover, intelligent allocation of computational resources will be possible since parts of the value space that have already been mastered will no longer drain CPU time.6ConclusionThis paper presented two approaches to decoupling MDPs, a complete decoupling approach and a partial decoupling approach.With complete decoupling,the problem is di-vided into independent subproblems,and the solutions to these subproblems are combined in a light-weight step.Two new algorithms for determining optimal policy caches for a subproblem are presented.The significance of the first algorithm is that it uses a polynomial time test to de-cide when to add new policies to the cache.The second algorithm uses a computational geometry approach that can be exponential in the fan-out of the subproblem,but can be more efficient than thefirst algorithm if the fan-out is small. Since complete decoupling may not always be possible,a method for partial decoupling is presented.This method assumes that an imperfect policy cache is used by a high-level asynchronous MDP algorithm.It uses the policy cache to bound the optimal values of states in the in-space of a region with respect to the values of the states in the out-space of the region.By providing upper and lower bounds, this permits intelligent decisions about when to update the policy cache for a region based upon the algorithm’s current estimate of the values of the states in the out-space of the region.Together these results provide a framework for large-scale parallelization of MDPs and a formal framework for the transfer of knowledge across problems that share common structures.These results can be applied hierarchically,al-though the optimality requirements for the subproblems will become stricter with each division if the same level of optimality is to be maintained at the top level.This work does not address the questions of state abstrac-tion or value function approximation.Fortunately,these techniques will compliment the results presented here.The decoupled MDP algorithms will benefit from any approachthat compresses the state space,especially if the compres-sion reduces the fan-out of the regions in some decomposi-tion of the space.A limitation of this work is that it applies mainly to a restricted class of MDPs,those that are weakly coupled. Moreover,the efficiency of the methods described here will depend heavily upon the manner in which the MDP is de-composed into subproblems,and,in particular,the fan-out of the regions in the decomposition.The reader should keep in mind,however,that this type of aggressive decoupling of MDPs is a farily new topic and and that while the algorithms involved are,admittedly,complex,the potential benefits in parallelization and knowledge transfer across problems re-sulting from this line of research are substantial.7AcknowledgmentThis work was supported in part by DARPA contract DACA76-93-C-0025under subcontract to Information Ex-traction and Transport,Inc.,and through the generosity of the Powell Foundation and the Sloan Foundation and by DARPA Prime contract IET-1004-96-009.Some of this was done at the University of California at Berkeley, where it was supported in part,by ONR grant N00014-97-1-0942and ARO MURI grant DAAH04-96-1-0341.The author benefited from helpful discussions about this and re-lated work with David Andre,Craig Boutilier,Mike Bowl-ing,Tom Dean,Nir Friedman,Milos Hauskrecht,Daphne Koller,Uri Lerner,Stuart Russell,Mehran Sahami and Rich Sutton.The reviewers also provided some extremely help-ful comments.ReferencesAndre,D.,Friedman,N.,&Parr,R.(1997).Generalized prioritized sweeping.In Advances in Neural Information Processing Systems10:Proceedings of the1997Confer-ence Denver,Colorado.MIT Press.Bellman,R.E.(1957).Dynamic Programming.Princeton University Press,Princeton,New Jersey.Bertsekas,D.C.,&Tsitsiklis,J.N.(1989).Parallel and Distributed Computation:Numerical Methods.Prentice-Hall,Englewood Cliffs,New Jersey.Cassandra,A.R.,Kaelbling,L.P.,&Littman,M.L.(1994). Acting optimally in partially observable stochastic domains. In Proceedings of the Twelfth National Conference on Arti-ficial Intelligence(AAAI-94),pp.1023–1028Seattle,Wash-ington.AAAI Press.Dean,T.,&Lin,S.-H.(1995).Decomposition techniques for planning in stochastic domains.In Proceedings of the Fourteenth International Joint Conference on Artificial In-telligence(IJCAI-95),pp.1121–1127Montreal,Canada. Morgan Kaufmann.Forestier,J.-P.,&Varaiya,P.(1978).Multilayer control of large Markov chains.IEEE Transactions on Automatic Control,AC-23,298–304.Hauskrecht,M.(1998).Planning with temporally abstract actions.Tech.rep.CS-98-01,Computer Science Depart-ment,Brown University,Providence,Rhode Island. Hauskrecht,M.,Meuleau,N.,Boutilier,C.,Kaelbling,L.P., &Dean,T.(1998).Hierarchical solution of Markov deci-sion processes using macro-actions.In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelli-gence(UAI-98).To appear.Lin,S.-H.(1997).Exploiting Structure for Planning and Control.Ph.D.thesis,Computer Science Department, Brown University,Providence,Rhode Island.Lovejoy,W.S.(1991).A survey of algorithmic methods for partially observed Markov decision processes.Annals of Operations Research,28(1–4),47–66.Moore, A.W.,&Atkeson, C.G.(1993).Prioritized sweeping—reinforcement learning with less data and less time.Machine Learning,13,103–130.Parr,R.(1998).Hierarchical Control and Learning for Markov Decision Processes.Ph.D.thesis,University of California,Computer Science Division,Berkeley,Califor-nia.Parr,R.,&Russell,S.(1997).Reinforcement learning with hierarchies of machines.In Advances in Neural In-formation Processing Systems10:Proceedings of the1997 Conference Denver,Colorado.MIT Press.Precup,D.,&Sutton,R.S.(1997).Multi-time models for temporally abstract planning.In Advances in Neural Information Processing Systems10:Proceedings of the 1997Conference Denver,Colorado.MIT Press. Puterman,M.L.(1994).Markov Decision Processes.Wi-ley,New York.Russell,S.J.,&Norvig,P.(1995).Artificial Intelligence:A Modern Approach.Prentice-Hall,Englewood Cliffs,New Jersey.Singh,S.P.(1992).Transfer of learning by composing solutions of elemental sequential tasks.Machine Learning, 8(3),323–340.Sutton,R.S.,Precup,D.,&Singh,S.P.(1998).Between MDPs and semi-MDPs:Learning,planning,and represent-ing knowledge at multiple temporal scales.In prep. Williams,R.J.,&Baird,L.C.I.(1993).Tight perfor-mance bounds on greedy policies based on imperfect value functions.Tech.rep.NU-CCS-93-14,College of Computer Science,Northeastern University,Boston,Massachusetts.。