Bayesian zeroinflated predictive modelling of herdlevel Salmonella prevalence for

合集下载

基于机器学习算法的股指期货价格预测模型研究

基于机器学习算法的股指期货价格预测模型研究

基于机器学习算法的股指期货价格预测模型研究作者:***来源:《软件工程》2022年第12期摘要:人工智能技术和量化投资领域的结合,诞生了各类基于机器学习算法的价格预测模型。

为研究不同机器学习算法在股指期货价格预测中的应用效果,采用支持向量回归、长短期记忆网络、随机森林及极端梯度提升树四种常用的机器学习算法构建价格预测模型,对沪深300股指期货价格进行预测研究,并利用贝叶斯算法对模型进行超参数优化,对比贝叶斯优化对于以上四种机器学习算法预测精度的提升效果。

研究结果表明,随机森林和极端梯度提升树因其模型自身的优点,可以实现对金融时序数据的准确预测,而贝叶斯优化利用高斯过程,不断更新先验,可以显著提高支持向量回归预测效果,均方误差(MSE)、平均绝对误差(MAE)、对称平均绝对百分比误差(SMAPE)和损失适应度(LOSS)分别降低了78.6%、94.7%、95.1%和97.0%。

关键词:机器学习;支持向量机;长短期记忆网络;随机森林;极端梯度提升树中图分类号:TP312 文献标识码:AResearch on Stock Index Futures Price Prediction Modelbased on Machine Learning AlgorithmsYANG Xuewei(School of Economics and Management, Qinghai University for Nationalities, Xining 810007, China)*****************Abstract: With the combination of artificial intelligence technology and quantitative investment, various price prediction models based on machine learning algorithms have emerged. In order to study the effect of different machine learning algorithms on stock index futures price prediction, this paper proposes to use four commonly used machine learning algorithms, namely SVR (Support Vector Regression), LSTM (Long Short-Term Memory), RF (Random Forest) and XGBoost (Extreme Gradient Boosting), to construct a price prediction model, so as to predict the stock index futures price of Shanghai and Shenzhen 300. Bayesian algorithm is used to optimize the hyperparameters of the model, and the improvement effect of Bayesian optimization on the prediction accuracy of the four machine learning algorithms is compared. The research results show that RF and XGBoost can achieve accurate prediction of financial time series data due to their own advantages, while Bayesian optimization can significantly improve the prediction effect of support vector machines by using Gaussian process and constantly updating the prior. MSE, MAE,SMAPE and LOSS are reduced by 78.6%, 94.7%, 95.1% and 97.0% respectively.Keywords: machine learning; SVR; LSTM; RF; XGBoost1 引言(Introduction)宏觀经济背景、金融市场发展水平和投资者心理预期等多种复杂因素共同驱动金融工具价格变化,使得金融时序价格具有非平稳性、非线性和高噪声的复杂特性[1]。

一种用于图像重构的新型贝叶斯压缩感知技术

一种用于图像重构的新型贝叶斯压缩感知技术
d p n e to e sg a . To e h n e a c r c ft e c n e to a e e d n n t in 1 h n a c c u a y o h o v n in BCS, a n w i d o dfe BCS l e k n fmo i d i
e p r ns s o a eP BC u p r r s t e c n e t n lBC n t e e o s u t n mer sfr x e me t h w t tt S S o t e o h o v n o a S a d o r r c n t ci t c o i h h f m i h r o i
应用传统的 R M进行信号重构往往精度非常差。为 了提 高精度 ,文 中提 出了一种新 的 B S V C 技 术 :粒子群 贝叶斯 压缩 感知 (S C ) 实验表 明这种 新 的 B S技 术在 重构精 度上 大大超越 了传 PB S 。 C
统的 B S 术。 C 技
关键 词 :贝叶斯压缩 感 知 ( C ) B S ;相关 向量机 ( V ;粒子 群 优 化 ;局 部 最 优 困境 ;向量 选 R M)
cm rsi e s g( C )w s rp sdi ter et er.I cnies h eos u t npoesa o pes esni B S a ooe e n as t o s r tercnt c o rcs s v n p nh c y d r i
h y sa d l t r n t r i o a r s a st mo e .I S,t O c l ee a c t e Ba e in mo e a e a et d t n l 1 o l p ri d 1 n BC h eS al d t er l v n e rh t h h a i Zn n y e h

论文-毕业论文

论文-毕业论文

艾拉姆咖分布参数的贝叶斯估计摘要在研究武器装备的维修时间的过程中,俄罗斯的数学家引入了艾拉姆咖分布,该分布对装备维修理论的研究起到了积极的作用。

首先,我的的论文在经典统计学中,对艾拉姆咖分布的参数进行了矩估计和最大似然估计;然后,在选取指数分布函数作为先验分布的条件下,研究了艾拉姆咖分布在Linex、复合Linex、MLinex和复合MLinex损失函数下的Bayes估计,并且深入对艾拉姆咖分布的参数在复合Linex和MLinex损失函数下的多层Bayes估计的研究;最后,利用matlab 软件,产生了一组随机数,对在Linex损失函数的情况下,比较了矩估计、最大似然估计和Bayes估计的三个估计的估计值;并且对不同损失函数下,不同参数值对艾拉拉姆咖分布的Bayes估计的估计值变化的影响进行了研究。

关键词损失函数Bayes估计多层Bayes估计数值模拟matlab软件Bayesian Estimation of Эрланга Distribution Parametersunder Different Loss FunctionsAbstract The Эрланга Distribution Parameters plays an important role in studying the maintenance of the moment estimation First, this paper in classical statistics, compared in the case of Linex loss the and he moment estimation maximum likelihood estimation;Then, under the condition of selecting the function as the prior distribution, exponential distribution is studied in Linex, composite Linex, MLinex and composite MLinex bayesian, and further to the parameters of the exponential distribution, the the moment estimation of the exponential distribution oss function of empirical bayes estimation and bayesian estimation research; Finally, using matlab software, a set of random Numbers is created, and the estimation values of the moment estimation, maximum likelihood estimation and bayesian estimation are compared in the case of Linex loss funection. In addition, the moment estimration of the exponedntial distrribution and the estimtation valdues of the modment estidmation loss functions.Key words Loss function Bayesian estimation of Multilayer bayesian estimation Data simulation Matlab software目录引言 (1)1 经典统计学中的参数估计 (2)1.1 参数的矩估计 (2)1.2 参数的极大似然估计: (2)2 不同损失函数下的贝叶斯估计 (3)2.1 Linex损失函数下的贝叶斯估计 (5)2.2 复合Linex损失函数的贝叶斯估计 (5)2.3 MLinex损失函数下的贝叶斯估计 (6)2.4 复合MLinex损失函数下的贝叶斯估计 (7)3 超参数λ的估计 (8)4 不同损失函数下的多层Bayes估计 (10)4.1 复合Linex损失函数下的多层贝叶斯估计 (11)4.2 复合MLinex损失函数下的多层贝叶斯估计 (12)5 实例分析与数据模拟 (13)5.1 Linex损失函数下估计量的比较 (13)5.2 艾拉姆咖分布参数各种损失函数下的贝叶斯估计比较 (14)5.3 复合Linex损失函数下估计量的比较分析 (14)5.4 MLinex损失函数下估计量的比较 (15)结论 (16)参考文献 (17)致谢............................................................. 错误!未定义书签。

中科院机器学习Lecture4-BayesianModeling

中科院机器学习Lecture4-BayesianModeling
0 1
a1 + N1 a1 +a2 + N
Dirichlet 先验
• 与multinomial/multinoulli 共轭
1 K ak -1 Dir (θ | α) = qk B(α) k=1 B(α) :=
G(a )
i i=1
K
G(a0 )
, a0 := åai
i=1
K
ak ak -1 (qk ) = , mode[qk ] = a0 a0 - K (q) = ak (a0 -ak )
h
(
)
MAP estimation: maximum aposteriori estimation
• 注意:只给出正样本,我们也能学习到概念!
Rectangle Game
• 健康水平游戏:测量健康人群的胰岛素和胆固醇 健康的 范围?
Rectangle Game: Setup
• 假设空间 • 似然
1D
• 先验
p(h) = p(l1) p( s1) p(l2 ) p( s2 )
为了简单,假设各参数独立
• 修正后的似然
-n i
p(li ) µ1 p(li ) = (li | 0, ¥) 平移不变 1 p( si ) µ p( si ) = Gamma ( si | 0,0) 尺度不变 si
j=1 k=1 D K ( x j =k)
pc =
ac + Nc
åa
c¢=1
K
,
qjck =
bk + N jck

+N
åb
k ¢=1
Kk¢Biblioteka + Nc后验的概述

潜类别增长 组基轨迹模型-概述说明以及解释

潜类别增长 组基轨迹模型-概述说明以及解释

潜类别增长组基轨迹模型-概述说明以及解释1.引言1.1 概述概述部分的内容可以是对整篇文章的简要介绍和背景提要。

下面是一个示例:概述潜类别增长和组基轨迹模型是机器学习领域中一种重要的研究方向。

随着数据规模的快速增长和应用需求的提升,潜类别增长的概念引起了广泛的关注。

在传统的机器学习任务中,数据被划分为已知的类别,预定义的分类模型可以对其进行准确的分类。

然而,实际应用中常常遇到新的类别出现的情况,传统的分类模型无法处理这些未知的类别。

潜类别增长方法的提出就是为了解决这一问题。

该方法通过动态地扩展已有的类别集合,允许系统在学习过程中接收新的未知类别,并不断更新分类模型。

这种方法的核心是利用数据中的潜在信息来推断新类别的存在。

潜类别增长方法的应用范围非常广泛,例如在物体识别、图像分类和语音识别等领域都有重要的应用。

与此同时,组基轨迹模型也是一类重要的机器学习模型。

在许多实际问题中,数据往往具有一定的时序性质,例如时间序列数据、视频数据等。

传统的机器学习模型往往无法充分利用数据中的时序信息。

组基轨迹模型通过将数据表示为一组子序列(组基)的线性组合,更好地描述了数据的时序特征。

这种方法不仅可以提高数据的表示能力,还可以减少特征的维度,从而提高模型的泛化能力。

本文将详细介绍潜类别增长和组基轨迹模型的原理和方法,并通过实验验证其性能。

接下来,我们将首先介绍潜类别增长方法的基本概念和原理,然后重点阐述组基轨迹模型的建模过程。

最后,我们将对实验结果进行分析和总结,并展望这两个方法在未来的发展方向。

1.2文章结构文章结构部分的内容可以描述本文的组织结构和各个章节的内容安排。

可以使用以下内容作为参考:本文主要围绕着“潜类别增长”和“组基轨迹模型”展开讨论。

文章分为引言、正文和结论三个部分。

引言部分将首先给出对整篇文章的概述,介绍潜类别增长和组基轨迹模型的背景和重要性。

接着,给出文章的整体结构以及各个章节的简要内容介绍。

MXM工具包:一种用于特征选择、交叉验证和贝叶斯网络的R包说明书

MXM工具包:一种用于特征选择、交叉验证和贝叶斯网络的R包说明书

A very brief guide to using MXMMichail Tsagris,Vincenzo Lagani,Ioannis Tsamardinos1IntroductionMXM is an R package which contains functions for feature selection,cross-validation and Bayesian Networks.The main functionalities focus on feature selection for different types of data.We highlight the option for parallel computing and the fact that some of the functions have been either partially or fully implemented in C++.As for the other ones,we always try to make them faster.2Feature selection related functionsMXM offers many feature selection algorithms,namely MMPC,SES,MMMB,FBED,forward and backward regression.The target set of variables to be selected,ideally what we want to discover, is called Markov Blanket and it consists of the parents,children and parents of children(spouses) of the variable of interest assuming a Bayesian Network for all variables.MMPC stands for Max-Min Parents and Children.The idea is to use the Max-Min heuristic when choosing variables to put in the selected variables set and proceed in this way.Parents and Children comes from the fact that the algorithm will identify the parents and children of the variable of interest assuming a Bayesian Network.What it will not recover is the spouses of the children of the variable of interest.For more information the reader is addressed to[23].MMMB(Max-Min Markov Blanket)extends the MMPC to discovering the spouses of the variable of interest[19].SES(Statistically Equivalent Signatures)on the other hand extends MMPC to discovering statistically equivalent sets of the selected variables[18,9].Forward and Backward selection are the two classical procedures.The functionalities or the flexibility offered by all these algorithms is their ability to handle many types of dependent variables,such as continuous,survival,categorical(ordinal,nominal, binary),longitudinal.Let us now see all of them one by one.The relevant functions are1.MMPC and SES.SES uses MMPC to return multiple statistically equivalent sets of vari-ables.MMPC returns only one set of variables.In all cases,the log-likelihood ratio test is used to assess the significance of a variable.These algorithms accept categorical only, continuous only or mixed data in the predictor variables side.2.wald.mmpc and wald.ses.SES uses MMPC using the Wald test.These two algorithmsaccept continuous predictor variables only.3.perm.mmpc and perm.ses.SES uses MMPC where the p-value is obtained using per-mutations.Similarly to the Wald versions,these two algorithms accept continuous predictor variables only.4.ma.mmpc and ma.ses.MMPC and SES for multiple datasets measuring the same variables(dependent and predictors).5.MMPC.temporal and SES.temporal.Both of these algorithms are the usual SES andMMPC modified for correlated data,such as clustered or longitudinal.The predictor vari-ables can only be continuous.6.fbed.reg.The FBED feature selection method[2].The log-likelihood ratio test or the eBIC(BIC is a special case)can be used.7.fbed.glmm.reg.FBED with generalised linear mixed models for repeated measures orclustered data.8.fbed.ge.reg.FBED with GEE for repeated measures or clustered data.9.ebic.bsreg.Backward selection method using the eBIC.10.fs.reg.Forward regression method for all types of predictor variables and for most of theavailable tests below.11.glm.fsreg Forward regression method for logistic and Poisson regression in specific.Theuser can call this directly if he knows his data.12.lm.fsreg.Forward regression method for normal linear regression.The user can call thisdirectly if he knows his data.13.bic.fsreg.Forward regression using BIC only to add a new variable.No statistical test isperformed.14.bic.glm.fsreg.The same as before but for linear,logistic and Poisson regression(GLMs).15.bs.reg.Backward regression method for all types of predictor variables and for most of theavailable tests below.16.glm.bsreg.Backward regression method for linear,logistic and Poisson regression(GLMs).17.iamb.The IAMB algorithm[20]which stands for Incremental Association Markov Blanket.The algorithm performs a forward regression at first,followed by a backward regression offering two options.Either the usual backward regression is performed or a faster variation, but perhaps less correct variation.In the usual backward regression,at every step the least significant variable is removed.In the IAMB original version all non significant variables are removed at every step.18.mmmb.This algorithm works for continuous or categorical data only.After applying theMMPC algorithm one can go to the selected variables and perform MMPC on each of them.A list with the available options for this argument is given below.Make sure you include the test name within””when you supply it.Most of these tests come in their Wald and perm (permutation based)versions.In their Wald or perm versions,they may have slightly different acronyms,for example waldBinary or WaldOrdinal denote the logistic and ordinal regression respectively.1.testIndFisher.This is a standard test of independence when both the target and the setof predictor variables are continuous(continuous-continuous).2.testIndSpearman.This is a non-parametric alternative to testIndFisher test[6].3.testIndReg.In the case of target-predictors being continuous-mixed or continuous-categorical,the suggested test is via the standard linear regression.If the robust option is selected,M estimators[11]are used.If the target variable consists of proportions or percentages(within the(0,1)interval),the logit transformation is applied beforehand.4.testIndRQ.Another robust alternative to testIndReg for the case of continuous-mixed(or continuous-continuous)variables is the testIndRQ.If the target variable consists of proportions or percentages(within the(0,1)interval),the logit transformation is applied beforehand.5.testIndBeta.When the target is proportion(or percentage,i.e.,between0and1,notinclusive)the user can fit a regression model assuming a beta distribution[5].The predictor variables can be either continuous,categorical or mixed.6.testIndPois.When the target is discrete,and in specific count data,the default test isvia the Poisson regression.The predictor variables can be either continuous,categorical or mixed.7.testIndNB.As an alternative to the Poisson regression,we have included the Negativebinomial regression to capture cases of overdispersion[8].The predictor variables can be either continuous,categorical or mixed.8.testIndZIP.When the number of zeros is more than expected under a Poisson model,thezero inflated poisson regression is to be employed[10].The predictor variables can be either continuous,categorical or mixed.9.testIndLogistic.When the target is categorical with only two outcomes,success or failurefor example,then a binary logistic regression is to be used.Whether regression or classifi-cation is the task of interest,this method is applicable.The advantage of this over a linear or quadratic discriminant analysis is that it allows for categorical predictor variables as well and for mixed types of predictors.10.testIndMultinom.If the target has more than two outcomes,but it is of nominal type(political party,nationality,preferred basketball team),there is no ordering of the outcomes,multinomial logistic regression will be employed.Again,this regression is suitable for clas-sification purposes as well and it to allows for categorical predictor variables.The predictor variables can be either continuous,categorical or mixed.11.testIndOrdinal.This is a special case of multinomial regression,in which case the outcomeshave an ordering,such as not satisfied,neutral,satisfied.The appropriate method is ordinal logistic regression.The predictor variables can be either continuous,categorical or mixed.12.testIndTobit(Tobit regression for left censored data).Suppose you have measurements forwhich values below some value were not recorded.These are left censored values and by using a normal distribution we can by pass this difficulty.The predictor variables can be either continuous,categorical or mixed.13.testIndBinom.When the target variable is a matrix of two columns,where the first one isthe number of successes and the second one is the number of trials,binomial regression is to be used.The predictor variables can be either continuous,categorical or mixed.14.gSquare.If all variables,both the target and predictors are categorical the default test isthe G2test of independence.An alternative to the gSquare test is the testIndLogistic.With the latter,depending on the nature of the target,binary,un-ordered multinomial or ordered multinomial the appropriate regression model is fitted.The predictor variables can be either continuous,categorical or mixed.15.censIndCR.For the case of time-to-event data,a Cox regression model[4]is employed.Thepredictor variables can be either continuous,categorical or mixed.16.censIndWR.A second model for the case of time-to-event data,a Weibull regression modelis employed[14,13].Unlike the semi-parametric Cox model,the Weibull model is fully parametric.The predictor variables can be either continuous,categorical or mixed.17.censIndER.A third model for the case of time-to-event data,an exponential regressionmodel is employed.The predictor variables can be either continuous,categorical or mixed.This is a special case of the Weibull model.18.testIndIGreg.When you have non negative data,i.e.the target variable takes positivevalues(including0),a suggested regression is based on the the inverse Gaussian distribution.The link function is not the inverse of the square root as expected,but the logarithm.This is to ensure that the fitted values will be always be non negative.An alternative model is the Weibull regression(censIndWR).The predictor variables can be either continuous, categorical or mixed.19.testIndGamma(Gamma regression).Gamma distribution is designed for strictly positivedata(greater than zero).It is used in reliability analysis,as an alternative to the Weibull regression.This test however does not accept censored data,just the usual numeric data.The predictor variables can be either continuous,categorical or mixed.20.testIndNormLog(Gaussian regression with a log link).Gaussian regression using the loglink(instead of the identity)allows non negative data to be handled naturally.Unlike the gamma or the inverse gaussian regression zeros are allowed.The predictor variables can be either continuous,categorical or mixed.21.testIndClogit.When the data come from a case-control study,the suitable test is via con-ditional logistic regression[7].The predictor variables can be either continuous,categorical or mixed.22.testIndMVReg.In the case of multivariate continuous target,the suggested test is viaa multivariate linear regression.The target variable can be compositional data as well[1].These are positive data,whose vectors sum to1.They can sum to any constant,as long as it the same,but for convenience reasons we assume that they are normalised to sum to1.In this case the additive log-ratio transformation(multivariate logit transformation)is applied beforehand.The predictor variables can be either continuous,categorical or mixed.23.testIndGLMMReg.In the case of a longitudinal or clustered target(continuous,propor-tions within0and1(not inclusive)),the suggested test is via a(generalised)linear mixed model[12].The predictor variables can only be continuous.This test is only applicable in SES.temporal and MMPC.temporal.24.testIndGLMMPois.In the case of a longitudinal or clustered target(counts),the suggestedtest is via a(generalised)linear mixed model[12].The predictor variables can only be continuous.This test is only applicable in SES.temporal and MMPC.temporal.25.testIndGLMMLogistic.In the case of a longitudinal or clustered target(binary),thesuggested test is via a(generalised)linear mixed model[12].The predictor variables can only be continuous.This test is only applicable in SES.temporal and MMPC.temporal.To avoid any mistakes or wrongly selected test by the algorithms you are advised to select the test you want to use.All of these tests can be used with SES and MMPC,forward and backward regression methods.MMMB accepts only testIndFisher,testIndSpearman and gSquare.The reason for this is that MMMB was designed for variables(dependent and predictors)of the same type.For more info the user should see the help page of each function.2.1A more detailed look at some arguments of the feature selection algorithmsSES,MMPC,MMMB,forward and backward regression offer the option for robust tests(the argument robust).This is currently supported for the case of Pearson correlation coefficient and linear regression at the moment.We plan to extend this option to binary logistic and Poisson regression as well.These algorithms have an argument user test.In the case that the user wants to use his own test,for example,mytest,he can supply it in this argument as is,without””. For all previously mentioned regression based conditional independence tests,the argument works as test=”testIndFisher”.In the case of the user test it works as user test=mytest.The max kargument must always be at least1for SES,MMPC and MMMB,otherwise it is a simple filtering of the variables.The argument ncores offers the option for parallel implementation of the first step of the algorithms.The filtering step,where the significance of each predictor is assessed.If you have a few thousands of variables,maybe this option will do no significant improvement.But, if you have more and a”difficult”regression test,such as quantile regression(testIndRQ),then with4cores this could reduce the computational time of the first step up to nearly50%.For the Poisson,logistic and normal linear regression we have included C++codes to speed up this process,without the use of parallel.The FBED(Forward Backward Early Dropping)is a variant of the Forward selection is per-formed in the first phase followed by the usual backward regression.In some,the variation is that every non significant variable is dropped until no mre significant variables are found or there is no variable left.The forward and backward regression methods have a few different arguments.For example stopping which can be either”BIC”or”adjrsq”,with the latter being used only in the linear regression case.Every time a variable is significant it is added in the selected variables set.But, it may be the case,that it is actually not necessary and for this reason we also calculate the BIC of the relevant model at each step.If the difference BIC is less than the tol(argument)threshold value the variable does not enter the set and the algorithm stops.The forward and backward regression methods can proceed via the BIC as well.At every step of the algorithm,the BIC of the relevant model is calculated and if the BIC of the model including a candidate variable is reduced by more that the tol(argument)threshold value that variable is added.Otherwise the variable is not included and the algorithm stops.2.2Other relevant functionsOnce SES or MMPC are finished,the user might want to see the model produced.For this reason the functions ses.model and mmpc.model can be used.If the user wants to get some summarised results with MMPC for many combinations of max k and treshold values he can use the mmpc.path function.Ridge regression(ridge.reg and ridge.cv)have been implemented. Note that ridge regression is currently offered only for linear regression with continuous predictor variables.As for some miscellaneous,we have implemented the zero inflated Poisson and beta regression models,should the user want to use them.2.3Cross-validationcv.ses and cv.mmpc perform a K-fold cross validation for most of the aforementioned regression models.There are many metric functions to be used,appropriate for each case.The folds can be generated in a stratified fashion when the dependent variable is categorical.3NetworksCurrently three algorithms for constructing Bayesian Networks(or their skeleton)are offered,plus modifications.MMHC(Max-Min Hill-Climbing)[23],(mmhc.skel)which constructs the skeleton of the Bayesian Network(BN).This has the option of running SES[18]instead.MMHC(Max-Min Hill-Climbing)[23],(local.mmhc.skel)which constructs the skeleton around a selected node.It identifies the Parents and Children of that node and then finds their Parents and Children.MMPC followed by the PC rules.This is the command mmpc.or.PC algorithm[15](pc.skel for which the orientation rules(pc.or)have been implemented as well.Both of these algorithms accept continuous only,categorical data only or a mix of continuous,multinomial and ordinal.The skeleton of the PC algorithm has the option for permutation based conditional independence tests[21].The functions ci.mm and ci.fast perform a symmetric test with mixed data(continuous, ordinal and binary data)[17].This is employed by the PC algorithm as well.Bootstrap of the PC algorithm to estimate the confidence of the edges(pc.skel.boot).PC skeleton with repeated measures(glmm.pc.skel).This uses the symetric test proposed by[17]with generalised linear models.Skeleton of a network with continuous data using forward selection.The command work does a similar to MMHC task.It goes to every variable and instead applying the MMPC algorithm it applies the forward selection regression.All data must be continuous,since the Pearson correlation is used.The algorithm is fast,since the forward regression with the Pearson correlation is very fast.We also have utility functions,such as1.rdag and rdag2.Data simulation assuming a BN[3].2.findDescendants and findAncestors.Descendants and ancestors of a node(variable)ina given Bayesian Network.3.dag2eg.Transforming a DAG into an essential(mixed)graph,its class of equivalent DAGs.4.equivdags.Checking whether two DAGs are equivalent.5.is.dag.In fact this checks whether cycles are present by trying to topologically sort theedges.BNs do not allow for cycles.6.mb.The Markov Blanket of a node(variable)given a Bayesian Network.7.nei.The neighbours of a node(variable)given an undirected graph.8.undir.path.All paths between two nodes in an undirected graph.9.transitiveClosure.The transitive closure of an adjacency matrix,with and without arrow-heads.10.bn.skel.utils.Estimation of false discovery rate[22],plus AUC and ROC curves based onthe p-values.11.bn.skel.utils2.Estimation of the confidence of the edges[16],plus AUC and ROC curvesbased on the confidences.12.plotnetwork.Interactive plot of a graph.4AcknowledgmentsThe research leading to these results has received funding from the European Research Coun-cil under the European Union’s Seventh Framework Programme(FP/2007-2013)/ERC Grant Agreement n.617393.References[1]John Aitchison.The statistical analysis of compositional data.Chapman and Hall London,1986.[2]Giorgos Borboudakis and Ioannis Tsamardinos.Forward-Backward Selection with Early Drop-ping,2017.[3]Diego Colombo and Marloes H Maathuis.Order-independent constraint-based causal structurelearning.Journal of Machine Learning Research,15(1):3741–3782,2014.[4]David Henry Cox.Regression Models and Life-Tables.Journal of the Royal Statistical Society,34(2):187–220,1972.[5]Silvia Ferrari and Francisco Cribari-Neto.Beta regression for modelling rates and proportions.Journal of Applied Statistics,31(7):799–815,2004.[6]Edgar C Fieller and Egon S Pearson.Tests for rank correlation coefficients:II.Biometrika,48:29–40,1961.[7]Mitchell H Gail,Jay H Lubin,and Lawrence V Rubinstein.Likelihood calculations for matchedcase-control studies and survival studies with tied death times.Biometrika,68(3):703–707, 1981.[8]Joseph M Hilbe.Negative binomial regression.Cambridge University Press,2011.[9]Vincenzo Lagani,Giorgos Athineou,Alessio Farcomeni,Michail Tsagris,and IoannisTsamardinos.Feature Selection with the R Package MXM:Discovering Statistically-Equivalent Feature Subsets.Journal of Statistical Software,80(7),2017.[10]Diane Lambert.Zero-inflated Poisson regression,with an application to defects in manufac-turing.Technometrics,34(1):1–14,1992.[11]RARD Maronna,Douglas Martin,and Victor Yohai.Robust statistics.John Wiley&Sons,Chichester.ISBN,2006.[12]Jose Pinheiro and Douglas Bates.Mixed-effects models in S and S-PLUS.Springer Science&Business Media,2006.[13]FW Scholz.Maximum likelihood estimation for type I censored Weibull data including co-variates,1996.[14]Richard L Smith.Weibull regression models for reliability data.Reliability Engineering&System Safety,34(1):55–76,1991.[15]Peter Spirtes,Clark Glymour,and Richard Scheines.Causation,Prediction,and Search.TheMIT Press,second edi edition,12001.[16]Sofia Triantafillou,Ioannis Tsamardinos,and Anna Roumpelaki.Learning neighborhoods ofhigh confidence in constraint-based causal discovery.In European Workshop on Probabilistic Graphical Models,pages487–502.Springer,2014.[17]Michail Tsagris,Giorgos Borboudakis,Vincenzo Lagani,and Ioannis Tsamardinos.Constraint-based Causal Discovery with Mixed Data.In The2017ACM SIGKDD Work-shop on Causal Discovery,14/8/2017,Halifax,Nova Scotia,Canada,2017.[18]I.Tsamardinos,gani,and D.Pappas.Discovering multiple,equivalent biomarker sig-natures.In In Proceedings of the7th conference of the Hellenic Society for Computational Biology&Bioinformatics,Heraklion,Crete,Greece,2012.[19]Ioannis Tsamardinos,Constantin F Aliferis,and Alexander Statnikov.Time and sampleefficient discovery of Markov blankets and direct causal relations.In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,pages673–678.ACM,2003.[20]Ioannis Tsamardinos,Constantin F Aliferis,Alexander R Statnikov,and Er Statnikov.Al-gorithms for Large Scale Markov Blanket Discovery.In FLAIRS conference,volume2,pages 376–380,2003.[21]Ioannis Tsamardinos and Giorgos Borboudakis.Permutation testing improves Bayesian net-work learning.In ECML PKDD’10Proceedings of the2010European conference on Machine learning and knowledge discovery in databases,pages322–337.Springer-Verlag,2010.[22]Ioannis Tsamardinos and Laura E Brown.Bounding the False Discovery Rate in LocalBayesian Network Learning.In AAAI,pages1100–1105,2008.[23]Ioannis Tsamardinos,Laura E.Brown,and Constantin F.Aliferis.The Max-Min Hill-ClimbingBayesian Network Structure Learning Algorithm.Machine Learning,65(1):31–78,2006.。

不确定性:用贝叶斯线性回归通向更好的模型选择之路

不确定性:用贝叶斯线性回归通向更好的模型选择之路

不确定性:用贝叶斯线性回归通向更好的模型选择之路关键词:概率、神经网络关注过Mathematica Stack Exchange(我强烈推荐给各位Wolfram语言的用户)的读者们可能最近看过这篇博文内容了,在那篇博文里我展示了一个我所编写的函数,可以使得贝叶斯线性回归的操作更加简单。

在完成了那个函数之后,我一直在使用这个函数,以更好地了解这个函数能做什么,并和那些使用常规拟合代数如Fit使用的函数进行比较。

在这篇博文中,我不想说太多技术方面的问题(想要了解更多贝叶斯神经网络回归的内容请参见我前一篇博文- https://wolfr.am/GMmXoLta),而想着重贝叶斯回归的实际应用和解释,并分享一些你可以从中得到的意想不到的结果。

01 准备工作获取我的BayesianLinearRegression (https://wolfr.am/GMn9Di7w)函数最简单的方法是参考我上传到Wolfram Function Repository 的内容。

如想要使用本博文中的代码范例,你可以计算下列代码,这段代码为该函数创建了一个快捷方式。

你也可以访问GitHub repository并参照安装说明,使用以下代码加载BayesianInference安装包:或者,你也可以通过计算BayesianLinearRegression 的独立源文件(https://wolfr.am/GMngf5Uj)的方式获取该函数,只是如果你没有完整的BayesianInference安装包的话,你可能无法使用我后面会用到的函数regressionPlot1D。

该函数的定义在BayesianVisualisations.wl(https://wolfr.am/GMnlzNkh)文件中。

02 回到基础我现在要做一些对有数据拟合背景的大部分人都非常熟悉的事情:多项式回归。

我可以用更复杂的例子,但是我发现用贝叶斯函数做数据拟合,即使是在如多项式回归这样简单的范例上也能延伸出很多新的可能性,所以其实这是一个非常好的演示范例。

matlab 贝叶斯法bdm 方向谱

matlab 贝叶斯法bdm 方向谱

Matlab中的贝叶斯法(Bayesian inference)和BDM(Bayesian Directional Model)方向谱是信号处理和数据分析领域中常用的方法。

本文将就这两个主题展开详细的介绍和讨论。

一、贝叶斯法1. 贝叶斯法概述贝叶斯法是一种统计推断方法,通过已知数据来推断未知参数的概率分布。

它基于贝叶斯定理,将先验信息和观测到的数据结合起来,得到后验概率分布,从而进行参数估计和模型推断。

贝叶斯法的核心思想是利用已知信息来不断修正对未知量的推断,因此在处理小样本和高维数据时具有独特的优势。

2. 贝叶斯法在Matlab中的应用Matlab提供了丰富的工具箱和函数,可用于实现贝叶斯推断。

用户可以通过调用相应的函数,输入观测数据和先验分布,快速实现贝叶斯参数估计、模型比较和预测分析。

Matlab还支持各种概率模型的建模和推断,包括贝叶斯线性回归、朴素贝叶斯分类、马尔可夫链蒙特卡洛等方法。

二、BDM(Bayesian Directional Model)方向谱1. BDM概述BDM是一种基于贝叶斯理论的方向谱估计方法,用于分析和处理具有方向性的信号和数据。

与传统的傅里叶变换和功率谱分析相比,BDM具有更强的建模能力和预测精度,特别适用于雷达信号处理、地震学、气象学等领域。

2. BDM在Matlab中的实现Matlab提供了专门用于方向谱分析的工具箱,用户可以通过调用相应的函数,输入方向性数据和模型参数,快速实现BDM方向谱的估计和分析。

Matlab的BDM工具箱支持多种方向谱模型和参数化方法,可根据具体应用需求进行灵活选择和定制。

三、结语贝叶斯法和BDM方向谱是Matlab中重要的信号处理和数据分析方法,它们在处理不确定性和方向性信息时具有独特的优势和应用前景。

通过深入理解和熟练运用这些方法,可以有效提高数据分析的准确性和效率,推动相关领域的研究和应用进步。

希望本文能够为对这两个领域感兴趣的读者提供一些参考和启发。

正态倒Gamma随机前沿模型的Bayesian推断

正态倒Gamma随机前沿模型的Bayesian推断

基金项 目:E 海市 自 然科学基金( 1 3 Z R1 4 1 9 1 0 0 ) ; 上海市教委科研创新项 目( 1 4 Y Z l 1 5 )
刘晓君等: 正 态倒 Ga mma 随机 前 沿模 型  ̄B a y e s i a n 推断
4 8 9
参数为p , 的 ̄ N G a m m a 分布, 即 —R G ( p , ) , 则其密度函数为7 r ( ; p , 0 ) =
广泛 的应用.通常取无 效率项 为单边干 扰, 假设 的单边干 扰项 的分布一般有 半正态 分布 【 l j 1 指
数分布[ 2 】 . 截断正态分布[ 3 ] j 有限制的G a o r m a 分布[ 0 】 等.文『 4 1 提出了一般的N o r m a l - G a m m a 随
机 前沿 模 型, 克服 了单 参不 灵活 的情 况,并用 极大似 然估 计方 法对 模 型进 行 了分析.文[ 5 ] 采
用B a y e s i a n 方法对No r ma l - Ga mma 随机 前沿模型进行 了参数估计, 但它仅对G a mma 分布形状参
数 为整数 的情况, 用重要抽样方法 进行 B a y e s i a n 分析 . 文【 6 】 用G i b b s 方法对N o r ma l - G a mma 随 机 前 沿模 型进 行B a y e s i a n 推 断, 解 决 了无效 率项 为一般Ga mma 分布 的推 断 问题 , 但 是算 法效
§ 2 模 型介绍
考 虑随机 前沿模 型
Y i = : +V i —U i , ( i =1 , 2 , … , Ⅳ) ,
高校应用数学学报 2 0 1 3 , 2 8 ( 4 ) : 4 8 8 — 4 9 6

R的应用领域包介绍

R的应用领域包介绍

R的应用领域包介绍 By R-FoxAnalysis of Pharmacokinetic Data 药物(代谢)动力学数据分析网址:/web/views/Pharmacokinetics.html维护人员:Suzette Blanchard版本:2008-02-15翻译:R-fox, 2008-04-12药物(代谢)动力学数据分析的主要目的是用非线性浓度时间曲线(concentration time curve)或相关的总结(如曲线下面积)确定给药方案(dosing regimen)和身体对药物反应间的关系。

R基本包里的nls()函数用非线性最小二乘估计法估计非线性模型的参数,返回nls类的对象,有 coef(),formula(), resid(),print(), summary(),AIC(),fitted() and vcov()等方法。

在主要目的实现后,兴趣就转移到研究属性(如:年龄、体重、伴随用药、肾功能)不同的人群是否需要改变药物剂量。

在药物(代谢)动力学领域,分析多个个体的组合数据估计人群参数被称作群体药动学(population PK)。

非线性混合模型为分析群体药动学数据提供了自然的工具,包括概率或贝叶斯估计方法。

nlme包用Lindstrom和Bates提出的概率方法拟合非线性混合效应模型(1990, Biometrics 46, 673-87),允许nested随机效应(nested random effects),组内误差允许相关的或不等的方差。

返回一个nlme类的对象表示拟合结果,结果可用print(),plot()和summary() 方法输出。

nlme对象给出了细节的结果信息和提取方法。

nlmeODE包组合odesolve包和nlme包做混合效应建模,包括多个药动学/药效学(PK/PD)模型。

面版数据(panel data)的贝叶斯估计方法在CRAN的Bayesian Inference任务列表里有所描述(/web/views/Bayesian.html)。

育种中bayesa模型原理

育种中bayesa模型原理

育种中bayesa模型原理
Bayesa模型是一种基于贝叶斯理论的统计模型,常用于基因组选择育种中。

其原理如下:
首先,模拟4个世代的群体数据,每个群体4000尾个体,分别模拟4分类、3分类和2分类的表型,将第三个世代的4000个个体作为参考群体,随机选择第四个世代的1000个个体作为验证群体,模拟重复20次。

利用BayesA和BayesCπ的线性模型和阈模型进行基因组育种值估计,另外选择基于默认超参数的支持向量回归(SVRdef)和调整超参数的支持向量回归(SVRtuning)进行基因组选择。

结果表明,贝叶斯阈模型对2分类、3分类和4分类性状的预测准确性平均比贝叶斯线性模型高2.1%、2.6%和2.9%。

在育种过程中,BayesA和BayesCπ模型都可以用于预测和选择个体,但BayesA模型假设所有基因的效应都是独立的,而BayesCπ模型则假设基因的效应是相关的。

实际应用中,可以根据具体情况选择合适的模型。

利用小波系数上下文建模的Bayesian压缩感知重建算法

利用小波系数上下文建模的Bayesian压缩感知重建算法

利用小波系数上下文建模的Bayesian压缩感知重建算法侯兴松;孙锦强【摘要】针对目前压缩感知图像重建算法没有充分利用图像小波系数尺度内相关性的缺点,提出一种上下文建模的Bayesian压缩感知重建(CBCS)算法.该算法假定图像的小波系数服从参数未知的spike-and-slab概率模型,先通过一种新的上下文建模方法得到待估计小波系数邻域内的上下文矢量,然后根据待估计系数与上下文矢量的相关性及其父亲系数的状态,推测待估计系数为显著系数的概率,最后根据待估计系数的概率,采用马尔科夫链-蒙特卡洛采样的Bayesian推理从观测向量中恢复出图像的小波系数,进而得到重建图像.实验结果表明,CBCS算法可以自适应于图像内容的变化,与仅利用尺度间相关性的小波树结构的压缩感知重建算法相比,在0.9的采样率下,重构性能最大可提高近2 dB.【期刊名称】《西安交通大学学报》【年(卷),期】2013(047)006【总页数】6页(P12-17)【关键词】上下文建模;压缩感知;图像重建;Bayesian推理【作者】侯兴松;孙锦强【作者单位】西安交通大学电子与信息工程学院,710049,西安;西安交通大学电子与信息工程学院,710049,西安【正文语种】中文【中图分类】TN914.42经典的压缩感知重建算法如匹配追踪算法(MP)[1]、正交匹配追踪算法(OMP)[2]、CoSa MP算法[3]等大都属于贪婪迭代算法,这些算法认为信号在稀疏基上的分解系数是独立同分布的。

但是,图像的小波变换系数往往是统计相关的,具体体现在小波系数在尺度间具有衰减特性,在尺度内具有聚类特性[4]。

如果能够在压缩感知重建算法中充分利用小波系数的相关性,将会使压缩感知重建算法的性能得到明显的提升。

如何利用小波系数间存在的相关性来提高压缩感知重建算法的性能,已经引起了广泛的关注。

比如,Model-Based CoSa MP (MBC)算法[5]和小波树结构的压缩感知(TSW-CS)[6]算法都利用了小波系数尺度间的衰减性。

层次贝叶斯模型-空间分析

层次贝叶斯模型-空间分析

1.1 层次贝叶斯模型经典的推断分析模型、空间回归模型、空间面板模型有一个共同的特点:这些模型的求解完全依赖所采集的样本信息。

然而,在业务实践中,在收集样本之前,研究者往往会对研究对象的变化或分布规律有一定的认识。

这些认识或是来自长期积累的经验,也可能来自合理的假设。

由于这些认识没有经过样本的检验,所以我们可以称之为先验知识。

比如我们要研究某地某疾病月发病人数的概率分布。

即使没有进行统计调查,我们根据一些定理和合理假设,也可以知道发病数服从泊松分布。

甚至根据医院日常接诊的经验,可以推算出发病人数大概在哪个区间。

这种情况下,对于发病人数分布形态和大致区间的认识,属于先验知识。

先验知识对我们探索研究对象的变化规律会有很大的帮助。

而经典的推断分析模型、空间回归模型、空间面板模型都没有利用先验知识,导致了信息利用的不充分。

而本节所要谈到的层次贝叶斯模型,会结合先验知识和样本信息,对数据进行推断分析。

由于层次贝叶斯模型能有效利用先验知识和样本信息,因此可以提高推断的准确度或降低抽样的成本。

(1)贝叶斯统计原理简介在介绍层次贝叶斯模型之前,有必要首先简单阐述一下贝叶斯统计的基本原理。

贝叶斯统计的基础是贝叶斯定理:(|)()(|)()P B A P A P A B P B = (1)其中: ()P A 是事件A 的先验概率(例如,某专家通过经验或之前的研究得出乙肝发病率为10%,这就是一个先验概率),()P B 是事件B 发生的概率,且()0P B ≠,(|)P A B 是给出事件B 后事件A 的后验概率。

(|)/()P B A P B 是事件A 发生对事件B 的支持程度,即似然函数。

对(|)/()P B A P B 可以有如下的理解:设(|)/()P B A P B n =,则在事件A 发生的条件下,事件B 发生的概率是不知A 是否发生的条件下的n 倍。

使用贝叶斯方法的一个重要目的,就在于得出随机变量的概率分布及各因素对分布的影响。

NGINAR(1)模型的Bayes估计及预测

NGINAR(1)模型的Bayes估计及预测

NGINAR(1)模型的Bayes估计及预测杨艳秋; 王德辉【期刊名称】《《吉林大学学报(理学版)》》【年(卷),期】2019(057)006【总页数】6页(P1385-1390)【关键词】NGINAR(1)模型; Bayes估计; 模型预测【作者】杨艳秋; 王德辉【作者单位】吉林大学数学学院长春130012; 吉林师范大学数学学院吉林四平136000【正文语种】中文【中图分类】O212.1现实生活中的许多计数过程,如某医院某时刻住院的病患人数、某时刻某地区的生物种群数量、某地区某阶段的犯罪数量等,这些数据通常取非负整数值,用一般的时间序列模型拟合这些数据通常会产生异常预测,因此需要引入整值自回归模型.目前对整值时间序列数据的建模主要有如下两种方法: 借助于潜过程的状态空间建模过程[1];借助于稀疏算子的建模过程,这类方法是建模整值时间序列数据的主要方法.二项稀疏算子“∘”[2]是整值时间序列发展的基础,定义为其中: α∈(0,1);X为非负整值随机变量;{Bi}为独立同分布(i.i.d.)的Bernoulli随机变量序列且与X互相独立,满足P(Bi=1)=1-P(Bi=0)=α.但二项稀疏算子有一个局限,即求和序列{Bi}为i.i.d.的Bernoulli随机变量序列,只能取0或1值,因此α∘X的取值总是小于或等于X的值.但在实际问题中,每一个事件都可能关联更多相关的计数事件,因此用几何随机变量来描述这些事件更合适.文献[3]引入了负二项稀疏算子“*”,定义为其中: β∈(0,1);X为非负整值随机变量;{Wj}为i.i.d.服从分布的随机变量序列且与X 互相独立,满足k∈0.由于负二项稀疏算子中求和序列{Wj}的取值是非负整数,使得β*X的取值可能大于也可能小于X的取值,从而很好地突破了二项稀疏算子的局限.本文给出基于负二项稀疏算子的一阶整值自回归模型参数的Bayes估计,先进行数值模拟,再与条件最小二乘估计和Yule-Walker估计进行均方误差的比较,最后对新西兰牛皮肤病数据进行实例分析及模型预测.1 模型简介基于二项稀疏算子“∘”的一阶整值自回归模型[4-5]为Xt=α∘Xt-1+Zt.若Xt表示t时刻住院的患者数,α∘Xt-1表示上个月以概率α仍继续住院的患者数,Zt为t时刻新住院的患者数,则Xt为α∘Xt-1与Zt之和.但对于过度分散的计数过程,负二项稀疏算子“*”则更合适.文献[6]基于负二项稀疏算子“*”,提出了一阶整数值自回归模型(new geometric first-order integer-valued autoregressive process,NGINAR(1))如下:Xt=α*Xt-1+εt,其中: 负二项稀疏算子“*”定义为α∈(0,1),Xt-1为非负整值随机变量,{Wj}为i.i.d.服从分布的随机变量序列且与Xt-1互相独立;{εt}为相互独立的整值随机变量序列且与Xt-1和α*Xt-1独立.由NGINAR(1)模型给出的随机变量序列{Xt}是一个平稳时间序列,考虑边缘分布为几何分布即可以验证,当时,随机变量{εt}服从二混合几何分布,分布律为2 Bayes估计文献[7-8]提出了在平方损失函数下,考虑NGINAR(1)模型的Bayes估计,其中是α的Bayes估计量.设X0,X1,…,XT是NGINAR(1)模型的一组样本观测值,则样本的似然函数为其中i=1,2,…,T.考虑α的先验分布取(0,1)上的均匀分布,即π(α)=1,0<α<1.根据Bayes原理,参数α的后验分布为其中x=(x0,x1,…,xT).约束在相应的平稳域内,在二次损失函数下α的Bayes估计为后验分布π(α|x)的期望值.在实际应用中,通常选用后验期望估计作为参数的Bayes 估计.可以求得NGINAR(1)模型参数的Bayes估计为综上可得:定理1 设样本x0,x1,…,xT来自于NGINAR(1)模型,在二次损失函数和均匀先验分布下,参数α的Bayes估计形如式(1).3 数值模拟下面通过数值模拟给出NGINAR(1)模型参数Bayes估计的优良性,将NGINAR(1)模型参数的Bayes估计与条件最小二乘(CLS)估计和Yule-Walker估计(Y-W)进行均方误差的比较.先给出Bayes估计的算法:1) 选择迭代初值α(0),并令i=1;2) 利用二混合几何分布抽取3) 抽取4) 利用式(1)计算5) 令i=i+1,返回步骤2),直到算法达到事先约定的收敛标准.下面进行数值模拟,样本容量分别取T=100,500.取μ=0.5,5,10,表1列出了Bayes 估计值的偏差(Bias)和均方误差(MSE),表2列出了参数的Bayes估计与条件最小二乘估计和Yule-Walker估计的均方误差比.模拟运行过程中先进行2 000次预迭代,以确保参数的收敛性,然后再进行500次迭代,得到模拟结果.表1 Bayes估计的偏差和均方误差Table 1 Bias and MSE of Bayesian estimationT(μ,α)偏差均方误差T(μ,α)偏差均方误差100(0.5,0.3)0.217 10.025 1500(0.5,0.3)0.116 50.013 7(0.5,0.5)-0.112 40.016 2(0.5,0.5)-0.089 60.008 5(0.5,0.7)0.080 20.010 9(0.5,0.7)-0.107 40.007 9(5,0.3)0.054 10.0021(5,0.3)0.035 60.001 2(5,0.5)0.012 80.003 9(5,0.5)0.044 80.000 7(5,0.7)-0.014 20.002 5(5,0.7)-0.015 40.002 1(10,0.3)0.102 00.007 3(10,0.3)0.051 10.003 1(10,0.5)0.120 20.015 4(10,0.5)0.116 30.010 8(10,0.7)-0.173 00.017 1(10,0.7)0.077 80.009 3表2 3种估计方法的均方误差比Table 2 MSE ratio of three estimation methodsT(μ,α)Bayes/CLSBayes/Y-W100(5,0.3)0.437 10.391 4(5,0.5)0.915 60.586 1(5,0.7)0.851 90.696 1500(5,0.3)0.748 40.823 5(5,0.5)1.004 10.814 0(5,0.7)0.791 90.817 1由表1可见,Bayes估计的偏差和均方误差都比较小.以均方误差为准则,表2中3种估计方法的均方误差比可见,Bayes估计优于条件最小二乘估计和Yule-Walker估计.4 实例分析及模型预测图1 牛皮肤病数据样本路径Fig.1 Sample path of skin-lesions data下面将NGINAR(1)模型应用到一组牛皮肤病患病数据集中[9],并进行分析及预测.该数据集来源于新西兰农林部,记录了新西兰某地区动物卫生实验室记录的2003—2009年间每月患皮肤病的牛数量.将该数据集分为两部分:2003-01—2009-08的数据用于估计参数值,2009-09—2009-12的数据作为样本外待预测值.将数据集的前80个观测数据记作X1,X2,…,X80,统计结果表明,牛皮肤病数据的样本均值为1.5,样本方差为3.417 72,方差比均值为Id=2.278 5.牛皮肤病数据样本路径如图1所示,自相关函数图像及偏自相关函数图像分别如图2和图3所示.由图2和图3可见,该组数据为一阶相关,因此可以用NGINAR(1)模型对其进行拟合.图2 牛皮肤病数据自相关函数图像Fig.2 Autocorrelation function plot of skin-lesions data图3 牛皮肤病数据偏自相关函数图像Fig.3 Partial autocorrelation function plot of skin-lesions data下面通过序列{Xt}的近似h步(h∈0)预测条件分布方法对NGINAR(1)模型进行预测(简称条件分布预测).条件分布预测方法较传统条件期望预测方法更适用于整数值时间序列.虽然条件期望预测方法可以使预测值的均方误差最小,但当观察值和待预测值为整数值时,利用条件期望预测方法得到的预测值却很少取到整数值点.为了解决上述问题,文献[10]提出通过条件分布预测方法对整数值模型进行预测,用这种预测方法得到的预测值和整数值时间序列本身的状态空间一致,而且利用条件分布预测方法来计算条件中位数、条件均值及条件众数等点的预测,甚至预测值的置信区间都比较容易,能得到较理想的预测值.由于NGINAR(1)过程具有Markov性,在给定Xn的条件下,Xn+h的条件分布(即Xn的条件预测分布)为P(Xn+h=xn+h|Xn=xn)=[Ph]xn+h,xn,其中转移概率为下面考察NGINAR(1)模型的预测效果.利用牛皮肤病数据集对模型进行预测.将条件分布预测和条件期望预测方法进行比较,结果列于表3.由表3可见,当h=1,2,3,4时,使用条件分布预测方法得到的预测值均为0,与实际值相符,而条件期望预测方法的预测结果分别为1.412 7,1.470 4,1.478 5,1.479 7,与实际值0有一定的偏差,而且条件分布预测方法的预测均值绝对偏差为0,条件期望预测方法的均值绝对偏差为1.460 3,因此,条件分布预测的方法更适用于整数值时间序列.表3 牛皮肤病数据的条件期望与条件分布预测结果比较Table 3 Comparison of conditional expection and conditional distribution prediction results of skin-lesions datah观测值条件期望预测条件分布预测101.412 70201.470 40301.478 50401.479 70均值绝对偏差1.460 30图4为牛皮肤病数据的h步条件预测分布图像.用NGINAR(1)模型、ZIPINAR(1)模型、PINAR(1)模型来拟合该组数据,并用AIC(Akaike信息准则)、BIC(Bayes信息准则)、均方根值(RMS)和方差比均值对上述3个模型进行比较,结果列于表4.由表4可见,NGINAR(1)模型具有最小的AIC值和BIC值,3个模型的均方根值相差不大,而NGINAR(1)模型的方差比均值为2.479 9,更接近于数据集自身的方差比均值2.278 5,这些数据均表明用NGINAR(1)模型拟合该组数据集较合适.图4 牛皮肤病数据的h步条件预测分布Fig.4 h-Step conditional prediction distribution of skin-lesions data表4 不同模型下牛皮肤病数据的AIC、BIC、均方根值和IdTable 4 AIC,BIC,RMS and Id of skin-lesions data under different models模型估计AICBIC均方根值IdZIPINAR(1)^α=0.164 1276.875 6284.021 71.806 81.674 0^λ=2.031 1^ρ=0.386 3PINAR(1)^α=0.157 3293.339298.103 11.807 51.000 0^λ=1.256 7NGINAR(1)^α=0.140 0271.103275.8671.809 72.479 9^μ=1.479 9通过上述模拟结果可知,NGINAR(1)模型参数的Bayes估计效果优于条件最小二乘估计和Yule-Walker估计,且条件分布预测方法比条件期望预测方法更适用于整数值时间序列.参考文献【相关文献】[1] FUKASAWA T,BASAWA I V.Estimation for a Class of Generalized State-Space Time Series Models [J].Statist Probab Lett,2002,60(4): 459-473.[2] STEUTEL F W,VAN HARN K.Discrete Analogues of Self-decomposability and Stability [J].Ann Probab,1979,7(5): 893-899.[3] ALY E-E A A,BOUZAR N.On Some Integer-Valued Autoregressive Moving Average Models [J].J Multivariate Anal,1994,50(1): 132-151.[4] McKENZIE E.Some Simple Models for Discrete Variate Time Series [J].Journal of the Amer Water Resources Association,1985,21(4): 645-650.[5] AL-OSH M A,ALZAID A A.First-Order Integer-Valued Autoregressive (INAR(1)) Process [J].J Time Ser Anal,1987,8(3): 261-275.[6 M M,BAKOUCH H S, A S.A New Geometric First-Order Integer-Valued Autoregressive (NGINAR(1)) Process [J].J Statist Plan Inference,2009,139(7): 2218-2226.[7] YANG Kai,WANG Dehui.Bayesian Estimation for First-Order Autoregressive Model with Explanatory Variables [J].Comm Statistics: Thoery Methods,2017,46(22): 11214-11227. [8] 朱复康,李琦.INGARCH(1,1)模型参数的矩估计和Bayes估计 [J].吉林大学学报(理学版),2009,47(5): 899-902.(ZHU Fukang,LI Qi.Moment and Bayesian Estimation of Parameters in the INGARCH(1,1) Model [J].Journal of Jilin University (Science Edition),2009,47(5): 899-902.)[9] JAZI M A,JONES G,LAI C D.First-Order Integer-Valued AR Processes with Zero Inflated Poisson Innovations [J].J Time Series Anal,2012,33(6): 954-963.[10] MÖLLER T A,SILVA M E,WEIβ C H,et al.Self-exciting Threshold Binomial Autoregressive Processes [J].Adv Statist Anal,2016,100(4): 369-400.。

时间序列向量自回归模型的贝叶斯推断理论

时间序列向量自回归模型的贝叶斯推断理论

|i}|,t I_、_(鲁时间序列向量自回归模型(简称为 程不相符,导致模型结构与经济理论相 因此,非限制性VAK(p)模型中的每一 VAK 模型)最初由美国学者Litterman 、 矛盾。

个方程都有相同的解释变量。

Sargent和Sims 等人在20世纪80年代与传统的经济计量分析方法不同,不难推出,模型(1)可以化为如下n 初提出来,主要用于替代联立方程(si — 贝叶斯推断理论为解决VAK 模型参数 个多方程模型系统 mu lt a ne o us equations)结构模型,提高经过多时的估计问题提供了一种便利的分 y 【_B 、+u 。

,t=1,2, ,n(3)济预测的准确性。

用联立方程模型研究 析框架,这主要得益于Litter man 在美国 n 2陲L 。

;2睦!]m p 。

宏观经济问题,是当前世界各国经济学 明尼苏达储备银行所做的开创性的研究家的一种通用做法,它把理论分析和实 工作。

1986年,Litte rman 利用贝叶斯时 际统计数据结合起来,运用线性或非线 间序列自回归模型,对明尼苏达州的国其中 性回归分析方法,确定经济变量之间的 民生产总值等七个宏观指标进行预测,数量关系,构建一个由若千方程组成的取得了很好的效果;此后,贝叶斯方法在模型系统。

联立方程模型适合于经济结 商业经济预测和政府宏观经济预测中获进一步,若将向量n ,BrZ 。

和u 。

t=l , 构分析,但不适合于预测目的:联立方程 得了广泛应用,相关的研究成果逐年增 2, ,13.的转置分别按行依次排列,各自 模型的预测结果的精度不高,其主要原 多。

形成一个n×m 矩阵,则上述n 个方程可 因是需要对外生变量本身进行预测。

但是,从国内外的文献资料来看,关以简化为一个结构更为紧凑的矩阵表达与联立方程模型不同,VAK 模型的 于VAK 模型的贝叶斯推断方法的系统形式结构相对简洁明了,更适合于经济预测, 理论研究尚未见诸文献。

相关主题
  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。

ORIGIN AL ARTI CLEBayesian Zero-Inflated Predictive Modelling of Herd-Level Salmonella Prevalence for Risk-Based SurveillanceJ.Benschop1,S.Spencer2,L.Alban3,M.Stevenson1and N.French11EpiCentre,Institute of Veterinary,Animal and Biomedical Sciences,Massey University,Palmerston North,New Zealand(place where work was carried out)2School of Mathematical Sciences,University of Nottingham,Nottingham,UK3Danish Meat Association,Kjellerup,DenmarkImpacts•Implementing reduction strategies for surveillance systems that weredeveloped to protect public health is challenging.The apparently contra-dictory requirements for continued consumer confidence in food supply,and cost reduction for industry need to be met.•We use zero-inflated modelling to demonstrate how targeting farms inspace and by risk factors has the potential to result in more efficient ways ofconducting surveillance.•This allows assessment of the factors that affect herd infection status(introduction)and those that affect the sero-prevalence in infected herds(persistence and spread).Partitioning risk factors is valuable for directingpractical risk-mitigating advice.Keywords:Salmonella;pigs;zoonotic;Denmark; surveillance;zero-inflatedCorrespondence:Dr.Jackie Benschop.Massey University, Private Bag11222,Palmerston North4442, New Zealand.Tel.:+6463504528;Fax.:+6463505716;E-mail:j.benschop@Received for publication November20,2009doi:10.1111/j.1863-2378.2010.01355.x SummaryThe national control programme for Salmonella in Danish swine herds intro-duced in1993has led to a large decrease in pork-associated human cases of salmonellosis.The pork industry is increasingly focused on the cost-effective-ness of surveillance while maintaining consumer confidence in the pork food ing national control programme data from2003and2004,we devel-oped a zero-inflated binomial model to predict which farms were most at risk of Salmonella.We preferentially sampled these high-risk farms using two sam-pling schemes based on model predictions resulting from a farm’s covariate pattern and its random effect.Zero-inflated binomial modelling allows assess-ment of similarities and differences between factors that affect herd infection status(introduction),and those that affect the seroprevalence in infected herds (persistence and spread).Both large(producing greater than5000pigs per annum),and small herds(producing less than2000pigs per annum)were at significantly higher risk for infection and subsequent seroprevalence,when compared with medium sized herds(producing between2000and5000pigs per annum).When compared with herds being located elsewhere,being located in the south of Jutland significantly decreased the risk of herd infection,but increased the risk of a pig from an infected herd being seropositive.The model suggested that many of the herds where Salmonella was not detected were infected,but at a low ing cost and sensitivity,we compared the results of our model based sampling schemes with those under the standard sampling scheme,based on herd size,and the recently introduced risk-based approach.Model-based results were less sensitive but show significant cost sav-ings.Further model refinements,sampling schemes and the methods to evalu-ate their performance are important areas for future work,and these should continue to occur in direct consultation with Danish authorities.Zoonoses and Public Health60ª2010Blackwell Verlag GmbH•Zoonoses Public Health.57(Suppl.1)(2010)60–70IntroductionNew challenges for animal health surveillance for zoonotic disease in the twenty-first century are many and include those brought about by increased trade,limited resources, consumer awareness and disease emergence(Hodges and Kimball,2005;Woolhouse and Gowtage-Sequeria,2005; Fevre et al.,2006;Vorou et al.,2007).This study is focused on the additional challenge of developing reduc-tion strategies for surveillance systems for diseases that in the past represented an important risk,while today the risk to consumers is substantially reduced.Surveillance for Salmonella in Danish pig herds is an example that meets these criteria.Such strategies require a delicate balance between satis-fying producer and industry concerns about cost-effective testing and maintaining consumer confidence in food supply.Salmonella and BSE were the food risks most dreaded in a UK survey of food risk perception under-taken in1999(Kirk et al.,2002)and a recent survey of consumers identified meat as the food item in which con-fidence had decreased the most(Verbeke et al.,2007).It makes sense that any strategy involving a reduction in testing should demonstrate an equal or greater sensitivity as the existing one,regardless of the potential efficiency gains.The means to evaluate the sensitivity of a surveillance programme and subsequently compare alternatives has been explored in the veterinary epidemiological literature (Audige´et al.,2001;Cannon,2002;Martin et al.,2007b). In this context,a surveillance programme is considered as a diagnostic system which aims to correctly identify the presence or absence of an unwanted agent.By quantifying the characteristics of the diagnostic system(such as its specificity and sensitivity),a surveillance programme can be formally evaluated.For example,Audige´and Beckett defined surveillance sensitivity as the probability of declaring an area infected,given that infection exists,for the evaluation of surveillance for porcine reproductive and respiratory syndrome(PRRS)in Switzerland(Audige´and Beckett,1999).Quantification of the sensitivity of a surveillance system allows comparison of alternative surveillance strategies. For example,the comparison of the sensitivity of the cur-rently targeted surveillance system for classical swine fever (CSF)in Denmark with that of a simulated non-targeted system identified that the current system was twice as sensitive compared with the simulated,non-targeted sys-tem(Martin et al.,2007a).In another Danish example, the sensitivity of the current surveillance programme for infectious bovine rhinotracheitis(IBR)was compared with three other surveillance scenarios targeting specific geographical areas and risk periods(Chriel et al.,2005).Techniques all involving scenario tree methodology have been used for proof of disease freedom for exotic, non-zoonotic and clinically severe animal infections such as PRRS,CSF and IBR.In this study,we apply zero-inflated binomial modelling to the endemic,zoonotic and sub-clinical infection of Danishfinisher pigs with Salmo-nella ing routinely collected data from the existing DSSCP data set from2003and2004.Proof of disease freedom is not the end-point here,rather the issue is to maintain the status of domestically produced pork as a minor source of salmonellosis in humans.We propose it is possible to meet the differing needs of both consumer confidence in food supply and industry requirements for a surveillance reduction strategy with a targeted approach, whereby populations with higher risk of infection are preferentially sampled.These higher-risk populations are identified by model-based predictions driven by the previ-ous year’s performance and their covariate risk profile. The objective wasfirstly to develop a model that predicts which farms are most at risk of Salmonella.Second,we preferentially sample the high-risk farms and compare our results to those under:(i)the standard sampling scheme,based on herd size and(ii)the recently intro-duced risk-based approach(Ministry of Family and Con-sumer Affairs,2006).In this way,we are able to evaluate the impact of alternative sampling strategies on overall system performance.Materials and MethodsData sourcesData were obtained from three sources.First,every pig herd is required to register with the Danish Central Hus-bandry Register.This provided a unique identifier(the CHR number),details of farm location,herd size and the number of sows in the herd.The second source of data was from the central data-base of the DSSCP.We used the results from9735farms in2003(n=578260individual samples)for initial model building.The DSSCP database also provided results from the8151farms sampled in2004that were also sampled in2003to investigate our different sampling schemes.Details retrieved from the DSSCP database included the CHR number,the date of sampling and the result of the Danish-mix ELISA(DME).This test mea-sures antibodies in meat-juice to determine the previous exposure offinisher pigs to Salmonella spp.and can detect O-antigens from at least93%of all serovars known to be present in Danish pigs(Mousing et al.,1997).The principal advantages of serological methods for Salmo-nella detection is the ability to assay a large number of samples rapidly at relatively low cost and high sensitivity when compared with bacteriology(2€per sample).ForJ.Benschop et al.Predictive Modelling Informs Salmonella Surveillance ª2010Blackwell Verlag GmbH•Zoonoses Public Health.57(Suppl.1)(2010)60–7061these analyses,an ELISA optical density percentage (OD%)greater than20is classified as positive.This is equivalent to an adjusted OD%of greater than10:the cut-off for positivity that has been used by the DSSCP since1August2001(Alban et al.,2002).All samples included in this study were analysed at the Danish Insti-tute for Food and Veterinary Research using the DME. On the basis of testing,herds receive a monthly‘serologi-cal Salmonella index’which is based on a weighted aver-age of the results from the previous three months.The levels of index are low level or no antibodies(index 0–39);medium(index40–69);and high(index70or greater)(Alban et al.,2002).Herds in the medium and high index have reduced payments forfinisher pigs sent to slaughter and must collect pen-faecal samples to deter-mine the subtype and distribution of Salmonella in the herd.The third source of data was the Danish Specific Path-ogen Free(SPF)Company which provided health status details associated with each farm.We chose to analyse data from2003and2004as we had access to additional farm-level details such as herd size,health status and the number of sows on the farm for those respective years.We proposedfitting a model to data from2003to inform sampling strategies for the sub-sequent year(estimation).Then wefit a model to the 2004data and use this to see how successful the sampling strategies chosen from the2003data were(prediction). Sampling schemes and model developmentFour sampling schemes were developed.Herds were assigned to one of these schemes based on estimates from a modelfitted to DSSCP data from2003:1Original herd size based sampling(OHS):This sam-pling strategy was in place from August2001to July ing this approach,the eligible population com-prised all herds with an annual kill greater than200 slaughter pigs(representing99%of allfinisher herds in Denmark).The number of samples taken depended solely on herd size:the aim was to take60,75or100samples annually from herds with an estimated annual kill of 200–2000,2001–5000and greater than5000slaughter pigs respectively(Alban et al.,2002).For the purposes of this study,we have used this sampling scheme to represent the bench-mark to which we compare the alternative sampling strategies.2DMA risk-based sampling(DRB):In July2005,the surveillance system became performance-based which reduced the annual sample size by approximately one-third.For herds that had no positive meat juice samples over the previous5months,the sample size was reduced to one sample per month(Enoe et al.,2003;Ministry of Family and Consumer Affairs,2006).If a herd then had one or more positive samples,the strategy reverts to one based on herd-size(OHS).We apply a modified version of these sampling criteria to herds in2004based on their performance in2003.Our modification is that we have extended the time period over which herds are assessed to determine their prevalence to be the whole year,rather than the previous5months.3Model derived risk-based sampling A(MRBA):We developed a targeted surveillance strategy based on our previous risk-factor,spatial and temporal analyses of the DSSCP data(Benschop et al.,2008a,b,c).All herds with a predicted median within-herd seroprevalence at or below a model determined cut-off in2003were identified as low risk and were placed on the DRB scheme.This prediction was based on the farm’s covariate pattern and random farm effect.All other herds(above the predicted within-herd seroprevalence threshold)were left on the current sampling scheme for2004based on herd size (OHS).4Model derived risk-based sampling B(MRBB):As in MRBA above,all herds with a predicted median within-herd seroprevalence at or below a model determined cut-off in2003were identified as low risk and were placed on the DRB scheme.The remaining herds were then assigned to two different sampling schemes depending on their predicted seroprevalence in2003:(i)those with a pre-dicted seroprevalence that was<0:25or>0:55were left on the current sampling scheme based on herd size;and, (ii)those with a predicted seroprevalence of between0.25 and0.55were more intensively sampled to provide95% confidence that we were within0.05of the true value of the predicted seroprevalence.The increased intensity of sampling is created from model-derived data.This range was chosen as these herds were near the cut-off for level 2Salmonella status(0.40).Model development for the sampling schemesThe frequency histogram of the herd-level prevalence based on the actual test results from the OHS sampling strategy for2003and2004(Figure1)showed a large amount of variation with a predominance of test-negative herds.These test-negative herds can come from two types of disease-negative herds:(i)those that are truly unin-fected and therefore every sample is negative,and(ii); those that are,in fact,infected but provide insufficient samples to detect the presence of infection.This led us to propose a zeroinflated binomial(ZIB)approach to model herd-level Salmonella prevalence as it reflected our under-standing of what is happening on the farm.The ZIB model has two herd-level outcomes,the probability of infection and—conditional on infection being presentPredictive Modelling Informs Salmonella Surveillance J.Benschop et al. 62ª2010Blackwell Verlag GmbH•Zoonoses Public Health.57(Suppl.1)(2010)60–70—an estimate of herd-level seroprevalence.This type of modelling can provide an added advantage over logistic regression:an ability to assess the extent of the similarities and differences between factors affecting herd infection status (invasion)and those affecting the seroprevalence in infected herds (persistence and spread).Variables that might explain both the presence of infec-tion and herd-level prevalence included herd size,farm location,the number of sows present and herd health status.Herd size was the actual number of slaughter pigs produced for the year;this was centred by subtracting the mean and dividing by 1000.Farm location was a binary variable;if a herd was located in the Sonderjylland district it was coded as 1,otherwise 0.Health status was a three-level categorical variable:conventional,SPF and SPF with Mycoplasma .The presence of sows was expressed as a three level ordinal variable:farms with no sows,farms with less than 125(some)and farms with over 125(many).Logistic regression modelling was used for initial model building.Bivariate analyses found all covariates significant at the P £0.25level and using data from 2003we built a multivariable model within the statistical software r ,ver-sion 2.5.1(Ihaka and Gentleman,1996).The outcome variable was seroprevalence defined as the number of cases divided by the number of samples taken.All puta-tive risk factors were significant.The continuous variable herd size was checked to see if it was linear in its log odds (Hosmer and Lemeshow,1989).Polynomials of herd size and biologically plausible two-way interactionterms between the main-effect variables were considered for inclusion.Once satisfied with the model structure we developed a logistic model within a Bayesian framework using winbugs version 1.4.1(Gilks et al.,1994).The code for the model is shown in Fig.2.Initially,we stipulated informed priors for the intercept term,and covariates relating to location,health status and the number of sows present on farm.We based these on published literature supplying subjective information about the likelihood ascribed to various combinations of covariate values (Congdon,2001).For example,from earlier work on other data from the Danish Salmonella surveillance-and-control programme we believed that it would be protective factor for a herd having SPF health status (Benschop et al.,2008b).Moreover,residing in the district of Sonderjylland in the south of Jutland would be a risk factor (Benschop et al.,2008a)for herd-level sero-positivity.Based on available literature,an increased number of sows on farms were considered a risk factor for Salmonella in finishers (Hautekiet et al.,2008).Priors for the Bayesian logistic regression model were expressed in terms of a conjugate beta density (Congdon,2001).We used a non-informed,normallydistributed# WinBUGS model file for Zero Inflated Binomial model ## cases[i] = number of positives in herd i, # pop[i] = number sampled in herd i,# rho[i] = seroprevalence in herd i given that it is infected, # J[i] = infection status (1/0) of herd i, # q[i] = probability of infection for herd i,# p[i] = unconditional seroprevalence in herd i. model{for(i in 1:Ncases) { cases[i] ~ dbin(p[i], pop[i]) p[i] <- rho[i]*J[i] J[i] ~ dbern(q[i]) logit(rho[i]) <- beta0 + beta1*herd.small[i] + beta1a*rge[i] + beta2*stspf[i] + beta3*stm[i] + beta4*sowalot[i] + beta5*sownone[i] + beta6*sond[i] + B[i] logit(q[i]) <- alpha0 + alpha1*herd.small[i] + alpha1a*rge[i] + alpha2*stspf[i] + alpha3*stm[i] + alpha4*sowalot[i] + alpha5*sownone[i]+ alpha6*sond[i] + A[i]# Priors for random effects B[i] ~ dnorm(0, reprec) A[i] ~ dnorm(0, reprec)}# Priors on fixed effects alpha0 ~ dnorm(0,prec) beta0 ~ dnorm(0,prec) alpha1 ~ dnorm(0,prec) beta1 ~ dnorm(0,prec) alpha1a ~ dnorm(0,prec) beta1a ~ dnorm(0,prec) alpha2 ~ dnorm(0,prec) beta2 ~ dnorm(0,prec) alpha3 ~ dnorm(0,prec) beta3 ~ dnorm(0,prec) alpha4 ~ dnorm(0,prec) beta4 ~ dnorm(0,prec) alpha5 ~ dnorm(0,prec) beta5 ~ dnorm(0,prec) alpha6 ~ dnorm(0,prec) beta6 ~ dnorm(0,prec) }Fig.2.winbugs code for the zero-inflated binomial model.J.Benschop et al.Predictive Modelling Informs Salmonella Surveillanceª2010Blackwell Verlag GmbH •Zoonoses Public Health.57(Suppl.1)(2010)60–7063prior centred at zero and with a variance of1for the effect of herd size,given information about the effect of this variable on sero-positivity was not certain or conflict-ing.Three chains were run and convergence was judged to have occurred on the basis of visual inspection of time series plots and Gelman-Rubin plots(Toft et al.,2007). The length of the chain was determined by running suffi-cient iterations to ensure the Monte Carlo standard errors for each parameter were less than5%of the posterior standard deviation.A total of40000iterations were run with a‘burn in’of4000iterations.The logistic regression model was extended to a zero-inflated binomial model and specified as follows:Cases½i $Binðpop½i ;p½i ÞHere,the number of cases from the i th herd is binomi-ally distributed as a function of the number of trials(tests for Salmonella antibodies in meat-juice)pop[i],and the probability of a test being positive(adjusted OD%>10), p[i].We further defined:p½i ¼rho½i ÃJ½iwhere J[i]is an indicator variable representing infection status of the i th herd,rho[i]is the sero-prevalence condi-tional on the presence of infection.The term rho there-fore represents the probability offinding infection in a randomly chosen pig from an infected herd.The latent variable J[i]is distributed as:J½i $Bernðq½i Þwhere q[i]is the probability of a herd being infected.This latent variable was modelled as:logq i1Àq i¼a0þa1x1iþ...þa m x miþA ið1ÞIn Equation1,the logit of the observed probability of the i th herd being infected,logit(qi),was modelled as a function of m=4farm-level explanatory variables(herd size,location,the number of sows present and health sta-tus)and a random effect term,A i,which was normally dis-tributed with a mean of zero and precision r.For the ZIB model,the continuous variable herd size was categorized to facilitate model convergence.The categories chosen were the same as those used in the DSSCP(Alban et al.,2002). The latent variable rho[i]was modelled as:log rho ii¼b0þb1x1iþ...þb m x miþB ið2ÞIn Equation2,the logit of the probability of observing infection in a randomly chosen pig from the i th infected farm was modelled as a function of the four farm-level explanatory variables defined earlier and a random effect term for herd,B i,which was normally distributed with a mean of zero and precision s.We set non-informed,normally distributed priors centred at zero and with a precision of0.5for each of thefixed effect terms,including the intercept.Sensitiv-ity to these priors was evaluated by re-running the models with a precision of1and0.2.For the precision of the random farm-level effects,r and s,we specified a precision of 1.Sensitivity to these priors was evalu-ated by re-running the models with a precision of0.5 and0.3.Three chains were run and convergence was judged to have occurred on the basis of visual inspection of plots of the sampled values as a time series(Toft et al.,2007). The required number of iterations of the Gibbs sampler was determined by running sufficient iterations to ensure the Monte Carlo standard errors for each parameter were less than5%of the posterior standard deviations.A total of30060iterations were run with a‘burn in’of1000 iterations.We proposedfitting this model on2003data to inform sampling strategies for the subsequent year(estimation). Then wefit a model to the2004data and use this to see how successful the sampling strategies chosen from the 2003data were by,for example,comparing the number of false negatives(prediction).To check for consistency between years(2003and 2004),we examined model outputs from both years of data separately and compared the magnitude and direc-tion of the regression coefficients.The8151random farm-level effects for the2years were compared using scatter-plots and quantified using Lin’s concordance cor-relation coefficient(Lin,1989).A scatter plot of the median conditional sero-preva-lence rho[i]versus the median probability of infection q[i](Fig.3)was used to identify the cut-off for the two model derived risk-based sampling schemes MRBA and MRBB.Comparison of sampling schemesThe results from all four sampling schemes were com-pared by considering cost,the number of false-negative farms and the number of farms detected with a within herd sero-prevalence of‡0.40.Costs were compared by adding up the number of tests taken under each of the four sampling schemes.Only the costs of meat juice testing were taken into account,with each meat juice sample tested costing2€.These costs are borne by the producers through levies on each pig slaughtered.There are follow-on tests once herds reachPredictive Modelling Informs Salmonella Surveillance J.Benschop et al. 64ª2010Blackwell Verlag GmbH•Zoonoses Public Health.57(Suppl.1)(2010)60–70level2and3of200€with further costs if herds are found to be positive.These follow-on tests were not considered further in this study.For each farm(n=8151)there were1020iterations stored from the model and these were used to determine the false-negative rate and the number of farms detected with a within-herd sero-prevalence of‡0.40for each of the four sampling schemes.The number of farms that were falsely reported as neg-ative and the sensitivity for each of the four sampling schemes was determined using the following process: (a)the J[i]parameter,the indictor variable representing infection status of the i th herd,for2004was examined at each iteration.If it equalled one,then,for that iteration, the farm was considered infected.Otherwise,for that iter-ation,the farm was considered uninfected;(b)rho[i],the predicted within-herd seroprevalence given the herd was infected,for2004was determined for each iteration when the farm was infected.rho[i]was combined with the number of pigs sampled,using the binomial distribution to determine the number of posi-tives that would be detected at each iteration;(c)a false-negative iteration was defined as one where the farm was infected at the iteration,but no positives were detected at that iteration.The number of false-nega-tive iterations was summed and divided by the number of total iterations to give the number of false-negative farms;(d)this was expressed as the sensitivity of the sampling scheme by dividing the number of false-negative farms by the total number of farms(n=8151),and subtracting this fraction(the false-negative fraction)from one.The number of farms that were predicted to have an observed seroprevalence of‡0.40for each of the four sampling schemes was determined using the following process:1the number of positives detected in each herd for each iteration was determined as in steps(a)and(b)in the preceding paragraph;2the number of positives was divided by the number sampled to give the observed seroprevalence in each herd at each iteration;3these numbers were summed and divided by the number of iterations to obtain the expected number of herds with observed seroprevalences of‡0:40.ResultsData sourcesIn2003,there were9735herds in the programme.The median number of pigsfinished per year was2000(IQR: 800–3700).In total,5938herds(61%)kept no sows,1752 (18%)some and2045(21%)kept many.A total of7107 herds(73%)were of conventional health status,586(6%) SPF status and2042(21%)SPF with Mycoplasma.Finally, 978herds(10%)were from Sonderjylland.Sampling schemes(including model development)All predictors were significant in the simple logistic regression model developed in r.The results of the Bayes-ian model using all these predictors are shown in Table1. Compared with pigs from conventional health status herds,pigs from SPF health status and SPF-Mycoplasma status herds had0.69(95%CI0.66–0.72)and0.93(95% CI0.91–0.96)times the odds of being Salmonella positive, pared with herds having1–125sows, having none or more than125sows increased the odds of a pig being Salmonella positive by a factor of1.33(95% CI1.28–1.38)and1.36(95%CI1.32–1.41),respectively. Compared with farms located outside of Sonderjylland, the odds of pigs being Salmonella positive on farms within Sonderjylland was increased by a factor of factor of1.32(95%CI1.28–1.36).Estimated coefficients for the ZIB model are shown in Tables2and3.Table2shows the factors included in the zero-inflated part of the model;these are interpreted as fac-tors associated with the probability of a herd being infected.A herd producing less than2000(small),or greater than5000(large)pigs for slaughter per year had a 1.58(95%CI: 1.18–2.11)or 2.08(95%CI:1.42–3.14)greater odds of infection with Salmonella ,respectively,compared with herds producing between 2000and 5000(medium)pigs per year for pared with herds within farms located outside of Sonderjylland,the odds of a Sonderjylland herd being infected with Salmo-nella was decreased by a factor of 0.25(95%CI:0.19–0.33).Table 3shows the model results for the binomial part of the ZIB model;these are interpreted as variables asso-ciated with the level of seropositivity in a herd,given that the herd is infected.The odds of a pig being sero-positive in an infected small or large herd was increased by a fac-tor of 1.16(95%CI 1.08–1.24)compared with a pig being sero-positive in an infected medium herd.The remaining results were similar to those provided in Table 1for the logistic regression model.The ZIB model was insensitive to changes in the preci-sion parameter of the prior distribution assigned to A i and B i .The zero-inflated part of the model showed a 5-fold increase in the value of the posterior standard deviation when compared with the binomial part of the model.As we planned to use this model,based on 2003data,to predict the probability of infection and seropositivity in 2004we checked for consistency between the 2years.This was thought to be important,because substantial changes in pig-and herd-level risks for infection (arising from,e.g.changes in herd size or changes in the price of feed)from 1year to the next could reduce the ability of the 2003model to predict herd-level behaviour in 2004.The magnitude and sign of the regression coefficients for 2003and 2004were compared.There was no change inTable 2.Zero-inflated binomial model output showing factors associated with Salmonella infection status in 8151Danish finisher herds in 2003as a part of the national surveillance and control programme Variable LevelPosterior mean Posterior SD MC error OR (95%CI)Intercept – 2.360.190.008–Herd sizeSmall 0.460.150.004 1.58(1.18–2.11)Medium Reference –––Large0.730.210.005 2.08(1.42–3.14)Health statusConventional Reference –––SPF0.560.460.011 1.67(0.85–5.02)SPF (with Mycoplasma))0.140.160.0030.87(0.63–1.18)Sow statusNone 0.140.210.008 1.15(0.75–1.71)Some Reference –––Many 0.050.240.009 1.04(0.64–1.67)SonderjyllandNo Reference –––Yes)1.380.140.0030.25a(0.19–0.33)SD,standard deviation;CI,Bayesian credible interval;MC error,Monte Carlo standard error of the posterior mean;OR,odds ratio.aInterpretation:once adjusted for herd size,number of sows and herd health status,a farm located in Sonderjylland had 0.25times the odds of being Salmonella positive compared with a farm located elsewhere (95%CI:0.19–0.33).Table 1.Results of a logistic regression model showing factors associated with Salmonella sero-positivity in 578260meat-juice ELISA results taken from 9735Danish finisher herds in 2003as a part of the national surveillance-and-control programme Variable LevelPosterior mean Posterior SD MC error OR (95%CI)Intercept –)2.880.01<0.001–Herd size aContinuous 2.9·10)20.00<0.001 1.02(1.01–1.03)Health StatusConventional Reference –––SPF)0.370.02<0.0010.69(0.66–0.72)b SPF (with Mycoplasma ))0.070.01<0.0010.93(0.91–0.96)Sow StatusNo sows0.310.02<0.001 1.33(1.28–1.38)Some sows (1–125)Reference –––Many sows (>125)0.280.02<0.001 1.36(1.32–1.41)SonderjyllandNo Reference –––Yes0.280.02<0.0011.32(1.28–1.36)SD,standard deviation;CI,Bayesian credible interval;MC error,Monte Carlo standard error of the posterior mean;OR,odds ratio.aNo.finishers produced (rescaled by subtracting the minimum,then dividing by 1000).bInterpretation:once adjusted for herd size,sow status and location within Sonderjylland,a pig on a farm with SPF health status had 0.69times the odds of being Salmonella positive compared with a pig on a farm with conventional health status (95%CI:0.66–0.72).Predictive Modelling Informs Salmonella Surveillance J.Benschop et al.66ª2010Blackwell Verlag GmbH •Zoonoses Public Health.57(Suppl.1)(2010)60–70。

相关文档
最新文档