多重逻辑回归外文翻译资料

 2023-09-04 14:48:30

多重逻辑回归

作者:David W.Hosmer 和 Stanley Lemeshow

国籍:美国

出处:《Applied Logistic Regression》

1.介绍

在上一章中,我们介绍了单变量上下文中的逻辑回归模型.与线性回归的情况一样,建模的优势在于它能够进行多变量建模,其中一些变量可能有着不同的衡量指标,在本章中,我们将把逻辑模型推广到多个独立变量的情况.这将被称为'多变量情形'.考虑多个变量模型的核心是估计模型中的参数,并测试它们的显著性.这将遵循与单变量模型相同的研究方法,本章将介绍的另一个建模考虑因素是设计变量对离散的独立变量进行建模.在所有情况下,都将假定存在要检查的预先确定的变量集合.变量选择的问题会在第四章中讨论.

2.多重逻辑回归模型

考虑p个自变量的集合的向量形式.目前我们将假设这些中的每个变量至少是以区间为尺度的.假设出现结果的条件概率为由方程给出多元逻辑回归模型公式

(2.1)

在这种情况下,逻辑回归模型是

(2.2)

如果某些自变量是离散的变量,例如种族性别,实验分组等,则将它们包含在模型中是不合适的,就好像他们的衡量标准变了一样.用于表示这些变量程度的各个级别的数字仅仅是标识符,没有数字意义.在这种情况下,选择的方法是使用一组设计变量或虚拟变量,例如,假设自变量之一是种族,他被编码为'白人','黑人'和'其他'.在这种情况下,需要两个设计变量,一种可能的编码策略是,当受访者是白人时,两个设计变量都将设置为0;当被访者是'黑人',将被设置为等于1,而将被设置为仍然等于0,当受访者的种族是'其他'时,我们将使用=0和=1.表2.1说明了变量的这种编码功能.

表2.1 种族变量在三个级别中的编码示例

设计变量

种族

白种人

0

0

黑种人

1

0

其他

0

1

大多数逻辑回归软件会生成设计变量,有些程序可以选择几种不同的方法,第三章详细讨论了创建和解释设计变量的不同策略.

通常,如果变量有k个可能的值,那么需要k-1个设计变量,除非另有说明,否则我们的模型都有一个常数项,为了说明本文中用于设计变量的符号.假设独立变量有个水平.设计变量将表示为,并且这些设计变量的系数将表示为,.因此,具有p个变量和离散变量的logistic模型是

在讨论多元逻辑回归模型时,我们通常会取消指示何时使用设计变量所需的求和和双下标.当我们需要对模型中的任何设计变量使用系数的特定值时,对建模策略的讨论将是一个例外.

3.拟合多重Logistic回归模型

假设我们有一个由n个独立观测值组成的样本.与单变量情况一样,拟合模型需要我们获得向量的估计值.多变量情况下使用的估计方法将是与单变量情况相同.最大似然函数是几乎与等式(1.3)中给出的相同,唯一的变化是现在定义为等式(2.2).通过对p 1系数对对数似然函数进行微分而获得的似然方程.得到的方程可以表示如下:

如同在单变量模型中一样,似然方程的求解需要在大多数统计软件中用到特殊的软件包来表示这些方程的解.因此,拟合多元逻辑回归模型的值为,用和计算(2.2)的值.

在前一章中,只简要提到了估计系数标准误差的方法.既然逻辑回归模型在概念和符号上都已推广到了多变量的情况,我们将更详细的考虑标准误差的估计.

估计系数的方差和协方差的方法遵循发展良好的最大似然估计理论.该理论指出,估计是从对数似然函数的二阶偏导数矩阵中获得的.而这些偏导数具有以下一般形式

(2.3)

(2.4)

对于,其中表示.令包含等式(2.3)和(2.4)中给出的项为负数的矩阵表示为.这个矩阵称为观测信息矩阵.估计系数的方差和协方差是从该矩阵的逆矩阵中获得的,我们将其表示为,除非在非常特殊的情况下,否则不可能写向下表达此矩阵中元素的显式表达式.因此,我们将使用符号来表示对角线元素矩阵.即的方差,表示任意非对角元素,即和几何协方差.方差和协方差的估计量由表示,通过在处评估获得.我们将用和表示该矩阵中的值.在大多数情况下,我们将有机会仅使用估计系数的标准误差,我们将其表示为

(2.5)

对于,我们将在用于系数检验和置信区间估计的方法中使用此符号.

在讨论模型拟合和拟合评估时有用的信息矩阵的公式是,其中X是一个np 1的矩阵,包含每个主题的数据,V是一个nn的对角矩阵,具有一般元素.即矩阵X是

矩阵V是

再进一步讨论之前,我们提供了一个示例,该示例说明了多元逻辑回归模型的制定以及使用第1.6.2节中描述的,低于出生体重研究数据中的变量子集对其系数的估计.完整数据集的代码表在表1.6中给出.如1.6.2节所述,本研究的目标是确定与生下低出生体重婴儿(体重低于2500克)相关的风险因素.我们收集了189名妇女的数据,其中=59婴儿是低出生体重和另外的=130正常出生体重的婴儿.大部分观点认为四个比较重要的变量分别是年龄,母亲最后一次月经时的体重,种族,以及怀孕前三个月就医的次数.在这个例子中,变量种族是根据表2.1中的两个设计变量重新编码得到的在表中作为自变量分别表示为Race_2和Race_3,将逻辑回归模型拟合到这些数据的结果如表2.2所示.

表2.2 以年龄,末次月经体重(LWT)、种族

和孕期医生就诊次数(FTV)为变量的多重逻辑回归模型的估计系数

体重研究

变量

系数

标准差

z

p值

AGE

-0.024

0.0337

-0.71

0.280

LWT

-0.014

0.0065

-2.18

0.029

Race_2

1.004

0.4979

2.02

0.044

Race_3

0.433

0.3622

1.20

0.232

FTV

-0.049

0.1672

-0.30

0.768

Constant

1.295

1.0714

1.21

0.227

似然对数=-111.286

使用表2.1中的两个设计变量重新编码.将逻辑回归模型拟合到这些数据的结果如表2.2所示.在表2.1中的两个设计变量的估计系数由Race_2和Race_3表示.估计的logit由以下表达式给出:

拟合值是使用估计的得到的.

4.测试模型的重要性

一旦我们拟合了特定的多变量逻辑回归模型,我们就开始模型评估的过程.与第一章中介绍的单变量情况一样,此过程的第一步通常是评估模型中变量的显著性.模型中自变量的p系数总体显著性的似然比检验,以与单变量情况完全相同的方式执行,该检验基于等式(1.12)中给出的统计量G,唯一的区别是模型下的拟合值之于包含p 1个参数的向量.在空值下假设模型中因变量的'p斜率'系数等于0,则G的分布将是具有p自由度的卡方分布.

考虑拟合模型,其估计系数在表2.2中给出.对于该模型,对数似然的值显示在表的底部是L=-111.286.constant的似然对数可以通过评估方程(1.12)的分子或通过拟合仅常数模型来获得.任何一种方法都会产生对数似然L=-117.336.因此,根据等式(1.12)似然比检验的指是

并且测试的p值为在0.05水平上是显著的.在这种情况下,我们拒绝原假设,并得出结论,至少有一个或所有的p系数都不为零,这种解释类似于多元线性回归中的解释.

在得出任何所有系数不为零的结论之前,我们不妨先看看单变量Wald检验统计量

这些在表2.2的第四列中给出.在单个系数为零的假设下,这些统计量将遵循标准正态分布.p值在表2.2的第五列中给出.如果我们使用0.05的显著性水平,那么我们会得出结论,变量LWT和可能的Race显著,而AGE和FTV不显著.

如果我们的目标是在最小化参数数量的同时获得最佳拟合模型,那么下一个合乎逻辑的步骤是拟合仅包含那些被认为重要的变量的简化模型,并将其与包含所有变量的完整模型进行比较.表2.3给出了你和简化模型的结果.

表2.3 使用变量LWT的多元逻辑回归模型的估计系数

剩余内容已隐藏,支付完成后下载完整资料


Applied Logistic Regression, Second Edition. By David W. Hosmer and Stanley Lemeshow Copyright  2000 John Wiley amp; Sons, Inc. ISBN: 0-471-72214-6

CHAPTER 2

Multiple Logistic Regression

1.INTRODUCTION

In the previous chapter we introduced the logistic regression model in the univariate context. As in the case of linear regression, the strength of a modeling technique lies in its ability to model many variables, some of which may be on different measurement scales. In this chapter we will generalize the logistic model to the case of more than one independent variable. This will be referred to as the 'multi variable case.' Central to the consideration of multiple logistic models will be estimation of the coefficients in the model and testing for their significance. This will follow along the same lines as the univariate model. An additional modeling consideration which will be introduced in this chapter is the use of design variables for modeling discrete, nominal scale independent variables. In all cases it will be assumed that there is a predetermined collection of variables to be examined. The question of variable selection is dealt with in Chapter 4.

2.THE MULTIPLE LOGISTIC REGRESSION MODEL

Consider a collection of p independent variables denoted by the vector

x = (x,, x2 , ... , x,,). For the moment we will assume that eac_h of these

variables is at least interval scale. Let the conditional probability that the outcome is present be denoted by P(Y = 11 x) = n(x). The logit of the multiple logistic regression model is given by the equation

(2.1)

in which case the logistic regression model is

31

(2.2)

If some of the independent variables are discrete, nominal scale variables such as race, sex, treatment group, and so forth, it is inappropriate to include them in the model as if they were interval scale variables. The numbers used to represent the various levels of these nominal scale variables are merely identifiers, and have no numeric significance. In this situation the method of choice is to use a collection of design variables (or dummy variables). Suppose, for example, that one of the independent variables is race, which has been coded as 'white,' 'black' and 'other.' In this case, two design variables are necessary. One possible coding strategy is that when the respondent is 'white,' the two design variables, D1 and D2, would both be set equal to zero; when the respondent is 'black,' D1 would be set equal to 1 while D2 would

still equal 0; when the race of the respondent is 'other,' we would use D1 = O and D2 = 1. Table 2.1 illustrates this coding of the design vari-

ables.

Most logistic regression software will generate design variables, and some programs have a choice of several different methods. The different strategies for creation and interpretation of design variables are discussed in detail in Chapter 3.

In general, if a nominal scaled variable has k possible values, then k -1 design variables will be needed. This is true since, unless stated otherwise, all of our models have a constant term. To illustrate the notation used for design variables in this text, suppose that the fh independent variable xj has kj levels. The kj -1 design variables will be

denoted as Di 1 and the coefficients for these design variables will be

denoted as if31 ,l =1,2,...,kj -1. Thus, the logit for a model with p vari-

Table 2.1 An Example of the Coding of the Design

Variables for Race, Coded at Three Levels

Design Variable

变量

系数

标准差

z

p值

LWT

-0.015

0.0064

-2.36

0.018

Race_2

1.081

0.4881

2.22

0.027

Race_3

0.481

0.3567

1.35

0.178

Constant

0.806

0.8452

0.95

0.340

RACE

D1

D2

White

0

0

Black

Other

0

0

ables and the /h variable being discrete would be

When discussing the multiple logistic regression model we will, in general, suppress the summation and double subscripting needed to indicate when design variables are being used. The exception to this will be the discussion of modeling strategies when we need to use the specific value of the coefficients for any design variables in the model.

3.FITTING THE MULTIPLE LOGISTIC REGRESSION MODEL

Assume that we have a sample of n independent observations (x;,Y;), i=I,2,...,n. As in the univariate case, fitting the model requires that we obtain estimates of the vector The method of estimation used in the multivariable case will be the same as in the univariate situation - maximum likelihood. The likelihood function is

nearly identical to that given in equation (1.3) with the only change being that is now defined as in equation (2.2). There will be p l

likelihood equations that are obtained by differentiating the log like hood function with respect to the p I coefficients. The likelihood

equations that result may be expressed as follows:

and

for j =1,2,... ,p.

As in the univariate model, the solution of the likelihood equations requires special software that is available in most, if not all, statistical packages. Let denote the solution to these equations. Thus, the fitted

values for the multiple logistic regression model are ft(x;), the value of the expression in equation (2.2) computed using I}, and X;.

In the previous chapter only a brief mention was made of the method for estimating the standard errors of the estimated coefficients. Now that the logistic regression model has been generalized both in concept and notation to the multivariable case, we consider estimation of standard errors in more detail.

The method of estimating the variances and covariances of the estimated coefficients follows

剩余内容已隐藏,支付完成后下载完整资料


资料编号:[595558],资料为PDF文档或Word文档,PDF文档可免费转换为Word

原文和译文剩余内容已隐藏,您需要先支付 30元 才能查看原文和译文全部内容!立即支付

以上是毕业论文外文翻译,课题毕业论文、任务书、文献综述、开题报告、程序设计、图纸设计等资料可联系客服协助查找。