Free Essay

Forcast with Arma

In:

Submitted By
Words 4066
Pages 17
LECTURE 7

Forecasting with ARMA Models

Minimum Mean-Square Error Prediction Imagine that y(t) is a stationary stochastic process with E{y(t)} = 0. We may be interested in predicting values of this process several periods into the future on the basis of its observed history. This history is contained in the so-called information set. In practice, the latter is always a finite set {yt , yt−1 , . . . , yt−p } representing the recent past. Nevertheless, in developing the theory of prediction, it is also useful to consider an infinite information set It = {yt , yt−1 , . . . , yt−p , . . .} representing the entire past. We shall denote the prediction of yt+h which is made at the time t by yt+h|t or by yt+h when it is clear that we are predicting h steps ahead. ˆ ˆ The criterion which is commonly used in judging the performance of an estimator or predictor y of a random variable y is its mean-square error defined ˆ 2 by E{(y − y ) }. If all of the available information on y is summarised in its ˆ marginal distribution, then the minimum-mean-square-error prediction is simply the expected value E(y). However, if y is statistically related to another random variable x whose value can be observed, and if the form of the joint distribution of x and y is known, then the minimum-mean-square-error prediction of y is the conditional expectation E(y|x). This proposition may be stated formally: (1) Let y = y (x) be the conditional expectation of y given x which is ˆ ˆ also expressed as y = E(y|x). Then E{(y − y )2 } ≤ E{(y − π)2 }, ˆ ˆ where π = π(x) is any other function of x.

Proof. Consider E (y − π)2 = E (y − y ) + (ˆ − π) ˆ y
2

(2)

= E (y − y )2 + 2E (y − y )(ˆ − π) + E (ˆ − π)2 ˆ ˆ y y 1

D.S.G. POLLOCK : A SHORT COURSE OF TIME-SERIES ANALYSIS Within the second term, there is E (y − y )(ˆ − π) = ˆ y x y

(y − y )(ˆ − π)f (x, y)∂y∂x ˆ y (y − y )f (y|x)∂y (ˆ − π)f (x)∂x ˆ y x y

(3)

= = 0.

Here the second equality depends upon the factorisation f (x, y) = f (y|x)f (x) which expresses the joint probability density function of x and y as the product of the conditional density function of y given x and the marginal density function of x. The final equality depends upon the fact that (y − y )f (y|x)∂y = ˆ E(y|x) − E(y|x) = 0. Therefore E{(y − π)2 } = E{(y − y )2 } + E{(ˆ − π)2 } ≥ ˆ y 2 E{(y − y ) }, and the assertion is proved. ˆ The definition of the conditional expectation implies that E(xy) = x y

xyf (x, y)∂y∂x x x y

(4)

=

yf (y|x)∂y f (x)∂x

= E(xˆ). y When the equation E(xy) = E(xˆ) is rewritten as y (5) E x(y − y ) = 0, ˆ

it may be described as an orthogonality condition. This condition indicates ˆ that the prediction error y − y is uncorrelated with x. The result is intuitively appealing; for, if the error were correlated with x, we should not using the information of x efficiently in forming y . ˆ The proposition of (1) is readily generalised to accommodate the case where, in place of the scalar x, there is a vector x = [x1 , . . . , xp ] . This generalisation indicates that the minimum-mean-square-error prediction of yt+h given the information in {yt , yt−1 , . . . , yt−p } is the conditional expectation E(yt+h |yt , yt−1 , . . . , yt−p ). In order to determine the conditional expectation of yt+h given {yt , yt−1 , . . . , yt−p }, we need to known the functional form of the joint probability density function all of these variables. In lieu of precise knowledge, we are often prepared to assume that the distribution is normal. In that case, it follows that the conditional expectation of yt+h is a linear function of {yt , yt−1 , . . . , yt−p }; and so the problem of predicting yt+h becomes a matter of forming a linear 2

D.S.G. POLLOCK : FORECASTING regression. Even if we are not prepared to assume that the joint distribution of the variables in normal, we may be prepared, nevertheless, to base the prediction of y upon a linear function of {yt , yt−1 , . . . , yt−p }. In that case, the criterion of minimum-mean-square-error linear prediction is satisfied by forming yt+h = φ1 yy + φ2 yt−1 + · · · + φp+1 yt−p from the values φ1 , . . . , φp+1 which ˆ minimise p+1 E (yt+h − yt+h ) ˆ (6)

2

=E

yt+h − j=1 2

φj yt−j+1 φi φj γi−j , i j

= γ0 − 2 j φj γh+j−1 +

wherein γi−j = E(εt−i εt−j ). This is a linear least-squares regression problem which leads to a set of p + 1 orthogonality conditions described as the normal equations: p (7)

E (yt+h − yt+h )yt−j+1 = γh+j−1 − ˆ i=1 φi γi−j

=0 ; In matrix terms, these are γ (8)
0

j = 1, . . . , p + 1.

 γ1  .  . . γp

γ1 γ0 . . . γp−1

... γp   φ1   γh  . . . γp−1   φ2   γh+1  .  .  =  . . .. .  .   .  . . . . φp+1 γh+p ... γ0

Notice that, for the one-step-ahead prediction of yt+1 , they are nothing but the Yule–Walker equations. In the case of an optimal predictor which combines previous values of the series, it follows from the orthogonality principle that the forecast errors are uncorrelated with the previous predictions. A result of this sort is familiar to economists in connection with the socalled efficient-markets hypothesis. A financial market is efficient if the prices of the traded assets constitute optimal forecasts of their discounted future returns, which consist of interest and dividend payments and of capital gains. According to the hypothesis, the changes in asset prices will be uncorrelated with the past or present price levels; which is to say that asset prices will follow random walks. Moreover, it should not be possible for someone who is appraised only of the past history of asset prices to reap speculative profits on a systematic and regular basis. 3

D.S.G. POLLOCK : A SHORT COURSE OF TIME-SERIES ANALYSIS Forecasting with ARMA Models So far, we have avoided making specific assumptions about the nature of the process y(t). We are greatly assisted in the business of developing practical forecasting procedures if we can assume that y(t) is generated by an ARMA process such that (9) y(t) = µ(L) ε(t) = ψ(L)ε(t). α(L)

We shall continue to assume, for the sake of simplicity, that the forecasts are based on the information contained in the infinite set {yt , yt−1 , yt−2 , . . .} = It comprising all values that have been taken by the variable up to the present time t. Knowing the parameters in ψ(L) enables us to recover the sequence {εt , εt−1 , εt−2 , . . .} from the sequence {yt , yt−1 , yt−2 , . . .} and vice versa; so either of these constitute the information set. This equivalence implies that the forecasts may be expressed in terms {yt } or in terms {εt } or as a combination of the elements of both sets. Let us write the realisations of equation (9) as (10) yt+h = {ψ0 εt+h + ψ1 εt+h−1 + · · · + ψh−1 εt+1 } + {ψh εt + ψh+1 εt−1 + · · ·}.

Here the first term on the RHS embodies disturbances subsequent to the time t when the forecast is made, and the second term embodies disturbances which are within the information set {εt , εt−1 , εt−2 , . . .}. Let us now define a forecasting function, based on the information set, which takes the form of (11) yt+h|t = {ρh εt + ρh+1 εt−1 + · · ·}. ˆ

Then, given that ε(t) is a white-noise process, it follows that the mean square of the error in the forecast h periods ahead is given by h−1 ∞ 2 ψi i=0

(12)

E (yt+h − yt+h ) ˆ

2

=

2 σε

+

2 σε i=h

(ψi − ρi )2 .

Clearly, the mean-square error is minimised by setting ρi = ψi ; and so the optimal forecast is given by (13) yt+h|t = {ψh εt + ψh+1 εt−1 + · · ·}. ˆ

This might have been derived from the equation y(t + h) = ψ(L)ε(t + h), which generates the true value of yt+h , simply by putting zeros in place of the unobserved disturbances εt+1 , εt+2 , . . . , εt+h which lie in the future when the 4

D.S.G. POLLOCK : FORECASTING forecast is made. Notice that, on the assumption that the process is stationary, the mean-square error of the forecast tends to the value of (14)
2 V y(t) = σε 2 ψi

as the lead time h of the forecast increases. This is nothing but the variance of the process y(t). The optimal forecast of (5) may also be derived by specifying that the forecast error should be uncorrelated with the disturbances up to the time of making the forecast. For, if the forecast errors were correlated with some of the elements of the information set, then, as we have noted before, we would not be using the information efficiently, and we could not be generating optimal forecasts. To demonstrate this result anew, let us consider the covariance between the forecast error and the disturbance εt−i : h ˆ E (yt+h − yt+h )εt−i = k=1 ψh−k E(εt+k εt−i )


(15)

+ j=0 (ψh+j − ρh+j )E(εt−j εt−i )

2 = σε (ψh+i − ρh+i ).

Here the final equality follows from the fact that (16) E(εt−j εt−i ) =
2 σε , if i = j,

0,

if i = j.

If the covariance in (15) is to be equal to zero for all values of i ≥ 0, then we must have ρi = ψi for all i, which means that the forecasting function must be the one that has been specified already under (13). It is helpful, sometimes, to have a functional notation for describing the process which generates the h-steps-ahead forecast. The notation provided by Whittle (1963) is widely used. To derive this, let us begin by writing (17) y(t + h) = L−h ψ(L) ε(t).

On the LHS, there are not only the lagged sequences {ε(t), ε(t−1), . . .} but also the sequences ε(t + h) = L−h ε(t), . . . , ε(t + 1) = L−1 ε(t), which are associated with negative powers of L which serve to shift a sequence forwards in time. Let {L−h ψ(L)}+ be defined as the part of the operator containing only nonnegative powers of L. Then the forecasting function can be expressed as y (t + h|t) = L−h ψ(L) ˆ (18) = ψ(L) Lh 5
+

ε(t),

+

1 y(t). ψ(L)

D.S.G. POLLOCK : A SHORT COURSE OF TIME-SERIES ANALYSIS Example. Consider an ARMA (1, 1) process represented by the equation (19) (1 − φL)y(t) = (1 − θL)ε(t).

The function which generates the sequence of forecasts h steps ahead is given by y (t + h|t) = ˆ (20) L−h 1 + (φ − θ)L 1 − φL ε(t)
+

= φh−1 = φh−1

(φ − θ) ε(t) 1 − φL (φ − θ) y(t). 1 − θL

When θ = 0, this gives the simple result that y (t + h|t) = φh y(t). ˆ Generating The Forecasts Recursively We have already seen that the optimal (minimum-mean-square-error) forecast of yt+h can be regarded as the conditional expectation of yt+h given the information set It which comprises the values of {εt , εt−1 , εt−2 , . . .} or equally the values of {yt , yt−1 , yt−2 , . . .}. On taking expectations of y(t) and ε(t) conditional on It , we find that ˆ E(yt+k |It ) = yt+k|t (21) E(yt−j |It ) = yt−j if k > 0, if j ≥ 0,

E(εt+k |It ) = 0 if k > 0, E(εt−j |It ) = εt−j if j ≥ 0.

In this notation, the forecast h periods ahead is h ∞

E(yt+h |It ) = (22) = j=0 k=1 ∞

ψh−k E(εt+k |It ) + j=0 ψh+j E(εt−j |It )

ψh+j εt−j .

In practice, the forecasts may be generated using a recursion based on the equation (23) y(t) = − α1 y(t − 1) + α2 y(t − 2) + · · · + αp y(t − p) + µ0 ε(t) + µ1 ε(t − 1) + · · · + µq ε(t − q). 6

D.S.G. POLLOCK : FORECASTING By taking the conditional expectation of this function, we get (24) ˆ yt+h = −{α1 yt+h−1 + · · · + αp yt+h−p } ˆ + µh εt + · · · + µq εt+h−q when 0 < h ≤ p, q, ˆ yt+h = −{α1 yt+h−1 + · · · + αp yt+h−p } if ˆ q < h ≤ p,

(25)

(26) and (27)

ˆ ˆ yt+h = −{α1 yt+h−1 + · · · + αp yt+h−p } ˆ + µh εt + · · · + µq εt+h−q if p < h ≤ q,

yt+h = −{α1 yt+h−1 + · · · + αp yt+h−p } when ˆ ˆ ˆ

p, q < h.

It can be from (27) that, for h > p, q, the forecasting function becomes a pth-order homogeneous difference equation in y. The p values of y(t) from t = r = max(p, q) to t = r − p + 1 serve as the starting values for the equation. The behaviour of the forecast function beyond the reach of the starting values can be characterised in terms of the roots of the autoregressive operator. It may be assumed that none of the roots of α(L) = 0 lie inside the unit circle; for, if there were roots inside the circle, then the process would be radically unstable. If all of the roots are less than unity, then yt+h will converge to ˆ zero as h increases. If one of the roots of α(L) = 0 is unity, then we have an ARIMA(p, 1, q) model; and the general solution of the homogeneous equation of (27) will include a constant term which represents the product of the unit root with an coefficient which is determined by the starting values. Hence the forecast will tend to a nonzero constant. If two of the roots are unity, then the general solution will embody a linear time trend which is the asymptote to which the forecasts will tend. In general, if d of the roots are unity, then the general solution will comprise a polynomial in t of order d − 1. The forecasts can be updated easily once the coefficients in the expansion of ψ(L) = µ(L)/α(L) have been obtained. Consider (28) yt+h|t+1 = {ψh−1 εt+1 + ψh εt + ψh+1 εt−1 + · · ·} and ˆ yt+h|t = {ψh εt + ψh+1 εt−1 + ψh+2 εt−2 + · · ·}. ˆ

The first of these is the forecast for h − 1 periods ahead made at time t + 1 whilst the second is the forecast for h periods ahead made at time t. It can be seen that (29) yt+h|t+1 = yt+h|t + ψh−1 εt+1 , ˆ ˆ 7

D.S.G. POLLOCK : A SHORT COURSE OF TIME-SERIES ANALYSIS where εt+1 = yt+1 − yt+1 is the current disturbance at time t + 1. The later is ˆ also the prediction error of the one-step-ahead forecast made at time t. Example. For an example of the analytic form of the forecast function, we may consider the Integrated Autoregressive (IAR) Process defined by (30) 1 − (1 + φ)L + φL2 y(t) = ε(t),

wherein φ ∈ (0, 1). The roots of the auxiliary equation z 2 − (1 + φ)z + φ = 0 are z = 1 and z = φ. The solution of the homogeneous difference equation (31) ˆ 1 − (1 + φ)L + φL2 y (t + h|t) = 0,

which defines the forecast function, is (32) y (t + h|t) = c1 + c2 φh , ˆ

where c1 and c2 are constants which reflect the initial conditions. These constants are found by solving the equations (33) The solutions are (34) c1 = yt − φyt−1 1−φ and c2 = φ (yt − yt−1 ). φ−1 yt−1 = c1 + c2 φ−1 , yt = c1 + c2 .

The long-term forecast is y = c1 which is the asymptote to which the forecasts ¯ tend as the lead period h increases. Ad-hoc Methods of Forecasting There are some time-honoured methods of forecasting which, when analysed carefully, reveal themselves to be the methods which are appropriate to some simple ARIMA models which might be suggested by a priori reasoning. Two of the leading examples are provided by the method of exponential smoothing and the Holt–Winters trend-extrapolation method. Exponential Smoothing. A common forecasting procedure is exponential smoothing. This depends upon taking a weighted average of past values of the time series with the weights following a geometrically declining pattern. The function generating the one-step-ahead forecasts can be written as (35) y (t + 1|t) = ˆ (1 − θ) y(t) 1 − θL = (1 − θ) y(t) + θy(t − 1) + θ 2 y(t − 2) + · · · . 8

D.S.G. POLLOCK : FORECASTING On multiplying both sides of this equation by 1 − θL and rearranging, we get (36) y (t + 1|t) = θˆ(t|t − 1) + (1 − θ)y(t), ˆ y

which shows that the current forecast for one step ahead is a convex combination of the previous forecast and the value which actually transpired. The method of exponential smoothing corresponds to the optimal forecasting procedure for the ARIMA(0, 1, 1) model (1 − L)y(t) = (1 − θL)ε(t), which is better described as an IMA(1, 1) model. To see this, let us consider the ARMA(1, 1) model y(t) − φy(t − 1) = ε(t) − θε(t − 1). This gives y (t + 1|t) = φy(t) − θε(t) ˆ = φy(t) − θ (37) (1 − φL) y(t) 1 − θL {(1 − θL)φ − (1 − φL)θ} y(t) = 1 − θL (φ − θ) y(t). = 1 − θL

On setting φ = 1, which converts the ARMA(1, 1) model to an IMA(1, 1) model, we obtain precisely the forecasting function of (35). The Holt–Winters Method. The Holt–Winters algorithm is useful in extrapolating local linear trends. The prediction h periods ahead of a series y(t) = {yt , t = 0, ±1, ±2, . . .} which is made at time t is given by (38) where (39) ˆ αt = λyt + (1 − λ)(ˆ t−1 + βt−1 ) ˆ α = λyt + (1 − λ)ˆt|t−1 y ˆ yt+h|t = αt + βt h, ˆ ˆ

is the estimate of an intercept or levels parameter formed at time t and (40) ˆ ˆ α ˆ βt = µ(ˆ t − αt−1 ) + (1 − µ)βt−1

is the estimate of the slope parameter, likewise formed at time t. The coefficients λ, µ ∈ (0, 1] are the smoothing parameters. The algorithm may also be expressed in error-correction form. Let (41) ˆ ˆ ˆ et = yt − yt|t−1 = yt − αt−1 − βt−1 9

D.S.G. POLLOCK : A SHORT COURSE OF TIME-SERIES ANALYSIS be the error at time t arising from the prediction of yt on the basis of information available at time t − 1. Then the formula for the levels parameter can be given as (42) ˆ αt = λet + yt|t−1 ˆ ˆ = λet + αt−1 + βt−1 , ˆ

which, on rearranging, becomes (43) ˆ αt − αt−1 = λet + βt−1 . ˆ ˆ

When the latter is drafted into equation (40), we get an analogous expression for the slope parameter: (44) ˆ ˆ ˆ βt = µ(λet + βt−1 ) + (1 − µ)βt−1 ˆ = λµet + βt−1 .

In order reveal the underlying nature of this method, it is helpful to combine the two equations (42) and (44) in a simple state-space model: (45) α(t) ˆ 1 1 = ˆ 0 1 β(t) α(t − 1) ˆ λ ˆ − 1) + λµ e(t). β(t

This can be rearranged to give (46) 1−L 0 −L 1−L α(t) ˆ λ = e(t). ˆ λµ β(t)

The solution of the latter is (47) 1 α(t) ˆ = ˆ β(t) (1 − L)2 1−L L 0 1−L λ e(t). λµ

Therefore, from (38), it follows that ˆ y (t + 1|t) = α(t) + β(t) ˆ ˆ (48) = (λ + λµ)e(t) + λe(t − 1) . (1 − L)2

This can be recognised as the forecasting function of an IMA(2, 2) model of the form (49) (I − L)2 y(t) = µ0 ε(t) + µ1 ε(t − 1) + µ2 ε(t − 2) 10

D.S.G. POLLOCK : FORECASTING for which (50) y (t + 1|t) = ˆ µ1 ε(t) + µ2 ε(t − 1) . (1 − L)2

The Local Trend Model. There are various arguments which suggest that an IMA(2, 2) model might be a natural model to adopt. The simplest of these arguments arises from an elaboration of a second-order random walk which adds an ordinary white-noise disturbance to the tend. The resulting model may be expressed in two equations (51) (I − L)2 ξ(t) = ν(t), y(t) = ξ(t) + η(t), where ν(t) and η(t) are mutually independent white-noise processes. Combining the equations, and using the notation = 1 − L, gives y(t) = (52) = ν(t)
2

+ η(t)
2 2

ν(t) +

η(t)

.

Here the numerator ν(t)+ 2 η(t) = {ν(t)+η(t)}−2η(t−1)+η(t−2) constitutes an second-order MA process. Slightly more elaborate models with the same outcome have also been proposed. Thus the so-called structural model consists of the equations y(t) = µ(t) + ε(t), (53) µ(t) = µ(t − 1) + β(t − 1) + η(t), β(t) = β(t − 1) + ζ(t). Working backwards from the final equation gives β(t) = µ(t) = (54) = y(t) = = ζ(t) , + + + η(t) η(t) η(t) , + ε(t)
2

β(t − 1) ζ(t − 1)
2

ζ(t − 1)
2

ζ(t − 1) + 11

η(t) +
2

ε(t)

.

D.S.G. POLLOCK : A SHORT COURSE OF TIME-SERIES ANALYSIS Once more, the numerator constitutes a second-order MA process. Equivalent Forecasting Functions Consider a model which combines a global linear trend with an autoregressive disturbance process: (55) y(t) = γ0 + γ1 t + ε(t) . I − φL

The formation of an h-step-ahead prediction is straightforward; for we can separate the forecast function into two additive parts. The first part of the function is the extrapolation of the global linear trend. This takes the form of (56) zt+h|t = γ0 + γ1 (t + h) = zt + γ1 h

where zt = γ0 + γ1 t. The second part is the prediction associated with the AR(1) disturbance term η(t) = (I − φL)−1 ε(t). The following iterative scheme is provides a recursive solution to the problem of generating the forecasts: ηt+1|t = φηt , ˆ (57) ηt+2|t = φˆt+1|t , ˆ η η ηt+3|t = φˆt+2|t , ˆ etc.

Notice that the analytic solution of the associated difference equation is just (58) ηt+h|t = φh ηt . ˆ

This reminds us that, whenever we can express the forecast function in terms of a linear recursion, we can also express it in an analytic form embodying the roots of a polynomial lag operator. The operator in this case is the AR(1) operator I − φL. Since, by assumption, |φ| < 1, it is clear that the contribution of the disturbance part to the overall forecast function (59) yt+h|t = zt+h|t + ηt+h|t , ˆ ˆ

becomes negligible when h becomes large. Consider the limiting case when φ → 1. Now, in place of an AR(1) disturbance process, we have to consider a random-walk process. We know that the forecast function of a random walk consists of nothing more than a constant 12

D.S.G. POLLOCK : FORECASTING function. On adding this constant to the linear function zt+h|t = γ0 + γ1 (t + h) we continue to have a simple linear forecast function. Another way of looking at the problem depends upon writing equation (55) as (60) (I − φL) y(t) − γ0 − γ1 t = ε(t).

Setting φ = 1 turns the operator I − φL into the difference operator I − L = . But γ0 = 0 and γ1 t = γ1 , so equation (60) with φ = 1 can also be written as (61) y(t) = γ1 + ε(t).

This is the equation of a process which is described as random walk with drift. Yet another way of expressing the process is via the equation y(t) = y(t − 1) + γ1 + ε(t). It is intuitively clear that, if the random walk process z(t) = ε(t) is associated with a constant forecast function, and if z(t) = y(t) − γ0 − γ1 t, then y(t) will be associated with a linear forecast function. The purpose of this example has been to offer a limiting case where models with local stochastic trends—ie. random walk and unit root models—and models with global polynomial trends come together. Finally, we should notice that the model of random walk with drift has the same linear forecast function as the model (62)
2

y(t) = ε(t)

which has two unit roots in the AR operator.

13