Monday, August 29, 2016

On Credible Cointegration Analyses

I may not know whether some \(I(1)\) variables are cointegrated, but if they are, I often have a very strong view about the likely number and nature of cointegrating combinations. Single-factor structure is common in many areas of economics and finance, so if cointegration is present in an \(N\)-variable system, for example, so in such situations a natural benchmark is 1 common trend (\(N-1\) cointegrating combinations).  And moreover, the natural cointegrating combinations are almost always spreads or ratios. For example, log consumption and log income may or may not be cointegrated, but if they are cointegrated, then the obvious benchmark cointegrating combination is \((ln C - ln Y)\). Similarly, \(N\) government bond yields \(y\) may or may not be cointegrated, but if they are, then the obvious benchmark is \(N-1\) cointegrating combinations, given by term spreads relative to some reference yield; e.g., \(y_2 - y_1\), \(y_3 - y_1\), ..., \(y_N - y_1\).

There's not much literature exploring this perspective. (One notable exception is Horvath and Watson, "Testing for Cointegration When Some of the Cointegrating Vectors are Prespecified", Econometric Theory, 11, 952-984.) We need more.

Sunday, August 21, 2016

More on Big Data and Mixed Frequencies

I recently blogged on Big Data and mixed-frequency data, arguing that Big Data (wide data, in particular) leads naturally to mixed-frequency data.  (See here for the tall data / wide data / dense data taxonomy.)  The obvious just occurred to me, namely that it's also true in the other direction. That is, mixed-frequency situations also lead naturally to Big Data, and with a subtle twist: the nature of the Big Data may be dense rather than wide. The theoretically-pure way to set things up is as a state-space system laid out at the highest observed frequency, appropriately treating most of the lower-frequency data as missing, as in ADS.  By construction, the system is dense if any of the series are dense, as the system is laid out at the highest frequency.

Wednesday, August 17, 2016

On the Evils of Hodrick-Prescott Detrending

[If you're reading this in email, remember to click through on the title to get the math to render.]

Jim Hamilton has a very cool new paper, "Why You Should Never Use the Hodrick-Prescott (HP) Filter".

Of course we've known of the pitfalls of HP ever since Cogley and Nason (1995) brought them into razor-sharp focus decades ago.  The title of the even-earlier Nelson and Kang (1981) classic, "Spurious Periodicity in Inappropriately Detrended Time Series", says it all.  Nelson-Kang made the spurious-periodicity case against polynomial detrending of I(1) series.  Hamilton makes the spurious-periodicity case against HP detrending of many types of series, including I(1).  (Or, more precisely, Hamilton adds even more weight to the Cogley-Nason spurious-periodicity case against HP.)

But the main contribution of Hamilton's paper is constructive, not destructive.  It provides a superior detrending method, based only on a simple linear projection. 

Here's a way to understand what "Hamilton detrending" does and why it works, based on a nice connection to Beveridge-Nelson (1981) detrending not noticed in Hamilton's paper.  

First consider Beveridge-Nelson (BN) trend for I(1) series.  BN trend is just a very long-run forecast based on an infinite past.  [You want a very long-run forecast in the BN environment because the stationary cycle washes out from a very long-run forecast, leaving just the forecast of the underlying random-walk stochastic trend, which is also the current value of the trend since it's a random walk.  So the BN trend at any time is just a very long-run forecast made at that time.]  Hence BN trend is implicitly based on the projection: \(y_t ~ \rightarrow ~ c, ~ y_{t-h}, ~...,~ y_{t-h-p} \), for \(h \rightarrow \infty \) and \(p \rightarrow \infty\).

Now consider Hamilton trend.  It is explicitly based on the projection: \(y_t ~ \rightarrow ~ c, ~ y_{t-h}, ~...,~ y_{t-h-p} \), for \(p = 3 \).  (Hamilton also uses a benchmark of  \(h = 8 \).)

So BN and Hamilton are both "linear projection trends", differing only in choice of \(h\) and \(p\)!  BN takes an infinite forecast horizon and projects on an infinite past.  Hamilton takes a medium forecast horizon and projects on just the recent past.

Much of Hamilton's paper is devoted to defending the choice of \(p = 3 \), which turns out to perform well for a wide range of data-generating processes (not just I(1)).  The BN choice of \(h = p = \infty \), in contrast, although optimal for I(1) series, is less robust to other DGP's.  (And of course estimation of the BN projection as written above is infeasible, which people avoid in practice by assuming low-ordered ARIMA structure.)

Monday, August 15, 2016

More on Nonlinear Forecasting Over the Cycle

Related to my last post, here's a new paper that just arrived from Rachidi Kotchoni and Dalibor Stevanovic, "Forecasting U.S. Recessions and Economic Activity". It's not non-parametric, but it is non-linear. As Dalibor put it, "The method is very simple: predict turning points and recession probabilities in the first step, and then augment a direct AR model with the forecasted probability." Kotchoni-Stevanovic and Guerron-Quintana-Zhong are usefully read together.

Sunday, August 14, 2016

Nearest-Neighbor Forecasting in Times of Crisis

Nonparametric K-nearest-neighbor forecasting remains natural and obvious and potentially very useful, as it has been since its inception long ago.

[Most crudely: Find the K-history closest to the present K-history, see what followed it, and use that as a forecast. Slightly less crudely: Find the N K-histories closest to the present K-history, see what followed each of them, and take an average. There are many obvious additional refinements.]

Overall, nearest-neighbor forecasting remains curiously under-utilized in dynamic econometrics. Maybe that will change. In an interesting recent development, for example, new Federal Reserve System research by Pablo Guerron-Quintana and Molin Zhong puts nearest-neighbor methods to good use for forecasting in times of crisis.

Monday, August 8, 2016

NSF Grants vs. Improved Data

Lots of people are talking about the Cowen-Tabarrok Journal of Economic Perspectives piece, "A Skeptical View of the National Science Foundation’s Role in Economic Research". See, for example, John Cochrane's insightful "A Look in the Mirror".

A look in the mirror indeed. I was a 25-year ward of the NSF, but for the past several years I've been on the run. I bolted in part because the economics NSF reward-to-effort ratio has fallen dramatically for senior researchers, and in part because, conditional on the ongoing existence of NSF grants, I feel strongly that NSF money and "signaling" are better allocated to young assistant and associate professors, for whom the signaling value from NSF support is much higher.

Cowen-Tabarrok make some very good points. But I can see both sides of many of their issues and sub-issues, so I'm not taking sides. Instead let me make just one observation (and I'm hardly the first).

If NSF funds were to be re-allocated, improved data collection and dissemination looks attractive. I'm not talking about funding cute RCTs-of-the-month. Rather, I'm talking about funding increased and ongoing commitment to improving our fundamental price and quantity data (i.e., the national accounts and related statistics). They desperately need to be brought into the new millennium. Just look, for example, at the wealth of issues raised in recent decades by the Conference on Research in Income and Wealth.

Ironically, it's hard to make a formal case (at least for data dissemination as opposed to creation), as Chris Sims has emphasized with typical brilliance. His "The Futility of Cost-Benefit Analysis for Data Dissemination" explains "why the apparently reasonable idea of applying cost-benefit analysis to government programs founders when applied to data dissemination programs." So who knows how I came to feel that NSF funds might usefully be re-allocated to data collection and dissemination. But so be it.

Monday, August 1, 2016

On the Superiority of Observed Information

Earlier I claimed that "Efron-Hinkley holds up -- observed information dominates estimated expected information for finite-sample MLE inference." Several of you have asked for elaboration.

The earlier post grew from a 6 AM Hong Kong breakfast conversation with Per Mykland (with both of us suffering from 12-hour jet lag), so I wanted to get some detail from him before elaborating, to avoid erroneous recollections. But it's basically as I recalled -- mostly coming from the good large-deviation properties of the likelihood ratio. The following is adapted from that conversation and a subsequent email exchange. (Any errors or omissions are entirely mine.)

There was quite a bit of work in the 1980s and 1990s. It was kicked off by Efron and Hinkley (1978). The main message is in their plot on p. 460, suggesting that the observed info was a more accurate estimator. Research gradually focused on the behavior of the likelihood ratio (\(LR\)) statistic and its signed squared root \(R=sgn(\hat{\theta} - \theta ) \sqrt{LR}\), which was seen to have good conditionality properties, local sufficiency, and most crucially, good large-deviation properties.  (For details see Mykland (1999), Mykland (2001), and the references there.)

The large-deviation situation is as follows.  Most statistics have cumulant behavior as in Mykland (1999) eq. (2.1).  In contrast, \(R\) has cumulant behavior as in Mykland (1999) eq. (2.2), which yields the large deviation properties of Mykland (1999) Theorem 1. (Also see Theorems 1 and 2 of Mykland (2001).)

Tuesday, July 26, 2016

An important Example of Simultaneously Wide and Dense Data

By the way, related to my last post on wide and dense data, an important example of analysis of data that are both wide and dense is the high-frequency high-dimensional factor modeling of Pelger and Ait-Sahalia and Xiu.  Effectively they treat wide sets of realized volatilities, each of which is constructed from underlying dense data.

Monday, July 25, 2016

The Action is in Wide and/or Dense Data

I recently blogged on varieties of Big Data: (1) tall, (2) wide, and (3) dense.

Presumably tall data are the least interesting insofar as the only way to get a long calendar span is to sit around and wait, in contrast to wide and dense data, which now appear routinely.

But it occurs to me that tall data are also the least interesting for another reason:  wide data make tall data impossible from a certain perspective. In particular, non-parametric estimation in high dimensions (that is, with wide data) is always subject to the fundamental and inescapable "curse of dimensionality":  the rate at which estimation error vanishes gets hopelessly slow, very quickly, as dimension grows.  [Wonks will recall that the Stone-optimal rate in \(d\) dimensions is \( \sqrt{T^{1- \frac{d}{d+4}}}\).]

The upshot:  As our datasets get wider, they also implicitly get less tall. That's all the more reason to downplay tall data.  The action is in wide and dense data (whether separately or jointly).

Monday, July 18, 2016

The HAC Emperor has no Clothes: Part 2

The time-series kernel-HAC literature seems to have forgotten about pre-whitening. But most of the action is in the pre-whitening, as stressed in my earlier post. In time-series contexts, parametric allowance for good-old ARMA-GARCH disturbances (with AIC order selection, say) is likely to be all that's needed, cleaning out whatever conditional-mean and conditional-variance dynamics are operative, after which there's little/no need for anything else. (And although I say "parametric" ARMA/GARCH, it's actually fully non-parametric from a sieve perspective.)

Instead, people focus on kernel-HAC sans prewhitening, and obsess over truncation lag selection. Truncation lag selection is indeed very important when pre-whitening is forgotten, as too short a lag can lead to seriously distorted inference, as emphasized in the brilliant early work of Kiefer-Vogelsang and in important recent work by Lewis, Lazarus, Stock and Watson. But all of that becomes much less important when pre-whitening is successfully implemented.

[Of course spectra need not be rational, so ARMA is just an approximation to a more general Wold representation (and remember, GARCH(1,1) is just an ARMA(1,1) in squares). But is that really a problem? In econometrics don't we feel comfortable with ARMA approximations 99.9 percent of the time? The only econometrically-interesting process I can think of that doesn't admit a finite-ordered ARMA representation is long memory (fractional integration). But that too can be handled parametrically by introducing just one more parameter, moving from ARMA(p,q) to ARFIMA(p,d,q).]

My earlier post linked to the key early work of Den Haan and Levin, which remains unpublished. I am confident that their basic message remains intact. Indeed recent work revisits and amplifies it in important ways; see Kapetanios and Psaradakis (2016) and new work in progress by Richard Baillie to be presented at the September 2016 NBER/NSF time-series meeting at Columbia ("Is Robust Inference with OLS Sensible in Time Series Regressions?").