LARGE SAMPLE SIEVE ESTIMATION OF SEMI-NONPARAMETRIC MODELS BY Xiaohong Chen COWLES FOUNDATION PAPER NO. 1262 COWLES FOUNDATION FOR RESEARCH IN ECONOMICS YALE UNIVERSITY Box 208281 New Haven, Connecticut 06520-8281 2008 http://cowles.econ.yale.edu/ Chapter 76 LARGE SAMPLE SIEVE ESTIMATION OF SEMI-NONPARAMETRIC MODELS* XIAOHONG CHEN Department of Economics, Yale University, Box 208281, New Haven, CT 06520, USA e-mail: xiaohong.chen@yale.edu Contents Abstract 5550 Keywords 5551 1. Introduction 5552 2. Sieve estimation: Examples, de?nitions, sieves 5555 2.1. Empirical examples of semi-nonparametric econometric models 5555 2.2. De?nition of sieve extremum estimation 5560 2.2.1. Ill-posed versus well-posed problem, sieve extremum estimation 5560 2.2.2. Sieve M-estimation 5562 2.2.3. Series estimation, concave extended linear models 5563 2.2.4. Sieve MD estimation 5567 2.3. Typical function spaces and sieve spaces 5569 2.3.1. Typical smoothness classes and (?nite-dimensional) linear sieves 5569 2.3.2. Weighted smoothness classes and (?nite-dimensional) linear sieves 5573 2.3.3. Other smoothness classes and (?nite-dimensional) nonlinear sieves 5574 2.3.4. In?nite-dimensional (nonlinear) sieves and method of penalization 5576 2.3.5. Shape-preserving sieves 5577 2.3.6. Choice of a sieve space 5579 2.4. A small Monte Carlo study 5580 2.5. An incomplete list of sieve applications in econometrics 5585 3. Large sample properties of sieve estimation of unknown functions 5587 3.1. Consistency of sieve extremum estimators 5588 * The author thanks C. Ai, J. Heckman, B. Honore, J. Huang, G. Imbens, R. Matzkin, W. Newey, J. Powell and H. White for valuable suggestions, J. Huang for showing his work on concave extended linear models, and two anonymous referees for critical comments that lead to thorough revisions. She also thanks K. Hyndman, A. Ingster, M. Kredler, D. Pouzo and R. Sela for proof-reading, M. Garibotti, D. Pouzo and V. Tsyrennikov for simulations and other PhD students who went through earlier versions used as the lecture notes for Topics in Econometrics during the Fall 2002, Fall 2003, Spring 2005 and Fall 2005 sessions at New York University. The author acknowledges ?nancial support from the National Science Foundation and the C.V. Starr Center at NYU. Any errors or omissions are the responsibility of the author. Handbook of Econometrics, Volume 6B Copyright ? 2007 Elsevier B.V. All rights reserved DOI: 10.1016/S1573-4412(07)06076-X 5550 X. Chen 3.2. Convergence rates of sieve M-estimators 5593 3.2.1. Example: Additive mean regression with a monotone constraint 5596 3.2.2. Example: Multivariate quantile regression 5598 3.3. Convergence rates of series estimators 5600 3.4. Pointwise asymptotic normality of series LS estimators 5603 3.4.1. Asymptotic normality of the spline series LS estimator 5603 3.4.2. Asymptotic normality of functionals of series LS estimator 5604 4. Large sample properties of sieve estimation of parametric parts in semi- parametric models 5606 4.1. Semiparametric two-step estimators 5607 4.1.1. Asymptotic normality 5607 4.2. Sieve simultaneous M-estimation 5611 4.2.1. Asymptotic normality of smooth functionals of sieve M-estimators 5611 4.2.2. Asymptotic normality of sieve GLS 5613 4.2.3. Example: Partially additive mean regression with a monotone constraint 5616 4.2.4. Ef?ciency of sieve MLE 5617 4.3. Sieve simultaneous MD estimation: Normality and ef?ciency 5619 5. Concluding remarks 5622 References 5623 Abstract Often researchers ?nd parametric models restrictive and sensitive to deviations from the parametric speci?cations; semi-nonparametric models are more ?exible and robust, but lead to other complications such as introducing in?nite-dimensional parameter spaces that may not be compact and the optimization problem may no longer be well-posed. The method of sieves provides one way to tackle such dif?culties by optimizing an empirical criterion over a sequence of approximating parameter spaces (i.e., sieves); the sieves are less complex but are dense in the original space and the resulting opti- mization problem becomes well-posed. With different choices of criteria and sieves, the method of sieves is very ?exible in estimating complicated semi-nonparametric models with (or without) endogeneity and latent heterogeneity. It can easily incorporate prior information and constraints, often derived from economic theory, such as monotonicity, convexity, additivity, multiplicity, exclusion and nonnegativity. It can simultaneously es- timate the parametric and nonparametric parts in semi-nonparametric models, typically with optimal convergence rates for both parts. This chapter describes estimation of semi-nonparametric econometric models via the method of sieves. We present some general results on the large sample properties of the sieve estimates, including consistency of the sieve extremum estimates, convergence rates of the sieve M-estimates, pointwise normality of series estimates of regression functions, root-n asymptotic normality and ef?ciency of sieve estimates of smooth func- tionals of in?nite-dimensional parameters. Examples are used to illustrate the general results. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5551 Keywords sieve extremum estimation, series, sieve minimum distance, semiparametric two-step estimation, endogeneity in semi-nonparametric models JEL classi?cation: C13, C14, C20 5552 X. Chen 1. Introduction Semiparametric and nonparametric modelling techniques have grown increasingly pop- ular in both theoretical and applied econometrics.1 This is partly because economic theory seldom suggests any parametric functional relationships among economic vari- ables, nor does it suggest particular parametric forms for error distributions. An addi- tional reason for the growing popularity of semi-nonparametric models is the declining computational cost of collecting and analyzing large economic data sets. All of the chapters in the book edited by Barnett, Powell and Tauchen (1991) and several chap- ters2 in the Handbook of Econometrics Volume 4 edited by Engle and McFadden (1994) have already reviewed the work in semiparametric and nonparametric econometrics that has been conducted up to the mid-1990s. More recently, Horowitz (1998) has provided a comprehensive treatment of four leading classes of semiparametric econometric models estimated via the kernel method. Pagan and Ullah (1999), H?rdle et al. (2004) and Li and Racine (2007) have surveyed the most well-known existing theoretical and empirical work on the estimation and testing of semiparametric and nonparametric economet- ric models via the methods of kernel, local linear regression and series. This chapter will review some recent developments in large sample theory on estimation of semi- nonparametric models via the method of sieves [Grenander (1981)]. Semi-nonparametric models involve unknown parameters that lie in in?nite-dimen- sional parameter spaces; hence it can be computationally dif?cult to estimate such models using ?nite samples. Moreover, even if one could solve the problem of opti- mizing a sample criterion over an in?nite-dimensional parameter space, the resulting estimator may have undesirable large sample properties such as inconsistency and/or a very slow rate of convergence; this is because the problem of optimization over an in?nite-dimensional noncompact space may no longer be well-posed. To resolve this problem, the method of sieves optimizes a criterion function over a sequence of sig- ni?cantly less complex, and often ?nite-dimensional, parameter spaces, which we call sieves. To ensure consistency of the method, we require that the complexity of sieves increases with the sample size so that in the limit the sieves are dense in the original parameter space.3 The in?nite-dimensional unknown parameter in a nonparametric or semiparametric model can often be viewed as a member of some function space with certain regularities (e.g., having bounded second derivatives, monotone, concave). Thus, many determinis- tic approximation results developed in mathematics and computer science can be used to 1 In this chapter, an econometric model is termed "parametric" if all of its parameters are in ?nite- dimensional parameter spaces; a model is "nonparametric" if all of its parameters are in in?nite-dimensional parameter spaces; a model is "semiparametric" if its parameters of interests are in ?nite-dimensional spaces but its nuisance parameters are in in?nite-dimensional spaces; a model is "semi-nonparametric" if it contains both ?nite-dimensional and in?nite-dimensional unknown parameters of interests. 2 See the ones written by Newey and McFadden (1994), Andrews (1994a), Powell (1994), H?rdle and Linton (1994), Matzkin (1994), Manski (1994) and others. 3 These terms will become much clearer in the next two sections. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5553 suggest sieves that provide good and computable approximations to an unknown func- tion. For example, the sieves or approximating spaces can be constructed using linear spans of power series, Fourier series, splines or many other basis functions; see e.g. Judd (1998, Chapters 6 and 12) for numerical implementation of such sieves for problems in economics and ?nance. Since these approximating spaces can often be characterized by a ?nite number of "parameters", a nonparametric or semiparametric estimation problem is often reduced to a parametric one when the method of sieves is implemented. How- ever, to obtain the desired theoretical properties of the estimator, it is necessary that the number of parameters increase slowly with the sample size. It is this feature that gives the sieve method its added ?exibility and robustness over classical parametric methods which assume ?xed, ?nite-dimensional parameter spaces. One attractive feature of the method of sieves is that it is easy to implement. The sieve method is particularly convenient when the unknown functions enter the crite- rion function (or moment condition) nonlinearly, satisfy some known restrictions such as monotonicity, concavity, additivity, multiplicity and exclusion, or when the error distribution has known tail behavior such as fat tails. With different choices of crite- ria and sieves, the method of sieves provides a ?exible and computationally feasible approach to estimate complicated semi-nonparametric models with (or without) con- straints, endogeneity and latent heterogeneity. Moreover, it can simultaneously estimate the parametric and nonparametric components in semi-nonparametric models, and can often achieve optimal convergence rates for both parts. We shall demonstrate these with some examples in the subsequent sections. Although the method of sieves is easy to implement and the sieve estimators typically have desirable large sample properties, its theoretical properties cannot be justi?ed by applying the classical theory for parametric models. Any appropriate large sample the- ory for the sieve method should not only account for the approximation errors, which arise because we replace the original parameter space with the simpler sieve space, but also control for the complexity of the sieve parameter spaces, which increases with the sample size. Consequently, the large sample properties of the sieve method are in general dif?cult to derive, which may partly explain why currently there are fewer econometric applications using such techniques than those using the kernel method. However, we should mention that the sieve estimation method admits, as special cases, many standard estimation methods (such as series-based method) in econometrics. As a result, some large sample results appear in the literature in papers that do not mention the word "sieve" at all. In this chapter we shall present some general results on large sample estimation the- ory using the method of sieves and illustrate how to apply these results with examples. Instead of presenting the current sieve estimation theory at its greatest generality, we have chosen to review results that are relatively accessible but general enough to cover most semi-nonparametric econometric applications. References are given for the results that are not presented in detail. The rest of this chapter is organized as follows. In Section 2, we ?rst present several examples of semi-nonparametric econometric models. We then de?ne the sieve ex- 5554 X. Chen tremum estimation and its special cases including sieve M-estimation, sieve maximum likelihood estimation (MLE), sieve generalized least squares (GLS), sieve minimum distance (MD) and others. The various criterion functions are illustrated using exam- ples. In addition, we introduce the popular series estimators as the sieve M-estimators obtained when the criterion functions are concave and the sieve spaces are ?nite- dimensional linear.4 We then review typical function spaces and sieve spaces used in econometrics, and conclude this section with a small Monte Carlo study to demon- strate the implementation of the sieve extremum estimation.5 Section 3 focuses on the large sample properties of sieve estimation of in?nite-dimensional unknown parame- ters. We ?rst provide a new consistency theorem for general sieve extremum estimation where the original parameter space may not be compact and the problem may not be well-posed. This theorem implies consistency of sieve M-estimators and of sieve MD-estimators in two remarks. We then present a convergence rate result for sieve M- estimators and illustrate how to apply the result with some examples. We also review the convergence rate and the pointwise asymptotic normality results for the series esti- mators. In Section 4, we present general results on √ n-asymptotic normality of sieve estimators of smooth functionals of unknown in?nite-dimensional parameters, where n denotes the sample size. Here we ?rst discuss the popular two-step semiparamet- ric procedures in which the ?rst step unknown functions could be estimated by any nonparametric procedures such as kernel, local linear regression and sieve methods, and the second step unknown parametric components are estimated by the generalized method of moments (GMM). The theorem on √ n-asymptotic normality of the second step GMM estimator is a slight re?nement of the existing ones in the semiparametric literature. We then review the √ n-asymptotic normality of the sieve M-estimation of smooth functionals of unknown functions, as well as the semiparametric ef?ciency of the sieve MLE. Finally we present the recent theory on the sieve MD estimation for the parametric components in semi-nonparametric conditional moment models where the unknown functions could depend on endogenous variables. Section 5 points out addi- tional topics on statistical inference via the method of sieves that are not reviewed here due to the lack of space. Throughout this chapter, we assume that there is an underlying complete probability space, the data {Zt = (Yt , Xt ) : t 1} are strictly stationary ergodic,6 and all probabil- ity calculations are done under the true probability measure Po. For random variables Vn and positive numbers bn, n 1, we de?ne Vn = OP (bn) as limc→∞ lim supn P(|Vn| 4 We note that this de?nition of series estimators differs slightly from those in the current econometrics literature. 5 See the chapter by Ichimura and Todd (2007) for more details on the implementation of semi- nonparametric estimators. 6 In this chapter, the notation denotes the transpose of a vector. See Hansen (1982), White (1984) or Wooldridge (1994) for the de?nition of a strictly stationary ergodic process. We make this assumption to sim- plify the presentation. See White and Wooldridge (1991) on sieve extremum estimation for general dependent heterogeneous processes. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5555 cbn) = 0, and de?ne Vn = oP (bn) as limn P(|Vn| cbn) = 0 for all c > 0. The notation plimn→∞ Vn = 0 also means that Vn = oP (1) (i.e., Vn converges to 0 in prob- ability). Similarly Vn = oa.s.(1) means that Vn converges to 0 almost surely. For two sequences of positive numbers b1n and b2n, the notation b1n b2n means that the ratio b1n/b2n is bounded below and above by positive constants that are independent of n. 2. Sieve estimation: Examples, de?nitions, sieves As alluded to in the introduction, the method of sieves consists of two key ingredients: a criterion function and sieve parameter spaces (a sequence of approximating spaces). Both the criterion functions and the sieve spaces can be very ?exible. In particular, almost all of the classical criterion functions stated in Newey and McFadden (1994), so long as they still allow for identi?cation, can be used as criterion functions in the method of sieve estimation. Therefore, the main new ingredient is the choice of sieve parameter spaces, which will be discussed in this section. 2.1. Empirical examples of semi-nonparametric econometric models It is impossible to list all of the existing and potential semi-nonparametric models and their empirical applications in econometrics. In this subsection we present three em- pirical examples as illustration; additional ones can be found in Manski (1994), Powell (1994), Matzkin (1994), Horowitz (1998), Pagan and Ullah (1999), Blundell and Powell (2003) and other surveys on this topic. EXAMPLE 2.1 (Single spell duration models with unobserved heterogeneity). Classi- cal single spell duration models in search unemployment [Flinn and Heckman (1982)], job turnover [Jovanovic (1979)], labor supply [Heckman and Willis (1977)] and others often suggest a functional form for the structural duration distribution conditional on individual heterogeneity. More speci?cally, let G(τ|u, x) be the structural distribution function of duration T conditional on a scalar of unobserved heterogeneity U = u and a vector of observed heterogeneity X = x. The distribution of observed duration given X = x is F(τ|x) = G(τ|u, x) dh(u), where the unobserved heterogeneity U is modelled as a random factor with distribu- tion function h(·). An i.i.d. sample of observations {Ti, Xi}n i=1 allows us to recover the true F(τ|x) uniquely. Theoretical models often imply parametric functional forms of G up to unknown ?nite-dimensional parameters β. Denote g(·|β, u, x) as the probability density function of G(·|β, u, x). Conventional parametric MLE method assumes that the unobserved heterogeneity follows some known distribution hγ up to some unknown ?nite-dimensional parameters γ . Under this assumption it then estimates the unknown parameters β, γ by arg maxβ,γ 1 n n i=1 log{ g(Ti|β, u, Xi) dhγ (u)}. 5556 X. Chen Heckman and Singer (1984) point out that both theoretical and empirical examples indicate that the parametric MLE estimates of structural parameters β in these du- ration models are inconsistent if the distribution of the unobserved heterogeneity is misspeci?ed. Instead, they propose the following semi-nonparametric single spell dura- tion model (2.1) F(τ|β, h, x) = G(τ|β, u, x) dh(u), where the distribution h of unobserved heterogeneity is left unspeci?ed. Heckman and Singer (1984) establish the identi?cation of (β , h), and propose a sieve MLE method to estimate (β , h) jointly. They also show that their estimator is consistent. The Heckman–Singer model is a typical example of a broad class of semi-nonparam- etric models that specify the (conditional) distribution associated with the observed eco- nomic variables semi-nonparametrically, where the speci?c semi-nonparametric form can be derived from independence of errors and regressors such as in discrete choice models, transformation models, sample selection models, mixture models, random cen- soring, nonlinear measurement errors and others. More generally, one could consider semi-nonparametric models based on quantile independence, symmetry or other quali- tative restrictions on distributions. See Horowitz (1998), Manski (1994), Powell (1994) and Bickel et al. (1993) for examples. EXAMPLE 2.2 (Shape-invariant system of Engel curves). Blundell, Browning and Crawford (2003) have shown that a system of Engel curves that satis?es Slutsky's sym- metry condition and allows for demographic effects on budget shares in a given year must take the following form: Y1 i = h1 Y2i ? h0(X1i) + h2 (X1i) + ε i, = 1,N, where Y1 i is the ith household budget share on th goods, Y2i is the ith household log-total nondurable expenditure, X1i is a vector of the ith household demographic vari- ables that affect the household's nondurable consumption. Note that h0(X1i) is common among all the goods and is called an "equivalence scale" in the consumer demand lit- erature. Citing strong empirical evidence and many existing works, Blundell, Browning and Crawford (2003) have argued that popular parametric linear and quadratic forms for h1 (·) are inadequate, and that consumer demand theory only suggests the purely nonparametric speci?cation: E Y1 i ? h1 Y2i ? h0(X1i) + h2 (X1i) X1i, Y2i (2.2) = E[ε i|X1i, Y2i] = 0, where h1 , h2 and h0 are all unknown functions. For the identi?cation of all these unknown functions θ = (h0, h11,h1N , h21,h2N ) satisfying (2.2), it suf?ces to assume that at least one of h1 , = 1,N, is nonlinear and that h2 (x? 1 ) = 0, = 1,N, for some x? 1 in the support of X1. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5557 Unfortunately, when X1i contains too many household demographic variables (say when dim(X1i) 3), the fully nonparametric speci?cation (2.2) cannot lead to pre- cise estimates of the unknown functions h0, h21,h2N due to the so-called "curse of dimensionality". Therefore, applied researchers must impose more structure on the model. Using the British family expenditure survey (FES) data, Blundell, Duncan and Pendakur (1998) found the following semi-nonparametric speci?cation to be reason- able: (2.3) E Y1 i ? h1 Y2i ? g X1iβ1 + X1iβ2 X1i, Y2i = 0, where h1 , = 1,N, are still unknown functions, but now h0(X1i) = g(X1iβ1) and h2 (X1i) = X1iβ2 are known up to unknown ?nite-dimensional parameters β1 and β2 . Here the parameters of interest are θ = (β1, β21,β2N , h11,h1N ) . This semi-nonparametric speci?cation has been estimated by Blundell, Duncan and Pendakur (1998) using the kernel method and Blundell, Chen and Kristensen (2007) using the sieve method. Both the speci?cations (2.2) and (2.3) assume that the total nondurable expenditure Y2i is exogenous. However, this assumption has been rejected empirically. Noting the endogeneity of total nondurable expenditure, Blundell, Chen and Kristensen (2007) considered the following semi-nonparametric instrumental variables (IV) regression: (2.4) E Y1 i ? h1 Y2i ? g X1iβ1 + X1iβ2 X1i, X2i = 0, where the parameters of interest are still θ = (β1, β21,β2N , h11,h1N ) , and X2i is the gross earnings of the head of the ith household which is used as an instru- ment for the total nondurable expenditure Y2i. They estimated this model via the sieve method and their empirical ?ndings demonstrate the importance of accounting for the endogenous total expenditure semi-nonparametrically. EXAMPLE 2.3 (Consumption-based asset pricing models). A standard consumption- based asset pricing model assumes that at time zero a representative agent maximizes the expected present value of the total utility function E0{ ∞ t=0 δt u(Ct )}, where δ is the time discount factor and u(Ct ) is period t's utility. The consumption-based asset pricing model comes from the ?rst-order conditions of a representative agent's optimal consumption choice problem. These ?rst-order conditions place restrictions on the joint distribution of the intertemporal marginal rate of substitution in consumption and asset returns. They imply that for any traded asset indexed by , with a gross return at time t + 1 of R ,t+1, the following Euler equation holds: (2.5) E(Mt+1R ,t+1|wt ) = 1, = 1,N, where Mt+1 is the intertemporal marginal rate of substitution in consumption, and E(·|wt ) denotes the conditional expectation given the information set at time t (which is the sigma-?eld generated by wt ). More generally, any nonnegative random variable Mt+1 satisfying Equation (2.5) is called a stochastic discount factor (SDF); see Hansen and Richard (1987) and Cochrane (2001). 5558 X. Chen Hansen and Singleton (1982) have assumed that the period t utility takes the power speci?cation u(Ct ) = [(Ct )1?γ ? 1]/[1 ? γ ], where γ is the curvature parameter of the utility function at each period, which implies that the SDF takes the form Mt+1 = δ(Ct+1 Ct )?γ and the Euler equation becomes: (2.6) E δo Ct+1 Ct ?γo R ,t+1 ? 1 wt = 0, = 1,N, where the unknown scalar parameters δo, γo can be estimated by Hansen's (1982) gen- eralized method of moment (GMM). However, this classical power utility-based asset pricing model (2.6) has been rejected empirically. Many subsequent papers have tried to relax the model (2.6) to ?t the data better by introducing durable goods, habit formation or a nonseparable preference speci?ca- tion. The ?rst class of papers proposes various parametric forms of the SDF, Mt+1, that are more ?exible than Mt+1 = δ(Ct+1 Ct )?γ ; see e.g. Eichenbaum and Hansen (1990), Constantinides (1990), Campbell and Cochrane (1999). The second class of papers has made the SDF, Mt+1, a purely nonparametric function of a few state variables; see e.g. Gallant and Tauchen (1989), Newey and Powell (1989) and Bansal and Viswanathan (1993). Recently, Chen and Ludvigson (2003) have speci?ed the SDF, Mt+1, to be semi-nonparametric in order to incorporate some preference parameters. In particular, they combine the power utility speci?cation with a nonparametric internal habit forma- tion: E0{ ∞ t=0 δt [(Ct ? Ht )1?γ ? 1]/[1 ? γ ]}, where Ht = H(Ct , Ct?1,Ct?L) is the period t habit level. Here H(·) is a homogeneous of degree one unknown function of current and past consumption, and can be rewritten as H(Ct , Ct?1,Ct?L) = Ct ho(Ct?1 Ct Ct?L Ct ) with ho(·) unknown. It is obvious that one needs to impose 0 ho(·) < 1 so that 0 Ht < Ct . The following external habit speci?cation is a special case of their model: (2.7) E δo Ct+1 Ct ?γo 1 ? ho Ct Ct+1 Ct+1?L Ct+1 ?γo 1 ? ho Ct?1 Ct Ct?L Ct ?γo R ,t+1 ? 1 wt = 0, for = 1,N, where γo > 0, δo > 0 are unknown scalar preference parameters, ho(·) ∈ [0, 1) is an unknown function and Ht+1 = Ct+1ho( Ct Ct+1 Ct+1?L Ct+1 ) is the habit level at time t + 1. Chen and Ludvigson (2003) have applied the sieve method to estimate this model and its generalization which allows for internal habit formation of unknown form. Their empirical ?ndings, using quarterly data, are in favor of ?exible nonlinear internal habit formation. Semi-nonparametric conditional moment models. We note that Examples 2.2 and 2.3 and many other economic models imply semi-nonparametric conditional moment re- strictions of the form (2.8) E ρ(Zt ; θo) Xt = 0, θo ≡ βo, ho , Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5559 where ρ(·;·) is a column vector of residual functions whose functional forms are known up to unknown parameters, θ ≡ (β , h ) , and {Zt = (Yt , Xt )}n t=1 is the data where Yt is a vector of endogenous variables and Xt is a vector of conditioning variables. Here E[ρ(Zt , θ)|Xt ] denotes the conditional expectation of ρ(Zt , θ) given Xt , and the true conditional distribution of Yt given Xt is unspeci?ed (and is treated as a nuisance function). The parameters of interest θo ≡ (βo, ho) contain a vector of ?nite- dimensional unknown parameters βo and a vector of in?nite-dimensional unknown functions ho(·) = (ho1(hoq(·)) , where the arguments of hoj (·) could depend on Y, X, known index function δj (Z, βo) up to unknown βo, other unknown function hok(·) for k = j, or could also depend on unobserved random variables. Motivated by the asset pricing and rational expectations models, Hansen (1982, 1985) studied the conditional moment restriction E[ρ(Zt ; βo)|Xt ] = 0 (i.e., without unknown ho) for sta- tionary ergodic time series data (where typically Zt = (Yt , Xt ) and Xt includes lagged Yt and other pre-determined variables known at time t). Chamberlain (1992), Newey and Powell (2003), Ai and Chen (2003) and Chen and Pouzo (2006) studied the general case E[ρ(Zt ; βo, ho)|Xt ] = 0 for i.i.d. data. The semi-nonparametric conditional moment models given by (2.8) can be classi?ed into two broad subclasses. The ?rst subclass consists of models without endogene- ity in the sense that ρ(Zt , θ) ? ρ(Zt , θo) does not depend on any endogenous vari- ables (Yt ); hence the true parameter θo can be identi?ed as the unique maximizer of Q(θ) = ?E[ρ(Zt , θ) {Σ(Xt )}?1ρ(Zt , θ)], where Σ(Xt ) is a positive de?nite weight- ing matrix. The second subclass consists of models with endogeneity in the sense that ρ(Zt , θ)?ρ(Zt , θo) does depend on endogenous variables (Yt ). Here the true parameter θo can be identi?ed as the unique maximizer of Q(θ) = ?E m(Xt , θ) Σ(Xt ) ?1 m(Xt , θ) with m(Xt , θ) ≡ E ρ(Zt , θ) Xt . Although the second subclass includes the ?rst subclass as a special case, when θ contains unknown functions, it is much easier to derive asymptotic properties for various nonparametric estimators of θ identi?ed by the conditional moment models belonging to the ?rst subclass. The ?rst subclass includes, as special cases, many semi- nonparametric regression models that have been well studied in econometrics. For example, it includes the speci?cations (2.2) and (2.3) of Example 2.2, the partially linear regression E[Yi ? X1iβo ? ho(X2i)|X1i, X2i] = 0 of Engle et al. (1986) and Robinson (1988), the index regression E[Yi ? ho(Xiβo)|Xi] = 0 of Powell, Stock and Stoker (1989), Ichimura (1993) and Klein and Spady (1993), the varying coef?cient model E[Yi ? q j=1 hoj (Dji)Xji|(Dki, Xki), k = 1,q] = 0 of Chen and Tsay (1993), Cai, Fan and Yao (2000) and Chen and Conley (2001), and the additive model with a known link (F) function E[Yi ? F( q j=1 hoj (Xji))|X1i,Xqi] = 0 of Horowitz and Mammen (2004). The second subclass includes, as special cases, the speci?cation (2.4) of Example 2.2, Example 2.3, semi-nonparametric asset pricing and rational expectation models, and simultaneous equations with ?exible parameterization. A leading, yet dif?cult exam- ple of this subclass, is the purely nonparametric instrumental variables (IV) regression 5560 X. Chen E[Y1i ? ho(Y2i)|Xi] = 0 studied by Newey and Powell (2003), Darolles, Florens and Renault (2002), Blundell, Chen and Kristensen (2007), Hall and Horowitz (2005) and Carrasco, Florens and Renault (2006). A more dif?cult example is the nonparametric IV quantile regression E[1{Y1i ho(Y2i)} ? γ |Xi] = 0 for some known γ ∈ (0, 1) considered by Chernozhukov, Imbens and Newey (2007), Horowitz and Lee (2007) and Chen and Pouzo (2006). See Blundell and Powell (2003), Florens (2003), Newey and Powell (1989), Carrasco, Florens and Renault (2006) and Chen and Pouzo (2006) for additional examples. 2.2. De?nition of sieve extremum estimation 2.2.1. Ill-posed versus well-posed problem, sieve extremum estimation Let Θ be an in?nite-dimensional parameter space endowed with a (pseudo-) metric d. A typical semi-nonparametric econometric model speci?es that there is a population cri- terion function Q : Θ → R, which is uniquely maximized at a (pseudo-) true parameter θo ∈ Θ.7 The choice of Q(·) and the existence of θo are suggested by the identi?ca- tion of an econometric model. The (pseudo-) true parameter θo ∈ Θ is unknown but is related to a joint probability measure Po(z1,zn), from which a sample of size n observations {Zt }n t=1, Zt ∈ Rdz , 1 dz < ∞, is available. Let Qn : Θ → R be an empirical criterion, which is a measurable function of the data {Zt }n t=1 for all θ ∈ Θ, and converges to Q in some sense (to be more precise in Subsection 3.1) as the sample size n → ∞. One general way to estimate θo is by maximizing Qn over Θ; the maxi- mizer, arg supθ∈Θ Qn(θ), assuming it exists, is then called the extremum estimate. See e.g. Amemiya (1985, Chapter 4), Gallant and White (1988b), Newey and McFadden (1994) and White (1994). When Θ is in?nite-dimensional and possibly not compact with respect to the (pseudo-) metric d,8 maximizing Qn over Θ may not be well-de?ned; or even if a maximizer arg supθ∈Θ Qn(θ) exists, it is generally dif?cult to compute, and may have undesirable large sample properties such as inconsistency and/or a very slow rate of con- vergence. These dif?culties arise because the problem of optimization over an in?nite- dimensional noncompact space may no longer be well-posed. Throughout this chapter, we say the optimization problem is well-posed, if for all sequences {θk} in Θ such that Q(θo) ? Q(θk) → 0, then d(θo, θk) → 0; is ill-posed (or not well-posed) if there exists a sequence {θk} in Θ such that Q(θo) ? Q(θk) → 0 but d(θo, θk) 0.9 For a given 7 Although we often call θo the "true" parameter in this survey chapter, it in fact could be a pseudo-true parameter value, depending on the speci?cation of the econometrics model and the choice of Q. See Ai and Chen (2007) for estimation of misspeci?ed semi-nonparametric models. 8 In an in?nite-dimensional metric space (H, d), a compact set is a d-closed and totally bounded set. (A set is totally bounded if for any ε > 0, there exist ?nitely many open balls with radius ε that cover the set.) A d-closed and bounded set is compact only in a ?nite-dimensional Euclidean space. 9 See Carrasco, Florens and Renault (2006) and Vapnik (1998) for surveys on ill-posed inverse problems in linear nonparametric models. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5561 semi-nonparametric model, suppose the criterion Q(θ) and the space Θ are chosen such that Q(θ) is uniquely maximized at θo in Θ. Then whether the problem is ill-posed or well-posed depends on the choice of the pseudo-metric d. This is because different metrics on an in?nite-dimensional space Θ may not be equivalent to each other.10 In particular, it is likely that some standard norm (say θo ? θ s) on Θ is not continuous in Q(θo) ? Q(θ) and the problem is ill-posed under · s, but there is another pseudo- metric (say θo ? θ w) on Θ that is continuous in Q(θo) ? Q(θ), hence the problem becomes well-posed under this · w; such a pseudo-metric is typically weaker than · s (i.e., θo ? θ s → 0 implies θo ? θ w → 0). See Ai and Chen (2003, 2007) for more discussions.11 No matter whether the semi-nonparametric problems are well-posed or ill-posed, the method of sieves provides one general approach to resolve the dif?culties associated with maximizing Qn over an in?nite-dimensional space Θ by maximizing Qn over a sequence of approximating spaces Θn, called sieves by Grenander (1981), which are less complex but are dense in Θ. Popular sieves are typically compact, nondecreasing (Θn ? Θn+1 Θ) and are such that for any θ ∈ Θ there exists an element πnθ in Θn satisfying d(θ, πnθ) → 0 as n → ∞, where the notation πn can be regarded as a projection mapping from Θ to Θn. An approximate sieve extremum estimate, denoted by ? θn, is de?ned as an approximate maximizer of Qn(θ) over the sieve space Θn, i.e., (2.9) Qn( ? θn) sup θ∈Θn Qn(θ) ? OP (ηn), with ηn → 0 as n → ∞. When ηn = 0, we call ? θn in (2.9) the exact sieve extremum estimate.12 The sieve ex- tremum estimation method clearly includes the standard extremum estimation method by setting Θn = Θ for all n. REMARK 2.1. Following White and Wooldridge (1991, Theorem 2.2), one can show that ? θn in (2.9) is well de?ned and measurable under the following mild suf?cient con- ditions: (i) Qn(θ) is a measurable function of the data {Zt }n t=1 for all θ ∈ Θn; (ii) for any data {Zt }n t=1, Qn(θ) is upper semicontinuous on Θn under the metric d(·,·); and (iii) the sieve space Θn is compact under the metric d(·,·). Therefore, in the rest of this chapter we assume that ? θn in (2.9) exists and is measurable. For a semi-nonparametric econometric model, θo ∈ Θ can be decomposed into two parts θo = (βo, ho) ∈ B * H, where B denotes a ?nite-dimensional compact parame- ter space, and H an in?nite-dimensional parameter space. In this case, a natural sieve 10 This is in contrast to the fact that all the norms are equivalent on a ?nite-dimensional Euclidean space. 11 The use of a weaker pseudo-metric enables Ai and Chen (2003) to obtain root-n normality of ? β for βo iden- ti?ed via the model E[ρ(Zt ; βo, ho)|Xt ] = 0, even when ho( ) is a function of the endogenous variable Y and the estimation problem may be ill-posed under the standard mean squared error metric E[h(Y) ? ho(Y)]2. 12 Since the complexity of the sieve space Θn increases with the sample size, it is obvious that the maxi- mization of Qn(θ) over Θn need not be exact and the approximate maximizer ? θn in (2.9) will be enough for consistency; see the consistency theorem in Subsection 3.1. 5562 X. Chen space will be Θn = B * Hn with Hn being a sieve for H, and the resulting estimate ? θn = ( ? βn, ? hn) in (2.9) will sometimes be called a simultaneous (or joint) sieve ex- tremum estimate. For a semi-nonparametric model, we can also estimate the parameters of interest (βo, ho) by the approximate pro?le sieve extremum estimation that consists of two steps: Step 1. For an arbitrarily ?xed value β ∈ B, compute Qn β, ? h(β) sup h∈Hn Qn(β, h) ? OP (ηn) with ηn = o(1); Step 2. Estimate βo by ? βn solving Qn( ? β, ? h( ? β)) maxβ∈B Qn(β, ? h(β)) ? OP (ηn), and then estimate ho by ? hn = ? h( ? βn). Depending on the speci?c structure of a semi-nonparametric model, the pro?le sieve extremum estimation procedure may be easier to compute. 2.2.2. Sieve M-estimation When Qn(θ) can be expressed as a sample average of the form sup θ∈Θn Qn(θ) = sup θ∈Θn 1 n n t=1 l(θ, Zt ), with l : Θ *Rdz → R being the criterion based on a single observation, we also call the ? θn solving (2.9) as an approximate sieve maximum-likelihood-like (M-) estimate.13 This includes sieve maximum likelihood estimation (MLE), sieve least squares (LS), sieve generalized least squares (GLS) and sieve quantile regression as special cases. EXAMPLE 2.1 (Continued). Heckman and Singer (1984) estimated the unknown true parameters θo = (βo, ho) ∈ Θ in their semiparametric speci?cation, (2.1), of Exam- ple 2.1 by the sieve MLE: sup θ∈Θn Qn(θ) = sup β∈B, h∈Hn 1 n n i=1 log g(Ti|β, u, Xi) dh(u) , where as n → ∞, the sieve space, Hn, becomes dense in the space of probability distribution functions over R. 13 Our de?nition follows that in Newey and McFadden (1994). Some statisticians such as Birgé and Massart (1998) call this a sieve minimum contrast estimate. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5563 EXAMPLE 2.2 (Continued). The nonparametric exogenous expenditure speci?cation (2.2) of Example 2.2 can be estimated by the sieve nonlinear LS: sup θ∈Θn Qn(θ) = sup h∈Hn ?1 n n i=1 N =1 Y1 i ? h1 Y2i ? h0(X1i) + h2 (X1i) 2 , with θ = h = (h0, h11,h1N , h21,h2N ) the unknown parameters and Θn = Hn = H0,n * N =1 H1 ,n * N =1 H2 ,n the sieve space,14 where we impose the identi?cation condition h2 (x? 1 ) = 0 on the sieve space H2 ,n for = 1,N. The semi-nonparametric exogenous expenditure speci?cation (2.3) of Example 2.2 can be also estimated by the sieve nonlinear LS: sup θ∈Θn Qn(θ) = sup β∈B, h∈Hn ?1 n n i=1 N =1 Y1 i ? h1 Y2i ? g X1iβ1 + X1iβ2 2 , with θ = (β , h ) = (β1, β21,β2N , h11,h1N ) the unknown parameters and Θn = B * Hn = B1 * N =1 B2 * N =1 H1 ,n the sieve space. More generally, we can apply the sieve GLS criterion sup θ∈Θn Qn(θ) = sup θ∈Θn ?1 n n i=1 ρ(Zi, θ) Σ(Xi) ?1 ρ(Zi, θ) to estimate all the models belonging to the ?rst subclass of the conditional moment re- strictions (2.8) where ρ(Zi, θ)?ρ(Zi, θo) does not depend on endogenous variables Yi, here Σ(Xi) is a positive de?nite weighting matrix function such as the identity matrix. See Remark 4.3 in Subsection 4.3 for optimally weighted version of this procedure. 2.2.3. Series estimation, concave extended linear models In this chapter, we call a special case of sieve M-estimation series estimation, which is sieve M-estimation with concave criterion functions Qn(θ) = 1 n n t=1 l(θ, Zt ) and ?nite-dimensional linear sieve spaces Θn. We say the criterion is concave if Qn(τθ1 + (1 ? τ)θ2) τQn(θ1) + (1 ? τ)Qn(θ2) for any θ1, θ2 ∈ Θ and any scalar τ ∈ (0, 1). Of course this de?nition only makes sense when the parameter space Θ is convex (i.e., for any θ1, θ2 ∈ Θ, we have τθ1 + (1 ? τ)θ2 ∈ Θ for any scalar τ ∈ (0, 1)). We say a sieve Θn is ?nite-dimensional linear if it is a linear span of ?nitely many known basis functions; see Subsection 2.3.1 for examples. Although our de?nition of series estimation may differ from those in the current econometrics literature, it is closely related to the de?nition of the sieve M-estimation of "concave extended linear models" in the statistics literature; see e.g. Hansen (1994), Stone et al. (1997), and Huang (2001). Consider a Z-valued random variable Z, where 14 Throughout this chapter N =1 H ,n denotes a Cartesian product H1,n HN,n. 5564 X. Chen Z is an arbitrary set. The probability density po(z) of Z depends on a true but un- known parameter θo. All the concave extended linear models have three common ingredients: (1) a (possibly in?nite-dimensional) linear parameter space Θ; (2) the criterion evaluated at a single observation is concave; that is, given any θ1, θ2 ∈ Θ, l(τθ1 + (1 ? τ)θ2, z) τl(θ1, z) + (1 ? τ)l(θ2, z) for any scalar τ ∈ (0, 1) and any value z ∈ Z; (3) the population criterion Q(θ) = E[l(θ, Z)] is strictly concave; that is, given any two essentially different functions θ1, θ2 ∈ Θ, E[l(τθ1 + (1 ? τ)θ2, Z)] > τE[l(θ1, Z)] + (1 ? τ)E[l(θ2, Z)] for any scalar τ ∈ (0, 1). The sieve M-estimation of a concave extended linear model can be implemented by maximizing Qn(θ) = 1 n n t=1 l(θ, Zt ) over a ?nite-dimensional linear sieve space Θn without any constraints. The resulting estimator is called a series estimator in this paper. Therefore, for the same concave criterion function, a sieve M-estimator is a series estimator if the sieve spaces Θn are ?nite-dimensional linear (such as the ones listed in Subsections 2.3.1 and 2.3.2), but is not a series estimator if the sieve spaces Θn are not ?nite-dimensional linear (such as the ones listed in Subsections 2.3.3 and 2.3.4). Although this de?nition of a series estimator might look restrictive, it will make the descriptions of large sample properties much easier in Section 3. For series estimation, concavity of the criterion function plays a central role. In par- ticular, the sieve spaces used in estimation are not required to be compact and can be any unrestricted ?nite-dimensional linear spaces. Such sieves not only make it easy to compute the estimators, but also make it convenient to discuss orthogonal projections and functional analysis of variance (ANOVA) decompositions (such as additivity) in the nonparametric multivariate regression framework; see e.g. Stone (1985, 1986), Andrews and Whang (1990), Huang (1998a). In order to apply the series estimation to a semi-nonparametric model, one needs to ?rst ?nd a concave criterion function that identi?es the unknown parameters of interest. We now present several such examples. EXAMPLE 2.4 (Multivariate LS regression). We consider the estimation of an un- known multivariate conditional mean function θo(·) = ho(·) = E(Y|X = ·). Here Z = (Y, X), Y is a scalar, X has support X that is a bounded subset of Rd, d 1. Suppose ho ∈ Θ, where Θ is a linear subspace of the space of functions h with E[h(X)2] < ∞. Let l(h, Z) = ?[Y ? h(X)]2 and Q(θ) = ?E{[Y ? h(X)]2}; then both are concave in h and Q is strictly concave in h ∈ Θ. Let {pj (X), j = 1, 2,denote a sequence of known basis functions that can ap- proximate any real-valued square integrable functions of X well; see Subsection 2.3.1 or Newey (1997) for speci?c examples of such basis functions. Then (2.10) Θn = Hn = h : X → R, h(x) = kn j=1 aj pj (x): a1,akn ∈ R , with dim(Θn) = kn → ∞ slowly as n → ∞, is a ?nite-dimensional linear sieve for Θ, and ? h = arg maxh∈Hn ?1 n n t=1[Yt ? h(Xt )]2 is a series estimator of the conditional Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5565 mean ho(·) = E(Y|X = ·). Moreover, this series estimator ? h has a simple closed-form expression: (2.11) ? h(x) = pkn (x) (P P)? n i=1 pkn (Xi)Yi, x ∈ X, with pkn (X) = (p1(X)pkn (X)) , P = (pkn (X1)pkn (Xn)) and (P P)? the Moore–Penrose generalized inverse. The estimator ? h given in (2.11) will be called a series LS estimator or a linear sieve LS estimator. EXAMPLE 2.5 (Multivariate quantile regression). Let α ∈ (0, 1). We consider the estimation of an unknown multivariate αth quantile function θo(·) = ho(·) such that E[1{Y ho(X)}|X] = α. Here Z = (Y, X), X has support X that is a bounded subset of Rd, d 1. Suppose ho ∈ Θ, where Θ is a linear subspace of the space of functions h with E[h(X)2] < ∞. Let l(h, Z) = [1{Y h(X)} ? α][Y ? h(X)],15 and Q(θ) = E{[1{Y h(X)}?α][Y ?h(X)]}, then both are concave in h and Q is strictly concave in h ∈ Θ. Let Θn = Hn be a ?nite-dimensional linear sieve such as the one given in (2.10). Then ? h = arg maxh∈Hn 1 n n t=1[1{Yt h(Xt )} ? α][Yt ? h(Xt )] is a series estimator of the conditional quantile function ho. EXAMPLE 2.6 (Log-density estimation). Let fo be the true unknown positive proba- bility density of Z on Z and suppose that we want to estimate the log-density, log fo. Since log fo is subject to the nonlinear constraint Z exp{log fo(z)} dz = 1, it is more convenient to write log fo = ho ?log Z exp ho(z) dz, and treat ho as an unknown func- tion in some linear space. Since log fo = [ho + c] ? log Z exp[ho(z) + c] dz for any constant c, we need some location normalization to ensure the identi?cation of ho. By imposing a linear constraint such as Z h(z) dz = 0 (or h(z?) = 0 for a ?xed z? ∈ Z), we can determine h uniquely and make the mapping h → log f one-to-one. Therefore, we assume ho ∈ Θ, where Θ is a linear subspace of the space of real-valued functions h with E[h(Z)2] < ∞ and Z h(z) dz = 0. The log-likelihood evaluated at a single ob- servation Z is given by l(h, Z) = h(Z) ? log Z exp h(z) dz. Stone (1990) has shown that l(h, Z) is concave and Q(θ) = E{h(Z) ? log Z exp h(z) dz} is strictly concave in h ∈ Θ. Let {pj (Z), j = 1, 2,denote a sequence of known basis functions that can ap- proximate any real-valued square integrable functions of Z well. Then Θn = Hn = h : Z → R, h(z) = kn j=1 aj pj (z): Z h(z) dz = 0, a1,akn ∈ R , 15 This is a "check" function in Koenker and Bassett (1978). 5566 X. Chen with dim(Θn) = kn → ∞ slowly as n → ∞, is a ?nite-dimensional linear sieve for Θ, and ? h = arg max h∈Hn 1 n n i=1 h(Zi) ? log Z exp h(z) dz is a series estimator of the log-density function ho. It is easy to see that log-conditional density and log-spectral density estimation can be carried out in the same way; see e.g. Stone (1994) and Kooperberg, Stone and Truong (1995b). EXAMPLE 2.7 (Estimation of conditional hazard function). Consider a positive sur- vival time T , a positive censoring time C, the observed time Y = min(T, C) and an X-valued random vector X of covariates. Let Z = (X , Y, 1(T C)) denote a single observation. Suppose T and C are conditionally independent given X, and that Pr(C τ0) = 1 for a known positive constant τ0. Let fo(τ|x) and Fo(τ|x), τ > 0, be the true unknown conditional density function and conditional distribution function, re- spectively, of T given X = x. Then the ratio fo(τ|x)/[1?Fo(τ|x)], τ > 0, is called the conditional hazard function of T given X = x. We want to estimate the log-conditional hazard function ho(τ, x) = log{fo(τ|x)/[1?Fo(τ|x)]}. Since the likelihood at a single observation Z equals f (Y|X) 1(T C) 1 ? F(Y|X) 1(T >C) = exp h(Y, X) 1(T C) exp ? Y 0 exp h(τ, X) dτ , the log-likelihood evaluated at a single observation is given by l(h, Z) = 1(T C)h(Y, X) ? Y 0 exp h(τ, X) dτ. Kooperberg, Stone and Truong (1995a) showed that the l(h, Z) is concave in h and Q(θ) = E{l(h, Z)} is strictly concave in h. Suppose ho ∈ Θ, where Θ is a linear subspace of the space of real-valued functions h with E[h(Y, X)2] < ∞. Let {pj (Y, X), j = 1, 2,denote a sequence of known ba- sis functions that can approximate any real-valued square integrable functions of (Y, X) well. Then Θn = Hn = h : (0, τ0] * X → R, h(τ, x) = kn j=1 aj pj (τ, x): a1,akn ∈ R , Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5567 with dim(Θn) = kn → ∞ slowly as n → ∞, is a ?nite-dimensional linear sieve for Θ, and ? h = arg max h∈Hn 1 n n i=1 1(Ti Ci)h(Yi, Xi) ? Yi 0 exp h(τ, Xi) dτ is a series estimator of the log-conditional hazard function ho. Finally, we should point out that not all semi-nonparametric M-estimation problems can be reparameterized into series estimation problems. For example, the nonparametric exogenous expenditure speci?cation (2.2) of Example 2.2 does not belong to the con- cave extended linear models, since, in this speci?cation, the unknown function h0(X1) enters the other unknown functions h1 (Y2 ? h0(X1)), = 1,L, nonlinearly as an argument. Nevertheless, as described in the previous subsection, this model can still be estimated by the general sieve M-estimation method. 2.2.4. Sieve MD estimation When ?Qn(θ) can be expressed as a quadratic distance from zero, we call the ? θn solving (2.9) an approximate sieve minimum distance (MD) estimate. One typical quadratic form is (2.12) sup θ∈Θn Qn(θ) = sup θ∈Θn ? 1 n n t=1 ? m(Xt , θ) Σ(Xt ) ?1 ? m(Xt , θ) with ? m(Xt , θo) → 0 in probability. Here ? m(Xt , θ) is a nonparametrically estimated moment restriction function of ?xed, ?nite dimension, and Σ(Xt ) is a possibly non- parametrically estimated weighting matrix of the same dimension as that of ? m(Xt , θ). The weighting matrix, Σ, is introduced for the purpose of ef?ciency,16 and Σ(Xt ) → Σ(Xt ) in probability, where Σ(Xt ) is a positive de?nite matrix (of the same ?xed, ?nite dimension as that of Σ(Xt )). We can apply the sieve MD criterion, (2.12), to estimate all the models belonging to the conditional moment restrictions E[ρ(Z, θo)|X] = 0, regardless of whether or not ρ(Zt , θ) ? ρ(Zt , θo) depends on endogenous variables Yt . In particular, ? m(Xt , θ) could be any nonparametric estimate of the conditional mean function m(Xt , θ) = E[ρ(Z, θ)|X = Xt ]; see e.g. Newey and Powell (1989, 2003) and Ai and Chen (1999, 2003). Another typical quadratic form is the sieve GMM criterion (2.13) sup θ∈Θn Qn(θ) = sup θ∈Θn ? ? gn(θ) W ? gn(θ) 16 See Ai and Chen (2003) or Subsection 4.3 for details on semiparametric ef?ciency. 5568 X. Chen with ? gn(θo) → 0 in probability. Here ? gn(θ) is a sample average of some unconditional moment conditions of increasing dimension, and W is a possibly random weighting ma- trix of the same increasing dimension as that of ? gn(θ). As above, the weighting matrix W is introduced for the purpose of ef?ciency, and W ?Wn → 0 in probability, with Wn being a positive de?nite matrix (of the same increasing dimension as that of W). Note that E[ρ(Z, θo)|X] = 0 if and only if the following increasing number of unconditional moment restrictions hold: (2.14) E ρ(Zt , θo)p0j (Xt ) = 0, j = 1, 2,km,n, where {p0j (X), j = 1, 2,km,n} is a sequence of known basis functions that can approximate any real-valued square integrable functions of X well as km,n → ∞. Let pkm,n (X) = (p01(X)p0km,n (X)) . It is now obvious that the conditional moment restrictions (2.8) E[ρ(Z, θo)|X] = 0 can be estimated via the sieve GMM criterion (2.13) using ? gn(θ) = 1 n n t=1 ρ(Zt , θ) ? pkm,n (Xt ). Not only it is possible for both the sieve MD, (2.12), and the sieve GMM, (2.13), to estimate all the models belonging to the conditional moment restrictions (2.8), but they are also very closely related. For example, when applying the sieve MD (2.12) procedure, we could use the series LS estimator (2.15) as an estimator of the conditional mean function m(X, θ) = E[ρ(Z, θ)|X]: (2.15) ? m(X, θ) = n j=1 ρ(Zj , θ)pkm,n (Xj ) (P P)? pkm,n (X), with P = (pkm,n (X1)pkm,n (Xn)) where km,n → ∞ slowly as n → ∞, and (P P)? the Moore–Penrose inverse. The resulting sieve MD (2.12) with identity weighting Σ(Xt ) = I will become the following sieve GMM (2.13): (2.16) min θ∈Θn n i=1 ρ(Zi, θ) ? pkm,n (Xi) I ? (P P)? n i=1 ρ(Zi, θ) ? pkm,n (Xi) , where ? denotes the Kronecker product; see Ai and Chen (2003) for details. EXAMPLE 2.2 (Continued). The semi-nonparametric endogenous expenditure speci?- cation (2.4) of Example 2.2 can be estimated by the sieve MD (2.12), with ? m(Xi, θ) = ( ? m1(Xi, θ) mN (Xi, θ)) , ? m (Xi, θ) = n j=1 Y1 j ? h1 Y2j ? g X1j β1 + X1j β2 pkm,n (Xj ) (P P)? pkm,n (Xi), where θ = (β , h ) = (β1, β21,β2N , h11,h1N ) is the vector of unknown parameters, and Θn = B * Hn = B1 * N =1 B2 * N =1 H1 ,n is the sieve space; see Blundell, Chen and Kristensen (2007) for details. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5569 EXAMPLE 2.3 (Continued). The semi-nonparametric external habit speci?cation (2.7) of Example 2.3 can be estimated by the sieve GMM criterion (2.16), with ρ(Zt , θ) = (ρ1(Zt , θ)ρN (Zt , θ)) , ρ (Zt , θ) = δ Ct Ct+1 γ 1 ? h Ct Ct+1 Ct+1?L Ct+1 ?γ 1 ? h Ct?1 Ct Ct?L Ct ?γ R ,t+1 ? 1, = 1,N, Zt = Ct Ct+1 Ct+1?L Ct+1 , Ct?1 Ct Ct?L Ct , R1,t+1,RN,t+1, Xt , Xt = wt , where θ = (β , h) = (δ, γ, h) is the vector of unknown parameters, and Θn = B * Hn = Bδ * Bγ * Hn is the sieve space, here 0 h < 1 is imposed on the sieve space Hn. Obviously, this model (2.7) can also be estimated by the sieve MD (2.12), with ? m(Xt , θ) = ? m(wt , θ) being a nonparametric estimator such as the series LS estimator (2.15) of E[ρ(Zt , θ)|Xt = wt ]; see Chen and Ludvigson (2003) for de- tails.17 2.3. Typical function spaces and sieve spaces Here we will present some commonly used sieves whose approximation properties are already known in the mathematical literature on approximation theory. 2.3.1. Typical smoothness classes and (?nite-dimensional) linear sieves We ?rst review the most popular smoothness classes of functions used in the non- parametric estimation literature; see e.g. Stone (1982, 1994), Robinson (1988), Newey (1997) and Horowitz (1998). Suppose for the moment that X = X1 Xd is the Cartesian product of compact intervals X1,Xd. Let 0 < γ 1. A real-valued func- tion h on X is said to satisfy a H?lder condition with exponent γ if there is a positive number c such that |h(x)?h(y)| c|x?y| γ e for all x, y ∈ X; here |x|e = ( d l=1 x2 l )1/2 is the Euclidean norm of x = (x1,xd) ∈ X. Given a d-tuple α = (α1,αd) of nonnegative integers, set [α] = α1 +· · ·+αd and let Dα denote the differential operator de?ned by Dα = ?[α] ?xα1 1 . . . ?xαd d . 17 There are also semi-nonparametric recursive method of moment procedures that enable us to estimate nonlinear time series models with latent variables. See e.g. Chen and White (1998, 2002), Pastorello, Patilea and Renault (2003) and Linton and Mammen (2005). 5570 X. Chen Let m be a nonnegative integer and set p = m + γ . A real-valued function h on X is said to be p-smooth if it is m times continuously differentiable on X and Dαh satis?es a H?lder condition with exponent γ for all α with [α] = m. Denote the class of all p-smooth real-valued functions on X by Λp(X) (called a H?lder class), and the space of all m-times continuously differentiable real-valued func- tions on X by Cm(X). De?ne a H?lder ball with smoothness p = m + γ as Λ p c (X) = h ∈ Cm (X): sup [α] m sup x∈X Dα h(x) c, sup [α]=m sup x,y∈X, x=y |Dαh(x) ? Dαh(y)| |x ? y| γ e c . The H?lder (or p-smooth) class of functions are popular in econometrics because a p-smooth function can be approximated well by various linear sieves. A sieve is called a "(?nite-dimensional) linear sieve" if it is a linear span of ?- nitely many known basis functions. Linear sieves, including power series, Fourier series, splines and wavelets, form a large class of sieves useful for sieve extremum estimation. We now provide some examples of commonly used linear sieves for univariate functions with support X = [0, 1]. Polynomials. Let Pol(Jn) denote the space of polynomials on [0, 1] of degree Jn or less; that is, Pol(Jn) = Jn k=0 akxk , x ∈ [0, 1]: ak ∈ R . Trigonometric polynomials. Let TriPol(Jn) denote the space of trigonometric polyno- mials on [0, 1] of degree Jn or less; that is, TriPol(Jn) = a0 + Jn k=1 ak cos(2kπx) + bk sin(2kπx) , x ∈ [0, 1]: ak, bk ∈ R . Let CosPol(Jn) denote the space of cosine polynomials on [0, 1] of degree Jn or less; that is, CosPol(Jn) = a0 + Jn k=1 ak cos(kπx), x ∈ [0, 1]: ak ∈ R . Let SinPol(Jn) denote the space of sine polynomials on [0, 1] of degree Jn or less; that is, SinPol(Jn) = Jn k=1 ak sin(kπx), x ∈ [0, 1]: ak ∈ R . Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5571 We note that the classical trigonometric sieve, TriPol(Jn), is well suited for approximat- ing periodic functions on [0, 1], while the cosine sieve, CosPol(Jn), is well suited for approximating aperiodic functions on [0, 1] and the sine sieve, SinPol(Jn), can approx- imate functions vanishing at the boundary points (i.e., when h(0) = h(1) = 0). Univariate splines. Let Jn be a positive integer, and let t0, t1,tJn , tJn+1 be real numbers with 0 = t0 < t1 tJn < tJn+1 = 1. Partition [0, 1] into Jn + 1 subintervals Ij = [tj , tj+1), j = 0,Jn ? 1, and IJn = [tJn , tJn+1]. We assume that the knots t1,tJn have bounded mesh ratio: (2.17) max0 j Jn (tj+1 ? tj ) min0 j Jn (tj+1 ? tj ) c for some constant c > 0. Let r 1 be an integer. A function on [0, 1] is a spline of order r, equivalently, of degree m ≡ r ? 1, with knots t1,tJn if the following hold: (i) it is a polynomial of degree m or less on each interval Ij , j = 0,Jn; and (ii) (for m 1) it is (m ? 1)-times continuously differentiable on [0, 1]. Such spline functions constitute a linear space of dimension Jn + r. For detailed discussions of univariate splines; see de Boor (1978) and Schumaker (1981). For a ?xed integer r 1, we let Spl(r, Jn) denote the space of splines of order r (or of degree m ≡ r ? 1) with Jn knots satisfying (2.17). Since Spl(r, Jn) = r?1 k=0 akxk + Jn j=1 bj max{x ? tj , 0} r?1 , x ∈ [0, 1]: ak, bj ∈ R , we also call Spl(r, Jn) the polynomial spline sieve of degree m ≡ r ? 1. In this chapter, L2(X, leb) denotes the space of real-valued functions h such that X |h(x)|2 dx < ∞. Wavelets. Let m 0 be an integer. A real-valued function ψ is called a "mother wavelet" of degree m if it satis?es the following: (i) R xkψ(x) dx = 0 for 0 k m; (ii) ψ and all its derivatives up to order m decrease rapidly as |x| → ∞; (iii) {2j/2ψ(2j x ? k): j, k ∈ Z} forms a Riesz basis of L2(R, leb), in the sense that the linear span of {2j/2ψ(2j x ? k): j, k ∈ Z} is dense in L2(R, leb) and there exist positive constants c1 c2 < ∞ such that c1 ∞ j=?∞ ∞ k=?∞ |ajk|2 ∞ j=?∞ ∞ k=?∞ ajk2j/2 ψ 2j x ? k 2 L2(R,leb) c2 ∞ j=?∞ ∞ k=?∞ |ajk|2 for all doubly bi-in?nite square-summable sequences {ajk: j, k ∈ Z}. A scaling function φ is called a "father wavelet" of degree m if it satis?es the fol- lowing: (i) R φ(x) dx = 1; (ii) φ and all its derivatives up to order m decrease rapidly 5572 X. Chen as |x| → ∞; (iii) {φ(x ? k): k ∈ Z} forms a Riesz basis for a closed subspace of L2(R, leb). Orthogonal wavelets. Given an integer m 0, there exist a father wavelet φ of de- gree m and a mother wavelet ψ of degree m, both compactly supported, such that for any integer j0 0, any function g in L2(R, leb) has the following wavelet m-regular multiresolution expansion: g(x) = ∞ k=?∞ aj0kφj0k(x) + ∞ j=j0 ∞ k=?∞ bjkψjk(x), x ∈ R, where ajk = R g(x)φjk(x) dx, φjk(x) = 2j/2 φ 2j x ? k , x ∈ R, bjk = R g(x)ψjk(x) dx, ψjk(x) = 2j/2 ψ 2j x ? k , x ∈ R, and {φj0k, k ∈ Z; ψjk, j j0, k ∈ Z} is an orthonormal18 basis of L2(R, leb); see Meyer (1992, Theorem 3.3). For j 0 and 0 k 2j ? 1, denote the periodized wavelets on [0, 1] by φ? jk(x) = 2j/2 l∈Z φ 2j x + 2j l ? k , ψ? jk(x) = 2j/2 l∈Z ψ 2j x + 2j l ? k , x ∈ [0, 1]. For j0 0, the collection {φ? j0k, k = 0,2j0 ? 1; ψ? jk, j j0, k = 0,2j ? 1} is an orthonormal basis of L2([0, 1], leb) [see Daubechies (1992)]. We consider the ?nite-dimensional linear space spanned by this wavelet basis. For an integer Jn > j0, set Wav m, 2Jn = 2j0 ?1 k=0 αj0kφ? j0k(x) + Jn?1 j=j0 2j ?1 k=0 βjkψ? jk(x), x ∈ [0, 1]: αj0k, βjk ∈ R or, equivalently [see Meyer (1992)], Wav m, 2Jn = 2Jn ?1 k=0 αkφ? Jnk(x), x ∈ [0, 1]: αk ∈ R . 18 I.e., R ψjk(x)ψjk(x) dx = 1 and R ψjk(x)ψj k (x) dx = 0 for j = j or k = k ; also R φj0k(x)φj0k(x) dx = 1 and R φj0k(x)φj0k (x) dx = 0 for k = k ; in addition R φj0k(x)ψjk (x) dx = 0 for j j0. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5573 Tensor product spaces. Let U , 1 d, be compact sets in Euclidean spaces and U = U1 Ud be their Cartesian product. Let G be a linear space of functions on U for 1 d, each of which can be any of the sieve spaces described above, among others. The tensor product, G, of G1,Gd is de?ned as the space of functions on U spanned by the functions d =1 g (x ), where g ∈ G for 1 d. We note that dim(G) = d =1 dim(G ). Tensor-product construction is a standard way to generate linear sieves of multivariate functions from linear sieves of univariate functions. Linear sieves are attractive because of their simplicity and ease of implementation. Moreover, linear sieves can approximate functions in a H?lder space, Λp(X), well. In the following we let θ denote a real-valued function with a bounded domain X ? Rd, θ ∞ ≡ supx∈X |θ(x)| denote its L∞ norm, and θ 2,leb ≡ { X [θ(x)]2 dx/ vol(X)}1/2 be the scaled L2 norm relative to the Lebesgue measure of X. De?ne the sieve approx- imation errors to θo ∈ Λp(X) in L∞(X, leb)-norm and L2(X, leb)-norm as ρ∞n ≡ inf g∈Θn g ? θo ∞ and ρ2n ≡ inf g∈Θn g ? θo 2,leb. It is obvious that ρ2n ρ∞n. For a multivariate function θo ∈ Θ = Λp([0, 1]d), we consider the tensor product linear sieve space Θn, which is constructed as a ten- sor product space of some commonly used univariate linear approximating spaces Θn1,Θnd. Let dim(Θn) = kn and [p] be the biggest integer satisfying [p] < p. Then we have the following tensor product sieve approximation error rates for θo ∈ Λp([0, 1]d): Polynomials. If each Θn = Pol(Jn), then ρ∞n = O(J ?p n ) = O(k ?p/d n ) [see e.g. Section 5.3.2 of Timan (1963)]. Trigonometric polynomials. If θo can be extended to a periodic function, and if each Θn = TriPol(Jn), then ρ∞n = O(J ?p n ) = O(k ?p/d n ) [see e.g. Section 5.3.1 of Timan (1963)]. Splines. If each Θn = Spl(r, Jn) with r [p]+1, then ρ∞n = O(J ?p n ) = O(k ?p/d n ) [see (13.69) and Theorem 12.8 of Schumaker (1981)]. Orthogonal wavelets. If each Θn = Wav(m, 2Jn ) with m > p, then ρ∞n = O(2?pJn ) = O(k ?p/d n ) [see Proposition 2.5 of Meyer (1992)]. 2.3.2. Weighted smoothness classes and (?nite-dimensional) linear sieves In semi-nonparametric econometric applications, sometimes the parameters of interest are functions with unbounded supports. Here we present two ?nite-dimensional linear sieves that can approximate functions with unbounded supports well. In the following we let Lp(X, ω), 1 p < ∞, denote the space of real-valued functions h such that X |h(x)|pω(x) dx < ∞ for a smooth weight function ω : X → (0, ∞). 5574 X. Chen Hermite polynomials. Hermite polynomial series {Hk: k = 1, 2,is an ortho- normal basis of L2(R, ω) with ω(x) = exp{?x2}. It can be obtained by apply- ing the Gram–Schmidt procedure to the polynomial series {xk?1: k = 1, 2, . . .} under the inner product f, g ω = R f (x)g(x) exp{?x2} dx. That is, H1(x) = 1/ R exp{?x2} dx = π?1/4, and for all k 2, Hk(x) = xk?1 ? k?1 j=1 xk?1, Hj ωHj (x) R[xk?1 ? k?1 j=1 xk?1, Hj ωHj (x)]2 exp{?x2} dx . Let HPol(Jn) denote the space of Hermite polynomials on R of degree Jn or less: HPol(Jn) = Jn+1 k=1 akHk(x) exp ? x2 2 , x ∈ R: ak ∈ R . Then any function in L2(R, leb) can be approximated by the HPol(Jn) sieve as Jn → ∞. When the HPol(Jn) sieve is used to approximate an unknown √ θo, where θo is a probability density function over R, the corresponding sieve maximum likelihood esti- mation is also called SNP in econometrics; see e.g. Gallant and Nychka (1987), Gallant and Tauchen (1989) and Coppejans and Gallant (2002). Laguerre polynomials. Laguerre polynomial series {Lk: k = 1, 2,is an ortho- normal basis of L2([0, ∞), ω) with ω(x) = exp{?x}. It can be obtained by applying the Gram–Schmidt procedure to the polynomial series {xk?1: k = 1, 2,under the inner product f, g ω = ∞ 0 f (x)g(x) exp{?x} dx. Let LPol(Jn) denote the space of Laguerre polynomials on [0, ∞) of degree Jn or less: LPol(Jn) = Jn+1 k=1 akLk(x) exp ? x 2 , x ∈ [0, ∞): ak ∈ R . Then any function in L2([0, ∞), leb) can be approximated by the LPol(Jn) sieve as Jn → ∞. 2.3.3. Other smoothness classes and (?nite-dimensional) nonlinear sieves Nonlinear sieves can also be used for sieve extremum estimation. A popular class of nonlinear sieves in econometrics is single hidden layer feedforward Arti?cial Neural Networks (ANN). Here we present three typical forms of ANNs; see Hornik et al. (1994) for additional ones. Sigmoid ANN. De?ne sANN(kn) = kn j=1 αj S γj x + γ0,j : γj ∈ Rd , αj , γ0,j ∈ R , Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5575 where S : R → R is a sigmoid activation function, i.e., a bounded nondecreasing func- tion such that limu→?∞ S(u) = 0 and limu→∞ S(u) = 1. Some popular sigmoid activation functions include ? Heaviside S(u) = 1{u 0}; ? logistic S(u) = 1/(1 + exp{?u}); ? hyperbolic tangent S(u) = (exp{u} ? exp{?u})/(exp{u} + exp{?u}); ? Gaussian sigmoid S(u) = (2π)?1/2 u ?∞ exp(?y2/2) dy; ? cosine squasher S(u) = 1+cos(u+3π/2) 2 1{|u| π/2} + 1{u > π/2}. Let X be a compact set in Rd, and C(X) be the space of continuous functions map- ping from X to R. Gallant and White (1988a) ?rst established that the sANN sieve with the cosine squasher activation function is dense in C(X) under the sup-norm. Cybenko (1990) and Hornik, Stinchcombe and White (1989) show that the sANN(kn), with any sigmoid activation function, is dense in C(X) under the sup-norm. Let H = {h ∈ L2(X, leb): Rd |w||? h(w)| dw < ∞}. This means h ∈ H if and only if it is square integrable and its Fourier transform ? h has ?nite ?rst moment, where ? h(w) ≡ exp(?iwx)h(x) dx is the Fourier transform of h. Barron (1993) established that for any ho ∈ H, the sANN(kn) sieve approximation error rate in L2(X, leb)-norm ρ2n is no slower than O([kn]?1/2), which was later improved to O([kn]?1/2?1/(2d)) in Makovoz (1996) for the sANN(kn) with the Heaviside sigmoid function, and to O([kn]?1/2?1/(d+1)) in Chen and White (1999) for the sANN(kn) with general sigmoid function. General ANN. De?ne gANN(kn) = 2r kn j=1 αj max |γj |e, 1 ?m ψ γj x + γ0,j : γj ∈ Rd , αj , γ0,j ∈ R , where ψ : R → R is any activation function but not a polynomial with ?xed degree. In particular, we often let ψ be a smooth function in a H?lder space Λm(R) and satisfy 0 < R |Drψ(x)| dx < ∞ for some r m. This includes all the above sigmoid activation functions as special cases (with m = 0 and r = 1); see Hornik et al. (1994) for additional examples. Let H = h ∈ L2(X, μ): h(x) = exp(ia x) dσh(a), Rd max |a|e, 1 m+1 d|σh|tv(a) < ∞ , where σh is a complex-valued measure, and |σh|tv denotes the total variation of σh. Let Wm 2 (X, μ) be the weighted Sobolev space of functions, where functions as well as all their partial derivatives (up to mth order) are L2(X, μ)-integrable for a ?nite 5576 X. Chen measure μ. It is known that a function in H also belongs to Wm 2 (X, μ). Denote h m,μ = { h(x)2 dμ(x) + |Dmh(x)|2 e dμ(x)}1/2 as the weighted Sobolev norm. Hornik et al. (1994) established that for any ho ∈ H, the gANN(kn) sieve approxima- tion error rate in the weighted Sobolev norm ( · m,μ) is no slower than O([kn]?1/2), which was later improved to O([kn]?1/2?1/(d+1)) in Chen and White (1999). Gaussian radial basis ANN. Let X = Rd. De?ne rbANN(kn) = α0 + kn j=1 αj G {(x ? γj ) (x ? γj )}1/2 σj : γj ∈ Rd , αj , σj ∈ R, σj > 0 , where G is the standard Gaussian density function. Let Wm 1 (X) be the Sobolev space of functions, where functions as well as all their partial derivatives (up to mth order) are L1(X, leb)-integrable. Meyer (1992) shows that rbANN(kn) is dense in the smoothness class Wm 1 (X). Girosi (1994) established that for any ho ∈ H, the rbANN(kn) sieve approximation error rate in L2(X, leb)-norm ρ2n is no slower than O([kn]?1/2), which was later improved to O([kn]?1/2?1/(d+1)) in Chen, Racine and Swanson (2001). Additional examples of nonlinear sieves include spline sieves with data-driven choices of knot locations (or free-knot splines), and wavelet sieves with thresholding. Nonlinear sieves are more ?exible and may enjoy better approximation properties than linear sieves; see e.g. Chen and Shen (1998) for the comparison of linear vs. nonlinear sieves. 2.3.4. In?nite-dimensional (nonlinear) sieves and method of penalization Most commonly used sieve spaces are ?nite-dimensional truncated series such as those listed above. However, the general theory on sieve extremum estimation can also al- low for in?nite-dimensional sieve spaces. For example, consider the smoothness class Θ = Λp(X) with X = [0, 1], p > 1/2. It is well known that any function θ ∈ Θ can be expressed as an in?nite Fourier series θ(x) = ∞ k=1[ak cos(kx) + bk sin(kx)], and its derivative with fractional power γ ∈ (0, p] can also be de?ned in terms of Fourier series: θ(γ ) (x) = ∞ k=1 kγ ak cos πγ 2 + bk sin πγ 2 cos(kx) + bk cos πγ 2 ? ak sin πγ 2 sin(kx) . Similarly, any function θ ∈ Θ = Λp(X) and its fractional derivatives can be ex- pressed as in?nite series of splines and wavelets; see e.g. Meyer (1992). Let pen(θ) = Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5577 ( X |θ(p)(x)|q dx)1/q for p > 1/2 and some integer q 1. Then we can take the sieves to be Θn = {θ ∈ Θ: pen(θ) bn} with bn → ∞ as n → ∞ arbitrarily slowly; see e.g. Shen (1997). The choice of q is typically related to the criterion function Qn(θ), such as q = 2 for conditional mean regression [Wahba (1990)], q = 1 [Koenker, Ng and Portnoy (1994)] and total variation norm [Koenker and Mizera (2003)] for quantile regressions. More generally, if the parameter space Θ is a typical function space such as a H?lder, Sobolev or Besov space, then any function θ ∈ Θ can be expressed as in?nite series of some known Riesz basis {Bk(·)}∞ k=1. An in?nite-dimensional sieve space could take the form: (2.18) Θn = θ ∈ Θ: θ(·) = ∞ k=1 akBk(·), pen(θ) bn with bn → ∞ slowly, where pen(θ) is a smoothness (or roughness) penalty term. REMARK 2.2. When Qn(θ) is concave and pen(θ) is convex, the sieve extremum es- timation, supθ∈Θn Qn(θ) with Θn given in (2.18), becomes equivalent to the penalized extremum estimation (2.19) max θ∈Θ Qn(θ) ? λn pen(θ) where the Lagrange multiplier λn is chosen such that the solution satis?es pen( ? θ) = bn. See e.g. Eggermont and LaRiccia (2001, Subsection 1.6). 2.3.5. Shape-preserving sieves There are many sieves that can preserve the shape, such as nonnegativity, monotonicity and convexity, of the unknown function to be approximated. See e.g. DeVore (1977a, 1977b) on shape-preserving spline and polynomial sieves, Anastassiou and Yu (1992a, 1992b) and Dechevsky and Penev (1997) on shape-preserving wavelet sieves. Here we mention one of such shape-preserving sieves. Cardinal B-spline wavelets. The cardinal B-spline of order r 1 is given by (2.20) Br(x) = 1 (r ? 1)! r j=0 (?1)j r j max(0, x ? j) r?1 , which has support [0, r], is symmetric at r/2 and is a piecewise polynomial of highest degree r ? 1. It satis?es Br(x) 0, +∞ k=?∞ Br(x ? k) = 1 for all x ∈ R, which is crucial to preserve the shape of the unknown function to be approximated. Its derivative satis?es ? ?x Br(x) = Br?1(x)?Br?1(x ?1). See Chui (1992, Chapter 4) for a recursive construction of cardinal B-splines and their properties. 5578 X. Chen We can construct a cardinal B-spline wavelet basis for the space L2(R, leb) as fol- lows. Let φr(x) = Br(x) be the father wavelet (or the scaling function). Then there is a "unique" mother wavelet function ψr with minimum support [0, 2r ? 1] and is given by ψr(x) = 3r?2 =0 q Br(2x ? ), q = (?1) 21?r r j=0 r j B2r( + 1 ? j). Let φr,jk(x) = 2j/2 Br 2j x ? k , ψr,jk(x) = 2j/2 ψr 2j x ? k , x ∈ R. Then for an integer j0 0, {φr,j0k, k ∈ Z; ψr,jk, j j0, k ∈ Z} is a Riesz basis of L2(R, leb). Moreover, any function g in L2(R, leb) has the following spline-wavelet m = r ? 1 regular multiresolution expansion: g(x) = ∞ k=?∞ aj0k2j0/2 Br 2j0 x ? k + ∞ j=j0 ∞ k=?∞ bjkψr,jk(x), x ∈ R, see Chui (1992, Chapter 6). For an integer Jn > j0 = 0, set SplWav r ? 1, 2Jn = ∞ k=?∞ a0kBr(x ? k) + Jn?1 j=0 ∞ k=?∞ βjkψr,jk(x), x ∈ R: a0k, βjk ∈ R or, equivalently,19 SplWav r ? 1, 2Jn = ∞ k=?∞ αk2Jn/2 Br 2Jn x ? k , x ∈ R: αk ∈ R . Any nondecreasing continuous function on R can be approximated well by the SplWav(r ? 1, 2Jn ) sieve with nondecreasing sequence {αk} (i.e., αk αk+1). In par- ticular, let MSplWav r ? 1, 2Jn = g(x) = ∞ k=?∞ αk2Jn/2 Br 2Jn x ? k + r 2 : αk αk+1 19 See Chen, Hansen and Scheinkman (1998) for the approximation property of this sieve for twice differen- tiable functions on R. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5579 denote the monotone spline wavelet sieve. Then for any bounded nondecreasing contin- uous function θo on R, the MSplWav(r ? 1, 2Jn ), r 1, sieve approximation error rate in sup-norm is O(2?Jn ); for any bounded nondecreasing continuously differentiable function θo on R, the MSplWav(r ? 1, 2Jn ), r 2, sieve approximation error rate in sup-norm is O(2?2Jn ); see e.g. Anastassiou and Yu (1992a). 2.3.6. Choice of a sieve space The choice of a sieve space Θn = B * Hn depends on how well it approximates Θ = B * H and how easily one can compute maxθ∈Θn Qn(θ). In general, it will be easier to compute maxθ∈Θn Qn(θ) when the sieve space, Θn = B * Hn, is an unconstrained ?nite-dimensional linear space. Moreover, if the criterion function, Qn(θ), is concave, one can choose such a linear sieve, just as in the series estimation of a concave extended linear model described in Subsection 2.2.2. However, the ease of computation should not be the only concern when one decides which sieve to use in practice. This is because the large sample performance of a sieve estimate also depends on the approximation properties of the chosen sieve. Unfortu- nately, a ?nite-dimensional linear sieve does not always possess better approximation properties than some nonlinear sieves. For example, let us consider the estimation of a multivariate conditional mean function ho(·) = E[Yt |Xt = ·] ∈ Θ. Let Θn be a sieve space. Then ? θ = ? h = arg maxh∈Θn ?1 n n t=1[Yt ? h(Xt )]2 is a sieve M-estimator of ho. If Θ = Λp([0, 1]d) is the space of p-smooth functions with p > d/2, then one can take Θn to be any of the ?nite-dimensional linear sieve space in Subsection 2.3.1, and the resulting estimator ? h is a series estimator. However, if Θ = W1 1 ([0, 1]d) as de?ned in Subsection 2.3.3, then it is better to choose the sieve space, Θn, to be the nonlinear Gaussian radial basis ANN in Subsection 2.3.3; the resulting estimator is still a sieve M-estimator but not a series estimator. See Section 3 for additional examples. How well a sieve, Θn, approximates Θ often depends on the support, the smooth- ness, the shape restrictions of functions in Θ and the structure, such as additivity, nonnegativity, exclusion restrictions, imposed by the econometric model. For example, a Hermite polynomial sieve can approximate a multivariate unknown smooth density with unbounded supports and relatively thin tails well, but a power series sieve and a Fourier series sieve cannot. This is why Gallant and Nychka (1987) considered Hermite polynomial sieve MLE since they wanted to approximate multivariate densities that are smooth, have unbounded supports and include the multivariate normal density as a special case. As another example, a ?rst-order monotone spline sieve can approximate any bounded monotone but nondifferentiable function well, and a third-order cardinal B-spline wavelet sieve can approximate any bounded monotone differentiable function well. In Example 2.1, Heckman and Singer (1984, pp. 300 and 301) did not want to impose any assumptions on the distribution function h(·) of the latent random factor, hence they applied a ?rst-order monotone spline sieve to approximate it. In their esti- mation of the ?rst eigenfunction of the conditional expectation operator associated with a fully nonparametric scalar diffusion model, Chen, Hansen and Scheinkman (1998) 5580 X. Chen applied a shape-preserving third order cardinal B-spline wavelet sieve to approximate the unknown ?rst eigenfunction, since the ?rst eigenfunction is known to be monotone and twice continuously differentiable. As a ?nal example, in their sieve MD estimation of the semi-nonparametric external habit model (2.7) of Example 2.3, Chen and Lud- vigson (2003) used the sANN sieve with logistic activation function to approximate the unknown habit function H(Ct , Ct?1,Ct?L) = Ct h(Ct?1 Ct Ct?L Ct ). This is partly because when L 3, the unknown smooth function h : RL → [0, 1) can be approx- imated by a sANN sieve well, and partly because it is very easy to impose the habit constraint 0 H(Ct , Ct?1,Ct?L) < Ct when h(Ct?1 Ct Ct?L Ct ) is approximated by the sANN sieve with logistic activation function. For a sieve estimate to be consistent with a fast rate of convergence, it is important to choose sieves with good approximation error rates as well as controlled complexity.20 Nevertheless, for econometric applications where the only prior information on the un- known functions is their smoothness and supports, the choice of a sieve space is not important, as long as the chosen sieve space has the desired approximation error rate. 2.4. A small Monte Carlo study To illustrate how to implement the sieve extremum estimation, we present a small Monte Carlo simulation carried out using Matlab and Fortran. The true model is: Y1 = X1βo + ho1(Y2) + ho2(X2) + U with βo = 1, ho1(Y2) = 1/[1 + exp{?Y2}] and ho2(X2) = log(1 + X2). We assume that Y2 is endogenous and Y2 = X1 + X2 + X3 + R * U + e with either R = 0.9 (strong correlation) or 0.1 (weak correlation). Suppose that the regressors X1, X2, X3 are independent and uniformly distributed over [0, 1], and that e is independent of (X, U) and normally distributed with mean zero and variance 0.1. (We have also tried E[e2] = 0.05, 0.25, the simulation results share very similar patterns to the ones when E[e2] = 0.1, hence are not reported here.) Conditional on X = (X1, X2, X3) , U is normally distributed with mean zero and variance (X2 1 + X2 2 + X2 3)/3. Let Z = (Y1, Y2, X ) . A random sample of n = 1000 data {Zi}n i=1 is generated from this design. An econometrician observes the simulated data {Zi}n i=1, and wants to estimate θo = (βo, ho1, ho2) , obeying the conditional moment restriction: (2.21) E Y1i ? X1iβo + ho1(Y2i) + ho2(X2i) Xi = 0. This model is a generalization of the partially linear IV regression E[Y1 ? {X1βo + ho1(Y2)}|X] = 0 example of Ai and Chen (2003) to a partially additive IV regression. Since ho1(Y2) is an unknown function of the endogenous variable Y2, both examples belong to the so-called ill-posed inverse problems. Let ρ(Z, θ) = Y1 ? {X1β + h1(Y2) + h2(X2)} with θ = (β, h1, h2) . We say that the parameters θo = (βo, ho1, ho2) are identi?ed if E[ρ(Z, θ)|X] = 0 only when θ = θo. 20 This will become clear from the large sample theory discussed later in Section 3. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5581 As a suf?cient condition for the identi?cation of θo, we assume that Var(X1) > 0, h1(y2) is a bounded function with supy2 |h1(y2)| 1 and that h2(x2) satis?es h2(0.5) = log(3/2). In particular, we assume that θo = (βo, ho1, ho2) ∈ Θ = B*H1*H2 with B a compact interval in R, H1 = {h1 ∈ C2(R): supy2 |h1(y2)| 1, [D2h1(y2)]2 dy2 < ∞} and H2 = {h2 ∈ C2([0, 1]): h2(0.5) = log(3/2), [D2h2(x2)]2 dx2 < ∞}. Since this model (2.21) ?ts into the second subclass of the conditional moment re- strictions (2.8) with E[ρ(Z, θo)|X] = 0, we can apply the sieve MD criterion (2.12) to estimate θo = (βo, ho1, ho2). We take Θn = B * H1n * H2n as the sieve space, where H1n = h1(y2) = Π1Bk1,n (y2): D2 h1(y2) 2 dy2 c1 log n , Bk1,n (y2) is either a polynomial spline basis with equally spaced (according to empirical quantile of Y2) knots, or a 3rd order cardinal B-spline basis, or a Hermite polynomial basis,21 and dim(Π1) = k1,n is the number of unknown sieve coef?cient of h1. Simi- larly, H2n = h2(x2) = Π2Bk2,n (x2): D2 h2(x2) 2 dx2 c2 log n, h2(0.5) = log(3/2) , Bk2,n (x2) is either a polynomial spline basis with equally spaced (according to empirical quantile of X2) knots, or a 3rd order cardinal B-spline basis, and dim(Π2) = k2,n is the number of unknown sieve coef?cients of h2. In the Monte Carlo study, we have tried k1,n = 4, 5, 6, 8 and k2,n = 4, 5, 6. As an illustration, we only consider the sieve MD estimation (2.12) using the identity weighting Σ(X) = I,22 and the series LS estimator as the ? m(X, θ) for the conditional mean function E[ρ(Z, θ)|X], thus the criterion becomes min β∈B, h1∈H1n, h2∈H2n 1 n n i=1 ? m(Xi, θ) 2 , with ? m(X, θ) = n j=1 Y1j ? X1j β + h1(Y2j ) + h2(X2j ) pkm,n (Xj ) (P P)? pkm,n (X), where in the simulation pkm,n (X) is taken to be the 4th degree polynomial spline sieve, with basis {1, X1, X2 1, X3 1, X4 1, [max(X1 ? 0.5, 0)]4, X2, X2 2, X3 2, X4 2, [max(X2 ? 0.5, 0)]4, X3, X2 3, X3 3, X4 3, [max(X3 ? 0.1, 0)]4, [max(X3 ? 0.25, 0)]4, [max(X3 ? 21 See Blundell, Chen and Kristensen (2007) for a more detailed description on the choice of H1n. 22 See Subsection 4.3 or Ai and Chen (2003) for the sieve MD procedure with the optimal weighting matrix. 5582 X. Chen 0.5, 0)]4, [max(X3 ? 0.75, 0)]4, [max(X3 ? 0.90, 0)]4, X1X3, X2X3, X1[max(X3 ? 0.25, 0)]4, X2[max(X3 ?0.25, 0)]4, X1[max(X3 ?0.75, 0)]4, X2[max(X3 ?0.75, 0)]4}. We note that the above criterion is equivalent to a constrained 2 Stage Least Squares (2SLS) with km,n = 26 instruments and dim(Θn) = 1 + k1,n + k2,n (< km,n) unknown parameters: min β∈B, h1∈H1n, h2∈H2n [Y1 ? X1β ? BΠ] P(P P)? P [Y1 ? X1β ? BΠ], where Y1 = (Y11,Y1n) , X1 = (X11,X1n) , Π = (Π1, Π2) , B1 = (Bk1,n (Y21)Bk1,n (Y2n)) , B2 = (Bk2,n (X21)Bk2,n (X2n)) and B = (B1, B2) . Since ρ(Z, θ) is linear in θ = (β, h1, h2) , the joint sieve MD estimation is equivalent to the pro?le sieve MD estimation for this model. We can ?rst compute a pro?le sieve estimator for h1(y2)+h2(x2). That is, for any ?xed β, we compute the sieve coef?cients Π by minimizing n i=1{ ? m(Xi, θ)}2 subject to the smoothness constraints imposed on the functions h1 and h2: (2.22) min Π: [D2h (y)]2 dy c log n, =1,2 [Y1 ? X1β ? BΠ] P(P P)? P [Y1 ? X1β ? BΠ] for some upper bounds c > 0, = 1, 2. Let Π(β) be the solution to (2.22) and ? h1(y2; β) + ? h2(x2; β) = (Bk1,n (y2) , Bk2,n (x2) )Π(β) be the pro?le sieve estimator of h1(y2) + h2(x2). Next, we estimate β by ? βiv which solves the following 2SLS problem: (2.23) min β Y1 ? X1β ? BΠ(β) P(P P)? P Y1 ? X1β ? BΠ(β) . Finally we estimate ho1(y2) + ho2(x2) by ? h1(y2) + ? h2(x2) = Bk1,n (y2) , Bk2,n (x2) Π( ? βiv), and then estimate ho1 and ho2 by imposing the location constraint h2(0.5) = log(3/2): ? h2,iv(x2) = Bk2,n (x2) Π2( ? βiv) ? Bk2,n (0.5) Π2( ? βiv) + log(3/2), ? h1,iv(y2) = Bk1,n (y2) Π1( ? βiv) + Bk2,n (0.5) Π2( ? βiv) ? log(3/2). We note that although this model (2.21) belongs to the nasty ill-posed inverse prob- lem, the above pro?le sieve MD procedure is very easy to compute, and in fact, ? βiv and Π( ? βiv) have closed form solutions. To see this, we note that (2.22) is equivalent to min Π,λ (Y1 ? X1β ? BΠ) P(P P)? P (Y1 ? X1β ? BΠ) + 2 =1 λ Π C Π ? c log n , where for = 1, 2, C = [D2Bk ,n (y)][D2Bk ,n (y)] dy, Π C Π = [D2h (y)]2 dy and λ 0 is the Lagrange multiplier. However, we do not want to specify the upper Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5583 bounds c > 0, = 1, 2, instead we choose some small values as the penalization weights λ1, λ2, and solve the following problems: (2.24) min Π (Y1 ? X1β ? BΠ) P(P P)? P (Y1 ? X1β ? BΠ) + 2 =1 λ Π C Π . Denote C(λ1, λ2) = λ1C1 0 0 λ2C2 as the smoothness penalization matrix. The minimiza- tion problem (2.24) has a simple closed form solution: Π(β) = B P(P P)? P B + C(λ1, λ2) ? B P(P P)? P [Y1 ? X1β] = W[Y1 ? X1β], with W = (B P(P P)?P B + C(λ1, λ2))?B P(P P)?P . Substituting the solution Π(β) into the 2SLS problem (2.23), we obtain ? βiv = X1(I ? BW) P(P P)? P (I ? BW)X1 ?1 X1 * (I ? BW) P(P P)? P (I ? BW)Y1, and Π( ? βiv) = W[Y1 ? X1 ? βiv]. Table 1 Different endogeneity, Spl(3, 2) for h2, k2n = 5, λ2 = 0.0001 R β SE(β) IBias2(h1) IMSE(h1) IBias2(h2) IMSE(h2) Spl(3, 2) k1n = 5 λ1 = 0.005 0.0 1.0081 0.0909 0.0003 0.0427 0.0000 0.0026 0.1 1.0021 0.0907 0.0003 0.0446 0.0000 0.0026 0.9 0.9404 0.0947 0.0148 0.0926 0.0003 0.0030 Spl(3, 1) k1n = 4 λ1 = 0.001 0.0 1.0076 0.0891 0.0002 0.0225 0.0000 0.0025 0.1 1.0010 0.0886 0.0002 0.0229 0.0000 0.0025 0.9 0.9398 0.0941 0.0160 0.0623 0.0003 0.0029 HPol(4) k1n = 5 λ1 = 0.005 0.0 1.0089 0.0906 0.0003 0.0395 0.0000 0.0026 0.1 1.0029 0.0901 0.0003 0.0397 0.0000 0.0026 0.9 0.9418 0.0948 0.0121 0.0830 0.0003 0.0030 HPol(3) k1n = 4 λ1 = 0.001 0.0 1.0078 0.0890 0.0002 0.0202 0.0000 0.0025 0.1 1.0012 0.0885 0.0002 0.0205 0.0000 0.0025 0.9 0.9401 0.0941 0.0112 0.0546 0.0003 0.0029 5584 X. Chen Table 2 Different penalization levels and sieve terms, R = 0.9 (λ1, λ2) β SE(β) IBias2(h1) IMSE(h1) IBias2(h2) IMSE(h2) Spl(3, 1) for h1 and h2, k1n = k2n = 4 (0.001, 0.0) 0.9366 0.0941 0.0176 0.0612 0.0003 0.0018 (0.05, 0.001) 0.9324 0.0867 0.0185 0.0568 0.0003 0.0016 Spl(3, 3) for h1 and h2, k1n = k2n = 6 (0.001, 0.0) 0.9451 0.0984 0.0124 0.1594 0.0003 0.0032 (0.05, 0.001) 0.9441 0.0954 0.0125 0.0720 0.0003 0.0028 For R = 0.9, 0.1 and 0.0, a sample of 1000 data points were generated according to the above design. The sieve MD procedure was applied to the data with identity weighting matrix Σ(X) = I and the penalization weights λ1 = 0.005 (or 0.001) and λ2 = 0.0001 (or 0) for simplicity. The estimated coef?cients were recorded. Then, a new sample of 1000 data points were drawn and the estimated coef?cients were computed again. This procedure was repeated 400 times. The mean (M) and standard error (SE) of the βo estimator across the 400 simulations are reported in Tables 1–2. To evaluate the performance of the sieve MD estimators of the nonparametric components ho1(Y2) and ho2(X2), we report their integrated squared biases (IBias2) and the integrated mean squared errors (IMSE) across the 400 simulations in Tables 1–2.23 Table 1 summarizes the performance of the estimators across different degrees of endogeneity and different sieves for h1(Y2). Table 2 summarizes the sensitivity of the estimators (under R = 0.9) to different sieve number of terms and penalization parameters for both h1(Y2) and h2(X2). We also plot the estimated functions ho1(Y2) and ho2(X2) corresponding to the strong correlation case (R = 0.9) in Figure 1, where the solid lines represent the true functions and the dashed (or dotted) lines denote the sieve MD (or sieve IV) estimates. Tables 1–2 and Figure 1 indicate that even under strong correlation, the sieve MD estimates of βo and ho2(X2) perform well. We ?nd that the sieve IV estimates of βo and ho2(X2) are not sensitive to the choices of the penalization parameters λ1, λ2, nor to the choices of sieve bases for ho1(Y2). The sieve IV estimate of ho1(Y2) is also not very sensitive to the choices of sieve bases, although it is slightly more sensitive to the penalization parameter λ1 under strong correlation. Since under strong correlation, the 23 The IBias2(h1) and IMSE(h1) in Table 1 are calculated as follows. Let ? hi be the estimate of ho1 from the ith simulated data set, and h(y) = 400 i=1 ? hi(y)/400 be the pointwise average across 200 simulations. We calculate the pointwise squared bias as [h(y)?ho1(y)]2, and the pointwise variance as 400?1 400 i=1[? hi(y)? h(y)]2. The integrated squared bias is calculated by numerically integrating the pointwise squared bias from y to y which are respectively the 2.5th and 97.5th empirical percentiles of Y2; The integrated MSE are computed in a similar way. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5585 Figure 1. True and estimated functions with R = 0.9, λ1 = 0.001, λ2 = 0.0001. estimation of ho1(Y2) is a nasty ill-posed inverse problem, as the penalization parameter λ1 gets smaller, the integrated squared bias of ho1( ) does not change much but the integrated variance of ho1( ) increases more. The additional Monte Carlo results for other sieve bases such as 3rd order cardinal B-splines and for different combinations of sieve number of terms and penalization levels share similar patterns to the ones reported here. These ?ndings are also consistent with the more detailed Monte Carlo studies in Blundell, Chen and Kristensen (2007). 2.5. An incomplete list of sieve applications in econometrics We conclude this section by listing a few applications of the sieve extremum estimation in econometrics.24 Most of the existing applications are done in microeconometrics. Elbadawi, Gallant and Souza (1983) studied Fourier series LS estimation of demand elasticity. Cosslett (1983) proposed nonparametric ML estimation of a binary choice model. Heckman and Singer (1984) considered sieve ML estimation of a duration model where the unknown error distribution is approximated by a ?rst-order spline. Their es- timation procedure was also applied in Cameron and Heckman (1998) to a life-cycle schooling problem. Duncan (1986) used spline sieve MLE in estimating a censored regression. Hausman and Newey (1995) considered power series and spline series LS estimation of consumer surplus. Hahn (1998) and Imbens, Newey and Ridder (2005) 24 Although restricting our attention to economic applications only, it is still impossible to mention all the existing applications of sieve methods in econometrics. Any omissions re?ect my lack of awareness and are purely unintentional. 5586 X. Chen used power series and splines in the two-step ef?cient estimation of the average treat- ment effect models. Newey, Powell and Vella (1999), and Pinkse (2000) considered series estimation of a triangular system of simultaneous equations. To estimate semi- parametric generalizations of Heckman's (1979) sample selection model, Gallant and Nychka (1987) proposed the Hermite polynomial sieve MLE, while Newey (1988) and Das, Newey and Vella (2003) applied the series LS estimation method. Recently, Newey (2001) used the sieve MD procedure to estimate a nonlinear measurement error model. Blundell, Chen and Kristensen (2007) considered a pro?le sieve MD procedure to estimate shape-invariant Engel curves with nonparametric endogenous expenditure. Coppejans (2001) proposed sieve ML estimation of a binary choice model. Khan (2005) considered a sieve LS estimation of a probit binary choice model with unknown het- eroskedasticity. Hirano, Imbens and Ridder (2003) proposed a sieve logistic regression to estimate propensity score for treatment effect models. Mahajan (2004) estimated a semiparametric single index model with binary misclassi?ed regressors via sieve MLE. Chen, Fan and Tsyrennikov (2006) studied sieve MLE of semi-nonparametric multivariate copula models. Chen, Hong and Tamer (2005) made use of spline sieves to estimate nonlinear nonclassical measurement error models with an auxiliary sam- ple. Their estimation procedure was shown in Chen, Hong and Tarozzi (2007) to be semiparametrically ef?cient for general nonlinear GMM models of nonclassical mea- surement errors, missing data and treatment effects. Hu and Schennach (2006) apply sieve MLE to estimate a nonlinear nonclassical measurement error model with instru- ments. Brendstrup and Paarsch (2004) applied Hermite and Laguerre polynomial sieve MLE to estimate sequential asymmetric English auctions. Bierens (in press) and Bierens and Carvalho (in press) applied Legendre polynomial sieve MLE respectively to es- timate an interval-censored mixed proportional hazard model and a competing risks model of recidivism. There have also been many applications of the method of sieves in time series econo- metrics. Engle et al. (1986) forecasted electricity demand using a partially linear spline regression. Engle and Gonzalez-Rivera (1991) applied sieve MLE to estimate ARCH models where the unknown density of the standardized innovation is approximated by a ?rst order spline sieve. Gallant and Tauchen (1989) and Gallant, Hsieh and Tauchen (1991) employed Hermite polynomial sieve MLE to study asset pricing and foreign exchange rates. Gallant and Tauchen (1996, 2004) have proposed the combinations of Hermite polynomial sieve and simulated method of moments to effectively solve many complicated asset pricing models with latent factors, and their methods have been widely applied in empirical ?nance. Bansal and Viswanathan (1993), Bansal, Hsieh and Viswanathan (1993) and Chapman (1997) considered sieve approximation of the whole stochastic discount factor (or pricing kernel) as a function of a few macroeconomic factors. White (1990) and Granger and Ter?svirta (1993) suggested nonparametric LS forecasting via sigmoid ANN sieve. Hutchinson, Lo and Poggio (1994) applied radial basis ANN to option pricing. Chen, Racine and Swanson (2001) used partially linear ANN and ridgelet sieves to forecast US in?ation. McCaffrey et al. (1992) estimated Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5587 the Lyapunov exponent of a chaotic system via ANN sieves.25 Chen and Ludvigson (2003) employed a sigmoid ANN sieve to estimate the unknown habit function in a consumption asset pricing model. Polk, Thompson and Vuolteenaho (2003) applied sigmoid ANN to compute conditional quantile in testing stock return predictability. Chen, Hansen and Scheinkman (1998) employed a shape-preserving spline-wavelet sieve to estimate the eigenfunctions of a fully nonparametric scalar diffusion model from discrete-time low-frequency observations. Chen and Conley (2001) made use of the same sieve to estimate a spatial temporal model with ?exible conditional mean and conditional covariance. Phillips (1998) applied orthonormal basis to analyze spurious regressions. Engle and Rangel (2004) proposed a new Spline GARCH model to mea- sure unconditional volatility and have applied it to equity markets for 50 countries for up to 50 years of daily data. See Fan and Yao (2003) for additional applications to ?nancial time series models. 3. Large sample properties of sieve estimation of unknown functions We already know that the sieve method is very general and easily implementable. In this section, we shall ?rst establish that, under mild regularity conditions, the sieve extremum estimation will consistently estimate both ?nite-dimensional and in?nite- dimensional unknown parameters. However, for econometric and statistical inference, one would like to know how accurate a consistent sieve estimator might be given a ?nite data set and what its limiting distribution is. Unfortunately there does not yet ex- ist a general theory of pointwise limiting distribution for a sieve extremum estimator of an unknown function. There are a few results on pointwise limiting distribution for series estimators of densities and LS regression functions, which we shall review at the end of this section. However, all is not lost. We do have a well developed theory on √ n-asymptotic normality of sieve estimators of smooth functionals26 of unknown functions. As we shall see in Section 4, in order to derive √ n-asymptotic normality and semiparametric ef?ciency of sieve estimators of parametric components in a semi- nonparametric model, the sieve estimators of the nonparametric components should converge to the true unknown functions at rates faster than n?1/4 under certain metric. This motivates the importance of establishing rates of convergence for sieve estimators of unknown functions even when the unknown functions are nuisance parameters (i.e., not the parameters of interest). Moreover, when an unknown function is also a parame- ter of interest in a nonparametric or a semi-nonparametric model, the convergence rate 25 Their work is closely related to the estimation of derivative of a multivariate unknown regression function via ANN sieves in Gallant and White (1992). Shintani and Linton (2004) proposed a nonparametric test of chaos via ANN sieves. 26 See Section 4 for the de?nition of a "smooth functional". Here it suf?ces to know that regular ?nite- dimensional parameters and average derivatives of unknown functions are examples of smooth functionals. 5588 X. Chen will provide useful information on the accuracy of a sieve estimator for a given ?nite sample size. Unfortunately, to date there is no uni?ed theory on rates of convergence for the general sieve extremum estimators of unknown functions either.27 Nevertheless, the theory on convergence rates of sieve M-estimators is by now well developed. In this section we ?rst provide a new consistency theorem on general sieve extremum estimation in Subsection 3.1. We then review the existing results on convergence rates and pointwise limiting distributions for sieve M-estimators of unknown functions. We begin this discussion with a survey of the convergence rate results for general sieve M-estimators of unknown functions in Subsection 3.2 and illustrate how to verify the technical conditions assumed for the general result with two examples. Although series estimation is a special case of sieve M-estimation, due to its special properties (i.e., concave criterion and ?nite-dimensional linear sieve space), the convergence rate of a series estimator can be derived under alternative suf?cient conditions, which will be reviewed in Subsection 3.3. Subsection 3.4 presents the existing results on the pointwise normality of the series estimator in the special case of a LS regression function. 3.1. Consistency of sieve extremum estimators For an in?nite-dimensional, possibly noncompact parameter space Θ, Geman and Hwang (1982) obtained the consistency of sieve MLE with i.i.d. data; White and Wooldridge (1991) obtained the consistency of sieve extremum estimates with depen- dent and heterogeneous data. For an in?nite-dimensional, compact parameter space Θ, Gallant (1987) and Gallant and Nychka (1987) derived the consistency of sieve M- estimates; Newey and Powell (2003) and Chernozhukov, Imbens and Newey (2007) established the consistency of sieve MD estimates. In the following, we present a new consistency theorem for approximate sieve extremum estimates that allows for noncompact in?nite-dimensional Θ and is applicable to ill-posed semi-nonparametric problems.28 Let d(·,·) be a (pseudo) metric on Θ. In particular, when Θ = B * H where B is a subset of some Euclidean space and H is a subset of some normed function space, we 27 To the best of our knowledge, currently there is one unpublished paper [Chen and Pouzo (2006)] that de- rives the convergence rates for the sieve MD estimates ? θn of θo = (βo, ho) satisfying the semi-nonparametric conditional moment models E[ρ(Z, βo, ho(·))|X] = 0, where the unknown ho(·) could depend on the endogenous variables Y or latent variables. Earlier, Ai and Chen (2003) obtained a faster than n?1/4 con- vergence rate under a weaker metric. There are also a few papers on convergence rates of sieve MD estimate of ho in speci?c models; see e.g. Blundell, Chen and Kristensen (2007) and Hall and Horowitz (2005) for the model E[Y1 ? ho(Y2)|X] = 0. Van der Vaart and Wellner (1996, Theorem 3.4.1) stated an abstract rate result for sieve extremum estimation. However, their conditions rule out ill-posed semi-nonparametric problems, and require a maximal inequality with rate for the process √ n(Qn ?Q), which is currently not available for a general criterion Qn. Hence, it is fair to say that a general theory on rates of convergence for sieve extremum estimators is currently lacking. 28 Based on a recent theorem of Stinchcombe (2002), the consistency of sieve extremum estimates is a generic property. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5589 can use d(θ, ? θ) = |β ? ? β|e + h ? ? h H, where | · |e denotes the Euclidean norm, and · H is a norm imposed on the function space H. For example, if H = Cm(X) with a bounded X, we could take h H to be h ∞ or h 2,leb. CONDITION 3.1 (Identi?cation). (i) Q(θo) > ?∞, and if Q(θo) = +∞ then Q(θ) < +∞ for all θ ∈ Θk \ {θo} for all k 1; (ii) there are a nonincreasing positive function δ( ) and a positive function g( ) such that for all ε > 0 and for all k 1, Q(θo) ? sup {θ∈Θk: d(θ,θo) ε} Q(θ) δ(k)g(ε) > 0. CONDITION 3.2 (Sieve spaces). Θk ? Θk+1 ? Θ for all k 1; and there exists a sequence πkθo ∈ Θk such that d(θo, πkθo) → 0 as k → ∞. CONDITION 3.3 (Continuity). (i) For each k 1, Q(θ) is upper semicontinuous on Θk under the metric d(·,·); (ii) |Q(θo) ? Q(πk(n)θo)| = o(δ(k(n))). CONDITION 3.4 (Compact sieve space). The sieve spaces, Θk, are compact under d(·,·). CONDITION 3.5 (Uniform convergence over sieves). (i) For all k 1, plimn→∞ supθ∈Θk |Qn(θ) ? Q(θ)| = 0; (ii) ? c(k(n)) = oP (δ(k(n))) where ? c(k(n)) ≡ supθ∈Θk(n) |Qn(θ) ? Q(θ)|; (iii) ηk(n) = o(δ(k(n))). THEOREM 3.1. Let ? θn be the approximate sieve extremum estimator de?ned by (2.9). If Conditions 3.1–3.5 hold, then d( ? θn, θo) = oP (1). PROOF. By Remark 2.1, ? θn is well de?ned and measurable. For all ε > 0, under Con- ditions 3.3(i) and 3.4, sup{θ∈Θk(n): d(θ,θo) ε} Q(θ) exists. By de?nition, we have for all ε > 0, Pr d( ? θn, θo) > ε Pr sup {θ∈Θk(n): d(θ,θo) ε} Qn(θ) Qn(πk(n)θo) ? O(ηk(n)) P1 + P2, where P1 ≡ Pr sup {θ∈Θk(n): d(θ,θo) ε} Qn(θ) ? Q(θ) > ? ν k(n) Pr sup θ∈Θk(n) Qn(θ) ? Q(θ) > ? ν k(n) , 5590 X. Chen and P2 ≡ Pr sup {θ∈Θk(n): d(θ,θo) ε} Q(θ) Q(πk(n)θo) ? 2? ν k(n) ? O(ηk(n)) = Pr 2? ν k(n) + Q(θo) ? Q(πk(n)θo) + O(ηk(n)) Q(θo) ? sup {θ∈Θk(n): d(θ,θo) ε} Q(θ) . Choosing ? ν(k(n)) = ? c(k(n)) it follows that the P1 = 0 by de?nition of ? c(k(n)) and Condition 3.5(i), and P2 Pr[2? c(k(n)) + {Q(θo) ? Q(πk(n)θo)} + O(ηk(n)) δ(k(n))g(ε)] → 0 by Conditions 3.1 and 3.5(ii). REMARK 3.1. (1) Theorem 3.1 is applicable to both well-posed and ill-posed semi- nonparametric models. When the problem (such as the nonparametric IV regression E[Y1 ? ho(Y2)|X] = 0) is ill-posed, one may have lim infk δ(k) = 0, which is still allowed by Conditions 3.1(ii), 3.3(ii) and 3.5(ii)(iii). See Chen and Pouzo (2006) for alternative general consistency theorems for sieve extremum estimates that allow for ill-posed problems. (2) If lim infk δ(k) > 0, then Condition 3.5(iii) is automatically satis?ed with ηk(n) = o(1), Condition 3.5(ii) is implied by Condition 3.5(i), and Condition 3.3(ii) is implied by Condition 3.2 and Condition 3.3(ii) : CONDITION 3.3(ii) . Q(θ) is continuous at θo in Θ. (3) Theorem 3.1 is an extension of Corollary 2.6 of White and Wooldridge (1991). Their corollary implies d( ? θn, θo) = oP (1) under Conditions 3.4, 3.5(i) and Condi- tions 3.1 , 3.2 and 3.3 : CONDITION 3.1 . (i) Q(θ) is continuous at θo in Θ, Q(θo) > ?∞; (ii) for all ε > 0, Q(θo) > sup{θ∈Θ: d(θ,θo) ε} Q(θ). CONDITION 3.2 . Θk ? Θk+1 ? Θ for all k 1; and for any θ ∈ Θ there exists πkθ ∈ Θk such that d(θ, πkθ) → 0 as k → ∞. CONDITION 3.3 . For each k 1, (i) Qn(θ) is a measurable function of the data {Zt }n t=1 for all θ ∈ Θk; and (ii) for any data {Zt }n t=1, Qn(θ) is upper semicontinuous on Θk under the metric d(·,·). We note that under Condition 3.2, Condition 3.1 (ii) implies that Condition 3.1(ii) is satis?ed with δ(k) = const. > 0, hence Remark 3.1(2) is applicable and d( ? θn, θo) = Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5591 oP (1). Unfortunately, Condition 3.1 (ii) may fail to be satis?ed in some ill-posed semi- nonparametric models when Θ is a noncompact in?nite-dimensional parameter space. (4) Condition 3.1 is satis?ed by Condition 3.1 : CONDITION 3.1 . (i) Θ is compact under d(·,·), and Q(θ) is upper semicontinuous on Θ under d(·,·); (ii) Q(θ) is uniquely maximized at θo in Θ, Q(θo) > ?∞. As a consequence of Theorem 3.1, we obtain: d( ? θn, θo) = oP (1) under Condi- tions 3.1 , 3.2, 3.4 and 3.5(i). This result is very similar to Lemmas A.1 in Newey and Powell (2003) and Chernozhukov, Imbens and Newey (2007). REMARK 3.2. If ? θn satis?es Qn( ? θn) supθ∈Θn Qn(θ) ? Oa.s.(ηn), then d( ? θn, θo) = oa.s.(1) under Conditions 3.1–3.4 and Condition 3.5 : CONDITION 3.5 . (i) For all k 1, supθ∈Θk |Qn(θ) ? Q(θ)| = oa.s.(1); (ii) ? c(k(n)) = oa.s.(δ(k(n))); (iii) ηk(n) = o(δ(k(n))). This extends Gallant's (1987) theorem to almost sure convergence of approximate sieve extremum estimates, allowing for noncompact in?nite-dimensional Θ and for ill- posed semi-nonparametric models. Note that when Θk = Θ is compact, the conditions for Theorem 3.1 become the stan- dard assumptions imposed for consistency of parametric extremum estimation in Newey and McFadden (1994) and White (1994). For semi-nonparametric models, the entire parameter space Θ contains in?nite-dimensional unknown functions and is generally noncompact. Nevertheless, one can easily construct compact approximating parameter spaces (sieves) Θk. Moreover, it is relatively easy to verify the uniform convergence over compact sieve spaces,29 while "plimn→∞ supθ∈Θ |Qn(θ) ? Q(θ)| = 0" may fail when the space Θ is too "large" or too "complex". We now review some notions of complexity of a function class. Let Lr(Po), r ∈ [1, ∞), denote the space of real-valued random variables with ?nite rth moments and · r denote the Lr(Po)-norm. Let Fn = {g(θ, ·): θ ∈ Θn} be a class of real-valued, Lr(Po)-measurable functions indexed by θ ∈ Θn. One notion of complexity of the class Fn is the Lr(Po)-covering numbers without bracketing, which is the minimal number of w-balls {{f : f ? gj r w}, gj r < ∞, j = 1,N} that cover Fn, denoted 29 One could modify the proof of Corollary 2.2 in Newey (1991) or the proof of Lemma 1 in Andrews (1992) to provide suf?cient conditions for Condition 3.5(i) in terms of Conditions 3.3(i) and 3.4 and the pointwise convergence over Θk. 5592 X. Chen as N(w, Fn, · r). Likewise, we can de?ne N(w, Fn, · n,r) as the Lr(Pn)-(random) covering numbers without bracketing, where · n,r denotes the Lr(Pn)-norm and Pn denotes the empirical measure of a random sample {Zi}n i=1. Sometimes the covering numbers of Fn can grow to in?nity very fast as n grows; it is then more convenient to measure the complexity of Fn using the notion of Lr(Po)-metric entropy without bracketing, H(w, Fn, · r) ≡ log(N(w, Fn, · r)), and the Lr(Pn)-(random) metric entropy without bracketing, H(w, Fn, · n,r) ≡ log(N(w, Fn, · n,r)). Detailed discussions of metric entropy can be found in Pollard (1984), Andrews (1994a), van der Vaart and Wellner (1996) and van de Geer (2000). When the function class Θ is too complex in terms of its metric entropy be- ing too large, then the uniform convergence over the entire parameter space Θ may fail, but the uniform convergence over a sieve space Θn (i.e., Condition 3.5(i)) can still be satis?ed. For example, when Qn(θ) = n?1 n t=1 l(θ, Zt ) and {Zt }n t=1 is i.i.d., E{supθ∈Θn |l(θ, Zt then Condition 3.5(i) is satis?ed if and only if H(w, {l(θ, ·): θ ∈ Θn}, · n,1) = oP (n) for all w > 0; see Pollard (1984). When the space Θ is in?nite-dimensional and not totally bounded, H(w, {l(θ, ·): θ ∈ Θ}, · n,1) = OP (n) may occur; hence supθ∈Θ |Qn(θ) ? Q(θ)| = oP (1). For such a case, the extremum estimator obtained by maximizing over the entire parameter space Θ, arg supθ∈Θ Qn(θ), may fail to exist or be inconsistent. Conditions 3.1–3.4 of Theorem 3.1 are basic regularity conditions; one can provide more primitive suf?cient assumptions for Condition 3.5 in speci?c applications. In the next remarks we present simple consistency results for sieve M-estimators and sieve MD-estimators. Let N(w, Θn, d) denote the minimal number of w-radius balls (under the metric d) that cover the sieve space Θn. REMARK 3.3 (Consistency of sieve M-estimator ? θn = arg supθ∈Θn n?1 n t=1 l(θ, Zt )? oP (1)). Suppose that Conditions 3.2 and 3.4 hold, that Condition 3.1 is satis?ed with Q(θ) = E{l(θ, Zt )} and lim infk(n) δ(k(n)) > 0, and that E{l(θ, Zt )} is continuous at θ = θo ∈ Θ. Then d( ? θn, θo) = oP (1) under the following Condition 3.5M: CONDITION 3.5M. (i) {Zt }n t=1 is i.i.d., E{supθ∈Θn |l(θ, Zt )|} is bounded; (ii) there are a ?nite s > 0 and a random variable U(Zt ) with E{U(Zt )} < ∞ such that supθ,θ ∈Θn: d(θ,θ ) δ |l(θ, Zt ) ? l(θ , Zt )| δsU(Zt ); (iii) log N(δ1/s, Θn, d) = o(n) for all δ > 0. Remark 3.3 is a direct consequence of Theorem 3.1 and Pollard's (1984) The- orem II.24. This is because Condition 3.5M(i) and (ii) imply H(w, {l(θ, ·): θ ∈ Θn}, · n,1) log N(δ1/s, Θn, d), hence Condition 3.5M implies Condition 3.5(i). See White and Wooldridge (1991, Theorem 2.5) and Ai and Chen (2007, Lemma A.1) for more general suf?cient assumptions for Condition 3.5. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5593 REMARK 3.4 (Consistency of sieve MD-estimator ? θn = arg infθ∈Θn 1 n n t=1 ? m(Xt , θ) * {Σ(Xt )}?1 ? m(Xt , θ) + oP (1)). Suppose that Conditions 3.2 and 3.4 hold, that m(Xt , θ) ≡ E{ρ(Zt , θ)|Xt } = 0 only when θ = θo ∈ Θ, that for all Xt , m(Xt , θ) is continuous in θo under the metric d(·,·), and that lim infk(n) δ(k(n)) > 0. Then d( ? θn, θo) = oP (1) under the following Condition 3.5MD: CONDITION 3.5MD. (i) {Zt }n t=1 is i.i.d., E{supθ∈Θn |m(Xt , θ) m(Xt , θ)|} is bounded; (ii) there are a ?nite s > 0 and a U(Xt ) with E{[U(Xt )]2} < ∞ such that supθ,θ ∈Θn: d(θ,θ ) δ |m(Xt , θ) ? m(Xt , θ )| δsU(Xt ); (iii) log N(δ1/s, Θn, d) = o(n) for all δ > 0; (iv) uniformly over Xt , Σ(Xt ) = Σ(Xt ) + oP (1) for a positive de?nite and ?nite Σ(Xt ); (v) 1 n n i=1 | ? m(Xi, θ) ? m(Xi, θ)|2 = oP (1) uniformly over θ ∈ Θn. See Chen and Pouzo (2006) for a proof of Remark 3.4; they also provide suf- ?cient conditions for the consistency of sieve MD-estimator ? θn without imposing lim infk(n) δ(k(n)) > 0. Also see Newey and Powell (2003) and Ai and Chen (1999, 2003, 2007) for primitive suf?cient conditions for Condition 3.5MD(iv) and (v) where Σ(Xt ) and ? m(Xt , θ) are kernel or series estimates of Σ(Xt ) and m(Xt , θ), respectively. Finally, Theorem 3.1 is also applicable to derive convergence of sieve extremum es- timates to some pseudo-true values in misspeci?ed semi-nonparametric models; see Lemma 3.1 of Ai and Chen (2007) for such an application. 3.2. Convergence rates of sieve M-estimators There are many results on convergence rates of sieve M-estimators of unknown func- tions. For i.i.d. data, Van de Geer (1995) obtained the rate for sieve LS regression. Shen and Wong (1994), and Birgé and Massart (1998) derived the rates for general sieve M-estimation. Van de Geer (1993) and Wong and Shen (1995) obtained the rates for sieve MLE. For time series data, Chen and Shen (1998) derived the rate for sieve M- estimation of stationary beta-mixing models.30 The general theory on convergence rates is technically involved and relies on the theory of empirical processes. In this section we present a simple version of the rate results for sieve M-estimation whose conditions are easy to verify. However, readers who are interested in the most general theory on convergence rates of sieve M-estimates are encouraged to read the papers by Shen and Wong (1994), Wong and Shen (1995) and Birgé and Massart (1998). 30 It is impossible to mention here all the existing results on convergence rates of sieve M-estimates. There are many papers on convergence rates of particular sieves, such as the work on polynomial spline regression and density estimation by Stone and his collaborators, see Subsection 3.3 for details; the work on wavelets by Donoho, Johnstone and others [see e.g., Donoho et al. (1995)]; the work on neural networks by Barron (1993), White (1990) and others. 5594 X. Chen Recall θo ∈ Θ and that the approximate sieve M-estimate ? θn solves: (3.1) n?1 n t=1 l( ? θn, Zt ) sup θ∈Θn n?1 n t=1 l(θ, Zt ) ? OP ε2 n with εn → 0. Let d(θo, θ) be a (pseudo-) metric on Θ such that d(θo, ? θn) = oP (1). Let K(θo, θ) ≡ E(l(θo, Zt ) ? l(θ, Zt )).31 Let θo ? θ be a metric on Θ such that θo ? θ const.d(θo, θ) for all θ ∈ Θ, and θo ? θ K1/2(θo, θ) for θ ∈ Θ with d(θo, θ) = o(1). We shall give a convergence rate for sieve estimate ? θn under θo ? θ , and thus automatically give an upper bound on d(θo, ? θn), where d is any other metric on Θ sat- isfying d(θo, θ) const.K1/2(θo, θ). In order for ? θn to converge to θo at a fast rate under the metric θo ? ? θn , not only does the sieve approximation error rate, θo ? πnθo , have to approach zero suitably fast, but additionally, the sieve space, Θn, must not be too complex. We have already in- troduced Lr(Po)-covering numbers (metric entropy) without bracketing as a complexity measure of a class Fn = {g(θ, ·): θ ∈ Θn}, we now consider another measure of com- plexity. Let Lr be the completion of Fn under the norm · r. For any given w > 0, if there exists a collection of functions (brackets) {gl 1, gu 1 gl N , gu N } ? Lr such that max1 j N gu j ? gl j r w and for any g ∈ Fn, there exists j ∈ {1,N} with gl j g gu j a.e.-Po, then the minimal number of such brackets, N[ ](w, Fn, · r) ≡ min(N: {gl 1, gu 1 gl N , gu N }), is called the Lr(Po)-covering numbers with bracketing. Likewise, H[ ](w, Fn, · r) ≡ log(N[ ](w, Fn, · r)) is called the Lr(Po)-metric en- tropy with bracketing of the class Fn. See Pollard (1984), Andrews (1994a), Van der Vaart and Wellner (1996) and Van de Geer (2000) for more details. We now present a result of Chen and Shen (1996) for i.i.d. data; see Chen and Shen (1998) for the stationary beta-mixing case and Chen and White (1999) for the stationary uniform-mixing case.32 CONDITION 3.6. {Zt }n t=1 is an i.i.d. or m-dependent sequence. CONDITION 3.7. There is C1 > 0 such that for all small ε > 0, sup {θ∈Θn: θo?θ ε} Var l(θ, Zt ) ? l(θo, Zt ) C1ε2 . CONDITION 3.8. For any δ > 0, there exists a constant s ∈ (0, 2) such that sup {θ∈Θn: θo?θ δ} l(θ, Zt ) ? l(θo, Zt ) δs U(Zt ), with E([U(Zt )]γ ) C2 for some γ 2. 31 If the criterion is a log-likelihood, then K(θo, θ) is simply the Kullback–Leibler information. 32 See Fan and Yao (2003) for description of various nonparametric methods applied to nonlinear time series models. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5595 Conditions 3.6 and 3.7 imply that, within a neighborhood of θo, Var n?1/2 n t=1 l(θ, Zt ) ? l(θo, Zt ) behaves like θo ? θ 2. Condition 3.8 implies that, when restricting to a local neigh- borhood of θo, l(θ, Zt ) is "continuous" at θo with respect to a metric θo ? θ , which is locally equivalent to K1/2. Conditions 3.7 and 3.8 are usually easily veri?able by exploiting the speci?c form of the criterion function. Denote Fn = {l(θ, Zt ) ? l(θo, Zt ): θo ? θ δ, θ ∈ Θn}, and for some constant b > 0, let33 δn = inf δ ∈ (0, 1): 1 √ nδ2 δ bδ2 H[ ] w, Fn, · 2 dw const. . To calculate δn, an upper bound on H[ ](w, Fn, · 2) is often enough, and, fortunately for us, much of the work has already been done. For instance, according to Lemma 2.1 of Ossiander (1987) we have that, H[ ](w, Fn, · 2) H(w, Fn, · ∞). Moreover, Condition 3.8 implies that H[ ] w, Fn, · 2 log N w1/s , Θn, · . For ?nite-dimensional linear sieves such as those listed in Subsection 2.3.1 we have log N( , Θn, · ) const. dim(Θn) log(1 ) [see e.g. Chen and Shen (1998)]; and for neural network and ridgelet nonlinear sieves we have log N( , Θn, · ) const. dim(Θn) log(dim(Θn) ) [see e.g. Chen and White (1999)]. THEOREM 3.2. Let ? θn be the approximate sieve M-estimator de?ned by (3.1). If Con- ditions 3.6–3.8 hold, then θo ? ? θn = OP (εn), with εn = max δn, θo ? πnθo . We note that δn increases with the complexity of the sieve Θn and can be interpreted as a measure of the standard deviation term, while the deterministic approximation error θo ? πnθo decreases with the complexity of the sieve Θn and is a measure of the bias. The best convergence rate can be obtained by choosing the complexity of the sieve Θn such that δn θo ? πnθo . Chen and Shen (1998) have demonstrated how to apply the time series version of this theorem with three examples: ?rst, they considered a multivariate nonparametric regression with either a neural network sieve, a wavelet sieve or a spline sieve; second, a partially additive time series model via spline and Fourier series sieves; and third, 33 There is a typo in Chen and Shen (1998, p. 297), where the "sup" in the de?nition of δn should be replaced by the "inf". Nevertheless, all the other calculations of δn in Chen and Shen (1998) are correct. 5596 X. Chen a transformation model with an unknown link via a monotone spline sieve. Chen and White (1999) considered a time series nonparametric conditional quantile regression via neural network sieve and multivariate conditional density estimation via neural net- work sieve. Chen and Conley (2001) applied this theorem to a varying coef?cient VAR model with a ?exible spatial conditional covariance. In the following we illustrate the veri?cation of the conditions of Theorem 3.2 with two examples. 3.2.1. Example: Additive mean regression with a monotone constraint Suppose that the i.i.d. data {Yt , Xt = (X1t Xqt )}n t=1 are generated according to Yt = ho1(X1t hoq(Xqt ) + et , E[et |Xt ] = 0. Let θo = (ho1,hoq) ∈ Θ = H be the parameters of interest with H = H1 Hq to be speci?ed in Assumption 3.1. For simplicity, we assume that dim(Xj ) = 1 for j = 1,q, dim(X) = q and dim(Y) = 1. We estimate the regression function θo(X) = q j=1 hoj (Xjt ) by maximizing over a sieve Θn = Hn the criterion Qn(θ) = n?1 n t=1 l(θ, Zt ), where l(θ, Zt ) = ?(1/2)[Yt ? q j=1 hj (Xjt )]2 and Zt = (Yt , Xt ) . Let θ ? θo 2 = E(θ(Xt ) ? θo(Xt ))2 = E{ q j=1[hj (Xjt ) ? hoj (Xjt )]}2. ASSUMPTION 3.1. (i) ho1 ∈ H1 = C([b11, b21]) ∩ {h: nondecreasing}; (ii) for j = 2,q, hoj ∈ Hj = Λ pj cj ([b1j , b2j ]) with pj > 1/2; and hoj (x? j ) = 0 for some known x? j ∈ (b1j , b2j ). ASSUMPTION 3.2. σ2(X) ≡ E[e2|X] is bounded. Assumption 3.1(ii) is suf?cient for identi?cation, and Assumption 3.2 is a simple regularity condition that has been imposed in many papers; see e.g. Newey (1997). The sieve will be chosen to have the form Hn = H1 n H q n. First we let H1 n be a shape-preserving sieve such as the monotone spline wavelet sieve MSplWav(r1 ? 1, 2J1n ) with r1 1 and k1n = 2J1n in Subsection 2.3.5. For j = 2,q, we let H j n = {hj ∈ Θjn: hj (x? j ) = 0, hj ∞ cj } where Θjn can be any of the ?nite-dimensional linear sieve examples in Subsection 2.3.1 such as Θjn = Pol(kjn) or TriPol(kjn) or Spl(rj , kjn) with rj [pj ] + 1, or Wav(mj , 2Jjn ) with mj > pj and kjn = 2Jjn . In the following result we denote p1 = 1 and p = min{p1, p2,pq}. PROPOSITION 3.3. Let ? θn be the sieve M-estimate. Suppose that Assumptions 3.1 and 3.2 hold. Let kjn = O(n1/(2pj +1)) for j = 1,q. Then ? θn ?θo = OP (n?p/(2p+1)) with p = min{p1,pq}. PROOF. Theorem 3.2 is readily applicable to prove this result. It is easy to see that K(θo, θ) θ ?θo 2. Condition 3.6 is assumed. Now we check Conditions 3.7 and 3.8. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5597 Since l(θ, Zt ) ? l(θo, Zt ) = (θ ? θo)[et + (θo ? θ)/2], we have E l(θ, Zt ) ? l(θo, Zt ) 2 2E σ2 (Xt ) θo(Xt ) ? θ(Xt ) 2 + (1/2)E θo(Xt ) ? θ(Xt ) 4 const. θ ? θo 2 + (1/2)E θo(Xt ) ? θ(Xt ) 4 . By Theorem 1 of Gabushin (1967) when p is an integer and Lemma 2 in Chen and Shen (1998) for any p > 0, we have θ ? θo ∞ c θ ? θo 2p/(2p+1). Hence E θo(Xt ) ? θ(Xt ) 4 sup x θ(x) ? θo(x) 2 E θo(Xt ) ? θ(Xt ) 2 C θ ? θo 2(1+[2p/(2p+1)]) . So Condition 3.7 is satis?ed for all ε 1. On the other hand, l(θ, Zt ) ? l(θo, Zt ) θ ? θo ∞ |et | + θo ∞ + θ ∞ /2 a.s. Using Lemma 2 in Chen and Shen (1998) we see that Condition 3.8 is then satis?ed with s = 2p/(2p + 1), U(Zt ) = |et | + const. and γ = 2. To apply Theorem 3.2, it remains to compute the deterministic approximation error rate θo ? πnθo and the metric entropy with bracketing H[ ](w, Fn, · 2) of the class Fn = {l(θ, Zt ) ? l(θo, Zt ): θ ? θo δ, θ ∈ Θn}. By de?nition, θo ? πnθo const. max{ hoj ? πnhoj ∞: j = 1,q}. Let C = E{U(Zt )2}, then for all 0 < w C δ < 1, H[ ](w, Fn, · 2) q j=1 log N(w C , H j n, · ∞). The ?nal bit of calculation now depends on the choice of sieves. First, ho1 ? πnho1 ∞ = O((k1n)?1) by Anastassiou and Yu (1992a); and for j = 2,q, Hj = Λ pj cj , hoj ? πnhoj ∞ = O((kjn)?pj ) by Lorentz (1966). Second, for all j = 1, 2,q, log N(w C , H j n, · ∞) const. * kjn * log(1 + 4cj w ) by Lemma 2.5 in van de Geer (2000). Hence δn solves 1 √ nδ2 n δn bδ2 n H[ ] w, Fn, · 2 dw 1 √ nδ2 n max j=1,...,q δn bδ2 n kjn * log 1 + 4cj w dw 1 √ nδ2 n max j=1,...,q kjn * δn const. and the solution is δn maxj=1,...,q kjn n . By Theorem 3.2, ? θn ? θo = OP (maxj=1,...,q{(kjn)?pj , δn}). With the choice of kjn = O(n1/(2pj +1)) for j = 1,q, we obtain ? θn ? θo = OP (n?p/(2p+1)) with p = min{p1,pq} > 0.5. This immediately implies ? hj ? hoj 2 = OP (n?p/(2p+1)) for j = 1,q. 5598 X. Chen REMARK 3.5. (1) Since the parameter space H = H1 *· · ·*Hq speci?ed in Assump- tion 3.1 is compact with respect to the norm · , we can take the original parameter space H as the sieve space Hn. Applying Theorem 3.2 again, note that the approxima- tion error πnθo ? θo = 0, we have ? θn ? θo = OP (δn), where δn solves: 1 √ nδ2 n δn bδ2 n q j=1 log N w, Hj , · ∞ dw 1 √ nδ2 n δn bδ2 n q j=1 cj w 1/pj dw by Birman and Solomjak (1967) 1 √ nδ2 n max j=1,...,q const.(δn) 1? 1 2pj const. which is satis?ed if δn = O(n?p/(2p+1)) with p = min{p1,pq} > 0.5. However, it is unclear how one can implement such an optimization over the entire parameter space H given a ?nite data set. (2) Suppose that in Proposition 3.3 we replace Assumption 3.1(i) by ho1 ∈ Λ p1 c1 ([b11, b21]) and let H1 n = Pol(k1n), or TriPol(k1n), or Spl(r1, k1n) with r1 [p1]+1, or Wav(m1, 2J1n ) with m1 > p1, 2J1n = k1n. Let p = min{p1,pq} > 0.5. Then we have ? hj ? hoj 2 = OP (n?p/(2p+1)) for j = 1,q. Further, let Dm ? hj ? Dmhoj 2 = {E[Dm ? hj (Xjt ) ? Dmhoj (Xjt )]2}1/2 for an integer m 1. If p > m 1 then Dm ? hj ? Dmhoj 2 = OP (k ?(p?m) jn ) = OP (n?(p?m)/(2p+1)) for j = 1,q. This convergence rate achieves the optimal one derived in Stone (1982). 3.2.2. Example: Multivariate quantile regression Suppose that the i.i.d. data {Yt , Xt }n t=1 are generated according to Yt = θo(Xt ) + et , P [et 0|Xt ] = α ∈ (0, 1), where Xt ∈ X = Rd, d 1. We estimate the conditional quantile function θo(·) by maximizing over Θn the criterion Qn(θ) = n?1 n t=1 l(θ, Yt , Xt ), where l(θ, Yt , Xt ) = {1(Yt < θ(Xt )) ? α}[Yt ? θ(Xt )]. Let θ ? θo 2 = E(θ(Xt ) ? θo(Xt ))2 and W1 1 (X) be the Sobolev space de?ned in Subsection 2.3.3. ASSUMPTION 3.3. θo ∈ Θ = W1 1 (X). ASSUMPTION 3.4. Let fe|X be the conditional density of et given Xt satisfying 0 < infx∈X fe|X=x(0) supx∈X fe|X=x(0) < ∞ and supx∈X |fe|X=x(z)?fe|X=x(0)| → 0 as |z| → 0. It is known that the tensor product of ?nite-dimensional linear sieves such as those in Subsection 2.3.1 will not be able to approximate functions in Wm 1 (X), m 1, well, Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5599 hence the sieve convergence rates based on those linear sieves will be slower than those based on nonlinear sieves; see e.g. Chen and Shen (1998, Proposition 1, Case 1.3(ii)) for such an example. For time series regression models, Chen and White (1999), Chen, Racine and Swanson (2001) have shown that neural network sieves lead to faster con- vergence rates for functions in Wm 1 (X). Thus we consider the following Gaussian radial basis ANN sieve Θn for the unknown θo ∈ W1 1 (X): Θn = α0 + kn j=1 αj G {(x ? γj ) (x ? γj )}1/2 σj , kn j=0 |αj | c0, |γj | c1, 0 < σj c2 , where G is the standard Gaussian density function. PROPOSITION 3.4. Let ? θn be the sieve M-estimate. Suppose that Assumptions 3.3 and 3.4 hold. Let k 2(1+1/(d+1)) n log(kn) = O(n). Then ? θn ? θo = OP [n/ log n]?(1+2/(d+1))/[4(1+1/(d+1))] . PROOF. Theorem 3.2 is readily applicable to prove this result. Condition 3.6 is directly assumed. By the above assumptions on conditional density fe|X, it is easy to check that K(θo, θ) E(θ(Xt ) ? θo(Xt ))2; see Chen and White (1999, pp. 686–687) for details. Now let us check Conditions 3.7 and 3.8. Note that |l(θ, Yt , Xt ) ? l(θo, Yt , Xt )| max(α, 1 ? α)|θ(Xt ) ? θo(Xt )|, we have Var l(θ, Yt , Xt ) ? l(θo, Yt , Xt ) E l(θ, Yt , Xt ) ? l(θo, Yt , Xt ) 2 E θ(Xt ) ? θo(Xt ) 2 , and thus Condition 3.7 is satis?ed. Moreover, we have sup {θ∈Θn: θ?θo δ} l(θ, Yt , Xt ) ? l(θo, Yt , Xt ) sup {θ∈Θn: θ?θo δ} θ(Xt ) ? θo(Xt ) , and θ ?θo ∞ c θ ?θo 2/3 by Theorem 1 of Gabushin (1967). Hence, Condition 3.8 is satis?ed with s = 2/3, U(Xt ) ≡ c. Now by results in Chen, Racine and Swanson (2001), θo ? πnθo const.(kn)?1/2?1/(d+1) and log N(w, Θn, · ∞) const.kn log(kn w ). With k 2(1+1/(d+1)) n log(kn) = O(n), it is easy to see that ? θn ? θo = OP ([n/ log n]?(1+2/(d+1))/[4(1+1/(d+1))]) by applying Theorem 3.2. 5600 X. Chen 3.3. Convergence rates of series estimators In this subsection we present the convergence rate of the series estimators for the con- cave extended linear models. Recall that in this framework, the parameter space, Θ, is a linear space which is often a subspace of the space of square integrable functions, the sample criterion function Qn(θ) = n?1 n i=1 l(θ, Zi) is concave in θ ∈ Θ almost surely and the population criterion function Q(θ) = E[l(θ, Zi)] is strictly concave in θ ∈ Θ. The results reported here are largely based on those of Huang (1998a, 2001) and Newey (1997). Throughout this subsection, {Zi}n i=1 is i.i.d. and θ denotes a real-valued function with a bounded domain, X ? Rd. We use ? θ ? θo to measure the discrepancy between ? θ and θo. CONDITION 3.9. θ θ 2,leb for any Lebesgue square-integrable function θ. In the multivariate LS regression of Example 2.4, θo(X) = E[Y|X], a natural choice for the norm is θ = θ 2 = {E[θ(X)2]}1/2. If the density of X is bounded away from zero and in?nity, then Condition 3.9 is satis?ed. In general a natural choice of the norm, · , will depend on the speci?c application and on the data generating process. We impose the following condition on the linear sieve space. CONDITION 3.10. The ?nite-dimensional linear sieve space, Θn, is theoretically iden- ti?able in the sense that any θ ∈ Θn with θ = 0 implies that θ(u) = 0 everywhere. Under Condition 3.9, Condition 3.10 is trivially satis?ed by commonly used linear approximation spaces such as those given in Subsection 2.3.1. CONDITION 3.11. θo = arg maxΘ E[l(θ, Z)] satis?es θo ∞ Ko < ∞. CONDITION 3.12. For any bounded functions θ1, θ2 ∈ Θ, E[l(θ1 + τ(θ2 ? θ1), Z)] is twice continuously differentiable with respect to τ ∈ [0, 1]. For any constant 0 < K < ∞, ?2 ?τ2 E[l(θ1 + τ(θ2 ? θ1), Z)] ? θ2 ? θ1 2 for θ1, θ2 ∈ Θ with θ1 ∞ K and θ2 ∞ K and 0 τ 1. Given the above conditions, we can de?ne θn ≡ arg maxθ∈Θn E[l(θ, Z)], and it is easy to see that θn ? θo infθ∈Θn θ ? θo . CONDITION 3.13. For any pair of functions θ1, θ2 ∈ Θn, l(θ1 +τ(θ2 ?θ1), Z) is twice continuously differentiable with respect to τ. Moreover, (i) sup g∈Θn | ? ?τ l(θn + τg, Z)|τ=0| g = OP dim(Θn) n ; Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5601 (ii) for any constant 0 < K < ∞, there is a c > 0 such that ?2 ?τ2 l(θ1 + τ(θ2 ? θ1), Z) ?c θ2 ? θ1 2 for any θ1, θ2 ∈ Θn with θ1 ∞ K and θ2 ∞ K and 0 τ 1, except on an event whose probability tends to zero as n → ∞. Denote kn = dim(Θn), An ≡ supθ∈Θn, θ 2,leb=0( θ ∞/ θ 2,leb) and ρ2n ≡ infθ∈Θn θ ? θo 2,leb. Under Conditions 3.9–3.11, we have ρ2n infθ∈Θn θ ? θo . The following result is a special case of Huang (2001) for the sieve estimator of a con- cave extended linear model. THEOREM 3.5. Suppose Conditions 3.9–3.13 hold. Let limn→∞ Anρ2n = 0 and limn→∞ A2 nkn/n = 0. Then the series estimator, ? θ, exists uniquely with probability approaching one as n → ∞, and ? θ ? θo = OP kn n + ρ2n . This theorem could be regarded as a special case of Theorem 3.2 by taking δn kn n and πnθo ?θo ρ2n. To see this, ?rst note that under Conditions 3.9–3.11 there is an essentially unique element πnθo ∈ Θn such that πnθo ? θo = infθ∈Θn θ ? θo , and πnθo ? θo πnθo ? θo 2,leb ρ2n, which is the approximation error rate. Second, within the framework of concave extended linear models, for a ?nite-dimensional linear sieve Θn we have log N(w, Θn, · ∞) const.kn log( 1 w ), hence δn kn n . The constant An 1 is a measure of irregularity of the ?nite-dimensional linear sieve space, Θn. Since we require that Θn be theoretically identi?able and functions in Θn be bounded, An is ?nite. In fact, let {φj , j = 1,kn} be an orthonormal basis of Θn relative to the theoretical inner product. Then, by the Cauchy–Schwarz inequality, An { kn j=1 φj 2 ∞}1/2 < ∞. It is obvious that θ ∞ An θ 2,leb for all θ ∈ Θn. The linear sieve spaces are usually chosen to be among commonly used approximating spaces such as those described in Subsection 2.3.1 and the associated constant An is readily obtained by using results in the approximation theory literature. Here are some examples. Polynomials. If Θn = Pol(Jn) and X = [0, 1], then An Jn [see Theorem 4.2.6 of DeVore and Lorentz (1993)]. Trigonometric polynomials. If Θn = TriPol(Jn) and X = [0, 1], then An J 1/2 n [see Theorem 4.2.6 of DeVore and Lorentz (1993)]. Univariate splines. If Θn = Spl(r, Jn) and X = [0, 1], then An J 1/2 n [see Theo- rem 5.1.2 of DeVore and Lorentz (1993)]. 5602 X. Chen Orthogonal wavelets. If Θn = Wav(m, 2Jn ) and X = [0, 1], then An 2Jn/2 [see Lemma 2.8 of Meyer (1992)]. Tensor product spaces. Let Θn be the tensor product of Θn1,Θnd. The constant An associated with the tensor product linear sieve space, Θn, can be determined from the corresponding constants for its components. Set an = sup θ∈Θn , θ 2,leb=0 θ ∞/ θ 2,leb for 1 d. It is shown in Huang (1998a) that An const. d =1 an . We conclude this subsection with an application to the multivariate LS regression of Example 2.4. ASSUMPTION 3.5. (i) X has a compact support X and has a density that is bounded away from zero and in?nity on X, where X ? Rd is a Cartesian product of compact intervals X1,Xd; (ii) Var(Y|X = ·) is bounded on X; (iii) ho(·) = E[Y|X = ·] ∈ Λp(X) with p > d/2. Theorem 3.5 can treat a general ?nite-dimensional linear sieve space Θn. For simplic- ity, however, we consider here only the case when the sieve space, Θn, in Example 2.4 is constructed as a tensor product space of some commonly used univariate linear ap- proximating spaces Θn1,Θnd. Then kn = dim(Θn) = d =1 dim(Θn ). PROPOSITION 3.6. Suppose Assumption 3.5 holds. Let ? hn be the series estimate of ho in Example 2.4, with the sieve, Θn, being the tensor-product of the univariate sieve spaces Θn1,Θnd. For = 1,d, ? if Θn = Pol(Jn), p > d and J3d n /n → 0, then ? hn ? ho = OP ( Jd n /n + J ?p n ); ? if Θn = TriPol(Jn), p > d/2 and J2d n /n → 0, then ? hn ? ho = OP ( Jd n /n + J ?p n ); ? if Θn = Spl(r, Jn) with r [p]+1, p > d/2 and J2d n /n → 0, then ? hn ?ho = OP ( Jd n /n + J ?p n ). Let Jn = O(n1/(2p+d)), then ? hn ? ho = OP (n?p/(2p+d)). We note that this proposition can also be obtained as a direct consequence of Theo- rem 1 in Newey (1997).34 The choice of Jn n1/(2p+d) balances the variance (Jd n /n) and the squared bias (J ?2p n ) trade-off: Jd n /n J ?2p n . The resulting rate of convergence 34 Proposition 3.6 is about the convergence rates under · 2-norm for LS regressions. There are also a few results on the convergence rates under · ∞-norm for LS regressions; see e.g. Stone (1982), Newey (1997) and de Jong (2002). Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5603 n?2p/(2p+d) is actually optimal in the context of regression and density estimations: no estimate has a faster rate of convergence uniformly over the class of p-smooth functions [Stone (1982)]. The rate of convergence depends on two quantities: the spec- i?ed smoothness p of the target function θo and the dimension d of the domain on which the target function is de?ned. Note the dependence of the rate of convergence on the dimension d: given the smoothness p, the larger the dimension, the slower the rate of convergence; moreover, the rate of convergence tends to zero as the dimension tends to in?nity. This provides a mathematical description of a phenomenon commonly known as the "curse of dimensionality". Imposing additivity on an unknown multivari- ate function can imply faster rates of convergence of the corresponding estimate; see Subsection 3.2.1, Stone (1985, 1986), Andrews and Whang (1990), Huang (1998b) and Huang et al. (2000). 3.4. Pointwise asymptotic normality of series LS estimators To date, we have a relatively complete theory on the rates of convergence for sieve M-estimators. The corresponding asymptotic distribution theory, however, is incom- plete and requires much future work. All of the currently available results are for series estimators of densities and the LS regression functions. Asymptotic normality of the series LS estimators has been studied in Andrews (1991b), Gallant and Souza (1991), Newey (1994b, 1997), Zhou, Shen and Wolfe (1998), and Huang (2003). Stone (1990) and Strawderman and Tsiatis (1996) have given asymptotic normality results for poly- nomial spline estimators in the context of density estimation and hazard estimation, respectively.35 We focus on Example 2.4 throughout this subsection. That is, we assume that the data {Zi = (Yi, Xi) }n i=1 are i.i.d., and the parameter of interest, θo(·) = ho(·) = E[Y|X = ·], is a real-valued regression function with a bounded domain X ? Rd. 3.4.1. Asymptotic normality of the spline series LS estimator Here we present a result by Huang (2003) on the pointwise asymptotic normality of the spline series LS estimator. ASSUMPTION 3.6. (i) Var(Y|X = ·) is bounded away from zero on X; (ii) sup x∈X E Y ? ho(X) 2 * 1 Y ? ho(X) > λ X = x → 0 as λ → ∞. 35 See Portnoy (1997) for a closely related result on the asymptotic normality for smoothing spline quantile estimators. 5604 X. Chen In the following, Φ(·) denotes the standard normal distribution function, and SD(? h(x)|X1,Xn) = {Var(? h(x)|X1,Xn)}1/2. THEOREM 3.7. [See Huang (2003).] Suppose Assumptions 3.5 and 3.6 hold. Let ? hn be the series estimate of ho in Example 2.4, with the sieve, Θn, being the tensor-product of the univariate spline sieve spaces Θn = Spl(r, Jn), r [p] + 1, 1 d. If limn→∞ Jd n log n/n = 0 and limn→∞ Jn/n1/(2p+d) = ∞, then Pr ? h(x) ? ho(x) t * SD ? h(x) X1,Xn → Φ(t), t ∈ R. Asymptotic distribution results such as Theorem 3.7 can be used to construct asymptotic con?dence intervals. Let SD(? h(x)|X1,Xn) be a consistent estimate of SD(? h(x)|X1,Xn); see Andrews (1991b) and Newey (1997) for such an es- timate. Let ? hl α(x) = ? h(x) ? z1?α/2SD(? h(x)|X1,Xn) and ? hu α(x) = ? h(x) + z1?α/2SD(? h(x)|X1,Xn), where z1?α/2 is the (1 ? α/2)th quantile of the standard normal distribution. If the conditions of Theorem 3.7 hold, then [? hl α(x), ? hu α(x)] is an asymptotic 1 ? α con?dence interval of ho(x); that is, limn→∞ P(? hl α(x) ho(x) ? hu α(x)) = 1 ? α. Recall that for the tensor product spline sieve Θn, kn = dim(Θn) Jd n . If ho(·) is p-smooth, then the tensor product spline sieve has the bias order J ?p n k ?p/d n . The condition limn→∞ Jn/n1/(2p+d) = ∞ in Theorem 3.7 implies that the bias term is asymptotically negligible relative to the standard deviation of the estimate. Such a con- dition, limn→∞ kn/nd/(2p+d) = ∞, is usually called undersmoothing (or over?tting); that is, the total number of sieve parameters (kn) required for undersmoothing is more than what is required to achieve Stone's (1982) optimal rate of convergence. 3.4.2. Asymptotic normality of functionals of series LS estimator We now review the asymptotic normality results in Newey (1997) for any series esti- mation of functionals of the LS regression function. Let a : Θ → R be a functional, and we want to estimate a(ho), where ho(·) = E[Y|X = ·] ∈ Θ. Recall that ? h(·) = pkn (·) (P P)? n i=1 pkn (Xi)Yi is the series LS estimator of ho(·), with pkn (X) being the ?nite-dimensional linear sieve (2.10), see Example 2.4. Then a(? h) will be a natural estimator for a(ho). Let s 0 be an integer, and de?ne a strong norm on Θ as h s,∞ = max[γ ] s supx∈X |Dγ h(x)|. Also, let ζ0(kn) ≡ supx∈X |pkn (x)|e, ζs(kn) ≡ max[γ ] s supx∈X |Dγ pkn (x)|e, where | · |e is the Euclidean norm. ASSUMPTION 3.7. (i) Var(Y|X = ·) is bounded away from zero on X; supx∈X E[{Y ? ho(X)}4|X = x] < ∞; (ii) the smallest eigenvalue of E[pkn (X)pkn (X) ] is bounded away from zero uni- formly in kn; Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5605 (iii) for an integer s 0 there are α > 0, β? kn such that infg∈Θn g ? ho s,∞ = pkn (·) β? kn ? ho(·) s,∞ = O(k?α n ). ASSUMPTION 3.8. Either (i) limn→∞ kn{ζ0(kn)}2/n = 0, and a(h) is linear in h ∈ Θ; or (ii) for s as in Assumption 3.7, limn→∞ k2 n{ζs(kn)}4/n = 0, and there exists a func- tion D(h; ? h) that is linear in h ∈ Θ such that for some c1, c2, ε > 0 and for all ? h, h with ? h ? ho s,∞ < ε, h ? ho s,∞ < ε, it is true that a(h) ? a(? h) ? D(h ? ? h; ? h) c1 h ? ? h s,∞ 2 ; and D(h; ? h) ? D(h; h) c2 h s,∞ ? h ? h s,∞. ASSUMPTION 3.9. (i) There is a positive constant c such that |D(h; ho)| c h s,∞ for s from As- sumption 3.7; (ii) there is an hn ∈ Θn such that E[hn(X)2] → 0 but D(hn; ho) is bounded away from zero. Assumption 3.7(iii) is a condition on the sieve approximation error under the strong norm h s,∞. Assumption 3.8 implies that a(h) is Frechet differentiable in h with respect to the norm h s,∞. Assumption 3.9 says that the derivative D(h; ho) is contin- uous in the norm h s,∞ but not in the mean-square norm h 2 = {E[h(X)2]}1/2. The lack of mean-square continuity will imply that the estimator a(? h) is not √ n- consistent for a(ho); see Newey (1997) for detailed discussions. In the following we denote Σ = E[pkn (X)pkn (X) Var(Y|X)], A = ?a(pkn (X) β) ?β β? kn and Vkn = A E pkn (X)pkn (X) ?1 Σ E pkn (X)pkn (X) ?1 A. We let d ? → denote convergence in distribution and N(0, 1) denote a scalar random vari- able drawn from a standard normal distribution. THEOREM 3.8. [See Newey (1997).] Suppose Assumptions 3.5(i)(ii), 3.7–3.9 hold. Let ? hn be the series estimate of ho in Example 2.4, with the sieve Θn being the linear sieve (2.10). If limn→∞ √ nk?α n = 0, then n Vkn a(? h) ? a(ho) d ? → N(0, 1). We note that for the linear functional a(ho) = ho(x), this theorem implies pointwise asymptotic normality of any series LS estimators ? h(x) satisfying Assumptions 3.5(i)(ii), 3.7, 3.8(i) and 3.9(ii). When we specialize this theorem further to the tensor product 5606 X. Chen spline series estimator of ho(x), then Assumption 3.8(i) requires limn→∞ k2 n/n = 0, which is stronger than the condition limn→∞ kn log n/n = 0 in Theorem 3.7. However, Theorem 3.7 is applicable only to the spline series LS estimator, while the results by Newey (1994b, 1997) are much more general. The normality results reported in this section are only valid for i.i.d. data; see Andrews (1991b) for asymptotic normality of linear functionals of the series LS es- timators based on time series dependent observations. 4. Large sample properties of sieve estimation of parametric parts in semiparametric models In the general sieve extremum estimation framework of Section 2, a model typically contains a parameter vector θ = (β, h), where β is a vector of ?nite-dimensional pa- rameters and h is a vector of in?nite-dimensional parameters. When both β and h are parameters of interest we call the model "semi-nonparametric". When h is a vector of nuisance parameters, then, following Powell (1994) and others, we will call the model "semiparametric". For weakly dependent observations, semiparametric models can be classi?ed into two categories: (i) β cannot be estimated at a √ n-rate, i.e., β has zero information bound; see van der Vaart (1991); and (ii) β can be estimated at a √ n-rate. Models belong- ing to category (i) should be correctly viewed as nonparametric. However, since these models can still be estimated by the method of sieves, the general sieve convergence rate results can be applied to derive slower than √ n-rates for the sieve estimates of β. To date there is little research about whether or not the sieve estimate of β can reach the optimal convergence rate and what its limiting distribution is. It is worth mention- ing that for Heckman and Singer's (1984) model, Ishwaran (1996a) established that the β-parameters cannot be estimated at √ n-rate, while Ishwaran (1996b) constructed an- other estimator of β that converges at the optimal rate but is not a sieve MLE. Prior to the work of Ishwaran (1996a, 1996b), Honoré (1990, 1994) proposed a clever estimator of β that is not a sieve MLE either and computed its convergence rate. It is still an open question whether or not Heckman and Singer's (1984) sieve MLE estimator could reach Ishwaran's optimal rate.36 There is a large literature on semiparametric estimation of β for models belonging to category (ii); see Bickel et al. (1993), Newey and McFadden (1994), Powell (1994), 36 There are other important results in econometrics about speci?c models belonging to category (i). For example, Manski (1985) proposed a maximum score estimator of a binary choice model with zero median restriction; Kim and Pollard (1990) derived the n1/3 consistency of Manski's estimator; Horowitz (1992) proposed a smoothed maximum score estimator for Manski's model, and proved that his smoothed estimator converges faster than n1/3 and is asymptotically normal; Andrews and Schafgans (1998) proposed a slower than √ n rate kernel estimator of the intercept in Heckman's sample selection model; Honoré and Kyriazidou (2000) developed a slower than √ n rate kernel estimator of a discrete choice dynamic panel data model. See Powell (1994), Horowitz (1998), Pagan and Ullah (1999) for more examples. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5607 Horowitz (1998) and Pagan and Ullah (1999) for reviews. Most of these results are derived using the so-called two-step procedure: Step one estimates h nonparametrically by ? h, while step two estimates β via either M-estimation, GMM or more generally, MD-estimation with the unknown h replaced by ? h. A few general results deal with the simultaneous estimation of β and h. For example, the sieve simultaneous procedure jointly estimates β and h by maximizing a sample criterion function Qn(β, h) over the sieve parameter space Θn = B * Hn. The earlier applications of sieve MLE in econometrics, such as the papers by Duncan (1986) and Gallant and Nychka (1987) took this approach. In Subsection 4.1 we review existing theory on the √ n-asymptotic normality of the two-step estimates of β. In Subsection 4.2, we present recent advances on the √ n- asymptotic normality and ef?ciency of the sieve simultaneous M-estimates of β. In Subsection 4.3, we mention the √ n-asymptotic normality and ef?ciency of the simulta- neous sieve MD estimates of β. 4.1. Semiparametric two-step estimators There are several general theory papers in econometrics about the semiparametric two- step procedure. Andrews (1994b) proposed the MINPIN estimator of β, which is the extremum estimator of β where the empirical criterion function depends on the ?rst step nonparametric estimator of h. Andrews (1994b) also provided a set of relatively high level conditions to ensure the √ n-normality of his MINPIN estimator of β. Ichimura and Lee (2006) presented a set of relatively low level conditions to ensure the √ n- normality of the semiparametric two-step M-estimator of β. Newey (1994a), Pakes and Olley (1995), and Chen, Linton and van Keilegom (2003) have studied the properties of the semiparametric two-step GMM estimator of β. In addition to providing a general way to compute the asymptotic variance of the second step β estimate, Newey (1994a) showed that the second stage estimation of β and its asymptotic variance do not depend on the particular choice of the nonparametric estimation technique in the ?rst step, but only depend on the convergence rate of the ?rst step estimation. 4.1.1. Asymptotic normality In the following we state two results which are slight modi?cations of those in Chen, Linton and van Keilegom (2003), in which the empirical criterion function can be nonsmooth with respect to both β and h. Let M : B * H → Rdm be a nonran- dom, vector-valued measurable function, where B is a compact subset in Rdβ with dm dβ. The identifying assumption is that M(β, ho(·, β)) = 0 at β = βo ∈ B and M(β, ho(·, β)) = 0 for all β = βo. We denote βo ∈ B and ho ∈ H as the true unknown ?nite- and in?nite-dimensional parameters, where the function ho ∈ H can depend on the parameters β and the data Z. We usually suppress the arguments of the function ho for notational convenience; thus: (β, h) ≡ (β, h(·, β)), (β, ho) ≡ (β, ho(·, β)) and (βo, ho) ≡ (βo, ho(·, βo)). We assume that H is a vector space of functions endowed 5608 X. Chen with a pseudo-metric · H, which is a sup-norm metric with respect to the β-argument and a pseudo-metric with respect to all the other arguments. Suppose that there also exists a random vector-valued function Mn : B * H → Rdm depending on the data {Zi: i = 1,n}, such that Mn(β, ho) WMn(β, ho) is close to M(β, ho) WM(β, ho) for some symmetric positive-de?nite matrix W. Suppose that for each β there is an ini- tial nonparametric estimator ? h(.) for ho(.). Denote Wn as a possibly random weighting matrix such that Wn ? W = oP (1). Then βo can be estimated by ? β, which solves the sample minimum distance problem37: (4.1) min β∈B Mn(β, ? h) WnMn(β, ? h). For any β ∈ B, we say that M(β, h) is pathwise differentiable at h in the direction [h?h] if {h+τ(h?h): τ ∈ [0, 1]} ? H and limτ→0[M(β, h+τ(h?h))?M(β, h)]/τ exists; we denote the limit by Γ2(β, h)[h ? h]. THEOREM 4.1. Suppose that βo ∈ int(B) satis?es M(βo, ho) = 0, that ? β ? βo = oP (1), Wn ? W = oP (1), and that: (4.1.1) Mn( ? β, ? h) = inf β?βo δn Mn(β, ? h) + oP (1/ √ n) for some positive se- quence δn = o(1). (4.1.2) (i) The ordinary partial derivative Γ1(β, ho) in β of M(β, ho) exists in a neighborhood of βo, and is continuous at β = βo; (ii) the matrix Γ1 ≡ Γ1(βo, ho) is such that Γ1WΓ1 is nonsingular. (4.1.3) The pathwise derivative Γ2(β, ho)[h?ho] of M(β, ho) exists in all directions [h ? ho] and satis?es: Γ2(β, ho)[h ? ho] ? Γ2(βo, ho)[h ? ho] β ? βo * o(1) for all β with β ? βo = o(1), all h with h ? ho H = o(1). Either (4.1.4) M(β, ? h) ? M(β, ho) ? Γ2(β, ho)[? h ? ho] = oP (n?1/2) for all β with β ? βo = o(1); or (4.1.4) (i) there are some constants c 0, ∈ (0, 1] such that M(β, h) ? M(β, ho) ? Γ2(β, ho)[h ? ho] c h ? ho 1+ H for all β with β ? βo = o(1) and all h with h ? ho H = o(1); and (ii) c ? h ? ho 1+ H = oP (n?1/2). (4.1.5) For all sequences of positive numbers {δn} with δn = o(1), sup β?βo <δn, h?ho H<δn Mn(β, h) ? M(β, h) ? Mn(βo, ho) n?1/2 + Mn(β, h) + M(β, h) = oP (1). (4.1.6) For some ?nite matrix V1, √ n{Mn(βo, ho)+Γ2(βo,ho)[? h?ho]} d ? → N[0, V1]. Then √ n( ? β ? βo) d ? → N[0, (Γ1WΓ1)?1Γ1WV1WΓ1(Γ1WΓ1)?1]. 37 See Theorem 1 in Chen, Linton and van Keilegom (2003) for the consistency property of ? β ?βo = oP (1). Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5609 REMARK 4.1. This theorem can be established by following the proof of Theorem 2 in Chen, Linton and van Keilegom (2003). Note that condition (4.1.4) is implied by condition (4.1.4) , while condition (4.1.4) with = 1 becomes the one imposed in Newey (1994a) and Chen, Linton and van Keilegom (2003). When M(β, h) is highly nonlinear in h and/or when the argument "·" of h(·, β) has unbounded support, then condition (4.1.4) (i) with = 1 may fail to hold, but condition (4.1.4) with 0 < < 1 is typically satis?ed. See Chen, Hong and Tarozzi (2007) for such an example in the two-step GMM estimation for nonclassical measurement error models and missing data problems. Of course a smaller has to be compensated by a faster rate of convergence of ? h to ho in condition (4.1.4) (ii) ? h ? ho H = oP (n?1/2(1+ )). In the extreme case when ? h ? ho H = OP (n?1/2), which is the case if ho is a probability distribution function, then condition (4.1.4) is implied by condition (4.1.4) (i) M(β, h)?M(β, ho)?Γ2(β, ho)[h?ho] = h?ho H *o(1) for all β with β?βo = o(1) and all h with h?ho H = o(1); and (ii) ? h?ho H = OP (n?1/2). Many econometric models correspond to M(β, h) = E[m(Zi, β, h)], Mn(β, h) = n?1 n i=1 m(Zi, β, h), where m : Rdz * B * H → Rdm is a measurable, vector-valued function such that E[m(Zi, β, ho(·, β))] = 0 if and only if β = βo ∈ B, a subset of Rdβ . In this situation, Theorem 3 in Chen, Linton and van Keilegom (2003) provides a set of easily-veri?able suf?cient conditions for the stochastic equicontinuity condition (4.1.5) with i.i.d. data {Zi}. The following lemma extends their result to strictly station- ary processes. Let F = {m(z, β, h): β ∈ B, h ∈ H} denote the class of measurable functions indexed by (β, h), and H[ ](w, F, · r) be the Lr(Po)-metric entropy with bracketing of the class F. LEMMA 4.2. Suppose that {Zt : t 1} is strictly stationary, that M(β, h) = E[m(Zt , β, h)] and Mn(β, h) = n?1 n i=1 m(Zt , β, h), and that the arguments of the h(·) in m(Zt , β, h(·)) only depend on β and ?nitely many Zt . Suppose that each com- ponent mj of m = (m1,mdm ) satis?es: (4.2.1) mj (·, β, h) is locally uniformly Lr(Po)-continuous with respect to β, h in the sense: E sup (β ,h ): β ?β <δ, h ?h H<δ mlcj (Z, β , h ) ? mlcj (Z, β, h) r 1/r Kj δsj for all (β, h) ∈ B * H, all small positive value δ = o(1), and for some constants sj ∈ (0, 1], Kj > 0 and r 1. Then: (i) H[ ](w, Fj , · r) log N([ ε 2Kj ]1/sj , B, · )+log N([ ε 2Kj ]1/sj , H, · H) for j = 1,dm. Furthermore, suppose that (4.2.2) B is a compact subset of Rdβ , and ∞ 0 log N(ε1/sj , H, · H) dε < ∞ for j = 1,dm. 5610 X. Chen (4.2.3) Either {Zt }n t=1 is i.i.d. and (4.2.1) holds with r 2, or {Zt }n t=1 is beta-mixing with a mixing decay rate satisfying ∞ t=1 t2/(r?2)βt < ∞ for some r > 2, and (4.2.1) holds with r > 2. Then: (ii) for all positive δn with δn = o(1), sup β?βo <δn, h?ho H<δn Mn(β, h) ? M(β, h) ? Mn(βo, ho) ? M(βo, ho) (4.2) = oP n?1/2 . PROOF. Result (i) is already derived in the proof of Theorem 3 in Chen, Linton and van Keilegom (2003). Result (ii) for i.i.d. case is Theorem 3 of Chen, Linton and van Keilegom (2003). Now for stationary beta-mixing case, conditions (4.2.1)–(4.2.3) imply that ∞ 0 H[ ](w, F, · r) dw < ∞ with r > 2. This and ∞ t=1 t2/(r?2)βt < ∞ imply that all the assumptions in Doukhan, Massart and Rio (1995) for the Donsker theorem on stationary beta-mixing are satis?ed, which in turn implies the stochastic equicontinuity (4.2) result (ii). Both Theorem 3 in Chen, Linton and van Keilegom (2003) and Lemma 4.2 are ex- tensions of the "type II class" and "type IV class" de?ned in Andrews (1994a) from β ∈ B to (β, h) ∈ B *H. Condition (4.2.1) allows for discontinuous moment functions in (β, h) such as sign and indicator functions of (β, h). Given the results of Newey (1994a), Chen, Linton and van Keilegom (2003) and Theorem 4.1, the choice of estimation of h in the ?rst step should mainly depend on the ease of implementation. Recently, for the partially linear quantile regression Yt = X0t βo + ho(X1t ) + et , P[et 0|Xt ] = α ∈ (0, 1), Lee (2003) proposed a two-step, √ n asymptotically normal and ef?cient estimator of β, where the ?rst step involved a high-dimensional kernel quantile regression of Yt on X = (X0, X1) . Chen, Linton and van Keilegom (2003) considered a modi?cation of Lee's model to a par- tially linear quantile regression with some endogenous regressors, and proposed another √ n asymptotically normal estimator of β by two-step GMM where the ?rst step non- parametric estimation only involves h(X1t ). We can extend their models further to a partially additive quantile regression: Yt = X0t βo + q j=1 hoj (Xjt ) + et , P [et 0|Xt ] = α ∈ (0, 1). If ho1,hoq were known, then βo could be estimated based on the moment re- striction E[m(Zi, β, ho)] = 0 iff β = βo with m(Zi, β, ho) = X0{α ? 1(Y X0t β+ q j=1 hoj (Xjt )}. Clearly, to estimate β by semiparametric two-step GMM using the sample moment n?1 n i=1 m(Zi, β, ? h), it would be much easier if ? h = (? h1, hq) were a sieve estimate, say obtained by maxh∈Hn Qn(β, h) = n?1 n t=1 l(β, h, Yt , Xt ) for any arbitrarily ?xed β, where Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5611 l(β, h, Yt , Xt ) = 1 Yt < X0t β + q j=1 hj (Xjt ) ? α Yt ? X0t β ? q j=1 hj (Xjt ) , and Hn = H1 n H q n as in Subsection 3.2.1, rather than a high-dimensional kernel quantile regression. Andrews (1994b), Newey (1994a, 1994b), Newey, Powell and Vella (1999) and Das, Newey and Vella (2003) have made the same recommendation in the context of two-step estimation with an additive LS regression in the ?rst step. There is also a large literature on the general theory of ef?cient estimation of β via various two-step procedures. For instance, the pro?le MLE estimation of β [which can be viewed as an important subclass of Andrews' (1994b) MINPIN procedure] can lead to ef?cient estimation of β; see e.g. Severini and Wong (1992), Ai (1997) and Murphy and van der Vaart (2000). Other two-step procedures which lead to the ef?cient esti- mation of β include those based on the ef?cient score equation approach; see Bickel et al. (1993) and Newey (1990a), and the optimally weighted GMM approach; see Newey (1990a, 1990b, 1993). See Powell (1994) and Pagan and Ullah (1999) for other exam- ples. Clearly, these ef?cient procedures can be combined with a ?rst step nonparametric estimation of h via the method of sieves. 4.2. Sieve simultaneous M-estimation There are few general theory papers about the sieve simultaneous M-estimation of β and h; see Wong and Severini (1991), Shen (1997), Chen and Shen (1998). This pro- cedure jointly estimates β and h by maximizing a sample criterion function Qn(β, h) over the sieve parameter space Θn = B * Hn, where Qn(β, h) takes a sample av- erage form 1 n n i=1 l(β, h, Zi). Wong and Severini (1991) established √ n-asymptotic normality and ef?ciency of smooth functionals of nonparametric MLE with parameter space Θn ≡ Θ = B * H. Shen (1997) extended their results to sieve MLE and to allow for highly curved (nonlinear) least favorable directions. Chen and Shen (1998) extend the result of Shen (1997) to general sieve M-estimation with stationary weakly dependent data. 4.2.1. Asymptotic normality of smooth functionals of sieve M-estimators Let ? θn = ( ? βn, ? hn) = arg max(β,h)∈B*Hn 1 n n i=1 l(β, h, Zi) denote the sieve M-esti- mate of θo = (βo, ho). In this subsection we present a simple √ n-asymptotic normality theorem for the plug-in estimate of a smooth functional of θo. See Shen (1997) and Chen and Shen (1998) for the general version. Suppose that Θ = B * H is convex in θo so that θo + τ[θ ? θo] ∈ Θ for all small τ ∈ [0, 1] and for all ?xed θ ∈ Θ. Suppose that the directional derivative ?l(θo, z) ?θ [θ ? θo] ≡ lim τ→0 l(θo + τ[θ ? θo], z) ? l(θo, z) τ is well de?ned for almost all z in the support of Z. 5612 X. Chen Let Θ = B * H be equipped with a norm · . Suppose the functional of interest, f : Θ → R, is smooth in the sense that ?f (θo) ?θ [θ ? θo] ≡ lim τ→0 f (θo + τ[θ ? θo]) ? f (θo) τ is well de?ned and ?f (θo) ?θ ≡ sup {θ∈Θ: θ?θo >0} |?f (θo) ?θ [θ ? θo]| θ ? θo < ∞. Next, suppose that · induces an inner product ·,· on the completion of the space spanned by Θ ? θo, denoted as ? V . By the Riesz representation theorem, there exists v? ∈ ? V such that, for any θ ∈ Θ, ?f (θo) ?θ [θ ? θo] = θ ? θo, v? iff ?f (θo) ?θ < ∞. Suppose that the sieve M-estimate ? θn converges to θo at a rate faster than δn (i.e., ? θn ? θo = oP (δn)). Let εn denote any sequence satisfying εn = o(n?1/2), and μn(g(Z)) = 1 n n t=1{g(Zt ) ? E(g(Zt ))} denote the empirical process indexed by the function g. Recall that K(θo, θ) ≡ E[l(θo, Zi) ? l(θ, Zi)]. CONDITION 4.1. (i) There is ω > 0 such that |f (θ) ? f (θo) ? ?f (θo) ?θ [θ ? θo]| = O( θ ? θo ω) uniformly in θ ∈ Θn with θ ? θo = o(1); (ii) ?f (θo) ?θ < ∞; (iii) there is πnv? ∈ Θn such that πnv? ? v? * ? θn ? θo = oP (n?1/2). CONDITION 4.2. sup {θ∈Θn: θ?θo δn} μn l(θ, Z) ? l(θ ± εnπnv? , Z) ? ?l(θo, Z) ?θ [±εnπnv? ] = OP ε2 n . CONDITION 4.3. K(θo, ? θn) ? K(θo, ? θn ± εnπnv?) = ±εn ? θn ? θo, πnv? + o(n?1). CONDITION 4.4. (i) μn(?l(θo,Z) ?θ [πnv? ? v?]) = oP (n?1/2); (ii) E{?l(θo,Z) ?θ [πnv?]} = o(n?1/2). CONDITION 4.5. n1/2μn(?l(θo,Z) ?θ [v?]) d ? → N(0, σ2 v? ), with σ2 v? > 0. We note that for classical nonlinear M-estimation such as those reviewed in Newey and McFadden (1994), Conditions 4.1(i)(ii), 4.2, 4.3 and 4.5 are still required (albeit Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5613 in slightly different expressions), while Conditions 4.1(iii) and 4.4 are automatically satis?ed since πnv? = v? for the standard nonlinear M-estimation. Note that for i.i.d. data Condition 4.5 is satis?ed whenever σ2 v? = Var(?l(θo,Z) ?θ [v?]) > 0. If l(θ, Z) is also pathwise differentiable in θ ∈ Θn with θ ? θo = o(1), then Conditions 4.2 and 4.3 are implied by Conditions 4.2 and 4.3 respectively, where CONDITION 4.2 . sup {θ∈Θn: θ?θo δn} μn ?l(θ, Z) ?θ [πnv? ] ? ?l(θo, Z) ?θ [πnv? ] = oP n?1/2 . CONDITION 4.3 . E{?l( ? θn,Z) ?θ [πnv?]} = ? θn ? θo, πnv? + o(n?1/2). Condition 4.2 (or 4.2 ) can be veri?ed by applying Lemma 4.2. Condition 4.3 (or 4.3 ) can be veri?ed when a Hilbert norm θ ? θo is chosen. Conditions 4.2–4.4 may need to be modi?ed when the parameter space Θ is not convex; see Shen (1997) and Chen and Shen (1998) for the needed modi?cation. THEOREM 4.3. Suppose Conditions 4.1–4.5 hold, and ? θn ?θo ω = oP (n?1/2). Then, for the sieve M-estimate ? θn, n1/2(f ( ? θn) ? f (θo)) d ? → N(0, σ2 v? ). The proof of Theorem 4.3 follows trivially from those in Shen (1997) and Ai and Chen (1999). In applications, one needs to specify a Hilbert norm θ ? θo in order to compute the representer v?. Wong and Severini (1991) and Shen (1997) have used the Fisher norm, θ ? θo 2 = E{?l(θo,Zi) ?θ [θ ? θo]}2, for the sieve MLE procedure. Ai and Chen (1999, 2003) have introduced a Fisher-like norm for their sieve MD and sieve GLS procedures. In the next subsection we specialize Theorem 4.3 to derive root-n asymptotic normality of parametric parts in sieve GLS problems. 4.2.2. Asymptotic normality of sieve GLS Recall that for all the models belonging to the ?rst subclass of the conditional moment restrictions (2.8), E{ρ(Z, θo)|X} = 0, where ρ(Z, θ) ? ρ(Z, θo) does not depend on endogenous variables Y, we can estimate θo = (βo, ho) ∈ B * H by the sieve GLS procedure: ? θn = ( ? βn, ? hn) = arg min (β,h)∈B*Hn 1 n n i=1 ρ(Zi, β, h) Σ(Xi)?1 ρ(Zi, β, h), where Σ(Xi) is a positive de?nite weighting matrix. When Σ(Xi) is known such as the identity matrix, this belongs to the sieve M-estimation with l(θ, Zi) = ?ρ(Zi, θ) Σ(Xi)?1ρ(Zi, θ)/2. See Subsection 4.3 and Remark 4.3 for estimation of the optimal weighting matrix Σo(Xi) ≡ Var{ρ(Zi, θo)|Xi}. 5614 X. Chen We now apply Theorem 4.3 to derive root-n asymptotic normality of the sieve GLS estimator ? βn. De?ne the norm θ ?θo 2 = E{(?ρ(Zi,θo) ?θ [θ ?θo])Σ(Xi)?1(?ρ(Zi,θo) ?θ [θ ? θo])}. For j = 1,dβ, let Dwj (X) = ?ρ(Z, β, ho(·)) ?βj β=βo ? ?ρ(X, βo, ho(·) + τwj (·)) ?τ τ=0 = ?ρ(Z, θo) ?βj ? ?ρ(Z, θo) ?h [wj ], w = (w1,wdβ ), and Dw(X) = (Dw1 (X)Dwdβ (X)) = ?ρ(Z,θo) ?β ? ?ρ(Z,θo) ?h [w] be a (dρ * dβ)-matrix valued measurable function of X. Let w? = (w? 1,w? dβ ), where for j = 1,dβ, w? j solves E Dw? j (X) Σ(X)?1 Dw? j (X) = inf wj E Dwj (X) Σ(X)?1 Dwj (X) . Denote Dw? (X) = ?ρ(Z,θo) ?β ? ?ρ(Z,θo) ?h [w?]. Let v? β = E Dw? (X) Σ(X)?1 Dw? (X) ?1 λ, v? h = ?w?v? β and v? = (v? β, v? h). ASSUMPTION 4.1. (i) βo ∈ int(B); (ii) E[Dw? (X) Σ(X)?1Dw? (X)] is positive de?nite; (iii) there is πnv? ∈ Θn such that πnv? ? v? * ? θn ? θo = oP (n?1/2). ASSUMPTION 4.2. (i) Σ(X) and Σo(X) ≡ Var{ρ(Zi, θo)|X} are positive de?nite and bounded uni- form over X; (ii) ρ(Z, θ) is twice continuously pathwise differentiable with respect to θ ∈ Θ with θ ? θo = o(1); (iii) Conditions 4.2 and 4.3 are satis?ed with ?l(θ,Z) ?θ [πnv?] = ?ρ(Z, θ) * Σ(X)?1{?ρ(Z,θ) ?θ [πnv?]} for all θ ∈ Θn with θ ? θo = o(1); (iv) {Zi}n i=1 is i.i.d., E{ρ(Z, θo)|X} = 0, E{ρ(Z, θ) ? ρ(Z, θo)|X} = ρ(Z, θ) ? ρ(Z, θo) for all θ ∈ Θ. PROPOSITION 4.4. Let ? θn be the sieve GLS estimate. Suppose Assumptions 4.1–4.2 hold. Then n1/2( ? βn ? βo) d ? → N(0, V ?1 1 V2V ?1 1 ) where V1 = E Dw? (X) Σ(X)?1 Dw? (X) , V2 = E Dw? (X) Σ(X)?1 Σo(X)Σ(X)?1 Dw? (X) . Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5615 PROOF. Let f (θ) = λ β, where λ is an arbitrary unit vector in Rdβ . Clearly, Condi- tion 4.1(i) is satis?ed with ?f (θo) ?θ [θ ? θo] = (β ? βo) λ and ω = ∞. In addition, under Assumption 4.1(i)(ii), we have v? = (v? β, v? h) and v? 2 = sup {θ∈Θ: θ?θo >0} {(β ? βo) λ}2 θ ? θo 2 = λ E Dw? (X) Σ(X)?1 Dw? (X) ?1 λ < ∞. Thus Condition 4.1 is implied by Assumption 4.1. Note that ?l(θo, Z) ?θ [θ ? θo] = ?ρ(Z, θo) Σ(X)?1 ?ρ(Z, θo) ?β (β ? βo) + ?ρ(Z, θo) ?h [h ? ho] , we have E ?l(θo, Z) ?θ [πnv? ] = ?E ρ(Z, θo) Σ(X)?1 ?ρ(Z, θo) ?β v? β + ?ρ(Z, θo) ?h πnv? h = 0, hence Condition 4.4(ii) is automatically satis?ed. Since 1 n n t=1 ?l(θo, Zt ) ?θ [πnv? ? v? ] = ?1 n n t=1 ρ(Zt , θo) Σ(Xt )?1 ?ρ(Zt , θo) ?h πnv? h ? v? h , by Chebyshev inequality and Assumptions 4.1(iii) and 4.2(i), we have 1 n n i=1 ?l(θo, Zi) ?θ [πnv? ? v? ] = oP n?1/2 , hence Condition 4.4(i) is satis?ed. Since data are i.i.d. and under Assumptions 4.1(ii) and 4.2(i), σ2 v? = Var ?l(θo, Z) ?θ [v? ] = Var ρ(Z, θo) Σ(X)?1 ?ρ(Z, θo) ?β ? ?ρ(Z, θo) ?h [w? ] v? β = v? β E Dw? (X) Σ(X)?1 Σo(X)Σ(X)?1 Dw? (X) v? β = λ V ?1 1 V2V ?1 1 λ > 0, 5616 X. Chen Condition 4.5 is satis?ed. By Theorem 4.3, we obtain, for any arbitrary unit vector λ ∈ Rdβ , n1/2λ ( ? βn ?βo) d ? → N(0, σ2 v? ). Hence √ n( ? βn ?βo) d ? → N(0, V ?1 1 V2V ?1 1 ). REMARK 4.2. The asymptotic variance, V ?1 1 V2V ?1 1 , of the sieve GLS estimator ? βn can be consistently estimated by V ?1 1 V2V ?1 1 , where V1 = 1 n n i=1 ?ρ(Zi, ? θn) ?β ? ?ρ(Zi, ? θn) ?h [ ? w] * Σ(Xi)?1 ?ρ(Zi, ? θn) ?β ? ?ρ(Zi, ? θn) ?h [ ? w] , V2 = 1 n n i=1 ?ρ(Zi, ? θn) ?β ? ?ρ(Zi, ? θn) ?h [ ? w] * Σ(Xi)?1 Σo(Xi)Σ(Xi)?1 ?ρ(Zi, ? θn) ?β ? ?ρ(Zi, ? θn) ?h [ ? w] , ? w = ( ? w1, wdβ ) solves the following sieve minimization problem: for j = 1,dβ, min wj ∈Hn n i=1 ?ρ(Zi, ? θn) ?βj ? ?ρ(Zi, ? θn) ?h [wj ] * Σ(Xi) ?1 ?ρ(Zi, ? θn) ?βj ? ?ρ(Zi, ? θn) ?h [wj ] , and Σo(Xi) can be any consistent nonparametric estimator of Σo(Xi); see Ai and Chen (1999) for kernel estimator and Ai and Chen (2003, 2007) for series LS estimator of Σo(Xi). 4.2.3. Example: Partially additive mean regression with a monotone constraint Suppose that the i.i.d. data {Yt , Xt = (X0t , X1t Xqt )}n t=1 are generated according to Yi = X0iβo + ho1(X1i)hoq(Xqi) + ei, E[ei|Xi] = 0. Let θo = (βo, ho1,hoq) ∈ Θ = B * H be the parameters of interests, where B is a compact subset of Rdβ and H is the same as that in Subsection 3.2.1. Since ho1(·) can have a constant we assume that X0 does not contain the constant regressor, dim(X0) = dβ, dim(Xj ) = 1 for j = 1,q, dim(X) = dβ + q, and dim(Y) = 1. We estimate the regression function θo(X) = X0t βo + q j=1 hoj (Xjt ) by maximizing over Θn = B * Hn the criterion Qn(θ) = n?1 n t=1 l(θ, Yt , Xt ), where l(θ, Yt , Xt ) = Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5617 ?1 2 [Yt ?X0t β ? q j=1 hj (Xjt )]2. Let θ ?θo 2 = E{X0t (β ?βo)+ q j=1[hj (Xjt )? hoj (Xjt )]}2. Note that Dw? (X) = X0 ? q k=1 w?k(Xk), where w?k(Xk), k = 1,q, solves inf wk,k=1,...,q: E[|X0? q k=1 wk(Xk)|2 e]>0 E X0 ? q k=1 wk (Xk) X0 ? q k=1 wk (Xk) . PROPOSITION 4.5. Suppose that Assumption 3.1 and the following hold: (i) βo ∈ int(B); (ii) Σo(X) is positive and bounded; (iii) E[X0X0] is positive de?nite; E[Dw? (X) Dw? (X)] is positive de?nite; (iv) each element of w?j belongs to the H?lder space Λmj with mj > 1/2 for j = 1,q. Let kjn = O(n1/(2pj +1)) for j = 1,q. Then n1/2( ? βn ? βo) d ? → N(0, V ?1 1 V2V ?1 1 ) where V1 = E[Dw? (X) Dw? (X)], V2 = E[Dw? (X) Σo(X)Dw? (X)]. PROOF. We obtain the result by applying Proposition 4.4. Let Θn = B *Hn and Hn = H1 n H q n, where H j n, j = 1, 2,q, are the same as those in Subsection 3.2.1. By the same proof as that for Proposition 3.3, we have ? θn ? θo = OP (n?p/(2p+1)) provided that p = min{p1,pq} > 0.5. This and assumption (iv) imply Assump- tion 4.1(iii). Condition 4.3 is trivially satis?ed given the de?nition of the metric · . It remains to verify Condition 4.2 : μn X0 v? β + q j=1 πnv? hj (Xj ) X0[β ? βo] + q j=1 hj (Xj ) ? hoj (Xj ) = oP n?1/2 , uniformly over θ ∈ Θn with θ ? θo δn = O(n?p/(2p+1)). Applying Theorem 3 in Chen, Linton and van Keilegom (2003) (or Lemma 4.2 for i.i.d. case), assumptions (i)–(iv) and Assumption 3.1 (hj ∈ Hj = Λ mj c with mj > 1/2 for all j = 1,q) imply Condition 4.2 ; also see van der Vaart and Wellner (1996). Notice that for the well-known partially linear regression model Yi = X0iβo + ho1(X1i)+ei, E[ei|Xi] = 0, we can explicitly solve for Dw? (X) ≡ X0 ?w?1(X1) with w?1(X1) = E{X0|X1}. Hence assumption (iv) will be satis?ed if E{X0|X1} is smooth enough. See Remark 4.3 for semiparametric ef?cient estimation of βo. 4.2.4. Ef?ciency of sieve MLE Wong (1992), and Wong and Severini (1991) established asymptotic ef?ciency of plug- in nonparametric MLE estimates of smooth functionals. Shen (1997) extended their 5618 X. Chen results to sieve MLE. We review the results of Wong (1992) and Shen (1997) in this subsection. Related work can be found in Begun et al. (1983), Ibragimov and Hasmin- skii (1991), Bickel et al. (1993). Here the criterion is Qn(θ) = 1 n n i=1 l(Zi, θ), where l(Zi, θ) = log p(Zi, θ) is a log-likelihood evaluated at the single observation Zi. We use the Fisher norm: θ ? θo 2 = E{? log p(Zi,θo) ?θ [θ ? θo]}2. Recall that a probability family {Pθ : θ ∈ Θ} is locally asymptotically normal (LAN) at θo, if (1) for any g in the linear span of Θ ? θo, θo + tn?1/2g ∈ Θ for all small t 0, and (2) dPθo+n?1/2g dPθo (Z1,Zn) = exp Σn(g) ? 1 2 g 2 + Rn(θo, g) , where Σn(g) is linear in g, Σn(g) d ? → N(0, g 2) and plimn→∞ Rn(θo, g) = 0 (both limits are under the true probability measure Po = Pθo ); see e.g. LeCam (1960). To avoid the "super-ef?ciency" phenomenon, certain conditions on the estimates are required. In estimating a smooth functional in the in?nite-dimensional case, Wong (1992, p. 58) de?nes the class of pathwise regular estimates in the sense of Bahadur (1964). An estimate Tn(Z1,Zn) of f (θo) is pathwise regular if for any real number τ > 0 and any g in the linear span of Θ ? θo, we have lim sup n→∞ Pθn,τ Tn < f (θn,τ ) lim inf n→∞ Pθn,?τ Tn < f (θn,?τ ) , where θn,τ = θo + n?1/2τg. THEOREM 4.6. [See Wong (1992), Shen (1997).] In addition to LAN, suppose the func- tional f : Θ → R is Frechet-differentiable at θo with 0 < ?f (θo) ?θ < ∞. Then for any pathwise regular estimate Tn of f (θo), and any real number τ > 0, lim sup n→∞ Po √ n Tn ? f (θo) τ Po N 0, ?f (θo) ?θ 2 τ where N(0, ?f (θo) ?θ 2) is a scalar random variable drawn from a normal distribution with mean 0 and variance ?f (θo) ?θ 2. THEOREM 4.7. [See Shen (1997).] In addition to the conditions to ensure n1/2(f ( ? θn)? f (θo)) pθo ? ? → N(0, σ2 v? ) with σ2 v? = ?f (θo) ?θ 2, if LAN holds, then for the plug-in sieve MLE estimates of f (θ), any real number τ > 0, and any g in the linear span of Θ ? θo, n1/2 f ( ? θn) ? f (θn,τ ) pθn,τ ? ? ? → N 0, σ2 v? , where θn,τ = θo + n?1/2τg. Here pθ ? → means convergence in distribution under proba- bility measure Pθ . Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5619 4.3. Sieve simultaneous MD estimation: Normality and ef?ciency As we mentioned in Section 2.1, most structural econometric models belong to the semiparametric conditional moment framework: E[ρ(Z, βo, ho(·))|X] = 0, where the difference ρ(Z, β, h(·)) ? ρ(Z, βo, ho(·)) does depend on the endogenous variables Y. There are even fewer general theory papers on the sieve simultaneous MD estimation of βo and ho for this class of models; see Newey and Powell (1989, 2003) and Ai and Chen (1999, 2003). The sieve simultaneous MD procedure jointly estimates βo and ho by minimizing a sample quadratic form 1 n n i=1 ? m(Xi, β, h) [Σ(Xi)]?1 ? m(Xi, β, h) over the sieve parameter space Θn = B * Hn, where ? m(Xi, β, h) is any nonparametric estimator of the conditional mean function m(X, β, h) ≡ E[ρ(Z, β, h(·))|X], Σ(X) → Σ(X) in probability and Σ(X) is a positive de?nite weighting matrix. Ai and Chen (1999, 2003) established the √ n-asymptotic normality of this sieve MD estimator ? β of βo. For semiparametric ef?cient estimation of βo, Ai and Chen (1999) proposed the three- step optimally weighted sieve MD procedure: Step 1. Obtain an initial consistent sieve MD estimator ? θn = ( ? βn, ? hn) by min θ=(β,h)∈B*Hn 1 n n i=1 ? m(Xi, θ) ? m(Xi, θ), where ? m(Xi, θ) is any nonparametric estimator of the conditional mean function m(X, θ) ≡ E[ρ(Z, β, h(·))|X]. Step 2. Obtain a consistent estimator Σo(X) of the optimal weighting matrix Σo(X) ≡ Var[ρ(Z, βo, ho(·))|X] using ? θn = ( ? βn, ? hn) and any nonpara- metric regression procedures (such as kernel, nearest-neighbor or series LS estimation). Step 3. Obtain the optimally weighted estimator ? θn = ( ? βn, ? hn) by solving min θ=(β,h)∈B*Hn 1 n n i=1 ? m(Xi, θ) Σo(Xi) ?1 ? m(Xi, θ). As an alternative way to ef?ciently estimate βo, Ai and Chen (2003) proposed the locally continuously updated sieve MD procedure: Step 1. Obtain an initial consistent sieve MD estimator ? θn by min θ∈B*Hn n i=1 ? m(Xi, θ) ? m(Xi, θ), where ? m(Xi, θ) is the series LS estimator (2.15) of m(X, θ) ≡ E[ρ(Z, β, h(·))|X]. Step 2. Obtain the optimally weighted sieve MD estimator ? θn = ( ? βn, ? hn) by min θ=(β,h)∈Non 1 n n i=1 ? m(Xi, θ) Σo(Xi, θ) ?1 ? m(Xi, θ), 5620 X. Chen where Non is a shrinking neighborhood of θo = (βo, ho) within the sieve space B * Hn, and Σo(Xi, θ) is any nonparametric estimator of the condi- tional variance function Σo(X, θ) ≡ Var[ρ(Z, β, h(·))|X]. To compute this Step 2 one could use ? θn = ( ? βn, ? hn) from Step 1 as a starting point. While Ai and Chen (1999) consider kernel estimation of the conditional mean m(·, θ) and the conditional variance Σo(·, θ), Ai and Chen (2003) propose series LS esti- mation of m(·, θ) and Σo(·, θ). Let {p0j (X), j = 1, 2,km,n} be a sequence of known basis functions that can approximate any real-valued square integrable func- tions of X well as km,n → ∞, pkm,n (X) = (p01(X)p0km,n (X)) and P = (pkm,n (X1)pkm,n (Xn)) . Then a series LS estimator of the conditional variance Σo(X, θ) ≡ Var[ρ(Z, θ)|X] is Σo(X, θ) ≡ n i=1 ρ(Zi, θ)ρ(Zi, θ) pkm,n (Xi) (P P)?1 pkm,n (X). Also, Σo(X) = Var[ρ(Z, θo)|X] can be simply estimated by Σo(X) ≡ Σo(X, ? θn). We state the following result on semiparametric ef?cient estimation of βo for the class of conditional moment restrictions E[ρ(Z, βo, ho(·))|X] = 0; see Ai and Chen (2003) for details. For j = 1,dβ, let Dwj (X) ≡ ?E{ρ(Z, β, ho(·))|X} ?βj β=βo ? ?E{ρ(X, βo, ho(·) + τwj (·))|X} ?τ τ=0 ≡ ?m(X, θo) ?βj ? ?m(X, θo) ?h [wj ], E Dwoj (X) Σo(X)?1 Dwoj (X) = inf wj E Dwj (X) Σo(X)?1 Dwj (X) , wo = (wo1,wodβ ), and Dwo (X) ≡ (Dwo1 (X)Dwodβ (X)) be a (dρ * dβ)- matrix valued measurable function of X. THEOREM 4.8. Let ? βn be either the three-step optimally weighted sieve MD estimator or the two-step locally continuously updated sieve MD estimator. Under the conditions stated in Ai and Chen (2003, Theorems 6.1 and 6.2), ? βn is semiparametric ef?cient and satis?es √ n( ? βn ? βo) d ? → N(0, V ?1 o ), with Vo = E Dwo (X) Σo(X) ?1 Dwo (X) . Ai and Chen (2003) also provide a simple consistent estimator, V ?1 o , for the asymp- totic variance V ?1 o of ? βn, where Vo = 1 n n i=1 ? ? m(Xi, ? θn) ?β ? ? ? m(Xi, ? θn) ?h [ ? wo] * Σo(Xi) ?1 ? ? m(Xi, ? θn) ?β ? ? ? m(Xi, ? θn) ?h [ ? wo] , Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5621 ? wo = ( ? wo1, wodβ ) solves the following sieve minimization problem: min wj ∈Hn n i=1 ? ? m(Xi, ? θn) ?βj ? ? ? m(Xi, ? θn) ?h [wj ] Σo(Xi) ?1 * ? ? m(Xi, ? θn) ?βj ? ? ? m(Xi, ? θn) ?h [wj ] for j = 1,dβ, and ? ? m(X, θ) ?βj ? ? ? m(X, θ) ?h [wj ] ≡ n i=1 ?ρ(Zi, θ) ?βj ? ?ρ(Zi, θ) ?h [wj ] pkm,n (Xi) (P P)?1 pkm,n (X). REMARK 4.3. (1) Recently, Chen and Pouzo (2006) have extended the root-n normal- ity and ef?ciency results of Ai and Chen (2003) to allow that the generalized residual functions ρ(Z, β, h(·)) are not pointwise continuous in θ = (β, h). (2) The three-step optimally weighted sieve MD leads to semiparametric ef?cient estimation of βo for the model E[ρ(Z, βo, ho(·))|X] = 0 regardless of whether ρ(Z, β, h(·)) ? ρ(Z, βo, ho(·)) depends on the endogenous variables Y or not. How- ever, when ρ(Z, β, h(·)) ? ρ(Z, βo, ho(·)) does not depend on Y, to obtain an ef?cient estimator of βo one can also apply the following simpler three-step sieve GLS procedure as suggested in Ai and Chen (1999): Step 1. Obtain an initial consistent sieve GLS estimator ? θn = ( ? βn, ? hn) by min (β,h)∈B*Hn 1 n n i=1 ρ Zi, β, h(·) ρ Zi, β, h(·) . Step 2. Obtain a consistent estimator Σo(X) of Σo(X) = Var[ρ(Z, θo)|X] us- ing ? θn = ( ? βn, ? hn) and any nonparametric regression procedures such as Σo(X) = Σo(X, ? θn). Step 3. Obtain the optimally weighted GLS estimator ? θn = ( ? βn, ? hn) by solving min (β,h)∈B*Hn 1 n n i=1 ρ Zi, β, h(·) Σo(Xi) ?1 ρ Zi, β, h(·) . That is, for all the models belonging to the ?rst subclass of the conditional moment restrictions (2.8), E{ρ(Z, βo, ho)|X} = 0, where ρ(Z, θ) ? ρ(Z, θo) does not depend on endogenous variables Y, the simple three-step sieve GLS estimator ? βn also satis?es √ n( ? βn ?βo) d ? → N(0, V ?1 o ). Of course, the following continuously updated sieve GLS procedure will also lead to semiparametric ef?cient estimation of βo: 5622 X. Chen ( ? βcgls, ? hcgls) = arg min (β,h)∈B*Hn 1 n n i=1 ρ Zi, β, h(·) Σo Xi, β, h(·) ?1 ρ Zi, β, h(·) . For the conditional moment restriction (without unknown function ho), E[ρ(Z, βo)| X] = 0, there are many alternative ef?cient estimation procedures for βo, includ- ing the empirical likelihood of Donald, Imbens and Newey (2003), the generalized empirical likelihood (GEL) of Newey and Smith (2004), the kernel-based empirical likelihood of Kitamura, Tripathi and Ahn (2004), the continuously updated minimum distance procedure or the Euclidean conditional empirical likelihood of Antoine, Bon- nal and Renault (2007), among others. It seems that one could extend their results to the more general conditional moment framework E[ρ(Z, βo, ho( ))|X] = 0, where the unknown function ho( ) is approximated by a sieve. In fact, Zhang and Gijbels (2003) have already considered the sieve empirical likelihood procedure for the special case E[ρ(Z, βo, ho(X))|X] = 0 where ho is a function of conditioning variable X only; See Otsu (2005) for the general case. Recently Ai and Chen (2007, 2004) have considered the semiparametric conditional moment framework E[ρj (Z, βo, ho( ))|Xj ] = 0 for j = 1,J with ?nite J, where each conditional moment has its own conditioning set Xj that could differ across equa- tions. This extension would be useful to estimating semiparametric structure models with incomplete information. 5. Concluding remarks In this chapter, we have surveyed some recent large sample results on nonparametric and semiparametric estimation of econometric models via the method of sieves. We have re- stricted our attention to general consistency and convergence rates of sieve estimation of unknown functions and √ n-asymptotic normality of sieve estimation of smooth func- tionals. Examples were used to illustrate the general sieve estimation theory. It is our hope that the examples adequately depicted the general sieve extremum estimation ap- proach and its versatility. We conclude this chapter by pointing out additional topics on the method of sieves that have not been reviewed for lack of time and space. First, although there is still lack of general theory on testing via the sieve method, there are some consistent speci?cation tests using the method of sieves. For example, Hong and White (1995) tested a parametric regression model using series LS estima- tors; Hart (1997) presented many consistent tests using series estimators; Stinchcombe and White (1998) tested a parametric conditional moment restriction E[ρ(Z, βo)| X] = 0 using neural network sieves and Li, Hsiao and Zinn (2003) tested semipara- metric/nonparametric regression models using spline series estimators. Most recently Song (2005) proposed consistent tests of semi-nonparametric regression models via conditional martingale transforms where the unknown functions are estimated by the Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5623 method of sieves. Additional references include Wooldridge (1992), Bierens (1990), Bierens and Ploberger (1997) and de Jong (1996). Also in principle, all of the existing test results based on kernel or local linear regression methods such as those in Robinson (1989), Fan and Li (1996), Lavergne and Vuong (1996), Chen and Fan (1999), Fan and Linton (1999), A?t-Sahalia, Bickel and Stoker (2001), Horowitz and Spokoiny (2001) and Fan, Zhang and Zhang (2001) can be done using the method of sieves. Second, we have not touched on the issue of data-driven selection of sieve spaces. In practice, many existing model selection methods such as cross-validation (CV), gen- eralized CV and AIC have been used in the current context due to the connection of the method of sieves with the parametric models; see the survey chapter by Ichimura and Todd (2007) on implementation details of semi-nonparametric estimators includ- ing series estimators, and the review by Stone et al. (1997) and Ruppert, Wand and Carroll (2003) on model selection with spline sieves for extended linear models. There are a few papers in statistics including Barron, Birgé and Massart (1999) and Shen and Ye (2002) that address data-driven selection among different sieve bases. There are many results on data-driven selection of the number of terms for a given sieve basis; see e.g. Li (1987), Andrews (1991a), Hurvich, Simonoff and Tsai (1998), Donald and Newey (2001), Coppejans and Gallant (2002), Phillips and Ploberger (2003), Fan and Peng (2004) and Imbens, Newey and Ridder (2005). In particular, Andrews (1991a) establishes the asymptotic optimality of CV as a method to select series terms for non- parametric least square regressions with heteroskedastic errors. Imbens, Newey and Ridder (2005) establishes a similar result for semiparametrically ef?cient estimation of average treatment effect parameters with a ?rst step series estimation of conditional means. It would be very useful to extend their results to handle a more general class of semi-nonparametric models estimated via the method of sieves. Third, so far there is little research on the higher order re?nements of the large sample properties of the semiparametric ef?cient sieve estimators. Many authors, in- cluding Linton (1995) and Heckman et al. (1998), have pointed out that the ?rst-order asymptotics of semiparametric procedures could be misleading and unhelpful. For the case of kernel estimators, some papers such as Robinson (1995), Linton (1995, 2001), Nishiyama and Robinson (2000, 2005), Xiao and Linton (2001) and Ichimura and Lin- ton (2002) have obtained higher order re?nements. It would be useful to extend these results to semiparametric ef?cient estimators using the method of sieves. Finally, given the relative ease of implementation of the sieve method, but the general dif?culty of deriving its large sample properties, it might be fruitful to combine the sieve method with the kernel or the local linear regression methods [see e.g. Fan and Gijbels (1996)]. Recent papers by Horowitz and Mammen (2004) and Horowitz and Lee (2005) have demonstrated the usefulness of this combination. References Ai, C. (1997). "A semiparametric maximum likelihood estimator". Econometrica 65, 933–964. Ai, C., Chen, X. (2003). "Ef?cient estimation of models with conditional moment restrictions containing unknown functions". Econometrica 71, 1795–1843. Working paper version, 1999. 5624 X. Chen Ai, C., Chen, X. (1999). "Ef?cient sieve minimum distance estimation of semiparametric conditional moment models". Manuscript. London School of Economics. Ai, C., Chen, X. (2004). "On ef?cient sequential estimation of semi-nonparametric moment models". Working paper. New York University. Ai, C., Chen, X. (2007). "Estimation of possibly misspeci?ed semiparametric conditional moment restriction models with different conditioning variables". Journal of Econometrics. In press. A?t-Sahalia, Y., Bickel, P., Stoker, T. (2001). "Goodness-of-?t tests for kernel regression with an application to option implied volatilities". Journal of Econometrics 105, 363–412. Amemiya, T. (1985). Advanced Econometrics. Harvard University Press, Cambridge. Anastassiou, G., Yu, X. (1992a). "Monotone and probabilistic wavelet approximation". Stochastic Analysis and Applications 10, 251–264. Anastassiou, G., Yu, X. (1992b). "Convex and convex-probabilistic wavelet approximation". Stochastic Analysis and Applications 10, 507–521. Andrews, D. (1991a). "Asymptotic optimality of generalized CL, cross-validation, and generalized cross- validation in regression with heteroskedastic errors". Journal of Econometrics 47, 359–377. Andrews, D. (1991b). "Asymptotic normality of series estimators for nonparametric and semiparametric re- gression models". Econometrica 59, 307–345. Andrews, D. (1992). "Generic uniform convergence". Econometric Theory, 241–257. Andrews, D. (1994a). "Empirical process method in econometrics". In: Engle III, R.F., McFadden, D.F. (Eds.), Handbook of Econometrics, vol. 4. North-Holland, Amsterdam. Andrews, D. (1994b). "Asymptotics for semi-parametric econometric models via stochastic equicontinuity". Econometrica 62, 43–72. Andrews, D., Schafgans, M. (1998). "Semiparametric estimation of the intercept of a sample selection model". Review of Economic Studies 65, 497–517. Andrews, D., Whang, Y. (1990). "Additive interactive regression models: Circumvention of the curse of di- mensionality". Econometric Theory 6, 466–479. Antoine, B., Bonnal, H., Renault, E. (2007). "On the ef?cient use of the informational content of estimating equations: Implied probabilities and Euclidean empirical likelihood". Journal of Econometrics 138, 488– 512. Bahadur, R.R. (1964). "On Fisher's bound for asymptotic variances". Ann. Math. Statist. 35, 1545–1552. Bansal, R., Viswanathan, S. (1993). "No arbitrage and arbitrage pricing: A new approach". The Journal of Finance 48 (4), 1231–1262. Bansal, R., Hsieh, D., Viswanathan, S. (1993). "A new approach to international arbitrage pricing". The Journal of Finance 48, 1719–1747. Barnett, W.A., Powell, J., Tauchen, G. (1991). Non-parametric and Semi-parametric Methods in Econometrics and Statistics. Cambridge University Press, New York. Barron, A.R. (1993). "Universal approximation bounds for superpositions of a sigmoidal function". IEEE Trans. Information Theory 39, 930–945. Barron, A., Birgé, L., Massart, P. (1999). "Risk bounds for model selection via penalization". Probab. Theory Related Fields 113, 301–413. Begun, J., Hall, W., Huang, W., Wellner, J.A. (1983). "Information and asymptotic ef?ciency in parametric- nonparametric models". The Annals of Statistics 11, 432–452. Bickel, P.J., Klaassen, C.A.J., Ritov, Y., Wellner, J.A. (1993). Ef?cient and Adaptive Estimation for Semi- parametric Models. The John Hopkins University Press, Baltimore. Bierens, H. (1990). "A consistent conditional moment test of functional form". Econometrica 58, 1443–1458. Bierens, H. (2006). "Semi-nonparametric interval-censored mixed proportional hazard models: Identi?cation and consistency results". Econometric Theory. In press. Bierens, H., Carvalho, J. (2006). "Semi-nonparametric competing risks analysis of recidivism". Journal of Applied Econometrics. In press. Bierens, H., Ploberger, W. (1997). "Asymptotic theory of integrated conditional moment tests". Economet- rica 65, 1129–1151. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5625 Birgé, L., Massart, P. (1998). "Minimum contrast estimators on sieves: Exponential bounds and rates of con- vergence". Bernoulli 4, 329–375. Birman, M., Solomjak, M. (1967). "Piece-wise polynomial approximations of functions in the class Wα p ". Mathematics of the USSR Sbornik 73, 295–317. Blundell, R., Powell, J. (2003). "Endogeneity in nonparametric and semiparametric regression models". In: Dewatripont, M., Hansen, L.P., Turnovsky, S. (Eds.), Advances in Economics and Econometrics: Theory and Applications, vol. 2. Cambridge University Press, Cambridge, pp. 312–357. Blundell, R., Browning, M., Crawford, I. (2003). "Non-parametric Engel curves and revealed preference". Econometrica 71, 205–240. Blundell, R., Chen, X., Kristensen, D. (2007). "Semi-nonparametric IV estimation of shape-invariant Engel curves". Econometrica. In press. Blundell, R., Duncan, A., Pendakur, K. (1998). "Semiparametric estimation and consumer demand". Journal of Applied Econometrics 13, 435–461. Brendstrup, B., Paarsch, H. (2004). "Identi?cation and estimation in sequential, asymmetric, English auc- tions". Manuscript, University of Iowa. Cai, Z., Fan, J., Yao, Q. (2000). "Functional-coef?cient regression models for nonlinear time series". Journal of American Statistical Association 95, 941–956. Cameron, S., Heckman, J. (1998). "Life cycle schooling and dynamic selection bias". Journal of Political Economy 106, 262–333. Campbell, J., Cochrane, J. (1999). "By force of habit: A consumption-based explanation of aggregate stock market behavior". Journal of Political Economy 107, 205–251. Carrasco, M., Florens, J.-P., Renault, E. (2006). "Linear inverse problems in structural econometrics esti- mation based on spectral decomposition and regularization". In: Heckman, J.J., Leamer, E.E. (Eds.), Handbook of Econometrics, vol. 6. North-Holland, Amsterdam. Chamberlain, G. (1992). "Ef?ciency bounds for semiparametric regression". Econometrica 60, 567–596. Chapman, D. (1997). "Approximating the asset pricing kernel". The Journal of Finance 52 (4), 1383–1410. Chen, X., Conley, T. (2001). "A new semiparametric spatial model for panel time series". Journal of Econo- metrics 105, 59–83. Chen, X., Fan, Y. (1999). "Consistent hypothesis testing in semiparametric and nonparametric models for econometric time series". Journal of Econometrics 91, 373–401. Chen, X., Ludvigson, S. (2003). "Land of addicts? An empirical investigation of habit-based asset pricing models". Manuscript. New York University. Chen, X., Pouzo, D. (2006). "Ef?cient estimation of semi-nonparametric conditional moment models with possibly nonsmooth moments". Manuscript. New York University. Chen, X., Shen, X. (1996). "Asymptotic properties of sieve extremum estimates for weakly dependent data with applications". Manuscript. University of Chicago. Chen, X., Shen, X. (1998). "Sieve extremum estimates for weakly dependent data". Econometrica 66, 289– 314. Chen, R., Tsay, R. (1993). "Functional-coef?cient autoregressive models". Journal of American Statistical Association 88, 298–308. Chen, X., White, H. (1998). "Nonparametric adaptive learning with feedback". Journal of Economic The- ory 82, 190–222. Chen, X., White, H. (1999). "Improved rates and asymptotic normality for nonparametric neural network estimators". IEEE Tran. Information Theory 45, 682–691. Chen, X., White, H. (2002). "Asymptotic properties of some projection-based Robbins–Monro procedures in a Hilbert space". Studies in Nonlinear Dynamics and Econometrics 6 (1). Article 1. Chen, X., Fan, Y., Tsyrennikov, V. (2006). "Ef?cient estimation of semiparametric multivariate copula mod- els". Journal of the American Statistical Association 101, 1228–1240. Chen, X., Hansen, L.P., Scheinkman, J. (1998). "Shape-preserving estimation of diffusions". Manuscript. University of Chicago. Chen, X., Hong, H., Tamer, E. (2005). "Measurement error models with auxiliary data". Review of Economic Studies 72, 343–366. 5626 X. Chen Chen, X., Hong, H., Tarozzi, A. (2007). "Semiparametric ef?ciency in GMM models of nonclassical mea- surement errors, missing data and treatment effects". The Annals of Statistics. In press. Chen, X., Linton, O., van Keilegom, I. (2003). "Estimation of semiparametric models when the criterion function is not smooth". Econometrica 71, 1591–1608. Chen, X., Racine, J., Swanson, N. (2001). "Semiparametric ARX neural network models with an application to forecasting in?ation". IEEE Tran. Neural Networks 12, 674–683. Chernozhukov, V., Imbens, G., Newey, W. (2007). "Instrumental variable identi?cation and estimation of nonseparable models via quantile conditions". Journal of Econometrics 139, 4–14. Chui, C. (1992). An Introduction to Wavelets. Academic Press, Inc., San Diego. Cochrane, J. (2001). Asset Pricing. Princeton University Press, Princeton, NJ. Constantinides, G. (1990). "Habit-formation: A resolution of the equity premium puzzle". Journal of Political Economy 98, 519–543. Coppejans, M. (2001). "Estimation of the binary response model using a mixture of distributions estimator (MOD)". Journal of Econometrics 102, 231–261. Coppejans, M., Gallant, A.R. (2002). "Cross-validated SNP density estimates". Journal of Econometrics 110, 27–65. Cosslett, S. (1983). "Distribution-free maximum likelihood estimation of the binary choice model". Econo- metrica 51, 765–782. Cybenko, G. (1990). "Approximation by superpositions of a sigmoid function". Mathematics of Control, Signals and Systems 2, 303–314. Darolles, S., Florens, J.-P., Renault, E. (2002). "Nonparametric instrumental regression". Mimeo. GREMAQ, University of Toulouse. Das, M., Newey, W.K., Vella, F. (2003). "Nonparametric estimation of sample selection models". Review of Economic Studies 70, 33–58. Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM, Philadelphia. de Boor, C. (1978). A Practical Guide to Splines. Springer-Verlag, New York. Dechevsky, L., Penev, S. (1997). "On shape-preserving probabilistic wavelet approximators". Stochastic Analysis and Applications 15, 187–215. de Jong, R. (1996). "The Bierens test under data dependence". Journal of Econometrics 72, 1–32. de Jong, R. (2002). "A note on 'Convergence rates and asymptotic normality for series estimators': Uniform convergence rates". Journal of Econometrics 111, 1–9. DeVore, R.A. (1977a). "Monotone approximation by splines". SIAM Journal on Mathematical Analysis 8, 891–905. DeVore, R.A. (1977b). "Monotone approximation by polynomials". SIAM Journal on Mathematical Analy- sis 8, 906–921. DeVore, R.A., Lorentz, G.G. (1993). Constructive Approximation. Springer-Verlag, Berlin. Donald, S., Newey, W. (2001). "Choosing the number of instruments". Econometrica 69, 1161–1191. Donald, S., Imbens, G., Newey, W. (2003). "Empirical likelihood estimation and consistent tests with condi- tional moment restrictions". Journal of Econometrics 117, 55–93. Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., Picard, D. (1995). "Wavelet shrinkage: Asymptopia?". Journal of the Royal Statistical Society, Series B 57, 301–369. Doukhan, P., Massart, P., Rio, E. (1995). "Invariance principles for absolutely regular empirical processes". Ann. Inst. Henri Poincaré – Probabilités et Statistiques 31, 393–427. Duncan, G.M. (1986). "A semiparametric censored regression estimator". Journal of Econometrics 32, 5–34. Eggermont, P., LaRiccia, V. (2001). Maximum Penalized Likelihood Estimation, Volume I: Density Estima- tion. Springer, New York. Eichenbaum, M., Hansen, L.P. (1990). "Estimating models with intertemporal substitution using aggregate time series data". Journal of Business and Economic Statistics 8, 53–69. Elbadawi, I., Gallant, A.R., Souza, G. (1983). "An elasticity can be estimated consistently without a prior knowledge of functional form". Econometrica 51, 1731–1751. Engle, R., Gonzalez-Rivera, G. (1991). "Semiparametric ARCH models". Journal of Business and Economic Statistics 9, 345–359. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5627 Engle, R.F., McFadden, D.L. (Eds.) (1994). Handbook of Econometrics, vol. 4. North-Holland, Amsterdam. Engle, R., Rangel, G. (2004). "The spline GARCH model for unconditional volatility and its global macro- economic causes". Working paper. New York University. Engle, R., Granger, C., Rice, J., Weiss, A. (1986). "Semiparametric estimates of the relation between weather and electricity sales". Journal of the American Statistical Association 81, 310–320. Fan, J., Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall, London. Fan, Y., Li, Q. (1996). "Consistent model speci?cation tests: Omitted variables, parametric and semiparamet- ric functional forms". Econometrica 64, 865–890. Fan, Y., Linton, O. (1999). "Some higher order theory for a consistent nonparametric model speci?cation test". Working paper LSE. Fan, J., Peng, H. (2004). "On non-concave penalized likelihood with diverging number of parameters". The Annals of Statistics 32, 928–961. Fan, J., Yao, Q. (2003). Nonlinear Time Series: Nonparametric and Parametric Methods. Springer-Verlag, New York. Fan, J., Zhang, C., Zhang, J. (2001). "Generalized likelihood ratio statistics and Wilks phenomenon". The Annals of Statistics 29, 153–193. Flinn, C., Heckman, J. (1982). "New methods for analyzing structural models of labor force dynamics". Journal of Econometrics 18, 115–168. Florens, J.P. (2003). "Inverse problems and structural econometrics: The example of instrumental variables". In: Dewatripont, M., Hansen, L.P., Turnovsky, S. (Eds.), Advances in Economics and Econometrics: The- ory and Applications, vol. 2. Cambridge University Press, Cambridge, pp. 284–311. Gabushin, O. (1967). "Inequalities for norms of functions and their derivatives in the Lp metric". Matem- aticheskie Zametki 1, 291–298. Gallant, A.R. (1987). "Identi?cation and consistency in seminonparametric regression". In: Bewley, T.F. (Ed.), Advances in Econometrics, vol. I. Cambridge University Press, pp. 145–170. Gallant, A.R., Nychka, D. (1987). "Semi-non-parametric maximum likelihood estimation". Econometrica 55, 363–390. Gallant, A.R., Souza, G. (1991). "On the asymptotic normality of Fourier ?exible form estimates". Journal of Econometrics 50, 329–353. Gallant, A.R., Tauchen, G. (1989). "Semiparametric estimation of conditional constrained heterogenous processes: Asset pricing applications". Econometrica 57, 1091–1120. Gallant, A.R., Tauchen, G. (1996). "Which moments to match?". Econometric Theory 12, 657–681. Gallant, A.R., Tauchen, G. (2004). "EMM: A program for ef?cient method of moments estimation, Version 2.0 User's Guide". Working paper. Duke University. Gallant, A.R., White, H. (1988a). "There exists a neural network that does not make avoidable mistakes". In: Proceedings of the IEEE 1988 International Conference on Neural Networks, vol. 1. IEEE, New York, pp. 657–664. Gallant, A.R., White, H. (1988b). A Uni?ed Theory of Estimation and Inference for Nonlinear Dynamic Models. Basil Blackwell, Oxford. Gallant, A.R., White, H. (1992). "On learning the derivatives of an unknown mapping with multilayer feed- forward networks". Neural Networks 5, 129–138. Gallant, A.R., Hsieh, D., Tauchen, G. (1991). "On ?tting a recalcitrant series: The pound/dollar exchange rate, 1974–83". In: Barnett, W.A., Powell, J., Tauchen, G. (Eds.), Non-parametric and Semi-parametric Methods in Econometrics and Statistics. Cambridge University Press, Cambridge, pp. 199–240. Geman, S., Hwang, C. (1982). "Nonparametric maximum likelihood estimation by the method of sieves". The Annals of Statistics 10, 401–414. Girosi, F. (1994). "Regularization theory, radial basis functions and networks". In: Cherkassky, V., Friedman, J.H., Wechsler, H. (Eds.), From Statistics to Neural Networks. Theory and Pattern Recognition Applica- tions. Springer-Verlag, Berlin. Granger, C.W.J., Ter?svirta, T. (1993). Modelling Nonlinear Economic Relationships. Oxford University Press, New York. 5628 X. Chen Grenander, U. (1981). Abstract Inference. Wiley Series, New York. H?rdle, W., Linton, O. (1994). "Applied nonparametric methods". In: Engle III, R.F., McFadden, D.F. (Eds.), Handbook of Econometrics, vol. 4. North-Holland, Amsterdam. H?rdle, W., Mueller, M., Sperlich, S., Werwatz, A. (2004). Nonparametric and Semiparametric Models. Springer, New York. Hahn, J. (1998). "On the role of the propensity score in ef?cient semiparametric estimation of average treat- ment effects". Econometrica 66, 315–332. Hall, P., Horowitz, J. (2005). "Nonparametric methods for inference in the presence of instrumental variables". The Annals of Statistics 33, 2904–2929. Hansen, L.P. (1982). "Large sample properties of generalized method of moments estimators". Economet- rica 50, 1029–1054. Hansen, L.P. (1985). "A method for calculating bounds on the asymptotic covariance matrices of generalized method of moments estimators". Journal of Econometrics 30, 203–238. Hansen, M.H. (1994). "Extended linear models, multivariate splines, and ANOVA". PhD Dissertation. De- partment of Statistics, University of California at Berkeley. Hansen, L.P., Richard, S. (1987). "The role of conditioning information in deducing testable restrictions implied by dynamic asset pricing models". Econometrica 55, 587–613. Hansen, L.P., Singleton, K. (1982). "Generalized instrumental variables estimation of nonlinear rational ex- pectations models". Econometrica 50, 1269–1286. Hart, J. (1997). Nonparametric Smoothing and Lack-of-Fit Tests. Springer-Verlag, New York. Hausman, J., Newey, W. (1995). "Nonparametric estimation of exact consumer surplus and deadweight loss". Econometrica 63, 1445–1467. Heckman, J. (1979). "Sample selection bias as a speci?cation error". Econometrica 47, 153–161. Heckman, J., Singer, B. (1984). "A method for minimizing the impact of distributional assumptions in econo- metric models for duration data". Econometrica 68, 839–874. Heckman, J., Willis, R. (1977). "A beta logistic model for the analysis of sequential labor force participation of married women". Journal of Political Economy 85, 27–58. Heckman, J., Ichimura, H., Smith, J., Todd, P. (1998). "Characterization of selection bias using experimental data". Econometrica 66, 1017–1098. Hirano, K., Imbens, G., Ridder, G. (2003). "Ef?cient estimation of average treatment effects using the esti- mated propensity score". Econometrica 71, 1161–1189. Hong, Y., White, H. (1995). "Consistent speci?cation testing via nonparametric series regression". Econo- metrica 63, 1133–1159. Honoré, B. (1990). "Simple estimation of a duration model with unobserved heterogeneity". Econometrica 58, 453–473. Honoré, B. (1994). "A note on the rate of convergence of estimators of mixtures of Weibulls". Manuscript. Northwestern University. Honoré, B., Kyriazidou, E. (2000). "Panel data discrete choice models with lagged dependent variables". Econometrica 68, 839–874. Hornik, K., Stinchcombe, M., White, H. (1989). "Multilayer feedforward networks are universal approxima- tors". Neural Networks 2, 359–366. Hornik, K., Stinchcombe, M., White, H., Auer, P. (1994). "Degree of approximation results for feedforward networks approximating unknown mappings and their derivatives". Neural Computation 6, 1262–1275. Horowitz, J. (1992). "A smoothed maximum score estimator for the binary response model". Economet- rica 60, 505–531. Horowitz, J. (1998). Semiparametric Methods in Econometrics. Springer-Verlag, New York. Horowitz, J., Lee, S. (2005). "Nonparametric estimation of an additive quantile regression model". Journal of the American Statistical Association 100, 1238–1249. Horowitz, J., Lee, S. (2007). "Nonparametric instrumental variables estimation of a quantile regression model". Econometrica 75, 1191–1208. Horowitz, J., Mammen, E. (2004). "Nonparametric estimation of an additive model with a link function". The Annals of Statistics 32, 2412–2443. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5629 Horowitz, J., Spokoiny, V. (2001). "An adaptive, rate-optimal test of a parametric mean-regression model against a nonparametric alternative". Econometrica 69, 599–631. Hu, Y., Schennach, S. (2006). "Identi?cation and estimation of nonclassical nonlinear errors-in-variables models with continuous distributions using instruments". Working paper. University of Texas, Austin. Huang, J.Z. (1998a). "Projection estimation in multiple regression with application to functional ANOVA models". The Annals of Statistics 26, 242–272. Huang, J.Z. (1998b). "Functional ANOVA models for generalized regression". Journal of Multivariate Analy- sis 67, 49–71. Huang, J.Z. (2001). "Concave extended linear modeling: A theoretical synthesis". Statistica Sinica 11, 173– 197. Huang, J.Z. (2003). "Local asymptotics for polynomial spline regression". The Annals of Statistics 31, 1600– 1635. Huang, J.Z., Kooperberg, C., Stone, C.J., Truong, Y.K. (2000). "Functional ANOVA modeling for propor- tional hazards regression". The Annals of Statistics 28, 960–999. Hurvich, C., Simonoff, J., Tsai, C. (1998). "Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion". Journal of the Royal Statistical Society, Series B 60, 271–293. Hutchinson, J., Lo, A., Poggio, T. (1994). "A non-parametric approach to pricing and hedging derivative securities via learning networks". The Journal of Finance 3, 851–889. Ibragimov, I.A., Hasminskii, R.Z. (1991). "Asymptotically normal families of distributions and ef?cient esti- mation". The Annals of Statistics 19, 1681–1724. Ichimura, H. (1993). "Semiparametric least squares (SLS), and weighted SLS estimation of single index models". Journal of Econometrics 58, 71–120. Ichimura, H., Lee, S. (2006). "Characterization of the asymptotic distribution of semiparametric M- estimators". Manuscript. UCL. Ichimura, H., Linton, O. (2002). "Asymptotic expansions for some semiparametric program evaluation esti- mators". Working paper IFS and LSE. Ichimura, H., Todd, P. (2007). "Implementing nonparametric and semiparametric estimators". In: Heckman, J.J., Leamer, E. (Eds.), Handbook of Econometrics, vol. 6B. Elsevier. Chapter 74. Imbens, G., Newey, W., Ridder, G. (2005). "Mean-squared-error calculations for average treatment effects". Manuscript. UC Berkeley. Ishwaran, H. (1996a). "Identi?cation and rates of estimation for scale parameters in location mixture models". The Annals of Statistics 24, 1560–1571. Ishwaran, H. (1996b). "Uniform rates of estimation in the semiparametric Weibull mixture models". The Annals of Statistics 24, 1572–1585. Jovanovic, B. (1979). "Job matching and the theory of turnover". Journal of Political Economy 87, 972–990. Judd, K. (1998). Numerical Method in Economics. MIT University Press. Khan, S. (2005). "An alternative approach to semiparametric estimation of heteroskedastic binary response models". Manuscript. University of Rochester. Kim, J., Pollard, D. (1990). "Cube root asymptotics". The Annals of Statistics 18, 191–219. Kitamura, Y., Tripathi, G., Ahn, H. (2004). "Empirical likelihood-based inference in conditional moment restriction models". Econometrica 72, 1667–1714. Klein, R., Spady, R. (1993). "An ef?cient semiparametric estimator for binary response models". Economet- rica 61, 387–421. Koenker, R., Bassett, G. (1978). "Regression quantiles". Econometrica 46, 33–50. Koenker, R., Mizera, I. (2003). "Penalized triograms: Total variation regularization for bivariate smoothing". Journal of the Royal Statistical Society, Series B 66, 145–163. Koenker, R., Ng, P., Portnoy, S. (1994). "Quantile smoothing splines". Biometrika 81, 673–680. Kooperberg, C., Stone, C.J., Truong, Y.K. (1995a). "Hazard regression". Journal of the American Statistical Association 90, 78–94. Kooperberg, C., Stone, C.J., Truong, Y.K. (1995b). "Rate of convergence for logspline spectral density esti- mation". Journal of Time Series Analysis 16, 389–401. 5630 X. Chen Lavergne, P., Vuong, Q. (1996). "Nonparametric selection of regressors: The nonnested case". Economet- rica 64, 207–219. LeCam, L. (1960). "Local asymptotically normal families of distributions". Univ. California Publications in Statist. 3, 37–98. Lee, S. (2003). "Ef?cient semiparametric estimation of a partially linear quantile regression model". Econo- metric Theory 19, 1–31. Li, K. (1987). "Asymptotic optimality for Cp, CL cross-validation, and generalized cross-validation: Discrete index set". The Annals of Statistics 15, 958–975. Li, Q., Racine, J. (2007). Nonparametric Econometrics Theory and Practice. Princeton University Press. In press. Li, Q., Hsiao, C., Zinn, J. (2003). "Consistent speci?cation tests for semiparametric/nonparametric models based on series estimation methods". Journal of Econometrics 112, 295–325. Linton, O. (1995). "Second order approximation in the partially linear regression model". Econometrica 63, 1079–1112. Linton, O. (2001). "Edgeworth approximations for semiparametric instrumental variable estimators and test statistics". Journal of Econometrics 106, 325–368. Linton, O., Mammen, E. (2005). "Estimating semiparametric ARCH(∞) models by kernel smoothing meth- ods". Econometrica 73, 771–836. Lorentz, G. (1966). Approximation of Functions. Holt, New York. Mahajan, A. (2004). "Identi?cation and estimation of single index models with misclassi?ed regressors". Manuscript. Stanford University. Makovoz, Y. (1996). "Random approximants and neural networks". Journal of Approximation Theory 85, 98–109. Manski, C. (1985). "Semiparametric analysis of discrete response: Asymptotic properties of the maximum score estimator". Journal of Econometrics 27, 313–334. Manski, C. (1994). "Analog estimation of econometric models". In: Engle III, R.F., McFadden, D.F. (Eds.), Handbook of Econometrics, vol. 4. North-Holland, Amsterdam. Matzkin, R.L. (1994). "Restrictions of economic theory in nonparametric methods". In: Engle III, R.F., Mc- Fadden, D.F. (Eds.), Handbook of Econometrics, vol. 4. North-Holland, Amsterdam. McCaffrey, D., Ellner, S., Gallant, A., Nychka, D. (1992). "Estimating the Lyapunov exponent of a chaotic system with nonparametric regression". Journal of the American Statistical Association 87, 682–695. Meyer, Y. (1992). Ondelettes et operateurs I: Ondelettes. Hermann, Paris. Murphy, S., van der Vaart, A. (2000). "On pro?le likelihood". Journal of the American Statistical Associa- tion 95, 449–465. Newey, W.K. (1990a). "Semiparametric ef?ciency bounds". Journal of Applied Econometrics 5, 99–135. Newey, W.K. (1990b). "Ef?cient instrumental variables estimation of nonlinear models". Econometrica 58, 809–837. Newey, W.K. (1991). "Uniform convergence in probability and stochastic equicontinuity". Econometrica 59, 1161–1167. Newey, W.K. (1993). "Ef?cient estimation of models with conditional moment restrictions". In: Maddala, G.S., Rao, C.R., Vinod, H.D. (Eds.), Handbook of Statistics, vol. 11. North-Holland, Amsterdam. Newey, W.K. (1994a). "The asymptotic variance of semiparametric estimators". Econometrica 62, 1349– 1382. Newey, W.K. (1994b). "Series estimation of regression functionals". Econometric Theory 10, 1–28. Newey, W.K. (1997). "Convergence rates and asymptotic normality for series estimators". Journal of Econo- metrics 79, 147–168. Newey, W.K. (2001). "Flexible simulated moment estimation of nonlinear errors in variables models". Review of Economics and Statistics 83, 616–627. Newey, W.K. (1988). "Two step series estimation of sample selection models". Manuscript. MIT Department of Economics. Newey, W.K., McFadden, D.L. (1994). "Large sample estimation and hypothesis testing". In: Engle III, R.F., McFadden, D.L. (Eds.), Handbook of Econometrics, vol. 4. North-Holland, Amsterdam. Ch. 76: Large Sample Sieve Estimation of Semi-Nonparametric Models 5631 Newey, W.K., Powell, J.L. (1989). "Nonparametric instrumental variable estimation". Manuscript. Princeton University. Newey, W.K., Powell, J.L. (2003). "Instrumental variable estimation of nonparametric models". Economet- rica 71, 1565–1578. Working paper version, 1989. Newey, W.K., Smith, R. (2004). "Higher order properties of GMM and generalized empirical likelihood estimators". Econometrica 72, 219–256. Newey, W.K., Powell, J.L., Vella, F. (1999). "Nonparametric estimation of triangular simultaneous equations models". Econometrica 67, 565–603. Nishiyama, Y., Robinson, P.M. (2000). "Edgeworth expansions for semiparametric averaged derivatives". Econometrica 68, 931–980. Nishiyama, Y., Robinson, P.M. (2005). "The bootstrap and the Edgeworth correction for semiparametric av- eraged derivatives". Econometrica 73, 903–980. Ossiander, M. (1987). "A central limit theorem under metric entropy with L2 bracketing". The Annals of Probability 15, 897–919. Otsu, T. (2005). "Sieve conditional empirical likelihood estimation of semiparametric models". Manuscript. Yale University. Pagan, A., Ullah, A. (1999). Nonparametric Econometrics. Cambridge University Press. Pakes, A., Olley, G.S. (1995). "A limit theorem for a smooth class of semiparametric estimators". Journal of Econometrics 65, 295–332. Pastorello, S., Patilea, V., Renault, E. (2003). "Iterative and recursive estimation in structural non-adaptive models". Journal of Business & Economic Statistics 21, 449–509. Phillips, P.C.B. (1998). "New tools for understanding spurious regressions". Econometrica 66, 1299–1325. Phillips, P.C.B., Ploberger, W. (2003). "An introduction to best empirical models when the parameter space is in?nite-dimensional". Oxford Bulletin of Economics and Statistics 65, 877–890. Pinkse, J. (2000). "Nonparametric two-step regression estimation when regressors and errors are dependent". Canadian Journal of Statistics 28, 289–300. Polk, C., Thompson, T.S., Vuolteenaho, T. (2003). "New forecasts of the equity premium". Manuscript. Har- vard University. Pollard, D. (1984). Convergence of Statistical Processes. Springer-Verlag, New York. Portnoy, S. (1997). "Local asymptotics for quantile smoothing splines". The Annals of Statistics 25, 387–413. Powell, J. (1994). "Estimation of semiparametric models". In: Engle III, R.F., McFadden, D.F. (Eds.), Hand- book of Econometrics, vol. 4. North-Holland, Amsterdam. Powell, J., Stock, J., Stoker, T. (1989). "Semiparametric estimation of index coef?cients". Econometrica 57, 1403–1430. Robinson, P. (1988). "Root-N-consistent semiparametric regression". Econometrica 56, 931–954. Robinson, P. (1989). "Hypothesis testing in semiparametric and nonparametric models for econometric time series". Review of Economic Studies 56, 511–534. Robinson, P. (1995). "The normal approximation for semiparametric averaged derivatives". Econometrica 63, 667–680. Ruppert, D., Wand, M., Carroll, R. (2003). Semiparametric Regression. Cambridge University Press, Cam- bridge. Schumaker, L. (1981). Spline Functions: Basic Theory. John Wiley & Sons, New York. Severini, T., Wong, W.H. (1992). "Pro?le likelihood and conditionally parametric models". The Annals of Statistics 20, 1768–1802. Shen, X. (1997). "On methods of sieves and penalization". The Annals of Statistics 25, 2555–2591. Shen, X., Wong, W. (1994). "Convergence rate of sieve estimates". The Annals of Statistics 22, 580–615. Shen, X., Ye, J. (2002). "Adaptive model selection". Journal of the American Statistical Association 97, 210– 221. Shintani, M., Linton, O. (2004). "Nonparametric neural network estimation of Lyapunov exponents and a direct test for chaos". Journal of Econometrics 120, 1–34. Song, K. (2005). "Testing semiparametric conditional moment restrictions using conditional martingale trans- forms". Manuscript. Yale University, Department of Economics. 5632 X. Chen Stinchcombe, M. (2002). "Some genericity analyses in nonparametric econometrics". Manuscript, University of Texas, Austin, Department of Economics. Stinchcombe, M., White, H. (1998). "Consistent speci?cation testing with nuisance parameters present only under the alternative". Econometric Theory 14, 295–325. Stone, C.J. (1982). "Optimal global rates of convergence for nonparametric regression". The Annals of Sta- tistics 10, 1040–1053. Stone, C.J. (1985). "Additive regression and other nonparametric models". The Annals of Statistics 13, 689– 705. Stone, C.J. (1986). "The dimensionality reduction principle for generalized additive models". The Annals of Statistics 14, 590–606. Stone, C.J. (1990). "Large-sample inference for log-spline models". The Annals of Statistics 18, 717–741. Stone, C.J. (1994). "The use of polynomial splines and their tensor products in multivariate function estima- tion (with discussion)". The Annals of Statistics 22, 118–184. Stone, C.J., Hansen, M., Kooperberg, C., Truong, Y.K. (1997). "Polynomial splines and their tensor products in extended linear modeling (with discussion)". The Annals of Statistics 25, 1371–1470. Strawderman, R.L., Tsiatis, A.A. (1996). "On the asymptotic properties of a ?exible hazard estimator". The Annals of Statistics 24, 41–63. Timan, A.F. (1963). Theory of Approximation of Functions of a Real Variable. MacMillan, New York. Van de Geer, S. (1993). "Hellinger-consistency of certain nonparametric maximum likelihood estimators". The Annals of Statistics 21, 14–44. Van de Geer, S. (1995). "The method of sieves and minimum contrast estimators". Mathematical Methods of Statistics 4, 20–38. Van de Geer, S. (2000). Empirical Processes in M-estimation. Cambridge University Press. Van der Vaart, A. (1991). "On differentiable functionals". The Annals of Statistics 19, 178–204. Van der Vaart, A., Wellner, J. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer-Verlag, New York. Vapnik, V. (1998). Statistical Learning Theory. Wiley–Interscience, New York. Wahba, G. (1990). Spline Models for Observational Data. CBMS–NSF Regional Conference Series. Philadel- phia. White, H. (1984). Asymptotic Theory for Econometricians. Academic Press. White, H. (1990). "Connectionist nonparametric regression: Multilayer feedforward networks can learn arbi- trary mappings". Neural Networks 3, 535–550. White, H. (1994). Estimation, Inference and Speci?cation Analysis. Cambridge University Press. White, H., Wooldridge, J. (1991). "Some results on sieve estimation with dependent observations". In: Bar- nett, W.A., Powell, J., Tauchen, G. (Eds.), Non-parametric and Semi-parametric Methods in Econometrics and Statistics. Cambridge University Press, Cambridge, pp. 459–493. Wong, W.H. (1992). "On asymptotic ef?ciency in estimation theory". Statistica Sinica 2, 47–68. Wong, W.H., Severini, T. (1991). "On maximum likelihood estimation in in?nite dimensional parameter spaces". The Annals of Statistics 19, 603–632. Wong, W.H., Shen, X. (1995). "Probability inequalities for likelihood ratios and convergence rates for sieve MLE's". The Annals of Statistics 23, 339–362. Wooldridge, J. (1992). "A test for functional form against nonparametric alternatives". Econometric Theory 8, 452–475. Wooldridge, J. (1994). "Estimation and inference for dependent processes". In: Engle III, R.F., McFadden, D.F. (Eds.), Handbook of Econometrics, vol. 4. North-Holland, Amsterdam. Xiao, Z., Linton, O. (2001). "Second order approximation for an adaptive estimator in a linear regression". Econometric Theory 17, 984–1024. Zhang, J., Gijbels, I. (2003). "Sieve empirical likelihood and extensions of the generalized least squares". Scandinavian Journal of Statistics 30, 1–24. Zhou, S., Shen, X., Wolfe, D.A. (1998). "Local asymptotics for regression splines and con?dence regions". The Annals of Statistics 26, 1760–1782.