gam                   package:mgcv                   R Documentation

_G_e_n_e_r_a_l_i_z_e_d _A_d_d_i_t_i_v_e _M_o_d_e_l_s _u_s_i_n_g _p_e_n_a_l_i_z_e_d _r_e_g_r_e_s_s_i_o_n _s_p_l_i_n_e_s _a_n_d 
_G_C_V

_D_e_s_c_r_i_p_t_i_o_n:

     Fits the specified  generalized additive model (GAM) to data.
     `gam()' is not a clone of what Splus provides. Smooth terms are
     represented using penalized regression splines with smoothing
     parameters selected by GCV or by regression splines with fixed
     degrees of freedom (mixtures of the two are permitted).
     Multi-dimensional smooths are available using penalized thin plate
     regression splines, but the user must make sure that covariates
     are sensibly scaled relative to each other when using such terms.
     For a general overview see Wood (2001).

_U_s_a_g_e:

      gam(formula,family=gaussian(),data=list(),weights=NULL,control=gam.control,scale=0)

_A_r_g_u_m_e_n_t_s:

 formula: A GAM formula. This is exactly like the formula for a glm
          except that smooth terms can be added to the right hand side
          of the formula (and a formula of the form `y ~ .' is not
          allowed). Smooth terms are specified by expressions of the
          form: `s(var1,var2,...,k=12,fx=FALSE,bs="tp")' where `var1',
          `var2', etc. are the covariates which the smooth is a
          function of and `k' is the dimension of the basis used to
          represent the smooth term. If `k' is not specified then
          `k=10*3^(d-1)' is used where `d' is the number of covariates
          for this term. `fx' is used to indicate whether or not this
          term has a fixed muber of degrees of freedom (`fx=FALSE' to
          select d.f. by GCV/UBRE). `bs' indicates the basis to use,
          with `"cr"' indicating cubic regression spline, and `"tp"'
          indicating thin plate regression spline: `"cr"' can only be
          used with 1-d smooths.

          For backwards compatibility the formula may also include
          terms like  `s(x,12|f)', which specifies a regression spline
          which is not to be penalized and has 12 knots, or `s(x,z,25)'
          indicating a rank 25 penalized t.p.r.s. In such cases
          arguements `k', `fx' and `bs' are ignored if supplied and a
          one dimensional term will always use a cubic regression
          spline basis. Note that a term of the form `s(x)' will result
          in a term with a `"tp"' basis.

  family: This is a family object specifying the distribution and link
          to use in fitting etc. See `glm' and `family' for more
          details. Where the family is ` neg.binom' then a negative
          binomial family is used based on the implementation in the
          MASS library. In this case, if the value of `theta' is not
          given, a version of `glm.nb' :`gam.nbut' is used to estimate
          theta iteratively, starting from a  Poisson distribution.
          This extra layer of iteration  slows  down fitting. 

    data: A data frame containing the model response variable and
          covariates required by the formula. If this is missing then
          the frame from which `gam' was called is searched for the
          variables specified in the formula.

 weights: prior weights on the data. 

 control: A list as returned by `gam.control', with five user
          controllable elements: `maxit' controls maximum iterations in
          `gam.fit', convergence tolerance in `gam.fit' is controlled
          by `epsilon'   and the third item is `trace'. The smoothing
          parameter selection method is controlled by two further
          items: `mgcv.tol' controls the convergence tolerance to use
          in smoothing parameter estimation, while `mgcv.half.max'
          controls the maximum number of step halvings to try in each
          optimization step if the step fails to reduce the GCV score.

   scale: If this is zero then GCV is used for all distributions except
          Poisson, binomial and negative binomial where UBRE is used
          with scale parameter assumed to be 1. If this is greater than
          1 it is assumed to be the scale parameter/variance and UBRE
          is used. If `scale' is negative  GCV  is always used (for
          binomial models in particular, it is probably worth 
          comparing UBRE and GCV results; for ``over-dispersed
          Poisson'' GCV is probably more appropriate than UBRE.)

_D_e_t_a_i_l_s:

     Two alternative bases are available for representing model terms.
     Univariate smooth terms can be represented using conventional
     cubic regression splines - which are very efficient
     computationally - or thin plate regression splines. Multivariate
     terms must be represented using thin plate regression splines. For
     either basis the user specifies the dimension of the basis for
     each smooth term. The dimension of the basis is one more than the
     maximum degrees of freedom that the  term can have, but usually
     the term will be fitted by penalized maximum likelihood estimation
     and the actual degrees of freedom will be chosen by GCV. However,
     the user can choose to fix the degrees of freedom of a term, in
     which case the actual degrees of freedom will be one less than the
     basis dimension.

     Thin plate regression splines are constructed by starting with the
     basis for a full thin plate spline and then truncating this basis
     in an optimal manner, to obtain a low rank smoother. Details are
     given in Wood (MS submitted). One key advantage of the approach is
     that it avoids the knot placement problems of conventional
     regression spline modelling, but it also has the advantage that
     smooths of lower rank are nested within smooths of higher rank, so
     that it is legitimate to use conventional hypothesis testing
     methods to compare models based on pure regression splines.  

     In the case of the cubic regression spline basis, knots  of the
     spline are placed evenly throughout the covariate values to which
     the term refers:  For example, if fitting 101 data with an 11 knot
     spline of `x' then there would be a knot at every 10th (ordered) 
     `x' value. The parameterization used represents the spline in
     terms of its values at the knots. Connection of these values at
     neighbouring knots by sections of  cubic polynomial constrainted
     to join at the knots so as to be  continuous up to and including
     second derivative yields a natural cubic  spline through the
     values at the knots (given two extra conditions specifying  that
     the second derivative of the curve should be zero at the two end 
     knots). This parameterization gives the parameters a nice
     interpretability. 

     Given a basis for each smooth term, it easy to obtain a wiggliness
     penalty for each, and to construct a penalized likelihood, which
     balances the fit of the overall model against it's complexity.
     This consists of the log likelihood for the model minus a sum of
     wiggliness penalties (one for each smooth) each multiplied by a
     smoothing parameter. The smoothing parameters control the
     trade-off between fit and smoothness. 

     So, the gam fitting problem has become  a penalized glm fitting
     problem, which can be fitted using a slight modification of 
     `glm.fit' : `gam.fit'.  The penalized glm approach also allows
     smoothing parameters for all smooth terms to be  selected 
     simultaneously by GCV or UBRE. This is achieved as part of fitting
     by calling `mgcv'  within `gam.fit'.        

     Details of the GCV/UBRE minimization method are given in Wood
     (2000).

_V_a_l_u_e:

     The function returns an object of class `"gam"' which has the
     following elements: 

coefficients: the coefficients of the fitted model. Parametric
          coefficients are  first, followed  by coefficients for each
          spline term in turn.

residuals: the deviance residuals for the fitted model.

fitted.values: fitted model predictions of expected value for each
          datum.

  family: family object specifying distribution and link used. 

linear.predictor: fitted model prediction of link function of expected
          value for  each datum.

deviance: (unpenalized)

null.deviance: deviance for single parameter model.

 df.null: null degrees of freedom

    iter: number of iterations of IRLS taken to get convergence.

 weights: final weights used in IRLS iteration.

prior.weights: prior weights on observations.

 df.null: number of data

       y: response data.

converged: indicates whether or not the iterative fitting method
          converged.

    sig2: estimated or supplied variance/scale parameter.

     edf: estimated degrees of freedom for each smooth.

boundary: did parameters end up at boundary of parameter space?

      sp: smoothing parameter for each smooth.

      df: number of knots for each smooth (one more than maximum
          degrees of freedom).

    nsdf: number of parametric, non-smooth, model terms including the
          intercept.

      Vp: estimated covariance matrix for parameter estimators.

covariate.shift: covariates get shifted so that they are centred around
          zero - this is by how much.

      xp: knot locations for each cubic regression spline based smooth.
          `xp[i,]' are the locations for the ith smooth.

      UZ: array storing the matrices for transforming from t.p.r.s.
          basis to equivalent t.p.s. basis - see GAMsetup for details
          of how the matrices are packed in this array.

      Xu: The set of unique covariate locations used to define t.p.s.
          from which t.p.r.s. basis was derived. Again see GAMsetup for
          details of the packing algorithm.

xu.length: The number of unique covariate combinations in the data. 

 formula: the model formula.

full.formula: the model formula with each smooth term fully expanded
          and with option arguments given explicitly (i.e. not with
          reference to other variables) - useful for later prediction
          from the model.

       x: parametric design matrix columns (including intercept)
          followed by the data that form arguments of the smooths.

  s.type: type of spline basis used: 0 for conventional cubic
          regression spline, 1 for t.p.r.s.

 p.order: the order of the penalty used for each term. 0 signals
          auto-selection.

     dim: number of covariates of which term is a function

    call: a mode `call' object containing the call to `gam()' that
          produced  this `gam' object (useful for constructing model
          frames).

mgcv.conv: A list of smoothing parameter convergence diagnostics, with
          the following elements (irrelevant for models with only one
          smoothing parameter to estimate):

     _g the gradient of the GCV/UBRE score w.r.t. the smoothing
          parameters at termination.

     _h the second derivatives corresponding to `g' above - i.e. the
          leading diagonal of the Hessian.

     _e the eigen-values of the Hessian. These should all be
          non-negative!

     _i_t_e_r the number of iterations taken.

     _i_n._o_k `TRUE' if the second smoothing parameter guess improved the
          GCV/UBRE score. (Please report examples  where this is
          `FALSE')

     _s_t_e_p._f_a_i_l `TRUE' if the algorithm terminated by failing to improve
          the GCV/UBRE score rather than by "converging".  Not
          necessarily a problem, but check the above derivative
          information quite carefully.

_W_A_R_N_I_N_G_S:

     The code does not check for rank defficiency of the model matrix
     -it will likely just fail instead!

     You must have more unique combinations of covariates than the
     model has total parameters. (Total parameters is sum of basis
     dimensions plus sum of non-spline terms less the number of spline
     terms). 

     Automatic smoothing parameter selection is not likely to work well
     when  fitting models to very few response data.

     Relative scaling of covariates to a multi-dimensional smooth term
     has an affect on the results: make sure that relative scalings are
     sensible. For example, measuring one spatial co-ordinate in
     millimetres and the other in lightyears will usually produce poor
     results.   

     With large datasets (more than a few thousand data) the `"tp"'
     basis gets very slow to use. In this case use `"cr"' for 1-d
     smooths. If you need to use multi-dimensional terms with large
     datasets and find `gam' too slow, please let me know - and I'll up
     the priority for fixing this!

_A_u_t_h_o_r(_s):

     Simon N. Wood snw@st-and.ac.uk

_R_e_f_e_r_e_n_c_e_s:

     Hastie and Tibshirani (1990) Generalized Additive Models. Chapman
     and Hall.

     Green and Silverman (1994) Nonparametric Regression and
     Generalized Linear Models. Chapman and Hall.

     Gu and Wahba (1991) Minimizing GCV/GML scores with multiple
     smoothing parameters via the Newton method. SIAM J. Sci. Statist.
     Comput. 12:383-398

     Wood (2000) Modelling and Smoothing Parameter Estimation  with
     Multiple  Quadratic Penalties. JRSSB 62(2):413-428

     Wood (2001) mgcv:GAMs and Generalized Ridge Regression for R. R
     News 1(2):20-25

     Wood (MS submitted) Thin Plate Regression Splines

     Wahba (1990) Spline Models of Observational Data. SIAM 

     <URL: http://www.ruwpa.st-and.ac.uk/simon.html>

_S_e_e _A_l_s_o:

     `s' `predict.gam' `plot.gam'

_E_x_a_m_p_l_e_s:

     library(mgcv)
     set.seed(1)
     n<-400
     sig2<-4
     x0 <- runif(n, 0, 1)
     x1 <- runif(n, 0, 1)
     x2 <- runif(n, 0, 1)
     x3 <- runif(n, 0, 1)
     pi <- asin(1) * 2
     f <- 2 * sin(pi * x0)
     f <- f + exp(2 * x1) - 3.75887
     f <- f + 0.2 * x2^11 * (10 * (1 - x2))^6 + 10 * (10 * x2)^3 * (1 - x2)^10 - 1.396
     e <- rnorm(n, 0, sqrt(abs(sig2)))
     y <- f + e
     b<-gam(y~s(x0)+s(x1)+s(x2)+s(x3))
     plot(b,pages=1)
     # now a GAM with 3df regression spline term & 2 penalized terms
     b1<-gam(y~s(x0,k=4,fx=TRUE,bs="tp")+s(x1,k=12)+s(x2,15))
     plot(b1,pages=1)
     # now fit a 2-d term to x0,x1
     b3<-gam(y~s(x0,x1)+s(x2)+s(x3))
     par(mfrow=c(2,2))
     plot(b3)
     par(mfrow=c(1,1))
     # now simulate poisson data
     g<-exp(f/5)
     y<-rpois(rep(1,n),g)
     b2<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=poisson)
     plot(b2,pages=1)
     # negative binomial data
     set.seed(1)
     y<-rnbinom(g,size=2,mu=g)
     b3<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=neg.binom)
     plot(b3,pages=1)
     # and a pretty 2-d smoothing example....
     test1<-function(x,z,sx=0.3,sz=0.4)  
     { (pi**sx*sz)*(1.2*exp(-(x-0.2)^2/sx^2-(z-0.3)^2/sz^2)+
       0.8*exp(-(x-0.7)^2/sx^2-(z-0.8)^2/sz^2))
     }
     n<-500
     old.par<-par(mfrow=c(2,2))
     x<-runif(n);z<-runif(n);
     xs<-seq(0,1,length=30);zs<-seq(0,1,length=30)
     pr<-data.frame(x=rep(xs,30),z=rep(zs,rep(30,30)))
     truth<-matrix(test1(pr$x,pr$z),30,30)
     contour(xs,zs,truth)
     y<-test1(x,z)+rnorm(n)*0.1
     b4<-gam(y~s(x,z))
     fit1<-matrix(predict.gam(b4,pr,se=FALSE),30,30)
     contour(xs,zs,fit1)
     persp(xs,zs,truth)
     persp(b4)
     par(old.par)

