\documentclass{article}
\usepackage[utf8]{inputenc}
%\usepackage{tikz}
\usepackage{natbib}
\usepackage{amssymb}
\usepackage{url}
\usepackage{hyperref}
\usepackage{color}
\usepackage{fancyvrb} %  VerbatimFootnotes, allow verb in footnotes
\usepackage{listings}
\usepackage{moreverb}
\usepackage{graphicx}
\DeclareGraphicsExtensions{.pdf}
% #### this ## DISABLED comment persuades Sweave not to insert \ usepackage { SXweaveXXX }
%% next stuff must be in PREAMBLE, not code
\SweaveOpts{keep.source=TRUE}
<<SweaveListingsPreparations,results=tex,echo=FALSE,strip.white=false>>=
## require(SweaveListingUtils)
## SweaveListingPreparations()
@ 
\usepackage{bm}
\begin{document}
%% disable for SweaveListingsUtils
\newcommand{\code}[1]{{\tt #1}}
\newcommand{\todo}[1]{{\color{red} \bf #1}}

\newcommand{\X}{{\mathbf X}}
\newcommand{\vu}{{\mathbf u}}
\newcommand{\vb}{{\mathbf b}}
\newcommand{\vzero}{{\mathbf 0}}
\newcommand{\cb}{{\mathcal B}}
\newcommand{\Z}{{\mathbf Z}}
\newcommand{\f}{{\mathbf f}}
\newcommand{\veta}{{\bm \eta}}
\newcommand{\vtheta}{{\bm \theta}}
\newcommand{\vbeta}{{\bm \beta}}
\newcommand{\vsigma}{{\bm \Sigma}}
\newcommand{\Vnest}{\sigma^2_{\mbox{\small nest}}}
\newcommand{\Vobs}{\sigma^2_{\mbox{\small obs}}}

\newcommand{\z}{\tau} % zero-inflation parameter
\title{Owls example: a zero-inflated, generalized linear mixed model for count data}
\date{\today}
\author{Ben Bolker, Mollie Brooks, Beth Gardner, Cleridy Lennert, Mihoko Minami}
\bibliographystyle{chicago}

\maketitle

%\lstdefinestyle{Routstyle}{style=RoutstyleO,numbers=left,numberstyle=\tiny,%
%basicstyle=\ttfamily,columns=fixed}

<<echo=FALSE>>=
options(continue=" ")
if (require(cacheSweave)) {
  setCacheDir("owls_cache")
}
@ 

\section{The model}

Data collected in ecological studies are often complex. Studies may
involve repeat observations on the same units (e.g., individuals,
quadrats, stations).  Predictor variables may be categorical or
continuous, and interactions may be of interest.  In addition, such
data may contain excess zero-valued observations (with respect to a
Poisson model) because of sampling variability and/or incomplete
knowledge of processes that generated the data (e.g., factors that
define suitable species habitat)
\cite{zuur_mixed_2009}. \emph{Zero-inflated generalized linear
  mixed-effects models} (ZIGLMMs) are a class of models, incorporating
aspects of generalized linear models, mixed models, and zero-inflated
models, that are both flexible and computationally efficient tools for
data of this sort.

The data for this example, taken from \cite{zuur_mixed_2009} and
ultimately from \cite{roulinbersier_2007}, quantify the amount
of sibling negotiation (vocalizations when parents
are absent) by owlets (owl chicks) in different nests as a
function of food treatment (deprived or satiated), the sex of the
parent, and arrival time of the parent at the nest. Since the same nests are
observed repeatedly, it is natural to consider a mixed-effects model
for these data, with the nest as a random effect. Because we are
interested in explaining variability in the number of vocalizations
\emph{per chick}, the total brood size in each nest is used as an offset
in the model.

Since the same nests are measured repeatedly, \cite{zuur_mixed_2009} fitted a Poisson generalized linear mixed-effects model to these data (see their Section 13.2.2). We extend that example by considering zero-inflation.

We use a zero-inflated Poisson model 
with a log link function for the count (Poisson) part
[i.e., the inverse link is exponential]
of the model and a logit link for the binary part
[i.e., the inverse link is logistic]
\begin{equation}
  Y_{ij} \sim \mbox{ZIPois}(\lambda=\exp(\eta_{ij}),p=\mbox{logistic}(z_i))
\label{eq:zip}
\end{equation}
where ZIPois represents the zero-inflated Poisson distribution,
and $\mbox{logistic}(z) \equiv (1+\exp(-z))^{-1}$.
The zero-inflated Poisson with parameters $\lambda$, $p$ has
probability $p+(1-p) \exp(-\lambda)$ if $Y=0$ and $(1-p) \mbox{Pois}(X,\lambda)$ if $Y>0$.

In this case we use only a single, overall zero-inflation parameter
for the whole model --- $z_i=z_0$. (In R's Wilkinson-Rogers notation,
this would be written as \verb+z~1+.)

The model for the vector $\veta$, the linear predictor of
the Poisson part of the distribution,
follows a standard linear mixed model formulation:
\begin{equation}
  (\veta|\cb=\vb) = \X \vbeta + \Z \vb + \f
\label{eq:linpred}
\end{equation}
and
\begin{equation}
  \cb \sim \mbox{Normal}(\vzero,\vsigma_\theta)
\label{eq:refmat}
\end{equation}
where $\veta$ is the vector of linear predictors;
$\cb$ is the random variable representing a vector
of random effects; $\vb$ is a \emph{particular} value
of this random variable; $\X$ is the fixed-effect
design matrix; $\vbeta$ is the vector of fixed-effect
parameters; $\Z$ is the random-effect design matrix;
$\vsigma_\theta$ is the variance-covariance matrix of
the random effects, with parameters $\vtheta$; and
$\f$ is an offset vector.

In this case, the  fixed-effect
design matrix $\X$ includes columns
for the intercept, difference between males and females,
between food treatments, arrival times, and their interactions.
The random effect design matrix $\Z$
is a simple dummy matrix encoding nest identity.
There is only a single random effect, so $\vsigma_\vtheta$
is a diagonal matrix with the
random effects (nest) variance $\Vnest$ on the diagonal
(the nest random effects are independent and identically
distributed)
and the random-effects parameter vector is
just $\vtheta=\{\Vnest\}$.
The offset $\f$, equal to the log of the number
of chicks in the nest, accounts for the
fact that the data give the total number of
vocalizations per \emph{nest}, while we
are ultimately interested in the total
number of vocalizations per \emph{chick}. 

For the purposes of model comparison (timing, accuracy etc.)
we will stick with the offset, zero-inflated Poisson model
described by eqs. (\ref{eq:linpred}, \ref{eq:zip}), but
in the discussion below we will also consider alternative
formulations that seem to fit the data well (e.g.
using overdispersed distributions such as the
lognormal-Poisson or negative binomial; allowing the
response to brood size to be other than strictly proportional).

\section{Preliminaries}

Load packages:

<<loadpkg,results=hide>>=
library(coda)
library(reshape)
library(MCMCglmm)
library(coefplot2)
library(glmmADMB)
library(bbmle)
library(ggplot2)
theme_set(theme_bw())  ## set color scheme
library(RColorBrewer)
library(R2jags)
## library(lme4)  ## wait to load -- glmmADMB/lme4 conflicts
@ 
Package versions:
<<echo=FALSE>>=
usedpkgs <- sort(c("coda","reshape","MCMCglmm","coefplot2","glmmADMB","lme4","bbmle","RColorBrewer",
                   "ggplot2","R2jags","rjags","coda","R2WinBUGS"))
i1 <- installed.packages()[usedpkgs,"Version"]
## HACK required for SweaveListingsUtils package
## print(i1[1:4],quote=FALSE)
## print(i1[5:6],quote=FALSE)
print(i1,quote=FALSE)
@ 

Load the data and use the \code{rename} function from the
\code{reshape} package to change the name of the response variable:
<<loadtweak>>=
load("../DATA/Owls.rda")
library(reshape)
Owls <- rename(Owls,c(SiblingNegotiation="NCalls"))
@ 

Scale arrival time: necessary for WinBUGS, useful
to have it done up front for the sake of parameter comparisons\footnote{here we \emph{replace} the original variable with the scaled version, rather than creating a new variable called (say) \code{scArrivalT}.  This can potentially lead to confusion if we forget whether we are dealing with the scaled or the unscaled version \ldots}
<<>>=
Owls <- transform(Owls,ArrivalTime=scale(ArrivalTime,center=TRUE,scale=FALSE))
@ 

\section{R}

There are (at least) three different ways to do this problem in R,
although (as far as we know) there is no simple, out-of-the-box
method that is built purely on R (or underlying C or FORTRAN code)
that solves the problem as we have stated it.
\begin{itemize}
\item The reference method, used for comparisons with ADMB and WinBUGS,
  is the \code{MCMCglmm} package; this method is the only one that works ``out of the box''
  in R without recourse to other (non-R) tools.  Its only disadvantage
  is that it fits a lognormal-Poisson version of the model (i.e., with
  an observation-level random effect added \citep{Elston+2001}
  rather than the reference Poisson model.  
  In terms of equation~\ref{eq:linpred}, we add a parameter
  $\Vobs$ to the random-effects parameter vector
  $\vtheta$ and expand $\vsigma_{\vtheta}$ to incorporate an additional
  set of diagonal components $\Vobs$
  (the observation-level random effects are also independent
  and identically distributed).
  
(Because it allows for overdispersion in the count part
of the model, this model is actually superior to the
ZIP we originally proposed, but it is not the reference model.)
\item One can cheat a little bit and make use of ADMB (but without
  having to do any explicit AD Model Builder coding) by using the
  \code{glmmADMB} package, which encapsulates a subset of ADMB's
  capabilities into a pre-compiled binary that can be run from within
  R using a fairly standard formula syntax.
\item With a bit of hacking, one can write up a fairly simple, fairly
  generic implementation of the expectation-maximization (EM) algorithm 
  that alternates between fitting a GLMM (using \code{glmer}) with data
  that are weighted according to their zero probability, and fitting a
  binary GLM (using \code{glm}) for the probability that a data point is zero.
  We have implemented this in the code in \verb+owls_R_funs.R+, as
  a function called \code{zipme} (see below).
\end{itemize}

\subsection{MCMCglmm}

As mentioned above, we use \code{MCMCglmm} for our reference
R implementation.

<<>>=
library(MCMCglmm)
@ 

Set up a variable that will pick out the offset (log brood size)
parameter, which will be in position 3 of the parameter vector:
<<offvec>>=
offvec <- c(1,1,2,rep(1,5))
@ 

While \code{MCMCglmm} can easily fit a ZIGLMM,
specifying the fixed effect is a little bit tricky.
For zero-inflated models, \code{MCMCglmm} internally constructs an 
augmented data frame: if the original data frame is 

\begin{verbatim}
resp      f1  x1
1         A   1.1
0         A   2.3
3         B   1.7
\end{verbatim}
where \code{resp} is a Poisson response, \code{f1} is a factor,
and \code{x1} is a continuous predictor, then the augmented data frame
will look like this:
\begin{verbatim}
resp      trait   f1  x1
1         resp    A   1.1
0         resp    A   2.3
3         resp    B   1.7
1         zi_resp A   1.1
0         zi_resp A   2.3
1         zi_resp B   1.7
\end{verbatim}
The \code{trait} column
distinguishes two types of ``pseudo-observations'':
the \code{resp} values give the actual observed values,
for modeling the count portion of the data,
while the \verb+zi_resp+ values reduce the results to
binary (0/1) form.

\code{MCMCglmm} provides a special helper function, \code{at.level}, which
enables us to specify that some covariates affect only the count part of
the model (\code{resp}), or only the binary part (\verb+zi_resp+)
of the model: the first would be
specified as an interaction of \code{at.level(trait,1)} with a covariate
(i.e., the covariate affects only responses for level~1 of trait, which
are the count responses).  

Here is the model specified by \cite{zuur_mixed_2009},
which includes an offset effect of brood size and most but not all of
the interactions between \code{FoodTreatment}, \code{SexParent},
and \code{ArrivalTime} on the count aspect of the model, but only a single
fixed (intercept) effect on the binary part of the model (i.e., the
zero-inflation term):
<<fixef2>>=
fixef2 <- NCalls~trait-1+ ## intercept terms for both count and binary terms
  at.level(trait,1):logBroodSize+
  at.level(trait,1):((FoodTreatment+ArrivalTime)*SexParent)
@ 

(If we wanted to include covariates in the model for the
level of zero-inflation we would use an interaction with
\code{at.level(trait,2)}: for example, e.g. we would add
a term \code{at.level(trait,2):SexParent} to the fixed-effect
model if we wanted to model a situation where the zero-inflation 
proportion varied according to parental sex.)

The next complexity is in the specification of the priors, which
(ironically) we have to do in order 
to make the model simple enough.  By default,
\code{MCMCglmm} will fit the same random effect models to both
parts of the model (count and binary).  Here, we want
to turn off the random effect of nest for the binary aspect
of the model. In addition, \code{MCMCglmm}
always fits residual error terms for both parts of the model.  In our
first specification, we first fix the residual error (\code{R}) of the binary part
of the data to 1 (because it is not identifiable) by setting \code{fix=2};
the parameter \code{nu=0.002} specifies a very weak
prior for the other (non-fixed) variance term.
(In order to get reasonable mixing of the chain we have to fix it to a non-zero value.)
We also essentially turn off the random effect on the 
binary part of the model by fixing its variance to $10^{-6}$, in the same way.

<<prior1>>=
prior_overdisp  <- list(R=list(V=diag(c(1,1)),nu=0.002,fix=2),
                        G=list(list(V=diag(c(1,1e-6)),nu=0.002,fix=2)))
@ 

\verb+prior_overdisp+ will serve as a base version of the prior, but
we also want to specify that the log brood size enters the model as
an offset.  We do this by making the priors for all \emph{but} the
log-brood-size effect (nearly) equal to the default value for
fixed effects [$N(\mu=0,\sigma^2=10^{8})$ --- the
  variance for the log brood size effect
is an (extremely) weak prior centered on zero] and setting
a very strong prior centered at 1 [$N(\mu=1,\sigma^2=10^{-6})$] for the
log brood size effect.\footnote{If we set $\sigma^2=10^{10}$,
  the default value, for the non-offset predictors,
  we get an error saying that \code{fixed effect V prior is not positive definite} ---
  the difference in variances is so large that the smaller variance underflows
  to zero in a calculation.}

<<>>=
prior_overdisp_broodoff <- c(prior_overdisp,
                             list(B=list(mu=c(0,1)[offvec],
                               V=diag(c(1e8,1e-6)[offvec]))))
@ 

Now we fit the lognormal version of the ZIPois model as follows:

<<broodfit,cache=TRUE>>=
mt1 <- system.time(mfit1 <- MCMCglmm(fixef2,
                                     rcov=~idh(trait):units,
                                     random=~idh(trait):Nest,
                                     prior=prior_overdisp_broodoff,
                                     data=Owls,
                                     family="zipoisson",
                                     verbose=FALSE))
@

For comparison, we'll also try the model with 
the log-brood-size parameter allowed to vary from 1
(i.e. dropping the prior specification that fixes 
its value at 1):

<<fit2,cache=TRUE>>=
mt2 <- system.time(mfit2 <- MCMCglmm(fixef2,
                                     rcov=~idh(trait):units,
                                     random=~idh(trait):Nest,
                                     prior=prior_overdisp,
                                     data=Owls,
                                     family="zipoisson",
                                     verbose=FALSE))
@ 

These fits take \Sexpr{round(mt1[3],1)} and \Sexpr{round(mt2[3],1)} seconds
respectively.

Define a utility
function to abbreviate the variable names from
the slightly verbose MCMCglmm results and apply
it to the relevant portions of the fits:
<<sum2>>=
abbfun <- function(x) {
  gsub("(Sol\\.)*(trait|at.level\\(trait, 1\\):)*","",
       gsub("FoodTreatment","FT",x))
}
colnames(mfit1$Sol) <- abbfun(colnames(mfit1$Sol))
colnames(mfit2$Sol) <- abbfun(colnames(mfit2$Sol))
@ 

Now look at the results:
<<>>=
summary(mfit1)
@ 

Moving through this in sequence:
\begin{itemize}
\item the first three lines give information about the
  chain parameters: using the MCMCglmm default, 3000 ``burn-in'' 
  samples have been taken, followed by 10000 steps that are sampled every 10 steps
\item the DIC, or deviation information criterion,
  is useful (with caveats) for model comparison
  \citep{Spiegelhalter+2002}
\item the $G$ (random effects) variance for the count
  part of the model (\code{zi\_NCalls.Nest}) has been succesfully
  fixed to a small value; the variance among nests is small but non-zero
\item the residual (among-unit) variance is quite large for number of
  calls (suggesting overdispersion); it is fixed to 1.0 for the
  count part of the model, as described above
\item the fixed-effect parameter table gives the mean, 95\% credible
  intervals, effective sample size, and Bayesian $p$-value (a two-tailed
  test of the more extreme of the fraction of samples above or below zero)
  for both the binary intercept (the logit of the zero-inflation
  probability $z_0$, named \verb+zi_NCall+ here) and the
  the fixed effects of the parameters.
  It looks like parental sex is not doing much, at least as measured in terms of $p$ values.
\end{itemize}

Alternatively, we can represent the results graphically:
<<coefplot1,fig=TRUE>>=
library(coefplot2)
op <- par(mfrow=c(2,1))
vn1 <- abbfun(rownames(coeftab(mfit2)))
coefplot2(mfit1,intercept=TRUE,varnames=vn1)
coefplot2(mfit1,var.idx=c(1,3),ptype="vcov",
          main="")
par(op)
@ 

We have to look at the \emph{trace plots} --- plots of the
time series of the Markov chain --- to make sure that the
fits behaved appropriately.  We are hoping that the 
trace plots look like white noise, with rapid and
featureless variation:

For the fixed effects:
<<tr1,fig=TRUE>>=
print(xyplot(mfit1$Sol,layout=c(3,3)))
@ 

The one parameter that looks slightly questionable is \verb+zi_NCalls+,
and indeed its effective size is rather lower than we'd like (we 
should be aiming for at least 200 if we want reliable confidence
intervals):
<<>>=
round(sort(effectiveSize(mfit1$Sol)))
@ 
Note that although \code{logBroodSize} is varying, it's varying
over a very small range (0.998-1.002) because of the strong prior
we imposed.
%% why is DIC(mfit2) 20 points GREATER than DIC(mfit1) ???

We can also run a quantitative check on convergence,
using \code{geweke.diag} which gives a $Z$-statistic
for the similarity between the first 10\% and the
last 50\% of the chain:
<<>>=
geweke.diag(mfit1$Sol)
@ 
All the values are well within the 95\% confidence
intervals of the standard normal (i.e., $|x|<1.96$).

For variance parameters:
<<tr2,fig=TRUE>>=
vv <- mfit1$VCV
## drop uninformative ZI random effects
vv <- vv[,c("NCalls.Nest","NCalls.units")]
print(xyplot(vv,layout=c(1,2)))
effectiveSize(vv)
@ 

Density plots display the posterior distributions
of the parameters, useful for 
checking whether the distribution of
sampled points is somehow odd (strongly skewed,
bimodal, etc.).  Symmetry and approximate normality
of the posterior are useful for inference
(e.g. the DIC is based on an assumption of
approximate normality, and quantiles and highest posterior
density intervals give the same results for symmetry),
but not absolutely necessary.

For fixed effects:
<<dens1,fig=TRUE>>=
print(densityplot(mfit1$Sol,layout=c(3,3)))
@ 

For variance parameters:
<<dens2,fig=TRUE>>=
print(densityplot(vv,layout=c(2,1)))
@ 

Comparing the fits of the model with and without 
log brood size:
<<compcoefplot,fig=TRUE>>=
coefplot2(list("brood-offset"=mfit1,"brood-est"=mfit2),
          intercept=TRUE,
          varnames=vn1,
          legend=TRUE,legend.x="right")
@ 

With the exception of the log brood size and intercept
parameters,
the parameters are similar.  The log brood size parameter
is quite far from 1:
<<>>=
coda::HPDinterval(mfit2$Sol)["logBroodSize",]
@ 
%% why does DIC not go in the expected direction??

We can also create versions of the model that attempt to eliminate the overdispersion
in the count part of the model.  We can't fix both variance parameters, but we can
make the variance prior informative (by setting \code{nu=100}, or \code{nu=1000})
and make the (1,1) element of the variance matrix small.
However, if we try too hard to do this, we sacrifice 
a well-mixed chain (this mixing property is the reason that \code{MCMCglmm}
automatically adds an observation-level variance to every model).
Since the model runs \emph{reasonably} fast, for this case we might be able to use
brute force and just run the model 10 or 100 times as long \ldots

We could set up prior specifications for this non-overdispersed case as
follows:
<<elimoverdisp>>=
prior_broodoff <- within(prior_overdisp_broodoff,
                         R <- list(V=diag(c(1e-6,1)),nu=100,fix=2))
prior_null <- within(prior_overdisp,
                     R <- list(V=diag(c(1e-6,1)),nu=100,fix=2))
@



\subsection{glmmADMB}

This problem can also be done with \code{glmmADMB}.
The model is about equally fast, and the coding is easier!
On the other hand, there is arguably some advantage to having the MCMC
output (which would take longer to get with ADMB).

<<loadglmmadmb>>=
library(glmmADMB)
@ 
<<glmmadmbfit,cache=TRUE>>=
gt1 <- system.time(gfit1 <- glmmadmb(NCalls~(FoodTreatment+ArrivalTime)*SexParent+
                                     offset(logBroodSize)+(1|Nest),
                                     data=Owls,
                                     zeroInflation=TRUE,
                                     family="poisson"))
@

It takes \Sexpr{round(gt1[3],1)} seconds, slightly faster than \code{MCMCglmm}.

<<>>=
summary(gfit1)
@ 

The results are quite similar, although MCMCglmm gives wider
confidence intervals in general because it considers the
uncertainties in all model components, and because it allows 
for overdispersion.
<<coefplotmg1,fig=TRUE,width=8>>=
cm1tab <- coeftab(mfit1)[4:8,]
cg1tab <- coeftab(gfit1)[2:6,]
coefplot2(list(MCMCglmm=cm1tab,glmmADMB=cg1tab),merge.names=FALSE,intercept=TRUE,
          varnames=abbfun(rownames(cg1tab)),
          legend=TRUE)
@ 

<<glmmadmbnbinomfit,cache=TRUE>>=
gt2 <- system.time(gfit2 <- glmmadmb(NCalls~(FoodTreatment+ArrivalTime)*SexParent+
                                     offset(logBroodSize)+(1|Nest),
                                     data=Owls,
                                     zeroInflation=TRUE,
                                     family="nbinom"))
@

This takes a little longer than the Poisson model
(\Sexpr{round(gt2[3],1)} seconds). 
The parameters are a little
bit closer to the MCMCglmm fit (which also allows
for overdispersion).

<<coefplotmg2,fig=TRUE,width=8>>=
cg2tab <- coeftab(gfit2)[2:6,]
coefplot2(list(MCMCglmm=cm1tab,glmmADMB_Poiss=cg1tab,glmmADMB_NB=cg2tab),
          merge.names=FALSE,intercept=TRUE,
          varnames=abbfun(rownames(cg1tab)),
          legend=TRUE,
          col=c(1,2,4))
@ 

It seems that most of the difference in confidence interval size was
due to overdispersion (or lack of it), rather than to what components
of uncertainty were included.

To explore the variance-mean relationship, we can calculate
the mean and variance by group and look at the relationship.
A linear relationship implies that a quasi-Poisson or 
negative binomial ``type 1''
with variance $V$ proportional the mean $\mu$
\citep{hardin_generalized_2007}
is likely to be best, while a relationship of the form $V = \mu + C \mu^2$
implies that a negative binomial type~2 
(the standard in R and most other statistics
packages) or lognormal-Poisson fit is likely to be best.

<<plot2,fig=TRUE>>=
library(plyr)
mvtab <- ddply(Owls,
               .(FoodTreatment:SexParent:Nest),
               summarise,
               callmean=mean(NCalls),
               callvar=var(NCalls))
q1 <- qplot(callmean,callvar,data=mvtab)
print(q1+
      ## linear (quasi-Poisson/NB1) fit
      geom_smooth(method="lm",formula=y~x-1)+
      ## smooth (loess)
      geom_smooth(colour="red")+
      ## semi-quadratic (NB2/LNP)
      geom_smooth(method="lm",formula=y~I(x^2)+offset(x)-1,colour="purple")+
      ## Poisson (v=m)
      geom_abline(a=0,b=1,lty=2))
@ 
\\ (black=Poisson [$V=\mu$]; purple=quadratic/NB2; blue=linear/NB1;
red=nonparametric)

Of the parametric choices, it looks like NB1 is better than NB2.
We can fit this in \code{glmmADMB} via \code{family="nbinom1"}:
<<glmmadmbnbinom1fit,cache=TRUE>>=
gt3 <- system.time(gfit3 <- glmmadmb(NCalls~(FoodTreatment+ArrivalTime)*SexParent+
                                     offset(logBroodSize)+(1|Nest),
                                     data=Owls,
                                     zeroInflation=TRUE,
                                     family="nbinom1"))
@ 

It doesn't change the coefficients very much, although
it does make the main effect of food stronger:
<<coefplotmg3,fig=TRUE,width=8>>=
cg3tab <- coeftab(gfit3)[2:6,]
coefplot2(list(MCMCglmm=cm1tab,glmmADMB_Poiss=cg1tab,glmmADMB_NB=cg2tab,
               glmmADMB_NB1=cg3tab),
          merge.names=FALSE,intercept=TRUE,
          varnames=abbfun(rownames(cg1tab)),
          legend=TRUE,
          col=c(1,2,4,5))
@ 

AIC suggests that (ZI)NB1 is a much better fit to the data.
<<>>=
AICtab(gfit1,gfit2,gfit3)
@ 

\subsection{EM algorithm}

The expectation-maximization (EM) algorithm (e.g., \cite{Dempster+1977,minami_modeling_2007}) is an iterative procedure for
finding maximum likelihood estimates of model parameters. To find
parameter estimates, the EM algorithm iterates between expectation and
maximization steps until convergence. The response variable of a
zero-inflated Poisson model can be viewed as arising from one of two
states (e.g., \cite{Lambert1992}): a 'perfect' ('zero') state where only the
value zero is possible or an imperfect state where any non-negative
value is possible. In this conceptualization, the zero-valued
observations of a data set could come from either state, whereas
positive counts could only come from the imperfect state. At each
iteration of the EM algorithm, the purpose of the expectation step is
to estimate the probability that each zero- valued observation is in
the 'perfect' state, given the parameters of the logistic and
log-linear models (which are estimated in the maximization steps). In
our implementation, we assume that the logistic and log-linear models
have no shared parameters.

In our case, we use the \code{glmer} function from \code{lme4} to fit the Poisson part of
the regression, and either the \code{glmer} function (if the zero-inflation model contains
a random effect) or the \code{glm} function (if not) for the logistic regression ($Z$) part of the algorithm.
The algorithm is encapsulated in a \code{zipme} function we have written which takes
a standard R model formula for the Poisson part (\code{cformula});
a model formula with \code{z} on the left-hand side for the zero-inflation part;
and addition parameters \code{data}, \code{maxitr} (maximum number of iterations),
\code{tol} (tolerance criterion), and \code{verbose} (whether to print out
progress messages).

<<zipemload>>=
source("../R/owls_R_funs.R")
library(lme4)
@ 

<<zipem1,cache=TRUE>>=
zt1 <- system.time(ofit_zipme <- 
                   zipme(cformula=NCalls~(FoodTreatment+ArrivalTime)*SexParent+
                         offset(logBroodSize)+(1|Nest),
                         zformula=z ~ 1,
                         data=Owls,maxitr=20,tol=1e-6,
                         verbose=FALSE))
@ 

<<zipem2,cache=TRUE,results=hide>>=
Owls$obs <- seq(nrow(Owls))
zt2 <- system.time(ofit2_zipme <- 
                   zipme(cformula=NCalls~(FoodTreatment+ArrivalTime)*SexParent+
                         offset(logBroodSize)+(1|Nest)+(1|obs),
                         zformula=z ~ 1,
                         data=Owls,maxitr=20,tol=1e-6,
                         verbose=FALSE))
@ 

The EM fits take \Sexpr{round(zt1[3],1)} 
and \Sexpr{round(zt2[3],1)} seconds without and
with observation-level random effects, respectively.

The EM results are similar to the corresponding fits
from other approaches: the Poisson fit looks most
similar to the \code{glmmADMB} Poisson fit (this is
the reference model, which doesn't account for overdispersion)
and the lognormal-Poisson fit (with an observation-level
random effect) looks most like the \code{MCMCglmm}
fit.  

The only innovation here is using \code{brewer.pal(6,"Dark2")}
from the \code{RColorBrewer} package to get some nicer colors.

<<coefplotmg4,fig=TRUE,width=8>>=
cg4tab <- coeftab(ofit_zipme$cfit)[2:6,]
cg5tab <- coeftab(ofit2_zipme$cfit)[2:6,]
library(RColorBrewer)
cvec <- brewer.pal(6,"Dark2")
coefplot2(list(MCMCglmm=cm1tab,glmmADMB_Poiss=cg1tab,glmmADMB_NB=cg2tab,
               glmmADMB_NB1=cg3tab,
               zipme=cg4tab,
               zipme2=cg5tab),
          merge.names=FALSE,intercept=TRUE,
          varnames=abbfun(rownames(cg1tab)),
          legend=TRUE,
          col=cvec)
@ 


\section{ADMB}
\subsection{Code in owls.tpl}
\VerbatimInput[numbers=left,samepage=true,fontsize=\small]{../ADMB/owls.tpl}

\begin{itemize}
\item \code{sigma} is bounded in (0.001,100)
\item line 21 initializes the objective function to zero. \code{jnll} is where we'll sum up the joint negative log-likelihood of the observation model and the random effects. 
\item line 22 transforms the logit of the zero-inflation parameter back to the probability scale
\item line 23 computes the linear predictor (offset + fixed effects + random effects):
  in \code{rcoeffs*sigma} (scaling the random effects by the random-effects standard deviation:
  vector $\times$ scalar), \code{*} acts as elementwise multiplication;
  in the other cases (multiplying fixed-effect and random-effect design matrices by the
  corresponding parameter vectors) it acts as matrix multiplication
\item line 24 converts the linear predictor from the log to the count scale;
\item lines 26--36 loop over the observations, subtracting the ZIP log-likelihood
  (computed on lines 28--35) for each observation to the objective function value (joint negative log-likelihood) \code{jnll}
\item line 38 adds the negative log-likelihood of the random effects to \code{jnll}
\item line 42 will make sure there's enough space. ADMB told us to use this number when we tried to run the program.

\end{itemize}
\subsection{Running the model from within R}
Load the \code{R2admb} package and tell R
where to find the \code{admb} executable:
<<admb_run>>=
library(R2admb)
setup_admb()
@

Read in the data (if we hadn't already done so):
<<loaddataagain>>=
load("../DATA/Owls.rda")
## don't forget to transform!
Owls <- transform(Owls,ArrivalTime=scale(ArrivalTime,center=TRUE,scale=FALSE))
@ 

Organize the data --- these 
definitions need to be defined in advance 
in order for \code{R2admb} to check them properly ...
<<>>=
LBrood <- log(Owls$BroodSize)
mmf <- model.matrix(~(FoodTreatment+ArrivalTime)*SexParent,
                    data=Owls)
mmr <- model.matrix(~Nest-1, data=Owls)
response <- Owls$SiblingNegotiation
nf <- ncol(mmf)
nobs <- nrow(Owls)
nnests <- length(levels(Owls$Nest))
@ 

Combine these objects into a list and write them to
a file in the appropriate format for ADMB:
<<>>=
regdata=list(nobs=nrow(Owls),
     nnests=length(levels(Owls$Nest)),
     nf=ncol(mmf),
     fdesign=mmf,
     rdesign=mmr,
     obs=response,
     Lbrood=LBrood)
write_dat("owls", L=regdata)
@ 

Define and write a list of starting
values for the coefficients:
<<>>=
regparams=list(fcoeffs=rep(0,ncol(mmf)),
  logit_p=0,
  sigma=1,
  rcoeffs=rep(0.001,length(levels(Owls$Nest))))
write_pin("owls", L=regparams)
@ 

Compile the model, specifying that it contains
random effects:
<<eval=FALSE>>=
compile_admb("owls", re=TRUE)
@ 

Run the compiled executable:
<<eval=FALSE>>=
xargs <- "-noinit -nox -l1 40000000 -nl1 40000000 -ilmn 5"
run_admb("owls",extra.args=xargs)
@ 
The extra argument \code{-noinit} tells ADMB to start the random effects from the last optimum values, instead of the \code{pin} file values, when doing the Laplace approximation. \code{-nox} reduces the amount of information output while it's running. \code{-l1} allocates memory to prevent ADMB from creating the temporary storage file \code{f1b2list1}, which is much slower than using RAM. \code{-nl1} is similar to \code{-l1}, but for the temporary file \code{nf1b2list1}. Users add these command line options when they see temporary files created. The amount to allocate is done by trial and error or experience. In this user's experience, 40000000 is a good value to try. \code{-ilmn 5} is used to make ADMB run faster when there are a lot of random effects; it tells ADMB to use a limited  memory quasi-Newton optimization algorithm and only save 5 steps.
 (See the ADMB and ADMB-RE manuals, and \url{http://admb-project.org/community/tutorials-and-examples/memory-management}, for more information.) 

Read in the results and clean up files
that were produced by ADMB:
<<eval=FALSE>>=
fit_admb <- read_admb("owls")
## rename fixed-effect parameters according to column
##  names of model matrix
names(fit_admb$coeflist$fcoeffs) <-
  names(fit_admb$coefficients)[1:ncol(mmf)] <- colnames(mmf)
clean_admb("owls")
@ 



<<echo=FALSE>>=l
L1 <- load("../ADMB/fit_fits.RData")
@ 
\subsection{Results}

The \verb+fit_admb+ object read in by \code{R2admb}
works with many of the standard accessor methods in
R: \code{coef}, \code{stdEr}, \code{vcov}, \code{confint}, etc.:
<<>>=
methods(class="admb")
@ 

However, the coefficients vector contains the full set of
fixed and random effect coefficients (as well as the zero-inflation
parameter and the random-effects variance):
<<>>=
logLik(fit_admb)
coef(fit_admb)[1:8]
@ 

These coefficients are in the same order as the columns of the model matrix \code{mmf}
<<>>=
colnames(mmf)
@

Checking equivalence:
<<>>=
write_pin("owls",as.list(c(coef(gfit1),
                           logitpz=qlogis(gfit1$pz),
                           sigma=sqrt(gfit1$S[[1]]))))
@ 

Comparing the three reference implementations
(with \code{glmmADMB}, the EM algorithm (\code{zipme}),
and ADMB):
\todo{do we care that the results from glmmADMB and ADMB
  aren't quite identical?  Is this due to ``robustification''?
  glmmADMB has a better log-likelihood (1985 vs 1999) --- either ADMB
  got stuck or the likelihoods are calculated differently (robustification
  etc.)
}
<<fig=TRUE,width=8>>=
## rearrange order to match
a1tab <- coeftab(fit_admb)[2:6,]
cvec <- brewer.pal(6,"Dark2")
coefplot2(list(glmmADMB_Poiss=cg1tab,
               zipme=cg4tab,
               ADMB=a1tab),
          merge.names=FALSE,intercept=TRUE,
          varnames=abbfun(rownames(cg1tab)),
          legend=TRUE,
          col=cvec[1:3])
@ 

\subsection{Faster version using separable functions}
The same model can be written in a much more efficient way if we exploit the separability of its parameters. We do this by operating on each observation separately and only sending the necessary parameters to the separable function.  This lets ADMB know how sparse the Hessian will be. The fewer parameters get sent together, the sparser the Hessian. So instead of using a matrix for the random effects, we use an index vector for which nest corresponds to each observation. Then, for each observation, only the random effect for the relevant nest gets sent to the separable function.   

The new TPL file is \verb+owls_sep.tpl+.

<<eval=FALSE>>=
sepdat=list(nobs=nrow(Owls),
     nnests=length(levels(Owls$Nest)),
     nf=ncol(mmf),
     fdesign=mmf,
     nests=as.integer(Owls$Nest),
     obs=response,
     Lbrood=LBrood)
write_dat("owls_sep", L=sepdat)
@
The \verb+DATA_SECTION+ of the new tpl file looks like this:
\begin{verbatimtab}[8]
DATA_SECTION
	init_int nobs    // # obs (=599)
	init_int nnests  // # random effects levels (=nests=27)
	init_int nf      // # fixed effects (incl intercept & interaction)
	init_matrix fdesign(1,nobs,1,nf) // Design matrix for fixed effects + intercept
	init_ivector nests(1,nobs)   // Nest indices
	init_vector obs(1,nobs)      // Response variable (# calls)
	init_vector Lbrood(1,nobs)   // Log of brood size (offset)
\end{verbatimtab}
The \code{PARAMETER\_SECTION} and par files are the same.
<<eval=FALSE>>=
write_pin("owls_sep", L=regpars)
@
The new \code{PROCEDURE\_SECTION} is followed by two separable function definitions: one to calculate the negative log-likelihood of the observations, and another to calculate the negative log-likelihood of the random effects. These negative log-likelihoods get added to the joint negative log-likelihood \code{jnll}. Note that the definition of a \code{SEPARABLE\_FUNCTION} must all be on one line, but it's split in this document to fit on the page.
\begin{verbatimtab}[8]
PROCEDURE_SECTION
	jnll = 0.0;
	for(int i=1; i<=nobs; i++)
	{
		pois(i, fcoeffs, rcoeffs(nests(i)), logit_p, sigma);
	}
	for(int n=1; n<=nnests; n++)
	{
		rand(rcoeffs(n));
	}
SEPARABLE_FUNCTION void pois(int i, const dvar_vector& fcoeffs, const dvariable& r, 
const dvariable& logit_p, const dvariable& sigma)
	
	dvariable p = exp(logit_p)/(1.0+exp(logit_p));
	dvariable eta = Lbrood(i) + fdesign(i)*fcoeffs + r*sigma;
	dvariable mu = exp(eta);

	if(obs(i)==0)
	{
		jnll-=log(p + (1-p)*exp(-mu) );
	}
	else//obs(i)>0
	{
		jnll-=log(1-p) + obs(i)*log(mu) - mu- gammln(obs(i)+1);
	}
SEPARABLE_FUNCTION void rand(const dvariable& r)
	jnll+=0.5*(r*r)+0.5*log(2.*M_PI); //for the random effects distributed N(0,1)
\end{verbatimtab}

Then we compile and run the code from R and read back the results  
<<eval=FALSE>>=
compile_admb("owls_sep", safe=FALSE, re=TRUE, verbose=FALSE)
run_admb("owls_sep",verbose=FALSE,
extra.args="-shess -noinit -nox -l1 40000000 -nl1 40000000 -ilmn 5")
fit_admb_sep <- read_admb("owls_sep")
@

The extra argument \code{-shess} tells ADMB to use algorithms that are efficient for sparse Hessian matrices. 

We get the same answer, at least up to a precision of $10^{-7}$:
<<>>=
all.equal(coef(fit_admb),coef(fit_admb_sep),tol=1e-7)
@ 
\section{BUGS}
\VerbatimInput[numbers=left,samepage=true,fontsize=\small]{../BUGS/owls.txt}

\begin{itemize}
\item{lines 4--13 define the priors
    \begin{itemize}
    \item (4--8) weak $N(\mu=0,\tau=0.01)$ for
      the fixed effect coefficients \code{beta} and the intercept
      \code{alpha} (remember that WinBUGS parameterizes
      the normal distribution in terms of the precision, or inverse variance);
    \item (8--9) a uniform $(0,5)$ distribution for the random-effects standard deviation
      \code{sigma}
      (which is converted to a precision \code{tau} on line 9); 
    \item (10) $U(0,1)$ for
      the zero-inflation probability \code{psi}; 
    \item (11--13) and $N(0,\tau)$ priors for
      the per-nest random effects (these last values could be thought of as
      part of the model rather than as priors, since their variance is
      controlled by the parameter \code{sigma}).
    \end{itemize}}
\item line 16 defines the conditional probability of observation \code{i}
  based on the conditional mean \code{mu[i]}
\item line 17 defines \code{mu[i]}, which is either equal to zero
  (if the observation is a structural zero) or to the \code{lambda[i]}
  value relevant to observation \code{i}; \code{JAGS} complains if
  we ever have a non-zero observation from a Poisson with mean zero
  (because the likelihood is exactly zero), so we add a small fudge factor
\item line 18 picks a Bernoulli (\code{dbern}) variable to decide whether
  the observation is a structural zero or not
\item line 19 defines the expected mean of the sampling part of the
  distribution, based on the design matrix and the fixed effects
  (\code{inprod(X[i,],beta)}); the intercept and offset (\code{alpha}+\code{offset[i]});
  and the random nest effect (\code{a[nest[i]]}).
\end{itemize}

For convenience, we packaged the R code to run the BUGS model (in JAGS) as
a function. 
<<>>=
owls_BUGS_fit <- function(data, ni=25000, nb=2000, nt=10, nc=3) {
  ## Bundle data
  data$ArrivalTime <- scale(data$ArrivalTime,center=TRUE,scale=FALSE)
  fmm <- model.matrix(~(FoodTreatment+ArrivalTime)*SexParent,data=data)[,-1]
  bugs.data <- with(data,list(N=nrow(data),
                            nnests=length(levels(Nest)),
                            offset = logBroodSize,  ## nb specified on original scale
                            SibNeg = SiblingNegotiation,
                            nest = as.numeric(Nest),
                            X=fmm))

  ## Inits function
  inits <- function(){list(a=rnorm(27, 0, .5),
                           sigma=runif(1,0,.5), alpha=runif(1, 0, 2), beta = rnorm(5))}

  ## Parameters to estimate
  params <- c("alpha", "beta", "sigma", "psi")  ## DON'T save b, overdispersion latent variable
                                                ## or a, nest random effect
 
  jags(data = bugs.data, inits = inits, parameters.to.save = params, model.file = "owls.txt",
       n.thin = nt, n.chains = nc, n.burnin = nb, n.iter = ni)
}
@ 

The \code{jags} command (from the \code{R2jags} package) is compatible with the \code{bugs}
command from the \code{R2WinBUGS} package; that should be all you have to change to run
with WinBUGS instead of JAGS.

The only important point we had to take account specifically for a BUGS solution to
the problem is the issue of centering the continuous predictor.  Centering continuous predictors
is generally a good idea for reasons of both interpretability and numerical stability
\citep{GelmanHill2006,schielzeth_simple_2010}.
While deterministic, derivative- or linear-algebra-based approaches such as those
implemented by \code{lme4} in R or AD Model Builder often have built-in
safeguards against strongly correlated parameters, centering is helpful
in borderline cases or when convergence appears to fail.
However, centering can be critically important for 
MCMC analyses: with a few exceptions (the optional \code{glm} module in JAGS%
can handle this case), mixing will be very slow for models with uncentered
predictors.

<<>>=
library(R2jags) ## loads 'rjags', 'R2WinBUGS', 'coda' as well
@ 
<<eval=FALSE>>=
ofit <- owls_BUGS_fit(Owls)
@ 
<<echo=FALSE>>=
L <- load("../BUGS/owls_BUGS_fits.RData")
@ 

\code{ofit} is an object of class \code{rjags}.
For a start, we can print it out:
<<>>=
print(ofit,digits=2)
@ 
The results give information about the fit (number of chains, length of burn-in,
number of samples, number saved) as well as estimated means, standard deviations, quantiles
for each saved parameter.  
We use \code{print(ofit,digits=2)} to reduce the resolution a bit
so the print-out fits on the page better.
(By default, \code{R2jags} prints results with 3 digits of precision,
while \code{R2WinBUGS} prints them with only 1 digit of precision:
you can override either of these defaults by using \code{print}
with an explicit \code{digits} option set.)
In addition, the results contain two useful diagnostics,
the Gelman-Rubin statistic (\code{Rhat}) and the effective size
(\code{n.eff}). Because we have run multiple chains, we can use the (preferred) Gelman-Rubin
test of convergence, which assesses whether the variance among and within
chains is consistent with those chains sampling the same space (i.e., whether
the chains have converged).  The G-R statistics should be close to 1 (equal within-
and between-chain variance), 
with a general rule of thumb that $\hat R <1.2$ is acceptable.
(In this case it looks like we might have done a bit of overkill, with the maximum
\code{Rhat} value equal to 1.006.)  The effective sample size corrects for autocorrelation
in the chain, telling how many equivalent samples we would have if the chains were really
independent.  In this case the intercept parameter mixes most poorly ($n=390$), while
$\sigma$ is uncorrelated ($n=6900$, which is equal to the total number of samples saved).
In general $n>1000$ is overkill, $n>200$ is probably acceptable.

The \code{R2jags} package provides a \code{plot} method for \code{rjags} object,
which falls back on the method defined in \code{R2WinBUGS} for \code{bugs} objects;
\code{plot(ofit)} would give us 
a nice, rich plot that compares the chains, but the effect here
is a bit ruined by the deviance, which is on a very different scale
from the other variables.  
We extract the \code{BUGSoutput} component of the fit (which is
what gets plotted anyway)
and use a utility function \code{dropdev} (defined elsewhere)
to drop the deviance component from the fit:
We use a utility function
<<fig=TRUE>>=
source("../R/owls_R_funs.R")
oo <- dropdev(ofit$BUGSoutput) 
plot(oo)
@ 

Adding the optional argument \code{display.parallel=TRUE} would compare
the distributions of the estimated parameters across chains.
This plot method is most useful if we have large vectors of 
parameters (e.g. if we had included the nest random effects in
our list of parameters to save), and if the predictors have
all been adjusted to have similar scales (e.g. if we had
divided arrival time [our only continuous predictor variable]
by its standard deviation).  While putting all of the input
variables into a model matrix is convenient from one point
of view, it does limit the usefulness of our output because
the names of the variables are \code{beta[1]}, \code{beta[2]}, \ldots
instead of being more informative. Note that \code{plot.bugs}
displays the 50\% credible intervals, not the 95\% intervals,
so these plots are not particularly useful for standard 
(``is it significant??'') types of inference.

If we want more detail, or different types of plots, we
can convert the \code{bugs} object we have extracted
from the JAGS fit to an \code{mcmc} object
(or an \code{mcmc.list} object in this case, since it
contains multiple chains)
and use plotting and diagnostic tools from the \code{coda} package:
<<>>=
mm <- as.mcmc.list(oo)
@ 

As in the \code{MCMCglmm} case we would 
usually like to look at the trace plots: here
the traces of the three different chains we ran are plotted together
for each parameter.
<<BUGStraceplot,fig=TRUE>>=
## thin slightly for convenience
mm2 <- as.mcmc.list(lapply(mm,function(x) {as.mcmc(x[seq(1,nrow(x),by=3),])}))
print(xyplot(mm2,layout=c(3,3)))
@ 

We could plot density plots with \code{densityplot},
or run the Gelman-Rubin diagnostic with \code{gelman.diag}.



\section{Combined summary}

Finally, we will use some utility functions to summarize and
compare all of the parameter estimates from the methods described, for 
model variants as close
as we can get to the reference model (i.e. zero-inflated Poisson, log-brood-size
offset).

<<this_is_64>>=
cwd <- setwd("../../TOOLS")
source("tools.R")
setwd(cwd)  ## restore
library(reshape)
library(ggplot2)
ovals <- get_proj("owls",start.dir="../..")
b <- basesum(ovals,ncolparam=3)
@ 

\setkeys{Gin}{width=6.5in}
<<sumparamplot,fig=TRUE,width=20,height=15>>=
print(b$plots$params)
@ 

The MCMCglmm fit is the most noticeably different from all the rest,
(presumably) because it is fitting a zero-inflated lognormal-Poisson (overdispersed)
model rather than a zero-inflated Poisson model.

The next most obvious discrepancy is in the confidence intervals for the
among-nest variance (\code{nestvar}).  The ADMB fit was done on the
standard deviation scale, and the confidence intervals then transformed
by squaring.  The more symmetric appearance of the MCMCglmm and BUGS confidence
intervals (which are based on the posterior distribution of the variance,
and hence are invariant across changes in parameter scales) suggest that
the sampling/posterior distribution of the variance is actually more symmetric
on the variance scale than on the standard deviation scale.

Beyond this, there are slight differences in the parameter estimates from
glmmADMB, possibly because of some difference in stopping criteria.
\todo{?? could try restarting glmmADMB at consensus parameter values \ldots}

Finally, we look at the timings of the various methods.  The R methods
are all in the same ballpark, along with
the non-separable variant ADMB model (with E-M/zipme slightly faster than the others).
BUGS is much slower ($\approx 12$ minutes), while the separable ADMB model
is much faster ($\approx 4$ seconds):

We could use \verb+print(b$times)+ to look at the plot,
but the values (times in seconds) are easy enough to read:
<<>>=
ff <- subset(ovals$base,select=c(.id,time))
ff$time <- round(ff$time,1)
ff[order(ff$time),]
@ 
\bibliography{../../nonlin}

<<cleanup,echo=FALSE>>=
unloadNamespace("SweaveListingUtils")
@

\end{document}
                  
