- Thursday, June 14th
- 10:45am -- 12:30pm
- 2:00pm -- 3:45pm
- 4:15pm -- 6:00pm

- Friday, June 15th
- 10:45am -- 12:30pm
- 2:00pm -- 3:45pm
- 4:15pm -- 6:00pm

- Saturday, June 16th
- Bioinformatics Day: Saturday, June 16

Session Time:

Room Location: Santa Ana Room

Session Chair:

**Making Trees Interactive: KLIMT**`Author:`*Simon Urbanek*(University of Augsburg, Germany)*Antony Unwin*(University of Augsburg, Germany)`Abstract:`- Trees are a valuable way of displaying structure in data sets. Adding interactive tools would make them even more valuable. This paper describes our prototype software, KLIMT (Classification - Interactive Methods for Trees), for interactive graphical analysis of trees. The research is work in progress and there are many different possible options. What features do analysts want?

**Some Graphics for Recursive Partitioning**`Author:`*Daniel B. Carr*(George Mason University)*Ru Sun*(George Mason University)`Abstract:`- This paper presents some graphical templates for constructing,
describing, and studying recursive partitioning trees. Some of the templates
have multiple applications. For example, the layout approaches for showing
random trees are generally applicable to the display of clusters. Thus the
paper may be of interest to those who make little use of recursive
partitioning.

The paper provides some perspective by raising a few general issues. One issue concerns the limits of human sensory input and the resulting conscious awareness. The limits are cause for humility in the face of overwhelming quantities of data. For example one paper indicates using a recursive partitioning algorithm on a problem with over two million variables. Analysts that would look at the data on much smaller problems inevitably end up looking at caricatures of the data. After making the assumption that it is still beneficial to have analysts involved in the analysis process, it seems that thought and computational power could be devoted to producing and prioritizing caricatures that exploit analysts' visual processing strengths.

In terms for constructing trees, a dynamic example shows the approach of using grand tour, brushing, alphablending and graphical partitioning to build trees. The visual approach uses linear combinations of predictor variables. When the data view allows partitioning on more than one predictor variable the approach includes a type of look ahead compared to a one variable at a time algorithm. More generally the views can be smoothed regression surfaces and various approaches can be used to graphically define multivariate partitions. The view used for partitioning may not be well chosen. Thus projection pursuit or related algorithms can help the analysts to select views. Trees displays can use graphical representations to show the partition boundaries. This can be done for traditional as well as graphically defined partitions.

The paper emphasizes graphical possibilities and does not evaluate the quality of analyst defined trees. Adjusting the significance tests of human defined partitions for multiple comparison is an open research question with some algorithm emulation possibilities. More generally graphics can be also be used in the evaluation process. One evaluation process generates different trees by weighted random selection of prioritized variables at each partitioning step. The paper closes by describing an approach to laying out trees based on their similarities.

Session Time:

Room Location: Costa Mesa Room

Session Chair:

**Predictive Data Mining with Multiple Additive Regression Trees**`Author:`*Jerome H. Friedman*(Stanford University)`Abstract:`- Predicting future outcomes based on past observational data is a common application in data mining. The primary goal is usually predictive accuracy, with secondary goals being speed, ease of use, and interpretability of the resulting predictive model. New automated procedures for predictive data mining, based on ``boosting'' (CART) regression trees, are described. The goal is a class of fast ``off-the-shelf'' procedures for regression and classification that are competitive in accuracy with more customized approaches, while being fairly automatic to use (little tuning), and highly robust especially when applied to less than clean data. Tools are presented for interpreting and visualizing these multiple additive regression tree (MART) models.

**Towards Understanding Boosting**`Author:`*Bin Yu*(UC Berkeley)*Peter Buhlmann*(Statistics, ETH, Zurich)`Abstract:`- Boosting is a very effective computational procedure to improve upon an initial estimator/classifier (weak learner). It comes from machine learning and has had impressive successes on large real data sets.This talk takes the recent gradient descent point of view of boosting to understand L2Boost in regression and classification, especially its overfitting resistance. In particular, we derive L2Boost's interesting exponential bias and variance trade-off in regression, and approximate the 0-1 generalization error by its smoothed version to argue for the importance of the bias-reduction in classification, for the overfitting-resistance of the 0-1 loss function through tapering, and for the failure of attempts to decompose the 0-1 loss as a sum of bias and variance. Moreover, we put forward the thesis that the (proper) L2Boost in classification is not worse, or maybe even better, than LogitBoost or AdaBoost. We conclude that the exponentially-diminishing variance (and centered higher moments) plus the overfitting-resistance of the 0-1 loss are responsible for the overfitting resistance of L2Boosting (and possibly other forms of boosting) in classification.

**Why Does Model Averaging Work?**`Author:`*Yoav Freund*(AT\&T Labs)`Abstract:`- The last few years have seen an increased interest in model averaging
techniques such as bagging and boosting. To the statistician, the most
interesting aspect of these techniques is their resistance to over-fitting.
I will describe two theoretical explanations to this phenomenon. One
explanation applies to boosting (and other margin-based methods such as
SVM). The other applies to bagging (and other pseudo-Bayesian techniques).

These explanations differ from common statistical analysis in that they are both based on ``non-generative'' models of the world. I will explain how generative and non-generative models differ and why the difference is important.

Session Time:

Room Location: Viejo Room

Session Chair:

**A Model Based Approach to Text Categorization and Clustering**`Author:`*Alejandro Murua*(University of Washington and Insightful Corporation)*Jeremy Tantrum*(University of Washington)*Werner Stuetzle*(University of Washington)*Solveig Sieberts*(University of Washington)`Abstract:`- In this work we develop a complete methodology for document classification and clustering. Documents can be represented by high-dimensional vectors of term/word frequencies. We study in depth the effect of dimensionality reduction (via Principal Components Analysis), term weighting and frequency transformation on both classification and clustering tasks. We conclude that increasing the feature space dimension beyond a certain critical value depending on the complexity and size of the data does not improve performance, but worsens it; and applying a logarithm or square-root transformation to the term frequencies reduces error rates. We used these findings to construct a model based document clustering (MBDC) algorithm. This explicitly models the data as being drawn from a Gaussian mixture. This is used to construct clusters based on the likelihood of the data, and to classify documents according to the Bayes rule. One main advantage of our approach is the ability to automatically select the number of clusters present in the document collection via Bayes factors. Our experiments with the Topic Detection and Tracking Corpus demonstrates the ability of MBDC to choose a sensible number of clusters as well as meaningful partitions of the data. Moreover, motivated by the document clustering problem, we introduce a novel algorithm to cluster high dimensional large data sets. This algorithm uses model based clustering in the context of splitting the data into more manageable subsets by way of fractionation. An extension to this method (model based re-fractionation) is also proposed.

**Model-Based Clustering for Gene Expression Data**`Author:`*Ka Yee Yeung*(University of Washington)*Walter L. Ruzzo*(University of Washington)`Abstract:`- Many biologists are excited about emerging DNA microarray technologies, since they make it possible for the first time to study the simultaneous variations in activities of thousands of genes. There is a great need to develop analytical methodologies to extract the information contained in these rapidly growing data sets. Because of the large number of genes and the complexity of biological networks, clustering is a useful exploratory technique for analysis of such data. Among the many clustering algorithms that have been proposed, model-based clustering stands out as one of the few with rigorous probabilistic underpinnings. Namely, it assumes that the data is generated by a mixture of underlying probability distributions such as multivariate normal distributions. This Gaussian mixture model has been shown to be a powerful tool for many applications. We will present our experiences in applying model-based clustering algorithms to gene expression data, compare it to some of the many other clustering approaches that have been tried, and discuss challenges remaining in this field.

**Model-Based Clustering in Multidimensional Scaling**`Author:`*Man-Suk Oh*(Ewha Women's University)*Adrian Raftery*(University of Washington)`Abstract:`- Multidimensional scaling is widely used to handle data which consists of similarity or dissimilarity measures between pairs of objects. Major problems in multidimensional scaling are object configuration, choice of dimension, and clustering of objects. We propose a Bayesian approach to deal with these problems, using a mixture of multivariate normal distributions as a prior distribution of the object coordinates. A Markov chain Monte Carlo algorithm is used to estimate the object configuration and group membership simultaneously. A simple criterion to determine the dimension of object configuration and the number of groups is proposed.

Room Location: Capistrano Room

Session Chair:

**Bayesian Inference in a New Class of Multi-Scale Time Series Models**`Author:`*Marco A. R. Ferreira*(Duke University)*Mike West*(Duke University)*David Higdon*(Duke University)*Herbie Lee*(Duke University)`Abstract:`- We introduce a class of multi-scale models for time series. The novel framework couples ``simple'' standard Markov models for the time series stochastic process at different levels of aggregation, and links them via ``error'' models to induce a new and rich class of structured linear models reconciling modelling and information at different levels of resolution. Jeffrey's rule of conditioning is used to revise the implied distributions and ensure that the probability distributions at different levels are strictly compatible. Our construction has several interesting characteristics: a variety of autocorrelation functions resulting from just a few parameters, the ability to combine information from different scales, and the capacity to emulate long memory processes. There are at least three uses for our multi-scale framework: to integrate the information from data observed at different scales; to induce a particular process when the data is observed only at the finest scale; as a prior for an underlying multi-scale process. Bayesian estimation based in MCMC analysis is developed, and issues of forecasting are discussed. Two interesting applications are presented. In the first application, we illustrate some basic concepts of our multiscale class of models through the analysis of the flow of a river. In the second application we use our multiscale framework to model daily and monthly log-volatilities of exchange rates.

**A Variable Memory Markovian Modeling Approach to Unsupervised Sequence Segmentation**`Author:`*Gill Bejerano*(Hebrew University)*Yevgeny Seldin*(Hebrew University)*Naftali Tishby*(Hebrew University)`Abstract:`- We outline a novel unsupervised sequence segmentation algorithm motivated by information theoretic principles. The algorithm, which segments the sequences into alternating Variable Memory Markov sources, is based on competitive learning between the Markov models, implemented as Prediction Suffix Trees using the minimum description length principle. By applying a model clustering procedure, based on rate distortion theory combined with deterministic annealing, we obtain a hierarchical segmentation of sequences between alternating Markov sources. The algorithm seems to be self regulated and automatically avoids over segmentation. The method is applied successfully to unsupervised segmentation of texts into languages where it is able to infer both the number of languages and the language switching points. When applied to protein sequence families, we demonstrate the method's ability to identify biologically meaningful sub-sequences within the proteins, which correspond to important functional sub-units, known as protein domains.

**Causal Investigation of Time Series Using Graphical Modelling**`Author:`*Marco Reale*(University of Canterbury)*Granville Tunnicliffe Wilson*(Lancaster University)`Abstract:`- (No abstract available)

**Evaluating Sequential Tests for A Class of Stochastic Processes**`Author:`*Xiaoping Xiong*(St. Jude Children's Research Hospital)*Ming Tan*(St. Jude Children's Research Hospital)`Abstract:`- We propose computation methods for sequential tests for a class of stochastic processes for which the probability density of test statistic can be factorized into a product of a known likelihood function that is independent of stopping rule and a conditional probability that is independent of parameters. The proposed methods improve accuracy and efficiency of computation and enable us to valuate properties of special interests in sequential tests such as the probability of discordance between a sequential test and the nonsequential test at the last stage of the sequential test. We give examples for evaluating sequential tests on information time with normal outcomes.

**A Comparison of Reversible Jump MCMC Algorithms for DNA Sequence Segmentation Using Hidden Markov Models**`Author:`*R. J. Boys*(Newcastle University)*D. A. Henderson*(Newcastle University)`Abstract:`- Most DNA sequences display evidence of compositional heterogeneity in the form of patches or domains of similar structure. In this paper we address the problem of identifying such regions of homogeneous composition in genome sequences by using hidden Markov models.

Session Time:

Room Location: Laguna Room

Session Chair:

**Functional Analysis of Computer Network Data**`Author:`*Jeff Solka*(NSWCDD)*David Marchette*(NSWCDD)`Abstract:`- This talk we focus on some of our recent efforts in the application of functional data analysis methods to mail and web server access data. The application of cluster analysis to the data will be illustrated. Methods for the visualization of the data will also be presented. If time permits results that illustrate the identification of outliers within the data set will also be given.

**Inferring Internal Losses and Delays in Communication Networks from Edge Measurements**`Author:`*Robert Nowak*(Rice University)`Abstract:`- Optimizing communication network performance and detecting attacks and intrusions requires knowledge of loss rates and queueing delays at different points in the network. However, it is impractical to directly monitor packet losses and delays at each and every router. Measurements at the edge of the network (sources and receivers) are relatively easy and inexpensive in comparison. Consequently, it is natural to consider the following inverse problem: from edge-based measurements can we infer the loss rates and delays experienced at internal points in the network ? This paper presents a unified formulation of the problems of internal loss and delay estimation, and describes an expectation-maximization algorithm for computing maximum likelihood estimates. We also propose a new method for jointly visualizing network connectivity and network performance parameters.

**Texture Modeling Using Self-Similar Wavelets and POMMs**`Author:`*Jennifer Davidson*(Iowa State University)*Richard Barton*(University of Houston)`Abstract:`- Stochastic models are well suited for the modeling of natural texture images. The wavelet transform has also been useful in analyzing texture data. While previous work has supported this, a factor limiting the widespread use of such models is the high number of parameters necessary to describe the model. Recent work has shown that the number of these parameters can be reduced by making assumptions about the data, such as self-similarity of the wavelet coefficients and other assumptions. We make further assumptions about the wavelet transform (WT) coefficients in this work to include stationarity of the WT coefficients as well as dependence within and across scales of the WT coefficients. This is done by fitting an empirical version of a stationary partially ordered Markov model (POMM) in the transform domain both within and across scales. The POMM assume dependence of the WT coefficient values in a local neighborhood in scale, and dependence of the parent node in the adjacent scale with coarser resolution. An empirical pdf for the POMM is found that is based on a quantized version of the WT coefficients. The quantization into K states is performed by modeling the WT coefficients with a mixture of two Gaussians and then estimating the parameters of the Gaussians (the variance only, as we assume a zero mean). We use an expectation-maximization algorithm to find the variances, as the states are considered unobserved data. Maximum likelihood is used to determine the K states. Once the states are found, the empirical POMM pdf is calculated, and this is the model used to represent the data. Synthesis is straightforward and real-time since the POMMs have an algorithm that produces a sample of its distribution with a single visit to each of the pixel locations in the image. Examples are shown of the synthesis technique for varying values of K and neighborhood size for the POMM. Other applications are also being pursued, including classification of texture data and encoding for storage and transmission.

**The Adaptive Data Cube: An Experiment in Hyperspectral Pattern Recognition**`Author:`*Carey E. Priebe*(Johns Hopkins University)`Abstract:`- We present an experiment in hyperspectral pattern recognition designed to illustrate the potential of the ``adaptive data cube'' in integrated sensing and processing (ISP).

Session Time:

Room Location: Santa Ana Room

Session Chair:

**GGobi Meets R: An Extensible Environment for Interactive Dynamic Data Visualization**`Author:`*Deborah F Swayne*(AT\&T Labs - Research)*Duncan Temple Lang*(Lucent Bell Laboratories)*Andreas Buja*(AT\&T Labs - Research)*Diane Cook*(Iowa State University)`Abstract:`- GGobi is a direct descendant of XGobi, designed so that it can be embedded in other software and controlled using an API (application programming interface). This design has been developed and tested in partnership with R. When GGobi is used with R, the result is a full marriage between GGobi's direct manipulation graphical environment and R's familiar extensible environment for statistical data analysis. GGobi has several other advances over XGobi, including multiple plotting windows, more flexible color management, xml file handling, and portability to Windows.

**Graphical Post-Analysis of Association Rules**`Author:`*Heike Hofmann*(University of Ausgburg, Germany)`Abstract:`- Association Rules are a widely used tool in data mining. Major problems originate from the mass of output as well as from the restriction to support and confidence for measures of quality. We will introduce graphical techniques for examining association rules. These allow us not only to assess the quality of a single rule visually, but they also provide an overview of the structure among the rules, laying the basis for an interpretation and extraction of ``real'' results.

**Uncovering Complexity in Data through Sound**`Author:`*Mark H. Hansen*(Bell Laboratories)*Ben Rubin*(EAR Studio)`Abstract:`- Today, almost every aspect of our lives can be ``rendered'' digitally. Advances in data collection technologies have made commonplace continuous, high-resolution measurements of our physical environment (weather patterns, seismic events, ecological indicators). Equally open to observation are our routine movements through and interactions with our physical surroundings (automobile and air traffic, large-scale land use). In computer-mediated settings, our activities either depend crucially on or consist entirely of complex digital data (financial transactions, accesses to global information systems, web site and internet usage). As a reflection of the diversity and variety of the systems under study, these data-based descriptions of our daily lives tend to be massive in size, dynamic in character, and replete with rich structures. The advent of these enormous repositories of digital information presents us with an interesting challenge: how can we represent and interpret such complex, abstract and socially important data? In a new collaboration, we have begun to give voice to a variety of internet-related data streams. In our initial work, we studied traffic across the web site www.lucent.com. Recently, our focus has been on capturing large-scale ``chatter'' on the web. We have built monitoring agents that allow us to collect streams from thousands of public forums, bulletin boards and chat rooms. The incorporation of textual components in our audio displays presents new and interesting challenges. In this talk, we will try to put our work in context by presenting a brief (and biased) review of the use of data in music compositions, as well as previous attempts to incorporate sound directly into the process of data analysis. This work is part of a Lucent program sponsoring collaborations between researchers at Bell Laboratories and the Brooklyn Academy of Music.

Session Time:

Room Location: Costa Mesa Room

Session Chair:

**Causal Inference in Statistics: A Gentle Introduction**`Author:`*Judea Pearl*(UCLA)`Abstract:`- This talk will provide a conceptual introduction to causal inference, aimed at assisting researchers gain access to recent advances in this area. The talk will stress the paradigmatic shifts that must be undertaken in moving from traditional statistical analysis to causal analysis of multivariate data. Special emphasis is placed on the assumptions that underly all causal inferences, the languages used in formulating those assumptions, and the conditional nature of causal claims inferred from nonexperimental studies. These emphases will be illustrated through a brief survey of recent results, including the control of confounding, corrections for noncompliance, and a symbiosis between counterfactual and graphical methods of analysis. Background information can be viewed on \path|http://www.cs.ucla.edu/~judea/|.

**The Defining Role of ``Principal Effects" in Comparing Treatments Using General Post-Treatment Variables**`Author:`*Constantine E. Frangakis*(The Johns Hopkins University)*Donald B. Rubin*(Harvard University)`Abstract:`- We study the general and demanding problem of how to make treatment comparisons that adjust for a post-treatment variable, because standard methods can create post-treatment selection bias precisely in situations where there is a scientific reason for the adjustment. We propose a general framework for comparing treatments adjusting for post-treatment variables that yields ``principal effects'' based on ``principal stratification''. Principal stratification with respect to a post-treatment variable is a cross-classification of subjects defined by the joint potential values of that variable under each of the treatments being compared. Principal effects are defined as causal effects within a principal stratum. The key property of principal strata is that they are not affected by treatment assignment and, therefore, can be used as a pre-treatment covariate in defining any covariate-based estimand. As a result, the central property of our principal effects is that they are always causal effects, and do not suffer from post-treatment selection bias. We discuss briefly that principal causal effects are the link between two recently worked applications with post-treatment variables on: (i) treatment noncompliance; and (ii) missingness of outcomes (dropout) following treatment noncompliance. We then discuss the open problem with surrogate endpoints, where we show, using principal effects, that all current definitions of surrogacy, even when perfectly true, cannot generally have interpretation as causal effects attributable to the surrogate. We also formulate a new approach based on principal stratification and principal effects, and show that it has better properties than the standard methods.

Session Time:

Room Location: Viejo Room

Session Chair:

**Active Learning for Support Vector Machines with Applications to Text Classification**`Author:`*Simon Tong*(Stanford University)*Daphne Koller*(Stanford University)`Abstract:`- Support vector machines have met with significant success in numerous real-world learning tasks. However, like most machine learning algorithms, they are generally applied using a randomly selected training set classified in advance. In many settings, we also have the option of using pool-based active learning. Instead of using a randomly selected training set, the learner has access to a pool of unlabeled instances and can guide the sampling process by querying for the labels of certain pool instances based upon the data that it has seen so far. We introduce a new algorithm for performing active learning with support vector machines, i.e., an algorithm for choosing which instances to request next. We provide a theoretical motivation for the algorithm using the notion of a version space. We present experimental results showing that employing our active learning method can significantly reduce the need for labeled training instances in both the standard inductive and transductive settings.

**Conditional Random Fields for Text Processing**`Author:`*John Lafferty*(Carnegie Mellon University)*Andrew McCallum*(WhizBang! Labs - Research)*Fernando Pereira*(WhizBang! Labs - Research)`Abstract:`- We present a framework for building probabilistic models to segment and label sequence data using random fields that are globally conditioned on the input sequence. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax the strong independence assumptions made in those models, and to incorporate hierarchical and overlapping features into the model. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare their performance to HMMs and MEMMs on synthetic and natural language data.

**Relevant Encoding of Linguistic Data via the Information Bottleneck Method**`Author:`*Naftali Tishby*(The Hebrew University of Jerusalem)`Abstract:`- We introduce a general information theoretic method for extracting relevant information from one set of variables about another, relevant, set. The mutual information between two random variables answers the question: `what is the minimal number of yes/no questions (bits) that are needed to be asked about the variable X in order to learn all we can about the variable Y?' This value does not tell you anything, however, on the content of these questions. What is it that we need to know about the variable X that provides information about the variable Y? We call (the answers to) these questions the relevant components (information) in X about Y, and propose a general method for generating such questions - relevant encoding of X with respect to Y. We introduce both top-down and bottom-up algorithms for this task and discuss their applications for data clustering, text categorization and classification, word-sense disambiguation, and Bioinformatics. Based on joint work with Noam Slonim, Bill Bialek, and Fernando Pereira.

Room Location: Capistrano Room

Session Chair:

**A Split-Merge Markov Chain Monte Carlo Procedure for the Dirichlet Process Mixture Model**`Author:`*Sonia Jain*(University of Toronto)*Radford M. Neal*(University of Toronto)`Abstract:`- We propose a split-merge Markov chain algorithm to address the problem of inefficient sampling for conjugate Dirichlet process mixture models. Traditional Markov chain Monte Carlo methods for Bayesian mixture models, such as Gibbs sampling, can become trapped in isolated modes corresponding to an inappropriate clustering of data points. This article describes a Metropolis-Hastings procedure that can escape such local modes by splitting or merging mixture components. Our Metropolis-Hastings algorithm employs a new technique in which an appropriate proposal for splitting or merging components is obtained by using a restricted Gibbs sampling scan. We demonstrate empirically that our method outperforms the Gibbs sampler in situations where two or more components are similar in structure.

**Priors for Bayesian Neural Networks**`Author:`*Mark Robinson*(University of British Columbia)`Abstract:`- In recent years, Neural Networks (NN) have become a popular data-analytic tool in Statistics, Computer Science and many other fields. NNs can be used as universal approximators, that is, a tool for regressing a dependent variable on a possibly complicated function of the explanatory variables. The NN parameters, unfortunately, are notoriously hard to interpret. Under the Bayesian view, we propose and discuss prior distributions for some of the network parameters which encourage parsimony and reduce overfit, by eliminating redundancy, promoting orthogonality, linearity or additivity. Thus we consider more senses of parsimony than are discussed in the existing literature. We investigate the predictive performance of networks fit under these various priors.

**Adaptive Metropolis-Hastings Samplers for the Bayesian Analysis of Large Linear Gaussian Systems**`Author:`*Stephen KH Yeung*(University of Newcastle upon Tyne)*Darren J. Wilkinson*(University of Newcastle upon Tyne)`Abstract:`- This paper concerns the implementation of efficient Bayesian computation for large linear Gaussian models containing many latent variables. Such models often arise in the context of dynamic linear modeling, where the underlying stochastic process evolves through time, and in more conventional applications such as in hierarchical linear modeling.

**Genetic Analysis of Melanoma Onset by Using Estimating Equations and Bayesian Hierarchical Models**`Author:`*Kim-Anh Do*(University of Texas M.D. Anderson Cancer Center)`Abstract:`- There are complex relative contributions of genetic and shared environmental factors to an increased risk in melanoma. Data from the Queensland Familial Melanoma Project comprising 15,907 persons from the 1,912 families of 2,118 melanoma cases were analyzed to estimate the additive genetic, common and unique environmental contributions to variation in the age at onset of melanoma. Two complementary approaches for analyzing correlated time-to-onset family data were considered: the generalized estimating equations (GEE) method in which one can estimate relationship-specific dependence simultaneously with regression coefficients that describe the average population response to changing covariates; and a subject-specific Bayesian mixed model in which heterogeneity in regression parameters is explicitly modeled and the different components of variation may be estimated directly. The proportional hazards and Weibull models were utilized, as both produce natural frameworks for estimating relative risks while adjusting for simultaneous effects of other covariates. A simple Markov Chain Monte Carlo method for covariate imputation of missing data was used and the actual implementation of the Bayesian model was based on Gibbs sampling using the free ware package BUGS. In addition, we also used a Bayesian model to investigate the relative contribution of genetic and environmental effects on the expression of naevi and freckles, which are known risk factors for melanoma.

**GDAGsim: Sparse Matrix Algorithms for Bayesian Computation**`Author:`*Darren J. Wilkinson*(University of Newcastle)`Abstract:`- GDAGsim is a C software library for analysis of conditionally specified linear models. In particular, it can be used to carry out conditional sampling of Gaussian Directed Acyclic Graph (GDAG) models, and hence can be used for the implementation of efficient block MCMC samplers for such models.

Session Time:

Room Location: Laguna Room

Session Chair:

**Approximations to Dirichlet Processes with Applications**`Author:`*Jayaram Sethuraman*(Florida State University)`Abstract:`- Dirichlet process, which can be used as priors for an unknown distribution appearing in the modeling of data, was introduced by Ferguson. A direct constructive definition of a Dirichlet process given by us simplifies the proofs of many properties. In this talk we show several approximations to Dirichlet processes which will be useful in computational Bayesian analysis involving Dirichlet processes. We present proofs to show that these approximations work in appropriate senses adequate for applications. Finally we present examples of applications of these approximations in some computational Bayesian problems.

**Banks of Interacting Bayesian Filters**`Author:`*Boris L. Rozovskii*(University of Southern California)*R. Blazek*(University of Southern California)*A. Petrov*(University of Southern California)`Abstract:`- Emerging applications of statistics to network intrusion detection, estimation of network congestion, target tracking, and volatility estimation in financial markets, etc. bring to the forefront new, more complicated problems in state estimation. These applications require models with abrupt sporadic changes and unusual structures of observation. This paper proposes the use of randomly modulated jump-diffusion systems to model behavior of such systems. We present a bank of interacting Bayesian filters that provide an optimal (in the mean-square sense) estimate for the state process. A recursive algorithm for computing the estimates will be discussed.

**Data Reduction by Quantization**`Author:`*Edward J. Wegman*(George Mason University, Center for Computational Statistics )*Nkem-Amin (Martin) Khumbah*(George Mason University)`Abstract:`- Massive data sets challenge the limits of both computability and visualization. It is therefore desirable to compress data sets; approximately $10^6$ to $10^7$ Bytes seems desirable. This can be done by sampling (thinning) or quantization (binning). Binning essentially maps the original sample space into a new discrete sample space. It is commonly thought that data is sparse (lumpy) in high dimensions. Binning therefore consists of identifying clusters and determining statistical properties within clusters. Our proposal is to identify statistical properties within bins and replace original data with statistically equivalent data of a much smaller scale. This paper explores these ideas.

**The Principle and Practice of Minimum Description Length**`Author:`*Bin Yu*(University of California, Berkeley)`Abstract:`- The Minimum Description Length (MDL) Principle for statistical modeling states Occam's razor in the precise language of coding/information theory. It on one hand generalizes the Maximum Likelihood Principle, and on the other motivates useful and effective model selection criteria. Moreover, it serves as an objective platform for comparing model selection procedures from frequentist and Bayesian statistics alike. In this talk, I will give an overview of MDL and describe a fast and low-delay perceptually lossless coder for music/speech based on cascaded LMS and MDL weighting.

Room Location: Santa Ana Room

Session Chair:

**A Bayesian Approach to the Analysis of cDNA Microarray Data**`Author:`*M.A. Black*(Purdue University)*B.A. Craig*(Purdue University)*M. Tanurdzic*(Purdue Genetics Program, Purdue University)*R.W. Doerge*(Purdue University)`Abstract:`- The recent explosion of interest in microarray technology has resulted in this becoming the preferred methodology for conducting gene expression experiments. Although the ability of an array experiment to examine the expression of thousands of genes simultaneously gives a previously unheard of level of insight to researchers, it also raises a plethora of statistical questions regarding both the sheer volume of data being produced, as well as the level of variability inherent in this relatively new technology. In this paper we present statistical methods based on Bayesian linear models for investigating the various sources of variability present in array experiments. Examples involving data from cDNA microarray experiments conducted at the Purdue University Computational Genomics facility will be used to illustrate this methodology.

**A Statistical Analysis of Radiolabeled Gene Expression Data**`Author:`*Rafael A. Irizarry*(Johns Hopkins University)*Giovanni Parmigiani*(Johns Hopkins University)*Mingzhou Guo*(Johns Hopkins University)*Tatiana Dracheva*(National Cancer Institute)*Jin Jen*(Johns Hopkins University)`Abstract:`- This paper considers statistical issues in the analysis of a designed experiment to investigate differential gene expression in colon cancer and normal colon tissue. In this experiment gene expression is measured using radiolabeling--based array filters. Specific statistical issues arise in connection with radiolabeling technology, because of the absence of direct control, which are replaced by empty spots on the filter, and with designed experiments, because of the opportunity to systematically quantify important sources of random variation. Here we consider three aspects in detail: normalization of expression intensities; shrinkage estimates of intensity ratios between cancer and normal tissue; and ranking of genes by the strength of the evidence that they are differentially expressed. We propose a robust and simple--to--implement procedures for normalization and shrinkage, that addresses in a technology--specific way the problem of estimating ratios in presence of small and noisy denominators. We also discuss a graphical display to rank genes using a metric based on quantiles of a null distribution obtained by replicating the array experiment in normal tissue.

**Replication and Appropriate Statistical Analysis Are Required For Accurate Interpretation of DNA Microarray Experiments**`Author:`*She-pin Hung*(University of California, Irvine)*G. Wesley Hatfield*(University of California, Irvine)`Abstract:`- In its most simple sense, a DNA microarray is defined as an orderly arrangement of hundreds to hundreds of thousands of unique DNA molecules (probes) of known sequence. There are two basic sources for the DNA probes on an array. Either each unique probe is individually synthesized {\it in situ\/} on a rigid surface, or pre-synthesized probes (oligonucleotides or PCR products) are attached to the array platform (usually glass or nylon membranes). Prior to the studies reported here, it was not clear whether comparable data could be obtained by these different DNA microarray formats. Here we report the use of rigorous statistical methods to compare data obtained from {\it in situ\/} synthesized and pre-synthesized DNA microarrays, the Affymetrix GeneChip${}^{\scriptstyle\rm TM}$, and Sigma Genosys${}^{\scriptstyle\rm TM}$ nylon filters, respectively. The results dramatically demonstrate the necessity of replication and appropriate statistical analysis for the interpretation of results of DNA microarray experiments.

**Identifying Statistically Significant Similarities in Gene Expression Patterns via Bayesian Infinite Mixture Models**`Author:`*Mario Medvedovic*(University of Cincinnati Medical Center)*Siva Sivaganesan*(University of Cincinnati)`Abstract:`- The recent development of DNA microarray (DNA ``chip'') technologies for parallel monitoring expression levels of a large number of genes holds a promise of taking our understanding of molecular processes underlying normal functions of living organisms as well as underlying mechanisms of human diseases to a new level. The ability of the DNA microarray technology to produce expression data on a large number of genes in a parallel fashion has resulted in new approaches to identifying individual genes as well as whole pathways involved in performing different biologic functions. One commonly used approach in making conclusions from microarray data is to identify groups of genes with similar expression patterns across different experimental conditions through a cluster analysis. Biologic significance of results of such analyses have been demonstrated in numerous studies. Various clustering procedures, ranging from simple agglomerative hierarchical methods to optimization based global procedures and Self Organizing Maps, have been used for clustering gene expression profiles. In identifying patterns of expression, such procedures depend on either a visual identification of patterns (hierarchical clustering) in a color-coded display or on the correct specification of the number of patterns present in data prior to the analysis (k-means and Self Organizing Maps). We developed a statistical procedure based on the Bayesian Infinite Mixture model in which conclusions about the probability of a set of gene expression profiles being generated by the same pattern are based on the posterior probability distribution of clusterings given the data. In contrast to the Finite Mixture approach, this model does not require specifying the number of mixture components and resulting clustering is a result of averaging over all possible number of mixture components. We implemented a Gibbs sampling based algorithm for generating sample from the posterior distribution of clusterings and used it to identify groups of genes with similar expression profiles in a publicly available dataset.

**An Interdisciplinary Program Employing Computational, Biochemical and Genomic Methods to Examine the Effects of Chromosome Structure on the Regulation of Gene Expression**`Author:`*Lorenzo Tolleri*(University of California, Irvine )*Craig J. Benham*(Mount Sinai School of Medicine, New York)*Pierre Baldi*(University of California, Irvine )*G. Wesley Hatfield*(University of California, Irvine )`Abstract:`- We have computed the position of all of the SIDD sites present on the {\it E. coli\/} chromosome at a mid-physiological superhelical density of $\sigma=-0.05$. These calculations are based on a statistical mechanical approach in which the governing partition function and other quantities of interest are evaluated to a high degree of precision. Further computations to determine the positions of all of the SIDD sites on the chromosome at the superhelical densities encountered in topoisomerase mutant strains are currently underway. We are using Hidden Markov Models to predict the location of all the high affinity IHF binding sites on the {\it E. coli\/} chromosome. To determine the {\it in vivio\/} occupancy of IHF at each of its chromosomal binding sites, we are developing a procedure using {\it in vivio\/} IHF-DNA crosslinking, immunoprecipitation and DNA microarrays that we call (CLIP-on-a-CHIP). For these experiments, cells in log phase are treated with formaldehyde to covalently link chromosomal DNA with DNA binding proteins.

Session Time:

Room Location: Costa Mesa Room

Session Chair:

**Introductory Comments***Michael Jordan*(UC Berkeley)

**Variational Methods and Bayesian Estimation**`Author:`*Tommi Jaakkola*(Massachusetts Institute of Technology)`Abstract:`- Sampling methods have typically been used to counter the often prohibitive cost of exact Bayesian calculations. I will discuss an alternative deterministic approach to this problem based on variational methods. Variational methods either generate adjustable simplifying transforms of the likelihood function or operate in the space of distributions and find the best posterior approximation within a simpler family of tractable distributions. The resulting variational transforms are fast and lead to closed form posterior approximations over the parameters and thereby provide an approximate posterior predictive model. The methods can be readily extended to graphical models with complete or incomplete observations with the help of additional variational transforms. I will present the fundamentals of this approach and discuss its limitations and relation to other approaches.

**Advanced Mean Field Methods for Probabilistic Models**`Author:`*Manfred Opper*(NCRG, Aston University, Birmingham)*Ole Winther*(Technical University of Denmark )`Abstract:`- Mean field (MF) methods provide tractable approximations for the computation of high dimensional sums and integrals in probabilistic models. By neglecting certain dependencies between random variables, a closed set of equations for the expected values of these variables is derived which often can be solved in a time that only grows polynomially in the number of variables. This talk deals with principled and general approaches for correcting the deficiencies of simple MF methods which have their origin in Statistical Physics. We also discuss the relation of these methods to belief propagation techniques.

**Probability assessment with Maximum Entropy in Graphical Models**`Author:`*Wim Wiegerinck*(SNN - University of Nijmegen)*Tom Heskes*(SNN - University of Nijmegen)`Abstract:`- The Maximum Entropy (MaxEnt) method is a standard method that searches
for the distribution that maximizes entropy under a given set of
constraints. Roughly spoken, it selects the distribution that satisfies the
given constraints without introducing any additional information. This talk
will discuss MaxEnt applied to graphical models. Here, the optimal model has
not only to satisfy a given set of constraints, but also a given set of
independency to graphical models. Here, the optimal model has not only to
satisfy a given set of constraints, but also a given set of independency
statements.
As an application, we show how MaxEnt in graphical models can provide a practical tool for the assessment of model parameters in graphical models that are build in collaboration with domain experts.

Session Time:

Room Location: Viejo Room

Session Chair:

**Functional Data Analysis of Complex Computer Simulation Output: A Case Study in nuclear Waste Disposal Risk Assessment**`Author:`*David Draper*(University of California, Santa Cruz)*Bruno Mendes*(University of California, Santa Cruz and University of Bath, UK)`Abstract:`- A key issue in the consolidation process of the nuclear fuel cycle is the safe disposal of radioactive waste. Deep geological disposal based on a multibarrier concept is at present the most actively investigated option (visualize a deep underground facility within which radioactive materials such as spent fuel rods or reprocessed waste, previously encapsulated, are placed, surrounded by other man-made barriers). While the safety of this concept ultimately relies on the safety of the mechanical, chemical and physical barriers offered by the geological formation itself, the physico-chemical behavior of such a disposal system over geological time scales (hundreds or thousands of years) is far from known with certainty. From 1996 to 1999, with partners in Italy, Spain, and Sweden, we were involved in a project for the European Commission, GESAMAC, which aimed in part to capture all relevant sources of uncertainty in predicting what would happen if the disposal barriers were compromised in the future by processes such as geological faulting, human intrusion, and/or climatic change. One major goal of the project was the development of a methodology to predict the radiologic dose for people in the biosphere as a function of time, how far the disposal facility and the other components of the multibarrier system are underground, and other factors likely to be related to dose. For this purpose we developed a complex computer simulation environment called GTMCHEM which ``deterministically'' models the one-dimensional migration of radionuclides through the geosphere up to the biosphere. In this talk I will describe the application of methods of functional data analysis (FDA) to explore the dependence of predicted radiologic dose curves as a function of time on inputs to the computer simulations. FDA includes extensions of traditional statistical methods such as principal components analysis and the analysis of variance (ANOVA) to the case where the outcome, instead of a single real number, is a curve, in our case the logarithm of radiologic dose as a function of the logarithm of time. Previous work in this field was limited to methods such as ANOVA applied to the maximum of such curves; FDA thus permits a much more complete investigation of the relationship between dose and time, and how this relationship depends on the computer simulation inputs.

**Integrated Assessment of Drinking Water Regulations**`Author:`*Mitchell J. Small*(Carnegie Mellon University)*Patrick Gurian*(Carnegie Mellon University)*Mark Schervish*(Carnegie Mellon University)*J.R. Lockwood*(Carnegie Mellon University)`Abstract:`- The evaluation of the costs and benefits of drinking water regulations in the United States involves multiple disciplines, diverse datasets, and widely differing expectations and beliefs among interested parties. The debate over the EPA's recent proposal (and subsequent withdrawal) of a new maximum contaminant level (MCL) for arsenic highlights the high degree of uncertainty and importance of such an assessment. We present the results of an integrated, Bayesian statistical assessment framework for evaluating the costs and benefits of alternative MCL's for drinking water in the United States. The framework includes: statistical models for the national distribution of contaminant concentrations in (raw) source waters for the $\approx 56,000$ community water suppliers in the US; models for current treatment systems in-place and their pollutant removal efficiencies; and resulting finished-water concentrations, population exposures and health risks. The model also predicts new treatments or other management options that will be adopted in response to new MCL's, their costs, and exposure- and risk-reduction benefits. A full uncertainty analysis is conducted, informed with available data sets for raw-water concentrations, treatments-in-place, finished-water concentrations, and health-risk data. The model is identified and estimated using Bayesian, Markov Chain Monte Carlo methods. The framework is illustrated for a single contaminant (arsenic); and methods are discussed for evaluating multiple MCL's for a suite of contaminants.

**Bayesian Sensitivity Analysis and Uncertainty Analysis**`Author:`*Jeremy E. Oakley*(University of Sheffield)*Anthony O'Hagan*(University of Sheffield)`Abstract:`- The problem of assessing uncertainties in complex computer simulation codes is of increasing importance in many fields. This is particularly true in environmental applications, where large and highly complex codes are used, where the relevant science is not always clear, and where the implications of errors can be enormous. Although the codes themselves are generally deterministic, statistical methods provide powerful tools for quantifying uncertainties. This talk concerns a range of Bayesian techniques that offer an integrated framework for addressing problems in the use of such models. One major source of uncertainty for the user of a model is often in specifying values for relevant inputs. Running the model may typically mean assigning values to physical parameters which are either unobservable or at least cannot be measured on the scale assumed by the model. Uncertainty in these inputs induces uncertainty on the model outputs, and the objectives of sensitivity and uncertainty analysis are first to quantify this uncertainty and then to explore the role of each uncertain input in the overall output uncertainty. The standard techniques for such analysis involve Monte Carlo methods, which make random draws from the distributions of the inputs, run the code for each sampled input configuration, and thereby obtain a sample of outputs. When the code may take some minutes, hours or even days for a single run, the thousands of runs demanded by Monte Carlo methods become impracticable. The Bayesian tools described in this talk can be applied effectively with far fewer code runs.

**Sensitivity Analysis of a Buried Radioactive Waste Risk Model**`Author:`*Tom Stockton*(Neptune and Co)`Abstract:`- Complex ecosystem models are useful for investigating dynamics of systems where multiple variables are interacting in a non-linear manner. Quantitatively assessing the importance of input variables becomes more difficult as the dimensionality of the model increases. Sensitivity analysis deals with assigning influence measures to input variables for a given model. Local sensitivity analysis deals with the modification of input parameters one at a time. Although local sensitivity analysis is useful in some applications, the region of possible realizations for the model of interest is left largely unexplored. Global sensitivity analysis attempts to explore the possible realizations of the model more completely. The space of possible realizations for the model can be explored through the use of search curves or evaluation of multi-dimensional integrals using Monte Carlo methods. Sensitivity measures computed by these methods estimate the portion of the total variation of the response that can be attributed to the input variable of interest through an ANOVA-like decomposition of main and interaction effects. Methods of sensitivity analysis such as Multivariate Adaptive Regression Splines (MARS) and the Fourier Amplitude Sensitivity Test (FAST), provide tools that can be used in global sensitivity analysis. To enhance the interpretability of model output by providing a quantitative measure of the importance of input parameters, application of such global sensitivity analysis tools will be demonstrated through an modeling example of involving the performance of buried radioactive waste.

Session Time:

Room Location: Capistrano Room

Session Chair:

**Learning to Trade via Direct Reinforcement**`Author:`*John Moody*(Oregon Graduate Institute)`Abstract:`- I present new methods for optimizing portfolios, asset allocations and trading systems based on Direct Reinforcement. In this approach, investment decision making is viewed as a stochastic control problem, and strategies are discovered directly. The need to build forecasting models is eliminated, and better trading performance is obtained. The Direct Reinforcement approach differs from dynamic programming and reinforcement algorithms such as TD-learning and Q-learning, which attempt to estimate a value function for the control problem. I present an adaptive algorithm called Recurrent Reinforcement Learning (RRL) that enables a simpler problem representation, avoids Bellman's curse of dimensionality, and offers compelling advantages in efficiency. I demonstrate how Direct Reinforcement can be used to directly optimize risk-adjusted investment returns (including the differential Sharpe ratio), while accounting for the effects of transaction costs. In extensive simulation work, we find that our approach based on RRL produces better trading strategies than systems utilizing Q-Learning (a value function method) or trading based on forecasts. Real world applications include a monthly asset allocation system and an intra-daily currency trader.

**Statistical Inference, The Bootstrap, and Neural Network Modeling with Application to Foreign Exchange Rates**`Author:`*Jeff Racine*(University of South Florida)*Halbert White*(University of California, San Diego)`Abstract:`- In this paper we propose tests for individual and joint irrelevance of network inputs. Such tests can be used to determine whether an input or group of inputs ``belong'' in a particular model, thus permitting valid statistical inference based on estimated feedforward neural network models. The approaches employ well known statistical resampling techniques. We conduct a small Monte Carlo Experiment showing that our tests have reasonable level and power behavior, and we apply our methods to examine whether there are predictable regularities in foreign exchange rates. We find that exchange rates do appear to contain information that is exploitable for enhanced point prediction, but the nature of the predictive relations evolves through time.

Session Time:

Room Location: Laguna Room

Session Chair:

**Dynamic Visualization of Changing Prior and Posterior in Bayesian Analysis**`Author:`*Hani Doss*(Ohio State University)*B. Narasimhan*(Stanford University)`Abstract:`- Over the years, statistical problems given to the NSASAG have often been
considered through a Bayesian approach. This invariably raises the question
of ``How do you choose the prior?'' Ideally, one would want to know
posterior distributions for a wide variety of priors. If the posterior does
not change much when one changes the prior then one gets a feeling of
reassurance---a different investigator with a slightly different prior may
not even bother to recompute the posterior for his prior. On the other hand,
if the posterior changes significantly when one changes the prior, then it
is important to record that fact, so that for example more time is spent on
prior elicitation. Therefore, in almost any problem in which one carries out
a serious data analysis, one wants to calculate the posterior distribution
for a large number of prior distributions, especially in the exploratory
stages of the analysis. In many problems, the posterior is estimated through
Markov chain Monte Carlo, which may require non-negligible computer time,
and unfortunately this precludes consideration of a large number of priors
and an interactive analysis. To deal with this problem, we present a
computing environment within which one can interactively change the prior
and immediately see the corresponding changes in the posterior. The
environment is based on the object-oriented programming language LISP-STAT
and an importance sampling procedure which enables one to use the output of
one or a small number of Markov chains to obtain estimates of the posterior
for a large class of priors. The environment is very general and handles a
wide range of standard models, including for example GLM's and hierarchical
models.

**Nonparametric Clustering**`Author:`*David W. Scott*(Rice University)`Abstract:`- The use of density estimation to find clusters in data is supplementing ad hoc hierarchical methodology. Examples include finding high-density regions, finding modes in a kernel density estimator, and the mode tree. Alternatively, a mixture model may be fit and the mixture components associated with individual clusters. Fitting a high-dimensional mixture model with many components is difficult to estimate in practice. Here, we survey mode and level set methods for finding clusters. We describe a new algorithm that estimates a subset of a mixture model. In particular, we demonstrate how to fit one component at a time and how the fits may be organized to reveal the complete clustering model.

**Massive Data Sets**`Author:`*Jon Kettenring*(Telcordia Technologies)`Abstract:`- TBA

Session Time:

Room Location: Santa Ana Room

Session Chair:

**Statistical Learning Problems Associated with the World Wide Web**`Author:`*Byron Dom*(IBM Almaden Research Center)`Abstract:`- Problems in statistical machine learning occur in many domains and there
is currently much activity in this field. The two major problems addressed
are supervised and unsupervised learning, which have also appeared in the
context of the world wide web. This talk surveys work on web-specific
instances of these two learning problems and others in the hope of making
the statistics community more aware of some of the more significant work
that has been performed to date. Problems discussed include supervised
classification of web pages, clustering of web pages and so-called
``resource discovery'' - finding the most authoritative information on
specific subjects.

**Finite State Approaches to Information Extraction**`Author:`*Andrew McCallum*(WhizBang! Labs)*Fernando Pereira*(WhizBang! Labs)*John Lafferty*(Carnegie Mellon University)*Dayne Freitag*(Carnegie Mellon University)`Abstract:`- Finite state machines are the dominant model for information extraction
both in research and industry. In this talk I will give an overview of
several finite-state approaches to information extraction, culminating in
the presentation of Conditional Random Fields (CRFs), a new model for
probabilistic modeling of sequence data. CRFs offer several advantages over
hidden Markov models, including the ability to relax strong independence
assumptions made in those models. Conditional random fields also avoid a
fundamental limitation of maximum entropy Markov models (MEMMs) and other
discriminative Markov models based on directed graphical models, which can
be biased towards states with few successor states. I will present parameter
estimation algorithms several models, as well as experiments with real and
synthetic data.

**Graph Structure in the Web**`Author:`*Andrew Tomkins*(IBM Almaden Research Center)`Abstract:`- This talk discusses recent results concerning the macroscopic structure of the web. We show that the web is a ``bow tie'' decomposable into four equal-sized regions based on connectivity properties. We also examine a number of statistical properties of the web graph, showing results concerning the distribution of in- and out-links, the distribution of sizes of strongly and weakly connected components, and the graph diameter under various definitions. The resulting picture is much less strongly connected than was previously believed.

Session Time:

Room Location: Costa Mesa Room

Session Chair:

**Genome-Wide Binding Motif Discovery via Microarray and Prospect Sampler**`Author:`*Jun Liu*(Harvard University)*Xiaole Liu*(Stanford Unviersity)`Abstract:`- Recent biological experiments showed that by combining a modified
chromatin immunoprecipitation procedure with microarray analysis one can
have very strong information on the genome-wide binding locations of a
specific tanscription factor (a protein). When combined with a computational
strategy such as the Gibbs motif sampler, one can even pinpoint the exact
binding motif in some cases. Here we describe an improved algorithm that
specifically designed to find the most significant protein-binding motifs
when the ChIP-microarray data are available.

**Hierarchical Models for Gene Expression Data Analysis**`Author:`*Michael Newton*(University of Wisconsin at Madison)*Christina Kendziorski*(University of Wisconsin at Madison)`Abstract:`- Hierarchical statistical models provide a flexible approach to the
analysis of gene expression data. They enable robust and efficient inference
concerning patterns of differential expression among cell types. One
rationale for these models is that with thousands of gene-specific
questions, it is meaningful to treat the gene-specific parameters as arising
from array-specific or cell-type-specific distributions, rather than
treating them as fixed effects. Two parametric formulations have proven to
be effective: Gamma-Gamma-Multinomial (GGM) and LogNormal-Normal-Multinomial
(LNNM) characterize both fluctuations in gene-specific expected expression
and fluctuations in measured expression given these underlying means. A
nonparametric version of the model provides further insight. I will discuss
hierarchical statistical modeling in the context of several experiments
involving oligonucleotide chip sets. I will try to demonstrate the utility
of reporting the probability of various forms of differential expression,
and I will discuss model fitting, model checking, and the role of an
interesting arithmetic-geometric mean ratio.

**Stochastic Models for Sequences with Non-Local Dependency Structure**`Author:`*Scott C. Schmidler*(Duke University)`Abstract:`- We describe a class of probability models for capturing non-local dependencies in sequential data, motivated by applications to biopolymer (protein and nucleic acid) sequence analysis. These models generalize previous work on segment-based stochastic models for sequence analysis. We provide algorithms for Bayesian inference on these models via dynamic programming and Markov chain Monte Carlo simulation. We demonstrate this approach with an application to protein structure prediction.

Session Time:

Room Location: Viejo Room

Session Chair:

**Does Anyone Know When the Correlation Coefficient is Useful?: A Study of the Times of Extreme River Flows**`Author:`*David R. Brillinger*(University of California, Berkeley)`Abstract:`- John Tukey spoke and wrote concerning the uses and limitations of
correlation and regression coefficients. In particular he did not think
highly of the former, except in limited circumstances and he recognized that
there were substantial difficulties of interpretation going along with the
latter. The two concepts have random process analogs and this paper
considers the case of stationary point processes and their use in a setup
that is reasonably well understood physically. The situation is that of the
passage of the extreme water flows through a series of locks along the
Mississippi River. The focus is on an investigation of the validity of
partial coherency analysis, an analog of partial correlation analysis. A
maximum likelihood analysis is also presented.

**On the Interaction Between Statistics and Computing: In Memory of John W. Tukey**`Author:`*Luisa Fernholz*(Temple University)`Abstract:`- This talk is partially based on my unpublished joint work with John W.
Tukey. I will discuss some of Tukey's ideas and comments on data analysis,
statistics, mathematical statistics, computing, etc. In that context, I will
present examples of how the interaction between statistical theory and
statistical computing can be used to generate new methodologies that offer
powerful tools for analyzing data. In particular, I will present a data
dependent method of outlier detection based on the multihalver (the
leave-out-half jackknife). The approach is based on the sequences from
Plackett-Burman designs with Hadamard matrices used to generate the
different ``halvings'' of the data. Examples will be given to show the
effectiveness of the multihalver to detect outliers. The multihalver
examples are based on my joint work with John W. Tukey, one of his last
works in statistics.

**The Legacy of John Tukey**`Author:`*Robert L. Launer*(U.S. Army Research Office and the George Washington University)`Abstract:`- The talk begins with another look at John Tukey's 1949 paper, ``One Degree of Freedom for Non-Additivity''. This is the spring-board for a look at his career and work. His activity as a consultant to the United States Government is highlighted through personal reminiscence.

Room Location: Capistrano Room

Session Chair:

**Spatio-Temporal Prediction of Incomplete Precipitation Records**`Author:`*Craig Johns*(University of Colorado, Denver)*Douglas Nychka*(National Center for Atmospheric Research)`Abstract:`- Ecological models depend on preciptation fields as inputs. However,
monthly precipitation data from a large number of stations in the United
States over a large number of years contain many missing observations. To
predict, or infill, we describe a model which does not completely rely on
stationary models nor completely upon observed correlations. Modifications
to the model are made in order to make the fitting computationally feasible
for large data sets.

**Bayesian and Frequentist Inference for Ecological Inference: The R x C Case**`Author:`*Ori Rosen*(University of Pittsburgh)*Wenxin Jiang*(Northwestern University)*Gary King*(Harvard University)*Martin Tanner*(Northwestern University)`Abstract:`- In this paper we propose Bayesian and frequentist approaches to
ecological inference, based on $R \times C$ contingency tables, including a
covariate. The proposed Bayesian model extends the binomial-beta
hierarchical model developed by King, Rosen and Tanner (1999) from the $2
\times 2$ case to the $R \times C$ case. As in the $2 \times 2$ case, the
inferential procedure employs Markov Chain Monte Carlo (MCMC) methods. As
such, the resulting MCMC analysis is rich but computationally intensive. The
frequentist approach, based on first moments rather than on the entire
likelihood, provides quick inference via nonlinear least-squares, while
retaining good frequentist properties. The two approaches are illustrated
with simulated data, as well as with real data on voting patterns in Weimar
Germany. In the final section of the paper we provide an overview of a range
of alternative inferential approaches which trade-off computational
intensity for statistical efficiency.

**Using the Chemical Mass Balance Receptor Model to Estimate Pollution Source Contributions from Correlated Air Quality Observations**`Author:`*William F. Christensen*(Southern Methodist University)`Abstract:`- In the environmental sciences, receptor models are used to evaluate the
contribution of various pollution sources to the air composition at a
location. Using pollution source profiles, profile uncertainties, and
measurement error variances, the chemical mass balance (CMB) model can be
fit in order to partition the ambient pollutants measured at the receptor
into a collection of source contributions. We discuss the use of the CMB
model for the analysis of a multivariate time series of air quality
measurements, and we consider estimation and inference procedures which
account for the multiple sources of correlation in the data. Using computer
simulation, approaches are compared under various scenarios in which
standard model assumptions are violated.

**The Application of Ensemble and Combination Classifiers for Land Cover Mapping via Satellite Imagery**`Author:`*Brian M. Steele*(University of Montana)*David A. Patterson*(University of Montana)`Abstract:`- This talk concerns the application of statistical classification rules
for constructing land cover maps from Landsat Thematic Mapper (TM) satellite
imagery. These maps are widely used for large-scale management by the USDA
Forest Service and other agencies for tasks such as prioritizing fire
fighting efforts. The classification problem begins with a Landsat TM scene
(approximately $170 \times 170$ km) that has been partitioned as a set of
approximately 800,000 polygons. A training set (2000 to 5000 observations)
is collected by ground visitation, and a classifier is constructed from the
training set to assign land cover type (15-20 types) to the unsampled
polygons using the satellite imagery. Good map accuracy is difficult to
achieve in mountainous and forested landscapes because land cover type
transitions are sometimes indistinct, and because of satellite measurement
error. This talk addresses the application of ensemble methods (boosting and
bagging), and combination methods for improving classifier accuracy. Two
classifiers are of particular interest. I discuss a k-nearest neighbor
(k-NN) classifier which estimates group membership probabilities using the
exact analytic bootstrap expectations of a k-NN probability estimator. In
other words, this classifier amounts to the classifier that would be
obtained from averaging all possible bagging versions of a k-NN classifier.
The second classifier is a simple spatial classifier that uses the distance
from polygons to training observations to classify polygons. While the
accuracy of this classifier is quite poor, combinations of spatial and k-NN
or tree classifiers are substantially better than any of the constituent
classifiers. The method of combining classifiers is not particularly
important; in fact, for these data, a very simple method works nearly as
well as any.

**Mining for Knowledge About Ostracode Assemblages in the Tecolutla River Delta**`Author:`*A. Dale Magoun*(The University of Louisiana at Monroe)*Mervin Kontrovitz*(The University of Louisian at Monroe)*Daniel J. Stanley*(Smithsonian Institution)`Abstract:`- Sediment surface samples containing ostracodes from twenty-one locations representing different aquatic habitats were obtained from a recent study pertaining to the depositional variability of the Tecolutla River delta, Mexico (Chen, Stanley, and Wright, 2000). Aquatic habitats such as an estuary, offshore, a mangrove, tidal ponds, a marsh, and the river were surveyed. The surface samples were analyzed for organic matter and grain size distributions. The ostracodes were identified to their lowest taxonomic level. This paper shows the results of several approaches to the analysis and interpretation of this multivariate study. The methods of this paper provide different approaches to the analysis and the interpretation of the interdependent structures that exists in the aquatic habitats found in this riverine environment. This paper discusses these findings from a traditional taxonomic viewpoint to the latest techniques using correspondence analysis.

Session Time:

Room Location: Laguna Room

Session Chair:

**Developing Data Mining Systems**`Author:`*Arno Siebes*(Utrecht University)`Abstract:`- Data mining is the search for patterns in (large) databases. In the last
couple of years I've been involved in the development of two data mining
systems and I'm about to start the development of a third system (geared
towards bioinformatics). In this talk I will tell you some of my experiences
and how they influence the design of the new system. I will especially
address efficiency issues related to the interrogation of large databases.
For example, what data structures in the database speed-up the query
processing and what type of query operators are necessary.

**Graphical and Statistical Pruning of Association Rules**`Author:`*Adalbert Wilhelm*(University of Augsburg)`Abstract:`- Association rules are amongst the most important patterns that can be discovered using data mining. Their discovery is supported by most, if not all, data mining tools. The analysis and the interpretation of the discovered rules is more difficult or almost impossible, given the huge number of generated rules. In this paper we propose graphical aids and statistical tests to overcome the main drawbacks of mining association rules: arbitrary thresholds for support and confidence, huge number of association rules, meaningless associations due to the presence of frequent itemsets and empirical evaluation of the implications strength. We show how Double Decker plots can be used to visualize association rules. These plots visualize the contingency table that yields the association rule as well as the other potential rules in that table, whether they meet the thresholds or not. This gives a deeper understanding on the nature of the correlation between the left-hand side of the rule and the right-hand side.

***** CANCELLED *****~~Analyzing High Dimensional Online Monitoring Data~~`Author:`*Ursula Gather*(University of Dortmund)`Abstract:`- In technical process control but also in modern intensive care we are
confronted with data structures which are massive as well as high
dimensional, dynamic and coming along with complex dependencies between the
components at the same time. We report on problems and challenges for
statistical data analysis in this situation especially w.r.t. the ability of
methods to work online. Some first results concerning dimension reduction
and online pattern detection will also be presented.

Session Time:

Room Location: Santa Ana Room

Session Chair:

**Searching the Web: Current Limitations, New Techniques, and Future Directions**`Author:`*C. Lee Giles*(Pennsylvania State University)`Abstract:`- The World Wide Web continues to revolutionize communication and information systems. Measurements of its size and content provide us with insights into the Web's future and suggestions for new Web tools. Our sampling of the web found that the web though large is not as large as some industrial databases. Our sampling of search engines found that search engines index only a fraction of the web, do not index sites equally, and may not index new pages for months. This talk discusses the impact of these results on future Web search and makes suggestions for future Web tools and research. We illustrate new techniques for information access on the Web, by describing two approaches, metasearch illustrated by Inquirus, a content-based metasearch engine, and by researchindex, a niche search engine, which is the largest free full-text index of computer science literature. This is joint work with Steve Lawrence, Kurt Bollacker and Eric Glover.

**How Big is the World Wide Web?**`Author:`*Adrian Dobra*(Carnegie Mellon University)*Stephen Fienberg*(Carnegie Mellon University)`Abstract:`- Considerable efforts have been dedicated to the development of sound procedures for assessing the size of the World Wide Web. The problem is compounded by the fact that sampling directly from the Web is not possible. Several groups of researchers have found sampling schemes which consist of running a number of queries on several major search engines. We present a new approach to analyze datasets collected by query-based sampling, founded on a hierarchical Bayes formulation of the Rasch model for multiple capture-recapture methodology. We illustrate the approach using data gathered by Giles and Lawrence in 1997.

Session Time:

Room Location: Costa Mesa Room

Session Chair:

**A Tutorial on Support Vector Machines**`Author:`*Bernhard Sch\"olkopf*(Biowulf Technologies)`Abstract:`- This talk will present a general tutorial on support vector machines (SVMs).
`Talk Slides:`- [Postscript]

**Kernel Methods for Unsupervised Learning**`Author:`*Bernhard Sch\"olkopf*(Biowulf Technologies)`Abstract:`- Over the last years, ideas from SVMs have been generalized to unsupervised learning problems such as multi-dimensional quantile estimation and vector quantization. Algorithms for these problems will be presented in the talk.

Session Time:

Room Location: Viejo Room

Session Chair:

**Graphical Representation as a Discipline**`Author:`*Herman Chernoff*(Harvard University)`Abstract:`- The advent of the computer has facilitated the development and use of
graphical representation. At the same time it has made such use more
important. In response to such needs, many novel techniques have originated.
It is time to replace naive dogmas about what constitutes a good
representation with a serious study of fundamental
principles.

**Clustering and the Genetics of Complex Disease**`Author:`*Richard Olshen*(Stanford University)`Abstract:`- This work is in collaboration with Alfred Lin, Jing Huang, Neil Risch,
David Cox, Koustubh Ranade, Yii-Der Ida Chen, Dee Pei, Chii-Min Hwu, David
Curb, Beatriz Rodriguez, Victor Dzau, and many others. Part of the talk will
be a description of two ongoing projects in which I am involved, both with
many other individuals. Each is a search for genes that predispose to what
seems to be a polygenic disease. The first project is SAPPHIRe (Stanford,
Asian, and Pacific Program in Hypertension and Insulin Resistance), a
network of the NHLBI's Family Blood Pressure Program. The other is supported
by the Donald W. Reynolds Foundation and is concerned with cardiovascular
diseases, in particular with genes expressed in vessel walls. Most of the
talk will be about defining phenotypes in general and so-called
``intermediate phenotypes'' in particular. With SAPPHIRe we have applied
k-means clustering on variables, none marginally Gaussian, that quantify
levels of plasma lipids and the metabolism of insulin and glucose. Some
technologies for choosing ``how many clusters'' firmly choose two for the
597 women and somewhat less certainly two for the 535 men. However, others
insist upon three clusters. The differences seem of interest, as statistics
and as biology. In each case, one cluster is clearly the ``not insulin
resistant'' cluster. Cluster membership is significantly associated with
hypertension in women, much less so in men. We devised various permutation
tests for making inferences about these data, tests that respect family
structures.

**Mutlivariate Statistical Process Control and Signature Analysis Using Eigenfactor Detection Methods**`Author:`*Kuang H. Chen*(Massachusetts Institute of Technology)*Duane S. Boning*(Massachusetts Institute of Technology)*Roy E. Welsch*(Massachusetts Institute of Technology)`Abstract:`- Many businesses now use univariate statistical process control (USPC) in both their manufacturing and service operations. Automated data collection, low-cost computation, product design to facilitate measurement, and demands for higher quality, lower cost, and increased reliability have accelerated the use of USPC. However, in many situations the widespread use of USPC has caused a backlash as processes are frequently adjusted or shut down when nothing is really wrong because the probability of false positives (Type I error) is calculated based on USPC and takes little or no account of the multiple tests that are being performed or the correlation structure that may exist in the data. Attempts to deal with these issues focus on Bonferroni adjustments, Hotelling's T-squared statistics, and the generalized variance. The problem of high dimensionality is commonly addressed by using some form of dimension reduction such as principal component analysis (PCA). Often these methods indicate that some sort of change has taken place, but provide little information about the real nature of that change. In this paper, we develop a multivariate detection method called eigenfactor analysis which combines information contained in the matrix of eigenvectors and the eigenvalues, and is capable of detecting new events and subtle changes in the covariance structure of the process. Information regarding covariance structure in the process can be crucial for feedback, tuning, and control purposes. Unlike univariate cases where the distributions of the null and alternative hypotheses are aligned on the same axis, the orientations of two multivariate distributions can be much more complicated. Multivariate PCA and T-squared techniques project the test samples on a distribution model based on the training data; hence, such techniques assume alignment in the distributions. Consequently, detection strategies that are capable of differentiating directional drifts become very desirable. The paper concludes with an example from semiconductor manufacturing where the goal is to detect the end point in plasma etch using optical emission spectra. Results using the new procedures are then compared with existing univariate and multivariate process control techniques.

Room Location: Capistrano Room

Session Chair:

**Data Sharpening for Higher-Order Density Estimation**`Author:`*Michael C. Minnotte*(Utah State University)*Peter Hall*(Australian National University)`Abstract:`- Data sharpening is a method of applying carefully chosen transformations to data to obtain superior properties with simple methods of analysis. For the case of kernel density estimation, we show that transformations based on pilot estimates of the density and its derivatives can lead to arbitrarily high orders of bias reduction in density estimation with second-order (positive) kernels. Although the transformation is bandwidth-dependent, it requires neither subsidiary smoothing parameters nor back-transformation. Unlike estimates generated using traditional higher-order kernels, ours are constrained to be nonnegative. Numerical studies demonstrate that they also have improved mean square error properties and have fewer arbitrary wiggles.

**Robust Detection of Multivariate Outliers in High Dimensions and High Levels of Contamination**`Author:`*Mark Werner*(University of Colorado)*Karen Kafadar*(University of Colorado)`Abstract:`- Detection of multivariate outliers using classical statistical measures
such as the mean and standard deviation is made difficult by the masking
effect, where the influence of multiple outliers on these measures limits
their identification. Robust methods have been developed which do not use
classical measures, but they tend to be computationally prohibitive and are
not feasible for use in very large data sets. Building on ideas of Pe{\~n}a
and Prieto (2001) and Rocke and Woodruff (2001), we investigate the success
of a variety of methods in detecting outliers with particular interest in a
mixture of algorithms - using a combination of classical and robust methods
to utilize the advantages of each method and thereby achieve greater success
than when utilizing the methods separately. We thus investigate the
performance of these algorithms on various types of outliers and outlier
clusters. We find that different algorithms perform better for different
types of outliers and therefore recommend a combination of methods to
achieve the highest possible outlier identification rate when confronted
with a data set containing unknown outlier types.

**The Complexity of the MCD-Problem**`Author:`*Paul Fischer*(University of Dortmund, Lehrstuhl Informatik 2)*Thorsten Bernholt*(University of Dortmund, Lehrstuhl Informatik 2)`Abstract:`- In modern statistics the robust estimation of parameters is a central
problem. The Minimum Covariance Determinant (MCD), see, e.g.,~[Rous84], is
probably the most important robust estimator of multivariate location and
scatter. Its algorithmic complexity, however, was unknown and generally
thought to be exponential even if the dimensionality of the data is fixed. A
number of heuristics for solving the problem have been developed, e.g.,
Fast-MCD~[Rous99] and Feasible Solution~[HaOl99]. Here we present a
polynomial time algorithm for MCD for fixed dimension of the data. In
contrast we show that the MCD problem is $NP$-hard if the dimension varies,
hence one cannot expect to find efficient algorithms in this
case.

**Finding Committee Solutions by Clustering Models in Function Space**`Author:`*Thomas Ragg*(University of Karlsruhe)`Abstract:`- Forming a committee is an approach for integrating several opinions or
functions instead of favoring a single one. Selecting and weighting the
committee members is done in several ways by different algorithms. Possible
solutions to this problem is still the topic of current research. Our
starting point is the decomposition of the committee error into a bias- and
variance-like term. Two requests can be derived from this equation: Models
should on the one hand be regularized properly to reduce the average error.
On the other hand they should be as independent as possible (in the
mathematical sense) to decrease the committee error. The first request of
regularization can be handled by a Bayesian learning framework. For the
second request I want to suggest a new selection method for committee
members based on the pairwise stochastic dependence of their output
functions, which maximizes the overall independence. Given these pairwise
similarity values the models can be separated in classes by a hierarchical
clustering algorithm. From the error decomposition of committees I derive a
criterion that allows me to find the optimal number of classes, i.e. the
optimal stop criteria for the clustering algorithm. The benefits of the
approach are demonstrated for committees of neural networks on a noisy
benchmark problem as well as on some problems from the UCI
repository.

**Detection of Novel Samples in Mass Spectral Data Using Cluster Analysis**`Author:`*Vladimir Svetnik*(Merck \& Co., Inc.)*Andy Liaw*(Merck \& Co., Inc.)`Abstract:`- We present an application of cluster analysis to the detection of novel
(unusual, outlying) samples in mass spectral data. These samples, unlike the
majority of the others, may represent novel chemical structures that are of
most interest to the scientists. Details of our specific application are
presented in [1]. A typical N by p data set consists of, N, nearly a
thousand mass spectral measurements, with intensities measured at p equal to
hundreds of mass-to-charge ratios. Since each sample can be represented as a
point in the p-dimensional space, search for unusual samples can be
considered as a search for the ``abnormal'' (outlier) points in this space.
While it is well known that outlier identification and clustering in such
high dimensions is extremely difficult, it will be argued that the nature of
the mass spectral data does not lend itself to dimension reduction. There is
significant attention in the data mining community to the outlier detection
problem in large, multidimensional data sets, see for example [2-4]. Similar
to [4], our approach utilizes hierarchical clustering to solve this problem.
Advantages of this approach are that, unlike many other methods, it places
fewer assumptions on the data, and can be used with various distance
(similarity) measures. Based on the clustering algorithm, we developed a
complete procedure for outlier identification. We start from the working
definition of outliers as those samples that are members of clusters with
``small'' cardinalities, and are far away from clusters with ``large''
cardinalities. A parameter, U, which depends on the expected number of novel
samples in the data, is used to define ``small'' clusters. We note one
monotonic feature of the hierarchical clustering algorithm: the
cardinalities of clusters that a sample belongs to is non-decreasing with
respect to the number of clusters. This feature of the algorithm allows us
to calculate a measure of outlyingness for each sample, kmin, which is the
smallest number of clusters at which the sample belongs to an outlying
cluster (a cluster with cardinality smaller than the threshold UN). Samples
are then ranked by their kmin values, and the top few are submitted for
further investigation by scientists. The use of the lowly populated clusters
is consistent with the hypothesis that novel samples should represent only a
small fraction of the data. We discuss the use of different similarity
measures in our method. In particular, we focus on those measures that are
traditionally used in the analysis of mass spectral data. We also present a
bootstrap procedure that is used to assess the ``confidence'' one should
have about the outlyingness of the samples identified by this method.

Session Time:

Room Location: Laguna Room

Session Chair:

**A Computational Approach for Full Nonparametric Bayesian Inference under Dirichlet Process Mixture Models**`Author:`*Alan E. Gelfand*(University of Connecticut)*Athanasios Kottas*(Duke University)`Abstract:`- Widely used parametric generalized linear models are, unfortunately, a
somewhat limited class of specifications. Nonparametric aspects are often
introduced to enrich this class, resulting in semiparametric models.
Focusing on single or k-sample problems, many classical nonparametric
approaches are limited to hypothesis testing. Those that allow estimation
are limited to certain functionals of the underlying distributions.
Moreover, the associated inference often relies upon asymptotics when
nonparametric specifications are often most appealing for smaller sample
sizes. Bayesian nonparametric approaches avoid asymptotics but have, to
date, been limited in the range of inference. Working with Dirichlet process
priors, we extend the effort of Gelfand and Mukhopadhyay (1995). In that
paper inference was confined to posterior moments of linear functionals of
the population distribution. Here, we provide a computational approach to
obtain the entire posterior distribution for more general functionals. We
illustrate with three applications: investigation of extreme value
distributions associated with a single population, comparison of medians in
a k-sample problem, and comparison of survival times from different
populations under fairly heavy censoring.

**Hierarchical Model-Based Clustering For Large Datasets**`Author:`*Christian Posse*(KangarooNet Inc.)`Abstract:`- In recent years, hierarchical model-based clustering has provided promising results in a variety of applications. However, its use with large datasets has been hindered by a time and memory complexity that are at least quadratic in the number of observations. To overcome this difficulty, we propose to start the hierarchical agglomeration from an efficient classification of the data in many classes rather than from the usual set of singleton clusters. This initial partition is derived from a subgraph of the minimum spanning tree associated with the data. To this end, we develop graphical tools that assess the presence of clusters in the data and uncover observations difficult to classify. Using this approach, we analyze two large, real datasets: a multi-band MRI image of the human brain and data on global precipitation climatology. In the last case, we discuss ways of integrating the spatial information in the clustering analysis. We focus on two-stage methods, in which a second stage of processing using established methods is applied to the output from the algorithm presented in this paper, viewed as a first stage.

Session Time:

Room Location: Santa Ana Room

Session Chair:

**Computing Environments for Bayesian Statistics**`Author:`*Robert Gentleman*(Harvard School of Public Health)`Abstract:`- Bayesian computing should be more than the application of standard
Bayesian methods to particular problems. It is certainly important to be
able to apply this methodology to practical problems, easily. It is perhaps
more important to have a platform that encourages the development of new
Bayesian methodology and helps us extend the functionality of existing
tools. There are relatively few computing environments available for
Bayesian computing. In this talk I will discuss some of the functionality
that is needed for Bayesian computing and some of the implementations that
are available. In addition changes or enhancements to existing packages that
will encourage the use of Bayesian methods will be proposed. Some prototypes
will be explored.

**Stochastic Parameterized Grammars for Bayesian Model Composition**`Author:`*Eric Mjolsness*(JPL)*Michael Turmon*(JPL)*Wolfgang Fink*(JPL)`Abstract:`- Bayesian analysis systems typically have some input language for
describing probabilistic models upon which exact or approximate inference is
to be performed by one or more algorithmic engines. Here it is proposed that
many useful generative probabilistic models can appropriately be expressed
in the form of stochastic grammars, which recursively generate sets of words
with numerical parameters attached. The rules in these stochastic
parameterized grammars (SPG’s) each have the power of a Boltzmann
probability distribution, suggesting the use of mean field theory and other
methods for model inversion. Model composition arises at the level of
multiple rules in a grammar, and also at the level of entire grammars called
as subroutines to implement a rule in other grammars. This SPG viewpoint
raises new possibilities for interacting with subject domain experts to
create statistical models and data analysis algorithms, but raises new
challenges for the language or system implementor in the areas of
mathematical notation, algorithm composition (e.g. using clocked objective
functions), and software synthesis.

**The Bayes Net Toolbox for Matlab**`Author:`*Kevin Murphy*(UC Berkeley)`Abstract:`- The Bayes Net Toolbox (BNT) for Matlab is a software package for directed graphical models. It supports exact and approximate inference, parameter and structure learning, and static and temporal models. It is widely used in academia for teaching and research. The BNT web site receives an average of 300 hits per week. In this talk, I will describe some of the features that distinguish it from other Bayes net software packages, some of the advantages and disadvantages of using Matlab, plus plans for future work. For more details, see http://http.cs.berkeley.edu/~murphyk/Bayes/bnt.htm

Session Time:

Room Location: Costa Mesa Room

Session Chair:

**Data Squashing: Constructing Summary Data Sets**`Author:`*William DuMouchel*(AT\&T Shannon Labs)`Abstract:`- One of the chief obstacles to effective data mining is the clumsiness of
managing and analyzing data in very large files. The process of model search
and model fitting often require many passes over a large dataset, or random
access to the elements of a large dataset. Many statistical fitting
algorithms assume that the entire dataset being analyzed fits into computer
memory, restricting the number of feasible analyses. Here we define ``large
dataset'' as one that cannot be analyzed using some particular desired
combination of hardware and software because of computer memory constraints.
There are two basic approaches to this problem: either switch to a different
hardware/software/analysis strategy or else substitute a smaller dataset for
the large one. Here we assume that the former strategy is unavailable or
undesirable and consider ways of constructing a smaller substitute dataset.
This latter approach was named data squashing by DuMouchel, Volinsky,
Johnson, Cortes and Pregibon (1999) ``Squashing flat files flatter'' [KDD'99
Proceedings]. Formally, data squashing is a form of lossy compression that
attempts to preserve statistical information. Suppose that the original or
``mother'' dataset is a matrix $Y$ having $N$ rows or entities and $n$
columns or variables. The squashed dataset is a matrix $X$ having $M$ rows
and $n+1$ columns, where $M \ll N$. The extra column in $X$ is a column of
weights, $w_i$, $i = 1, \ldots, M$, where $w_i > 0$ and $\sum_i w_i = N$.
It is assumed that $M$ is small enough so that $X$ can be processed by the
desired hardware/software, and that the software can make appropriate use of
the weight variable. The $n$-dimensional distribution of the rows of $X$
weighted by the $w_i$ is intended to approximate the distribution of the
rows of $Y$ well enough that statistical analysis of $X$ is an acceptable
substitute for the desired analysis of $Y$. A squashing procedure is
evaluated by how much more closely modeling of the squashed pseudo-data
approximates results from the full data than do the results from a random
sample of the same size. Methods for data squashing will be presented and
compared.

**Exploratory Analysis of Retail Sales of Billions of Items**`Author:`*William F. Eddy*(Carnegie Mellon University)*Dunja Mladenic*(J. Stefan Institute, Slovenia and Carnegie Mellon Univ., USA)*Scott Ziolko*(Carnegie Mellon University)`Abstract:`- We report some preliminary analyses of a data set collected over the
past year from a grocery chain containing hundreds of stores. Each record in
the data set represents an individual item processed by an individual laser
scanner at a particular store at a particular time on a particular day. Each
record contains additional information such as store department, price, etc.
together with identifying information such as the particular checkout
scanner and, for some transactions, customer identification. The total data
set contains billions of items which can be aggregated into hundreds of
millions of transactions for millions of repeat customers. This talk will
describe a number of analyses we have undertaken. Some of these have simply
focused on ascertaining the ``quality'' of the data while others have been
more narrowly focussed on simple questions like ``which pairs of items are
most frequently purchased together'' or ``what is the relationship between
basket size and number of baskets.'' The sheer size of the data set has
forced us to go beyond simple ``data mining'' methods and become involved in
``meta-mining:'' the post processing of the results of basic
analyses.

**Mining Large Datasets**`Author:`*Johannes Gehrke*(Cornell University)`Abstract:`- This talk has two parts. I will first survey recent work in scalable decision tree construction over massive training databases. In the second part, I will address algorithms for mining high-speed data streams.

Session Time:

Room Location: Viejo Room

Session Chair:

**Technology and the 2010 Census**`Author:`*Carol M. Van Horn*(U.S. Census Bureau)`Abstract:`- The success of the 2000 census was due, in part, to the application of
new technologies such as image capture, Internet and laptop computers for
some data collection activities. In the coming decade the Census Bureau is
planning a program which integrates three components: collecting long form
data in a national survey, the American Community Survey (ACS); enhancing
the Master Address File and TIGER, a geographic database to bring into
compliance with GPS coordinates; and a re-engineered 2010 short form census.
By the next census in 2010 advances in technology will provide opportunities
for further successes. Research and testing will involve handheld computers
equipped with GPS for the creation of an initial address list and for use in
nonresponse and other field followup activities, enabling field workers to
enter responses directly into a computer file. The short form data
collection holds promise for expanding the Internet and other electronic
reporting options as modes for data collection. This paper describes the
opportunities, the benefits, and the challenges the Census Bureau faces
using technology in the 2010 census. The expanded use of technology in 2010
will greatly reduce the Census Bureau's reliance on paper questionnaires.

**The U.S. Census Bureau's MAF/TIGER System, Internal and External Interfaces**`Author:`*Robert Marx*(U.S. Census Bureau)*Linda M. Franz*(U.S. Census Bureau)`Abstract:`- The U.S. Census Bureau's overall mission is to be the preeminent
collector and provider of timely, relevant, and quality data about the
people and economy of the United States. To accomplish this mission, the
Census Bureau has been using the Master Address File and the Topologically
Integrated Geographic Encoding and Referencing (MAF/TIGER) system for more
than fifteen years to support its various census and sample survey
activities. In addition, the MAF/TIGER database has been used as the
foundation of the burgeoning geographic information system (GIS) industry in
the United States to support the analytical programs and GIS activities
managed by other federal agencies, numerous state, local, and tribal
governments, the private sector, and academic organizations. The MAF/TIGER
system is an aging national resource. The Census Bureau needs to prepare for
significantly more automation in the 21st Century, including: improvement to
street and other map feature locations; addition of accurate housing unit
locations, and enhanced feature change detection methodology to provide more
timely updates; and modernizing the processing environment from ``home
grown'' systems to one based on COTS and GIS software. Accomplishing the
needed improvements will increase the effectiveness of the sponsoring and
participating organizations that depend on the Census Bureau's statistical
data and its geographic infrastructure.

**(Title and abstract unavailable)**`Author:`*Latanya Sweeney*(Carnegie Mellon University)`Abstract:`

Room Location: Capistrano Room

Session Chair:

**Assessing Patient Survival Using Microarray Gene Expression Data Via Partial Least Squares Proportional Hazard Regression**`Author:`*Danh V. Nguyen*(UC Davis)*David M. Rocke*(UC Davis)`Abstract:`- High dimensional data sets from microarray experiments where the number
of variables (genes) $p$ far exceed the number of samples $N$ render most
traditional statistical tools of little direct use. However, some of these
statistical tools when used in conjunction with an appropriate dimension
reduction method can be effective. In this paper we introduce the use the
proportional hazard (PH) regression (Cox 1972) in conjunction with dimension
reduction by partial least squares (PLS), since the number of covariates $p$
exceeds the number of samples $N$. This setting is typical of gene
expression data from DNA microarrays. Specifically, for a given vector of
response values which are times to event (death or censored times) and $p$
gene expressions (covariates) we address the issue of how to assess
(estimate) the survival experience (curve) when $N \ll p$. The approach
taken to cope with the high dimensionality is to reduce the dimension via
some dimension reduction (component extraction) method in the first stage
and then estimate the survival distribution using a PH regression model in
the second stage. The primary method of component extraction considered is
PLS. PLS achieves dimension reduction by constructing components to maximize
the covariance between the response (survival times) and the linear
combination of the covariates (gene expressions) sequentially. This is
analogous to principal components analysis (PCA) but the optimization
criterion in PCA is variance rather than covariance in PLS. We demonstrate
the use of the methodology to a diffuse large B-cell lymphoma (DLBCL)
textit{complementary} DNA (cDNA) data set.

**Lessons Learned From Analyzing the Differential Gene Expression Data Between Normal and Tumor Tissues in Head and Neck Cancer Patients**`Author:`*J. Jack Lee*(University of Texas M.D. Anderson Cancer Center)*Hyung Woo Kim*(University of Texas M.D. Anderson Cancer Center)*Feng Zhan*(University of Texas M.D. Anderson Cancer Center)*Adel K. El-Naggar*(University of Texas M.D. Anderson Cancer Center)`Abstract:`- Gene expression in head and neck cancer patients was assessed by using
the Research Genetics cDNA membranes GF200 and GF211. The major objective is
to identify differentially expressed genes between normal and tumor tissues.
In the presentation, we will share many lessons we have learned in
acquiring, displaying, and analyzing the microarray data. Before the data
analysis, experimental condition need to be carefully documented. For
example, information on patient characteristics, tissue extraction method,
choice of primer, lot number and strip number of the membrane, hybridization
procedure, exposure time, and other parameters of image acquisition all
needs to be recorded. We performed duplicate experiments and, in some cases,
aligned the images multiple times to estimate the alignment variability,
experiment variability, and patient variability. The effect due to multiple
stripping of the membrane was also inspected. Standardization by background
correction and by the nonparametric regression based LOESS method was
examined. The differential expressed genes were identified by computing the
raw folds of change, the t-statistics, and change with respect to the
interquartile range. Exploratory graphical methods using the hexbin plot and
brushing methods were applied. Results of data analysis using various tools
were compared. The conclusion of the data analysis and its biological
interpretation will be reported.

**Taming Genetic Microarray Data: A Paradigm Using a Well-Known Case Study**`Author:`*Howard T Thaler*(Memorial Sloan-Kettering Cancer Center)`Abstract:`- Microarray technology, such as the Gene-Chip expression analysis probe
array (Affymetrix), generates expression levels for thousands of genes from
a single specimen. The array data are used to characterize genetic
differences between individuals or between types of tissue. However,
statistical properties of the data produced present challenges in the
interpretation of statistical analyses and identification of true outliers
associated with causal genes. The data deviate “wildly” from classical
statistical assumptions: normality, homoscedasticity, and additivity. In a
well-known, publicly available leukemia patients data set published by
Gollub, et al., means and standard deviations varied over several orders of
magnitude and the cost-constrained, modest sample size precludes using
asymptotic approximations. A simple data transformation “tamed” the data to
come reasonably close to satisfying those assumptions by several ad hoc
criteria. While the number of genes measured exceeded the number of sample
specimens 100-fold, a simple dimensionality-reduction strategy ameliorated
the multiplicity problem and facilitated evaluation of group differences and
covariate effects -- yielding more focused results for the transformed data
than for the raw data. Conclusion: The proposed paradigm for analyzing
microarray gene expression data yielded more precise, concise and reliable
results.

**Statistical Modelling of Micro Array Data**`Author:`*Ziad Taib*(Biostatistics, AstraZeneca R\&D M{"o}lndal)`Abstract:`- We summarize some statistical issues encountered when attempting to analyse gene expression data. We try to argue that basing the analysis on a statistical model can be far more rewarding than using ad hoc methods and cut off criteria.

Session Time:

Room Location: Laguna Room

Session Chair:

**Unraveling and Defining Biocomplexity**`Author:`*William K Michener*(University of New Mexico)*James L Rosenberger*(Penn State University)`Abstract:`- In this presentation we discuss Biocomplexity, a word which describes a
new research focus which evolved during the past three years at the NSF and
which fosters interdisciplinary research to understand and model the complex
interrelationships underlying biological systems. Examples of biocomplexity
research are given, research paradigms are described, and essential
components for success are presented. The importance of the mathematical and
statistical sciences can be seen for integrating the components of
reductionist research into a quantitative model which provides a predictive
outcome with appropriate measures of uncertainly. The Biocomplexity and the
Biocomplexity in the Environment funding programs will be described and
opportunities for statisticians, mathematicians and computer scientists
discussed.

**Theoretical and Computational Challenges in Entropy Evaluation of Macromolecules**`Author:`*Harshinder Singh*(West Virginia University)*E. James Harner*(West Virginia University)*Eugene Demchuk*(NIOSH/HELD)*Vladimir Hnizdo*(NIOSH/HELD)`Abstract:`- Evaluation of entropy is important in biological processes in order to predict the stability of a molecular conformation. The entropy evaluation requires probabilistic modeling of conformations in the internal coordinates. Since fluctuations in the rotational angle (torsional) coordinates make a pivotal contribution to the overall configurational entropy of the molecule, we review circular probability modeling approaches for modeling the torsional angles. Since macromolecules such as proteins have a very large number of interdependent torsional angles and the distributions of many of them could be multimodal and even skewed, we discuss theoretically and computationally challenging problems that arise in the simultaneous modeling of these angles based on data from molecular dynamics simulations.

Room Location: Santa Ana Room

Session Chair:

**Ciphertext Size Requirement of Ciphertext-Only Attack on Vigenere Cipher**`Author:`*Qiong Yang*(Boston University)*Song Guo*(College of Computer Science, Northeastern University)`Abstract:`- Index of coincidence (IC) testing method has been used in cryptoanalysis
of Vigenere cipher. However, this testing method is used based on intuition
rather than probability theory. In this paper, we studied the statistical
properties of the IC test and proved that IC is an unbiased estimator of
$sum p_i^2$. Furthermore, by using Cauchy inequality and One-Sample
U-Statistic theory, we proved that IC is Gaussian distributed. Based on
these results, we developed a probability framework for IC testing method
and applied it to determine the sample size requirement.

**Interval Computation of Gamma Probabilities and Their Inverses**`Author:`*Trong Wu*(Southern Illinois University Edwardsville)`Abstract:`- A new method for computing the gamma cumulative distribution functions and their inverses is presented in this paper. This method uses two continued fractions for computation, one for the incomplete gamma function and the other for the complement of the incomplete gamma function. An improved interval method for computation and implemented as C++ language classes is used. This is a self-validated computation. We developed programming techniques to speed up the increment in the iterative loops for finding the inverse of the gamma cumulative distribution function for a given probability. In fact, the inverses can be considered random gamma variates if a uniform random number generator is used to generate the probabilities over the interval [0, 1). The entire computation only involves two simple algebraic functions. There is no use of transcendental functions, auxiliary functions, power series, or Newton's method in the computation. Therefore, one can expect it is easy to implement.

**Smooth Quadratures of Volterra Integral Equations with Applications to Estimation of HIV Infection Rates and Projection of AIDS Incidence**`Author:`*John J. Hsieh*(University of Toronto)`Abstract:`- Many scientific research problems arise as solutions to Volterra integral equations of the first kind in which the given function $\mu(t)$ is expressed as the integral over $(0,t]$ of the product of the known kernel $K(t,u)$ and the unknown function $\lambda(u)$ with respect to $u$, where $\mu(t)$ and $\lambda(t)$ are positive for $0 \leq t \leq \tau$, for some positive $\tau$, $K(t,u)$ is positive for $t \geq u$ and is 0 for $t<u$ and is square integrable on ${(t,u)| 0 \leq t \leq \tau,0 \leq u \leq \tau}$. In biomedical research problems, data for $\mu(t)$ and $K(t,u)$ often come as step functions of time and so the method of quadratures may be employed to solve the equation to estimate $\lambda(t)$ as a step function of time. However, this method tends to yield negative and/or erratic values as solutions of $\lambda(t)$. To obtain non-negative and smooth estimates of $\lambda(t)$, we shall treat $\lambda(t)$ as the mean of a linear (non-homogeneous) Poisson process and $\mu(t)$ as the mean of the translated planar (non-homogeneous) Poisson process with $K(t,u)$ as the probability distribution of the translation from the linear Poisson at u to the planar Poisson at t. We then construct the likelihood function for $\lambda(t)$ in terms of the planar Poisson point process at t and the probability distribution $K(t,u)$. An EM (expectation-maximization) formula is derived for iterative estimation of $\lambda(t)$. Smooth nonparametric maximum likelihood estimates of $\lambda(t)$ are then obtained by applying the EM algorithm coupled with a smoothing step at each iteration. Accuracy of the estimates of $\lambda(t)$ can then be checked by comparing the estimates of $\mu(t)$, calculated from the Volterra equation using the estimated solution of $\lambda(t)$ and the known distribution $K(t,u)$, with the observed planar Poisson process via statistical goodness of fit tests. Furthermore, by extrapolating the estimates of $\lambda(t)$ into the future, the Volterra integral equation can be used to project the future course of the $\mu(t)$ function. In this paper we have employed various parametric and nonparametric distributions for $K(t,u)$ and applied the above method to estimate the HIV infection rates and to make short term projections of AIDS incidence using HIV/AIDS diagnosis data. The parametric distributions (such as Weibull and gamma distributions and several new functions) and non-parametric distributions (such as linear and cubic spline functions) for $K(t,u)$ are so chosen as to take into account the fact that the HIV screening test was available only since 1985 and the treatment made available to the HIV positive patients only after 1987. Our method has produced results that fit the observed AIDS incidence better than those produced by other existing methods based on data from U.S., Canada and Australia.

**Designing Experiments for Causal Networks**`Author:`*William D Heavlin*(Advanced Micro Devices)`Abstract:`- Causal networks are generalizations of Ishikawa diagrams. Emphasizing tolerance design applications, this work presents an optimal design algorithm when the variables are organized as a causal network. The causal network is transformed into a causal map, which represents all factors and responses as points in a common D-dimensional metric space. The design approach is algorithmic, optimizing Wynn’s entropy criterion. This criterion maximizes dispersion among multiple responses, using a distance-in-space coefficients (DiSCo) model. A key constraint is block self-containment--the blocks are analyzable without reference to one another--complemented by all-block analyses. Criteria are response dispersion, efficiency, and column rank. Skewing blocks with off-target factors is explored.

**Multi-Layer Structured Correlation Designs for Heterogeneous and Unbalanced Clustered Data**`Author:`*Edward C. Chao*(Insightful Corporation)`Abstract:`- Data with high dimensional hierarchical structures often occur in
longitudinal studies, geographical studies or family studies. A usual
approach is the multi-level random-effects models. The situation of interest
here is the case when the number of clusters is small but the number of
hierarchical levels in each cluster is large. Particular interests focus on
the heterogeneous and unbalanced clustering. Multi-level models might have
difficulty in fitting data with unbalanced clusters. We propose designing
correlation with multi-layer structures, in which, each layer represents a
unique parameterization for correlation with a type chosen from generic
structures such as AR, exchangeable, stationary etc.

In nested designs, a factor is nested in another and each factor has multiple levels. In these cases, a layer of correlation corresponds to modeling the correlation structure within a factor in the hierarchical structures. A layer may consist of multiple blocks, and blocks of the same layer have the same parameterization. These blocks represent the levels of the factor associated with the layer. This approach can be easily extended to unbalanced hierarchical structures and heterogeneous clusters. Algorithms based on the approach are embedded in GEE methods. A case study from prostate cancer and simulation studies show this approach is more efficient than the existing GEE methods and multivariate methods.

**On Perfect Stability in Characteristic Function**`Author:`*Jinhyo Kim*(Cheju National University, South Korea)*Bongsu Ko*(Cheju National University, South Korea)`Abstract:`- Algorithm stability is used to support the notion that the CF is superior to the MGF in terms of numerically stable behavior. It is shown that there cannot exist computationally better tool than CF in terms of numerical stability. Uniqueness of the Vandermonde matrix with the perfect condition number is characterized for the numerical behavior of the CF.

Room Location: Viejo Room

Session Chair:

**An Environment for Creating Interactive Statistical Documents**`Author:`*Samuel E. Buttrey*(Naval Postgraduate School)*Deborah Nolan*(University of California, Berkeley)*Duncan Temple Lang*(Bell Laboratories)`Abstract:`- The spectacular growth and acceptance of the Web has made it a very
attractive medium for interactive documents. Web-based reporting in
industry, ``live'' documents in research, and interactive worksheets in
education material are in many ways ideal uses of the Web. These types of
documents frequently display dynamic, statistical output both in the form of
text and plots. Unfortunately, much of the effort in creating these types of
documents has focussed on re-inventing existing statistical software, and
often with inferior results. The reason is that systems such as S and SAS
cannot be integrated into the reader's browser.

A better approach is to allow the author to create a document using the common authoring tools (e.g. {\LaTeX}, MS Word or htm editors) and to conveniently insert dynamic and interactive components from other languages. The author focuses on the presentation and display of these components, including the usual multi-media elements such as text, images and sounds. She uses HTML form elements and Java components to provide interactive controls with which the reader can manipulate the contents of the document. And finally she performs statistical computations and renders visual displays using the statistical software that is embedded within the reader's browser.

In this presentation, we describe how we have created an environment for interactive statistical documents. It allows the author to use HTML, JavaScript and R to create the content and the interactivity. The reader accesses the interactive and dynamic functionality of the document via a plug-in for Netscape that embeds R within it. The different languages are all reasonably standard tools and each is used for the purposes for which it was designed. This makes it a reasonably straightforward environment in which to quickly and simply create interfaces for various different applications and audiences.

**A Course on Web-Based Statistics**`Author:`*Juergen Symanzik*(Utah State University)*Natascha Vukasinovic*(Utah State University)`Abstract:`- Many Statistics courses have been taught that make use of Web-based statistical tools such as teachware tools, electronic textbooks, and statistical software on the Web. However, to our best knowledge, there has been no course before where statistical issues and the Web have been systematically discussed. In this talk, we provide an overview of our Web-Based Statistics course, including detailed discussions of lecture topics, homework assignments, and student projects. We discuss references (papers and URLs) useful for such a class and summarize student surveys conducted during the course. We finish our talk with recommendations for future similar courses.

**ASSIST: A Package for Spline Smoothing In S-Plus Template**`Author:`*Yuedong Wang*(Univ of California)*Chunlei Ke*(St. Jude Medical)`Abstract:`- We present a suite of user friendly S-Plus functions for fitting, among others, (a) smoothing spline models for independent and correlated Gaussian data, and for independent binomial, Poisson and Gamma data; (b) semi-parametric regression models; (c) non-parametric mixed effects models; and (d) semi-parametric nonlinear mixed effects models. The general form of smoothing splines based on reproducing kernel Hilbert space is used to model non-parametric functions. Thus these S-Plus functions deal with many different situations in a unified fashion. Some well known special cases are polynomial splines including the popular cubic splines, periodic splines, spherical splines, thin-plate splines, l-splines, generalized additive models, smoothing spline ANOVA models, and self-modeling nonlinear regression models. These fixed/mixed non-parametric/semi-parametric models are widely used in practice to analyze data arise in many areas of investigation such as medicine, epidemiology, pharmacokinetics and social science. One goal of this software development is to collect existing programs and make them user friendly so that more researchers can use them with ease. We have also written some new programs to fill in the gaps.

**JAVA Implementation of Multiple Linear Regression Models for Patient-Specific Longitudinal Data to Monitor Chemotherapy-Induced Anemia**`Author:`*Christine E. McLaren*(University of California, Irvine)*Wagner Truppel*(University of California, Irvine)*Randall F. Holcombe*(Chao Family Comprehensive Cancer Center)*Edward L. Kambour*(Prostrategic Solutions)`Abstract:`- Physicians typically compare a laboratory result for an individual patient with previous values and population-based reference ranges to determine the significance of any change. To provide a statistical basis for this process, we previously developed an approach to sequentially analyze laboratory test results and identify departures from past values. The statistical methods include hierarchical multiple regression modeling with a weighted minimum risk criteria for model selection to choose models indicating changes in mean values over time. The “optimal” model was chosen as the one with the smallest, statistically significant, weighted estimated risk as compared to the null (no-change) model. For routine use in clinical settings, we now describe the improved design and implementation of these numerical algorithms for sequential change detection of the mean. Algorithms were enhanced with analytical versions of matrices and were implemented using the JAVA programming language. Input to the computer program included a vector of sequential readings and desired statistical significance level. Positive weight factors used in model evaluation were read from a stored table. Efficient techniques were developed for computation of the Gasser, Sroka, and Jenner-Steinmetz (GSJS) variance estimate and associated unbiased risk (i.e. expected loss function), subset regression of indicator variables on the input vector, regression residuals, and estimated auto-correlation of model residuals. A GUI was constructed to support the numerical and graphical input/output structure. Simulations were used to assess the speed and portability of the JAVA implementation, termed Change Detector. We analyzed data from patients treated with cisplatin-based chemotherapy regimens ($n=60$) to determine significant changes in hematocrit values. Compared to the original S-plus implementation, we found the new JAVA program to be as accurate, easier to use, and faster. We conclude that Change Detector provides an improved statistical program for automated review of laboratory data in the clinical setting.

**The Development of Community Nutrition Map (CNMap)**`Author:`*Alvin B. Nowverl*(USDA-ARS-BHNRC-CNRG)`Abstract:`- Community Nutrition Map (CNMap) is a web application to display nutritional and demographic information for geographic areas within the United States using a compilation of data from a variety of sources. Data sources include the United States Department of Agriculture's 1994-96 and 1998 Continuing Surveys of Food Intakes by Individuals and the Department's Food and Nutrition Service web site. Data were also obtained from the U.S. Bureau of the Census' 1995-1998 Current Population Surveys and adjusted total population data files. The initial phase of CNMap will provide reports at the state level for a number of nutritional indicators such as: (1) The percentage of individuals meeting recommended daily allowances of a select group of nutrients; (2) The percentage of individuals meeting minimum requirements for Pyramid Servings food groups; (3) The percentage of American households receiving food stamps; (4) The percentage of individuals using supplements. In the development of CNMap, several statistical and data processing issues were addressed. These include maintaining confidentiality when reporting survey data geographically, obtaining proper estimates when combining different sources of information, proper use of sampling weights, and using static or dynamic web page design.

Room Location: Capistrano Room

Session Chair:

**Cost Growth Models for NASA's Programs**`Author:`*Tze-San Lee*(Western Illinois University)*L. Dale Thomas*(National Aeronautics and Space Administration)`Abstract:`- Under two cost growth indices, annual absolute and relative cost growth, probability-based models were constructed for basis functions of NASA's technology readiness levels through the use of Johnson's four parameter system of bounded, unbounded, or lognormal distributions. In addition, statistical prediction models were built for the programs. The result of this research shows that the program's initial cost estimate is the only significant predictor for the program's annual absolute cost growth, while the weighted average of technology readiness level from the program's components is the only significant predictor for the program's annual relative cost growth.

**Series Approximations in Risk Analysis**`Author:`*Reza Modarres*(The George Washington University)*Costas Christophi*(The George Washington University)`Abstract:`- Several asymptotic approximation methods for computing the distribution of a multiplicative risk model are discussed. We consider the asymptotic expansion of $R=\prod_{i=1}^p x_i^{a_i}$ where $x_i$ are positive random variables, which are independent but not identically distributed. The Generalized Central Limit Theorem is used to provide an approximation for the distribution of R and study conditions under which this approximation is valid. The Edgeworth expansion of the distribution of R is discussed for the independent and not identically distributed case. We will also discuss a saddlepoint approximation to the distribution and quantile functions of the model. The accuracy of the above approximations are illustrate in several examples and the results are compared to the exact, when available, or Monte Carlo results.

**An Adequate Statistic for the Exponentially Distributed Censoring Data**`Author:`*P. S. Nair*(Creighton University)*S-C Cheng*(Creighton University)`Abstract:`- In the problem of estimating the parameter of the underlying probability distribution, a sufficient statistic should be one that summarizes and exhausts in itself all the relevant information on the parameter that is contained in the sample. A similar basic problem of statistics is that of predicting a future (that is, not yet observed) random variable on the basis of some existing observable random variables when the parameter of the underlying probability distribution does not concern us directly. Likewise, we are looking for a statistic as one that is exhaustive of all the relevant information on the future random variable that is available in the current observable random variables. The notion of adequate statistics initiated by Fisher and Skibinsky would deal with this concern. Subsequently, it has been extensively investigated in literatures. In this article, we make a comprehensive study of finding an adequate statistic for the total time on test when the data is assumed to be exponentially distributed and censored after the r-th failure.

**Comparing Two Measurement Devices: Review and Extensions to Estimate New Device Variability**`Author:`*Brian J Eastwood*(Eli Lilly and Company)`Abstract:`- There is much literature available on methods for comparing two measurement systems that are supposed to be equivalent. These methods are briefly reviewed in the context of comparing two in-vitro assays for measuring clotting times where a new method is intended to replace an old method. The basic study is a comparison of results run in both assays. With this study it is possible to determine the relative performance of both assays with respect to bias and variability (standard deviation) using techniques described in Bland and Altman (1981) and Lin (1989) to determine if they are ``in agreement''. But it is not possible for such a study to describe the bias or variability of either assay. Usually there is much historical information available for the ``old'' method. By incorporating that information it is then straightforward to obtain point estimates of the bias and variability of the new assay. Using distributional assumptions or bootstrap estimates one can then obtain confidence interval estimates and conduct hypothesis tests about the absolute and relative performance of the two assays. The performance of various bootstrap and exact-distribution estimates is compared, first in the context of ``demonstrating agreement'', and in the context examining if the new assay has actually improved performance over the old assay.

**Computationally Intensive Techniques for a Fully Bayesian, Decision Theoretic Approach to Financial Forecasting and Portfolio selection**`Author:`*Andrew Simpson*(University of Newcastle)*Darren J. Wilkinson*(University of Newcastle)`Abstract:`- This paper considers the problem of Bayesian modelling and forecasting for multivariate financial time series. For example, prices of related stocks exhibit dependencies between series, as well as the usual dependencies over time. The multivariate dynamic linear state space models of West and Harrison (1997) are often appropriate for explaining log-price behaviour. The problem of Bayesian inference for the underlying states and covariance matrices has been examined by a variety of algorithms in the literature. In order for some of these algorithms to work efficiently, a variety of Kalman filtering, smoothing and simulation-smoothing techniques are required, as na{\"\i}ve implementations suffer from problems associated with slow mixing and convergence.

Room Location: Laguna Room

Session Chair:

**A Statistical View of the Support Vector Machine**`Author:`*Yi Lin*(University of Wisconsin, Madison)`Abstract:`- We establish the relationship between the support vector machine and the Bayes rule. The Bayes rule is the optimal classification rule if the underlying distribution of the data is known. Therefore the Bayes rule is not directly available in practice, but can be used as an ideal benchmark of any classification procedure. We show that the support vector machine approaches the Bayes rule in an asymptotic sense. The results are established under very mild condition, allowing arbitrary number of discontinuities in the underlying conditional probability function. This is in contrast with most other asymptotic results in the statistical literature, where the underlying conditional probability functions are assumed to be smooth to a given order. The results clarify the mechanism beneath the support vector machine, and highlight the advantage and limitation of the support vector machine methodology.

**Lazy Class Probability Estimators**`Author:`*Dragos D. Margineantu*(Oregon State University)*Thomas G. Dietterich*(Oregon State University)`Abstract:`- For many practical applications one would rather have learning
algorithms compute accurate values of probabilities for each possible class,
instead of a single class label. Unfortunately, most of the existing
classification algorithms give poor class probability estimates because they
were specifically designed to maximize the classification accuracy and, as a
result, the learned models will output probability values that are more
extreme (i.e., close to 0 and 1).

This paper introduces a new lazy learning algorithm --- the Lazy Option Trees --- based on which we derive a method for computing good class probability estimates.

Our algorithm builds on the basic ideas of the lazy decision tree classification algorithm introduced by Friedman et al. (1996). In order to compute good probability estimates, multiple tests are performed in each node on the query instance. The algorithm also allows tests on continuous attributes, and performs local smoothing in the leaf nodes. The class probability estimates are improved using Breiman's Bagging.

One of the most important uses of accurate class probability estimates in machine learning and data mining is prediction in the presence of arbitrarily large costs associated with the different kinds of errors. We tested our method for different cost models and application domains from the UCI ML repository and, for the majority of the tasks, its probability estimates improve over the probability estimates of both decision trees and bagged Probability Estimation Trees (unpruned, uncollapsed, and smoothed decision trees; B-PETs) - one of the best existing class probability estimators. For evaluating the quality of the predictions in a cost-sensitive context, we employed the paired BDeltaCost procedure introduced by Margineantu \& Dietterich (2000).

**Perfect Random Tree Classifiers**`Author:`*Adele Cutler*(Utah State University)*Guohua Zhao*(AT\&T)`Abstract:`- Ensemble classifiers are some of the most accurate general-purpose classifiers available. We introduce a new ensemble classifier, PERT, which is an ensemble of perfectly-fit random trees. Compared to other ensemble methods, PERT is very fast to fit. Considering the random nature of the trees, PERT is surprisingly accurate. Calculations suggest that one reason why PERT performs so well is that although the trees are extremely weak, they are also almost uncorrelated.

**Multicategory Support Vector Machines**`Author:`*Yoonkyung Lee*(University of Wisconsin-Madison)*Yi Lin*(University of Wisconsin-Madison)*Grace Wahba*(University of Wisconsin-Madison)`Abstract:`- Support Vector Machine (SVM) has shown great performance in practice as a classification methodology recently. Even though SVM implements the optimal classification rule asymptotically in binary case, one-versus-rest approach to solve multicategory case using SVM is not optimal. We have proposed Multicategory SVMs, which extend binary SVM to multicategory case, and encompass binary SVM as a special case. Multicategory SVM implements the optimal classification rule as sample size gets large, overcoming the suboptimality of conventional one-versus-rest approaches. The proposed method deals with equal misclassification cost and unequal cost case in unified way.

**Using Pseudo-Predictors to Improve the Performance of a Classification Rule**`Author:`*Majid Mojirsheibani*(Carleton University)`Abstract:`- We consider an iterative procedure to improve the misclassification error rate of an initial classification rule. The proposed procedure involves two steps: (i) an iterative method for generating a sequence of classifiers from the initial one, and (ii) a combining procedure that ``pools together'' the sequence of constructed classifiers in order to produce a new classifier which is far more effective (in an asymptotic sense) than the initial one. The sequence of classifiers in step (i) are generated based on repeated augmentation of the feature vector with some carefully constructed pseudo-predictors. Both the mechanics and the asymptotic validity of the proposed procedure are discussed. We will also discuss methods for selecting the number of iterations.

Room Location: Santa Ana Room

Session Chair:

**Self-Modeling Regression with Random Effects**`Author:`*Naomi S. Altman*(Cornell University)*Julio Villarreal*(EdVISION Corporation)`Abstract:`- In many longitudinal studies, the response can be modeled as a
(discretely sampled) curve over time for each subject. Often these curves
have a common shape function and individual subjects differ from the common
shape by a transformation of the time and response scales. Lindstrom (1995)
represented the common shape by a free-knot regression spline, and used a
parametric random effects model to represent the differences between curves.
We extend Lindstrom's work by representing the common shape by a penalized
regression spline, and use a parametric random effects model to represent
the differences between curves. The use of penalized regression splines
allows for a generalization in the modeling, estimation, and testing of
parameters and is easily implemented. An iterative two-step algorithm is
proposed for fitting the model.

Conditional on the fitted common shape model, it is possible to fit and test nonlinear mixed effects using standard methods. While the sieve parametric form of the model suggests that a conditional likelihood ratio test should be available for testing whether the shape varies with a time invariant covariate, the null distribution of the likelihood ratio test may not be chi-squared.

**Support Vector Machine Regression in Chemometrics**`Author:`*Ayhan Demiriz*(Verizon Inc.)*Kristin P. Bennett*(Rensselaer Polytechnic Institute)*Curt M. Breneman*(Rensselaer Polytechnic Institute)*Mark J. Embrechts*(Rensselaer Polytechnic Institute )`Abstract:`- Predicting the biological activity of a compound from its chemical structure is a fundamental problem in drug design. The ability exists to generate vast amounts of potential pharmecutical compounds. Statistical and machine learning methods can provide an efficient means of estimating the bioreponses of these compounds in order to expedite drug design. In this paper we develop a Support Vector Machine Regression (SVMr) methodology for estimating the bioresponse of molecules based on the large sets of descriptors. Since the concerned data is characterized by large numbers of descriptors and very few data points, we adapt SVMr model selection and bagging strategies in order to avoid overfitting. The proposed approach compares very favorably with Partial Least Squares (PLS), a well-known and commonly used method in chemometrics, on the performance of Quantitative Structure-Activity Relationships (QSAR) analysis based on real chemistry data.

**Data-Driven and Optimal Denoising of a Signal and Recovery of Its Derivative Using Multiwavelets**`Author:`*Nathaniel Tymes, Jr*(University of New Mexico)*Sam Efromovich*(University of New Mexico)*M. Cristina Pereyra*(University of New Mexico)*Joseph D. Lakey*(New Mexico State University)`Abstract:`- Multiwavelets are relative newcomers into the world of wavelets. Thus it has not been a surprise that the used methods of denoising are modified universal thresholding procedures developed for uniwavelets. On the other hand, the specific of a multiwavelet discrete transform is that typical errors are not identically distributed and independent normal errors. Thus we suggest an alternative denoising procedure based on the Efromovich-Pinsker algorithm.

**RIP-GAMs with an Application in Human Brain Research**`Author:`*Michael G. Schimek*(Karl-Franzens-University of Graz)`Abstract:`- Backfitting is still the most popular numerical technique for generalized additive models (GAM). Its implementation in S-Plus is stable and sufficient for a large number of fitting problems. Here we take interest in GAM fitting of rather complicated data showing patterns of correlation. As a result we have to account for rank deficiency of the system matrix due to spatial or temporal correlation of some variables. We illustrate this on an example from human brain research. To cope with such situations we introduce the idea of relaxed iterative projection generalized additive models (RIP-GAM). What backfitting GAM and RIP-GAM have in common is the use of the same S-Plus functions provided for generalized additive modelling such as s(), the spline smoother, and other features. Main results from our example: While RIP does not seem to run into numerical troubles, backfitting has slow or no convergence in some instances. In standard situations, however, both procedures produce the same estimation results).

**An Adaptive-Learned Temporal Radial Basis Function Network for Recursive Function Estimation**`Author:`*Yiu Ming Cheung*(The Chinese University of Hong Kong)*Lei Xu*(The Chinese University of Hong Kong)`Abstract:`- We present a temporal Radial Basis Function (RBF) network for recursive function estimation. This network is a dynamic hybrid system which consists of two sub-RBF networks. One sub-network models the relationship between the current network output and the past ones, and the other sub-network describes the relationship between the current network output and the inputs. In each sub-network, the kernel parameters of the hidden layer and those in the output layer are all adaptively determined globally by using an expectation-maximization (EM) algorithm (Xu 1998). The performance of our proposed network is also demonstrated with a comparison to a classic one.

Room Location: Viejo Room

Session Chair:

**A Statistical Approach to the Segmentation of MR Imagery and Volume Estimation of Stroke Lesions**`Author:`*Benjamin Stein*(University of Massachusetts)*Joseph Horowitz*(University of Massachusetts)`Abstract:`- We propose a 3D method to segment magnetic resonance imagery (MRI) of ischemic stroke patients into lesion and background, and hence to estimate lesion volumes. It is a hierarchical, regularized method based on classical statistics that produces a rigorous confidence interval for lesion volume. This approach requires a limited amount of user interaction to initialize. The procedure has been tested on real MR data, with volume estimates within 6\% of those derived from doctors' hand segmentations. According to the physicians with whom we are working, these results are clinically useful to evaluate stroke therapies.

**Visualizing Spatial Autocorrelation with Dynamically Linked Windows**`Author:`*Luc Anselin*(University of Illinois, Urbana-Champaign)*Ibnu Syabri*(University of Illinois, Urbana-Champaign)*Oleg Smirnov*(University of Texas at Dallas)*Yanqui Ren*(University of Illinois, Urbana-Champaign)`Abstract:`- Several recent efforts have focused on adding exploratory data analysis functionality to geographic information systems by means of various coupling mechanisms between established statistical software packages and a GIS. In this paper, we outline an alternative approach where the functionality is built from scratch, using a combination of small libraries of dedicated functions, rather than relying on the full scope of existing software suites. The suggested approach is modular and completely freestanding, allowing the use of data formats from different vendors. It combines within an overall framework of fully dynamically linked windows a cartographic representation of data on a map with traditional statistical graphics, such as histograms, box plots, and scatterplots. In addition, it includes several devices to visualize spatial autocorrelation in lattice (or regional) data, such as the Moran Scatterplot and LISA maps. Apart from being freestanding, this new program (DynESDA2) implements a number of other advances, such as the capability to brush polygon coverages, simultaneous linking of multiple maps with multiple statistical graphics, and interactive LISA maps.

**Compression and Analysis of Very Large Imagery Data Sets Using Spatial Statistics**`Author:`*James A. Shine*(George Mason University and US Army Topgraphic Engineering Center)`Abstract:`- As remote sensing instruments evolve, the size of imagery data sets
derived from remote sensing continues to increase. Several satellites
currently offer resolution of 1 meter per pixel or better. At this
resolution, even a small geographic area leads to a very large data set; 1
square mile, for example, is represented by approximately $2.6 \times 10^6$
pixels. Many sensors are now multispectral or even hyperspectral, increasing
the size of the data set by up to $10^2$. Processing images for
classification or mapping purposes thus poses an increasing computational
challenge.

This paper describes the use of spatial statistics to compress the size of large 1-meter imagery data sets. The images were taken over locations in the United States using a CAMIS (Computerized Airborne Multispectral Imaging System) instrument flown in an airplane and registered by trained image analysts. Models of spatial variation are first computed on an entire image, then on subsampled sets of the image. Parameters of the models are used to compress the original image. Image analysis operations are then performed on the original and compressed images and performance is compared. In some cases it is possible to compress data several orders of magnitude without substantially degrading results of subsequent analysis.

**Hierarchical Visualization of Environmental Data on the Web Using nViZn**`Author:`*Lacey Jones*(Utah State University)`Abstract:`- Statistical analyses of large-scale data can often be hard to interpret. Many times several pages of numbers must be used to describe the data, which makes finding the numerical output that is of interest tedious or difficult. Converting these numbers into information that is understandable and useful to someone without an extensive statistical background is also a task that is not easily accomplished. Visual representations such as maps, graphs, and charts can aid in this process. They help create a better understanding through visual representation of the information and processes the data is explaining. A recent improvement in visual statistics has been the use of the Internet. The new Illumitek software tool, nViZn, the follow up to the Graphics Production Library, is an interactive tool that allows a user to expand or narrow the focus of a visual representation and to order, overlay, and rearrange the format of the data, as fast as the user’s connection can process without needing a statistical computing background. The software also utilizes the Internet to give opportunities for easy access to anyone with an Internet connection and a graphical user interface without having to purchase a personal statistical computing package at an impractical price. I will be using this new software by creating hierarchical visual images, such as maps and charts, of data modeling concentration estimates of 148 hazardous air pollutants for the 60,803 census tracts in the continental United States obtained through the EPA’s Cumulative Exposure Project.

**A Hierarchical Interactive Visualization System**`Author:`*Peter Tino*(Aston University, UK)*Ian Nabney*(Aston University, UK)*Yi Sun*(Aston University, UK)`Abstract:`- In this paper we propose an interactive hierarchical visualization system that, at each level of the hierarchy, provides the user not only with the data projections, but also with the corresponding magnification factor and directional curvature plots. Magnification factors quantify the extend to which the areas are magnified on projection to the data space. Directional curvatures capture the local folding patterns in the projection manifold. The visualization system is constructed in a statistically principled framework.

Room Location: Capistrano Room

Session Chair:

**A Tree-Based Scan Statistic for Database Disease Surveillance**`Author:`*Martin Kulldorff*(University of Connecticut School of Medicine)*Zixing Fang*(University of Connecticut School of Medicine)*Stephen Walsh*(University of Connecticut School of Medicine)`Abstract:`- Many databases exist by which it is possible to study the relationship
between health events and various potential risk factors. Among these
databases, some have variables that naturally form a hierarchical tree
structure, such as pharmaceutical drugs or occupations. For example, Ecotrin
is a brand of aspirin, which belongs to the class of nonsteoridal
anti-inflammatory drug, which in turn belongs to the larger class of
analgesic drugs. As another example, in the occupational classification
system of the Census, `statisticians' are a subset of `mathematical and
computer scientists', which are a subset `professional specialty
occupations', which in turn are a subset of `managerial and professional
specialty occupations'.

In this paper we propose a tree-based scan statistic for database surveillance use, to be used when the independent variable can be defined in the form of a hierarchical tree. The proposed method is illustrated by looking at whether death from silicosis is particularly common among specific occupations as classified by the Census Bureau, without preconceived idea of what specific occupation or group of occupations may be related to increased risk, if any at all. While the method can be used for many different types of databases, the proposed method will be described in terms of `occupation' and `mortality'.

**Creating Ensembles of Decision Trees Through Sampling**`Author:`*Chandrika Kamath*(Lawrence Livermore National Laboratory)*Erick Cantu-Paz*(Lawrence Livermore National Laboratory)`Abstract:`- Recent work in classification indicates that significant improvements in accuracy can be obtained by growing an ensemble of classifiers and having them vote for the most popular class. This paper focuses on ensembles of decision trees that are created with a randomized procedure based on sampling. Randomization can be introduced by using random samples of the training data (as in bagging or arcing) and running a conventional tree-building algorithm, or by randomizing the induction algorithm itself. The objective of this paper is to describe our first experiences with a novel randomized tree induction method that uses a subset of samples at a node to determine the split. Our empirical results show that ensembles generated using this approach yield results that are competitive in accuracy and superior in computational cost.

**Data Mining Diabetic Databases: Are Rough Sets a Useful Addition?**`Author:`*Joseph L. Breault*(Tulane University; Alton Ochsner Medical Foundation)`Abstract:`- The publicly available Pima Indian diabetic database (PIDD) at the UC-Irvine Machine Learning Lab has become a standard for testing data mining algorithms to see their accuracy in predicting diabetic status from the 8 variables given. Looking at the 392 complete cases, guessing all are non-diabetic gives an accuracy of 65.1\%. Since 1988, many dozens of publications using various algorithms have resulted in accuracy rates of 66\% to 81\%. Rough sets as a data mining predictive tool has been used in medical areas since the late 1980s, but not applied to the PIDD to our knowledge. When we apply rough sets to PIDD using ROSETTA software, the predictive accuracy is 82.6\%, which is better than other data mining methods that we are aware of. Rough sets are a useful addition to the analysis of diabetic databases.

**Model Complexity Based Design of Radial Basis Function Networks with Data Mining Applications**`Author:`*Miyoung Shin*(Syracuse University and ETRI(Korea))*Amrit L. Goel*(Syracuse University )`Abstract:`- Radial basis function (RBF) models, a particular class of neural networks, have recently become popular for pattern recognition tasks because of their fast learning capability and good mathematical properties (best and universal approximation). However, current algorithms for learning the model parameters tend to produce inconsistent designs due to their {\it ad-hoc\/} and trial-and-error nature. In this paper we develop a new mathematical framework for RBF design. Specifically, we use singular value decomposition to study the complexity of the interpolation ($G$) and design ($S$) matrices which form the foundation of the SG algorithm proposed here. This algorithm provides a consistent approach for determining the RBF parameters, viz. the number of basis functions ($m$), their widths ($s$) centers ($c$) and weights ($w$). It is shown that $m$ can be obtained as the effective rank of $G$. For this purpose a new model complexity measure ($D$) is introduced and its relationship to singular values is derived. The centers, $c$, are determined by QR factorization with column pivoting of right singular vectors of $G$. It is shown that the selected $c$'s reflect the best compromise between structural stability and residual minimization. Finally, the weights are computed by the usual pseudo inverse method.

**Combining Decision Trees Using Systematic Patterns**`Author:`*Hyunjoong Kim*(Worcester Polytechinc Institute)`Abstract:`- Tree ensemble or voting methods using re-sampling technique have been highlighted recently in Statistical classification and Data mining. In this paper, we propose a new ensemble method in decision trees that utilizes systematic patterns of classifications. The new method improved the prediction accuracy of a single decision tree algorithm. It is also expected that this method performs reasonably well with fewer number of re-samples compared to the popular Bagging or Boosting methods. An experiment with real dataset is carried out to see the performance of the new method.

Room Location: Laguna Room

Session Chair:

**Resampling Time Series with Seasonal Components**`Author:`*Dimitris Politis*(University of California, San Diego)`Abstract:`- In the case of time series with a seasonal component, the well-known block bootstrap procedure is not directly applicable. We propose a modification of the block bootstrap that successfully addresses the issue of seasonalities, and show some of its properties.

**Subgraph Sampling for Relational Data**`Author:`*David Jensen*(University of Massachusetts Amherst)*Jennifer Neville*(University of Massachusetts Amherst)`Abstract:`- Sampling is central to evaluating inductive learning algorithms. Sampling in relational data is far more challenging and error-prone than sampling in non-relational contexts. We examine a class of algorithms for sampling relational data, analyze the characteristics of the samples they produce, and show that all have important pitfalls for unwary researchers.

**Inference for Sample Maxima in the Presence of Serrial Correlation and Heavy-Tailed Distributions**`Author:`*Tucker McElroy*(University of California, San Diego)*Dimitris Politis*(UCSD)`Abstract:`- We consider data from an infinite order moving average time series model with inputs in a stable domain of attraction. The sample maximum of the data is of interest in settings such as insurance and finance; we produce a normalization of this statistic, which, in conjunction with subsampling methods, will allow for asymptotically correct estimation of its cumulative distribution function.

**BootQC: Bootstrap for Statistical Quality Control and Applications to Aviation Safety Analysis**`Author:`*Regina Y. Liu*(Rutgers University)*Hueychung Teng*(Rutgers university)`Abstract:`- Control charts are widely used as effective online monitoring tools in statistical quality control. Most of the existing methods for constructing control charts are parametric in nature, and their applicability is much restricted by the requirement of predetermined models such as normal distributions. A nonparametric alternative based on bootstrap methods was proposed in Liu and Tang (1996). Using both the standard bootstrap and the moving block bootstrap, these new bootstrap control charts are valid for monitoring independent data as well as dependent data. Assigning proper false alarm rates to these bootstrap control charts, this paper develops a meaningful threshold system for regulating and monitoring aviation safety data. The threshold system can serve as a set of standards for evaluating the performance of aviation entities, and provide guidelines for identifying unexpected performances and assigning appropriate corrective measures. Consequently, this threshold system can help achieve more effective regulation of air traffic and safety. Both bootstrap control charts and threshold systems are demonstrated in an analysis of aviation surveillance data collected by the FAA from several air carriers. The demonstration uses the software BootQC (Liu and Teng (2000)). BootQC is a Microsoft Excel Add-In file written in Visual Basic for Applications (VBA), and its graphical user interfaces can provide an easy access to online data analysis by applying bootstrap methods to generating statistical quality control charts and threshold charts.

**Selection of Shrinkage Factor for the Two Stage Testimator of the Normal Mean Using Bootstrap Likelihood**`Author:`*Makarand V. Ratnaparkhi*(Wright State University, Dayton)*Vasant B. Waikar*(Maimi University, Oxford, Ohio)*Frederick J. Schuurmann*(Miami University, Oxford,Ohio)`Abstract:`- In this paper, a new methodology based on the likelihood of bootstrap samples is introduced for improving the efficiency of the two stage shrinkage testimator of Waikar {\it et al\/} (2001) for estimation of a normal mean. In particular, this method is useful for selecting the shrinkage factor for the two stage testimator in view of increasing the efficiency of such testimator. The method is useful for number of other estimation problems. However, in this presentation only the estimation of the normal mean is considered for related simulation studies and the discussion of results.