catLGM - Code for Categorical data analysis with Latent Gaussian Models

Back to Home or Software

Written by Emtiyaz, CS, UBC.
Last updated: Dec. 17, 2011.

Table of Contents

Description

This code can be used for the following categorical (binary + multi-class) data analysis,

  • Binary/Multi-Class classification (e.g. logistic regression).
  • Binary/Multi-Class Gaussian Process classification.
  • Factor analysis/ PCA/ latent graphical model for categorical data.
  • Missing value imputation of categorical data.
See some examples here
examples of catLGM. For full details of models/algorithms, see the following papers:

Download CatLGM.zip.

System requirements and dependencies: You need to download minFunc (by Mark Schmidt) required for numerical optimization and GPML toolbox for Gaussian processes. Install these directories inside the catLGM directory. I also use some functions from Tom Minka’s lightspeed toolbox for matrix-inversion and matrix-determinant, but can be replaced by other equivalent functions. catLGM code works fine on MATLAB 7.4 (2007a) and higher versions.

Getting started

The following will get you started:
   > addpath(genpath(pwd));
   > exampleCatLGM;

Usage

All function calls follow the following format to perform a specific task on a data given a model and an algorithm.
   > out = catLGM(model, task, algo, data);

For example, given features X and labels y, the following performs posterior inference in Bayesian logistic regression using a variational method based on the piecewise bound, and then predict labels for new test data using Monte-carlo.
   > postDist = catLGM(’bayesLogitReg’, ’infer’, ’var-pw’, [X y]);
   > pred = catLGM(’bayesLogitReg’, ’predict’, ’mc’, Xtest, [], postDist);

Similarly, given a categorical data matrix Y with missing values, the following learns a Multinomial-logit factor analysis model using a variational EM algorithm based on the log bound, and then imputes the missing values using the learned model (with a Monte-carlo estimate).
   > learnOut = catLGM(’multiLogitFA’, ’learn’, ’var-log’, Y);
   > params = learnOut.params;
   > postDist = learnOut.postDist;
   > imputOut = catLGM(’multiLogitFA’, ’impute’, ’var-log’, Y, [], ’mc’, params, postDist);

Algorithm options can also be specified as 5’th argument.
   > options = struct( 'maxItersInfer', 500, 'lowerBoundTolInfer', 0.001);
   > postDist = catLGM(’bayesLogitReg’, ’infer’, ’var-pw’, [X y], options);
See full list of options.

Model-Task-Algo Details

Following is a list of model, task and algo pairs in the current implementation

Logistic Regression Models The table below shows the models implemented. Two tasks can be performed for these models: inference (‘infer’) and prediction (‘pred’). For task ‘infer’, algorithm options are variational EM algorithms based on the Bohning bound (‘var-boh’), the log bound (‘var-log’), and the piecewise bounds (‘var-pw’). For ‘pred’ task, I use approximation based on Monte-carlo (‘mc’).

Model Description infer pred
bayesLogitReg Bayesian logistic regression var-boh
var-log
var-pw
mc
bayesMultiLogitReg Bayesian Multinomial-logit regression var-boh
var-log
mc

See following papers for details of bounds:

Gaussian Process Classification The table below shows the models implemented. Currently, inference (‘infer’) and prediction (‘pred’) can be performed, but I will implement hyperparameter learning soon. Algorithms are same as regression models. See my AIstats paper for details of stick-likelihood and multi-class GP classification Categorical Data Analysis with Latent Gaussian Models.

Model Description infer pred
logitGP Binary GP classification with logistic link var-boh
var-log
var-pw
mc
multiLogitGP Multi-class Gausian process classification with multinomial-logit link var-boh
var-log
mc
stickLogitGP Multi-class Gausian process classification with stick-logit link var-pw mc

Factor Models

Currently, learning (‘learn’), inference (‘infer’) and imputation (‘pred’) can be performed. See my AIstats paper for details of the models and algorithms Categorical Data Analysis with Latent Gaussian Models.

Model Description learn and infer impute
binaryLogitFA Binary Factor analysis with logistic link var-boh
var-log
var-pw
mc
binaryLogitLGGM Binary latent Gaussian graphical model with logistic link var-boh
var-log
var-pw
mc
multiLogitFA Categorical factor analysis with multinomial-logit link var-boh
var-log
mc
multiLogitLGGM Categorical latent Gaussian graphical model with multinomial-logit link var-boh
var-log
mc
stickLogitFA Categorical factor analysis with stick-logit link var-pw mc
stickLogitLGGM Categorical latent Gaussian graphical model with stick-logit link var-pw mc

List of options

Default values of options are set inside getCatLGMDefaultOptions.m and getAlgoFuncsAndOptions.m.

Options for inference
maxItersInfer Maximum number of iterations for inference step 100).
lowerBoundTolInfer Lower bound tolerance for inference step to mine convergence (e.g. 0.01).
initVec Initialization for the inference step (a vector size of which ds on the algorithm).
displayInfer Display on/off (1 or 0).

Options for learning algorithm (EM)
maxItersLearn Maximum number of E and M iterations. learning
lowerBoundTolLearn Lower bound tolerance for inference step to mine convergence (e.g. 0.01).
displayLearn Display on/off (1 or 0).
computeSsCompute sufficient statistics (switched off during inference mode)
maxItersMstepMaximum number of iteration for M-step (in case a gradient algorithm is used).
lowerBoundTolMstepLower bound tolerance in M-step (in case a gradient algorithm is used).
displayMstep Display on/off (1 or 0).

Options for bound
boundParams A structure containing bound parameters.

Setting hyperparameters

The hyper-parameters can be passed through the options. If no hyperparameters are specified, the default hyper-parameter values are set inside setupLGM.m.

To Do List

I will be adding the following in next few days. If you want something specific, please email me.
  • Add hyperparameter learning for multi-class GP.
  • Binary GP methods such as EP from Carl Rasmussen’s GPML toolbox.
  • Mark Girolami’s VB algorithm for MultiClass GP.
  • Add the following bounds,
    • The Jaakkola bound for logistic-log-partition function.
    • Guillaume’s and Tom Minka’s bound for the log-sum-exp function.
  • Hybrid Monte Carlo (HMC) sampler for multiclass GP.
  • Fast sparse GP and sparse latent graphical models.