Modelling – examining the correlations between predictor variables

This post will consider how to explore the relationship between a set of variables: variables that we will ultimately use in a mixed-effects analysis of lexical decision data.

Most of the analyses we will be doing will examine whether observed reading behaviours e.g. the accuracy or latency of responses in experimental tasks like lexical decision are influenced by the attributes of the stimuli presented to participants, by the attributes of the participants themselves, or by interactions between the item and participant attributes.

By attribute I mean things like word length or frequency, or participant age or reading ability.

A concern in the analysis of psycholinguistic data is the way in which everything correlates with everything else. Thus, longer words typically look like few other words, words used more often in the language are learnt earlier in life, while older readers have read more print and are more skilled readers. These correlations are to be expected (that is not to say that we should not seek to explain them) but make things difficult when we are attempting to analyse data statistically.

You can understand the problem in terms of overlap. If you want to know what the effects of two attributes are, but these variables are correlated so that, essentially, the information provided by one variable overlaps with the information provided by the other, you are going to have a problem, and that problem is called multicollinearity.

As Cohen, Cohen, Aiken and West (2003) note, the estimate of the effect of one predictor will be unreliable because, if it overlaps with another predictor, little unique information will be available to estimate its value.

Think of it like this:


Creative Commons – flickr – National Media Museum: Elderly couple with a young female spirit

Creepy, right?

What are we looking at, which image is ‘real’?


I very much recommend reading the following publications to build understanding. I would start with Cohen et al. before going aware near Belsley et al (the authoritative text).

Baayen, R. H. (2008). Analyzing linguistic data. Cambridge University Press.

Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioural sciences (3rd. edition). Mahwah, NJ: Lawrence Erlbaum Associates.

This entry was posted in 10. Modelling - collinearity, modelling, rstats and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s