Getting started — loading data, writing and saving scripts

[16/1/14 edit — as the procedure for finding data in Mac OS were detailed in class, I am going to leave off editing this post to show Mac OS steps in detail because: steps are mostly the same; and we’ve practised them in class.]

In this post, we shall 1. load data into R workspace; 2. get summary statistics for those data; 3. plot visualizations of the data distributions — histograms; and 4. finish up by closing and saving the .R file.

For this post, I am going to assume that you know about files, folders and directories on your computer. In Windows (e.g. in XP and 7), you should be familiar with Windows Explorer. (For Macs, we would be working with Finder and within Finder with F+Get Info.) If you are not familiar with Windows Explorer, this series of videos e.g. here on Windows Explorer should get you to where you understand: the difference between a file and a folder; file types; the navigation, file folder directory; how to get the details view of files in folders; how to sort e.g. by name or date in folders. N.B. the screenshots of folders in the following will look different to you compared to Windows Explorer as it normally appears in Windows 7. That is because I use xplorer2, as I find it easier to move stuff around.

Just as in other statistics applications e.g. SPSS, data can be entered manually, or imported from an external source. Here I will talk about: 1. loading files into the R workspace as dataframe object; 2. getting summary statistics for the variables in the dataframe; 3. and plotting visualization of data distributions that will be helpful to making sense of the dataframe.

1. Loading files into the R workspace as dataframe object

Start by getting your data files

I often do data collection using paper-and-pencil standardized tests of reading ability and also DMDX scripts running e.g. a lexical decision task testing online visual word recognition. Typically, I end up with an excel spreadsheet file called subject scores 210413.xlsx and a DMDX output data file called lexical decision experiment 210413.azk. I’ll talk about the experimental data collection data files another time.

The paper and pencil stuff gets entered into the subject scores database by hand. Usually, the spreadsheets come with notes to myself on what was happening during testing. To prepare for analysis, though, I will often want just a spreadsheet showing columns and rows  of data, where the columns are named sensibly (meaningful names, I can understand) with no spaces in either column names or in entries in data cells. Something that looks like this:

R-2-1-csv

This is a screen shot of a file called: ML scores 080612 220413.csv

— which you can download here.

These data were collected in an experimental study of reading, the ML study.

The file is in .csv (comma separated values) format. I find this format easy to use in my workflow –collect data–tidy in excel–output to csv–read into R–analyse in R, but you can get any kind of data you can think of into R (SPSS .sav or .dat, .xlsx, stuff from the internet — no problem).

Let’s assume you’ve just downloaded the database and, in Windows, it has ended up in your downloads folder. I will make life easier for myself by copying the file into a folder whose location I know.

— I know that sounds simple but I often work with collaborators who do not know where their files are.

I’ll just copy the file from the downloads folder into a folder I called R resources public:

R-2-1-folders

Advice to you

Your lives, and my life, if I am working with you, will be easier if you are systematic and consistent about your folders: if I am supervising you, I expect to see an experiment for each folder, and in that sub-folders for – stimulus set selection; data collection script and materials; raw data files; (tidied) collated data; and analysis.

Read a data file into the R workspace

I have talked about the workspace before, here we are getting to load data into it using the functions setwd() and read.csv().

In R, you need to tell it where your data can be found, and where to put the outputs (e.g. pdfs of a plot). You need to tell it the working directory (see my advice earlier about folders): this is going to be the Windows folder where you put the data you wish to analyse.

For this example, the folder is:

C:\Users\me\Dropbox\resources R public

[Do you know what this is – I mean, the equivalent, on your computer? If not, see the video above – or get a Windows for Dummies book or the like.]

What does R think the working directory currently is? You can find out, run the getwd() command:


getwd()

and you will likely get told this:


> getwd()
[1] "C:/Users/p0075152/Documents"

The Documents folder is not where the data file is, so you set the working directory using setwd()


setwd("C:/Users/me/Dropbox/resources R public")

Notice: 1. the address in explorer will have all the slashes facing backwards but “\” in R is a special operator, escape so 2. the address given in setwd() has all slashes facing forward 3. the address is in “” and 4. of course if you spell this wrong R will give you an error message. I tend to copy the address from Windows Explorer, and change the slashes by hand in the R script.

Load the data file using the read.csv() function


subjects <- read.csv("ML scores 080612 220413.csv", header=T, na.strings = "-999")

Notice:

1. subjects… — I am calling the dataframe something sensible

2. …<- read.csv(…) — this is the part of the code loading the file into the workspace

3. …(“ML scores 080612 220413.csv”…) — the file has to be named, spelled correctly, and have the .csv suffix given

4. …, header = T… — I am asking for the column names to be included in the subjects dataframe resulting from this function call

5. … na.strings = “-999” … — I am asking R to code -999 values, where they are found, as NA – in R, NA means Not Available i.e. a missing value.

Things that can go wrong here:

— you did not set the working directory to the folder where the file is

— you misspelled the file name

— either error will cause R to tell you the file does not exist (it does, just not where you said it was)

I recommend you try making these errors and getting the error message deliberately. Making errors is a good thing in R.

Note:

— coding missing values systematically is a good thing -999 works for me because it never occurs in real life for my reading experiments

— you can code for missing values in excel in the csv before you get to this stage

Let’s assume you did the command correctly, what do you see? This:

R-2-1-readcsv

Notice:

1. in the workspace window, you can see the dataframe object, subjects, you have now created

2. in the console window, you can see the commands executed

3. in the script window, you can see the code used

4. the file name, top of the script window, went from black to red, showing a change has been made but not yet saved.

To save a change, keyboard shortcut is CTRL-S.

2. Getting summary statistics for the variables in the dataframe

It is worth reminding ourselves that R is, among other things, an object-oriented programming language. See this:

The entities that R creates and manipulates are known as objects. These may be variables, arrays of numbers, character strings, functions, or more general structures built from such components.

During an R session, objects are created and stored by name (we discuss this process in the next session). The R command

     > objects()

(alternatively, ls()) can be used to display the names of (most of) the objects which are currently stored within R. The collection of objects currently stored is called the workspace.

[R introduction: http://cran.r-project.org/doc/manuals/r-release/R-intro.html#Data-permanency-and-removing-objects]

or this helpful video by ajdamico.

So, we have created an object using the read.csv() function. We know that object was a .csv file holding subject scores data. In R, in the workspace, it is an object that is a dataframe, a structure much like the spreadsheets in excel, SPSS etc.: columns are variables and  rows are observations, with columns that can correspond to variables of different types (e.g. factors, numbers etc.) Most of the time, we’ll be using dataframes in our analyses.

You can view the dataframe by running the command:


view(subjects)

— or actually just clicking on the name of the dataframe in the workspace window in R-studio, to see:

R-2-1df-view

Note: you cannot edit the file in that window, try it.

I am going to skip over a whole bunch of stuff on how R deals with data. A useful Quick-R tutorial can be found here. The R-in-Action book, which builds on that website, will tell you that R has a wide variety of objects for holding data: scalars, vectors, matrices, arrays, dataframes, and lists. I mostly work with dataframes and vectors, so that’s mostly what we’ll encounter here.

Now that we have created an object, we can interrogate it. We can ask what columns are in the subjects dataframe, how many variables there are, what the average values of the variables are, if there are missing values, and so on using an array of useful functions:


head(subjects, n = 2)

summary(subjects)

describe(subjects)

str(subjects)

psych::describe(subjects)

length(subjects)

length(subjects$Age)

Copy-paste these commands into the script window and run them, below is what you will see in the console:

R-2-1-descriptives

Notice:

1. the head() function gives you the top rows, showing column names and data in cells; I asked for 2 rows of data with n = 2

2. the summary() function gives you: minimum, maximum, median and quartile values for numeric variables, numbers of observations with one or other levels for factors, and will tell you if there are any missing values, NAs

3. describe() (from the psych package) will give you means, SDs, the kind of thing you report in tables in articles

4. str() will tell you if variables are factors, integers etc.

If I were writing a report on this subjects data I would copy the output from describe() into excel, format it for APA, and stick it in a word document.

3. Plotting visualization of data distributions

These summary statistics will not give you the whole picture that you need. Mean estimates of the centre of the data distribution can be deceptive, hiding pathologies like multimodality, skew, and so on. What we really need to do is to look at the distributions of the data and, unsurprisingly, in R, the tools for doing so are excellent. Let’s start with histograms.

[Here, I will use the geom_histogram, see the documentation online.]

We are going to examine the distribution of the Age variable, age in years for ML’s participants, using the ggplot() function as follows:


pAge <- ggplot(subjects, aes(x=Age))
pAge + geom_histogram()

Notice:

— We are asking ggplot() to map the Age variable to the x-axis i.e. we are asking for a presentation of one variable.

If you copy-paste the code into the script window and run it three things will happen:

1. You’ll see the commands appear in the lower console window

2. You’ll see the pAge object listed in the upper right workspace window

R-2-1-rstudio-histogram

Notice:

— When you ask for geom_histogram(), you are asking for a bar plot plus a statistical transformation (stat_bin) which assigns observations to bins (ranges of values in the variable)

— calculates the count (number) of observations in each bin

— or – if you prefer – the density (percentage of total/bar width) of observations in each bin

— the height of the bar gives you the count, the width of the bin spans values in the variable for which you want the count

— by default, geom_histogram will bin data to the range in values/30 but you can ask for more or less detail by specifying binwidth, this is what R means when it says:

>stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to 
adjust this.

3. We see the resulting plot in the plot window, which we can export as a pdf:

R-2-1-histogram

Notice:

— ML tested many people in their 20s – no surprise, a student mostly tests students – plus a smallish number of people in middle age and older (parents, grandparents)

— nothing seems wrong with these data – no funny values

Exercise

Now examine the distribution of the other variables, TOWRE word and TOWRE nonword reading accuracy, adapt the code, by substituting the variable name with the names of these variables, do one plot for each variable:


pAge <- ggplot(subjects, aes(x=????))
pAge + geom_histogram()

Notice:

— In my opinion, at least one variable has a value that merits further thought.

4. What do you do when you’re finished?

OK, so you’ve been writing some killer code and you’ve produced a super graph, appearing in the plot window, how do you save it? You can click on the export button:

RStudio-plot-export

— you’ll be asked if you want to save the plot as an image (you’ll then be asked what format you want, .jpeg and .png are good for most purposes), as a pdf (my preference), or copy it to the clipboard so that you can paste it somewhere (like a word document, I suppose).

What about the script? In RStudio, go to the file menu then Save-As and you will see this:

RStudio-script-file-save

— Choose a name: something-sensible-dated.R

— And save it to where you can find it.

What have we learnt?

The R code used in this post can be downloaded here. Download it and use it with the database provided.

[For my postgraduate students coming to this post, see the .R file in the Week One practical materials folder.]

In this post, we learnt about:

1. Putting data files into folders on your computer.

2. The workspace, the working directory and how to set it in R using setwd().

3. How to read a file into the workspace (from the working directory = folder on your computer) using read.csv().

4. How to inspect the dataframe object using functions like summary() or describe().

5. How to plot histograms to examine the distributions of variables.

Key vocabulary

working directory

workspace

object

dataframe

NA, missing value

csv

histogram

Posted in a. Getting started | Tagged , , , , , , | Leave a comment

Getting started – getting helped

As we get started, we will need help, let’s work on how and where to find that help, and also how to act on it.

We will need help in various ways when using R

You need help most of the time when using a software programme, but if you’re getting started on R you don’t have the advantage of practice that you have with MS Word or Excel or SPSS. Needing help is not an R thing, it’s a learning thing. However, there being lots of help available is an R thing.

In the following, I’ll discuss two common (for me) help scenarios:

1. Let’s say you are trying to run a line of code and you get an error message or a warning

2. We will need help where we’ve got a decision to make

— The first is usually when I a trying to do something and I have a specific problem I need help on.

— The second is when I am not really sure what I need or even if I need help.

I’ll talk about getting advice and also dealing with it.

1. Let’s say you are trying to run a line of code and you get an error message or a warning

R is a case-sensitive,  interpreted language.

– This means that if you spell an object name incorrectly, if you get the syntax wrong, if you muddle letter case, you will get an error (just like any natural language).

The error messages can be cryptic.

For example, I run the following (incorrect) line of code:


p <- ggplot(mtcars, aes(x = wt))
p + geom_Object)

I get this:

R-error-unexpected

— You can see in the message that the (first) problem is the bracket is wrong.

How do I work out what to do?

First, more than likely, I am trying something out but I have beside me a book or a website I am working from, maybe trying an example: so if I got a message like this I would check that I copied the code correctly.

Second, I might try to google the error message “Error: unexpected ‘)’ in “p + geom_Object)” — try it — it does not get me very far in this case.

Third, I try to change the code to see what difference that makes. Here, the problem is with the bracket so I change that:


p <- ggplot(mtcars, aes(x = wt))
p + geom_Object()

— The obvious response to the error message was to close the brackets — even minimal experience with R will tell you that open brackets are not doing much good.

— Sometimes, that quick response will solve the problem straightaway.

— Sometimes it won’t and you might get another message, like this:

R-error-could-not-find-object

— “could not find function” — so obviously, I am still not getting the function name right — actually, I get the case wrong.

— But let’s say I am convinced I have it right: what should I do?

— Google it — most of the time, reasonably careful reading of the result will tell you what’s going on.

How and where to find help

As noted previously, there are plenty of places to get help, see also following, but one place you should know to look in is the built-in R help.

So you’re trying to work out how to use a function, or you’re looking for a function, or you’re trying to work out what options to use with a function, you can use in-built help:

– type: ?[name of function] or help([name of function]) or ??[name of function]

–Where [name of function] from [ to ] inclusive is the name of the function you want help with, for example, if you entered the function call:


help(ggplot)

in the RStudio script window and then ran that command you would get this:

R-help-ggplot

— Notice that you get the help(ggplot) command being shown in the console in blue, it has been executed, but the result appears on the right, in the Help window.

The built-in help in R, accessed by running a help([function name) or ?[function name] command, has a standard structure, it is useful but not (I admit) always terribly friendly. I usually use it if I basically know what the function is for and why I am using it but I want to use the function in a specific way or a different way than default.

The help will tell you: 1. what the function does 2. what arguments can be passed to it to work 3. how it can be specified in various ways; and 4. it will give you some example code to try out. The example code will work with built-in data and usually, I learn the most by playing with the example.

Note that a google search in a form something like this: “in R [name of function or text of error message]” will very often take you to a post on stackoverflow or some R blog where the error or problem and the solution or multiple alternate solutions (this is R) will be described or explained for you.

What to do with the help

Try it out. Learning how to use R involves some experimentation. Copy the code from a reported solution into a script, try running it and see if it works.

Every problem you ever have will already have been solved by someone else and, in R, the solution will be reported online.

2. We will need help where we’ve got a decision to make

Most modern software applications strive to ensure we get what we want without thinking too much about either the ends or the means. An application might, as soon as we start it, ask us what we want or show us what we might want: to write a letter; to input data; to read our emails. In R, you’re the boss of you and with power comes responsibility.

We might need help, in R, to work out what we want — Case one: You have a vague idea about what you are doing

Most applications will give you one way to do any thing. In R, you will usually have more than one (often, several) different ways to do a thing.

– Which way do you choose? This is a decision based on (for me) answers to the questions: does it work (now)? do I understand why it works? do I like it?

Typically, I work out what I want in one of three ways:

1. I am reading a book e.g. “R in Action” by Kabacoff on a topic relevant to my current task, e.g. on how to manipulate the structure of a dataframe, and I see a description of something I need to do, so I follow the example (code is given, operating on built-in data) and then adapt the code (e.g. changing variable names) to do that action for my own data.

2. I have only a vague idea of what I want, let’s say I am vaguely thinking that I would like to be able to draw error bars in R, so I enter the following in google: “How do I draw error bars in R” and I will get a result that looks like this:

R-help-google-it

I use DuckDuckGo instead of Google but the result will be similar whatever your search engine.

Typing a description of a problem or a question will often bring up a lot of results so I look for whatever seems promising, start reading it and then if I can understand the advice, and think I like the look of the results, I copy the code (code is always provided) to try the suggested solution.

Note you sometimes get alternate solutions getting offered in the comments.

Half the time, the search will give me a result listing linked to a discussion in Stack Overflow like this result in response to the question: “How can I break line on R’s setwd() command?”

R-help-stackoverflow

— This discussion was posted to StackOverflow in response to the question. You can see that one of the commentators did not think the person asking the question should even be using setwd(). While another commentator did think setwd() was useful. I agree with the second, but I note this to indicate that there is plenty of discussion, and often alternate solutions, for any one question you might have.

We might need help, in R, to work out what we want — Case two: I am not really aware that I have a problem but I am feeling dissatisfied with the approach I am currently taking

For example, I sometimes need to analyse information ordered by date and I am presenting it how I usually do with a bar chart or a line graph or something but I just don’t think the product feels right. What I might do, and what I used to do a lot, is have a look around for inspiration among the various R online aggregations of advice and tips.

For example, see what is shown at the R graph gallery website. There, if I were following my hypothetical example, I might find this:

R-graph-gallery-heatmap

A calendar heat map

— which looks pretty cool, and of course comes with a code example you can try out.

Or I might have a look around the R-bloggers website, and see if anything nice looking has been posted lately, like this:

R-help-bloggers

— coincidentally, a post by Max Gordon on a topic I was disussing yesterday: how can I produce a complete article, tabled statistics and text from inside R?

What have we learnt?

You will need help in R, but there is a lot of it and you can just work on how and where to find it, then what to do with it.

Posted in a. Getting started | Tagged | Leave a comment

Getting started — Installation of RStudio and some packages + using ggplot() to make a simple plot

Install R

Revise how to install R, as previously discussed here and here.

Installation

How do you download and install R? Google “CRAN” and click on the download link, then follow the instructions (e.g. at “install R for the first time”).

R-CRAN

Anthony Damico has produced some great video tutorials on using R, here is his how-to guide:

http://www.screenr.com/kzT8

And moonheadsing at Learning Omics has got a blog post with a series of screen-shots showing you how to install R with pictures.

Install R-studio

Having installed R, the next thing we will want to do is install R-studio, a popular and useful interface for writing scripts and using R.

If you google “R-studio” you will get to this window:

R-0-1-rstudio

Click on the “Download now” button and you will see this window:

R-0-2-rstudio

Click on the “Download RStudio desktop” and you will see this window:

R-0-3-rstudio

You can just click on the link to the installer recommended for your computer.

What happens next depends on whether you have administrative/root privileges on your computer.

I believe you can install R-studio without such rights using the zip/tarball dowload.

Having installed R and R-studio,  in Windows you will see these applications now listed as newly installed programs at the start menu. Depending on what you said in the installation process, you might also have icons on your desktop. In Mac OS, you will see the R and RStudio icons appear in the applications shown by Launchpad.

The R icon looks like this:

R-CRAN-icon

The RStudio icon looks like this:

RStudio-icon

If you click on the R icon you will get to see bare R, the console, which will look like this:

R-console-bare

R console, version 3.0.1, as seen on an iMac

Go ahead and click on the R icon, check it out, then go to the R menu and click on ‘Quit R’ to leave it, don’t save the workspace image (you’ll be asked if you want to or not). We won’t be using R directly, we will be using it through RStudio.

Click on the R-studio icon – it will pick up the R installation for you.

[N.B. Sometimes, e.g. when one does not have full admin rights on a computer e.g. in a university pooled computer, you are forced to install R someplace unusual and RStudio cannot find it. In that situation, you need to locate the R bin files from within RStudio. It’s not too hard to do this but I won’t get into that unless asked.]

Now we are ready to get things done in R.

Start a new script in R-studio, install packages, draw a plot

Here, we are going to 1. start a new script, 2. install then load a library of functions (ggplot2) and 3. use it to draw a plot.

Depending on what you did at installation, you can expect to find shortcut links to R (a blue R) and to R-Studio (a shiny blue circle with an R) in the Windows start menu, or as icons on the desktop.

To get started, in Windows, double click (left mouse button) on the R-Studio icon.

Maybe you’re now looking at this:

R-rstudio-1-1

1. Start a new script

What you will need to do next is go to the file menu [top left of R-Studio window] and create a new R script:

–move the cursor to file – then – new – then – R script and then click on the left mouse button

or

— in Windows, just press the buttons ctrl-shift-N at the same time or in Mac OS, press the buttons command-shift-N together

— the second move is a keyboard shortcut for the first, I prefer keyboard short cuts to mouse moves

— to get this:

R-rstudio-1-2

What’s next?

This circled bit you see in the picture below:

R-rstudio-1-3-console

is the console.

It is what you would see if you open R directly, not using R-Studio.

You can type and execute commands in it but mostly you will see unfolding here what happens when you write and execute commands in the script window, circled below:

R-rstudio-1-3-script

— The console reflects your actions in the script window.

If you look on the top right of the R-Studio window, you can see two tabs, Workspace and History: these windows (they can be resized by dragging their edges) also reflect what you do:

1. Workspace will show you the functions, data files and other objects (e.g. plots, models) that you are creating in your R session.

[Workspace — See the R introduction, and see the  this helpful post by Quick-R — when you work with R, your commands result in the creation of objects e.g. variables or functions, and during an R session these objects are created and stored by name — the collection of objects currently stored is the workspace.]

2. History shows you the commands you execute as you execute them.

— I look at the Workspace a lot when using R-Studio, and no longer look at (but did once use) History much.

My script is my history.

2. Install then load a library of functions (ggplot2)

We can start by adding some capacity to the version of R we have installed. We install packages of functions that we will be using e.g. packages for drawing plots (ggplot2) or for modelling data (lme4).

[Packages – see the introduction and this helpful page in Quick-R — all R functions and (built-in) datasets are stored in packages, only when a package is loaded are its contents available]

Copy then paste the following command into the script window in R-studio:


install.packages("ggplot2", "reshape2", "plyr", "languageR",
"lme4", "psych")

Highlight the command in the script window …

— to highlight the command, hold down the left mouse button, drag the cursor from the start to the finish of the command

— then either press the run button …

[see the run button on the top right of the script window]

… or (in Windows) press the buttons CTRL-enter together, alternatively (in Mac OS) press the cmd-enter buttons together …

— and watch the console show you R installing the packages you have requested.

Those packages will always be available to you, every time you open R-Studio, provided you load them at the start of your session.

[I am sure there is a way to ensure they are always loaded at the start of the session and will update this when I find that out.]

The keyboard shortcut for running a line or a selection of lines of code is:

— in Windows, press control (CTRL) and enter buttons together

— in Mac OS, press command (cmd) and enter buttons together

There is a 2-minute version of the foregoing laborious step-by-step, by ajdamico, here. N.B. the video is for installing and loading packages using the plain R console but applies equally to R-Studio.

Having installed the packages, in the same session or in the next session, the first thing you need to do is load the packages for use by using the library() function:


library(languageR)
library(lme4)
library(ggplot2)
library(rms)
library(plyr)
library(reshape2)
library(psych)

— copy/paste or type these commands into the script, highlight and run them: you will see the console fill up with information about how the packages are being loaded:

R-rstudio-1-4-library

Notice that the packages window on the bottom right of R-Studio now shows the list of packages have been ticked:

R-rstudio-1-4-library-packages

Let’s do something interesting now.

3. Use ggplot function to draw a plot

[In the following, I will use a simple example from the ggplot2 documentation on geom_point.]

Copy the following two lines of code into the script window:


p <- ggplot(mtcars, aes(wt, mpg))
p + geom_point()

— run them and you will see this:

R-rstudio-1-ggplot-point

— notice that the plot window, bottom right, shows you a scatterplot.

How did this happen?

Look at the first line of code:


p <- ggplot(mtcars, aes(wt, mpg))

— it creates an object, you can see it listed in the workspace (it was not there before):

R-rstudio-1-ggplot-point-2

— that line of code does a number of things, so I will break it down piece by piece:

p <- ggplot(mtcars, aes(wt, mpg))


p <- ...

— means: create <- (assignment arrow) an object (named p, now in the workspace)


... ggplot( ... )

— means do this using the ggplot() function, which is provided by installing the ggplot2 package then loading (library(ggplot) the ggplot2 package of data and functions


... ggplot(mtcars ...)

— means create the plot using the data in the database (in R: dataframe) called mtcars

— mtcars is a dataframe that gets loaded together with functions like ggplot when you execute: library(ggplot2)


... ggplot( ... aes(wt, mpg))

— aes(wt,mpg) means: map the variables wt and mpg to the aesthetic attributes of the plot.

In the ggplot2 book (Wickham, 2009, e.g. pp 12-), the things you see in a plot, the colour, size and shape of the points in a scatterplot, for example, are aesthetic attributes or visual properties.

— with aes(wt, mpg) we are informing R(ggplot) that the named variables are the ones to be used to create the plot.

Now, what happens next concerns the nature of the plot we want to produce: a scatterplot representing how, for the data we are using, values on one variable relate to values on the other.

A scatterplot represents each observation as a point, positioned according to the value of two variables. As well as a horizontal and a vertical position, each point also has a size, a colour and a shape. These attributes are called aesthetics, and are the properties that can be perceived on the graphic.

(Wickham: ggplot2 book, p.29; emphasis in text)

— The observations in the mtcars database are information about cars, including weight (wt) and miles per gallon (mpg).

[see

http://127.0.0.1:29609/help/library/datasets/html/mtcars.html

in case you’re interested]

— This bit of the code asked the p object to include two attributes: wt and mpg.

— The aesthetics (aes) of the graphic object will be mapped to these variables.

— Nothing is seen yet, though the object now exists, until you run the next line of code.

The next line of code:

p + geom_point()

— adds (+) a layer to the plot, a geometric object: geom

— here we are asking for the addition of geom_point(), a scatterplot of points

— the variables mpg and wt will be mapped to the aesthetics, x-axis and y-axis position, of the scatterplot

The wonderful thing about the ggplot() function is that we can keep adding geoms to modify the plot.

— add a command to the second line of code to show the relationship between wt and mpg for the cars in the mtcars dataframe:


p <- ggplot(mtcars, aes(wt, mpg))
p + geom_point() + geom_smooth()

— here

+ geom_smooth()

adds a loess smoother to the plot, indicating the predicted miles per gallon (mpg) for cars, given the data in mtcars, given a car’s weight (wt).

R-rstudio-ggplot-loess-export

[If you want to know what loess means – this post looks like a good place to get started.]

Notice that there is an export button on the top of the plots window pane, click on it and export the plot as a pdf.

Where does that pdf get saved to? Good question.

What have we learnt?

— starting a new script

— installing and loading packages

— creating a new plot

Posted in a. Getting started | Tagged , , , , , , , | Leave a comment

Postgraduate data analysis and interpretation

I am revising my previous posts on the whole process, from getting started with R to using it to do linear modelling and linear mixed-effects modelling.

PANO_20140111_155425

Sunset over the coast, on the road from Ravenglass to Broughton

What this revision will entail is some rewriting of .R provided code, some updating, some extension. Extensions will focus on model comparison methods, information criteria approaches to model evaluation, and the choices one must make in data analysis.

 

 

 

Posted in Uncategorized | Leave a comment

Mixed-effects modeling — four hour workshop — part IV: LMEs

For an example analysis, we collated a .csv file of data from a repeated measures lexical decision experiment, which can be downloaded here. The code I have used in running the analysis described in this post can be found here.

This file holds data on:

1. participant attributes;

2. word attributes;

3. participant responses to words in the lexical decision task.

In this post, I want to take a quick look at how you would actually run a mixed-effects model, using the lmer() function furnished by the lme4 package, written by Doug Bates (Bates, 2010 – appears to a preprint of his presumed new book, see also Pinheiro & Bates, 2000). A useful how-to guide, one I followed when I first started doing this kind of modelling, is in Baayen (2008). An excellent, and very highly cited, introduction to mixed-effects modelling for psycholinguistics, is found in Baayen, Davidson & Bates (2008).

First, then, we enter the following lines of code to load the libraries we are likely to need, to set the working directory to the folder where we have copied the full database, and to read that database as a dataframe in the session’s workspace:

Load libraries, define functions, load data

We start by loading the libraries of functions we are likely to need, defining a set of functions not in libraries, setting the working directory, and loading the dataframe we will work with into the R session workspace.


# load libraries of packages #############################################################################


# library(languageR)
library(lme4)
library(ggplot2)
library(grid)
library(rms)
library(plyr)
library(reshape2)
library(psych)
# library(gridExtra)



# load data files  #######################################################################################


# set the working directory

setwd("/Users/robdavies/Downloads")

# work out where your files are and substitute for the path for my files location, above


# # read in the item norms data
#
# item.norms <- read.csv(file="item norms 270312 050513.csv", head=TRUE,sep=",", na.strings = "-999")
# word.norms <- read.csv(file="word norms 140812.csv", head=TRUE,sep=",", na.strings = "-999")

item.coding <- read.csv(file="item coding info 270312.csv", head=TRUE,sep=",", na.strings = "-999")

# read in the merged new dataframe

item.words.norms <- read.csv(file="item.words.norms 050513.csv", head=TRUE,sep=",", na.strings = "NA")

# read in the subjects scores data file

subjects <- read.csv("ML scores 080612 220413.csv", header=T, na.strings = "-999")

# read in the behavioural data file

behaviour <- read.csv("ML data 080612.csv", header=T, na.strings = "-999")

# notice that we collated the subjects, items and behavioural data into a single dataframe which we then output
# using write.csv as a .csv file, we can read that collated data file in here, to save time

# read in the full database

RTs.norms.subjects <- read.csv("RTs.norms.subjects 061113.csv", header=T, na.strings = "-999")

# read in the database of correct responses to words only

nomissing.full <- read.csv("RTs.norms.subjects nomissing.full 061113.csv", header=T, na.strings = "-999")



# examine the data  ######################################################################################


summary(nomissing.full)

# note the presence of centred variables

 

If you run all that code, you will see the dataframes and functions appear listed in the workspace window of R-studio.

Run the models

What I will do next is perform three sets of mixed-effects model analyses. I will talk about the first set in this post.

In this first set, I shall compare models varying in the predictor variables included in the model specification. We shall say that models with more predictors are more complex than models with fewer predictors (a subset of the predictors specified for the more complex models).

I shall then compare models of differing in complexity to see if increased complexity is justified by improved fit to data.

This is in response to: 1. reading Baayen (2008) and Pinheiro & Bates (2000) 2. recommendations by a reviewer; though I note that the reviewer recommendations are, in their application, qualified in my mind by some cautionary comments made by Pinheiro & Bates (2000), and confess that any problems of application or understanding are surely expressions of my misunderstanding.

What I am concerned about is the level of complexity that can be justified in the ultimate model by the contribution to improved fit by the model of the observed data. Of course, I have collected a set of predictors guided by theory, custom and practice in my field. But, as you have seen, I might choose to draw on a large set of potential predictors. I know that one cannot – to take one option used widely by psycholinguistics researchers – simply inspect a correlation table, comparing predictors with the outcome variable, and exclude those predictors that do not correlate with the outcome. Harrell (2001) comments that that approach has a biasing effect on the likelihood of detecting an effect that is really there (consuming, ‘ghost degrees of freedom’, or something like that). So what should we do or how should we satisfy ourselves that we should or should not include one or more predictor variables in our models?

You could start complex and step backwards to the simplest model that appears justified by model fit to observed data. I will return to that option another time.

In this post, I will start with the simplest model and build complexity up. Simpler models can be said to be nested in more complex models, e.g. a simple model including just main effects vs. a more complex model also including interactions. I will add complexity by adding sets of predictors, and I will compare each proximate pair of simple model and more complex model using the Likelihood Ratio Test (Pinheiro & Bates, 2000; see, especially, chapter 2, pp. 75-).

We can compare models with the same random effects (e.g. of subjects or items) but varying in complexity with respect to the fixed effects, i.e. replicable or manipulated variables like item length or participant age, and, separately, models with the same fixed effects but varying random effects.

Note that in comparing models varying in random effects, one estimates those effects using restricted maximum likelihood: you will see the argument REML = TRUE in the lmer() function call, that is what REML means. Note that in comparing models varying in fixed effects one must estimate those effects using maximum likelihood: you will see the argument REML = FALSE in the lmer() function call, that is what FALSE means. See comments in Baayen (2008) and Pinheiro & Bates (2000); I will come back to those important terms to focus on understanding, another time, focusing in this post on just writing the code for running the models.

We can compare the more and less complex models using the Likelihood Ration Test (LRT) which is executed with the anova([simple model], [more complex model]) function call. A \chi^2 statistic is computed, and evaluated against the \chi^2 distribution to estimate the probability of an increase in model likelihood as large or larger as that observed when comparing the more or less complex model, given the null hypothesis of no gain in likelihood.

Note that I log transform the RTs but one might also consider examining the reciprocal RT.


# we can move to some multilevel modelling ###############################################################

# we might start by examining the effects of predictors in a series of additions of sets of predictors
# we can hold the random effects, of subjects and items, constant, while varying the fixed effects
# we can compare the relative fit of models to data using the anova() likelihood ratio test
# in this series of models, we will use ML fitting

full.lmer0 <- lmer(log(RT) ~

# Trial.order +
#
# cAge + cTOWREW + cTOWRENW + cART +
#
# cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer0,corr=F)

#

full.lmer1 <- lmer(log(RT) ~

 Trial.order +

 # cAge + cTOWREW + cTOWRENW + cART +
 #
 # cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer1,corr=F)

#

full.lmer2 <- lmer(log(RT) ~

 Trial.order +

 cAge + cTOWREW + cTOWRENW + cART +

 # cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer2,corr=F)

#

full.lmer3 <- lmer(log(RT) ~

 Trial.order +

 cAge + cTOWREW + cTOWRENW + cART +

 cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer3,corr=F)

#

full.lmer4 <- lmer(log(RT) ~

 Trial.order +

 (cAge + cTOWREW + cTOWRENW + cART)*

 (cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG) +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer4,corr=F)

# compare models of varying complexity

anova(full.lmer0, full.lmer1)
anova(full.lmer1, full.lmer2)
anova(full.lmer2, full.lmer3)
anova(full.lmer3, full.lmer4)

This code will first print out the summaries of each model, along the lines you have already seen. It will then give you this:


# > # compare models of varying complexity
# >
# > anova(full.lmer0, full.lmer1)
# Data: nomissing.full
# Models:
# full.lmer0: log(RT) ~ (1 | subjectID) + (1 | item_name)
# full.lmer1: log(RT) ~ Trial.order + (1 | subjectID) + (1 | item_name)
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# full.lmer0 4 -977.58 -951.31 492.79
# full.lmer1 5 -975.83 -943.00 492.92 0.2545 1 0.6139
# > anova(full.lmer1, full.lmer2)
# Data: nomissing.full
# Models:
# full.lmer1: log(RT) ~ Trial.order + (1 | subjectID) + (1 | item_name)
# full.lmer2: log(RT) ~ Trial.order + cAge + cTOWREW + cTOWRENW + cART + (1 |
# full.lmer2: subjectID) + (1 | item_name)
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# full.lmer1 5 -975.83 -943.00 492.92
# full.lmer2 9 -977.09 -917.98 497.54 9.2541 4 0.05505 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# > anova(full.lmer2, full.lmer3)
# Data: nomissing.full
# Models:
# full.lmer2: log(RT) ~ Trial.order + cAge + cTOWREW + cTOWRENW + cART + (1 |
# full.lmer2: subjectID) + (1 | item_name)
# full.lmer3: log(RT) ~ Trial.order + cAge + cTOWREW + cTOWRENW + cART + cLength +
# full.lmer3: cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG + (1 | subjectID) +
# full.lmer3: (1 | item_name)
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# full.lmer2 9 -977.09 -917.98 497.54
# full.lmer3 15 -1115.42 -1016.91 572.71 150.34 6 < 2.2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# > anova(full.lmer3, full.lmer4)
# Data: nomissing.full
# Models:
# full.lmer3: log(RT) ~ Trial.order + cAge + cTOWREW + cTOWRENW + cART + cLength +
# full.lmer3: cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG + (1 | subjectID) +
# full.lmer3: (1 | item_name)
# full.lmer4: log(RT) ~ Trial.order + (cAge + cTOWREW + cTOWRENW + cART) *
# full.lmer4: (cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG) +
# full.lmer4: (1 | subjectID) + (1 | item_name)
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# full.lmer3 15 -1115.4 -1016.91 572.71
# full.lmer4 39 -1125.7 -869.56 601.84 58.262 24 0.0001119 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# it looks like the full model, including interactions but not the trialorder variable, is justified

Notice:

— It is not very pretty but it does tell you what predictors are included in each model you are comparing, for each pair of models being compared, the outcome of the ratio test, giving the numbers I usually report: the \chi^2, df and the p-value.

— I am looking at this output to see if, for each pair of models, the more complex model fits the data significantly better (if p =< 0.05).

–I add predictors in blocks and assess the improvement of fit associated with the whole block.

I have made a lot of promises to explain things more another time. So I will add one more. Pinheiro & Bates (2000; p. 88) argue that Likelihood Ratio Test comparisons of models varying in fixed effects tend to be anticonservative i.e. will see you observe significant differences in model fit more often than you should. I think they are talking, especially, about situations in which the number of model parameter differences (differences between the complex model and the nested simpler model) is large relative to the number of observations. This is not really a worry for this dataset, but I will come back to the substance of this view, and alternatives to the approach taken here.

What next?

What I would seek to do in the next series of analyses is determine if the effects that are significant appear to be significant due:

1. to the presence of other, nonsignificant effects — making the check by removing the nonsignificant effects then rerunning the model;

2. to the presence of outlier observations — making the check by removing observations associated with large model residuals then rerunning the model.

What have we learnt?

(The code I have used in running the analysis described in this post can be found here.)

Firstly, I am going to have to do a lot more explaining, for my own sake, and have promised to do so for a wide range of ideas, including those concerning the methods of model fitting, and, especially, the discussion around how one determines the appropriate set of fixed effects.

Secondly, that the model code is both simple and – on its face – reasonably intuitive.

Key vocabulary (mostly, a list of things to be explained)

fixed effects

random effects

random intercepts

random slopes

interactions

multiplicative products

centering

maximum likelihood

restricted maximum likelihood

Markov chain Monte Carlo

Likelihood Ratio Test

lmer()

anova()

Reading

Baayen, R. H. (2008). Analyzing linguistic data. Cambridge University Press.
Bates, D. M. (2005). Fitting linear mixed models in R. R News, 5, 27-30.
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioural sciences (3rd. edition). Mahwah, NJ: Lawrence Erlbaum Associates.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. New York,NY: Cambridge University Press.
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-plus (Statistics and Computing). New York: Springer.

Posted in 2.4 LME workshop IV/IV | Tagged , | Leave a comment

Mixed-effects modeling — four hour workshop — part III: Regression modeling

We completed a study on the factors that influence visual word recognition performance in lexical decision. We then put together a dataset for analysis here. We examined the distribution of inter-relation of critical variables in that dataset, acting to transform the data where we found indications of concern, here.

We will now proceed to analyse the lexical decision reaction times (RTs). The dataset collated in the last post consisted of observations corresponding to correct responses to words only, RTs along with data about the subject and item attributes. It will be recalled that we wish to examine the effects on reading of who is reading and what they are reading.

The last collated dataset can be downloaded here.

The code I am going to use in this post can be downloaded here.

We are going to start by ignoring the random effects structure of the data, i.e. the fact that the data are collected in a repeated measures design where every participant saw every word so that there are dependencies between observations made by each participant (some people are slower than others, and most of their responses will be slower) and dependencies between observations made to each item (some words are harder than others, and will be found to be harder by most readers). What we will do is perform Ordinary Least Squares (OLS) regression with our data, and drawing scatterplots will be quite central to our work (see previous posts here and here). I am going to assume familiarity with correlation and OLS regression (but see previous posts here and here).

The relationships between pairs of variables — scatterplots

We are interested in whether our readers are recognizing some words faster than others. Previous research (e.g. Balota et al., 2004) might lead us to expect that they should respond faster to more frequent, shorter, words that are easier to imagine (the referent of). We are also interested in whether individuals vary in how they recognize words. We might expect that readers who score high on standardized reading tests (e.g. the TOWRE) would also respond faster, on average, in the experimental reading task (lexical decision). The interesting question, for me, is whether there is an interaction between the first set of effects — of what is being read — and the second set of effects — who is doing the reading.

We have already looked at plots illustrating the relationships between various sets of the predictor variables: scatterplot matrices.

We are now going to create a trellis of scatterplots to examine the relationship between each predictor variable and word recognition RT.

Let’s start with a scatterplot. I assume the collated dataset has been downloaded and is available for use in R’s workspace. If you copy the following code to draw a plot:


plength <- ggplot(nomissing.full,aes(x=Length, y=RT))
plength <- plength + geom_jitter(colour = "black", alpha = 1/5)
plength <- plength + xlab("Length") + theme_bw()
plength <- plength + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
plength <- plength + geom_smooth(method="lm", size = 1.5)
plength

You will get this:

LME-4hrworkshop-scatter-length

Notice:

– I am adding each element, one at a time, making for more lines of code but greater transparency between code and effect.

— I first ask to have a plot object created: plength <- ggplot(nomissing.full,aes(x=Length, y=RT))

— the object is called plength, the function doing the creation is ggplot(), it is acting on the database nomissing.full, and it will map the variables Length to x coordinate and RT to y coordinate

— the mapping is from data to aesthetic attributes like point location in a scatterplot

— I then add to what is a blank plot the geometric object (geom) point — run the first line alone and see what happens

— here I am asking for jittered points with + geom_jitter(colour = “black”, alpha = 1/5) — where a bit of random noise has been added to the point locations because so many would otherwise appear in the same place — overplotting that obscures the pattern of interest — I am also modifying the relative opacity of the points with alpha = 1/5

– I could add the scatterplot points with: + geom_point()
substitute that in for geom_jitter() and see what happens — I could change the colour scheme also

– I then change the axis label with: + xlab(“Length”)

– I modify the size of the labeling with: + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))

– I add a smoother i.e. a line showing a linear model prediction of y-axis values for the x-axis variable values with: + geom_smooth(method=”lm”, size = 1.5)  – and I ask for the size to be increased for better readability

– and I ask for the plot to have a white background with: + theme_bw()

In fact, I would like to see a plot of all the key predictor variables and their effect on RT, considering just the bivariate relationships. It is relatively easy to first draw each plot, then create a trellis of plots, as follows:


# 1. are RTs affected by item attributes?

# let's look at the way in which RT relates to the word attributes, picking the most obvious variables
# we will focus, here, on the database holding correct responses to words

head(nomissing.full, n = 2)

plength <- ggplot(nomissing.full,aes(x=Length, y=RT))
plength <- plength + geom_jitter(colour = "black", alpha = 1/5)
plength <- plength + xlab("Length") + theme_bw()
plength <- plength + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
plength <- plength + geom_smooth(method="lm", size = 1.5)
plength

pOLD <- ggplot(nomissing.full,aes(x=OLD, y=RT))
pOLD <- pOLD + geom_jitter(colour = "black", alpha = 1/5)
pOLD <- pOLD + xlab("OLD") + theme_bw()
pOLD <- pOLD + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pOLD <- pOLD + geom_smooth(method="lm", size = 1.5)
pOLD

pfreq <- ggplot(nomissing.full,aes(x=LgSUBTLCD, y=RT))
pfreq <- pfreq + geom_point(colour = "black", alpha = 1/5)
pfreq <- pfreq + xlab("LgSUBTLCD") + theme_bw()
pfreq <- pfreq + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pfreq <- pfreq + geom_smooth(method="lm", size = 1.5)
pfreq

pAoA_Kup_lem <- ggplot(nomissing.full,aes(x=AoA_Kup_lem, y=RT))
pAoA_Kup_lem <- pAoA_Kup_lem + geom_point(colour = "black", alpha = 1/5)
pAoA_Kup_lem <- pAoA_Kup_lem + xlab("AoA") + theme_bw()
pAoA_Kup_lem <- pAoA_Kup_lem + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pAoA_Kup_lem <- pAoA_Kup_lem + geom_smooth(method="lm", size = 1.5)
pAoA_Kup_lem

pbrookesIMG <- ggplot(nomissing.full,aes(x=brookesIMG, y=RT))
pbrookesIMG <- pbrookesIMG + geom_jitter(colour = "black", alpha = 1/5)
pbrookesIMG <- pbrookesIMG + xlab("Imageability") + theme_bw()
pbrookesIMG <- pbrookesIMG + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pbrookesIMG <- pbrookesIMG + geom_smooth(method="lm", size = 1.5)
pbrookesIMG

# 2. are RTs affected by subject attributes?

# let's look at the way in which RT relates to the word attributes, picking the most obvious variables
# we will focus, here, on the database holding correct responses to words

head(nomissing.full, n = 2)

pAge <- ggplot(nomissing.full,aes(x=Age, y=RT))
pAge <- pAge + geom_jitter(colour = "black", alpha = 1/5)
pAge <- pAge + xlab("Age") + theme_bw()
pAge <- pAge + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pAge <- pAge + geom_smooth(method="lm", size = 1.5)
pAge

pTOWRE_wordacc <- ggplot(nomissing.full,aes(x=TOWRE_wordacc, y=RT))
pTOWRE_wordacc <- pTOWRE_wordacc + geom_jitter(colour = "black", alpha = 1/5)
pTOWRE_wordacc <- pTOWRE_wordacc + xlab("TOWRE word") + theme_bw()
pTOWRE_wordacc <- pTOWRE_wordacc + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pTOWRE_wordacc <- pTOWRE_wordacc + geom_smooth(method="lm", size = 1.5)
pTOWRE_wordacc

pTOWRE_nonwordacc <- ggplot(nomissing.full,aes(x=TOWRE_nonwordacc, y=RT))
pTOWRE_nonwordacc <- pTOWRE_nonwordacc + geom_jitter(colour = "black", alpha = 1/5)
pTOWRE_nonwordacc <- pTOWRE_nonwordacc + xlab("TOWRE nonword") + theme_bw()
pTOWRE_nonwordacc <- pTOWRE_nonwordacc + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pTOWRE_nonwordacc <- pTOWRE_nonwordacc + geom_smooth(method="lm", size = 1.5)
pTOWRE_nonwordacc

pART_HRminusFR <- ggplot(nomissing.full,aes(x=ART_HRminusFR, y=RT))
pART_HRminusFR <- pART_HRminusFR + geom_jitter(colour = "black", alpha = 1/5)
pART_HRminusFR <- pART_HRminusFR + xlab("ART print exposure") + theme_bw()
pART_HRminusFR <- pART_HRminusFR + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pART_HRminusFR <- pART_HRminusFR + geom_smooth(method="lm", size = 1.5)
pART_HRminusFR

Notice:

— What is changing between plots are: plot object names, the x-axis variable name (spelled correctly) and plot x axis label

— You can run each set of code to examine the relationships between predictor variables and RT one at a time.

We can create a grid and then print the plots to specific locations in the grid, as follows, outputting everything as a pdf:


pdf("nomissing.full-RT-by-main-effects-subjects-061113.pdf", width = 18, height = 12)

grid.newpage()

pushViewport(viewport(layout = grid.layout(2,5)))

vplayout <- function(x,y)
viewport(layout.pos.row = x, layout.pos.col = y)

print(plength, vp = vplayout(1,1))
print(pOLD, vp = vplayout(1,2))
print(pfreq, vp = vplayout(1,3))
print(pAoA_Kup_lem, vp = vplayout(1,4))
print(pbrookesIMG, vp = vplayout(1,5))

print(pAge, vp = vplayout(2,1))
print(pTOWRE_wordacc, vp = vplayout(2,2))
print(pTOWRE_nonwordacc, vp = vplayout(2,3))
print(pART_HRminusFR, vp = vplayout(2,4))

dev.off()

If you run that code it will get you this:

LME-4hrworkshop-scatter-trellis

What is your evaluation?

The next question obviously is the extent to which these variables make unique contributions to the explanation of variance in reading RT, when taking the other variables into account, and whether we might detect interactions between effects of item attributes like frequency and person attributes like word reading skill.

We can examine potential interactions by splitting the data into subsets, effectively, through the creation of a variable that associates observations with intervals in one of the components of the potential interaction. In other words, if we wanted to look at the age x frequency interaction, we might look at the frequency effect for each of four subsets of the data, corresponding to four equal intervals in the age variable. We can use the following code to do that:


# first, add coding variables that split the dataset into equal intervals according to scores on each subject
# attribute of interest

nomissing.full$TOWRE_wordacc_cutequal <- cut_interval(nomissing.full$TOWRE_wordacc, n = 4)

#then plot the interaction: age x frequency

pdf("nomissing.full-freqwords-rt-lm-scatter-061113.pdf", width = 12, height = 8)

pLgSUBTLCD <- ggplot(nomissing.full,aes(x=LgSUBTLCD,y=RT))
pLgSUBTLCD <- pLgSUBTLCD + geom_jitter(colour = "black", alpha = 1/5) + theme_bw()+ xlab("LgSUBTLCD") # +
pLgSUBTLCD <- pLgSUBTLCD + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=10))
pLgSUBTLCD <- pLgSUBTLCD + geom_smooth(method="lm", size = 1.5, colour = "red")
pLgSUBTLCD <- pLgSUBTLCD + facet_wrap(~TOWRE_wordacc_cutequal,nrow=1)
pLgSUBTLCD

dev.off()

Running this code will get you this:

LME-4hrworkshop-wordxfreq-scatter

To me, there is a possible interaction such that the frequency effect (LgSUBTLCD) may be getting smaller for older readers but that modulation of frequency by age does not seem very strong.

OLS Regression — the full dataset

We can perform a full regression on the dataset to look for the effects of interest, effects of subject attributes, of word attributes and of the interactions between the first and the last. We are going to ignore the fact that, due to the dependencies of observations under subject and item, we do not really have as many degrees of freedom as we appear to have (several thousand observations).

We can run a regression taking all the data and ignoring the clustering of observations:


# Inspect the database

summary(nomissing.full)

# recommended that first make an object that summarizes the distribution

nomissing.full.dd <- datadist(nomissing.full)

# specify that will be working with the subjects.dd data distribution

options(datadist = "nomissing.full.dd")

# run the ols() function to perform the regression

nomissing.full.ols <- ols(logrt ~

(cAge + cTOWREW + cTOWRENW + cART)*
(cLength + cOLD + cIMG + cAOA + cLgSUBTLCD),

data = nomissing.full)

# enter the name of the model to get a summary

nomissing.full.ols

That will get you this:


# > nomissing.full.ols
#
# Linear Regression Model
#
# ols(formula = logrt ~ (cAge + cTOWREW + cTOWRENW + cART) * (cLength +
#                                                               cOLD + cIMG + cAOA + cLgSUBTLCD), data = nomissing.full)
#
# Model Likelihood     Discrimination
# Ratio Test           Indexes
# Obs     5257    LR chi2    677.56    R2       0.121
# sigma 0.1052    d.f.           29    R2 adj   0.116
# d.f.    5227    Pr(> chi2) 0.0000    g        0.043
#
# Residuals
#
# Min       1Q   Median       3Q      Max
# -0.37241 -0.06810 -0.01275  0.05202  0.55272
#
# Coef    S.E.   t       Pr(>|t|)
# Intercept              2.7789 0.0015 1914.52 <0.0001
# cAge                   0.0015 0.0001   13.95 <0.0001
# cTOWREW               -0.0018 0.0002   -8.42 <0.0001
# cTOWRENW               0.0009 0.0003    3.40 0.0007
# cART                  -0.0021 0.0002  -11.58 <0.0001
# cLength               -0.0101 0.0027   -3.68 0.0002
# cOLD                   0.0012 0.0071    0.18 0.8599
# cIMG                  -0.0066 0.0013   -5.00 <0.0001
# cAOA                   0.0022 0.0010    2.16 0.0310
# cLgSUBTLCD            -0.0365 0.0033  -10.98 <0.0001
# cAge * cLength         0.0003 0.0002    1.46 0.1453
# cAge * cOLD           -0.0006 0.0005   -1.13 0.2580
# cAge * cIMG            0.0000 0.0001   -0.03 0.9798
# cAge * cAOA           -0.0001 0.0001   -1.02 0.3099
# cAge * cLgSUBTLCD      0.0001 0.0002    0.42 0.6767
# cTOWREW * cLength      0.0002 0.0004    0.54 0.5915
# cTOWREW * cOLD         0.0002 0.0011    0.22 0.8265
# cTOWREW * cIMG         0.0000 0.0002    0.02 0.9859
# cTOWREW * cAOA         0.0002 0.0002    1.14 0.2531
# cTOWREW * cLgSUBTLCD   0.0002 0.0005    0.31 0.7600
# cTOWRENW * cLength    -0.0004 0.0005   -0.80 0.4213
# cTOWRENW * cOLD        0.0009 0.0012    0.72 0.4735
# cTOWRENW * cIMG        0.0003 0.0002    1.29 0.1975
# cTOWRENW * cAOA        0.0000 0.0002   -0.20 0.8377
# cTOWRENW * cLgSUBTLCD  0.0015 0.0006    2.62 0.0088
# cART * cLength        -0.0004 0.0003   -1.29 0.1967
# cART * cOLD            0.0002 0.0009    0.22 0.8263
# cART * cIMG           -0.0001 0.0002   -0.65 0.5133
# cART * cAOA            0.0000 0.0001    0.17 0.8670
# cART * cLgSUBTLCD     -0.0005 0.0004   -1.29 0.1964

which looks perfectly reasonable.

Notice:

— A regression model might be written as:

word.skill = age + nonword.skill + print .exposure + intercept + error

— I can break the function call down:

subjects.ols —  the name of the model to be created

<- ols() — the model function call

TOWRE_wordacc — the dependent or outcome variable

~ — is predicted by

Age + TOWRE_nonwordacc + ART_HRminusFR — the predictors

, data = subjects — the specification of what dataframe will be used

— Notice also that when you run the model, you get nothing except the model command being repeated (and no error message if this is done right).

— To get the summary of the model, you need to enter the model name:

subjects.ols

— Notice when you look at the summary the following features:

Obs — the number of observations

d.f. — residual degrees of freedom

Sigma — the residual standard error i.e. the variability in model residuals — models that better fit the data would be associated with smaller spread of model errors (residuals)

Residuals — the listed residuals quartiles, an indication of the distribution of model errors — linear regression assumes that model errors are random, and that they are therefore distributed in a normal distribution

Model Likelihood Ratio Test — a measure of goodness of fit, with the \chi^2 test statistic, degrees of freedom (model predictors) and p-value reported

R^2 and adj. R^2 — measures of the proportion of variance explained by the model.

In estimating the outcome (here, word reading skill), we are predicting the outcome given what we know about the predictors. Most of the time, the prediction will be imperfect, the predicted outcome will diverge from the observed outcome values. How closely do predicted values correspond to observed values? In other words, how far is the outcome independent of the predictor? Or, how much does the way in which values of the outcome vary coincide with the way in which values of the predictor vary? (I am paraphrasing Cohen et al., 2003, here.)

Let’s call the outcome Y and the predictor X. How much does the way in which the values of Y vary coincide with the way in which values of X vary? We are talking about variability, and that is typically measured in terms of the standard deviation SD or its square the variance sd^2. Variances are additive while SDs are not so we typically work with variances. In asking the string of questions given in the foregoing paragraph, we are thinking about a problem that can be expressed like this: what part of Y variance is associated with X (this part will equal the variance of the estimated scores) and what part of Y variance is independent of X (the model error, equal to the variance of the discrepancies between estimated and observed Y)? If there is no relationship between X and Y, our best estimate of Y will be the mean of Y. R^2 is the proportion of Y variance associated with X – our predictors.

Coef — a table presenting the model estimates for the effects of each predictor, while taking into account all other predictors

Notice:

— The estimated regression coefficients represent the rate of change in outcome units per unit change in the predictor variable.

— You could multiple the coefficient by a value for the predictor to get an estimate of the outcome for that value e.g. word reading skill is higher by 0.7382 times each unit increase in nonword reading score.

— The intercept (the regression constant or Y intercept) represents the estimated outcome when the predictors are equal to zero – you could say, the average outcome value assuming the predictors have no effect, here, a word reading score of 55.

For some problems, as here, it is convenient to center the predictor variables by subtracting a variable’s mean value from each of its values. This can help alleviate nonessential collinearity (see here) due to scaling, when working with curvilinear effects terms and interactions, especially (Cohen et al., 2003). If both the outcome and predictor variables have been centered then the intercept will also equal zero. If just the predictor variable is centered then the intercept will equal the mean value for the outcome variable. Where there is no meaningful zero for a predictor (e.g. with age) centering predictors can help the interpretation of results. The slopes of the effects of the predictors are unaffected by centering.

— In addition to the coefficients, we also have values for each estimate for:

S.E. — the standard error of the estimated coefficient of the effect of the predictor. Imagine that we sampled our population many times and estimated the effect each time: those estimates would constitute a sampling distribution for the effect estimate coefficient, and would be approximately normally distributed. The spread of those estimates is given by the SE of the coefficients, which can be calculated in terms of the portion of variance not explained by the model along with the SDs of Y and of X.

t — the t statistic associated with the Null Hypothesis Significance Test (NHST, but see Cohen, 1994) that the effect of the predictor is actually zero. We assume that there is no effect of the predictor. We have an estimate of the possible spread of coefficient estimate values, given repeated sampling from the population, and reruns of the regression – that is given by the SE of the coefficient. The t-test in general is given by:

t = \frac{sample.value - null.hypothesis.value}{standard.error}

therefore, for regression coefficients:

t = \frac{B.effect - H_0}{SE_B.effect}

i.e. for a null hypothesis that the coefficient should be zero, we divide the estimate of the coefficient, B, by the standard error of possible coefficients, to get a t-statistic

Pr(>|t|) — we can then get the probability that a value of t this large or larger by reference to the distribution of t values for the degrees of freedom at hand.

But is this really the right thing to do?

As I discuss in my talk, it is not (slides to be uploaded). I have already alluded to the problem. The model makes sense on its face. And the effects estimates might be reasonable. The problem is that we are not really giving the randomness its due here. Did we really have 5250+ independent observations? How accurate are our estimates of the standard errors?

OLS Regression — the by-items mean RTs dataset

The field has long been aware of random variation in datasets like this, where we are examining the effect of e.g. word attributes on the responses of samples of participants in a repeated-measures design. The solution has not been to throw all the data into a regression. Rather, we have averaged responses to each item over participants. To put it another way, for each word, we have averaged over the responses recorded for the participants in our sample, to arrive at an average person response. The folk explanation for this procedure is that we are ‘washing out’ random variation to arrive at a mean response but if you consider discussions by e.g. Raaijmakers et al. (1999) it is not quite that simple (the whole problem turns on the ‘language as fixed effect’ issue, the paper is highly recommended). Let’s look at how this works for our data set.

Note I am going to be a bit lazy here and not centre the predictors before performing the analysis nor log transform RTs. You’re welcome to redo the analysis having made that preliminary step.

The first thing we need to do is create a database holding by-items mean RTs. We do this by calculating the average RT (over subjects) of responses to each item.


# define functions for use later:

# we want to examine performance averaging over subjects to get a mean RT for each item ie by items RTs

# first create a function to calculate mean RTs

meanRT <- function(df) {mean(df$RT)}

# then apply it to items in our original full dataset, outputting a dataset of by-items mean RTs

nomissing.full.byitems.RT.means <- ddply(nomissing.full, .(item_name), meanRT)

# inspect the results

summary(nomissing.full.byitems.RT.means)

# if you click on the dataframe name in the workspace window in RStudio you can see that we have created a set of 160
# (mean) RTs

# we want to rename the variable calculated for us from V1 to byitems.means

nomissing.full.byitems.RT.means <- rename(nomissing.full.byitems.RT.means, c(V1 = "byitems.means"))

The next thing we can do is add information about the items into this dataframe by merging it with our item norms dataframe, as follows:


# we can analyse what affects these by-items means RTs but must first merge this dataset with the item norms dataset:

byitems <- merge(nomissing.full.byitems.RT.means, item.words.norms, by = "item_name")

# inspect the results

summary(byitems)

We can then run another OLS regression, this time on just the by-items means RTs therefore without the subject attribute variables or, critically, the interactions, as follows:


# recommended that first make an object that summarizes the distribution

byitems.dd <- datadist(byitems)

# specify that will be working with the subjects.dd data distribution

options(datadist = "byitems.dd")

# run the ols() function to perform the regression

byitems.ols <- ols(byitems.means ~

Length + OLD + brookesIMG + AoA_Kup_lem + LgSUBTLCD,

data = byitems)

# enter the name of the model to get a summary

byitems.ols

That will get you this:


> byitems.ols

Linear Regression Model

ols(formula = byitems.means ~ Length + OLD + brookesIMG + AoA_Kup_lem +
LgSUBTLCD, data = byitems)

Model Likelihood     Discrimination
Ratio Test           Indexes
Obs       160    LR chi2    126.84    R2       0.547
sigma 39.8543    d.f.            5    R2 adj   0.533
d.f.      154    Pr(> chi2) 0.0000    g       47.664

Residuals

Min       1Q   Median       3Q      Max
-96.8953 -26.1113  -0.3104  20.8134 167.8680

Coef     S.E.    t     Pr(>|t|)
Intercept   912.2377 49.6487 18.37 <0.0001
Length      -19.2256  5.8640 -3.28 0.0013
OLD           6.5787 15.2532  0.43 0.6669
brookesIMG  -12.6055  2.8667 -4.40 <0.0001
AoA_Kup_lem   4.0169  2.2254  1.81 0.0730
LgSUBTLCD   -58.1738  7.1677 -8.12 <0.0001

Notice:

— the proportion of variance explained by the model is considerably larger than that explained by the model for the full dataset

— the residual standard error is also larger

— obviously the degrees of freedom are reduced

— the pattern of effects seems broadly similar i.e. what effects are significant and what are not

How could we take into account both random variation due to items (as in our by-items analysis) and random variation due to subjects?

As I discuss in my talk (slides to be uploaded here), discussions by Clark (1973), Raaijmakers et al. (1999) are enlightening. We are heading towards the mixed-effects approach argued by Baayen (2008; Baayen et al., 2008) and many others to do the job. We might consider one last option, to perform a by-items analysis not on by-items mean RT but on the RTs of each participant considered as an individual (seeLorch & Myers, 1990). We might then analyse the coefficients resulting from that per-subjects analysis: see Balota et al. (2004), for one example of this approach; see Snijders & Boskers (1999), also Baayen et al. (2008), for a discussion of the problems with it.

We can first define a function to perform a regression analysis for each subject:


model <- function(df){lm(logrt ~

Length + OLD + brookesIMG + AoA_Kup_lem +
LgSUBTLCD,

data = df)}

Notice:

— I am switching to the lm() function to run regression analyses — the effect is the same but the function will be a bit simpler to handle.

Then use the function to model each person individually, creating a dataframe with results of each model:


nomissing.full.models <- dlply(nomissing.full, .(subjectID), model)

rsq <- function(x) summary(x)$r.squared
nomissing.full.models.coefs <- ldply(nomissing.full.models, function(x) c(coef(x),rsquare = rsq(x)))

head(nomissing.full.models.coefs,n=3)
length(nomissing.full.models.coefs$subjectID)
summary(nomissing.full.models.coefs)

# then merge the models coefs and subject scores dbases by subject

nomissing.full.models.coefs.subjects <- merge(nomissing.full.models.coefs, subjects, by="subjectID")

# inspect

summary(nomissing.full.models.coefs.subjects)

We can use these data to analyse whether coefficients of effects of word attributes estimated separately for each subject are, themselves, estimatble given data about subject attributes.


# we can now analyse whether the coefficients are predicted by subject variables

# rename the dataframe for simplicity:

subjects.coefs <- nomissing.full.models.coefs.subjects

# recommended that first make an object that summarizes the distribution

subjects.coefs.dd <- datadist(subjects.coefs)

# specify that will be working with the subjects.dd data distribution

options(datadist = "subjects.coefs.dd")

# run the ols() function to perform the regression

subjects.coefs.ols <- ols(LgSUBTLCD ~

Age + TOWRE_wordacc + TOWRE_nonwordacc + ART_HRminusFR,

data = subjects.coefs)

# enter the name of the model to get a summary

subjects.coefs.ols

Running that code will get you this:


> subjects.coefs.ols

Linear Regression Model

ols(formula = LgSUBTLCD ~ Age + TOWRE_wordacc + TOWRE_nonwordacc +
ART_HRminusFR, data = subjects.coefs)

Model Likelihood     Discrimination
Ratio Test           Indexes
Obs       34    LR chi2     19.94    R2       0.444
sigma 0.0176    d.f.            4    R2 adj   0.367
d.f.      29    Pr(> chi2) 0.0005    g        0.015

Residuals

Min        1Q    Median        3Q       Max
-0.029786 -0.014093  0.002851  0.012819  0.032995

Coef    S.E.   t     Pr(>|t|)
Intercept        -0.1264 0.0328 -3.86 0.0006
Age               0.0001 0.0002  0.55 0.5856
TOWRE_wordacc     0.0001 0.0005  0.11 0.9101
TOWRE_nonwordacc  0.0017 0.0005  3.26 0.0029
ART_HRminusFR    -0.0006 0.0004 -1.57 0.1268

Notice that the effect of frequency does appear to be predicted by the effect of subject nonword reading performance — in line with the frequency x nonword interaction seen in the analysis of the full dataset, above.

What have we learnt?

datadist()

ols()

ddply()

ggplot() — see previous posts on data visualization

Reading

Baayen, R. H. (2008). Analyzing linguistic data. Cambridge University Press.

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of memory and language, 59(4), 390-412.

Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133(2), 283.

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioural sciences (3rd. edition). Mahwah, NJ: Lawrence Erlbaum Associates.

Harrell Jr, F. J. (2001). Regression modelling strategies: With applications to linear models, logistic regression, survival analysis. New York, NY: Springer.

Lorch, R. F., & Myers, J. L. (1990). Regression analyses of repeated measures data in cognitive research. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(1), 149.

Raaijmakers, J. G., Schrijnemakers, J., & Gremmen, F. (1999). How to deal with “the language-as-fixed-effect fallacy”: Common misconceptions and alternative solutions. Journal of Memory and Language, 41(3), 416-426.

Snijders, T. A. B., & Bosker, R. J. (1999). Multilevel analysis: An introduction to basic and advanced multilevel modelingSage Publications. Thousand Oaks, CA.

Posted in 2.3 LME workshop III/IV, Uncategorized | Tagged , , , , | Leave a comment

Mixed-effects modeling — four hour workshop — part II: Data Exploration

We got our data together in Part I of this workshop, eventually arriving at the collation of a single dataframe in ‘long format’ holding all our data on subject attributes, item attributes and behavioural responses to words in a lexical decision experiment. That collated database can be downloaded here.

The code we will be using in this post can be downloaded from here.

We should begin by examining the features of our collated dataset. We shall be familiarising ourselves with the distributions of the variables, looking for evidence of problems in the dataset, and doing transformation or othe operations to handle those problems.

Some of the problems are straightforward. We should look at the distributions of scores to check if any errors crept in to the dataset, miscoded or misrecorded observations. Other problems are a little more demanding both to understand and to diagnose and treat. The running example in this workshop requires a regression analysis in which we will be required to enter a number of variables as predictors, including variables characterizing word attributes like the frequency, length and imageability of words. The problem is that all these variables share or overlap in information because words that appear often in the language also tend to be short, easy to imagine, learnt earlier in life and so on. This means that we need to examine the extent to which our predictors cluster, representing overlapping information, and presenting potential problems for our analysis.

Much of what follows has been discussed in previous posts here on getting summary statistics and plotting histograms to examine variable distributions, here on plotting multiple histograms in a grid, see also here on histograms and density plots, here on plotting scatterplots to examine bivariate relationships, and here on checking variable clustering. This post constitutes an edited, shortened, version of those previous discussions.

Examine the variable distributions — summary statistics

I am going to assume you have already downloaded and read in the data files concerning subject and item or word attributes as well as the collated database merging the subject and item attribute data with the behavioural data (see Part I). What we are going to do now is look at the distributions of the key variables in those databases.

We can start simply by examining the sumaries for our variables. We want to focus on the variables that will go into our models, so variables like age or TOWRE word or non-word reading score, item length, word frequency, and lexical decision reaction time (RT) to words.

If you have successfully read the relevant data files into the R workspace then you should see them listed by the dataframe names (assigned when you performed the read.csv function calls) in the workspace window in RStudio (top right).

R provides a handy set of functions to allow you to examine your data (see e.g. Dalgaard, 2008):


head(subjects, n = 2)

summary(subjects)

describe(subjects)

str(subjects)

psych::describe(subjects)

length(subjects)

length(subjects$Age)

Here we are focusing on the data in the subject scores dataframe.

Copy-paste these commands into the script window and run them, below is what you will see in the console:

R-2-1-descriptives

Notice:

1. the head() function gives you the top rows, showing column names and data in cells; I asked for 2 rows of data with n = 2

2. the summary() function gives you: minimum, maximum, median and quartile values for numeric variables, numbers of observations with one or other levels for factors, and will tell you if there are any missing values, NAs

3. psych::describe() (from the psych package) will give you means, SDs, the kind of thing you report in tables in articles

4. str() will tell you if variables are factors, integers etc.

If I were writing a report on this subjects data I would copy the output from describe() into excel, format it for APA, and stick it in a word document.

As an exercise, you might apply the same functions to the items or the full databases.

Notice the values and evaluate if they make sense to you. Examine, in particular, the treatment of factors and numeric variables.

Examine the variable distributions — plots

Summary statistics are handy, and required for report writing, but plotting distributions will raise awareness of problems in your dataset much more efficiently.

We will use ggplot to plot our variables, especially geom_histogram, see the documentation online.

We are going to examine the distribution of the Age variable, age in years for ML’s participants, using the ggplot() function as follows:


pAge <- ggplot(subjects, aes(x=Age))
pAge + geom_histogram()

Notice:

— We are asking ggplot() to map the Age variable to the x-axis i.e. we are asking for a presentation of one variable.

If you copy-paste the code into the script window and run it three things will happen:

1. You’ll see the commands appear in the lower console window

2. You’ll see the pAge object listed in the upper right workspace window

R-2-1-rstudio-histogram

Notice:

— When you ask for geom_histogram(), you are asking for a bar plot plus a statistical transformation (stat_bin) which assigns observations to bins (ranges of values in the variable)

— calculates the count (number) of observations in each bin

— or – if you prefer – the density (percentage of total/bar width) of observations in each bin

— the height of the bar gives you the count, the width of the bin spans values in the variable for which you want the count

— by default, geom_histogram will bin data to the range in values/30 but you can ask for more or less detail by specifying binwidth, this is what R means when it says:

>stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to 
adjust this.

3. We see the resulting plot in the plot window, which we can export as a pdf:

R-2-1-histogram

Notice:

— ML tested many people in their 20s – no surprise, a student mostly tests students – plus a smallish number of people in middle age and older (parents, grandparents)

— nothing seems wrong with these data – given what we know about the sample.

We can move on to examine the distribution of multiple variables simultaneously by producing a grid of histograms, as follows. In Exploratory Data Analysis, we often wish to split a larger dataset into smaller subsets, showing a grid or trellis of small multiples – the same kind of plot, one plot per subset – arranged systematically to allow the reader (maybe you) to discern any patterns, or stand-out features, in the data. Much of what follows is cribbed from the ggplot2 book (Wickham, 2009; pp.115 – ; see also Chang, 2o13). Wickham (2009) advises that arranging multiple plots on a single page involves using grid, an R graphics system, that underlies ggplot2.

Copy-paste the following code in the script window to create the plots:


pAge <- ggplot(subjects, aes(x=Age))
pAge <- pAge + geom_histogram()

pEd <- ggplot(subjects, aes(x=Years_in_education))
pEd <- pEd + geom_histogram()

pw <- ggplot(subjects, aes(x=TOWRE_wordacc))
pw <- pw + geom_histogram()

pnw <- ggplot(subjects, aes(x=TOWRE_nonwordacc))
pnw <- pnw + geom_histogram()

pART <- ggplot(subjects, aes(x=ART_HRminusFR))
pART <- pART + geom_histogram()

Notice:

— I have given each plot a different name. This will matter when we assign plots to places in a grid.

— I have changed the syntax a little bit: create a plot then add the layers to the plot; not doing this, ie sticking with the syntax as in the preceding examples, results in an error when we get to print the plots out to the grid; try it out – and wait for the warning to appear in the console window.

We have five plots we want to arrange, so let’s ask for a grid of 2 x 3 plots ie two rows of three.


grid.newpage()

pushViewport(viewport(layout = grid.layout(2,3)))

vplayout <- function(x,y)
viewport(layout.pos.row = x, layout.pos.col = y)

Having asked for the grid, we then print the plots out to the desired places, as follows:


print(pAge, vp = vplayout(1,1))
print(pEd, vp = vplayout(1,2))
print(pw, vp = vplayout(1,3))
print(pnw, vp = vplayout(2,1))
print(pART, vp = vplayout(2,2))

As R works through the print() function calls, you will see the Plots window in R-studio fill with plots getting added to the grid you specified.

You can get R to generate an output file holding the plot as a .pdf or as a variety of alternatively formatted files.

Wickham (2009; p. 151) recommends differing graphic formats for different tasks, e.g. png (72 dpi for the web); dpi = dots per inch i.e. resolution.

Most of the time, I output pdfs. You call the pdf() function – a disk-based graphics device -specifying the name of the file to be output, the width and height of the output in inches (don’t know why, but OK), print the plots, and close the device with dev.off(). You can use this code to do that.

I can wrap everything up in a single sequence of code bracketed by commands to generate a pdf:


pdf("ML-data-subjects-histograms-061113.pdf", width = 15, height = 10)

pAge <- ggplot(subjects, aes(x=Age))
pAge <- pAge + geom_histogram()

pEd <- ggplot(subjects, aes(x=Years_in_education))
pEd <- pEd + geom_histogram()

pw <- ggplot(subjects, aes(x=TOWRE_wordacc))
pw <- pw + geom_histogram()

pnw <- ggplot(subjects, aes(x=TOWRE_nonwordacc))
pnw <- pnw + geom_histogram()

pART <- ggplot(subjects, aes(x=ART_HRminusFR))
pART <- pART + geom_histogram()

grid.newpage()

pushViewport(viewport(layout = grid.layout(2,3)))

vplayout <- function(x,y)
viewport(layout.pos.row = x, layout.pos.col = y)

print(pAge, vp = vplayout(1,1))
print(pEd, vp = vplayout(1,2))
print(pw, vp = vplayout(1,3))
print(pnw, vp = vplayout(2,1))
print(pART, vp = vplayout(2,2))

dev.off()

Which results in a plot that looks like this:

LME-4hrworkshop-histograms

There is nothing especially concerning about these distributions though the bimodal nature of the age distribution is something worth remembering if age turns out to be a significant predictor of reading reaction times.

Further moves

Histograms are not always the most useful approach to visualizing data distributions. What if you wanted to compare two distributions overlaid on top of each other e.g. distributions of variables that should be similar? It is worth considering density plots, see here.

Exploring data and acting on what we find

Let’s consider an aspect of the data that, when revealed by a distribution plot (but was also evident when we looked at the summary statistics), will require action.

I am talking about what we see if we plot a histogram of the RT distribution using the following code. If we copy this into RStudio and run it:


pRT <- ggplot(RTs.norms.subjects, aes(x=RT))
pRT + geom_histogram()

We get this:

LME-4hrworkshop-RT-histogram

Notice that there are a number of negative RTs. That is. of course, impossible but makes sense if you know that DMDX, the application used to collect lexical decision responses, encodes RTs for errors as negative RTs.

Notice also that even if we ignore the -RT values, the RT distribution is, as usual, highly skewed.

Data transformations I/II

We can remove the error -RT observations by subsetting the full collated database, filtering out all observations correspondong to errors to create a new ‘correct responses only’ database. We can also log transform the RTs for correct responses in that same new database. We can do this in a few lines of code in the script as follows:


# the summary shows us that there are negative RTs, which we can remove by subsetting the dataframe
# by setting a condition on rows

nomissing.full <- RTs.norms.subjects[RTs.norms.subjects$RT > 0,]

# inspect the results

summary(nomissing.full)

# how many errors were logged?

# the number of observations in the full database i.e. errors and correct responses is:

length(RTs.norms.subjects$RT)

# > length(RTs.norms.subjects$RT)
# [1] 5440

# the number of observations in the no-errors database is:

length(nomissing.full$RT)

# so the number of errors is:

length(RTs.norms.subjects$RT) - length(nomissing.full$RT)

# > length(RTs.norms.subjects$RT) - length(nomissing.full$RT)
# [1] 183

# is this enough to conduct an analysis of response accuracy? we can have a look but first we will need to create
# an accuracy coding variable i.e. where error = 0 and correct = 1

RTs.norms.subjects$accuracy <- as.integer(ifelse(RTs.norms.subjects$RT > 0, "1", "0"))

# we can create the accuracy variable, and run the glmm but, as we will see, we will get an error message, false
# convergence likely due to the low number of errors


# I will also transform the RT variable to the log base 10 (RT), following convention in the field, though
# other transformations are used by researchers and may have superior properties

nomissing.full$logrt <- log10(nomissing.full$RT)


Notice:

— You can see that there were a very small number of errors recorded in the experiment, less than 200 errors out of 5000+ trials. This is a not unusual error rate, I think, for a dataset collected with typically developing adult readers.

— I am starting to use the script file not just to hold lines of code to perform the analysis, not just to hold the results of analyses that I might redo, but also to comment on what I am seeing as I work.

— The ifelse() function is pretty handy, and I have started using it more and more.

— Here, I can break down what is happening step by step:-

1. RTs.norms.subjects$accuracy <- — creates a vector that gets added to the existing dataframe columns, that vector has elements corresponding to the results of a test on each observation in the RT column, a test applied using the ifelse() function, a test that results in a new vector with elements in the correct order for each observation.

2. as.integer() — wraps the ifelse() function in a function call that renders the output of ifelse() as a numeric integer vector; you will remember from a previous post how one can use is.[datatype]() and as.[datatype]() function calls to test or coerce, respectively, the datatype of a vector.

3. ifelse(RTs.norms.subjects$RT > 0, “1”, “0”) — uses the ifelse() function to perform a test on the RT vector such that if it is true that the RT value is greater than 0 then a 1 is returned, if it is not then a 0 is returned, these 1s and 0s encode accuracy and fill up the new vector being created, one element at a time, in an order corresponding to the order in which the RTs are listed.

I used to write notes on my analyses on paper, then in a word document (using CTRL-F that will be more searchable), to ensure I had a record of what I did.

Check predictor collinearity

The next thing we want to do is to examine the extent to which our predictors cluster, representing overlapping information, and presenting potential problems for our analysis. We can do this stage of our analysis by examining scatterplot matrices and calculating the condition number for our predictors. Note that I am going to be fairly lax here, by deciding in advance that I will use some predictors but not others in the analysis to come. (If I were being strict, I would have to find and explain theory-led and data-led reasons for including the predictors that I do include.)

Note that we will first run some code to define a function that will allow us to create scatterplot matrices, furnished by William Revelle on his web-site (see URL below):


#code taken from part of a short guide to R
#Version of November 26, 2004
#William Revelle
# see: http://www.personality-project.org/r/r.graphics.html

panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r = (cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex * abs(r))
}

We can then run the following code:


# we can consider the way in which potential predictor variables cluster together

summary(nomissing.full)

# create a subset of the dataframe holding just the potential numeric predictors
# NB I am not bothering, here, to include all variables, e.g. multiple measures of the same dimension

nomissing.full.pairs <- subset(nomissing.full, select = c(

"Length", "OLD", "BG_Mean", "LgSUBTLCD", "brookesIMG", "AoA_Kup_lem", "Age",
"TOWRE_wordacc", "TOWRE_nonwordacc", "ART_HRminusFR"

))

# rename those variables that need it for improved legibility

nomissing.full.pairs <- rename(nomissing.full.pairs, c(

brookesIMG = "IMG", AoA_Kup_lem = "AOA",

TOWRE_wordacc = "TOWREW", TOWRE_nonwordacc = "TOWRENW", ART_HRminusFR = "ART"

))


# subset the predictors again by items and person level attributes to make the plotting more sensible

nomissing.full.pairs.items <- subset(nomissing.full.pairs, select = c(

"Length", "OLD", "BG_Mean", "LgSUBTLCD", "IMG", "AOA"

))

nomissing.full.pairs.subjects <- subset(nomissing.full.pairs, select = c(

"Age", "TOWREW", "TOWRENW", "ART"

))


# draw a pairs scatterplot matrix

pdf("ML-nomissing-full-pairs-items-splom-061113.pdf", width = 8, height = 8)

pairs(nomissing.full.pairs.items, lower.panel=panel.smooth, upper.panel=panel.cor)

dev.off()

# draw a pairs scatterplot matrix

pdf("ML-nomissing-full-pairs-subjects-splom-061113.pdf", width = 8, height = 8)

pairs(nomissing.full.pairs.subjects, lower.panel=panel.smooth, upper.panel=panel.cor)

dev.off()

# assess collinearity using the condition number, kappa, metric

collin.fnc(nomissing.full.pairs.items)$cnumber
collin.fnc(nomissing.full.pairs.subjects)$cnumber

# > collin.fnc(nomissing.full.pairs.items)$cnumber
# [1] 46.0404
# > collin.fnc(nomissing.full.pairs.subjects)$cnumber
# [1] 36.69636


You will see when you run this code that the set of subject attribute predictors (see below) …

R-10-1-subjects-splom

… and the set of item attribute predictors (see below) …

R-10-2-items-splom

… include some quite strongly overlapping variables, distinguished by high correlation coefficients.

I was advised, when I started doing regression analyses, that a rule-of-thumb one can happily use when examining bivariate i.e. pairwise correlations between predictor variables is that there is no multicollinearity to worry about if there are no coefficients greater than r = .8 however, I can think of no statistical reason for this number, while Baayen (2008) recommends using the condition number, kappa, as a diagnostic measure, based on Belsley et al.’s (1980) arguments that pairwise correlations do not reveal the real, underlying, structure of latent factors (i.e. principal components of which collinear variables are transformations or expressions).

If we calculate the condition number for, separately, the subject predictors and the item predictors, we get values of 30+ in each case. Baayen (2008) comments that (if memory serves) a condition number higher than 12 indicates a dangerous level of multicollinearity. However, Cohen et al. (2003) comment that while 12 or so might be a risky kappa there are no strong statistical reasons for setting the thresholds for a ‘good’ or a ‘bad’ condition number; one cannot apply the test thoughtlessly. That said, 30+ does not look good.

The condition number is based around Principal Components Analysis (PCA) and I think it is worth considering the outcome of such an analysis over predictor variables when examining the multicollinearity of those variables. PCA will often show that  set of several item attribute predictors will often relate quite strongly to a set of maybe four orthogonal components: something to do with frequency/experience; something relating to age or meaning; something relating to word length and orthographic similarity; and maybe some thing relating to sublexical unit frequency. However, as noted previously, Cohen et al. (2003) provide good arguments for being cautious about performing a principal components regression in response to a state of high collinearity among your predictors, including problems relating to the interpretability of findings.

Data transformations II/II: Center variables on their means

One perhaps partial means to alleviate the collinearity indicated may be to center predictor variables on their means.

Centering variables is helpful, also, for the interpretation of effects where 0 on the raw variables is not meaningful e.g. noone has an age of 0 years, no word has a length of 0 letters. With centered variables, if there is an effect then one can say that for unit increase in the predictor variable (where 0 for the variable is now equal to the mean for the sample, because of centering), there is a change in the outcome, on average, equal to the coefficient estimated for the effect.

Centering variables is helpful, especially, when we are dealing with interactions. Cohen et al. (2003) note that the collinearity between predictors and interaction terms is reduced (this is reduction of nonessential collinearity, due to scaling) if predictors are centered. Note that interactions are represented by the multiplicative product of their component variables. You will often see an interaction represented as “age x word length effects”, the “x” reflects the fact that for the interaction of age by length effects the interaction term is entered in the analysis as the product of the age and length variables. In addition, centering again helps with the interpretation of effects because if an interaction is significant then one must interpret the effect of one variable in the interaction while taking into account the other variable (or variables) involved in the interaction (Cohen et al., 2003; see, also, Gelman & Hill, 2007). As that is the case, if we center our variables and the interaction as well as the main effects are significant, say, in our example, the age, length and age x length effects, then we might look at the age effect and interpret it by reporting that for unit change in age the effect is equal to the coefficient while the effect of length is held constant (i.e. at 0, its mean).

As you would expect, you can center variables in R very easily, we will use one method, which is to employ the scale() function, specifying that we want variables to be centered on their means, and that we want the results of that operation to be added to the dataframe as centered variables. Note that by centering on their means, I mean we take the mean for a variable and subtract it from each value observed for that variable. We could center variables by standardizing them, where we both subtract the mean and divide the variable by its standard deviation (see Gelman & Hill’s 2007, recommendations) but we will look at that another time.

So, I center variables on their means, creating new predictor variables that get added to the dataframe with the following code:


# we might center the variables - looking ahead to the use of interaction terms - as the centering will
# both reduce nonessential collinearity due to scaling (Cohen et al., 2003) and help in the interpretation
# of effects (Cohen et al., 2003; Gelman & Hill, 2007 p .55) ie if an interaction is sig one term will be
# interpreted in terms of the difference due to unit change in the corresponding variable while the other is 0

nomissing.full$cLength <- scale(nomissing.full$Length, center = TRUE, scale = FALSE)
nomissing.full$cBG_Mean <- scale(nomissing.full$BG_Mean, center = TRUE, scale = FALSE)
nomissing.full$cOLD <- scale(nomissing.full$OLD, center = TRUE, scale = FALSE)
nomissing.full$cIMG <- scale(nomissing.full$brookesIMG, center = TRUE, scale = FALSE)
nomissing.full$cAOA <- scale(nomissing.full$AoA_Kup_lem, center = TRUE, scale = FALSE)
nomissing.full$cLgSUBTLCD <- scale(nomissing.full$LgSUBTLCD, center = TRUE, scale = FALSE)

nomissing.full$cAge <- scale(nomissing.full$Age, center = TRUE, scale = FALSE)
nomissing.full$cTOWREW <- scale(nomissing.full$TOWRE_wordacc, center = TRUE, scale = FALSE)
nomissing.full$cTOWRENW <- scale(nomissing.full$TOWRE_nonwordacc, center = TRUE, scale = FALSE)
nomissing.full$cART <- scale(nomissing.full$ART_HRminusFR, center = TRUE, scale = FALSE)

# subset the predictors again by items and person level attributes to make the plotting more sensible

nomissing.full.c.items <- subset(nomissing.full, select = c(

"cLength", "cOLD", "cBG_Mean", "cLgSUBTLCD", "cIMG", "cAOA"

))

nomissing.full.c.subjects <- subset(nomissing.full, select = c(

"cAge", "cTOWREW", "cTOWRENW", "cART"

))

# assess collinearity using the condition number, kappa, metric

collin.fnc(nomissing.full.c.items)$cnumber
collin.fnc(nomissing.full.c.subjects)$cnumber

# > collin.fnc(nomissing.full.c.items)$cnumber
# [1] 3.180112
# > collin.fnc(nomissing.full.c.subjects)$cnumber
# [1] 2.933563

Notice:

— I check the condition numbers for the predictors after centering and it appears that they are significantly reduced.

— I am not sure that that reduction can possibly be the end of the collinearity issue, but I shall take that result and run with it, for now.

— I also think that it would be appropriate to examine collinearity with both the centered variables and the interactions involving those predictors but that, too, can be left for another time.

Write the file out as a csv so that we do not have to repeat ourselves


write.csv(nomissing.full, file = "RTs.norms.subjects nomissing.full 061113.csv", row.names = FALSE)

What have we learnt?

In this post, we learnt about:

— How to inspect the dataframe object using functions like summary() or describe().

— How to plot histograms to examine the distributions of variables.

— How to use grid() to plot multiple variables at once.

— How to output our plots as a pdf for later use.

— How to subset dataframes conditionally.

— How to log transform variables.

— How to create a variable conditionally dependent on values in another variable.

Key vocabulary

working directory

workspace

object

dataframe

NA, missing value

csv

histogram

graphics device

ggplot()

geom_histogram()

geom_density()

facet_wrap()

pdf()

grid()

print()

log10()

ifelse()

Reading

Baayen, R. H. (2008). Analyzing linguistic data. Cambridge University Press.

Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.

Chang, W. (2013). R Graphics Cookbook. Sebastopol, CA: o’Reilly Media.

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioural sciences (3rd. edition). Mahwah, NJ: Lawrence Erlbaum Associates.

Dalgaard, P. (2008). Introductory statistics with R (2nd. Edition). Springer.

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. New York,NY: Cambridge University Press.

Wickham, H. (2009). ggplot2: Elegant graphics for data analysis. Springer.

Posted in 2.2 LME workshop II/IV, Uncategorized | Tagged , , , , | Leave a comment

Mixed-effects modeling — four hour workshop — part I: Data handling

This post will be one of four in which we go through how to model a multilevel dataset using linear mixed-effects modeling. To begin, we need to collate the data we have gathered in a simple reading experiment but one involving a repeated-measures design warranting a mixed-effects approach.

Data handling

This post is an updated version of an earlier post. The .R code I will use to perform the operations discussed in this post can be downloaded from here.

I am going to assume that you know how to install R and RStudio, if not, see earlier posts or this recent summary where you are shown how to:

— install R (the statistical programming environment) and RStudio (the interactive editor for working with R)

— start a script to run R functions and get things done

— install R packages to gain access to the capabilities furnished by user-contributed libraries of functions

— can write and run R code to e.g. produce basic plots or get summary statistics

–download data files of the sort we will be using (.csv) and read them into R workspace as dataframes

You should also know how to get help in R, as summarized here.

What we will do in this post is focus on getting your data together in preparation for running mixed-effects modelling

The data we will use in our example were collected for a student research project. We were interested in whether or how lexical decision responses were influenced by:

1. the participant attributes;

2. the item attributes;

3. interactions between participant and item attributes.

We tested 39 adults, males and females, most either in their 20s or in middle age.

Every adult was tested on word and nonword reading skills (TOWRE, Torgesen et al., 1999) and print exposure (ART, Masterson & Hayes, 2007). We collated the subject scores database and it can be downloaded here.

Critically, every adult also was tested in a computer based lexical decision task (run using DMDX, Forster & Forster, 2003) in which they saw 320 letter strings on a screen, one at a time, and were asked to indicate by a button press (using a gamepad) whether they thought each item was a word they knew or a nonword they did not. In addition to the 320 critical trials, participants were also given 20 practice trials which we shall ignore in the analysis. Information on the item names (the word or non-word) linking them to item type (word or non-word) and item number (DMDX trial number) can be downloaded here, we will need it for later merge operations.

The words and nonwords were selected to be matched on length (measured in letters), orthographic neighbourhood size (often called N, Coltheart et al., 1977) and bigram frequency (mean, sum and frequency specific numbers). This matching was designed to force participants to access their mental lexicon to perform the task, instead of making responses based on ‘low-level’ perceptual information like item length. We collated data on item (word and nonword attributes) and they can be downloaded here, see relevant previous posts here and here.

However, what we really cared about, in designing the experiment, was whether the time that participants took to respond to stimuli (reaction time, RT, or response latency) – the interval from stimulus onset to response onset – varied in relation to: participant attributes like reading skill; item attributes like length; and the interactions between those two sets of predictors. In addition to the length etc. of the items, we also cared if lexical access was influenced by word attributes. Focusing on just words, we wanted to know if RTs were affected by word frequency, age of acquisition (AoA), and imageability. We collated information about such attributes and combined them with data on item attributes like length to a single database that can be downloaded here, see relevant post here.

All these data about item attributes were drawn either from the English Lexicon Project (Balota et al., 2007) or from datasets very helpfully provided by other psycholinguists (Cortese & Fuggett, 2004; Cortese & Khanna, 200; Kuperman et al., in press) see notes here and here.

When we ran the lexical decision study, DMDX produced an .azk output file holding keypress RTs along with information about trial order (when did a word appear in a block of stimulus trials) and participant identity code (subjectID).

Getting data recorded in an experiment into a usable state can be a quite exasperating process. Things that one encounters include: inconsistent subjectID naming between behavioural and participant attribute data collection; item or trial coding errors; mismatches because one used the word false in excel. For some reason, the process of data collation is often called data munging.

The behavioural data file is in ‘long format’: all participants tested are shown, one participant on top of the other, in a long column where each row corresponds to a single trial observation, i.e. each row corresponds to a response made by a participant to a word. It can be downloaded here.

If you open it in excel, it looks like this:

R-9-1-long-data

Notice:

— If you were to open the file in excel, you would see that there are 13941 rows, including the column headers or variable names.

— 13940 data rows =  340*41

– Recall I said there were 320 critical trials and 39 participants. On the latter score, I misspoke: we lost two because of inconsistent subjectID entry. If you don’t know who they are (or, at least, if you cannot map subjectIDs in a participant attributes to subjectIDs in a behavioural database), you cannot analyze them.

Load the data into the workspace

I am going to assume you have successfully downloaded the datafiles and have them in a folder that you are able to use as your working directory.

You can run the following code to set the working directory to the folder where the datafiles are and then load the files into the R session workspace as dataframes.


# set the working directory

setwd("[enter path to folder holding data files here]")

# # read in the item norms data
#
# item.norms <- read.csv(file="item norms 270312 050513.csv", head=TRUE,sep=",", na.strings = "-999")
# word.norms <- read.csv(file="word norms 140812.csv", head=TRUE,sep=",", na.strings = "-999")

item.coding <- read.csv(file="item coding info 270312.csv", head=TRUE,sep=",", na.strings = "-999")

# read in the merged new dataframe

item.words.norms <- read.csv(file="item.words.norms 050513.csv", head=TRUE,sep=",", na.strings = "NA")

# read in the subjects scores data file

subjects <- read.csv("ML scores 080612 220413.csv", header=T, na.strings = "-999")

# read in the behavioural data file

behaviour <- read.csv("ML data 080612.csv", header=T, na.strings = "-999")

Notice:

— I commented out the read.csv() function calls for item.norms and word.norms. We will not need that stuff for this post.

— We do now need the item.coding dataframe, however, for reasons I will explain next.

Put the databases together

We next need to put the dataframes together through the use of the merge() function.

What this means is that we will end up with a dataframe that:

1. has as many rows (some thousands) as valid observations

2. each observation will be associated with data about the subject and about the word.

Notice:

— The merge() function  (see the help page) allows you to put two dataframes together. I believe other functions allow merger of multiple dataframes.

— By default, the dataframes are merged on the columns with names they both have, so here we plan to merge: the normative item.words.norms dataframe with the item.coding dataframe by the common variable item_name; we then merge that norms.coding dataframe with the behaviour dataframe by the common variable item_number; we then merge that RTs.norms dataframe with the subject scores dataframe by the common variable subjectID.

— You see why we needed the coding data: the behavioural database lists RT observations by (DMDX trial item) number; the normative database lists item attribute observations by item name (e.g. a word name like “axe”); I cannot put the normative and the behavioural databases without a common variable, that common variable has to be item_name so I created the coding database, deriving the information from the DMDX script, to hold both the item_name and item_number columns. Creating the coding database was not too costly but the fact that I needed to do so illustrates the benefits of thinking ahead, if I had thought ahead, I would have had both item_name and item_number columns in the normative database when I built it.

— As noted previously, if the dataframes being merged do not have observations that cannot be mapped together because there is a mismatch on the row index e.g. item name or whatever other common variable is being used they get ‘lost’ in the process of merging. This is nicely illustrated by what happens when you merge the behaviour dataframe – which holds observations on both responses to words and to nonwords – with the item.words.norms dataframe, which holds just data about the attributes of words. As there are no nonword  items in the latter, when you merge item.words.norms and behaviour, creating a new joint dataframe, the latter has nothing about nonwords. This suits my purposes, because I wish to ignore nonwords (and practice items, for that matter, also absent from the norms database) but you should pay careful attention to what should happen and what does happen when you merge dataframes.

— I also check my expectations about the results of operations like merge() against the results of those operations. I almost never get a match first time between expectations and results. Here, it is no different, as I will explain.

Let’s run the code to merge our constituent databases.


# to put files together ie to merge files we use the merge() function
# we want to put the data files together so that we can analyze RTs by subject characteristics or item norms or conditions (coding)

# so that we can associate item normative with coding data, we need to merge files by a common index, item_name

norms.coding <- merge(item.words.norms, item.coding, by="item_name")

# inspect the result

head(norms.coding, n = 2)

# note the common index item_name does not get repeated but R will recognize the duplication of item_type and assign .x or .y
# to distinguish the two duplicates.

# delete the redundant column

norms.coding <- norms.coding[, -2] # using data file indexing, [,] - to remove, remove 2 the second column

head(norms.coding,n=2)

# rename the .y column to keep it tidy

norms.coding <- rename(norms.coding, c(item_type.y = "item_type"))

# inspect the results:

head(norms.coding,n=2)

summary(norms.coding)

# put the data files together with data on subjects and on behaviour:

# first, rename "Item" variable in lexical.RTs as "item_number" to allow merge

behaviour <- rename(behaviour, c(Item = "item_number"))

# run merge to merge behaviour and norms

RTs.norms <- merge(norms.coding, behaviour, by="item_number")

# run merge to merge RTs.norms and subject scores

RTs.norms.subjects <- merge(RTs.norms, subjects, by="subjectID")

Notice:

— In merging the normative and coding dataframes, I had to deal with the fact that both held two common columns, the one I used for merging, item_name, and another, item_type. The former is used for merging and goes into the dataframe that is the product of the merger as a single column. The item_type column was not used for merging and consequently is duplicated: R deals with duplicates by renaming them to keep them distinct, here adding .x or .y.  This is a bit of a pain so I dealt with the duplication by deleting one copy and renaming the other.

If we inspect the results of the merger:


# inspect the results:

head(RTs.norms.subjects,n=2)
summary(RTs.norms.subjects)
length(RTs.norms.subjects$RT)

We see:

R-9-2-merged

Notice:

— We have just observations of words: see the item_type summary.

— Participants, at least the first few, are listed under subjectID with 160 observations each, as they should be. Variables appear to be correctly listed by data type, numeric, factor etc.

— I note that there are some missing values, NAs, among the IMG and AOA observations.

— I also note that the Subject variable indicates at least one participant, A18, has 320 not 160 observations.

Checking: Does the merger output dataframe match my expectations?

As noted in the foregoing study summary, we have participant attribute data on 39 participants and, presumably, lexical decision behavioural data on those same participants. I would expect the final merger output dataframe, therefore, to have: 39*160 = 6240 rows.

It does not.

We need to work out what the true state of each component dataframe is to make sense of what we get as the output of the merge() calls.

The first thing I am going to do is to check if the merger between subjects and behaviour dataframes lost a few observations because there was a mismatch on subjectID for one or more participants. The way you can do this (there are smarter ways) is ask R what the levels of the subjectID factor are for each dataframe then check the matching by eye.


# how many rows should we have?

levels(subjects$subjectID)
levels(behaviour$subjectID)
levels(RTs.norms.subjects$subjectID)

Which shows you this:

R-9-3-levels

I can tell that there are some subjectIDs in the subjects dataframe that are not in the behaviour dataframe, and vice versa. Well, I told you that there were problems in the consistency of subjectID naming. The output dataframe RTs.norms.subjects will only have observations for those rows indexed with a subjectID in common between both behaviour and subjects dataframes. (I think the smarter way to do the check would be to do a logic test, using ifelse(), with a dataframe output telling you where subjectID is or is not matched.)

Notice:

— The behaviour dataframe includes 41 not 39 participants’ data, including one with a blank “” subjectID.

— The levels listed for subjectID for the output dataframe RTs.norms.subjects actually gives you all levels listed for both factors, irrespective of whether there are other data associated with each level. This is an artifact of the way in which R (otherwise efficiently) stores factors as indices, with instances.

— We need, therefore, to count how many observations we have for each participant in the output dataframe. I will do this by creating a function to count how many times subjectID is listed.

The following code using the function definition syntax in R as well as the ddply() function made available by installing the plyr package and then loading its library. The plyr package, by Hadley Wickham, is incredibly useful, and I will come back to it. If you copy the following code into the script window and then run it, we will see what it does, promising to return in a later post to explain what is going on and how it happens.


# define a function to count how many observations are in the dataframe for each subjectID

IDCOUNT <- function(df) {length(df$subjectID)}

RTs.norms.subjects.count <- ddply(RTs.norms.subjects, .(subjectID), IDCOUNT)
RTs.norms.subjects.count

behaviour.count <- ddply(behaviour, .(subjectID), IDCOUNT)
behaviour.count

Which gives you this:

R-9-4-count

What we have done is:

1. Created a function to count how many times each different subjectID value e.g. A15 occurs in the subjectID column. You should realize, here, that each subjectID will be listed in the subjectID column for every single row held, i.e. every observation recorded, in the dataframe for the participant.

2. Used the function to generate a set of numbers – for each participant, how many observations are there? – then used the ddply() function call to create a dataframe listing each subjectID with, for each one, the number of associated observations.

3. Inspecting the latter dataframe tells us that we have 35 participants with 160 observations for each one.

Neat, even if I do say so myself, but it is all thanks to the R megastar Hadley Wickham, see here for more information on plyr.

I think I now understand what happened:

1. We had a match on observations in the subject attributes data and the behavioural data for only 34 participants.

2. The output output dataframe RTs.norms.subjects therefore holds only 34 * 160 = 5440 observations.

Output the merged dataframe as a .csv file

Having done all that work (not really, I did in a few lines of code and milliseconds what hours of hand copy/pasting required), I don’t want to have to do it all again, therefore, I can create a .csv file using the write.csv() function.


# we can write the dataframe output by our merging to a csv file to save having to do this work again later

write.csv(RTs.norms.subjects, file = "RTs.norms.subjects 061113.csv", row.names = FALSE)

What have we learnt?

You could do this by hand. That would involve working in excel, and copying, for each person’s trials, their scores on the reading skill tests etc., and for each word, that word’s values on length etc. I have done that. It is both time-consuming and prone to human error. Plus, it is very inflexible because who wants to do a thing like that more than once? I learnt about keeping different datasets separate until needed – what you could call a modular approach – in the great book on advanced regression by Andrew Gelman and Jennifer Hill (Gelman & Hill, 2007).

I think that keeping data sets the right size – i.e. subject attributes data in a subjects data base, item attributes data in a normative data base etc. – is helpful, both more manageable (easier to read things, if required, in excel), less error prone, and logically transparent (therefore, in theory, more intuitive). However, the cost of using merge(), as I have shown, is that you are required to know exactly what you have, and what you should get. You should have that understanding anyway, and I would contend that it is easier to avoid feeling the obligation when working with excel or SPSS, but that that avoidance is costly in other words (reputational risk). Thus, work and check your workings. Just because they taught you that in school does not stop it from continuing relevance.

Key vocabulary

merge()

levels()

function{}

ddply()

length()

summary()

write.csv()

Reading

Balota, D. A., Yap, M.J., Cortese, M.J., Hutchison, K.A., Kessler, B., Loftus, B., . . . Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39, 445-459.
Coltheart, M., Davelaar, E., Jonasson, J. T., & Besner, D. (1977). Access to the internal lexicon. In S. Dornic (Ed.), Attention and Performance VI (pp. 535-555). Hillsdale,NJ: LEA.
Cortese, M.J., & Fugett, A. (2004). Imageability Ratings for 3,000 Monosyllabic Words. Behavior Methods and Research, Instrumentation, & Computers, 36, 384-387.
Cortese, M.J., & Khanna, M.M. (2008). Age of Acquisition Ratings for 3,000 Monosyllabic Words. Behavior Research Methods, 40, 791-794.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. New York,NY: Cambridge University Press.
Kuperman, V., Stadthagen-Gonzales, H., & Brysbaert, M. (in press). Age-of-acquisition ratings for 30 thousand English words. Behavior Research Methods.
Masterson, J., & Hayes, M. (2007). Development and data for UK versions of an author and title recognition test for adults. Journal of Research in Reading, 30, 212-219.
Torgesen, J. K., Wagner, R. K., & Rashotte, C. A. (1999). TOWRE Test of word reading efficiency. Austin,TX: Pro-ed.

Posted in 2.1 LME workshop I/IV, Uncategorized | Tagged , , , | Leave a comment

Mixed effects – Autumn workshop – practical steps in one hour

For an example analysis, we collated a .csv file of data from a repeated measures lexical decision experiment, which can be downloaded here. The code I have used in running the analysis described in this post can be found here.

This file holds data on:

1. participant attributes;

2. word attributes;

3. participant responses to words in the lexical decision task.

In this post, I want to take a quick look at how you would actually run a mixed-effects model, using the lmer() function furnished by the lme4 package, written by Doug Bates (Bates, 2010 – appears to a preprint of his presumed new book, see also Pinheiro & Bates, 2000). A useful how-to guide, one I followed when I first started doing this kind of modelling, is in Baayen (2008). An excellent, and very highly cited, introduction to mixed-effects modelling for psycholinguistics, is found in Baayen, Davidson & Bates (2008).

First, then, we enter the following lines of code to load the libraries we are likely to need, to set the working directory to the folder where we have copied the full database, and to read that database as a dataframe in the session’s workspace:

Load libraries, define functions, load data

We start by loading the libraries of functions we are likely to need, defining a set of functions not in libraries, setting the working directory, and loading the dataframe we will work with into the R session workspace.


# load libraries of packages #############################################################################
library(languageR)
library(lme4)
library(ggplot2)
library(grid)
library(rms)
library(plyr)
library(reshape2)
library(psych)
library(gridExtra)

# define functions #######################################################################################
# pairs plot function

#code taken from part of a short guide to R
#Version of November 26, 2004
#William Revelle
# see: http://www.personality-project.org/r/r.graphics.html

panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r = (cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex * abs(r))
}
# just a correlation table, taken from here:
# http://myowelt.blogspot.com/2008/04/beautiful-correlation-tables-in-r.html
# actually adapted from prvious nabble thread:

corstarsl <- function(x){
require(Hmisc)
x <- as.matrix(x)
R <- rcorr(x)$r
p <- rcorr(x)$P

## define notions for significance levels; spacing is important.
mystars <- ifelse(p < .001, "***", ifelse(p < .01, "** ", ifelse(p < .05, "* ", " ")))

## trunctuate the matrix that holds the correlations to two decimal
R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]

## build a new matrix that includes the correlations with their apropriate stars
Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x))
diag(Rnew) <- paste(diag(R), " ", sep="")
rownames(Rnew) <- colnames(x)
colnames(Rnew) <- paste(colnames(x), "", sep="")

## remove upper triangle
Rnew <- as.matrix(Rnew)
Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)

## remove last column and return the matrix (which is now a data frame)
Rnew <- cbind(Rnew[1:length(Rnew)-1])
return(Rnew)
}

# read in data files #####################################################################################

getwd()
# set the working directory

setwd("[maybe file downloaded to Downloads folder--see getinfo in Finder")

# read in the full database

RTs.norms.subjects <- read.csv("RTs.norms.subjects 090513.csv", header=T, na.strings = "-999")

If you run all that code, you will see the dataframes and functions appear listed in the workspace window of R-studio.

Prepare data – remove errors

As in the previous post, we need to remove errors from the dataframe. We can do this by setting a condition on rows, subsetting the dataframe to a new dataframe where no RT is less than zero because DMDX outputs .azk datafiles in which incorrect RTs are listed as negative numbers.


# preparing our dataframe for modelling ##################################################################
# the summary shows us that there are negative RTs, which we can remove by subsetting the dataframe
# by setting a condition on rows

nomissing.full <- RTs.norms.subjects[RTs.norms.subjects$RT > 0,]

# inspect the results

summary(nomissing.full)

# how many errors were logged?

# the number of observations in the full database i.e. errors and correct responses is:

length(RTs.norms.subjects$RT)

# > length(RTs.norms.subjects$RT)
# [1] 5440

# the number of observations in the no-errors database is:

length(nomissing.full$RT)

# so the number of errors is:

length(RTs.norms.subjects$RT) - length(nomissing.full$RT)

# > length(RTs.norms.subjects$RT) - length(nomissing.full$RT)
# [1] 183

# is this enough to conduct an analysis of response accuracy? we can have a look but first we will need to create
# an accuracy coding variable i.e. where error = 0 and correct = 1

RTs.norms.subjects$accuracy <- as.integer(ifelse(RTs.norms.subjects$RT > 0, "1", "0"))

# we can create the accuracy variable, and run the glmm but, as we will see, we will get an error message, false
# convergence likely due to the low number of errors

Notice:

— I am copying the outputs from function calls from the console window, where they appear, into the script file, where they are now saved for later reference.

— I find this to be a handy way to keep track of results, as I move through a series of analyses. This is because I often do analyses a number of times, maybe having made changes in the composition of the dataframe or in how I transformed variables, and I need to keep track of what happens after each stage or each operation.

— You can see that there were a very small number of errors recorded in the experiment, less than 200 errors out of 5000+ trials. This is a not unusual error rate, I think, for a dataset collected with typically developing adult readers.

In this post, I will add something new, using the ifelse() control structure to create an accuracy variable which is added to the dataframe:


# so the number of errors is:

length(RTs.norms.subjects$RT) - length(nomissing.full$RT)

# > length(RTs.norms.subjects$RT) - length(nomissing.full$RT)
# [1] 183

# is this enough to conduct an analysis of response accuracy? we can have a look but first we will need to create
# an accuracy coding variable i.e. where error = 0 and correct = 1

RTs.norms.subjects$accuracy <- as.integer(ifelse(RTs.norms.subjects$RT > 0, "1", "0"))

# we can create the accuracy variable, and run the glmm but, as we will see, we will get an error message, false
# convergence likely due to the low number of errors

Notice:

— I am starting to use the script file not just to hold lines of code to perform the analysis, not just to hold the results of analyses that I might redo, but also to comment on what I am seeing as I work.

— The ifelse() function is pretty handy, and I have started using it more and more.

— Here, I can break down what is happening step by step:-

1. RTs.norms.subjects$accuracy <- — creates a vector that gets added to the existing dataframe columns, that vector has elements corresponding to the results of a test on each observation in the RT column, a test applied using the ifelse() function, a test that results in a new vector with elements in the correct order for each observation.

2. as.integer() — wraps the ifelse() function in a function call that renders the output of ifelse() as a numeric integer vector; you will remember from a previous post how one can use is.[datatype]() and as.[datatype]() function calls to test or coerce, respectively, the datatype of a vector.

3. ifelse(RTs.norms.subjects$RT > 0, “1”, “0”) — uses the ifelse() function to perform a test on the RT vector such that if it is true that the RT value is greater than 0 then a 1 is returned, if it is not then a 0 is returned, these 1s and 0s encode accuracy and fill up the new vector being created, one element at a time, in an order corresponding to the order in which the RTs are listed.

I used to write notes on my analyses on paper, then in a word document (using CTRL-F that will be more searchable), to ensure I had a record of what I did.

It is worth keeping good notes on what you do because you (I) rarely get an unbroken interval of time in which to complete an analysis up to the report stage, thus you will often return to an analysis after a break of days to months, and you will therefore need to write notes that your future self will read so that analyses are self-explanatory or self-explained.

Check predictor collinearity

The next thing we want to do is to examine the extent to which our predictors cluster, representing overlapping information, and presenting potential problems for our analysis. We can do this stage of our analysis by examining scatterplot matrices and calculating the condition number for our predictors. Note that I am going to be fairly lax here, by deciding in advance that I will use some predictors but not others in the analysis to come. (If I were being strict, I would have to find and explain theory-led and data-led reasons for including the predictors that I do include.)


# check collinearity and check data distributions #########################################################
# we can consider the way in which potential predictor variables cluster together

summary(nomissing.full)

# create a subset of the dataframe holding just the potential numeric predictors
# NB I am not bothering, here, to include all variables, e.g. multiple measures of the same dimension

nomissing.full.pairs <- subset(nomissing.full, select = c(

 "Length", "OLD", "BG_Mean", "LgSUBTLCD", "brookesIMG", "AoA_Kup_lem", "Age",
 "TOWRE_wordacc", "TOWRE_nonwordacc", "ART_HRminusFR"

 ))

# rename those variables that need it for improved legibility

nomissing.full.pairs <- rename(nomissing.full.pairs, c(

 brookesIMG = "IMG", AoA_Kup_lem = "AOA",

 TOWRE_wordacc = "TOWREW", TOWRE_nonwordacc = "TOWRENW", ART_HRminusFR = "ART"

 ))

# subset the predictors again by items and person level attributes to make the plotting more sensible

nomissing.full.pairs.items <- subset(nomissing.full.pairs, select = c(

 "Length", "OLD", "BG_Mean", "LgSUBTLCD", "IMG", "AOA"

))

nomissing.full.pairs.subjects <- subset(nomissing.full.pairs, select = c(

"Age", "TOWREW", "TOWRENW", "ART"

))

# draw a pairs scatterplot matrix

pdf("ML-nomissing-full-pairs-items-splom-110513.pdf", width = 8, height = 8)

pairs(nomissing.full.pairs.items, lower.panel=panel.smooth, upper.panel=panel.cor)

dev.off()

# draw a pairs scatterplot matrix

pdf("ML-nomissing-full-pairs-subjects-splom-110513.pdf", width = 8, height = 8)

pairs(nomissing.full.pairs.subjects, lower.panel=panel.smooth, upper.panel=panel.cor)

dev.off()

# assess collinearity using the condition number, kappa, metric

collin.fnc(nomissing.full.pairs.items)$cnumber
collin.fnc(nomissing.full.pairs.subjects)$cnumber

# > collin.fnc(nomissing.full.pairs.items)$cnumber
# [1] 46.0404
# > collin.fnc(nomissing.full.pairs.subjects)$cnumber
# [1] 36.69636

You will see when you run this code that the set of subject attribute predictors (see below) …

R-10-1-subjects-splom

… and the set of item attribute predictors (see below) …

R-10-2-items-splom

… include some quite strongly overlapping variables, distinguished by high correlation coefficients.

I was advised, when I started doing regression analyses, that a rule-of-thumb one can happily use when examining bivariate i.e. pairwise correlations between predictor variables is that there is no multicollinearity to worry about if there are no coefficients greater than r = .8 however, I can think of no statistical reason for this number, while Baayen (2008) recommends using the condition number, kappa, as a diagnostic measure, based on Belsley et al.’s (1980) arguments that pairwise correlations do not reveal the real, underlying, structure of latent factors (i.e. principal components of which collinear variables are transformations or expressions).

If we calculate the condition number for, separately, the subject predictors and the item predictors, we get values of 30+ in each case. Baayen (2008) comments that (if memory serves) a condition number higher than 12 indicates a dangerous level of multicollinearity. However, Cohen et al. (2003) comment that while 12 or so might be a risky kappa there are no strong statistical reasons for setting the thresholds for a ‘good’ or a ‘bad’ condition number; one cannot apply the test thoughtlessly. That said, 30+ does not look good.

The condition number is based around Principal Components Analysis (PCA) and I think it is worth considering the outcome of such an analysis over predictor variables when examining the multicollinearity of those variables. PCA will often show that  set of several item attribute predictors will often relate quite strongly to a set of maybe four orthogonal components: something to do with frequency/experience; something relating to age or meaning; something relating to word length and orthographic similarity; and maybe some thing relating to sublexical unit frequency. However, as noted previously, Cohen et al. (2003) provide good arguments for being cautious about performing a principal components regression in response to a state of high collinearity among your predictors, including problems relating to the interpretability of findings.

Center variables on their means

One perhaps partial means to alleviate the collinearity indicated may be to center predictor variables on their means.

(Should you center variables on their means before or after you remove errors, i.e. means calculated for the full or the no-errors dataframe? Should you center both predictors and the outcome variables? I will need to come back to those questions.)

Centering variables is helpful, also, for the interpretation of effects where 0 on the raw variables is not meaningful e.g. noone has an age of 0 years, no word has a length of 0 letters. With centered variables, if there is an effect then one can say that for unit increase in the predictor variable (where 0 for the variable is now equal to the mean for the sample, because of centering), there is a change in the outcome, on average, equal to the coefficient estimated for the effect.

Centering variables is helpful, especially, when we are dealing with interactions. Cohen et al. (2003) note that the collinearity between predictors and interaction terms is reduced (this is reduction of nonessential collinearity, due to scaling) if predictors are centered. Note that interactions are represented by the multiplicative product of their component variables. You will often see an interaction represented as “age x word length effects”, the “x” reflects the fact that for the interaction of age by length effects the interaction term is entered in the analysis as the product of the age and length variables. In addition, centering again helps with the interpretation of effects because if an interaction is significant then one must interpret the effect of one variable in the interaction while taking into account the other variable (or variables) involved in the interaction (Cohen et al., 2003; see, also, Gelman & Hill, 2007). As that is the case, if we center our variables and the interaction as well as the main effects are significant, say, in our example, the age, length and age x length effects, then we might look at the age effect and interpret it by reporting that for unit change in age the effect is equal to the coefficient while the effect of length is held constant (i.e. at 0, its mean).

As you would expect, you can center variables in R very easily, we will use one method, which is to employ the scale() function, specifying that we want variables to be centered on their means, and that we want the results of that operation to be added to the dataframe as centered variables. Note that by centering on their means, I mean we take the mean for a variable and subtract it from each value observed for that variable. We could center variables by standardizing them, where we both subtract the mean and divide the variable by its standard deviation (see Gelman & Hill’s 2007, recommendations) but we will look at that another time.

So, I center variables on their means, creating new predictor variables that get added to the dataframe with the following code:


# we might center the variables - looking ahead to the use of interaction terms - as the centering will
# both reduce nonessential collinearity due to scaling (Cohen et al., 2003) and help in the interpretation
# of effects (Cohen et al., 2003; Gelman & Hill, 2007 p .55) ie if an interaction is sig one term will be
# interpreted in terms of the difference due to unit change in the corresponding variable while the other is 0

nomissing.full$cLength <- scale(nomissing.full$Length, center = TRUE, scale = FALSE)
nomissing.full$cBG_Mean <- scale(nomissing.full$BG_Mean, center = TRUE, scale = FALSE)
nomissing.full$cOLD <- scale(nomissing.full$OLD, center = TRUE, scale = FALSE)
nomissing.full$cIMG <- scale(nomissing.full$brookesIMG, center = TRUE, scale = FALSE)
nomissing.full$cAOA <- scale(nomissing.full$AoA_Kup_lem, center = TRUE, scale = FALSE)
nomissing.full$cLgSUBTLCD <- scale(nomissing.full$LgSUBTLCD, center = TRUE, scale = FALSE)

nomissing.full$cAge <- scale(nomissing.full$Age, center = TRUE, scale = FALSE)
nomissing.full$cTOWREW <- scale(nomissing.full$TOWRE_wordacc, center = TRUE, scale = FALSE)
nomissing.full$cTOWRENW <- scale(nomissing.full$TOWRE_nonwordacc, center = TRUE, scale = FALSE)
nomissing.full$cART <- scale(nomissing.full$ART_HRminusFR, center = TRUE, scale = FALSE)

# subset the predictors again by items and person level attributes to make the plotting more sensible

nomissing.full.c.items <- subset(nomissing.full, select = c(

 "cLength", "cOLD", "cBG_Mean", "cLgSUBTLCD", "cIMG", "cAOA"

))

nomissing.full.c.subjects <- subset(nomissing.full, select = c(

 "cAge", "cTOWREW", "cTOWRENW", "cART"

))

# assess collinearity using the condition number, kappa, metric

collin.fnc(nomissing.full.c.items)$cnumber
collin.fnc(nomissing.full.c.subjects)$cnumber

# > collin.fnc(nomissing.full.c.items)$cnumber
# [1] 3.180112
# > collin.fnc(nomissing.full.c.subjects)$cnumber
# [1] 2.933563

Notice:

— I check the condition numbers for the predictors after centering and it appears that they are significantly reduced.

— I am not sure that that reduction can possibly be the end of the collinearity issue, but I shall take that result and run with it, for now.

— I also think that it would be appropriate to examine collinearity with both the centered variables and the interactions involving those predictors but that, too, can be left for another time.

Run the models

What I will do next is perform three sets of mixed-effects model analyses. I will talk about the first set in this post.

In this first set, I shall compare models varying in the predictor variables included in the model specification. We shall say that models with more predictors are more complex than models with fewer predictors (a subset of the predictors specified for the more complex models).

I shall then compare models of differing in complexity to see if increased complexity is justified by improved fit to data.

This is in response to: 1. reading Baayen (2008) and Pinheiro & Bates (2000) 2. recommendations by a reviewer; though I note that the reviewer recommendations are, in their application, qualified in my mind by some cautionary comments made by Pinheiro & Bates (2000), and confess that any problems of application or understanding are surely expressions of my misunderstanding.

What I am concerned about is the level of complexity that can be justified in the ultimate model by the contribution to improved fit by the model of the observed data. Of course, I have collected a set of predictors guided by theory, custom and practice in my field. But, as you have seen, I might choose to draw on a large set of potential predictors. I know that one cannot – to take one option used widely by psycholinguistics researchers – simply inspect a correlation table, comparing predictors with the outcome variable, and exclude those predictors that do not correlate with the outcome. Harrell (2001) comments that that approach has a biasing effect on the likelihood of detecting an effect that is really there (consuming, ‘ghost degrees of freedom’, or something like that). So what should we do or how should we satisfy ourselves that we should or should not include one or more predictor variables in our models?

You could start complex and step backwards to the simplest model that appears justified by model fit to observed data. I will return to that option another time.

In this post, I will start with the simplest model and build complexity up. Simpler models can be said to be nested in more complex models, e.g. a simple model including just main effects vs. a more complex model also including interactions. I will add complexity by adding sets of predictors, and I will compare each proximate pair of simple model and more complex model using the Likelihood Ratio Test (Pinheiro & Bates, 2000; see, especially, chapter 2, pp. 75-).

We can compare models with the same random effects (e.g. of subjects or items) but varying in complexity with respect to the fixed effects, i.e. replicable or manipulated variables like item length or participant age, and, separately, models with the same fixed effects but varying random effects.

Note that in comparing models varying in random effects, one estimates those effects using restricted maximum likelihood: you will see the argument REML = TRUE in the lmer() function call, that is what REML means. Note that in comparing models varying in fixed effects one must estimate those effects using maximum likelihood: you will see the argument REML = FALSE in the lmer() function call, that is what FALSE means. See comments in Baayen (2008) and Pinheiro & Bates (2000); I will come back to those important terms to focus on understanding, another time, focusing in this post on just writing the code for running the models.

We can compare the more and less complex models using the Likelihood Ration Test (LRT) which is executed with the anova([simple model], [more complex model]) function call. A \chi^2 statistic is computed, and evaluated against the \chi^2 distribution to estimate the probability of an increase in model likelihood as large or larger as that observed when comparing the more or less complex model, given the null hypothesis of no gain in likelihood.

Note that I log transform the RTs but one might also consider examining the reciprocal RT.


# we can move to some multilevel modelling ###############################################################

# we might start by examining the effects of predictors in a series of additions of sets of predictors
# we can hold the random effects, of subjects and items, constant, while varying the fixed effects
# we can compare the relative fit of models to data using the anova() likelihood ratio test
# in this series of models, we will use ML fitting

full.lmer0 <- lmer(log(RT) ~

# Trial.order +
#
# cAge + cTOWREW + cTOWRENW + cART +
#
# cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer0,corr=F)

#

full.lmer1 <- lmer(log(RT) ~

 Trial.order +

 # cAge + cTOWREW + cTOWRENW + cART +
 #
 # cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer1,corr=F)

#

full.lmer2 <- lmer(log(RT) ~

 Trial.order +

 cAge + cTOWREW + cTOWRENW + cART +

 # cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer2,corr=F)

#

full.lmer3 <- lmer(log(RT) ~

 Trial.order +

 cAge + cTOWREW + cTOWRENW + cART +

 cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer3,corr=F)

#

full.lmer4 <- lmer(log(RT) ~

 Trial.order +

 (cAge + cTOWREW + cTOWRENW + cART)*

 (cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG) +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer4,corr=F)

# compare models of varying complexity

anova(full.lmer0, full.lmer1)
anova(full.lmer1, full.lmer2)
anova(full.lmer2, full.lmer3)
anova(full.lmer3, full.lmer4)

This code will first print out the summaries of each model, along the lines you have already seen. It will then give you this:


# > # compare models of varying complexity
# >
# > anova(full.lmer0, full.lmer1)
# Data: nomissing.full
# Models:
# full.lmer0: log(RT) ~ (1 | subjectID) + (1 | item_name)
# full.lmer1: log(RT) ~ Trial.order + (1 | subjectID) + (1 | item_name)
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# full.lmer0 4 -977.58 -951.31 492.79
# full.lmer1 5 -975.83 -943.00 492.92 0.2545 1 0.6139
# > anova(full.lmer1, full.lmer2)
# Data: nomissing.full
# Models:
# full.lmer1: log(RT) ~ Trial.order + (1 | subjectID) + (1 | item_name)
# full.lmer2: log(RT) ~ Trial.order + cAge + cTOWREW + cTOWRENW + cART + (1 |
# full.lmer2: subjectID) + (1 | item_name)
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# full.lmer1 5 -975.83 -943.00 492.92
# full.lmer2 9 -977.09 -917.98 497.54 9.2541 4 0.05505 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# > anova(full.lmer2, full.lmer3)
# Data: nomissing.full
# Models:
# full.lmer2: log(RT) ~ Trial.order + cAge + cTOWREW + cTOWRENW + cART + (1 |
# full.lmer2: subjectID) + (1 | item_name)
# full.lmer3: log(RT) ~ Trial.order + cAge + cTOWREW + cTOWRENW + cART + cLength +
# full.lmer3: cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG + (1 | subjectID) +
# full.lmer3: (1 | item_name)
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# full.lmer2 9 -977.09 -917.98 497.54
# full.lmer3 15 -1115.42 -1016.91 572.71 150.34 6 < 2.2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# > anova(full.lmer3, full.lmer4)
# Data: nomissing.full
# Models:
# full.lmer3: log(RT) ~ Trial.order + cAge + cTOWREW + cTOWRENW + cART + cLength +
# full.lmer3: cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG + (1 | subjectID) +
# full.lmer3: (1 | item_name)
# full.lmer4: log(RT) ~ Trial.order + (cAge + cTOWREW + cTOWRENW + cART) *
# full.lmer4: (cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG) +
# full.lmer4: (1 | subjectID) + (1 | item_name)
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# full.lmer3 15 -1115.4 -1016.91 572.71
# full.lmer4 39 -1125.7 -869.56 601.84 58.262 24 0.0001119 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# it looks like the full model, including interactions but not the trialorder variable, is justified

Notice:

— It is not very pretty but it does tell you what predictors are included in each model you are comparing, for each pair of models being compared, the outcome of the ratio test, giving the numbers I usually report: the \chi^2, df and the p-value.

— I am looking at this output to see if, for each pair of models, the more complex model fits the data significantly better (if p =< 0.05).

–I add predictors in blocks and assess the improvement of fit associated with the whole block.

I have made a lot of promises to explain things more another time. So I will add one more. Pinheiro & Bates (2000; p. 88) argue that Likelihood Ratio Test comparisons of models varying in fixed effects tend to be anticonservative i.e. will see you observe significant differences in model fit more often than you should. I think they are talking, especially, about situations in which the number of model parameter differences (differences between the complex model and the nested simpler model) is large relative to the number of observations. This is not really a worry for this dataset, but I will come back to the substance of this view, and alternatives to the approach taken here.

What next?

What I would seek to do in the next series of analyses is determine if the effects that are significant appear to be significant due:

1. to the presence of other, nonsignificant effects — making the check by removing the nonsignificant effects then rerunning the model;

2. to the presence of outlier observations — making the check by removing observations associated with large model residuals then rerunning the model.

What have we learnt?

(The code I have used in running the analysis described in this post can be found here.)

Firstly, I am going to have to do a lot more explaining, for my own sake, and have promised to do so for a wide range of ideas, including those concerning the methods of model fitting, and, especially, the discussion around how one determines the appropriate set of fixed effects.

Secondly, that the model code is both simple and – on its face – reasonably intuitive.

Key vocabulary (mostly, a list of things to be explained)

fixed effects

random effects

random intercepts

random slopes

interactions

multiplicative products

centering

maximum likelihood

restricted maximum likelihood

Markov chain Monte Carlo

Likelihood Ratio Test

lmer()

anova()

Reading

Baayen, R. H. (2008). Analyzing linguistic data. Cambridge University Press.
Bates, D. M. (2005). Fitting linear mixed models in R. R News, 5, 27-30.
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioural sciences (3rd. edition). Mahwah, NJ: Lawrence Erlbaum Associates.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. New York,NY: Cambridge University Press.
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-plus (Statistics and Computing). New York: Springer.

Posted in Uncategorized | Leave a comment

Why R? One hour R workshop

Watching someone else do or talk about computing is boring, as many rubbish and one good film of the past have proved.

computers-wargames-imdb

Still from WarGames (1983), source: imdb

Better to do something practical and, here, the task is to evaluate if R is something you want or need to learn or not. Thus, this practical session is designed to show that:

1. It is easy to download and install R along with R-Studio, the interface you might use to work with R.

2. It is easy and efficient to load in and work with R.

3. Performing analyses makes intuitive sense.

4. Visualizing your data is way more effective than anywhere else – the plots are prettier and more informative.

In case you’re interested, I’ve got a more drawn out exposition of this point, see “Browse by category” menu, starting here:

https://ayeimanol-r.net/2013/04/13/getting-started-in-r/

Installation – R

How do you download and install R? Google “CRAN” and click on the download link, then follow the instructions (e.g. at “install R for the first time”).

R-CRAN

Obviously, you will require administration rights though I believe you can download the installation file (.dmg in Mac) run it and then stick the application in your applications folder without the ned for full admin rights.

What you’ll get is the R statistical computing environment and you can work with R through the console. I find it easier to use the R-Studio interface, for reasons I’ll explain.

Installation – RStudio

I am repeating this post in summary:

https://ayeimanol-r.net/2013/04/21/289/

If you google “R-studio” you will get to this window:

R-0-1-rstudio

Click on the “Download now” button and you will see this window:

R-0-2-rstudio

Click on the “Download RStudio desktop” and you will see this window:

R-0-3-rstudio

You can just click on the link to the installer recommended for your computer.

What happens next depends on whether you have administrative/root privileges on your computer.

I believe you can install R-studio without such rights using the zip/tarball dowload.

Having installed R and R-studio…

— in Windows you will see these applications now listed as newly installed programs at the start menu. Depending on what you said in the installation process, you might also have icons on your desktop.

— in Mac you should see RStudio in your applications folder and launchpad.

Click on the R-studio icon – it will pick up the R installation for you.

Now we are ready to get things done in R.

Start a new script in R-studio, install packages, draw a plot

Here, we are going to 1. start a new script, 2. install then load a library of functions (ggplot2) and 3. use it to draw a plot.

Depending on what you did at installation, you can expect to find shortcut links to R (a blue R) and to R-Studio (a shiny blue circle with an R) in the Windows start menu, or as icons on the desktop.

To get started, in Windows, double click (left mouse button) on the R-Studio icon.

Maybe you’re now looking at this:

R-rstudio-1-1

1. Start a new script

What you will need to do next is go to the file menu [top left of R-Studio window] and create a new R script:

–move the cursor to file – then – new – then – R script and then click on the left mouse button

or

— just press the buttons ctrl-shift-N at the same time — the second move is a keyboard shortcut for the first, I prefer keyboard short cuts to mouse moves

— to get this:

R-rstudio-1-2

What’s next?

This circled bit you see in the picture below:

R-rstudio-1-3-console

is the console.

It is what you would see if you open R directly, not using R-Studio.

You can type and execute commands in it but mostly you will see unfolding here what happens when you write and execute commands in the script window, circled below:

R-rstudio-1-3-script

— The console reflects your actions in the script window.

If you look on the top right of the R-Studio window, you can see two tabs, Workspace and History: these windows (they can be resized by dragging their edges) also reflect what you do:

1. Workspace will show you the functions, data files and other objects (e.g. plots, models) that you are creating in your R session.

[Workspace — See the R introduction, and see the  this helpful post by Quick-R — when you work with R, your commands result in the creation of objects e.g. variables or functions, and during an R session these objects are created and stored by name — the collection of objects currently stored is the workspace.]

2. History shows you the commands you execute as you execute them.

— I look at the Workspace a lot when using R-Studio, and no longer look at (but did once use) History much.

My script is my history.

2. Install then load a library of functions (ggplot2)

We can start by adding some capacity to the version of R we have installed. We install packages of functions that we will be using e.g. packages for drawing plots (ggplot2) or for modelling data (lme4).

[Packages – see the introduction and this helpful page in Quick-R — all R functions and (built-in) datasets are stored in packages, only when a package is loaded are its contents available]

Copy then paste the following command into the script window in R-studio:


install.packages("ggplot2", "reshape2", "plyr", "languageR",
"lme4", "psych")

Highlight the command in the script window …

— to highlight the command, hold down the left mouse button, drag the cursor from the start to the finish of the command

— then either press the run button …

[see the run button on the top right of the script window]

… or press the buttons CTRL-enter together, and watch the console show you R installing the packages you have requested.

Those packages will always be available to you, every time you open R-Studio, provided you load them at the start of your session.

[I am sure there is a way to ensure they are always loaded at the start of the session and will update this when I find that out.]

There is a 2-minute version of the foregoing laborious step-by-step, by ajdamico, here. N.B. the video is for installing and loading packages using the plain R console but applies equally to R-Studio.

Having installed the packages, in the same session or in the next session, the first thing you need to do is load the packages for use by using the library() function:


library(languageR)
library(lme4)
library(ggplot2)
library(rms)
library(plyr)
library(reshape2)
library(psych)

— copy/paste or type these commands into the script, highlight and run them: you will see the console fill up with information about how the packages are being loaded:

R-rstudio-1-4-library

Notice that the packages window on the bottom right of R-Studio now shows the list of packages have been ticked:

R-rstudio-1-4-library-packages

Let’s do something interesting now.

3. Use ggplot function to draw a plot

[In the following, I will use a simple example from the ggplot2 documentation on geom_point.]

Copy the following two lines of code into the script window:


p <- ggplot(mtcars, aes(wt, mpg))
p + geom_point()

— run them and you will see this:

R-rstudio-1-ggplot-point

— notice that the plot window, bottom right, shows you a scatterplot.

How did this happen?

Look at the first line of code:


p <- ggplot(mtcars, aes(wt, mpg))

— it creates an object, you can see it listed in the workspace (it was not there before):

R-rstudio-1-ggplot-point-2

— that line of code does a number of things, so I will break it down piece by piece:

p <- ggplot(mtcars, aes(wt, mpg))


p <- ...

— means: create <- (assignment arrow) an object (named p, now in the workspace)


... ggplot( ... )

— means do this using the ggplot() function, which is provided by installing the ggplot2 package then loading (library(ggplot) the ggplot2 package of data and functions


... ggplot(mtcars ...)

— means create the plot using the data in the database (in R: dataframe) called mtcars

— mtcars is a dataframe that gets loaded together with functions like ggplot when you execute: library(ggplot2)


... ggplot( ... aes(wt, mpg))

— aes(wt,mpg) means: map the variables wt and mpg to the aesthetic attributes of the plot.

In the ggplot2 book (Wickham, 2009, e.g. pp 12-), the things you see in a plot, the colour, size and shape of the points in a scatterplot, for example, are aesthetic attributes or visual properties.

— with aes(wt, mpg) we are informing R(ggplot) that the named variables are the ones to be used to create the plot.

Now, what happens next concerns the nature of the plot we want to produce: a scatterplot representing how, for the data we are using, values on one variable relate to values on the other.

A scatterplot represents each observation as a point, positioned according to the value of two variables. As well as a horizontal and a vertical position, each point also has a size, a colour and a shape. These attributes are called aesthetics, and are the properties that can be perceived on the graphic.

(Wickham: ggplot2 book, p.29; emphasis in text)

— The observations in the mtcars database are information about cars, including weight (wt) and miles per gallon (mpg).

[see

http://127.0.0.1:29609/help/library/datasets/html/mtcars.html

in case you’re interested]

— This bit of the code asked the p object to include two attributes: wt and mpg.

— The aesthetics (aes) of the graphic object will be mapped to these variables.

— Nothing is seen yet, though the object now exists, until you run the next line of code.

The next line of code:

p + geom_point()

— adds (+) a layer to the plot, a geometric object: geom

— here we are asking for the addition of geom_point(), a scatterplot of points

— the variables mpg and wt will be mapped to the aesthetics, x-axis and y-axis position, of the scatterplot

The wonderful thing about the ggplot() function is that we can keep adding geoms to modify the plot.

— add a command to the second line of code to show the relationship between wt and mpg for the cars in the mtcars dataframe:


p <- ggplot(mtcars, aes(wt, mpg))
p + geom_point() + geom_smooth()

— here

+ geom_smooth()

adds a loess smoother to the plot, indicating the predicted miles per gallon (mpg) for cars, given the data in mtcars, given a car’s weight (wt).

R-rstudio-ggplot-loess-export

[If you want to know what loess means – this post looks like a good place to get started.]

Notice that there is an export button on the top of the plots window pane, click on it and export the plot as a pdf.

Where does that pdf get saved to? Good question.

Extras

The really cool thing about drawing plots in R (if using ggplot2, especially) is that one can modify plots towards a target utility intuitively, over  a series of steps:

So far, we have mapped car weight (wt) and miles per gallon (mpg) in the mtcars database to the x and y coordinates of the plot. Is there more in the data that we can bring out in the visualization?

You can check out what is in the database:


# what is in this database?

help(mtcars)

View(mtcars)

Maybe we want to bring the power of the cars into the picture because, presumably, heavier cars consume more petrol because they need to have more power (you tell me).

We can add power (hp) to the set of aesthetics and map it to further plot attributes e.g. colour or size:


# maybe it would be interesting to add information to the plot about
# horsepower

p <- ggplot(mtcars, aes(x = wt, y = mpg, colour = hp))
p + geom_point()

# maybe it would be interesting to add information to the plot about
# horsepower

p <- ggplot(mtcars, aes(x = wt, y = mpg, size = hp))
p + geom_point()

Next, I would like to edit the plot a bit for presentation:

i. add in a smoother but the wiggliness of the loess is not helping much;

ii. add a plot title and relabel the axes to make them more intelligible;

ii. modify the axis labelling because the labels are currently too small for a presentation;

iv. make the background white.

— of course, this is going to be easy and do-able in steps, as follows.

Notice that I change the format of the coding a little bit to make it easier to keep track of the changes.


# add the smoother back in, I don't think the wiggliness of the default loess helps much so modify the line fitting method

p <- ggplot(mtcars, aes(x = wt, y = mpg, size = hp, colour = hp))
p + geom_point() + geom_smooth(method = "lm")

# add a plot title and make the axis labels intelligible

p <- ggplot(mtcars, aes(x = wt, y = mpg, colour = hp, size = hp))
p <- p + geom_point() + geom_smooth(method = "lm") + labs(title = "Cars plot")
p <- p + xlab("Vehicle Weight") + ylab("Miles per Gallon")
p

# modify the label sizing -- modify sizes plot title, axis labels, axis tick value labels

p <- ggplot(mtcars, aes(x = wt, y = mpg, colour = hp, size = hp))
p <- p + geom_point() + geom_smooth(method = "lm") + labs(title = "Cars plot")
p <- p + xlab("Vehicle Weight") + ylab("Miles per Gallon")
p <- p + theme(plot.title = element_text(size = 80))
p <- p + theme(axis.text = element_text(size = 35))
p <- p + theme(axis.title.y = element_text(size = 45))
p <- p + theme(axis.title.x = element_text(size = 45))
p

# change the background so the plot is on white

p <- ggplot(mtcars, aes(x = wt, y = mpg, colour = hp, size = hp))
p <- p + geom_point() + geom_smooth(method = "lm") + labs(title = "Cars plot")
p <- p + xlab("Vehicle Weight") + ylab("Miles per Gallon")
p <- p + theme(plot.title = element_text(size = 80))
p <- p + theme(axis.text = element_text(size = 35))
p <- p + theme(axis.title.y = element_text(size = 45))
p <- p + theme(axis.title.x = element_text(size = 45))
p <- p + theme_bw()
p

R-whyR-plotting-edits

On reflection, you might ask what we are adding by putting in colour and size modification to depict horse power. Decisions will be more sensible if we start using some real data.

Getting your data into R

Just as in other statistics applications e.g. SPSS, data can be entered manually, or imported from an external source. Here I will talk about: 1. loading files into the R workspace as dataframe object; 2. getting summary statistics for the variables in the dataframe; 3. and plotting visualizations of pairwise relationships in the data.

1. Loading files into the R workspace as dataframe object

Start by getting your data files

I often do data collection using paper-and-pencil standardized tests of reading ability and also DMDX scripts running e.g. a lexical decision task testing online visual word recognition. Typically, I end up with an excel spreadsheet file called subject scores 210413.xlsx and a DMDX output data file called lexical decision experiment 210413.azk. I’ll talk about the experimental data collection data files another time.

The paper and pencil stuff gets entered into the subject scores database by hand. Usually, the spreadsheets come with notes to myself on what was happening during testing. To prepare for analysis, though, I will often want just a spreadsheet showing columns and rows  of data, where the columns are named sensibly (meaningful names, I can understand) with no spaces in either column names or in entries in data cells. Something that looks like this:

R-2-1-csv

This is a screen shot of a file called: ML scores 080612 220413.csv

— which you can download here.

These data were collected in an experimental study of reading, the ML study.

The file is in .csv (comma separated values) format. I find this format easy to use in my workflow –collect data–tidy in excel–output to csv–read into R–analyse in R, but you can get any kind of data you can think of into R (SPSS .sav or .dat, .xlsx, stuff from the internet — no problem).

Let’s assume you’ve just downloaded the database and, in Windows, it has ended up in your downloads folder. I will make life easier for myself by copying the file into a folder whose location I know.

— I know that sounds simple but I often work with collaborators who do not know where their files are.

Read a data file into the R workspace

I have talked about the workspace before, here we are getting to load data into it using the functions setwd() and read.csv().

In R, you need to tell it where your data can be found, and where to put the outputs (e.g. pdfs of a plot). You need to tell it the working directory (see my advice earlier about folders): this is going to be the Windows folder where you put the data you wish to analyse.

For this example, the folder is:

C:\Users\p0075152\Dropbox\resources R public

[Do you know what this is – I mean, the equivalent, on your computer? If not, see the video above – or get a Windows for Dummies book or the like.]

What does R think the working directory currently is? You can find out, run the getwd() command:


getwd()

and you will likely get told this:

> getwd()
[1] “/Users/robdavies”

— with a different username, obviously.

The username folder is not where the data file is, so you set the working directory using setwd()


setwd("/Users/robdavies/Downloads")

— you need to:

— find the file in your downloads folder

— highlight the file in Mac Finder and use the Finder file menu getinfo option for the file to find the path to it

— copy that path information into the setwd() function call and run it

— typical errors include misspelling elements of the path, not putting it in “”

Load the data file using the read.csv() function


subjects <- read.csv("ML scores 080612 220413.csv", header=T, na.strings = "-999")

Notice:

1. subjects… — I am calling the dataframe something sensible

2. …<- read.csv(…) — this is the part of the code loading the file into the workspace

3. …(“ML scores 080612 220413.csv”…) — the file has to be named, spelled correctly, and have the .csv suffix given

4. …, header = T… — I am asking for the column names to be included in the subjects dataframe resulting from this function call

5. … na.strings = “-999” … — I am asking R to code -999 values, where they are found, as NA – in R, NA means Not Available i.e. a missing value.

Things that can go wrong here:

— you did not set the working directory to the folder where the file is

— you misspelled the file name

— either error will cause R to tell you the file does not exist (it does, just not where you said it was)

I recommend you try making these errors and getting the error message deliberately. Making errors is a good thing in R.

Note:

— coding missing values systematically is a good thing -999 works for me because it never occurs in real life for my reading experiments

— you can code for missing values in excel in the csv before you get to this stage

Let’s assume you did the command correctly, what do you see? This:

R-2-1-readcsv

Notice:

1. in the workspace window, you can see the dataframe object, subjects, you have now created

2. in the console window, you can see the commands executed

3. in the script window, you can see the code used

4. the file name, top of the script window, went from black to red, showing a change has been made but not yet saved.

To save a change, keyboard shortcut is CTRL-S.

2. Getting summary statistics for the variables in the dataframe

It is worth reminding ourselves that R is, among other things, an object-oriented programming language. See this:

The entities that R creates and manipulates are known as objects. These may be variables, arrays of numbers, character strings, functions, or more general structures built from such components.

During an R session, objects are created and stored by name (we discuss this process in the next session). The R command

     > objects()

(alternatively, ls()) can be used to display the names of (most of) the objects which are currently stored within R. The collection of objects currently stored is called the workspace.

[R introduction: http://cran.r-project.org/doc/manuals/r-release/R-intro.html#Data-permanency-and-removing-objects]

or this helpful video by ajdamico.

So, we have created an object using the read.csv() function. We know that object was a .csv file holding subject scores data. In R, in the workspace, it is an object that is a dataframe, a structure much like the spreadsheets in excel, SPSS etc.: columns are variables and  rows are observations, with columns that can correspond to variables of different types (e.g. factors, numbers etc.) Most of the time, we’ll be using dataframes in our analyses.

You can view the dataframe by running the command:


view(subjects)

— or actually just clicking on the name of the dataframe in the workspace window in R-studio, to see:

R-2-1df-view

Note: you cannot edit the file in that window, try it.

Now that we have created an object, we can interrogate it. We can ask what columns are in the subjects dataframe, how many variables there are, what the average values of the variables are, if there are missing values, and so on using an array of useful functions:


head(subjects, n = 2)

summary(subjects)

describe(subjects)

str(subjects)

psych::describe(subjects)

length(subjects)

length(subjects$Age)

Copy-paste these commands into the script window and run them, below is what you will see in the console:

R-2-1-descriptives

Notice:

1. the head() function gives you the top rows, showing column names and data in cells; I asked for 2 rows of data with n = 2

2. the summary() function gives you: minimum, maximum, median and quartile values for numeric variables, numbers of observations with one or other levels for factors, and will tell you if there are any missing values, NAs

3. describe() (from the psych package) will give you means, SDs, the kind of thing you report in tables in articles

4. str() will tell you if variables are factors, integers etc.

If I were writing a report on this subjects data I would copy the output from describe() into excel, format it for APA, and stick it in a word document.

3. Plotting visualization of data and doing a little bit of modelling

In a previous post, I went into how you can plot histograms of the data to examine the distribution of the variables, see here:

https://ayeimanol-r.net/2013/04/22/getting-started-working-directories-loading-data-and-a-bit-more-plotting/

but for this post I want to stick to the focus on relationships between variables.

Let’s use what we learnt to examine visually the relationships between variables in the dataset. Here, our concern might be whether measures of reading performance, collected using the TOWRE (Torgesen et al., 1998) a test of word and non-word accuracy where participants are asked to read lists of words or non-words, of increasing difficulty, in 45 s, and the performance score is the number of words read correctly in that time.

I might expect to see that the ability to read aloud words correctly actually increases as: 1. people get older – increased word naming accuracy with increased age; 2. people read more – increased word naming accuracy with increased reading history (ART score); 3. people show increased ability to read aloud made-up words (pseudowords) or nonwords – increased word naming accuracy with increased nonword naming accuracy.

We can examine these relationships and, in this post, we will work on doing so through using scatter plots like that shown below.

R-3-1-subjects-scatter

What does the scatterplot tell us, and how can it be made?

Notice:

— if people are more accurate on nonword reading they are also often more accurate on word reading

— at least one person is very bad on nonword reading but OK on word reading

— most of the data we have for this group of ostensibly typically developing readers is up in the good nonword readers and good word readers zone.

I imposed a line of best fit indicating the relationship between nonword and word reading estimated using linear regression, the line in blue, as well as a shaded band to show uncertainty – confidence interval – over the relationship. You can see that the uncertainty increases as we go towards low accuracy scores on either measure. We will return to these key concepts in other posts.

For now, we will stick to showing the relationships between all the variables in our sample, modifying plots over a series of steps as follows:

[In what follows, I will be using the ggplot2 documentation that can be found online, and working my way through some of the options discussed in Winston Chang’s book: R Graphics Cookbook .]

Draw a scatterplot using default options

Let’s draw a scatterplot using geom_point(). Run the following line of code


ggplot(subjects, aes(x = Age, y = TOWRE_wordacc)) + geom_point()

and you should see this in R-studio Plots window:

R-3-1-scatter-default

Notice:

— There might be no relationship between age and reading ability: how does ability vary with age here?

— Maybe the few older people, in their 60s, pull the relationship towards the negative i.e. as people get older abilities decrease.

Let’s imagine that we wanted to report a plot showing the relationship between word and nonword reading. We can modify the plot for presentation as we have done previously but we also add modifications to the axis labelling — fixing the axis limits and reducing the number of ticks to tidy the plot a bit.


# generate a scatterplot showing the relationship between a pair of variables
# let's split the plot construction into two sets of steps:

pNW <- ggplot(subjects,aes(x=TOWRE_nonwordacc, y=TOWRE_wordacc))
pNW <- pNW + geom_point() + ylim(70, 104)
pNW <- pNW + scale_x_continuous(breaks = c(20, 40, 60))
pNW <- pNW + geom_smooth(method="lm", size = 1.5)
pNW <- pNW + theme_bw()
pNW

pNW <- pNW + ggtitle("Word vs. nonword reading
skill")
pNW <- pNW + xlab("TOWRE nonword naming accuracy") + ylab("TOWRE word naming accuracy")
pNW <- pNW + theme(plot.title = element_text(size = 40))
pNW <- pNW + theme(axis.text = element_text(size = 25))
pNW <- pNW + theme(axis.title.y = element_text(size = 30))
pNW <- pNW + theme(axis.title.x = element_text(size = 30))
pNW

Notice:

— I am adding each change, one at a time, making for more lines of code; but I could and will be more succinct.

— I  add the title with: + ggtitle(“Word vs. nonword reading skill”)

— I change the axis labels with: + xlab(“TOWRE nonword naming accuracy”) + ylab(“TOWRE word naming accuracy”) 

— I fix the y-axis limits with: + ylim(70, 104)

— I change where the tick marks on the x-axis are with: + scale_x_continuous(breaks = c(20, 40, 60)) — there were too many before

— I modify the size of the axis labels for better readability with: + theme(axis.title.x = element_text(size=25), axis.text.x = element_text(size=20))

— I add a smoother i.e. a line showing a linear model prediction of y-axis values for the x-axis variable values with: + geom_smooth(method=”lm”, size = 1.5)  — and I ask for the size to be increased for better readability

— And then I ask for the plot to have a white background with: + theme_bw()

All that will get you this:

R-whyR-ML-scatter-edits

A bit of modelling

Can we get some stats in here now?

You will have started to see signs about one feature of R: there are multiple different ways to do a thing. That’s also true of what we are going to do next and last, a regression model of the ML subject scores data to look at the predictors of word reading accuracy. We are going to use the lm() function call, it’s simple, but we could use ols().


# generate a scatterplot showing the relationship between a pair of variables
# let's split the plot construction into two sets of steps:

# let's do a little bit of regresion modeling

help(lm)

summary(subjects)

MLwords.lm <- lm(TOWRE_wordacc ~

Age + TOWRE_nonwordacc + ART_HRminusFR,

data = subjects)

summary(MLwords.lm)

Notice:

— The code specifying the model is written to produce a regression object


MLwords.lm <- lm(TOWRE_wordacc ~

Age + TOWRE_nonwordacc + ART_HRminusFR,

data = subjects)

— Run it and you will see a model object appear in the workspace window, top right of RStudio but you will only see the code being repeated in the console.

— R will only show you what you ask for.

— The model formula is written out the same way you might write a regression equation i.e. the code syntax matches the statistical formula convention. Obviously, this is not an accident.

— Looking ahead, specifying models for ANOVA, mixed effects models etc. all follow the same sort of structure — the code syntax matches the conceptual unity.

Running this code will get you:

R-whyR-lm-regression-output

Thus, we see the same statistical output we would see anywhere else.

What have we learnt?

— starting a new script

— installing and loading packages

— creating a new plot

Vocabulary

functions

install.packages()

library()

ggplot()

critical terms

object

workspace

package

scatterplot

loess

aesthetics

Resources

You can download the .R code I used for the class here.

There is a lot more that you can do with plotting than I have shown here, see the technical help with examples you can run just by copying them into RStudio, here:

http://docs.ggplot2.org/current/

There’s a nice tutorial here:

http://www.r-bloggers.com/ggplot2-tutorial/

Reading

Baayen, R. H. (2008). Analyzing linguistic data. Cambridge University Press.
Castles, A., & Nation, K. (2006). How does orthographic learning happen? In S. Andrews (Ed.), From inkmarks to ideas: Challenges and controversies about word recognition and reading. Psychology Press.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioural sciences (3rd. edition). Mahwah, NJ: Lawrence Erlbaum Associates.
Harm, M. W., & Seidenberg, M. S. (2004). Computing the meanings of words in reading: Cooperative division of labor between visual and phonological processes. Psychological Review, 111, 662-720.
Harrell Jr, F. J. (2001). Regression modelling strategies: With applications to linear models, logistic regression, survival analysis. New York, NY: Springer.
Masterson, J., & Hayes, M. (2007). Development and data for UK versions of an author and title recognition test for adults. Journal of Research in Reading, 30, 212-219.
Nation, K. (2008). Learning to read words. The Quarterly Journal of Experimental Psychology, 61, 1121-1133.
Rayner, K., Foorman, B. R., Perfetti, C. A., Pesetsky, D., & Seidenberg, M. S. (2001). How psychological science informs the teaching of reading. Psychological Science in the Public Interest, 2, 31-74.
Share, D. L. (1995). Phonological recoding and self-teaching: sine qua non of reading acquisition. Cognition, 55, 151-218.
Stanovich, K. E., & West, R. F. (1989). Exposure to print and orthographic processing. Reading Research Quarterly, 24, 402-433.
Torgesen, J. K., Wagner, R. K., & Rashotte, C. A. (1999). TOWRE Test of word reading efficiency. Austin,TX: Pro-ed.

Posted in Uncategorized | Leave a comment