Getting started — loading data, writing and saving scripts

[16/1/14 edit — as the procedure for finding data in Mac OS were detailed in class, I am going to leave off editing this post to show Mac OS steps in detail because: steps are mostly the same; and we’ve practised them in class.]

In this post, we shall 1. load data into R workspace; 2. get summary statistics for those data; 3. plot visualizations of the data distributions — histograms; and 4. finish up by closing and saving the .R file.

For this post, I am going to assume that you know about files, folders and directories on your computer. In Windows (e.g. in XP and 7), you should be familiar with Windows Explorer. (For Macs, we would be working with Finder and within Finder with F+Get Info.) If you are not familiar with Windows Explorer, this series of videos e.g. here on Windows Explorer should get you to where you understand: the difference between a file and a folder; file types; the navigation, file folder directory; how to get the details view of files in folders; how to sort e.g. by name or date in folders. N.B. the screenshots of folders in the following will look different to you compared to Windows Explorer as it normally appears in Windows 7. That is because I use xplorer2, as I find it easier to move stuff around.

Just as in other statistics applications e.g. SPSS, data can be entered manually, or imported from an external source. Here I will talk about: 1. loading files into the R workspace as dataframe object; 2. getting summary statistics for the variables in the dataframe; 3. and plotting visualization of data distributions that will be helpful to making sense of the dataframe.

1. Loading files into the R workspace as dataframe object

Start by getting your data files

I often do data collection using paper-and-pencil standardized tests of reading ability and also DMDX scripts running e.g. a lexical decision task testing online visual word recognition. Typically, I end up with an excel spreadsheet file called subject scores 210413.xlsx and a DMDX output data file called lexical decision experiment 210413.azk. I’ll talk about the experimental data collection data files another time.

The paper and pencil stuff gets entered into the subject scores database by hand. Usually, the spreadsheets come with notes to myself on what was happening during testing. To prepare for analysis, though, I will often want just a spreadsheet showing columns and rows  of data, where the columns are named sensibly (meaningful names, I can understand) with no spaces in either column names or in entries in data cells. Something that looks like this:

R-2-1-csv

This is a screen shot of a file called: ML scores 080612 220413.csv

— which you can download here.

These data were collected in an experimental study of reading, the ML study.

The file is in .csv (comma separated values) format. I find this format easy to use in my workflow –collect data–tidy in excel–output to csv–read into R–analyse in R, but you can get any kind of data you can think of into R (SPSS .sav or .dat, .xlsx, stuff from the internet — no problem).

Let’s assume you’ve just downloaded the database and, in Windows, it has ended up in your downloads folder. I will make life easier for myself by copying the file into a folder whose location I know.

— I know that sounds simple but I often work with collaborators who do not know where their files are.

I’ll just copy the file from the downloads folder into a folder I called R resources public:

R-2-1-folders

Advice to you

Your lives, and my life, if I am working with you, will be easier if you are systematic and consistent about your folders: if I am supervising you, I expect to see an experiment for each folder, and in that sub-folders for – stimulus set selection; data collection script and materials; raw data files; (tidied) collated data; and analysis.

Read a data file into the R workspace

I have talked about the workspace before, here we are getting to load data into it using the functions setwd() and read.csv().

In R, you need to tell it where your data can be found, and where to put the outputs (e.g. pdfs of a plot). You need to tell it the working directory (see my advice earlier about folders): this is going to be the Windows folder where you put the data you wish to analyse.

For this example, the folder is:

C:\Users\me\Dropbox\resources R public

[Do you know what this is – I mean, the equivalent, on your computer? If not, see the video above – or get a Windows for Dummies book or the like.]

What does R think the working directory currently is? You can find out, run the getwd() command:


getwd()

and you will likely get told this:


> getwd()
[1] "C:/Users/p0075152/Documents"

The Documents folder is not where the data file is, so you set the working directory using setwd()


setwd("C:/Users/me/Dropbox/resources R public")

Notice: 1. the address in explorer will have all the slashes facing backwards but “\” in R is a special operator, escape so 2. the address given in setwd() has all slashes facing forward 3. the address is in “” and 4. of course if you spell this wrong R will give you an error message. I tend to copy the address from Windows Explorer, and change the slashes by hand in the R script.

Load the data file using the read.csv() function


subjects <- read.csv("ML scores 080612 220413.csv", header=T, na.strings = "-999")

Notice:

1. subjects… — I am calling the dataframe something sensible

2. …<- read.csv(…) — this is the part of the code loading the file into the workspace

3. …(“ML scores 080612 220413.csv”…) — the file has to be named, spelled correctly, and have the .csv suffix given

4. …, header = T… — I am asking for the column names to be included in the subjects dataframe resulting from this function call

5. … na.strings = “-999” … — I am asking R to code -999 values, where they are found, as NA – in R, NA means Not Available i.e. a missing value.

Things that can go wrong here:

— you did not set the working directory to the folder where the file is

— you misspelled the file name

— either error will cause R to tell you the file does not exist (it does, just not where you said it was)

I recommend you try making these errors and getting the error message deliberately. Making errors is a good thing in R.

Note:

— coding missing values systematically is a good thing -999 works for me because it never occurs in real life for my reading experiments

— you can code for missing values in excel in the csv before you get to this stage

Let’s assume you did the command correctly, what do you see? This:

R-2-1-readcsv

Notice:

1. in the workspace window, you can see the dataframe object, subjects, you have now created

2. in the console window, you can see the commands executed

3. in the script window, you can see the code used

4. the file name, top of the script window, went from black to red, showing a change has been made but not yet saved.

To save a change, keyboard shortcut is CTRL-S.

2. Getting summary statistics for the variables in the dataframe

It is worth reminding ourselves that R is, among other things, an object-oriented programming language. See this:

The entities that R creates and manipulates are known as objects. These may be variables, arrays of numbers, character strings, functions, or more general structures built from such components.

During an R session, objects are created and stored by name (we discuss this process in the next session). The R command

     > objects()

(alternatively, ls()) can be used to display the names of (most of) the objects which are currently stored within R. The collection of objects currently stored is called the workspace.

[R introduction: http://cran.r-project.org/doc/manuals/r-release/R-intro.html#Data-permanency-and-removing-objects]

or this helpful video by ajdamico.

So, we have created an object using the read.csv() function. We know that object was a .csv file holding subject scores data. In R, in the workspace, it is an object that is a dataframe, a structure much like the spreadsheets in excel, SPSS etc.: columns are variables and  rows are observations, with columns that can correspond to variables of different types (e.g. factors, numbers etc.) Most of the time, we’ll be using dataframes in our analyses.

You can view the dataframe by running the command:


view(subjects)

— or actually just clicking on the name of the dataframe in the workspace window in R-studio, to see:

R-2-1df-view

Note: you cannot edit the file in that window, try it.

I am going to skip over a whole bunch of stuff on how R deals with data. A useful Quick-R tutorial can be found here. The R-in-Action book, which builds on that website, will tell you that R has a wide variety of objects for holding data: scalars, vectors, matrices, arrays, dataframes, and lists. I mostly work with dataframes and vectors, so that’s mostly what we’ll encounter here.

Now that we have created an object, we can interrogate it. We can ask what columns are in the subjects dataframe, how many variables there are, what the average values of the variables are, if there are missing values, and so on using an array of useful functions:


head(subjects, n = 2)

summary(subjects)

describe(subjects)

str(subjects)

psych::describe(subjects)

length(subjects)

length(subjects$Age)

Copy-paste these commands into the script window and run them, below is what you will see in the console:

R-2-1-descriptives

Notice:

1. the head() function gives you the top rows, showing column names and data in cells; I asked for 2 rows of data with n = 2

2. the summary() function gives you: minimum, maximum, median and quartile values for numeric variables, numbers of observations with one or other levels for factors, and will tell you if there are any missing values, NAs

3. describe() (from the psych package) will give you means, SDs, the kind of thing you report in tables in articles

4. str() will tell you if variables are factors, integers etc.

If I were writing a report on this subjects data I would copy the output from describe() into excel, format it for APA, and stick it in a word document.

3. Plotting visualization of data distributions

These summary statistics will not give you the whole picture that you need. Mean estimates of the centre of the data distribution can be deceptive, hiding pathologies like multimodality, skew, and so on. What we really need to do is to look at the distributions of the data and, unsurprisingly, in R, the tools for doing so are excellent. Let’s start with histograms.

[Here, I will use the geom_histogram, see the documentation online.]

We are going to examine the distribution of the Age variable, age in years for ML’s participants, using the ggplot() function as follows:


pAge <- ggplot(subjects, aes(x=Age))
pAge + geom_histogram()

Notice:

— We are asking ggplot() to map the Age variable to the x-axis i.e. we are asking for a presentation of one variable.

If you copy-paste the code into the script window and run it three things will happen:

1. You’ll see the commands appear in the lower console window

2. You’ll see the pAge object listed in the upper right workspace window

R-2-1-rstudio-histogram

Notice:

— When you ask for geom_histogram(), you are asking for a bar plot plus a statistical transformation (stat_bin) which assigns observations to bins (ranges of values in the variable)

— calculates the count (number) of observations in each bin

— or – if you prefer – the density (percentage of total/bar width) of observations in each bin

— the height of the bar gives you the count, the width of the bin spans values in the variable for which you want the count

— by default, geom_histogram will bin data to the range in values/30 but you can ask for more or less detail by specifying binwidth, this is what R means when it says:

>stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to 
adjust this.

3. We see the resulting plot in the plot window, which we can export as a pdf:

R-2-1-histogram

Notice:

— ML tested many people in their 20s – no surprise, a student mostly tests students – plus a smallish number of people in middle age and older (parents, grandparents)

— nothing seems wrong with these data – no funny values

Exercise

Now examine the distribution of the other variables, TOWRE word and TOWRE nonword reading accuracy, adapt the code, by substituting the variable name with the names of these variables, do one plot for each variable:


pAge <- ggplot(subjects, aes(x=????))
pAge + geom_histogram()

Notice:

— In my opinion, at least one variable has a value that merits further thought.

4. What do you do when you’re finished?

OK, so you’ve been writing some killer code and you’ve produced a super graph, appearing in the plot window, how do you save it? You can click on the export button:

RStudio-plot-export

— you’ll be asked if you want to save the plot as an image (you’ll then be asked what format you want, .jpeg and .png are good for most purposes), as a pdf (my preference), or copy it to the clipboard so that you can paste it somewhere (like a word document, I suppose).

What about the script? In RStudio, go to the file menu then Save-As and you will see this:

RStudio-script-file-save

— Choose a name: something-sensible-dated.R

— And save it to where you can find it.

What have we learnt?

The R code used in this post can be downloaded here. Download it and use it with the database provided.

[For my postgraduate students coming to this post, see the .R file in the Week One practical materials folder.]

In this post, we learnt about:

1. Putting data files into folders on your computer.

2. The workspace, the working directory and how to set it in R using setwd().

3. How to read a file into the workspace (from the working directory = folder on your computer) using read.csv().

4. How to inspect the dataframe object using functions like summary() or describe().

5. How to plot histograms to examine the distributions of variables.

Key vocabulary

working directory

workspace

object

dataframe

NA, missing value

csv

histogram

Posted in a. Getting started | Tagged , , , , , , | Leave a comment

Getting started – getting helped

As we get started, we will need help, let’s work on how and where to find that help, and also how to act on it.

We will need help in various ways when using R

You need help most of the time when using a software programme, but if you’re getting started on R you don’t have the advantage of practice that you have with MS Word or Excel or SPSS. Needing help is not an R thing, it’s a learning thing. However, there being lots of help available is an R thing.

In the following, I’ll discuss two common (for me) help scenarios:

1. Let’s say you are trying to run a line of code and you get an error message or a warning

2. We will need help where we’ve got a decision to make

— The first is usually when I a trying to do something and I have a specific problem I need help on.

— The second is when I am not really sure what I need or even if I need help.

I’ll talk about getting advice and also dealing with it.

1. Let’s say you are trying to run a line of code and you get an error message or a warning

R is a case-sensitive,  interpreted language.

– This means that if you spell an object name incorrectly, if you get the syntax wrong, if you muddle letter case, you will get an error (just like any natural language).

The error messages can be cryptic.

For example, I run the following (incorrect) line of code:


p <- ggplot(mtcars, aes(x = wt))
p + geom_Object)

I get this:

R-error-unexpected

— You can see in the message that the (first) problem is the bracket is wrong.

How do I work out what to do?

First, more than likely, I am trying something out but I have beside me a book or a website I am working from, maybe trying an example: so if I got a message like this I would check that I copied the code correctly.

Second, I might try to google the error message “Error: unexpected ‘)’ in “p + geom_Object)” — try it — it does not get me very far in this case.

Third, I try to change the code to see what difference that makes. Here, the problem is with the bracket so I change that:


p <- ggplot(mtcars, aes(x = wt))
p + geom_Object()

— The obvious response to the error message was to close the brackets — even minimal experience with R will tell you that open brackets are not doing much good.

— Sometimes, that quick response will solve the problem straightaway.

— Sometimes it won’t and you might get another message, like this:

R-error-could-not-find-object

— “could not find function” — so obviously, I am still not getting the function name right — actually, I get the case wrong.

— But let’s say I am convinced I have it right: what should I do?

— Google it — most of the time, reasonably careful reading of the result will tell you what’s going on.

How and where to find help

As noted previously, there are plenty of places to get help, see also following, but one place you should know to look in is the built-in R help.

So you’re trying to work out how to use a function, or you’re looking for a function, or you’re trying to work out what options to use with a function, you can use in-built help:

– type: ?[name of function] or help([name of function]) or ??[name of function]

–Where [name of function] from [ to ] inclusive is the name of the function you want help with, for example, if you entered the function call:


help(ggplot)

in the RStudio script window and then ran that command you would get this:

R-help-ggplot

— Notice that you get the help(ggplot) command being shown in the console in blue, it has been executed, but the result appears on the right, in the Help window.

The built-in help in R, accessed by running a help([function name) or ?[function name] command, has a standard structure, it is useful but not (I admit) always terribly friendly. I usually use it if I basically know what the function is for and why I am using it but I want to use the function in a specific way or a different way than default.

The help will tell you: 1. what the function does 2. what arguments can be passed to it to work 3. how it can be specified in various ways; and 4. it will give you some example code to try out. The example code will work with built-in data and usually, I learn the most by playing with the example.

Note that a google search in a form something like this: “in R [name of function or text of error message]” will very often take you to a post on stackoverflow or some R blog where the error or problem and the solution or multiple alternate solutions (this is R) will be described or explained for you.

What to do with the help

Try it out. Learning how to use R involves some experimentation. Copy the code from a reported solution into a script, try running it and see if it works.

Every problem you ever have will already have been solved by someone else and, in R, the solution will be reported online.

2. We will need help where we’ve got a decision to make

Most modern software applications strive to ensure we get what we want without thinking too much about either the ends or the means. An application might, as soon as we start it, ask us what we want or show us what we might want: to write a letter; to input data; to read our emails. In R, you’re the boss of you and with power comes responsibility.

We might need help, in R, to work out what we want — Case one: You have a vague idea about what you are doing

Most applications will give you one way to do any thing. In R, you will usually have more than one (often, several) different ways to do a thing.

– Which way do you choose? This is a decision based on (for me) answers to the questions: does it work (now)? do I understand why it works? do I like it?

Typically, I work out what I want in one of three ways:

1. I am reading a book e.g. “R in Action” by Kabacoff on a topic relevant to my current task, e.g. on how to manipulate the structure of a dataframe, and I see a description of something I need to do, so I follow the example (code is given, operating on built-in data) and then adapt the code (e.g. changing variable names) to do that action for my own data.

2. I have only a vague idea of what I want, let’s say I am vaguely thinking that I would like to be able to draw error bars in R, so I enter the following in google: “How do I draw error bars in R” and I will get a result that looks like this:

R-help-google-it

I use DuckDuckGo instead of Google but the result will be similar whatever your search engine.

Typing a description of a problem or a question will often bring up a lot of results so I look for whatever seems promising, start reading it and then if I can understand the advice, and think I like the look of the results, I copy the code (code is always provided) to try the suggested solution.

Note you sometimes get alternate solutions getting offered in the comments.

Half the time, the search will give me a result listing linked to a discussion in Stack Overflow like this result in response to the question: “How can I break line on R’s setwd() command?”

R-help-stackoverflow

— This discussion was posted to StackOverflow in response to the question. You can see that one of the commentators did not think the person asking the question should even be using setwd(). While another commentator did think setwd() was useful. I agree with the second, but I note this to indicate that there is plenty of discussion, and often alternate solutions, for any one question you might have.

We might need help, in R, to work out what we want — Case two: I am not really aware that I have a problem but I am feeling dissatisfied with the approach I am currently taking

For example, I sometimes need to analyse information ordered by date and I am presenting it how I usually do with a bar chart or a line graph or something but I just don’t think the product feels right. What I might do, and what I used to do a lot, is have a look around for inspiration among the various R online aggregations of advice and tips.

For example, see what is shown at the R graph gallery website. There, if I were following my hypothetical example, I might find this:

R-graph-gallery-heatmap

A calendar heat map

— which looks pretty cool, and of course comes with a code example you can try out.

Or I might have a look around the R-bloggers website, and see if anything nice looking has been posted lately, like this:

R-help-bloggers

— coincidentally, a post by Max Gordon on a topic I was disussing yesterday: how can I produce a complete article, tabled statistics and text from inside R?

What have we learnt?

You will need help in R, but there is a lot of it and you can just work on how and where to find it, then what to do with it.

Posted in a. Getting started | Tagged | Leave a comment

Getting started — Installation of RStudio and some packages + using ggplot() to make a simple plot

Install R

Revise how to install R, as previously discussed here and here.

Installation

How do you download and install R? Google “CRAN” and click on the download link, then follow the instructions (e.g. at “install R for the first time”).

R-CRAN

Anthony Damico has produced some great video tutorials on using R, here is his how-to guide:

http://www.screenr.com/kzT8

And moonheadsing at Learning Omics has got a blog post with a series of screen-shots showing you how to install R with pictures.

Install R-studio

Having installed R, the next thing we will want to do is install R-studio, a popular and useful interface for writing scripts and using R.

If you google “R-studio” you will get to this window:

R-0-1-rstudio

Click on the “Download now” button and you will see this window:

R-0-2-rstudio

Click on the “Download RStudio desktop” and you will see this window:

R-0-3-rstudio

You can just click on the link to the installer recommended for your computer.

What happens next depends on whether you have administrative/root privileges on your computer.

I believe you can install R-studio without such rights using the zip/tarball dowload.

Having installed R and R-studio,  in Windows you will see these applications now listed as newly installed programs at the start menu. Depending on what you said in the installation process, you might also have icons on your desktop. In Mac OS, you will see the R and RStudio icons appear in the applications shown by Launchpad.

The R icon looks like this:

R-CRAN-icon

The RStudio icon looks like this:

RStudio-icon

If you click on the R icon you will get to see bare R, the console, which will look like this:

R-console-bare

R console, version 3.0.1, as seen on an iMac

Go ahead and click on the R icon, check it out, then go to the R menu and click on ‘Quit R’ to leave it, don’t save the workspace image (you’ll be asked if you want to or not). We won’t be using R directly, we will be using it through RStudio.

Click on the R-studio icon – it will pick up the R installation for you.

[N.B. Sometimes, e.g. when one does not have full admin rights on a computer e.g. in a university pooled computer, you are forced to install R someplace unusual and RStudio cannot find it. In that situation, you need to locate the R bin files from within RStudio. It’s not too hard to do this but I won’t get into that unless asked.]

Now we are ready to get things done in R.

Start a new script in R-studio, install packages, draw a plot

Here, we are going to 1. start a new script, 2. install then load a library of functions (ggplot2) and 3. use it to draw a plot.

Depending on what you did at installation, you can expect to find shortcut links to R (a blue R) and to R-Studio (a shiny blue circle with an R) in the Windows start menu, or as icons on the desktop.

To get started, in Windows, double click (left mouse button) on the R-Studio icon.

Maybe you’re now looking at this:

R-rstudio-1-1

1. Start a new script

What you will need to do next is go to the file menu [top left of R-Studio window] and create a new R script:

–move the cursor to file – then – new – then – R script and then click on the left mouse button

or

— in Windows, just press the buttons ctrl-shift-N at the same time or in Mac OS, press the buttons command-shift-N together

— the second move is a keyboard shortcut for the first, I prefer keyboard short cuts to mouse moves

— to get this:

R-rstudio-1-2

What’s next?

This circled bit you see in the picture below:

R-rstudio-1-3-console

is the console.

It is what you would see if you open R directly, not using R-Studio.

You can type and execute commands in it but mostly you will see unfolding here what happens when you write and execute commands in the script window, circled below:

R-rstudio-1-3-script

— The console reflects your actions in the script window.

If you look on the top right of the R-Studio window, you can see two tabs, Workspace and History: these windows (they can be resized by dragging their edges) also reflect what you do:

1. Workspace will show you the functions, data files and other objects (e.g. plots, models) that you are creating in your R session.

[Workspace — See the R introduction, and see the  this helpful post by Quick-R — when you work with R, your commands result in the creation of objects e.g. variables or functions, and during an R session these objects are created and stored by name — the collection of objects currently stored is the workspace.]

2. History shows you the commands you execute as you execute them.

— I look at the Workspace a lot when using R-Studio, and no longer look at (but did once use) History much.

My script is my history.

2. Install then load a library of functions (ggplot2)

We can start by adding some capacity to the version of R we have installed. We install packages of functions that we will be using e.g. packages for drawing plots (ggplot2) or for modelling data (lme4).

[Packages – see the introduction and this helpful page in Quick-R — all R functions and (built-in) datasets are stored in packages, only when a package is loaded are its contents available]

Copy then paste the following command into the script window in R-studio:


install.packages("ggplot2", "reshape2", "plyr", "languageR",
"lme4", "psych")

Highlight the command in the script window …

— to highlight the command, hold down the left mouse button, drag the cursor from the start to the finish of the command

— then either press the run button …

[see the run button on the top right of the script window]

… or (in Windows) press the buttons CTRL-enter together, alternatively (in Mac OS) press the cmd-enter buttons together …

— and watch the console show you R installing the packages you have requested.

Those packages will always be available to you, every time you open R-Studio, provided you load them at the start of your session.

[I am sure there is a way to ensure they are always loaded at the start of the session and will update this when I find that out.]

The keyboard shortcut for running a line or a selection of lines of code is:

— in Windows, press control (CTRL) and enter buttons together

— in Mac OS, press command (cmd) and enter buttons together

There is a 2-minute version of the foregoing laborious step-by-step, by ajdamico, here. N.B. the video is for installing and loading packages using the plain R console but applies equally to R-Studio.

Having installed the packages, in the same session or in the next session, the first thing you need to do is load the packages for use by using the library() function:


library(languageR)
library(lme4)
library(ggplot2)
library(rms)
library(plyr)
library(reshape2)
library(psych)

— copy/paste or type these commands into the script, highlight and run them: you will see the console fill up with information about how the packages are being loaded:

R-rstudio-1-4-library

Notice that the packages window on the bottom right of R-Studio now shows the list of packages have been ticked:

R-rstudio-1-4-library-packages

Let’s do something interesting now.

3. Use ggplot function to draw a plot

[In the following, I will use a simple example from the ggplot2 documentation on geom_point.]

Copy the following two lines of code into the script window:


p <- ggplot(mtcars, aes(wt, mpg))
p + geom_point()

— run them and you will see this:

R-rstudio-1-ggplot-point

— notice that the plot window, bottom right, shows you a scatterplot.

How did this happen?

Look at the first line of code:


p <- ggplot(mtcars, aes(wt, mpg))

— it creates an object, you can see it listed in the workspace (it was not there before):

R-rstudio-1-ggplot-point-2

— that line of code does a number of things, so I will break it down piece by piece:

p <- ggplot(mtcars, aes(wt, mpg))


p <- ...

— means: create <- (assignment arrow) an object (named p, now in the workspace)


... ggplot( ... )

— means do this using the ggplot() function, which is provided by installing the ggplot2 package then loading (library(ggplot) the ggplot2 package of data and functions


... ggplot(mtcars ...)

— means create the plot using the data in the database (in R: dataframe) called mtcars

— mtcars is a dataframe that gets loaded together with functions like ggplot when you execute: library(ggplot2)


... ggplot( ... aes(wt, mpg))

— aes(wt,mpg) means: map the variables wt and mpg to the aesthetic attributes of the plot.

In the ggplot2 book (Wickham, 2009, e.g. pp 12-), the things you see in a plot, the colour, size and shape of the points in a scatterplot, for example, are aesthetic attributes or visual properties.

— with aes(wt, mpg) we are informing R(ggplot) that the named variables are the ones to be used to create the plot.

Now, what happens next concerns the nature of the plot we want to produce: a scatterplot representing how, for the data we are using, values on one variable relate to values on the other.

A scatterplot represents each observation as a point, positioned according to the value of two variables. As well as a horizontal and a vertical position, each point also has a size, a colour and a shape. These attributes are called aesthetics, and are the properties that can be perceived on the graphic.

(Wickham: ggplot2 book, p.29; emphasis in text)

— The observations in the mtcars database are information about cars, including weight (wt) and miles per gallon (mpg).

[see

http://127.0.0.1:29609/help/library/datasets/html/mtcars.html

in case you’re interested]

— This bit of the code asked the p object to include two attributes: wt and mpg.

— The aesthetics (aes) of the graphic object will be mapped to these variables.

— Nothing is seen yet, though the object now exists, until you run the next line of code.

The next line of code:

p + geom_point()

— adds (+) a layer to the plot, a geometric object: geom

— here we are asking for the addition of geom_point(), a scatterplot of points

— the variables mpg and wt will be mapped to the aesthetics, x-axis and y-axis position, of the scatterplot

The wonderful thing about the ggplot() function is that we can keep adding geoms to modify the plot.

— add a command to the second line of code to show the relationship between wt and mpg for the cars in the mtcars dataframe:


p <- ggplot(mtcars, aes(wt, mpg))
p + geom_point() + geom_smooth()

— here

+ geom_smooth()

adds a loess smoother to the plot, indicating the predicted miles per gallon (mpg) for cars, given the data in mtcars, given a car’s weight (wt).

R-rstudio-ggplot-loess-export

[If you want to know what loess means – this post looks like a good place to get started.]

Notice that there is an export button on the top of the plots window pane, click on it and export the plot as a pdf.

Where does that pdf get saved to? Good question.

What have we learnt?

— starting a new script

— installing and loading packages

— creating a new plot

Posted in a. Getting started | Tagged , , , , , , , | Leave a comment

Postgraduate data analysis and interpretation

I am revising my previous posts on the whole process, from getting started with R to using it to do linear modelling and linear mixed-effects modelling.

PANO_20140111_155425

Sunset over the coast, on the road from Ravenglass to Broughton

What this revision will entail is some rewriting of .R provided code, some updating, some extension. Extensions will focus on model comparison methods, information criteria approaches to model evaluation, and the choices one must make in data analysis.

 

 

 

Posted in Uncategorized | Leave a comment

Mixed-effects modeling — four hour workshop — part IV: LMEs

For an example analysis, we collated a .csv file of data from a repeated measures lexical decision experiment, which can be downloaded here. The code I have used in running the analysis described in this post can be found here.

This file holds data on:

1. participant attributes;

2. word attributes;

3. participant responses to words in the lexical decision task.

In this post, I want to take a quick look at how you would actually run a mixed-effects model, using the lmer() function furnished by the lme4 package, written by Doug Bates (Bates, 2010 – appears to a preprint of his presumed new book, see also Pinheiro & Bates, 2000). A useful how-to guide, one I followed when I first started doing this kind of modelling, is in Baayen (2008). An excellent, and very highly cited, introduction to mixed-effects modelling for psycholinguistics, is found in Baayen, Davidson & Bates (2008).

First, then, we enter the following lines of code to load the libraries we are likely to need, to set the working directory to the folder where we have copied the full database, and to read that database as a dataframe in the session’s workspace:

Load libraries, define functions, load data

We start by loading the libraries of functions we are likely to need, defining a set of functions not in libraries, setting the working directory, and loading the dataframe we will work with into the R session workspace.


# load libraries of packages #############################################################################


# library(languageR)
library(lme4)
library(ggplot2)
library(grid)
library(rms)
library(plyr)
library(reshape2)
library(psych)
# library(gridExtra)



# load data files  #######################################################################################


# set the working directory

setwd("/Users/robdavies/Downloads")

# work out where your files are and substitute for the path for my files location, above


# # read in the item norms data
#
# item.norms <- read.csv(file="item norms 270312 050513.csv", head=TRUE,sep=",", na.strings = "-999")
# word.norms <- read.csv(file="word norms 140812.csv", head=TRUE,sep=",", na.strings = "-999")

item.coding <- read.csv(file="item coding info 270312.csv", head=TRUE,sep=",", na.strings = "-999")

# read in the merged new dataframe

item.words.norms <- read.csv(file="item.words.norms 050513.csv", head=TRUE,sep=",", na.strings = "NA")

# read in the subjects scores data file

subjects <- read.csv("ML scores 080612 220413.csv", header=T, na.strings = "-999")

# read in the behavioural data file

behaviour <- read.csv("ML data 080612.csv", header=T, na.strings = "-999")

# notice that we collated the subjects, items and behavioural data into a single dataframe which we then output
# using write.csv as a .csv file, we can read that collated data file in here, to save time

# read in the full database

RTs.norms.subjects <- read.csv("RTs.norms.subjects 061113.csv", header=T, na.strings = "-999")

# read in the database of correct responses to words only

nomissing.full <- read.csv("RTs.norms.subjects nomissing.full 061113.csv", header=T, na.strings = "-999")



# examine the data  ######################################################################################


summary(nomissing.full)

# note the presence of centred variables

 

If you run all that code, you will see the dataframes and functions appear listed in the workspace window of R-studio.

Run the models

What I will do next is perform three sets of mixed-effects model analyses. I will talk about the first set in this post.

In this first set, I shall compare models varying in the predictor variables included in the model specification. We shall say that models with more predictors are more complex than models with fewer predictors (a subset of the predictors specified for the more complex models).

I shall then compare models of differing in complexity to see if increased complexity is justified by improved fit to data.

This is in response to: 1. reading Baayen (2008) and Pinheiro & Bates (2000) 2. recommendations by a reviewer; though I note that the reviewer recommendations are, in their application, qualified in my mind by some cautionary comments made by Pinheiro & Bates (2000), and confess that any problems of application or understanding are surely expressions of my misunderstanding.

What I am concerned about is the level of complexity that can be justified in the ultimate model by the contribution to improved fit by the model of the observed data. Of course, I have collected a set of predictors guided by theory, custom and practice in my field. But, as you have seen, I might choose to draw on a large set of potential predictors. I know that one cannot – to take one option used widely by psycholinguistics researchers – simply inspect a correlation table, comparing predictors with the outcome variable, and exclude those predictors that do not correlate with the outcome. Harrell (2001) comments that that approach has a biasing effect on the likelihood of detecting an effect that is really there (consuming, ‘ghost degrees of freedom’, or something like that). So what should we do or how should we satisfy ourselves that we should or should not include one or more predictor variables in our models?

You could start complex and step backwards to the simplest model that appears justified by model fit to observed data. I will return to that option another time.

In this post, I will start with the simplest model and build complexity up. Simpler models can be said to be nested in more complex models, e.g. a simple model including just main effects vs. a more complex model also including interactions. I will add complexity by adding sets of predictors, and I will compare each proximate pair of simple model and more complex model using the Likelihood Ratio Test (Pinheiro & Bates, 2000; see, especially, chapter 2, pp. 75-).

We can compare models with the same random effects (e.g. of subjects or items) but varying in complexity with respect to the fixed effects, i.e. replicable or manipulated variables like item length or participant age, and, separately, models with the same fixed effects but varying random effects.

Note that in comparing models varying in random effects, one estimates those effects using restricted maximum likelihood: you will see the argument REML = TRUE in the lmer() function call, that is what REML means. Note that in comparing models varying in fixed effects one must estimate those effects using maximum likelihood: you will see the argument REML = FALSE in the lmer() function call, that is what FALSE means. See comments in Baayen (2008) and Pinheiro & Bates (2000); I will come back to those important terms to focus on understanding, another time, focusing in this post on just writing the code for running the models.

We can compare the more and less complex models using the Likelihood Ration Test (LRT) which is executed with the anova([simple model], [more complex model]) function call. A \chi^2 statistic is computed, and evaluated against the \chi^2 distribution to estimate the probability of an increase in model likelihood as large or larger as that observed when comparing the more or less complex model, given the null hypothesis of no gain in likelihood.

Note that I log transform the RTs but one might also consider examining the reciprocal RT.


# we can move to some multilevel modelling ###############################################################

# we might start by examining the effects of predictors in a series of additions of sets of predictors
# we can hold the random effects, of subjects and items, constant, while varying the fixed effects
# we can compare the relative fit of models to data using the anova() likelihood ratio test
# in this series of models, we will use ML fitting

full.lmer0 <- lmer(log(RT) ~

# Trial.order +
#
# cAge + cTOWREW + cTOWRENW + cART +
#
# cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer0,corr=F)

#

full.lmer1 <- lmer(log(RT) ~

 Trial.order +

 # cAge + cTOWREW + cTOWRENW + cART +
 #
 # cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer1,corr=F)

#

full.lmer2 <- lmer(log(RT) ~

 Trial.order +

 cAge + cTOWREW + cTOWRENW + cART +

 # cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer2,corr=F)

#

full.lmer3 <- lmer(log(RT) ~

 Trial.order +

 cAge + cTOWREW + cTOWRENW + cART +

 cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer3,corr=F)

#

full.lmer4 <- lmer(log(RT) ~

 Trial.order +

 (cAge + cTOWREW + cTOWRENW + cART)*

 (cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG) +

 (1|subjectID) + (1|item_name),

 data = nomissing.full, REML = FALSE)

print(full.lmer4,corr=F)

# compare models of varying complexity

anova(full.lmer0, full.lmer1)
anova(full.lmer1, full.lmer2)
anova(full.lmer2, full.lmer3)
anova(full.lmer3, full.lmer4)

This code will first print out the summaries of each model, along the lines you have already seen. It will then give you this:


# > # compare models of varying complexity
# >
# > anova(full.lmer0, full.lmer1)
# Data: nomissing.full
# Models:
# full.lmer0: log(RT) ~ (1 | subjectID) + (1 | item_name)
# full.lmer1: log(RT) ~ Trial.order + (1 | subjectID) + (1 | item_name)
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# full.lmer0 4 -977.58 -951.31 492.79
# full.lmer1 5 -975.83 -943.00 492.92 0.2545 1 0.6139
# > anova(full.lmer1, full.lmer2)
# Data: nomissing.full
# Models:
# full.lmer1: log(RT) ~ Trial.order + (1 | subjectID) + (1 | item_name)
# full.lmer2: log(RT) ~ Trial.order + cAge + cTOWREW + cTOWRENW + cART + (1 |
# full.lmer2: subjectID) + (1 | item_name)
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# full.lmer1 5 -975.83 -943.00 492.92
# full.lmer2 9 -977.09 -917.98 497.54 9.2541 4 0.05505 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# > anova(full.lmer2, full.lmer3)
# Data: nomissing.full
# Models:
# full.lmer2: log(RT) ~ Trial.order + cAge + cTOWREW + cTOWRENW + cART + (1 |
# full.lmer2: subjectID) + (1 | item_name)
# full.lmer3: log(RT) ~ Trial.order + cAge + cTOWREW + cTOWRENW + cART + cLength +
# full.lmer3: cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG + (1 | subjectID) +
# full.lmer3: (1 | item_name)
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# full.lmer2 9 -977.09 -917.98 497.54
# full.lmer3 15 -1115.42 -1016.91 572.71 150.34 6 < 2.2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# > anova(full.lmer3, full.lmer4)
# Data: nomissing.full
# Models:
# full.lmer3: log(RT) ~ Trial.order + cAge + cTOWREW + cTOWRENW + cART + cLength +
# full.lmer3: cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG + (1 | subjectID) +
# full.lmer3: (1 | item_name)
# full.lmer4: log(RT) ~ Trial.order + (cAge + cTOWREW + cTOWRENW + cART) *
# full.lmer4: (cLength + cOLD + cBG_Mean + cLgSUBTLCD + cAOA + cIMG) +
# full.lmer4: (1 | subjectID) + (1 | item_name)
# Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
# full.lmer3 15 -1115.4 -1016.91 572.71
# full.lmer4 39 -1125.7 -869.56 601.84 58.262 24 0.0001119 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# it looks like the full model, including interactions but not the trialorder variable, is justified

Notice:

— It is not very pretty but it does tell you what predictors are included in each model you are comparing, for each pair of models being compared, the outcome of the ratio test, giving the numbers I usually report: the \chi^2, df and the p-value.

— I am looking at this output to see if, for each pair of models, the more complex model fits the data significantly better (if p =< 0.05).

–I add predictors in blocks and assess the improvement of fit associated with the whole block.

I have made a lot of promises to explain things more another time. So I will add one more. Pinheiro & Bates (2000; p. 88) argue that Likelihood Ratio Test comparisons of models varying in fixed effects tend to be anticonservative i.e. will see you observe significant differences in model fit more often than you should. I think they are talking, especially, about situations in which the number of model parameter differences (differences between the complex model and the nested simpler model) is large relative to the number of observations. This is not really a worry for this dataset, but I will come back to the substance of this view, and alternatives to the approach taken here.

What next?

What I would seek to do in the next series of analyses is determine if the effects that are significant appear to be significant due:

1. to the presence of other, nonsignificant effects — making the check by removing the nonsignificant effects then rerunning the model;

2. to the presence of outlier observations — making the check by removing observations associated with large model residuals then rerunning the model.

What have we learnt?

(The code I have used in running the analysis described in this post can be found here.)

Firstly, I am going to have to do a lot more explaining, for my own sake, and have promised to do so for a wide range of ideas, including those concerning the methods of model fitting, and, especially, the discussion around how one determines the appropriate set of fixed effects.

Secondly, that the model code is both simple and – on its face – reasonably intuitive.

Key vocabulary (mostly, a list of things to be explained)

fixed effects

random effects

random intercepts

random slopes

interactions

multiplicative products

centering

maximum likelihood

restricted maximum likelihood

Markov chain Monte Carlo

Likelihood Ratio Test

lmer()

anova()

Reading

Baayen, R. H. (2008). Analyzing linguistic data. Cambridge University Press.
Bates, D. M. (2005). Fitting linear mixed models in R. R News, 5, 27-30.
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioural sciences (3rd. edition). Mahwah, NJ: Lawrence Erlbaum Associates.
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. New York,NY: Cambridge University Press.
Pinheiro, J. C., & Bates, D. M. (2000). Mixed-effects models in S and S-plus (Statistics and Computing). New York: Springer.

Posted in 2.4 LME workshop IV/IV | Tagged , | Leave a comment

Mixed-effects modeling — four hour workshop — part III: Regression modeling

We completed a study on the factors that influence visual word recognition performance in lexical decision. We then put together a dataset for analysis here. We examined the distribution of inter-relation of critical variables in that dataset, acting to transform the data where we found indications of concern, here.

We will now proceed to analyse the lexical decision reaction times (RTs). The dataset collated in the last post consisted of observations corresponding to correct responses to words only, RTs along with data about the subject and item attributes. It will be recalled that we wish to examine the effects on reading of who is reading and what they are reading.

The last collated dataset can be downloaded here.

The code I am going to use in this post can be downloaded here.

We are going to start by ignoring the random effects structure of the data, i.e. the fact that the data are collected in a repeated measures design where every participant saw every word so that there are dependencies between observations made by each participant (some people are slower than others, and most of their responses will be slower) and dependencies between observations made to each item (some words are harder than others, and will be found to be harder by most readers). What we will do is perform Ordinary Least Squares (OLS) regression with our data, and drawing scatterplots will be quite central to our work (see previous posts here and here). I am going to assume familiarity with correlation and OLS regression (but see previous posts here and here).

The relationships between pairs of variables — scatterplots

We are interested in whether our readers are recognizing some words faster than others. Previous research (e.g. Balota et al., 2004) might lead us to expect that they should respond faster to more frequent, shorter, words that are easier to imagine (the referent of). We are also interested in whether individuals vary in how they recognize words. We might expect that readers who score high on standardized reading tests (e.g. the TOWRE) would also respond faster, on average, in the experimental reading task (lexical decision). The interesting question, for me, is whether there is an interaction between the first set of effects — of what is being read — and the second set of effects — who is doing the reading.

We have already looked at plots illustrating the relationships between various sets of the predictor variables: scatterplot matrices.

We are now going to create a trellis of scatterplots to examine the relationship between each predictor variable and word recognition RT.

Let’s start with a scatterplot. I assume the collated dataset has been downloaded and is available for use in R’s workspace. If you copy the following code to draw a plot:


plength <- ggplot(nomissing.full,aes(x=Length, y=RT))
plength <- plength + geom_jitter(colour = "black", alpha = 1/5)
plength <- plength + xlab("Length") + theme_bw()
plength <- plength + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
plength <- plength + geom_smooth(method="lm", size = 1.5)
plength

You will get this:

LME-4hrworkshop-scatter-length

Notice:

– I am adding each element, one at a time, making for more lines of code but greater transparency between code and effect.

— I first ask to have a plot object created: plength <- ggplot(nomissing.full,aes(x=Length, y=RT))

— the object is called plength, the function doing the creation is ggplot(), it is acting on the database nomissing.full, and it will map the variables Length to x coordinate and RT to y coordinate

— the mapping is from data to aesthetic attributes like point location in a scatterplot

— I then add to what is a blank plot the geometric object (geom) point — run the first line alone and see what happens

— here I am asking for jittered points with + geom_jitter(colour = “black”, alpha = 1/5) — where a bit of random noise has been added to the point locations because so many would otherwise appear in the same place — overplotting that obscures the pattern of interest — I am also modifying the relative opacity of the points with alpha = 1/5

– I could add the scatterplot points with: + geom_point()
substitute that in for geom_jitter() and see what happens — I could change the colour scheme also

– I then change the axis label with: + xlab(“Length”)

– I modify the size of the labeling with: + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))

– I add a smoother i.e. a line showing a linear model prediction of y-axis values for the x-axis variable values with: + geom_smooth(method=”lm”, size = 1.5)  – and I ask for the size to be increased for better readability

– and I ask for the plot to have a white background with: + theme_bw()

In fact, I would like to see a plot of all the key predictor variables and their effect on RT, considering just the bivariate relationships. It is relatively easy to first draw each plot, then create a trellis of plots, as follows:


# 1. are RTs affected by item attributes?

# let's look at the way in which RT relates to the word attributes, picking the most obvious variables
# we will focus, here, on the database holding correct responses to words

head(nomissing.full, n = 2)

plength <- ggplot(nomissing.full,aes(x=Length, y=RT))
plength <- plength + geom_jitter(colour = "black", alpha = 1/5)
plength <- plength + xlab("Length") + theme_bw()
plength <- plength + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
plength <- plength + geom_smooth(method="lm", size = 1.5)
plength

pOLD <- ggplot(nomissing.full,aes(x=OLD, y=RT))
pOLD <- pOLD + geom_jitter(colour = "black", alpha = 1/5)
pOLD <- pOLD + xlab("OLD") + theme_bw()
pOLD <- pOLD + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pOLD <- pOLD + geom_smooth(method="lm", size = 1.5)
pOLD

pfreq <- ggplot(nomissing.full,aes(x=LgSUBTLCD, y=RT))
pfreq <- pfreq + geom_point(colour = "black", alpha = 1/5)
pfreq <- pfreq + xlab("LgSUBTLCD") + theme_bw()
pfreq <- pfreq + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pfreq <- pfreq + geom_smooth(method="lm", size = 1.5)
pfreq

pAoA_Kup_lem <- ggplot(nomissing.full,aes(x=AoA_Kup_lem, y=RT))
pAoA_Kup_lem <- pAoA_Kup_lem + geom_point(colour = "black", alpha = 1/5)
pAoA_Kup_lem <- pAoA_Kup_lem + xlab("AoA") + theme_bw()
pAoA_Kup_lem <- pAoA_Kup_lem + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pAoA_Kup_lem <- pAoA_Kup_lem + geom_smooth(method="lm", size = 1.5)
pAoA_Kup_lem

pbrookesIMG <- ggplot(nomissing.full,aes(x=brookesIMG, y=RT))
pbrookesIMG <- pbrookesIMG + geom_jitter(colour = "black", alpha = 1/5)
pbrookesIMG <- pbrookesIMG + xlab("Imageability") + theme_bw()
pbrookesIMG <- pbrookesIMG + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pbrookesIMG <- pbrookesIMG + geom_smooth(method="lm", size = 1.5)
pbrookesIMG

# 2. are RTs affected by subject attributes?

# let's look at the way in which RT relates to the word attributes, picking the most obvious variables
# we will focus, here, on the database holding correct responses to words

head(nomissing.full, n = 2)

pAge <- ggplot(nomissing.full,aes(x=Age, y=RT))
pAge <- pAge + geom_jitter(colour = "black", alpha = 1/5)
pAge <- pAge + xlab("Age") + theme_bw()
pAge <- pAge + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pAge <- pAge + geom_smooth(method="lm", size = 1.5)
pAge

pTOWRE_wordacc <- ggplot(nomissing.full,aes(x=TOWRE_wordacc, y=RT))
pTOWRE_wordacc <- pTOWRE_wordacc + geom_jitter(colour = "black", alpha = 1/5)
pTOWRE_wordacc <- pTOWRE_wordacc + xlab("TOWRE word") + theme_bw()
pTOWRE_wordacc <- pTOWRE_wordacc + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pTOWRE_wordacc <- pTOWRE_wordacc + geom_smooth(method="lm", size = 1.5)
pTOWRE_wordacc

pTOWRE_nonwordacc <- ggplot(nomissing.full,aes(x=TOWRE_nonwordacc, y=RT))
pTOWRE_nonwordacc <- pTOWRE_nonwordacc + geom_jitter(colour = "black", alpha = 1/5)
pTOWRE_nonwordacc <- pTOWRE_nonwordacc + xlab("TOWRE nonword") + theme_bw()
pTOWRE_nonwordacc <- pTOWRE_nonwordacc + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pTOWRE_nonwordacc <- pTOWRE_nonwordacc + geom_smooth(method="lm", size = 1.5)
pTOWRE_nonwordacc

pART_HRminusFR <- ggplot(nomissing.full,aes(x=ART_HRminusFR, y=RT))
pART_HRminusFR <- pART_HRminusFR + geom_jitter(colour = "black", alpha = 1/5)
pART_HRminusFR <- pART_HRminusFR + xlab("ART print exposure") + theme_bw()
pART_HRminusFR <- pART_HRminusFR + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=20))
pART_HRminusFR <- pART_HRminusFR + geom_smooth(method="lm", size = 1.5)
pART_HRminusFR

Notice:

— What is changing between plots are: plot object names, the x-axis variable name (spelled correctly) and plot x axis label

— You can run each set of code to examine the relationships between predictor variables and RT one at a time.

We can create a grid and then print the plots to specific locations in the grid, as follows, outputting everything as a pdf:


pdf("nomissing.full-RT-by-main-effects-subjects-061113.pdf", width = 18, height = 12)

grid.newpage()

pushViewport(viewport(layout = grid.layout(2,5)))

vplayout <- function(x,y)
viewport(layout.pos.row = x, layout.pos.col = y)

print(plength, vp = vplayout(1,1))
print(pOLD, vp = vplayout(1,2))
print(pfreq, vp = vplayout(1,3))
print(pAoA_Kup_lem, vp = vplayout(1,4))
print(pbrookesIMG, vp = vplayout(1,5))

print(pAge, vp = vplayout(2,1))
print(pTOWRE_wordacc, vp = vplayout(2,2))
print(pTOWRE_nonwordacc, vp = vplayout(2,3))
print(pART_HRminusFR, vp = vplayout(2,4))

dev.off()

If you run that code it will get you this:

LME-4hrworkshop-scatter-trellis

What is your evaluation?

The next question obviously is the extent to which these variables make unique contributions to the explanation of variance in reading RT, when taking the other variables into account, and whether we might detect interactions between effects of item attributes like frequency and person attributes like word reading skill.

We can examine potential interactions by splitting the data into subsets, effectively, through the creation of a variable that associates observations with intervals in one of the components of the potential interaction. In other words, if we wanted to look at the age x frequency interaction, we might look at the frequency effect for each of four subsets of the data, corresponding to four equal intervals in the age variable. We can use the following code to do that:


# first, add coding variables that split the dataset into equal intervals according to scores on each subject
# attribute of interest

nomissing.full$TOWRE_wordacc_cutequal <- cut_interval(nomissing.full$TOWRE_wordacc, n = 4)

#then plot the interaction: age x frequency

pdf("nomissing.full-freqwords-rt-lm-scatter-061113.pdf", width = 12, height = 8)

pLgSUBTLCD <- ggplot(nomissing.full,aes(x=LgSUBTLCD,y=RT))
pLgSUBTLCD <- pLgSUBTLCD + geom_jitter(colour = "black", alpha = 1/5) + theme_bw()+ xlab("LgSUBTLCD") # +
pLgSUBTLCD <- pLgSUBTLCD + theme(axis.title.x = element_text(size=25)) + theme(axis.text.x = element_text(size=10))
pLgSUBTLCD <- pLgSUBTLCD + geom_smooth(method="lm", size = 1.5, colour = "red")
pLgSUBTLCD <- pLgSUBTLCD + facet_wrap(~TOWRE_wordacc_cutequal,nrow=1)
pLgSUBTLCD

dev.off()

Running this code will get you this:

LME-4hrworkshop-wordxfreq-scatter

To me, there is a possible interaction such that the frequency effect (LgSUBTLCD) may be getting smaller for older readers but that modulation of frequency by age does not seem very strong.

OLS Regression — the full dataset

We can perform a full regression on the dataset to look for the effects of interest, effects of subject attributes, of word attributes and of the interactions between the first and the last. We are going to ignore the fact that, due to the dependencies of observations under subject and item, we do not really have as many degrees of freedom as we appear to have (several thousand observations).

We can run a regression taking all the data and ignoring the clustering of observations:


# Inspect the database

summary(nomissing.full)

# recommended that first make an object that summarizes the distribution

nomissing.full.dd <- datadist(nomissing.full)

# specify that will be working with the subjects.dd data distribution

options(datadist = "nomissing.full.dd")

# run the ols() function to perform the regression

nomissing.full.ols <- ols(logrt ~

(cAge + cTOWREW + cTOWRENW + cART)*
(cLength + cOLD + cIMG + cAOA + cLgSUBTLCD),

data = nomissing.full)

# enter the name of the model to get a summary

nomissing.full.ols

That will get you this:


# > nomissing.full.ols
#
# Linear Regression Model
#
# ols(formula = logrt ~ (cAge + cTOWREW + cTOWRENW + cART) * (cLength +
#                                                               cOLD + cIMG + cAOA + cLgSUBTLCD), data = nomissing.full)
#
# Model Likelihood     Discrimination
# Ratio Test           Indexes
# Obs     5257    LR chi2    677.56    R2       0.121
# sigma 0.1052    d.f.           29    R2 adj   0.116
# d.f.    5227    Pr(> chi2) 0.0000    g        0.043
#
# Residuals
#
# Min       1Q   Median       3Q      Max
# -0.37241 -0.06810 -0.01275  0.05202  0.55272
#
# Coef    S.E.   t       Pr(>|t|)
# Intercept              2.7789 0.0015 1914.52 <0.0001
# cAge                   0.0015 0.0001   13.95 <0.0001
# cTOWREW               -0.0018 0.0002   -8.42 <0.0001
# cTOWRENW               0.0009 0.0003    3.40 0.0007
# cART                  -0.0021 0.0002  -11.58 <0.0001
# cLength               -0.0101 0.0027   -3.68 0.0002
# cOLD                   0.0012 0.0071    0.18 0.8599
# cIMG                  -0.0066 0.0013   -5.00 <0.0001
# cAOA                   0.0022 0.0010    2.16 0.0310
# cLgSUBTLCD            -0.0365 0.0033  -10.98 <0.0001
# cAge * cLength         0.0003 0.0002    1.46 0.1453
# cAge * cOLD           -0.0006 0.0005   -1.13 0.2580
# cAge * cIMG            0.0000 0.0001   -0.03 0.9798
# cAge * cAOA           -0.0001 0.0001   -1.02 0.3099
# cAge * cLgSUBTLCD      0.0001 0.0002    0.42 0.6767
# cTOWREW * cLength      0.0002 0.0004    0.54 0.5915
# cTOWREW * cOLD         0.0002 0.0011    0.22 0.8265
# cTOWREW * cIMG         0.0000 0.0002    0.02 0.9859
# cTOWREW * cAOA         0.0002 0.0002    1.14 0.2531
# cTOWREW * cLgSUBTLCD   0.0002 0.0005    0.31 0.7600
# cTOWRENW * cLength    -0.0004 0.0005   -0.80 0.4213
# cTOWRENW * cOLD        0.0009 0.0012    0.72 0.4735
# cTOWRENW * cIMG        0.0003 0.0002    1.29 0.1975
# cTOWRENW * cAOA        0.0000 0.0002   -0.20 0.8377
# cTOWRENW * cLgSUBTLCD  0.0015 0.0006    2.62 0.0088
# cART * cLength        -0.0004 0.0003   -1.29 0.1967
# cART * cOLD            0.0002 0.0009    0.22 0.8263
# cART * cIMG           -0.0001 0.0002   -0.65 0.5133
# cART * cAOA            0.0000 0.0001    0.17 0.8670
# cART * cLgSUBTLCD     -0.0005 0.0004   -1.29 0.1964

which looks perfectly reasonable.

Notice:

— A regression model might be written as:

word.skill = age + nonword.skill + print .exposure + intercept + error

— I can break the function call down:

subjects.ols —  the name of the model to be created

<- ols() — the model function call

TOWRE_wordacc — the dependent or outcome variable

~ — is predicted by

Age + TOWRE_nonwordacc + ART_HRminusFR — the predictors

, data = subjects — the specification of what dataframe will be used

— Notice also that when you run the model, you get nothing except the model command being repeated (and no error message if this is done right).

— To get the summary of the model, you need to enter the model name:

subjects.ols

— Notice when you look at the summary the following features:

Obs — the number of observations

d.f. — residual degrees of freedom

Sigma — the residual standard error i.e. the variability in model residuals — models that better fit the data would be associated with smaller spread of model errors (residuals)

Residuals — the listed residuals quartiles, an indication of the distribution of model errors — linear regression assumes that model errors are random, and that they are therefore distributed in a normal distribution

Model Likelihood Ratio Test — a measure of goodness of fit, with the \chi^2 test statistic, degrees of freedom (model predictors) and p-value reported

R^2 and adj. R^2 — measures of the proportion of variance explained by the model.

In estimating the outcome (here, word reading skill), we are predicting the outcome given what we know about the predictors. Most of the time, the prediction will be imperfect, the predicted outcome will diverge from the observed outcome values. How closely do predicted values correspond to observed values? In other words, how far is the outcome independent of the predictor? Or, how much does the way in which values of the outcome vary coincide with the way in which values of the predictor vary? (I am paraphrasing Cohen et al., 2003, here.)

Let’s call the outcome Y and the predictor X. How much does the way in which the values of Y vary coincide with the way in which values of X vary? We are talking about variability, and that is typically measured in terms of the standard deviation SD or its square the variance sd^2. Variances are additive while SDs are not so we typically work with variances. In asking the string of questions given in the foregoing paragraph, we are thinking about a problem that can be expressed like this: what part of Y variance is associated with X (this part will equal the variance of the estimated scores) and what part of Y variance is independent of X (the model error, equal to the variance of the discrepancies between estimated and observed Y)? If there is no relationship between X and Y, our best estimate of Y will be the mean of Y. R^2 is the proportion of Y variance associated with X – our predictors.

Coef — a table presenting the model estimates for the effects of each predictor, while taking into account all other predictors

Notice:

— The estimated regression coefficients represent the rate of change in outcome units per unit change in the predictor variable.

— You could multiple the coefficient by a value for the predictor to get an estimate of the outcome for that value e.g. word reading skill is higher by 0.7382 times each unit increase in nonword reading score.

— The intercept (the regression constant or Y intercept) represents the estimated outcome when the predictors are equal to zero – you could say, the average outcome value assuming the predictors have no effect, here, a word reading score of 55.

For some problems, as here, it is convenient to center the predictor variables by subtracting a variable’s mean value from each of its values. This can help alleviate nonessential collinearity (see here) due to scaling, when working with curvilinear effects terms and interactions, especially (Cohen et al., 2003). If both the outcome and predictor variables have been centered then the intercept will also equal zero. If just the predictor variable is centered then the intercept will equal the mean value for the outcome variable. Where there is no meaningful zero for a predictor (e.g. with age) centering predictors can help the interpretation of results. The slopes of the effects of the predictors are unaffected by centering.

— In addition to the coefficients, we also have values for each estimate for:

S.E. — the standard error of the estimated coefficient of the effect of the predictor. Imagine that we sampled our population many times and estimated the effect each time: those estimates would constitute a sampling distribution for the effect estimate coefficient, and would be approximately normally distributed. The spread of those estimates is given by the SE of the coefficients, which can be calculated in terms of the portion of variance not explained by the model along with the SDs of Y and of X.

t — the t statistic associated with the Null Hypothesis Significance Test (NHST, but see Cohen, 1994) that the effect of the predictor is actually zero. We assume that there is no effect of the predictor. We have an estimate of the possible spread of coefficient estimate values, given repeated sampling from the population, and reruns of the regression – that is given by the SE of the coefficient. The t-test in general is given by:

t = \frac{sample.value - null.hypothesis.value}{standard.error}

therefore, for regression coefficients:

t = \frac{B.effect - H_0}{SE_B.effect}

i.e. for a null hypothesis that the coefficient should be zero, we divide the estimate of the coefficient, B, by the standard error of possible coefficients, to get a t-statistic

Pr(>|t|) — we can then get the probability that a value of t this large or larger by reference to the distribution of t values for the degrees of freedom at hand.

But is this really the right thing to do?

As I discuss in my talk, it is not (slides to be uploaded). I have already alluded to the problem. The model makes sense on its face. And the effects estimates might be reasonable. The problem is that we are not really giving the randomness its due here. Did we really have 5250+ independent observations? How accurate are our estimates of the standard errors?

OLS Regression — the by-items mean RTs dataset

The field has long been aware of random variation in datasets like this, where we are examining the effect of e.g. word attributes on the responses of samples of participants in a repeated-measures design. The solution has not been to throw all the data into a regression. Rather, we have averaged responses to each item over participants. To put it another way, for each word, we have averaged over the responses recorded for the participants in our sample, to arrive at an average person response. The folk explanation for this procedure is that we are ‘washing out’ random variation to arrive at a mean response but if you consider discussions by e.g. Raaijmakers et al. (1999) it is not quite that simple (the whole problem turns on the ‘language as fixed effect’ issue, the paper is highly recommended). Let’s look at how this works for our data set.

Note I am going to be a bit lazy here and not centre the predictors before performing the analysis nor log transform RTs. You’re welcome to redo the analysis having made that preliminary step.

The first thing we need to do is create a database holding by-items mean RTs. We do this by calculating the average RT (over subjects) of responses to each item.


# define functions for use later:

# we want to examine performance averaging over subjects to get a mean RT for each item ie by items RTs

# first create a function to calculate mean RTs

meanRT <- function(df) {mean(df$RT)}

# then apply it to items in our original full dataset, outputting a dataset of by-items mean RTs

nomissing.full.byitems.RT.means <- ddply(nomissing.full, .(item_name), meanRT)

# inspect the results

summary(nomissing.full.byitems.RT.means)

# if you click on the dataframe name in the workspace window in RStudio you can see that we have created a set of 160
# (mean) RTs

# we want to rename the variable calculated for us from V1 to byitems.means

nomissing.full.byitems.RT.means <- rename(nomissing.full.byitems.RT.means, c(V1 = "byitems.means"))

The next thing we can do is add information about the items into this dataframe by merging it with our item norms dataframe, as follows:


# we can analyse what affects these by-items means RTs but must first merge this dataset with the item norms dataset:

byitems <- merge(nomissing.full.byitems.RT.means, item.words.norms, by = "item_name")

# inspect the results

summary(byitems)

We can then run another OLS regression, this time on just the by-items means RTs therefore without the subject attribute variables or, critically, the interactions, as follows:


# recommended that first make an object that summarizes the distribution

byitems.dd <- datadist(byitems)

# specify that will be working with the subjects.dd data distribution

options(datadist = "byitems.dd")

# run the ols() function to perform the regression

byitems.ols <- ols(byitems.means ~

Length + OLD + brookesIMG + AoA_Kup_lem + LgSUBTLCD,

data = byitems)

# enter the name of the model to get a summary

byitems.ols

That will get you this:


> byitems.ols

Linear Regression Model

ols(formula = byitems.means ~ Length + OLD + brookesIMG + AoA_Kup_lem +
LgSUBTLCD, data = byitems)

Model Likelihood     Discrimination
Ratio Test           Indexes
Obs       160    LR chi2    126.84    R2       0.547
sigma 39.8543    d.f.            5    R2 adj   0.533
d.f.      154    Pr(> chi2) 0.0000    g       47.664

Residuals

Min       1Q   Median       3Q      Max
-96.8953 -26.1113  -0.3104  20.8134 167.8680

Coef     S.E.    t     Pr(>|t|)
Intercept   912.2377 49.6487 18.37 <0.0001
Length      -19.2256  5.8640 -3.28 0.0013
OLD           6.5787 15.2532  0.43 0.6669
brookesIMG  -12.6055  2.8667 -4.40 <0.0001
AoA_Kup_lem   4.0169  2.2254  1.81 0.0730
LgSUBTLCD   -58.1738  7.1677 -8.12 <0.0001

Notice:

— the proportion of variance explained by the model is considerably larger than that explained by the model for the full dataset

— the residual standard error is also larger

— obviously the degrees of freedom are reduced

— the pattern of effects seems broadly similar i.e. what effects are significant and what are not

How could we take into account both random variation due to items (as in our by-items analysis) and random variation due to subjects?

As I discuss in my talk (slides to be uploaded here), discussions by Clark (1973), Raaijmakers et al. (1999) are enlightening. We are heading towards the mixed-effects approach argued by Baayen (2008; Baayen et al., 2008) and many others to do the job. We might consider one last option, to perform a by-items analysis not on by-items mean RT but on the RTs of each participant considered as an individual (seeLorch & Myers, 1990). We might then analyse the coefficients resulting from that per-subjects analysis: see Balota et al. (2004), for one example of this approach; see Snijders & Boskers (1999), also Baayen et al. (2008), for a discussion of the problems with it.

We can first define a function to perform a regression analysis for each subject:


model <- function(df){lm(logrt ~

Length + OLD + brookesIMG + AoA_Kup_lem +
LgSUBTLCD,

data = df)}

Notice:

— I am switching to the lm() function to run regression analyses — the effect is the same but the function will be a bit simpler to handle.

Then use the function to model each person individually, creating a dataframe with results of each model:


nomissing.full.models <- dlply(nomissing.full, .(subjectID), model)

rsq <- function(x) summary(x)$r.squared
nomissing.full.models.coefs <- ldply(nomissing.full.models, function(x) c(coef(x),rsquare = rsq(x)))

head(nomissing.full.models.coefs,n=3)
length(nomissing.full.models.coefs$subjectID)
summary(nomissing.full.models.coefs)

# then merge the models coefs and subject scores dbases by subject

nomissing.full.models.coefs.subjects <- merge(nomissing.full.models.coefs, subjects, by="subjectID")

# inspect

summary(nomissing.full.models.coefs.subjects)

We can use these data to analyse whether coefficients of effects of word attributes estimated separately for each subject are, themselves, estimatble given data about subject attributes.


# we can now analyse whether the coefficients are predicted by subject variables

# rename the dataframe for simplicity:

subjects.coefs <- nomissing.full.models.coefs.subjects

# recommended that first make an object that summarizes the distribution

subjects.coefs.dd <- datadist(subjects.coefs)

# specify that will be working with the subjects.dd data distribution

options(datadist = "subjects.coefs.dd")

# run the ols() function to perform the regression

subjects.coefs.ols <- ols(LgSUBTLCD ~

Age + TOWRE_wordacc + TOWRE_nonwordacc + ART_HRminusFR,

data = subjects.coefs)

# enter the name of the model to get a summary

subjects.coefs.ols

Running that code will get you this:


> subjects.coefs.ols

Linear Regression Model

ols(formula = LgSUBTLCD ~ Age + TOWRE_wordacc + TOWRE_nonwordacc +
ART_HRminusFR, data = subjects.coefs)

Model Likelihood     Discrimination
Ratio Test           Indexes
Obs       34    LR chi2     19.94    R2       0.444
sigma 0.0176    d.f.            4    R2 adj   0.367
d.f.      29    Pr(> chi2) 0.0005    g        0.015

Residuals

Min        1Q    Median        3Q       Max
-0.029786 -0.014093  0.002851  0.012819  0.032995

Coef    S.E.   t     Pr(>|t|)
Intercept        -0.1264 0.0328 -3.86 0.0006
Age               0.0001 0.0002  0.55 0.5856
TOWRE_wordacc     0.0001 0.0005  0.11 0.9101
TOWRE_nonwordacc  0.0017 0.0005  3.26 0.0029
ART_HRminusFR    -0.0006 0.0004 -1.57 0.1268

Notice that the effect of frequency does appear to be predicted by the effect of subject nonword reading performance — in line with the frequency x nonword interaction seen in the analysis of the full dataset, above.

What have we learnt?

datadist()

ols()

ddply()

ggplot() — see previous posts on data visualization

Reading

Baayen, R. H. (2008). Analyzing linguistic data. Cambridge University Press.

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of memory and language, 59(4), 390-412.

Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133(2), 283.

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioural sciences (3rd. edition). Mahwah, NJ: Lawrence Erlbaum Associates.

Harrell Jr, F. J. (2001). Regression modelling strategies: With applications to linear models, logistic regression, survival analysis. New York, NY: Springer.

Lorch, R. F., & Myers, J. L. (1990). Regression analyses of repeated measures data in cognitive research. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(1), 149.

Raaijmakers, J. G., Schrijnemakers, J., & Gremmen, F. (1999). How to deal with “the language-as-fixed-effect fallacy”: Common misconceptions and alternative solutions. Journal of Memory and Language, 41(3), 416-426.

Snijders, T. A. B., & Bosker, R. J. (1999). Multilevel analysis: An introduction to basic and advanced multilevel modelingSage Publications. Thousand Oaks, CA.

Posted in 2.3 LME workshop III/IV, Uncategorized | Tagged , , , , | Leave a comment

Mixed-effects modeling — four hour workshop — part II: Data Exploration

We got our data together in Part I of this workshop, eventually arriving at the collation of a single dataframe in ‘long format’ holding all our data on subject attributes, item attributes and behavioural responses to words in a lexical decision experiment. That collated database can be downloaded here.

The code we will be using in this post can be downloaded from here.

We should begin by examining the features of our collated dataset. We shall be familiarising ourselves with the distributions of the variables, looking for evidence of problems in the dataset, and doing transformation or othe operations to handle those problems.

Some of the problems are straightforward. We should look at the distributions of scores to check if any errors crept in to the dataset, miscoded or misrecorded observations. Other problems are a little more demanding both to understand and to diagnose and treat. The running example in this workshop requires a regression analysis in which we will be required to enter a number of variables as predictors, including variables characterizing word attributes like the frequency, length and imageability of words. The problem is that all these variables share or overlap in information because words that appear often in the language also tend to be short, easy to imagine, learnt earlier in life and so on. This means that we need to examine the extent to which our predictors cluster, representing overlapping information, and presenting potential problems for our analysis.

Much of what follows has been discussed in previous posts here on getting summary statistics and plotting histograms to examine variable distributions, here on plotting multiple histograms in a grid, see also here on histograms and density plots, here on plotting scatterplots to examine bivariate relationships, and here on checking variable clustering. This post constitutes an edited, shortened, version of those previous discussions.

Examine the variable distributions — summary statistics

I am going to assume you have already downloaded and read in the data files concerning subject and item or word attributes as well as the collated database merging the subject and item attribute data with the behavioural data (see Part I). What we are going to do now is look at the distributions of the key variables in those databases.

We can start simply by examining the sumaries for our variables. We want to focus on the variables that will go into our models, so variables like age or TOWRE word or non-word reading score, item length, word frequency, and lexical decision reaction time (RT) to words.

If you have successfully read the relevant data files into the R workspace then you should see them listed by the dataframe names (assigned when you performed the read.csv function calls) in the workspace window in RStudio (top right).

R provides a handy set of functions to allow you to examine your data (see e.g. Dalgaard, 2008):


head(subjects, n = 2)

summary(subjects)

describe(subjects)

str(subjects)

psych::describe(subjects)

length(subjects)

length(subjects$Age)

Here we are focusing on the data in the subject scores dataframe.

Copy-paste these commands into the script window and run them, below is what you will see in the console:

R-2-1-descriptives

Notice:

1. the head() function gives you the top rows, showing column names and data in cells; I asked for 2 rows of data with n = 2

2. the summary() function gives you: minimum, maximum, median and quartile values for numeric variables, numbers of observations with one or other levels for factors, and will tell you if there are any missing values, NAs

3. psych::describe() (from the psych package) will give you means, SDs, the kind of thing you report in tables in articles

4. str() will tell you if variables are factors, integers etc.

If I were writing a report on this subjects data I would copy the output from describe() into excel, format it for APA, and stick it in a word document.

As an exercise, you might apply the same functions to the items or the full databases.

Notice the values and evaluate if they make sense to you. Examine, in particular, the treatment of factors and numeric variables.

Examine the variable distributions — plots

Summary statistics are handy, and required for report writing, but plotting distributions will raise awareness of problems in your dataset much more efficiently.

We will use ggplot to plot our variables, especially geom_histogram, see the documentation online.

We are going to examine the distribution of the Age variable, age in years for ML’s participants, using the ggplot() function as follows:


pAge <- ggplot(subjects, aes(x=Age))
pAge + geom_histogram()

Notice:

— We are asking ggplot() to map the Age variable to the x-axis i.e. we are asking for a presentation of one variable.

If you copy-paste the code into the script window and run it three things will happen:

1. You’ll see the commands appear in the lower console window

2. You’ll see the pAge object listed in the upper right workspace window

R-2-1-rstudio-histogram

Notice:

— When you ask for geom_histogram(), you are asking for a bar plot plus a statistical transformation (stat_bin) which assigns observations to bins (ranges of values in the variable)

— calculates the count (number) of observations in each bin

— or – if you prefer – the density (percentage of total/bar width) of observations in each bin

— the height of the bar gives you the count, the width of the bin spans values in the variable for which you want the count

— by default, geom_histogram will bin data to the range in values/30 but you can ask for more or less detail by specifying binwidth, this is what R means when it says:

>stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to 
adjust this.

3. We see the resulting plot in the plot window, which we can export as a pdf:

R-2-1-histogram

Notice:

— ML tested many people in their 20s – no surprise, a student mostly tests students – plus a smallish number of people in middle age and older (parents, grandparents)

— nothing seems wrong with these data – given what we know about the sample.

We can move on to examine the distribution of multiple variables simultaneously by producing a grid of histograms, as follows. In Exploratory Data Analysis, we often wish to split a larger dataset into smaller subsets, showing a grid or trellis of small multiples – the same kind of plot, one plot per subset – arranged systematically to allow the reader (maybe you) to discern any patterns, or stand-out features, in the data. Much of what follows is cribbed from the ggplot2 book (Wickham, 2009; pp.115 – ; see also Chang, 2o13). Wickham (2009) advises that arranging multiple plots on a single page involves using grid, an R graphics system, that underlies ggplot2.

Copy-paste the following code in the script window to create the plots:


pAge <- ggplot(subjects, aes(x=Age))
pAge <- pAge + geom_histogram()

pEd <- ggplot(subjects, aes(x=Years_in_education))
pEd <- pEd + geom_histogram()

pw <- ggplot(subjects, aes(x=TOWRE_wordacc))
pw <- pw + geom_histogram()

pnw <- ggplot(subjects, aes(x=TOWRE_nonwordacc))
pnw <- pnw + geom_histogram()

pART <- ggplot(subjects, aes(x=ART_HRminusFR))
pART <- pART + geom_histogram()

Notice:

— I have given each plot a different name. This will matter when we assign plots to places in a grid.

— I have changed the syntax a little bit: create a plot then add the layers to the plot; not doing this, ie sticking with the syntax as in the preceding examples, results in an error when we get to print the plots out to the grid; try it out – and wait for the warning to appear in the console window.

We have five plots we want to arrange, so let’s ask for a grid of 2 x 3 plots ie two rows of three.


grid.newpage()

pushViewport(viewport(layout = grid.layout(2,3)))

vplayout <- function(x,y)
viewport(layout.pos.row = x, layout.pos.col = y)

Having asked for the grid, we then print the plots out to the desired places, as follows:


print(pAge, vp = vplayout(1,1))
print(pEd, vp = vplayout(1,2))
print(pw, vp = vplayout(1,3))
print(pnw, vp = vplayout(2,1))
print(pART, vp = vplayout(2,2))

As R works through the print() function calls, you will see the Plots window in R-studio fill with plots getting added to the grid you specified.

You can get R to generate an output file holding the plot as a .pdf or as a variety of alternatively formatted files.

Wickham (2009; p. 151) recommends differing graphic formats for different tasks, e.g. png (72 dpi for the web); dpi = dots per inch i.e. resolution.

Most of the time, I output pdfs. You call the pdf() function – a disk-based graphics device -specifying the name of the file to be output, the width and height of the output in inches (don’t know why, but OK), print the plots, and close the device with dev.off(). You can use this code to do that.

I can wrap everything up in a single sequence of code bracketed by commands to generate a pdf:


pdf("ML-data-subjects-histograms-061113.pdf", width = 15, height = 10)

pAge <- ggplot(subjects, aes(x=Age))
pAge <- pAge + geom_histogram()

pEd <- ggplot(subjects, aes(x=Years_in_education))
pEd <- pEd + geom_histogram()

pw <- ggplot(subjects, aes(x=TOWRE_wordacc))
pw <- pw + geom_histogram()

pnw <- ggplot(subjects, aes(x=TOWRE_nonwordacc))
pnw <- pnw + geom_histogram()

pART <- ggplot(subjects, aes(x=ART_HRminusFR))
pART <- pART + geom_histogram()

grid.newpage()

pushViewport(viewport(layout = grid.layout(2,3)))

vplayout <- function(x,y)
viewport(layout.pos.row = x, layout.pos.col = y)

print(pAge, vp = vplayout(1,1))
print(pEd, vp = vplayout(1,2))
print(pw, vp = vplayout(1,3))
print(pnw, vp = vplayout(2,1))
print(pART, vp = vplayout(2,2))

dev.off()

Which results in a plot that looks like this:

LME-4hrworkshop-histograms

There is nothing especially concerning about these distributions though the bimodal nature of the age distribution is something worth remembering if age turns out to be a significant predictor of reading reaction times.

Further moves

Histograms are not always the most useful approach to visualizing data distributions. What if you wanted to compare two distributions overlaid on top of each other e.g. distributions of variables that should be similar? It is worth considering density plots, see here.

Exploring data and acting on what we find

Let’s consider an aspect of the data that, when revealed by a distribution plot (but was also evident when we looked at the summary statistics), will require action.

I am talking about what we see if we plot a histogram of the RT distribution using the following code. If we copy this into RStudio and run it:


pRT <- ggplot(RTs.norms.subjects, aes(x=RT))
pRT + geom_histogram()

We get this:

LME-4hrworkshop-RT-histogram

Notice that there are a number of negative RTs. That is. of course, impossible but makes sense if you know that DMDX, the application used to collect lexical decision responses, encodes RTs for errors as negative RTs.

Notice also that even if we ignore the -RT values, the RT distribution is, as usual, highly skewed.

Data transformations I/II

We can remove the error -RT observations by subsetting the full collated database, filtering out all observations correspondong to errors to create a new ‘correct responses only’ database. We can also log transform the RTs for correct responses in that same new database. We can do this in a few lines of code in the script as follows:


# the summary shows us that there are negative RTs, which we can remove by subsetting the dataframe
# by setting a condition on rows

nomissing.full <- RTs.norms.subjects[RTs.norms.subjects$RT > 0,]

# inspect the results

summary(nomissing.full)

# how many errors were logged?

# the number of observations in the full database i.e. errors and correct responses is:

length(RTs.norms.subjects$RT)

# > length(RTs.norms.subjects$RT)
# [1] 5440

# the number of observations in the no-errors database is:

length(nomissing.full$RT)

# so the number of errors is:

length(RTs.norms.subjects$RT) - length(nomissing.full$RT)

# > length(RTs.norms.subjects$RT) - length(nomissing.full$RT)
# [1] 183

# is this enough to conduct an analysis of response accuracy? we can have a look but first we will need to create
# an accuracy coding variable i.e. where error = 0 and correct = 1

RTs.norms.subjects$accuracy <- as.integer(ifelse(RTs.norms.subjects$RT > 0, "1", "0"))

# we can create the accuracy variable, and run the glmm but, as we will see, we will get an error message, false
# convergence likely due to the low number of errors


# I will also transform the RT variable to the log base 10 (RT), following convention in the field, though
# other transformations are used by researchers and may have superior properties

nomissing.full$logrt <- log10(nomissing.full$RT)


Notice:

— You can see that there were a very small number of errors recorded in the experiment, less than 200 errors out of 5000+ trials. This is a not unusual error rate, I think, for a dataset collected with typically developing adult readers.

— I am starting to use the script file not just to hold lines of code to perform the analysis, not just to hold the results of analyses that I might redo, but also to comment on what I am seeing as I work.

— The ifelse() function is pretty handy, and I have started using it more and more.

— Here, I can break down what is happening step by step:-

1. RTs.norms.subjects$accuracy <- — creates a vector that gets added to the existing dataframe columns, that vector has elements corresponding to the results of a test on each observation in the RT column, a test applied using the ifelse() function, a test that results in a new vector with elements in the correct order for each observation.

2. as.integer() — wraps the ifelse() function in a function call that renders the output of ifelse() as a numeric integer vector; you will remember from a previous post how one can use is.[datatype]() and as.[datatype]() function calls to test or coerce, respectively, the datatype of a vector.

3. ifelse(RTs.norms.subjects$RT > 0, “1”, “0”) — uses the ifelse() function to perform a test on the RT vector such that if it is true that the RT value is greater than 0 then a 1 is returned, if it is not then a 0 is returned, these 1s and 0s encode accuracy and fill up the new vector being created, one element at a time, in an order corresponding to the order in which the RTs are listed.

I used to write notes on my analyses on paper, then in a word document (using CTRL-F that will be more searchable), to ensure I had a record of what I did.

Check predictor collinearity

The next thing we want to do is to examine the extent to which our predictors cluster, representing overlapping information, and presenting potential problems for our analysis. We can do this stage of our analysis by examining scatterplot matrices and calculating the condition number for our predictors. Note that I am going to be fairly lax here, by deciding in advance that I will use some predictors but not others in the analysis to come. (If I were being strict, I would have to find and explain theory-led and data-led reasons for including the predictors that I do include.)

Note that we will first run some code to define a function that will allow us to create scatterplot matrices, furnished by William Revelle on his web-site (see URL below):


#code taken from part of a short guide to R
#Version of November 26, 2004
#William Revelle
# see: http://www.personality-project.org/r/r.graphics.html

panel.cor <- function(x, y, digits=2, prefix="", cex.cor)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r = (cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex * abs(r))
}

We can then run the following code:


# we can consider the way in which potential predictor variables cluster together

summary(nomissing.full)

# create a subset of the dataframe holding just the potential numeric predictors
# NB I am not bothering, here, to include all variables, e.g. multiple measures of the same dimension

nomissing.full.pairs <- subset(nomissing.full, select = c(

"Length", "OLD", "BG_Mean", "LgSUBTLCD", "brookesIMG", "AoA_Kup_lem", "Age",
"TOWRE_wordacc", "TOWRE_nonwordacc", "ART_HRminusFR"

))

# rename those variables that need it for improved legibility

nomissing.full.pairs <- rename(nomissing.full.pairs, c(

brookesIMG = "IMG", AoA_Kup_lem = "AOA",

TOWRE_wordacc = "TOWREW", TOWRE_nonwordacc = "TOWRENW", ART_HRminusFR = "ART"

))


# subset the predictors again by items and person level attributes to make the plotting more sensible

nomissing.full.pairs.items <- subset(nomissing.full.pairs, select = c(

"Length", "OLD", "BG_Mean", "LgSUBTLCD", "IMG", "AOA"

))

nomissing.full.pairs.subjects <- subset(nomissing.full.pairs, select = c(

"Age", "TOWREW", "TOWRENW", "ART"

))


# draw a pairs scatterplot matrix

pdf("ML-nomissing-full-pairs-items-splom-061113.pdf", width = 8, height = 8)

pairs(nomissing.full.pairs.items, lower.panel=panel.smooth, upper.panel=panel.cor)

dev.off()

# draw a pairs scatterplot matrix

pdf("ML-nomissing-full-pairs-subjects-splom-061113.pdf", width = 8, height = 8)

pairs(nomissing.full.pairs.subjects, lower.panel=panel.smooth, upper.panel=panel.cor)

dev.off()

# assess collinearity using the condition number, kappa, metric

collin.fnc(nomissing.full.pairs.items)$cnumber
collin.fnc(nomissing.full.pairs.subjects)$cnumber

# > collin.fnc(nomissing.full.pairs.items)$cnumber
# [1] 46.0404
# > collin.fnc(nomissing.full.pairs.subjects)$cnumber
# [1] 36.69636


You will see when you run this code that the set of subject attribute predictors (see below) …

R-10-1-subjects-splom

… and the set of item attribute predictors (see below) …

R-10-2-items-splom

… include some quite strongly overlapping variables, distinguished by high correlation coefficients.

I was advised, when I started doing regression analyses, that a rule-of-thumb one can happily use when examining bivariate i.e. pairwise correlations between predictor variables is that there is no multicollinearity to worry about if there are no coefficients greater than r = .8 however, I can think of no statistical reason for this number, while Baayen (2008) recommends using the condition number, kappa, as a diagnostic measure, based on Belsley et al.’s (1980) arguments that pairwise correlations do not reveal the real, underlying, structure of latent factors (i.e. principal components of which collinear variables are transformations or expressions).

If we calculate the condition number for, separately, the subject predictors and the item predictors, we get values of 30+ in each case. Baayen (2008) comments that (if memory serves) a condition number higher than 12 indicates a dangerous level of multicollinearity. However, Cohen et al. (2003) comment that while 12 or so might be a risky kappa there are no strong statistical reasons for setting the thresholds for a ‘good’ or a ‘bad’ condition number; one cannot apply the test thoughtlessly. That said, 30+ does not look good.

The condition number is based around Principal Components Analysis (PCA) and I think it is worth considering the outcome of such an analysis over predictor variables when examining the multicollinearity of those variables. PCA will often show that  set of several item attribute predictors will often relate quite strongly to a set of maybe four orthogonal components: something to do with frequency/experience; something relating to age or meaning; something relating to word length and orthographic similarity; and maybe some thing relating to sublexical unit frequency. However, as noted previously, Cohen et al. (2003) provide good arguments for being cautious about performing a principal components regression in response to a state of high collinearity among your predictors, including problems relating to the interpretability of findings.

Data transformations II/II: Center variables on their means

One perhaps partial means to alleviate the collinearity indicated may be to center predictor variables on their means.

Centering variables is helpful, also, for the interpretation of effects where 0 on the raw variables is not meaningful e.g. noone has an age of 0 years, no word has a length of 0 letters. With centered variables, if there is an effect then one can say that for unit increase in the predictor variable (where 0 for the variable is now equal to the mean for the sample, because of centering), there is a change in the outcome, on average, equal to the coefficient estimated for the effect.

Centering variables is helpful, especially, when we are dealing with interactions. Cohen et al. (2003) note that the collinearity between predictors and interaction terms is reduced (this is reduction of nonessential collinearity, due to scaling) if predictors are centered. Note that interactions are represented by the multiplicative product of their component variables. You will often see an interaction represented as “age x word length effects”, the “x” reflects the fact that for the interaction of age by length effects the interaction term is entered in the analysis as the product of the age and length variables. In addition, centering again helps with the interpretation of effects because if an interaction is significant then one must interpret the effect of one variable in the interaction while taking into account the other variable (or variables) involved in the interaction (Cohen et al., 2003; see, also, Gelman & Hill, 2007). As that is the case, if we center our variables and the interaction as well as the main effects are significant, say, in our example, the age, length and age x length effects, then we might look at the age effect and interpret it by reporting that for unit change in age the effect is equal to the coefficient while the effect of length is held constant (i.e. at 0, its mean).

As you would expect, you can center variables in R very easily, we will use one method, which is to employ the scale() function, specifying that we want variables to be centered on their means, and that we want the results of that operation to be added to the dataframe as centered variables. Note that by centering on their means, I mean we take the mean for a variable and subtract it from each value observed for that variable. We could center variables by standardizing them, where we both subtract the mean and divide the variable by its standard deviation (see Gelman & Hill’s 2007, recommendations) but we will look at that another time.

So, I center variables on their means, creating new predictor variables that get added to the dataframe with the following code:


# we might center the variables - looking ahead to the use of interaction terms - as the centering will
# both reduce nonessential collinearity due to scaling (Cohen et al., 2003) and help in the interpretation
# of effects (Cohen et al., 2003; Gelman & Hill, 2007 p .55) ie if an interaction is sig one term will be
# interpreted in terms of the difference due to unit change in the corresponding variable while the other is 0

nomissing.full$cLength <- scale(nomissing.full$Length, center = TRUE, scale = FALSE)
nomissing.full$cBG_Mean <- scale(nomissing.full$BG_Mean, center = TRUE, scale = FALSE)
nomissing.full$cOLD <- scale(nomissing.full$OLD, center = TRUE, scale = FALSE)
nomissing.full$cIMG <- scale(nomissing.full$brookesIMG, center = TRUE, scale = FALSE)
nomissing.full$cAOA <- scale(nomissing.full$AoA_Kup_lem, center = TRUE, scale = FALSE)
nomissing.full$cLgSUBTLCD <- scale(nomissing.full$LgSUBTLCD, center = TRUE, scale = FALSE)

nomissing.full$cAge <- scale(nomissing.full$Age, center = TRUE, scale = FALSE)
nomissing.full$cTOWREW <- scale(nomissing.full$TOWRE_wordacc, center = TRUE, scale = FALSE)
nomissing.full$cTOWRENW <- scale(nomissing.full$TOWRE_nonwordacc, center = TRUE, scale = FALSE)
nomissing.full$cART <- scale(nomissing.full$ART_HRminusFR, center = TRUE, scale = FALSE)

# subset the predictors again by items and person level attributes to make the plotting more sensible

nomissing.full.c.items <- subset(nomissing.full, select = c(

"cLength", "cOLD", "cBG_Mean", "cLgSUBTLCD", "cIMG", "cAOA"

))

nomissing.full.c.subjects <- subset(nomissing.full, select = c(

"cAge", "cTOWREW", "cTOWRENW", "cART"

))

# assess collinearity using the condition number, kappa, metric

collin.fnc(nomissing.full.c.items)$cnumber
collin.fnc(nomissing.full.c.subjects)$cnumber

# > collin.fnc(nomissing.full.c.items)$cnumber
# [1] 3.180112
# > collin.fnc(nomissing.full.c.subjects)$cnumber
# [1] 2.933563

Notice:

— I check the condition numbers for the predictors after centering and it appears that they are significantly reduced.

— I am not sure that that reduction can possibly be the end of the collinearity issue, but I shall take that result and run with it, for now.

— I also think that it would be appropriate to examine collinearity with both the centered variables and the interactions involving those predictors but that, too, can be left for another time.

Write the file out as a csv so that we do not have to repeat ourselves


write.csv(nomissing.full, file = "RTs.norms.subjects nomissing.full 061113.csv", row.names = FALSE)

What have we learnt?

In this post, we learnt about:

— How to inspect the dataframe object using functions like summary() or describe().

— How to plot histograms to examine the distributions of variables.

— How to use grid() to plot multiple variables at once.

— How to output our plots as a pdf for later use.

— How to subset dataframes conditionally.

— How to log transform variables.

— How to create a variable conditionally dependent on values in another variable.

Key vocabulary

working directory

workspace

object

dataframe

NA, missing value

csv

histogram

graphics device

ggplot()

geom_histogram()

geom_density()

facet_wrap()

pdf()

grid()

print()

log10()

ifelse()

Reading

Baayen, R. H. (2008). Analyzing linguistic data. Cambridge University Press.

Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.

Chang, W. (2013). R Graphics Cookbook. Sebastopol, CA: o’Reilly Media.

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioural sciences (3rd. edition). Mahwah, NJ: Lawrence Erlbaum Associates.

Dalgaard, P. (2008). Introductory statistics with R (2nd. Edition). Springer.

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. New York,NY: Cambridge University Press.

Wickham, H. (2009). ggplot2: Elegant graphics for data analysis. Springer.

Posted in 2.2 LME workshop II/IV, Uncategorized | Tagged , , , , | Leave a comment