This post assumes that you have read and worked through examples in previous posts concerning: 1. how to read in .csv files as dataframes; 2. data structure basics, including discussion of vectors, matrices, and dataframes. We will be working with the items data previously referenced here.
flickr – Creative Commons – US National Archives: Rising the first main frame of a dirigible, ca. 1933
Here we will look at dataframes in more detail because dataframes will be, generally, what we will be dealing with in our data analyses. We will revise how you can input a database from an external source, e.g. an excel .csv file, and read it into the R workspace as a dataframe. We will also look at how you can test and convert the data type of variables in a dataframe.
In the next post, we will look at how you refer to elements of a dataframe by place, name or condition, and how you use that capacity to add to or remove from the dataframe and how you subset dataframes.
Revision: downloading external databases
If you recall my post on workflow, in the research I do and supervise, we typically work through a series of stages that end in a number of data files: select items and prepare a script to present those items (as stimuli) and collect responses (usually reading responses); we test people and collect data; we collate the data, often with some cleaning; we do analyses in R. The data collation phase will typically yield a number of data files, and I usually create such files in excel (where collation and cleaning is done by hand), outputting them as .csv files.
It is straightforward getting the data files into R, provided you have set your working directory correctly.
I am going to assume that you have managed to download the item norms data files from the links give at this post, or in the following reiteration of the main text:
We have .csv files holding:
1. data on the words and nonwords, variables common to both like word length;
2. data on word and nonword item coding, corresponding to item in program information, that will allow us to link these norms data to data collected during the lexical decision task;
3. data on just the words, e.g. age-of-acquisition, which obviously do not obtain for the nonwords.
— We have data on the characteristics of 160 words and 160 nonwords, selected using the English Lexicon Project nonwords interface which is really very useful.
— The words are supposed to be matched to the nonwords on length and other orthographic characteristics to ensure that participants are required to access the lexicon to distinguish the real words from the made-up words (the nonwords). We can and will test this matching using t-tests (in another post).
Revision: getting external databases into R using read()
We can set the working directory and input the databases in a few lines of code:
# set the working directory setwd("C:/Users/p0075152/Dropbox/resources R public") # read in the item norms data item.norms <- read.csv(file="item norms 270312 050513.csv", head=TRUE,sep=",", na.strings = "-999") word.norms <- read.csv(file="word norms 140812.csv", head=TRUE,sep=",", na.strings = "-999") item.coding <- read.csv(file="item coding info 270312.csv", head=TRUE,sep=",", na.strings = "-999")
— The code here specifies that we are reading in or inputting comma delimited files, .csv, using the read.csv() function call.
— We specified that we wanted the column names for the variables with head= TRUE.
— We also specified that the values in different columns in the dataframe were delimited by commas — check what happens if this bit of the code is taken out.
— Finally, we specified that missing values in the database – referred to in R as NA or not available – should be recognized wherever there is a value of -999; I use -999 to code for missing values when I am building databases, you might choose something different.
— N.B. We might equally have used read.table(): a more general form of the read function.
Revision: getting information about dataframe objects
We have looked at how to get information about dataframes before. There are a number of very useful function, and you can enter the following code to get information about the various databases you have just put into the workspace.
# get summary statistics for the item norms head(item.norms, n = 2) summary(item.norms) describe(item.norms) str(item.norms) length(item.norms) length(item.norms$item_name)
If you run that code on the item.norms dataframe, you will see something like this in the console window (below the script window):
— head() will give you the top rows from the data frame, the n = argument in head(item.norms, n = 2) is a specification that we want the top 2 rows but you could ask for any number, try it.
— summary() will give you information about factors like item_type, telling you how many times each level in the factor occurs, as well as statistical data about numeric variables like (in this dataframe) Length, including median and mean.
— describe() will give you the mean, SD and other statistics often required for report
— str() is very useful because it indicates what data mode or type each variable is identified to have by R.
— length() will tell you how many variables there are in the dataframe.
— length(item.norms$item_name) tells us how many observations or elements there are in the variable or vector item.norms$item_name.
$ – Referring to dataframe variables by name
Notice the use of the item.norms$item_name notation.
What we are doing here is referring to a specific variable in a dataframe by name using the notation: dataframe’s-name-$-vector-name.
Note that you can avoid having to write dataframe$variable and just refer to the variable if you use the attach() function, however, some people regard that as problematic (consider if more than one variable exists with the same name) and so I will not be using attach().
Using descriptive information about dataframes to examine their state
I would use one or more of the summary functions to examine whether the dataframe I have input is in the shape I expect it to be.
I usually inspect dataframes as I input them, at the start of an analysis. I am looking for answers to the following questions:-
1. Does the dataframe have the dimensions I expect? I have usually built the dataset in excel by hand so I have a clear idea about the columns (variables) that should be in it, their names, and the number of rows (observations). Even for a dataset with several thousand observations, I should still be able to tell precisely how many should be in the dataframe resulting from the read() function call.
2. Are the variables of the datatype I expect? Using the str() and summary() function calls, I can tell if R regards variables in the same way as I do.
In an earlier version of this post, I noted that the item.norms dataframe inspection yielded a surprise:
— I expected to see the bigram frequency variables, BG_Sum, BG_Mean, BG_FreqBy_Pos, to all be treated as numeric variables, but saw in the output from the str() function call, that R thought they were factors. I think this happened because the ELP yields BG numbers with “,” as a delimiter for numbers in the 1,000s. The presence of the “,” triggered the recognition by R of these variables as factors. I fixed that in excel by reformatting the cells, removing the “,” but one could use the functions discussed at the end.
3. Are the missing values in the dataset, and have they been coded correctly? Needs will vary between analyses. As, typically, analyses of the data yielded by psycholinguistic experiments often are analyses of the latency of correct responses, we will need to exclude errors before analysis, which we could do by first coding them as missing values (-999 in the database when it is built, -999 assigned as NA in R when input), then omitting the NAs.
I don’t think it ever pays to omit a careful movement through this phase because any errors that go undetected here will only waste your time later.
Data modes, testing for mode, and changing it
As mentioned previously, data objects in R have structure (e.g. a vector or a dataframe etc.) and mode: R can deal with numeric, character, logical and other modes of data.
Usually, if you think a variable is numeric, character etc., R will think so too.
Let’s consider the different data types before we consider how to test a variable for type and then how to convert variable types (coercion). See this helpful post in the Quick-R blog, or chapter 2 in R-in-Action (Kabacoff, 2011) for further information.
— factor: categorical or nominal data, with a small number of distinct values (levels) e.g. the factor gender with levels male and female, or the factor item type with levels word and nonword, or the factor subject with levels one for each different person in a dataset; factors can be ordered or unordered (ordinal data).
— numeric: numbers data, could be of types double precision or integer [will need to look these things up but think not relevant yet as referring to more sophisticated distinctions]
— character: data consisting of characters, could be intended as numbers or factors
[This is a bit weak but will do for now.]
— dealing with dates and times is a whole other world of stuff to take on, and we will come to that stuff later
Testing variables for mode
There are some very useful functions in R for testing the mode of a variable.
Let’s start by checking out the BG_ variables.
# if we inspect the results of the input, we can check the assignment of data mode to the vectors # in our data frames is.factor(item.norms$BG_Sum) is.factor(item.norms$BG_Mean) is.factor(item.norms$BG_Freq_By_Pos) is.numeric(item.norms$BG_Sum) is.numeric(item.norms$BG_Mean) is.numeric(item.norms$BG_Freq_By_Pos)
You will see that if you run these lines of code:
1. R will return the result FALSE to the test is.factor() because the BG bigram frequency variables are classed as of type numeric;
2. R will return TRUE for is.numeric() because the bigram variables are not of type factor.
Converting (coercing) variable types – coercing data vector mode
There are some very useful functions in R for converting or coercing a variable’s mode from one kind to another, if required.
We can convert a variable’s data type using the as.factor(), as.numeric() functions; there is also as.character().
For example, we can convert BG_Sum from a numeric variable to a factor and back again using as.factor() and as.numeric(). With each conversion, we can demonstrate the variable’s data type by testing it using is.factor() or is.numeric() and also by getting str() and summary() results for the variable.
# convert the BG_Sum variable from being numeric to being a factor: item.norms$BG_Sum <- as.factor(item.norms$BG_Sum) # check the variable type is.factor(item.norms$BG_Sum) is.numeric(item.norms$BG_Sum) # show structure and summary str(item.norms$BG_Sum) summary(item.norms$BG_Sum) # convert the variable back from factor to numeric item.norms$BG_Sum <- as.numeric(item.norms$BG_Sum) # check the variable type is.factor(item.norms$BG_Sum) is.numeric(item.norms$BG_Sum) # show structure and summary str(item.norms$BG_Sum) summary(item.norms$BG_Sum)
Try this out for yourself, inspecting the result of each conversion.
— I am doing the conversion using the as.factor() or as.numeric() function calls.
— I am asking that the converted variable is given the same name as the original type variable. I could ask for the converted variable to be given a new name e.g. item.norms$newBG_Sum.
What have we learnt?
[The code used in this post can be downloaded from here.]
We have revised how you input a dataframe, getting it from a file in your computer folder to an object in the workspace.
We have revised how you get information to inspect the dataframe.
We have looked at how you test, and how you can convert, the type of data a variable is treated as.