Getting normative data for words

The easiest step is to get data we need from the English Lexicon Project (ELP). We will want information on frequency (there are several measures, we will be using the log context distinctiveness (CD) measure from the SUBTLEX database; see papers by Brysbaert, Adelman etc.), and orthographic similarity (we will be using OLD20; Yarkoni et al., 2008), in particular.

We need to open internet explorer (the ELP website does not seem to work well with chrome) and at:

http://elexicon.wustl.edu/WordStart.asp

click on:

Generate Lists of Items with Specific Lexical Characteristics

and then select for:

SUBTLWF

LgSUBTLWF

SUBTLCD

LgSUBTLCD

OLD

OLDF

PLD

PLDF

NPhon (Number of Phonemes)

NMorph (Number of Morphemes)

then click on the execute query button (at the bottom, for restricted vocabulary), then paste in a list of the words from the lexical decision stimulus set – any listing will do, one might be from:

item norms 100812.csv

We can then copy/paste the selected return from this query into a new sheet (word norms) created in a workbook where we gather together all the information sources for our analysis.

Getting data from databases without online interfaces

Using the ELP is relatively straightforward, getting the AoA and imageability ratings is a little harder. We are going to use the Cortese imageability and AoA norms, which I have copied into a new folder:

Dropbox\resources R\2013 R class\item norms data

We can get the values we need by looking for, and copying and pasting, values for each word by hand. Or we can let excel do the work, following the instructions given here:

http://crr.ugent.be/archives/833

— see the pdf how-to guide on vlookup

Having followed the instructions in the guide with respect to both the IMG and the AOA databases, a quick spot check shows both that the vlookup function seems to deliver norm values from the source databases accurately and that there are a few missing values.

We can use the Brookes online IMG ratings that I collected for these (and other words) to complete the IMG database.

We can use the Kuperman AOA norms to get an alternate (complete) set of AoA values for words in the stimulus set:

http://crr.ugent.be/archives/806

We use the 51k words database norms.

This entry was posted in normative data, workflow and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s