Creating a Non-Word List to Match 226 of the Snodgrass Standardised Picture Set

Creating non-word lists is a necessary but time consuming exercise often needed when conducting behavioural language tasks such as lexical decisions or non-word reading. The following article describes the process whereby we created a list of 226 non-words matching 226 of the Snodgrass picture set [1]. In order to examine phoneme monitoring in fluent and non-fluent speakers we used the Snodgrass pictures created by Snodgrass and Vanderwart [1]. We also wished to look at phoneme monitoring in non-words so began creating a list of words that were matched to the Snodgrass pictures. The non-words created were matched on the following dimensions; number of syllables, stress pattern, number of phonemes, bigram count and presence and location of the target sound when relevant. These properties were chosen as they have been found to influence how easy or difficult it is to detect a target phoneme.


Introduction
Creating non-word lists is a necessary but time consuming exercise often needed when conducting behavioural language tasks such as lexical decisions or non-word reading.The following article describes the process whereby we created a list of 226 non-words matching 226 of the Snodgrass picture set [1].In order to examine phoneme monitoring in fluent and non-fluent speakers we used the Snodgrass pictures created by Snodgrass and Vanderwart [1].We also wished to look at phoneme monitoring in non-words so began creating a list of words that were matched to the Snodgrass pictures.The non-words created were matched on the following dimensions; number of syllables, stress pattern, number of phonemes, bigram count and presence and location of the target sound when relevant.These properties were chosen as they have been found to influence how easy or difficult it is to detect a target phoneme.

Rationale for creating a non-word list
The nature of non-words used in experimental work has been shown to be extremely important to the results of the study they're used for.For example, the more or less similar a non-word is to a real word effects the speed at which a lexical decision is made [2][3][4][5].Gibbs and Van Orden [3] found that lexical decisions were fastest when the non-words used contained illegal letter strings -strings of letters that do not appear together in the language used e.g., /gtf/.Keuleers and Brysbaert [6], state that due to the impact non-words have on lexical decisions, they should only contain legal letter strings thus more closely approximating real words.
Phonotatic probability is the frequency with which different sound segments and segment sequences occur in the lexicon [7][8][9][10][11].For example, /bl/ occurs commonly in English and is therefore thought to have a high phonotactic probability.It has been found that sensitivity to phonotactic probability develops in childhood and becomes increasingly sensitive as our lexicon grows [8,[12][13][14].Munson and Bable [15] suggested that this increase in sensitivity is reflective of our lexical representations becoming more segmental.As our lexicon expands, so too do the phonotactic possibilities and we become more sensitive to those segments which appear most often e.g., /bl/.Coady and Aslin [12] Storkel [8] and Zamuner, Gerken and Hammond [16] have found that phonotactic probability is reflected in the accuracy of speech in young children e.g. the lower the phonotactic probability the less accurate the speech.This finding, when applied to the two-step model of lexical access [17] can be explained in terms of the level of activation.When a speaker attempts to access a word in their lexicon this model proposes two steps, lemma retrieval and phonological retrieval.These two steps are not sequential and activation spreads throughout the retrieval network from semantic features to phonological features and back again.The most active phoneme units are then selected and positioned into the phonological frame.The model would suggest that those units with higher phonological probability have higher activation and are, therefore, more readily retrieved.For this reason it may be easier to detect /l/ when it is in a /bl/ combination rather than a /nl/ combination as /bl/ occurs more often in English than /nl/.As our list was created for a phoneme monitoring task controlling for the number of letter bigrams was especially important.
In Levelt et al., [18] model of speech production it is noted that we have the ability to monitor phonological code that is generated in the syllabification process which occurs before word production.Tasks such as phoneme monitoring can be used to test our ability to monitor phonological code which is what Schiller [19] did.Adult Dutch speakers were given a silent phoneme monitoring task in which the phoneme they had to monitor for occurred in the syllable initial and stress initial position and was compared to when it occurred in syllable initial but not stress initial position.It was found that phoneme monitoring occurs fastest when the phoneme occurs in the initial stress position.Dutch like English is a language in which the majority of multisyllabic words have their syllable stress on the initial syllable so results can be generalised to English.Coalson and Byrd [20] conducted a study asking participants to monitor for a phoneme in non-words.They found similar results to Schiller (2005) and also suggest that fluent adults monitor for phonemes more slowly in non-words as opposed to real words.It can be seen from this work that controlling for the position of the phoneme within the word and whether it occurs in the stressed syllable is important as it affects speed of monitoring.

Purpose of the list -current study
We created this non-word list as in our subsequent study we wished to examine phoneme monitoring in real and non-words in adult who are fluent vs. adults who are dysfluent.As we also wished to do this in a silent picture phoneme monitoring paradigm we chose to use the Snodgrass picture set [1].Snodgrass and Vanderwart created this their set of 260 line drawings which they standardised on four variables; familiarity, image agreement, name agreement and visual complexity.These variables must be controlled for as they affect cognitive processing in pictorial and verbal form.More familiar items are more easily named as are words learnt at a younger age, those with higher name and image agreement, and less visual complexity, are also more easily named [21][22][23].

Generating the non-words
Initially we excluded some of the Snodgrass words e.g.those which are not regularly used in British English e.g.wrench (in English we would use spanner) noun phrases were also excluded e.g., wine glass.We then transcribed each word orthographically and phonologically detailing position of primary stress, total number of syllables and the total number of phonemes.A letter bigram count was also calculated by hand.This count, taking account of phonological transcription, was vital as English orthographic transcription does not consistently agree with phonological transaction.Once we had all of this information we could begin creating our non-words.
In order to create the non-words we used two software programs.The first was the ARC Nonword Database [24].This database was created so that researchers could access monosyllabic non-words or pseudo-homophones, chosen on the basis of a number of properties including; the number of letters, the neighbourhood size, summed frequency of neighbours, number of body neighbours, summed frequency of body neighbours, number of body friends, number of body enemies, number of onset neighbours, summed frequency of onset neighbours, number of phonological neighbours, summed frequency of onset neighbours, bigram frequency -type, bigram frequency -token (both position specific and position non-specific), trigram frequency -type, trigram frequency -token (both position specific and position non-specific) and the number of phonemes.
Values for each of these can be set (upper and lower limits) and the fields you wish to have output for can also be selected.Non-words and pseudo-homophones can be chosen to be only orthographically existing onsets, be only orthographically existing bodies, only legal bigrams, monomorphemic only syllables, polymorphemic only syllables and morphologically ambiguous syllables.The ARC software, whilst extensive, could only be used to create non-words for all of the monosyllabic words in the Snodgrass set (121 words of the 226 total).Each word was chosen from a list of possible options given by the ARC database, when the target sound needed to be present non-words had to be selected that also had the target sound in the same position.It was not possible to ask the software to do this for us so added additional workload.
For the remaining 105 multisyllabic words we used the Wuggy software (Keuleers and Brysbaert, 2010) to create the non-words.Once again words were matched to real words in terms of, phoneme length, syllable length, presence or absence of the target sound, place in which the target sound occurred when it occurred and stress pattern.Wuggy is a multilingual pseudo-word generator designed to elicit non-words in Basque, Dutch, English, French, German, Serbian (Cyrillic and Latin), Spanish, and Vietnamese.This software was developed to expand upon what ARC offers as it can generate multisyllabic words.A word or non-word can be inputted and the algorithm can generate pseudo-words which are matched in sub-syllabic structure and transition frequencies.In the Wuggy software, after the language has been selected, it is possible to select whether real or pseudo-words are required.Output restrictions can then be applied including; match length of sub-syllabic segments, match letter length, match transition frequencies (concentric search) and match sub-syllabic segments e.g. 2 out of 3.There are also output options similar to ARC, including; syllables, lexicality, OLD 20, neighbours at edit distance, number of overlapping segments and deviation statistics.Each of the remaining 105 words were put into Wuggy and one of the options generated was chosen based upon whether it had the target sound (when applicable) in the correct location.
Once each non-word had been chosen and transcribed orthographically and phonologically a manual bigram count was taken.To ensure no bigrams were missed the total number of phonemes was calculated (980 phonemes in each list -words and nonwords) following this the total number of possible bigrams was calculated (754 bigrams in each list -words and non-words).Bigram frequency data was calculated for real and non-words and a Wilcoxon signed rank test similar frequencies across the two word lists (z=-0.123,p=0.902).None of the non-words differed to the real words by more than 2 standard deviations (more than 5 bigrams) and the greatest difference was 6 occurrences of a bigram vs 1 occurrence of it.By ensuring that the lists are as similar as possible we have minimized the chance of any differences between performances on each list being down to factors other than the word/non-word distinction.

Outcome
The completed non-word list with corresponding Snodgrass words can be found in Table 1.The target phonemes that we used in the subsequent phoneme monitoring task are highlighted in bold (where applicable).It should be noted that whilst this list is matched and the bigram frequencies are such that there is no significant difference between the two lists, this is only the case when all 226 words are used.If exclusions are made in any work using them then a new bigram count must be taken to ensure that lists remain well matched.

S.NO. Non-Word List
Non-Word List

S.NO. Non-Word List
Non-Word List

Table 1 :
Bretherton-Furness J, Ward D, Saddy D (2016) Creating a Non-Word List to Match 226 of the Snodgrass Standardised Picture Set.J Bretherton-Furness J, Ward D, Saddy D (2016) Creating a Non-Word List to Match 226 of the Snodgrass Standardised Picture Set.J Phonet and Audiol 2: 109.doi:10.4172/2471-9455.1000109The completed non-word list with corresponding Snodgrass words.