Verbing Grammark

I came across a project on GitHub, which is where Open Source programmers like to hang out and work on projects, occasionally getting together and helping each other. The project is called Grammark, an online grammar checker.

The ability to predict which patterns are formed in brain scans when imagining a celery or an airplane, based on how these concepts as words co-occur in texts, suggests that it is possible to model mental representations based on word statistics. Whether counting how frequently nouns and verbs combine in Google search queries, or extracting eigenvectors from matrices made up of Wikipedia lines and Shakespeare plots, these latent semantics approximate the associative links that form concepts. However,cognition is fundamentally intertwined with action; even passively reading verbs has been shown to activate the same motor circuits as when we tap a finger or observe actual movements. If languages evolved by adapting to the brain, sensorimotor constraints linking articulatory gestures with aspects of motion might also be reflected in the statistics of word co-occurrences.

Emotional ratings of words are in high demand. Research on the emotions: the ways in which they are produced and perceived, their internal structure, and the consequences use semantic and grammar research in several areas. For instance, Verona, Sprague, and Sadeh (2012) used emotionally neutral and negative words in an experiment comparing responses of offenders without a personality disorder to offenders with an antisocial personality disorders.

Another line of research deals with the impact that emotional features have on the processing and memory. Kousta, Vinson, & Vigliocco (2009) found that participants responded faster to positive and negative words over neutral words in a lexical decision experiment, a finding later replicated by Scott, O’Donnell, and Sereno (2012) in sentence reading. According to Kousta, Vigliocco, Vinson, Andrews, and Del Campo (2011) emotion is particularly important in the semantic representations of abstract words. In other research, Fraga, Pineiro, Acuna-Farina, Redondo, and Garcia-Orza (2012) reported that emotional words are more likely to be used as attachment sites for relative clauses in sentences.

A third approach uses emotional ratings of words to estimate the sentiment expressed by entire messages or texts. Leveau, Jean-Larose, Denhière, and Nguyen (2012), for instance, wrote a computer program to estimate the valence and arousal evoked by texts on the basis of word measures (see also Liu, 2012).

Emotional ratings of words are also used to automatically estimate the emotional values of new words by comparing them to validated words. Bestgen and Vincze (2012) gauged the affective values of 17,350 words by using rated values of words that were semantically related.

So far, nearly all studies have been based on Bradley and Lang’s (1999) Affective Norms for English Words (ANEW) or translated versions (for exceptions see Kloumann et al., 2012; Mohammad & Turney, 2010) . These norms contain ratings for 1034 words. There are three types of ratings, in line with Osgood, Suci, and Tannenbaum’s (1957) theory of emotions. The first, and most important, concerns the valence (or pleasantness) of the emotions invoked by the word, going from unhappy to happy. The second addresses the degree of arousal evoked by the word. The third dimension refers to the dominance/power of the word, the extent to which the word denotes something that is weak/submissive or strong/dominant. The number of words covered by the ANEW norms appeared sufficient for use in small-scale factorial experiments.

In these experiments, a limited number of stimuli are selected that vary on one dimension (e.g., valence) and are matched on other variables (e.g.,arousal, word frequency, word length and others). However, this number is prohibitively small for the large-scale megastudies that are currently emerging in psycholinguistics. In these studies (e.g., Balota et al., 2007; Ferrand et al., 2010; Keuleers et al., 2010, 2012), regression analyses of thousands of words are used to disentangle the influences on word recognition. The ANEW norms are also limited as input for computer algorithms gauging the sentiment of a message/text or the emotional values of non-rated words.

In 2012 Marc Brysbaert posted that :

Together with Victor Kuperman and Hans Stadthagen-Gonzalez, we collected age-of-acquisition (AoA) ratings for 30,121 English content words (nouns, verbs, and adjectives). The collection of these new AoA norms was possible because we made use of the web-based crowdsourcing technology offered by the Amazon Mechanical Turk. Correlations with existing AoA measures suggest that these estimates are as good as the existing ones.

You find the article on the new AoA norms (Kuperman et al., Behavior Research Methods, 2012) here.

You find the Kuperman et al. (2012) AoA ratings here.

Here you find a comparison with the AoA norms from other large-scale databases (Bird et al., 2001; Stadthagen-Gonzalez & Davis, 2006; Cortese & Khanna, 2008; Schock et al., 2012). In each sheet two or three new columns have been added: the Kuperman et al. AoA ratings for the overlapping words, and the predicted Kuperman et al. AoA norms on the basis of original rating (by means of linear or polynomial regression; the regression weights are shown as well).

Because the Age-of-acquisition norms can also be used for inflected forms and because the other studies contained ratings for words we did not include (pronouns, number words, adverbs, nouns mostly used as names) we can expand the original Kuperman et al. list to a total of 51,715 words, which you find here. In this list, for each word we give the Kuperman et al. AoA rating, and the predicted AoA ratings on the basis of other studies (based on the lemmas of the words).

You may have noticed that we make much of our information (SUBTLEX word frequencies, AoA norms, RTs from the Lexicon Projects, …) available as Excel files. We do this because we know many people work with such files.

Most of the time we simply open the Excel files and manually look up the information we need. This is nice as long as the number of items is limited. However, it becomes an (error-prone!) chore once the stimulus lists become large and we need information for many variables. In such case it is nice to know that you can do the work automatically by making use of the Excel VLOOKUP function.

To help you, we have included a number of screenshots of how to do this in a pdf file.

After the publication of this post, Ian Simpsom (University of Granada) contacted them with some interesting examples of Excel functions to be used with text databases. You find them here.

With all of this amazing information I thought that it would be interesting to apply this information to a practical ends, such as giving the email you are about to send an ‘aggressiveness’ rating.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: