Lab 1: statistics for EMCL students

Lab assignment 1 — comments

Some general comments regarding your solutions of assignment 1

Only few of you defined exactly what the data are that you were collecting: not the number of occurrences of a certain pronoun (or of a certain numeral), but the number of web pages that contained that pronoun (or that numeral). Nonetheless, we used this value to approximate the frequency of the pronouns (or of the numerals) themselves. This is certainly very problematic, as a single page might contain some pronouns (or e.g., the numeral "one") many times, so I don't encourage you to write a scientific paper based on your results. Nonetheless, I hope you could learn a lot of it.

I was happy that all of you gave nice explanations concerning the linguistic behaviour of certain words in your language (and also in comparison with other languages). So you nicely pointed out cases when a word has several meanings/functions, etc. This is important for your future career within any subfield of linguistics.

Some people also reported their results for personal pronouns using a pie chart, which makes sense. The entire pie represents all occurrences of all personal pronouns. It would not make sense to draw a pie chart for numerals, though, because you have not counted all numerals ("forty" was an arbitrary value to stop). Similarly, percentages are meaningful for pronouns (how many percent of all pronouns are first person singular?), but not for numerals.

Only a few people made use of Word's functionality to add a caption to figures and graphs. In this way you can add not only figure numbers (such as "Figure 1.1"), but also a title and some explanation to your illustrations, which is very useful and even required in scientific publications.

The count of numerals resulted a significant peak at "twenty", "thirty" and "forty" for most of you. Some of you tried to bring cultural explanations for that observation. To prove such a hypothesis, you must show that the peak in your language is higher (compared to the neighbouring values) than in other languages. I have two simpler explanations (and simpler explanations are always to be prefered in science): first, round numbers are more often used in real life than other numbers, and second, non-round numbers are more often transcribed using digits and not letters.

Quite a lot people missed to observe the overall tendency that the frequencies of the numerals diminish (excluding the local peaks at round numbers). One of the goals for creating a histogram in which values are grouped by four was exactly to observe this general tendency despite local fluctuations.

When you compared different languages, you most often observed that frequencies were very different: obviously, there are much more pages in English than in German than in the native language Kannada. This should not have surprised you.

Concerning the difference between the values ("I", "you", etc. in the first case; "one", "two", etc. in the second case) on the one hand, and the frequencies of these values, on the other, please make sure you understand what I explained to you last Thursday.

Let me summarize once again. In the first experiment you were collecting personal pronouns. So the variable to be measured was "personal pronoun", and the variable had possible values such as "I", "you", "he", etc. (or their equivalents in your language). Each time Google located a page containing the query was a case. If you had 100 hits for "I", 50 hits for "you" and 40 hits for "he", then in total you had 190 cases. Within these 190 cases, you had value "I" observed 100 times, value "you" occurred 50 times and value "he" had a frequency of 40.

In both cases, it made sense to find the mode: the value that was most frequent. In some languages, there were two values that were similarly frequent, so you could speak of a bimodal distribution.

Speaking of the median did not make sense for pronouns, and quite questionably made sense in the case of numerals. In general, to speak of a median, you need to be able to sort the values. It does not make sense that "I" is larger than "you", for this is a typical case of nominal variable (such as "male/female" or nationality, etc.; called "string" in SPSS Variable View). In the case of numerals, the values are also nominal in some sense (a string of phonemes), but you can argue that it is possible to rank them. So, if you rank all numerals you encounter on the Internet in a specific language according to their numerical values, you can argue, for instance, that half of the time you encounter "one" or "two", and in the other half of the times you find "three", "four", "five", etc.

Speaking of the mean was totally meaningless. What is the average of hundred occurrences of "I", fifty occurrences of "you" and forty occurrences of "he"? What many of you calculated was the average of the frequencies of the values, and not the average of the values themselves. That is, how often in average a pronoun occurs? Without any data collection I can tell you that the average relative frequency of English pronouns is 12.5%. Namely, there are eight pronouns, the sum of their relative frequencies is 100%, so their average frequency is 1/8. If the absolute average frequency happens to be, say, 23,456,789, then eight times this number is simply how many personal pronouns you have on the Internet in total. Well, this figure primarily measures the number of pages in your language, because if there are more web pages, then there will also be more personal pronouns. (Dividing the number of pronouns by the number of sentences would be an interesting measure of how much pro-drop a language is.)

I hope this clarification is satisfactory. Should it not be the case, you are always welcome to ask questions. Don't be shy asking questions! An old maxim has: the shy person does not learn.