(Sorry for this "meta-tex" format) \bf{Physics and Linguistics} \it{What's common?} } Tam\'as B{\'\i}r\'o (Roland E\"otv\"os University, Budapest mail: birot@ludens.elte.hu) Abstract: ----------- In recent decades interdisciplinary questions, like biophysics, environment sciences, or cognitive sciences have been getting more and more popular. The purpose of my lecture is to present the first approaching steps between physicists and linguists, to show what possibilities can be found to relate these seemingly very distant disciplinaries. I shall introduce a method known from statistical physics (random walks), which is able to discover non-trivial long-range correlations in texts. I also shall define a "distance" between two texts, leading to an algorithm for sorting texts by language or content. Proceeding: -------------- "Physics is an imperialist, expanding science" $-$ has said Prof. G. Marx to me $-$ "Physics is what physicists are dealing with". No one can say the limits of modern physics: its methods are penetrating into most sciences. The interaction of physics with mathematics, physics with chemistry is well known. When speaking about physics and biology, we should distinguish among biophysics and biological physics: the first deals with physical processes in living organisms, the second uses physical methods (mostly statistical physics) to analyse biological systems, such as evolution or molecular motors (see e. g. 1). In recent decades physics has been penetrating in social sciences, as well. While scientists mostly apply physics as the discipline of the physical processes, scholars of social sciences $-$ similarly to biological physicists $-$ rather make use of physical methods. In economics, for example, the impact of thermodynamics (energy, entropy, etc. in their physical meaning) on microeconomic processes has been examined (2), but most of researches deal with statistical physical methods used in economics or finances (e. g. 3, 4). What can physics contribute to linguistics? The first point, I do not wish to deal with, is \it{acoustics} and physical phonetics. Phonetics is the science of the sounds in human speech (to not confuse with phonology, dealing with the system of the sounds, their rules and regular changes in a given language). One of the two aspects of phonetics (i. e. the physiological and the physical ones) could have been interesting to a physicist of hundred years ago, dealing with acoustics: spectra of human speech, of different sounds produced by the same or different people, can be recorded, analysed and compared. But that is rather like biophysics: analysing physical processes in various "contexts". I wish to deal with the potential contribution of \it{modern physics} to \it{modern linguistics}. What do I mean by modern physics and by modern linguistics? Modern physics $-$ in the sense Prigogine and Stengers use it $-$ is not quantum theory or relativity theory, but the statistical physics of \it{complex systems}, of irreversibility "\sl{making possible spontaneous organising processes}" (5). The development of modern statistical physics has put at our disposal means strong enough to describe complex systems, such as economic or biological phenomena. Why not using them to describe linguistical phenomena, as well? On the other hand, I mean by modern linguistics the \it{generative linguistics} of \it{Noam Chomsky} and his (closer or not so close) followers. \it{Mathematical linguistics}, whose one main elaborator was Chomsky himself, made possible to explore the features of natural languages with the help of the exact and abstract mathematical methods (theory of formal languages). (Before that, "quantitative linguistics" meant using statistical methods in processing data got from text corpi or a dictionary.) (6) Putting that together: linguists have learnt from mathematicians how to deal with linguistical phenomena in an exact way, and statistical physicists have new, powerful methods. Why cannot we (i. e. physicists) help mathematicians in providing linguists with modern methods? Just few steps have been done till now, but in the following I wish to present two attempts in that direction. To understand the so-called \it{one-dimensional random walk model} (7-10) let us imagine a flea walking along a line. We have a text composed of an alphabet (e. g. the 26 letters of the English alphabet, space, coma, etc.), that we map onto a binary sequence of "1" and "-1" (e. g. by replacing all characters by a binary code, and then writing "-1"s instead of all "0"s). We read this sequence to the flea, who has to move up when hearing "1", and to move down when hearing "-1". Denoting by $u_i$ the $i$-th element of the binary code, the place $y(l)$ of the flea after $l$ steps is: $$ y(l) = \sum_{i = 1}^l u_i $$ An important statistical quantity characterising the walk of the flea is the root mean square fluctuation $F(l)$ about the average of the displacement: $$ F^2 (l) = < (\Delta y (l) - <\Delta y(l)>)^2 > = = < \Delta y(l)^2 > - < \Delta y(l) >^2 $$ where $$ \Delta y(l) = y (l_0+l) - y (l_0) $$ , and the averages are taken over all positions $l_0$. It can be easily seen that $F(l)$ has a power law behaviour: $$ F (l) \sim l^\alpha $$ If we have a purely random sequence, $\alpha = 0.5$ . In the case of short (local) correlations extending up to a characteristic length $R$ (e. g. a Markov-chain), the asymptotic behaviour ($l \gg R$) would be unchanged: $\alpha = 0.5$. But in the case of long-range correlations (where no characteristic $R$ exists), i. e. when the probability of "1" at a place is affected by what can be find at a very long distance, the alpha-exponent will differ from $0.5$. This "experiment" has been made for various texts, such as the original version and different translations of the Bible, Shakespeare's dramas, novels, a dictionary, computer programs after compilation (.exe files), etc. Some more interesting results are: 1., Texts have a constant alpha-exponent over decades in $l$, significantly different from $0.5$ (in average about $0.6 - 0.7$). Computer programs are even more correlated: they scale with an exponent above $0.9$. (8) 2., The exponent is not characteristic to author. While the alpha of "Hamlet" is $0.56$, "Romeo and Juliet" 's one is $0.6$ (8). 3., Translation seems to diminish correlations. Although the Bible has a very high alpha value ($\sim 0.75$), its translations are less correlated (9). 4., The examined dictionary has shown correlations much longer than entries! (8) 5., Cutting the text into pieces and reshuffling them randomly ceases the correlations: beyond the scale of the pieces' length, $\alpha = 0.5$. The correct explanation for these long-range correlations has to be found yet. Let me present now an idea, leading to a useful algorithm. Called "gauging similarity with n-grams" by his inventor, \it{Marc Damashek}, this method consist of constructing a unity vector from a given text, and the similarity of two texts can be given by their dot product. (11) Let us move a "window" of length $n$ character along our document, symbol by symbol. We count the number $m_i$ of the occurrences of the n-gram (i. e. sequence of $n$ characters) indicated with $i$, for each $i$ (i. e. each possible n-grams). The document can be characterised by a vector $x$, whose components are: $$ x_i = {m_i \over \sum_j m_j}$$ If we have two texts with vectors $x$ and $y$, their "similarity" can be measured by their dot product: $$ S = \cos \theta = \sum_i x_i y_i $$ Table 1 shows my results with $n = 3$, while Table 2 shows the dot products of the vectors of the same documents, when $n = 4$. F1 - F5 are e-mail updates of the American Institute of Physics' Bulletin of Physics News, E1 - E3 are other e-mails in English, Fr1 and Fr2 are short French letters, while H1 - H3 are personal e-mails in Hungarian. Their lengths are between 3400 to 6000 characters, except of Fr1 and Fr2, whose length are about 1000 - 1200 characters. My alphabet consisted of 26 letters, space, dot and comma. Sequences of spaces should be deleted before. Texts of the same language and topic give noticeably higher dot product than documents of different languages. Product of an F- and an E- file (same language but different topics) is smaller than the one of two F- or of two E-files, but significantly higher than the product of two documents in different languages. (For example, in the case of $n = 3$, the n-gram "the" has far the highest $m_i$ value in English texts: think to the articles, to "these", "those", "there", "them", "they", etc.) The procedure can be improved by introducing centroid vectors. The latters being the average of vectors taken from a given set of document (e. g.: the set of the documents in a given language), it is characteristics to the common features of this set (e. g. the grammatical words in a language). If we subtract the centroid vector from the document vectors, we can refine our similarity measure. This method gives an effective technique called \it{Acquaintance} for sorting and clustering documents by language, topic and sub-topic. Table 1. (n = 3) 0.xxx (Please write 0. before all data !!) F1 F2 F3 F4 F5 E1 E2 E3 Fr1 Fr2 H1 H2 H3 F1 1.0 801 824 783 793 691 692 707 281 279 257 292 261 F2 1.0 805 787 796 696 717 704 263 259 254 281 249 F3 1.0 781 766 702 706 695 228 235 233 266 233 F4 1.0 727 650 674 667 241 231 237 265 237 F5 1.0 636 621 662 275 264 257 292 257 E1 1.0 816 786 272 264 221 275 232 E2 1.0 797 244 251 220 258 229 E3 1.0 273 280 218 266 222 Fr1 1.0 636 211 207 204 Fr2 1.0 218 225 233 H1 1.0 737 783 H2 1.0 804 H3 1.0 Table 2. (n = 4) 0.xxx (Please write 0. before all data !!) F1 F2 F3 F4 F5 E1 E2 E3 Fr1 Fr2 H1 H2 H3 F1 1.0 617 654 580 583 492 491 492 119 089 075 089 070 F2 1.0 644 597 615 499 516 504 097 078 072 079 058 F3 1.0 605 582 510 525 501 083 064 066 083 058 F4 1.0 525 448 476 461 085 059 065 075 063 F5 1.0 430 422 437 094 067 074 086 065 E1 1.0 680 651 116 078 068 103 068 E2 1.0 660 110 080 065 087 068 E3 1.0 116 085 075 103 073 Fr1 1.0 437 044 044 060 Fr2 1.0 038 044 059 H1 1.0 462 540 H2 1.0 601 H3 1.0 What conclusions can be made at the end? I believe, the task of the physics is to describe the nature with using mathematical methods. Before, "nature" meant lifeless nature, but in recent decades it has started to include other segments of the world around us: biology, our economical and social environment, etc. Why not also our language? We have two, very different sciences: physics and linguistics. They keep winking at each other. The first steps towards the meeting have been made from both sides. Hopefully, we can be present at their rendez-vous in the next decades, or even at their marriage, too. I am convinced of the fact that their relationship will be fruitful for both science, and their children may even have effect on other members of the big family "Science". (1) I. Der\'enyi, T. Vicsek: The kinesin walk: A dynamic model with elastically coupled heads, Proc. Natl. Acad. Sci. USA, Vol. 93, pp. 6775-6779, (1996) (2) K. Martin\'as, \'A. Csek\H o: Extropy - A New Tool for the Assessment of the Human impact on Environment, in: Complex Systems in Natural and Economic Sciences, Proceedings of the Workshop "Methods of Non-Equilibrium Processes and Thermodynamics in Economics and Environment Sciences", 19-22 September 1995, M\'atraf\"ured, Hungary. (3) M. H. R. Stanley et al.: Can Statistical Physics Contribute to the Science of Economics? Fractals, Vol. 4., No. 3 (1996), pp. 415-425. (4) J-Ph. Bouchaud, M. Potters: Th\Žeorie des Risques Financiers: Portefeuilles, options et risques majeurs, to be published. (5) I. Prigogine, I. Stengers: La nouvelle alliance. M\Žetamorphose de la science, Gallimard, Paris, 1986. (I have translated it from the Hungarian edition, Akad\'emiai Kiad\'o, Budapest, 1995, p. 10.) (6) N. Chomsky: Language and Mind, New York, etc., 1968. Harcourt, Brace & World, Inc. (I used the Hungarian edition, Osiris-Sz\'azadv\'eg, Budapest, 1995., p. 227.) (7) The idea has first been used for DNA-sequences: C.-K. Peng et al.: Long-range correlations in nucleotide sequences, Nature, Vol. 356, 12 March 1992, pp. 168-170. (8) A. Schenkel et al.: Long Range Correlation in Human Writings, Fractals, Vol. 1, No. 1 (1993), pp. 47-57. (9) M. Amit et al.: Language and Codification Dependence of Long-Range Correlations in Texts, Fractals, Vol. 2, No. 1 (1994), pp. 7-13. (10) G. Dietler, Y.-C. Zhang: Crossover from White Noise to Long Range Correlated Noise in DNA Sequences and Writings, Fractals, Vol. 2, No. 4 (1994), pp. 473-479. (11) M. Damashek: Gauging Similarity with n-Grams: Language-Independent Categorization of Text, Science, Vol. 267 (10 February 1995), pp. 843-848.