(Sorry for this "meta-tex" format)


\bf{Physics and Linguistics}
\it{What's common?} }

Tam\'as B{\'\i}r\'o
(Roland E\"otv\"os University, Budapest
mail: birot@ludens.elte.hu)

Abstract:
-----------

In recent decades interdisciplinary questions, like biophysics, environment sciences, or cognitive sciences have been getting more and more popular. The purpose of my lecture is to present the first approaching steps between physicists and linguists, to show what possibilities can be found to relate these seemingly very distant disciplinaries. I shall introduce a method known from statistical physics (random walks), which is able to discover non-trivial long-range correlations in texts. I also shall define a "distance" between two texts, leading to an algorithm for sorting texts by language or content. 


Proceeding:
--------------

	"Physics is an imperialist, expanding science" $-$ has said Prof. G. Marx to me $-$ "Physics is what physicists are dealing with". No one can say the limits of modern physics: its methods are penetrating into most sciences. The interaction of physics with mathematics, physics with chemistry is well known. When speaking about physics and biology, we should distinguish among biophysics and biological physics: the first deals with physical processes in living organisms, the second uses physical methods (mostly statistical physics) to analyse biological systems, such as evolution or molecular motors (see e. g. 1).
	In recent decades physics has been penetrating in social sciences, as well. While scientists mostly apply physics as the discipline of the physical processes, scholars of social sciences $-$ similarly to biological physicists $-$ rather make use of physical methods. In economics, for example, the impact of thermodynamics (energy, entropy, etc. in their physical meaning) on microeconomic processes has been examined (2), but most of researches deal with statistical physical methods used in economics or finances (e. g. 3, 4).
	What can physics contribute to linguistics? The first point, I do not wish to deal with, is \it{acoustics} and physical phonetics. Phonetics is the science of the sounds in human speech (to not confuse with phonology, dealing with the system of the sounds, their rules and regular changes in a given language). One of the two aspects of phonetics (i. e. the physiological and the physical ones) could have been interesting to a physicist of hundred years ago, dealing with acoustics: spectra of human speech, of different sounds produced by the same or different people, can be recorded, analysed and compared. But that is rather like biophysics: analysing physical processes in various "contexts".
	I wish to deal with the potential contribution of \it{modern physics} to \it{modern linguistics}. What do I mean by modern physics and by modern linguistics?
	Modern physics $-$ in the sense Prigogine and Stengers use it $-$ is not quantum theory or relativity theory, but the statistical physics of \it{complex systems}, of irreversibility "\sl{making possible spontaneous organising processes}" (5). The development of modern statistical physics has put at our disposal means strong enough to describe complex systems, such as economic or biological phenomena. Why not using them to describe linguistical phenomena, as well?
	On the other hand, I mean by modern linguistics the \it{generative linguistics} of \it{Noam Chomsky} and his (closer or not so close) followers. \it{Mathematical linguistics}, whose one main elaborator was Chomsky himself, made possible to explore the features of  natural languages with the help of the exact and abstract mathematical methods (theory of formal languages). (Before that, "quantitative linguistics" meant using statistical methods in processing data got from text corpi or a dictionary.) (6)
	Putting that together: linguists have learnt from mathematicians how to deal with linguistical phenomena in an exact way, and statistical physicists have new, powerful methods. Why cannot we (i. e. physicists) help mathematicians in providing linguists with modern methods? Just few steps have been done till now, but in the following I wish to present two attempts in that direction.

	To understand the so-called \it{one-dimensional random walk model} (7-10) let us imagine a flea walking along a line. We have a text composed of an alphabet (e. g. the 26 letters of the English alphabet, space, coma, etc.), that we map onto a binary sequence of "1" and "-1" (e. g. by replacing all characters by a binary code, and then writing "-1"s instead of all "0"s). We read this sequence to the flea, who has to move up when hearing "1", and to move down when hearing "-1". Denoting by $u_i$ the $i$-th element of the binary code, the place $y(l)$ of the flea after $l$ steps is:
		$$ y(l) = \sum_{i = 1}^l u_i $$
	An important statistical quantity characterising the walk of the flea is the root mean square fluctuation $F(l)$ about the average of the displacement:
		$$ F^2 (l) = < (\Delta y (l) - <\Delta y(l)>)^2 > = 
			= < \Delta y(l)^2 > - < \Delta y(l) >^2 $$
	where
		$$ \Delta y(l) = y (l_0+l) - y (l_0) $$ ,
	and the averages are taken over all positions $l_0$. It can be easily seen that $F(l)$ has a power law behaviour:
		$$ F (l) \sim l^\alpha $$
	If we have a purely random sequence, $\alpha = 0.5$ . In the case of short (local) correlations extending up to a characteristic length $R$ (e. g. a Markov-chain), the asymptotic behaviour ($l \gg R$) would be unchanged: $\alpha = 0.5$. But in the case of long-range correlations (where no characteristic $R$ exists), i. e. when the probability of "1" at a place is affected by what can be find at a very long distance, the alpha-exponent will differ from $0.5$.
	This "experiment" has been made for various texts, such as the original version and different translations of the Bible, Shakespeare's dramas, novels, a dictionary, computer programs after compilation (.exe files), etc. Some more interesting results are:
	1., Texts have a constant alpha-exponent over decades in $l$, significantly different from $0.5$ (in average about $0.6 - 0.7$). Computer programs are even more correlated: they scale with an exponent above $0.9$. (8)
	2., The exponent is not characteristic to author. While the alpha of "Hamlet" is $0.56$, "Romeo and Juliet" 's one is $0.6$ (8).
	3., Translation seems to diminish correlations. Although the Bible has a very high alpha value ($\sim 0.75$), its translations are less correlated (9). 
	4., The examined dictionary has shown correlations much longer than entries! (8)
	5., Cutting the text into pieces and reshuffling them randomly ceases the correlations: beyond the scale of the pieces' length, $\alpha = 0.5$.
	The correct explanation for these long-range correlations has to be found yet.

	Let me present now an idea, leading to a useful algorithm. Called "gauging similarity with n-grams" by his inventor, \it{Marc Damashek}, this method consist of constructing a unity vector from a given text, and the similarity of two texts can be given by their dot product. (11)
	Let us move a "window" of length $n$ character along our document, symbol by symbol. We count the number $m_i$ of the occurrences of the n-gram (i. e. sequence of $n$ characters) indicated with $i$, for each $i$ (i. e. each possible n-grams). The document can be characterised by a vector $x$, whose components are:
		$$ x_i = {m_i \over \sum_j m_j}$$
	If we have two texts with vectors $x$ and $y$, their "similarity" can be measured by their dot product:
		$$ S = \cos \theta = \sum_i x_i y_i $$
	Table 1 shows my results with $n = 3$, while Table 2 shows the dot products of the vectors of the same documents, when $n = 4$. F1 - F5 are e-mail updates of the American Institute of Physics' Bulletin of Physics News, E1 - E3 are other e-mails in English, Fr1 and Fr2 are short French letters, while H1 - H3 are personal e-mails in Hungarian. Their lengths are between 3400 to 6000 characters, except of Fr1 and Fr2, whose length are about 1000 - 1200 characters. My alphabet consisted of 26 letters, space, dot and comma. Sequences of spaces should be deleted before.
	Texts of the same language and topic give noticeably higher dot product than documents of different languages. Product of an F- and an E- file (same language but different topics) is smaller than the one of two F- or of two E-files, but significantly higher than the product of two documents in different languages. (For example, in the case of $n = 3$, the n-gram "the" has far the highest $m_i$ value in English texts: think to the articles, to "these", "those", "there", "them", "they", etc.)
	The procedure can be improved by introducing centroid vectors. The latters being the average of vectors taken from a given set of document (e. g.: the set of the documents in a given language), it is characteristics to the common features of this set (e. g. the grammatical words in a language). If we subtract the centroid vector from the document vectors, we can refine our similarity measure. This method gives an effective technique called \it{Acquaintance} for sorting and clustering documents by language, topic and sub-topic.


Table 1. (n = 3)		0.xxx		(Please write 0. before all data !!)

	F1	F2	F3	F4	F5	E1	E2	E3	Fr1	Fr2	H1	H2	H3
F1	1.0	801	824	783	793	691	692	707	281	279	257	292	261
F2		1.0	805	787	796	696	717	704	263	259	254	281	249
F3			1.0	781	766	702	706	695	228	235	233	266	233
F4				1.0	727	650	674	667	241	231	237	265	237
F5					1.0	636	621	662	275	264	257	292	257
E1						1.0	816	786	272	264	221	275	232
E2							1.0	797	244	251	220	258	229
E3								1.0	273	280	218	266	222
Fr1									1.0	636	211	207	204
Fr2										1.0	218	225	233
H1											1.0	737	783
H2												1.0	804
H3													1.0


Table 2.  (n = 4)		0.xxx		(Please write 0. before all data !!)

	F1	F2	F3	F4	F5	E1	E2	E3	Fr1	Fr2	H1	H2	H3
F1	1.0	617	654	580	583	492	491	492	119	089	075	089	070
F2		1.0	644	597	615	499	516	504	097	078	072	079	058
F3			1.0	605	582	510	525	501	083	064	066	083	058
F4				1.0	525	448	476	461	085	059	065	075	063
F5					1.0	430	422	437	094	067	074	086	065
E1						1.0	680	651	116	078	068	103	068
E2							1.0	660	110	080	065	087	068
E3								1.0	116	085	075	103	073
Fr1									1.0	437	044	044	060
Fr2										1.0	038	044	059
H1											1.0	462	540
H2												1.0	601
H3													1.0
	

	What conclusions can be made at the end? I believe, the task of the physics is to describe the nature with using mathematical methods. Before, "nature" meant lifeless nature, but in recent decades it has started to include other segments of the world around us: biology, our economical and social environment, etc. Why not also our language?
	We have two, very different sciences: physics and linguistics. They keep winking at each other. The first steps towards the meeting have been made from both sides. Hopefully, we can be present at their rendez-vous in the next decades, or even at their marriage, too. I am convinced of the fact that their relationship will be fruitful for both science, and their children may even have effect on other members of the big family "Science".


(1) I. Der\'enyi, T. Vicsek: The kinesin walk: A dynamic model with elastically coupled heads, Proc. Natl. Acad. Sci. USA, Vol. 93, pp. 6775-6779, (1996)

(2) K. Martin\'as, \'A. Csek\H o: Extropy - A New Tool for the Assessment of the Human impact on Environment, in: Complex Systems in Natural and Economic Sciences, Proceedings of the Workshop "Methods of Non-Equilibrium Processes and Thermodynamics in Economics and Environment Sciences", 19-22 September 1995, M\'atraf\"ured, Hungary.

(3) M. H. R. Stanley et al.: Can Statistical Physics Contribute to the Science of Economics? Fractals, Vol. 4., No. 3 (1996), pp. 415-425.

(4) J-Ph. Bouchaud, M. Potters: Th\´eorie des Risques Financiers: Portefeuilles, options et risques majeurs, to be published.

(5) I. Prigogine, I. Stengers: La nouvelle alliance. M\´etamorphose de la science, Gallimard, Paris, 1986. (I have translated it from the Hungarian edition, Akad\'emiai Kiad\'o, Budapest, 1995, p. 10.)

(6) N. Chomsky: Language and Mind, New York, etc., 1968. Harcourt, Brace & World, Inc. (I used the Hungarian edition, Osiris-Sz\'azadv\'eg, Budapest, 1995., p. 227.)

(7) The idea has first been used for DNA-sequences: C.-K. Peng et al.: Long-range correlations in nucleotide sequences, Nature, Vol. 356, 12 March 1992, pp. 168-170.

(8) A. Schenkel et al.: Long Range Correlation in Human Writings, Fractals, Vol. 1, No. 1 (1993), pp. 47-57.

(9) M. Amit et al.: Language and Codification Dependence of Long-Range Correlations in Texts, Fractals, Vol. 2, No. 1 (1994), pp. 7-13.

(10) G. Dietler, Y.-C. Zhang: Crossover from White Noise to Long Range Correlated Noise in DNA Sequences and Writings, Fractals, Vol. 2, No. 4 (1994), pp. 473-479. 

(11) M. Damashek: Gauging Similarity with n-Grams: Language-Independent Categorization of Text, Science, Vol. 267 (10 February 1995), pp. 843-848.