Regilaulude teema-analüüs: võimalusi ja väljakutseid / Topic analysis of Estonian runosongs: Prospects and challenges
Artikkel uurib regilaulu teema-analüüsi võimalusi teemade modelleerimise meetodi abil. Meetodi kasutamisel on probleemiks regilaulu keele piirkondlik varieeruvus. Laulutekstide esmane analüüs näitas, et sisukamaid tulemusi annab teema-analüüs ühtlasema keelega kogumite puhul. Lähemaks vaatluseks valitud Hiiumaa, Saaremaa ja Muhu laulude teema-analüüsil tuvastati 20 teemat, mis annavad kiire ülevaate vaadeldavate laulude temaatilisest struktuurist. Uurimus näitas, et tuvastatud teemad jaotuvad vaadeldud piirkonnas võrdlemisi ühtlaselt. Kuid arvutuslikud teemarühmad ei kattu üheselt regilaulu varasema liigitusega, arvestamata laulude žanrilisi erinevusi ning tuues esiplaanile vaadeldavas laulukogumis sagedamini esinevad laulutüübid.
The article explores possibilities of computational topic analysis of Estonian runosong texts using the latent Dirichlet allocation (LDA) topic modelling. Runosong is an oral poetic tradition known among most of Finnic peoples. Estonian runosong texts, the material of the current research, have been collected mainly since 1880s and gathered into the Estonian Folklore Archives of the Estonian Literary Museum, where the runosong database with more than 100 000 texts has been compiled (Oras et al 2003–2020). Language of runosongs varies considerably across dialects and, in addition to that, it uses a specific archaic idiom different from the spoken language which complicates the computational analysis of the content aspects of the texts.
Topic modelling is a method that enables to discover abstract topics detected statistically on the basis of the frequency of the co-occurrence of the words in the texts. In case of a runosong corpus, the method could be used to automatically detect the thematic structure of a large amount of runosong texts, to compare the thematic distribution of regional traditions of the runosong, and to analyse how the thematic distribution obtained with the help of computational methods relates to the classification of the texts resulting from folkloristic analysis. The idea of the current article is to explore whether topic modelling can give meaningful results if applied to unlemmatized and highly variative runosong texts.
For LDA topic modelling I used the application MALLET (McCallum 2002). The initial trials with the whole corpus of runosong texts made it clear that the language of the songs is too variative to reach the level of content. It also became obvious that it is necessary to remove stopwords and refrain words. The topics, obtained from the runosongs from all over Estonia, represented dialectal variants of the language rather than thematic clusters and it was necessary to restrict the material. I used stylometric analysis (using R package stylo, Eder et al 2013) to divide the area into linguistically more homogenous subregions, and chose the area of Western islands of Estonia with 16 parishes and 3672 song texts for further explorations.
With this material I decided to generate 20 topics. Within this smaller area the topics did not cluster regional language variants any more: (1) the linguistic variants of the main concepts of a topic were brought together under the keywords of the same topic; (2) in most cases, the detected topics were distributed among all the parishes included in the selection.
Looking at the 20 keywords, the topics indeed seemed to reflect certain thematic subgroups of the songs. In several cases the most prominent song type of a topic was reflected in keywords, in other cases the keywords referred to larger groups of songs. Five of the 20 topics focused on weddings, more precisely, on different episodes of the wedding ritual: adornation and dressing, arriving and greeting, finding the bride and taking her to her new home, sharing the presents prepared by the bride, and recommendations to the bride and the groom. In all these topics the verbs refer either to the present or the future (rather than to the past which is common in narrative songs). A topic of swinging songs includes also the songs about dancing and feasts. Five topics focus on different narrative plots about the troubles of young people, about wooing and marriage. Lyric songs about the life of orphans and about singing form a separate topic each, and there is a separate male topic covering the songs of various genres related to horses, riding and the woods. The largest topic includes the songs on working at home and outside, but also the songs about premarital sex. There are two topics with the focus on well-known children’s songs and lullabies. Two topics relate to German landlords, their power and activities, and one to recruiting and the war.
As a conclusion of this exploration: (1) for topic modelling it is necessary to use the texts in homogenous language variants; otherwise, the linguistic differences override the topics at some point; (2) it is possible to use unlemmatized texts for topic modelling, but in this case the grammatical features (tense, modality) interfere with topic analysis; (3) the proportions of variable and stable (recurrent) elements (song types, motifs) in the material have a clear impact on topic formation: the more frequently an element occurs in the material, and the more stable is its wording, the bigger its probability to form the centre of a topic, whereas distinct but rare themes remain unnoticed and will be shared between the topics of more prominent subjects; (4) common sets of words assembled together as the topic may, in addition to the common thematic focus, refer to a common framework, for example environments, and behavioural or communicative patterns (for example, begging for something). Compared to the folkloristic classification of folk songs, the automatic distribution of songs (1) highlights the subjects occurring more frequently in the body of songs (for example, a topic highlights swinging songs instead of calendar songs of the folkloristic classification); (2) partly overrides the genre differences (for example song games can be found under different topics, whereas forming a distinct group in folkloristic classifications).