Genres
At present, we can only list the genres that are present somewhere in the treebank. There is no machine-readable way how to say which sentences belong to which genre. This we could do the same way as the parallel ranges above, using ranges of sentence ids:
Genre: news = set-s1 .. set-s3694
However, we probably need to revisit the taxonomy of genres (or text types). The current granularity is not optimal. Note that we use the term genre in a broad sense of text type; besides the genre proper, it sometimes includes the medium, topic etc.
Decision list
- Can we specify the genre at all? We are required to say what we know about the genre in the metadata in README, but we could say that the genre is unknown:
Genre: unknown
In particular, if the corpus consists of randomly shuffled isolated sentences that resulted from crawling the web, we can hardly know the genre of the document a sentence was taken from. Hence the genre is unknown. This value includes the web genre, which was among the permitted values until UD 2.14. However, the unknown genre is broader, as the data source does not have to be the web. On the other hand, there are texts that do come from the web, yet they can be clearly assigned to a more specific genre. Note that the unknown value should not be used if we know that there are multiple genres in the corpus, we know which ones, we just do not know where exactly (but it would be possible to tell that if an annotator went over the corpus again). In such cases, multiple genre lines can be used without sentence id ranges.
A special type of isolated sentences is examples from a reference grammar, language textbook or other linguistic literature. These sentences are often made-up examples (or carefully selected from fieldwork material) to demonstrate a particular grammatical construction. Many UD treebanks contain such sentences. These should not be labeled as unknown genre; instead, they should use the grammar label (until UD 2.14 it was called grammar-examples):
Genre: grammar
- If we believe that the corpus or its part should have an identifiable genre, the next question to answer is whether the text / utterance was prepared or spontaneous. Written language is typically prepared (the author can think about it, revisit it and edit it), even if not always with the same level of care. Chats on social media are borderline, so they get their own label. Spoken language can be also prepared (e.g., news on TV) or it can be spontaneous. Political speeches (including parliament proceedings) are also borderline: Some of them are prepared, some of them are spontaneous or half-prepared (the speaker has notes but not the exact text). More generally, speech is a monologue intended for some audience (besides political speeches these could be laudatios, speeches at funerals or various other occasions). But the label does not cover all kinds of spoken data!
Genre: speech
An interview (e.g. a journalist interviewing a celebrity) is typically prepared on the side of the interviewer and partially spontaneous (with possible corrections before it is published) on the side of the interviewed. It gets its own label (but note that this label does not cover other types of dialogues):
Genre: interview
Short posts such as tweets get the label social. They may be prepared to some extent but often they are written quickly, with shortcuts and possibly errors. This genre includes chats and discussions under other posts or under larger articles (while the larger article itself does not belong to the social genre). User-generated reviews of products and services are not considered social (provided they are posted at places dedicated to reviews and not among general posts on Twitter, Facebook, Telegram, Reddit etc.)
Genre: social
For all other written data and for spoken language that is read or recited (i.e., there is probably a written original), consider it prepared and go to the next step.
For all other spoken data, consider it spontaneous and give it the label spontaneous. Hence, the former spoken genre, which was among the permitted values until UD 2.14, should now be split to speech, interview, spontaneous, or as the case may be, merged with the appropriate prepared genre. If it is not clear that a spoken utterance is prepared or qualifies as speech or interview, it should be spontaneous by default. Typical spontaneous data are transcriptions of unprepared dialogues such as TV debates; also recordings of conversations at home and in other natural settings.
Genre: spontaneous
- Drama contains mostly dialogues, sometimes monologues, and occasional other notes such as scene description. It may or may not be in verse. Note that movie subtitles would get the drama label, too.
Genre: drama
- If it is in verse but it is not a play (drama, see above), then it is a candidate for the poetry label, especially if it is lyric poetry (for epic poetry, there may be edge cases where one may want to consider merging it with prose narration, but poetry should still be the default). Song lyrics belong here, too. This category is probably also the best fit for prayers.
Genre: poetry
- Text produced by second language learners in the language class may contain specific errors and have their own category (formerly learner-essays, now just learner). Typically such texts are short essays but they could fall to various other genres below (such as mail, narration or essay) if they were not produced by language learners. Note that the learner category should not absorb everything written by a non-native speaker; it is designed specifically for texts created in the language learning environment.
Genre: learner
- Letters including e-mails get the label mail. This includes the former email genre, which was among the permitted values until UD 2.14. In general, these are monologues addressed to a concrete person or group of persons, unless they were already identified as a speech, poetry etc. We do not distinguish private letters from official letters and business correspondence.
Genre: mail
- Novels, short stories and other works of fiction are labeled as narration. It is not decisive whether the contents is fiction or it reflects real events (and in some cases, such as the Bible, the question of factuality would be controversial). The narration genre also includes non-fiction narratives such as chronicles, biographies and travelogues. On the other hand, news is a separate category, not included in narration. The former bible genre, used until UD 2.14, is now included in narration.
Genre: narration
- Daily newspapers typically contain short articles describing recent events and are labeled news. Magazines are typically not included in this genre, as they contain longer reads which may be popular science, reviews, interviews and other material. However, it will be practical if the whole issue of a daily newspaper (perhaps without weekend supplements) can be categorized as news without measuring the length of individual articles. Besides political news it may contain business news, sports results, weather forecasts, TV programs, announcements, advertisments etc. Transcribed spoken news broadcast through radio, TV or internet also qualify as news.
Genre: news
- Reviews are evaluative texts of any length and regardless of the qualification of the author (hence, the category covers user-generated product or service reviews as well as book or movie reviews written by experts). (In the former genres used until UD 2.14 the label was in plural – reviews.)
Genre: review
- Laws, international treaties, local regulations, contracts, terms and conditions of a service are all in a broad category called regulation. Note that while legal bills approved by a parliament are regulation, the proceedigns of the parliament deliberation belong to the speech category. From the former genres that were valid until UD 2.14, the new regulation category covers both legal and government.
Genre: regulation
- Manuals, guidelines, documentation, patent applications and various other types of instructions (including recipes, travel guides or directions how to get somewhere) are labeled as instruction. Furthermore, this category includes specialized descriptive texts such as technical report from an experiment or health report with a patient's diagnose (until UD 2.14 probably labeled as medical). Textbooks may also belong here, unless they are seen as fitting better to other categories (for example academic or grammar examples).
Genre: instruction
- Data from question answering competitions are close to educative or encyclopedic domains but they have a distinct form and are kept in a separate category.
Genre: qa
- Articles from Wikipedia or any other encyclopedia, as well as individual popular science articles from magazines are categorized as encyclopedia. This includes the former genre wiki that was used until UD 2.14. Dictionary entries would be also included in this category.
Genre: encyclopedia
- Scholarly articles from any field are categorized as academic. Unlike encyclopedia, they are intended for expert audience rather than the general public. There was a same-named category among the genres until UD 2.14, however, it is not clear whether academic papers about medicine were labeled as academic or medical; now they should be academic.
Genre: academic
- A text that discusses a topic, possibly presenting opinion of the author and/or other people, and does not belong to any of the above categories, is an essay. This may include some texts formerly (until UD 2.14) categorized as blog.
Genre: essay
TODO:
- I suppose there must be a taxonomy of text types somewhere that someone has created for corpora but I did not find anything that would look authoritative.
- Chris: Here’s at least one attempt at defining registers, genres, etc. from the Corpus Linguistics tradition. Could be a good thing to compare with:
https://scholarspace.manoa.hawaii.edu/server/api/core/bitstreams/e80fbfaf-736c-4f32-bd98-ed99ef330316/content
- Better URL: https://core.ac.uk/download/pdf/36987049.pdf
- List of genres known and allowed in UD v2 (see validator or documentation of treebank metadata).
- none, news, fiction, nonfiction, academic, medical, legal, government, blog, reviews, social, email, spoken, wiki, web, bible, grammar-examples, learner-essays, poetry
- See what Sara et al. did in their genre-balanced UD.
- See the Helsinki Corpus (Old, Middle, Early Modern English).
The categorization I sketched in the decision list above largely ignores the topic (domain) of the data, which can obviously also affect the linguistic properties. If we find it important, we could possibly add some coarse-grained topic classification such as the following:
- medicine, natural sciences, mathematics
- technology, engineering, industry, agriculture
- law, politology, economy, social sciences
- humanities, arts, culture, religion
To discuss:
Instead of the sentence id ranges in README, we could define sentence-level tags that would be added to every sentence belonging to a parallel dataset or to a genre. It would use more space but filtering of the data (when we are looking for a particular genre or parallel dataset) would be easier. The metadata in README would be more readable; we would still keep a general overview (list of genres, id of parallel datset) there.