Corpus Annotation: Sentence and Discourse

March 16–20, 2009



Prof. Eva Hajičová (Charles University in Prague, Czech Republic)


Institute of Mathematics and Computer Science (University of Latvia)
Raiņa blvd. 29, Riga, Latvia


The purpose of the course is to expose, on the example of the scenario of the Prague Dependency Treebank, an integrated approach to corpus annotation on the layer of underlying syntax including the information structure of the sentence and to illustrate possibilities such an integrated approach to sentence annotation offers for the annotation of discourse.

All the lectures will be oriented towards both the theoretical aspects of the given topics as well the possibilities how to reflect them in corpus annotation in order to make the annotation useful for the further development of the theories and for practical applications.








Underlying layer of (dependency) syntactic relations and their representation in sentence annotation.
A brief introduction to dependency syntax in comparison to the immediate constituent approach (phrase structure); advantages of dependency structures with the verb as the root of the dependency tree and its complementations carrying the types of dependency relations. The requirements such an approach imposes on the information carried by the lexical entries.

 Mo, Mar 16




Information structure of the sentence and its annotation in the Prague Dependency Treebank (PDT).
Information structure of the sentence (its topic focus articulation) as a semantically relevant linguistic phenomenon. Contextual boundness as a primary distinction on the basis of which the articulation of the sentence into its topic (what the sentence is about) and its focus (what the sentence "says" about its topic) can be derived. The annotation of contextual boundness in the PDT, the implementation of the algorithm of the determination of topic and focus and the first results of research based on this annotation.

 Tu, Mar 17




Coreference in the sentence and in the text; coreferential links in the PDT.
The distinction between grammatical and textual coreference and its reflection in the annotation of corpus. Type of coreferential relations, problematic issues of capturing these relations in corpus annotation.

 We, Mar 18




From the structure of the sentence to discourse patterning.
Topic-focus articulation of the sentence and its implications for the structuring of discourse. The notion of the stock of shared knowledge (i.e. knowledge assumed by the speaker to be shared by him and the addressee) and the hierarchy of activation of the items of this stock. Illustration of discourse patterning based on this approach.
Annotation of discourse relations in different annotation schemes; a proposal of discourse relations annotation within the PDT.

 Th, Mar 19




Hands-on session.

 Fr, Mar 20





In the hands-on session the students will undertake a corpus-based analysis of a piece of English text in terms of underlying syntactic relations, information structure, coreference and basic discourse relations. Their task will also consist of a translation of the given piece of text into their mother tongue; in their commentary, they should also present a short evaluation of the two versions.


General knowledge of basic concepts of syntax.

Reading list

  • Charles Fillmore: The Case for Case. In: E. Bach and R. Harms (eds.), Universals in Linguistic Theory, New York, 1968, pp. 1-88.
  • Jan Hajič et al.: A Three-Level Annotation Scenario. In: A. Abeillé (ed.), Treebanks Building Using Parsed Corpora, Kluwer, Amsterdam, 2000, pp. 103-127
  • Eva Hajičová: Issues of Sentence Structure and Discourse Patterns, Charles University, Prague, 1993
  • Eva Hajičová and Petr Sgall: Dependency Syntax in Functional Generative Description. In: V. Agel et al. (eds.), Dependenz und Valenz, Ein internazionales Handbuch der zeitgenössischen Forschung, Halbband I, Walter de Gruyter, Berlin, New York, 2003, pp. 570-592
  • Martha Palmer, Daniel Guildea and Paul Kingsbury, The Proposition Bank: An annotated corpus of semantic roles. In: Computational Linguistics 31(1), pp. 71-106
  • Rashmi Prasat et al., The Penn Discourse Treebank 2.0, in: Proceedings of the LREC Conference 2008, Marrakesh
  • Petr Sgall, Eva Hajičová and Jarmila Panevová: The Meaning of the Sentence in its Semantic and Pragmatic Aspects. Chapters 2 and 3, Reidel, Dordrecht, 1986, pp. 100-265
  • Papers on sentence and discourse annotation in the recent Proceedings of LREC and the ACL special workshops on corpus annotation – selection according to the participants` personal choice.

ECTS credits


Course application form

Local contact persons

Gunta Nešpore:

Normunds Grūzītis: