Output details
11 - Computer Science and Informatics
University of Edinburgh
Genre distinctions for discourse in the Penn TreeBank
<22> Originality: First-ever demonstration that the widely-used 1-million word Penn TreeBank corpus is not simply a collection of news reports, but actually comprises documents from a range of genres, from financial reports to film reviews to errata and verse, each type showing very different lexical, syntactic and organizational properties.
Significance: Subsequent work that has used the Penn TreeBank corpus for training parsers, doing domain adaptation, assessing discourse coherence, etc, now acknowledges this fact and conditions its claims on the particular genre involved.
Rigour: Uses standard methodology of corpus linguistics and tools from computational linguistics.