|Open class words||Closed class words||Other|
ADJ is currently precisely the union of PTB JJ, JJR, and JJS.
ADP covers the Penn Treebank RP, and a subset of uses of IN (when not a complementizer or subordinating conjunction) and TO (in old treebanks which used this for to even when used as a preposition).
ADV covers all uses of PTB tags RB, RBR, RBS, and WRB except the clausal negation not and reduced forms of it, which become PART.
AUX: auxiliary verb
AUX covers PTB MD and uses of the various verbal tags (VB, VBP, VBG, VBN, VBD, VBZ) when they are forms of be, have, do, and get when used as an auxiliary (we count passive get as an auxiliary).
CONJ: coordinating conjunction
CONJ corresponds to PTB CC.
DET covers most cases of Penn Treebank DT, PDT, WDT. However, when a Penn Treebank word with one of these tags stands alone as a noun phrase rather than modifying another word, then it becomes
INTJ corresponds to the PTB UH.
NOUN corresponds to all cases of PTB NN and NNS, except for %, which we retag as SYM.
NUM corresponds exactly to the PTB CD.
The following English words (only) are currently being treated as
PART in English:
- Possessive marker: ’s or ’ (and non-standard forms s, -s)
- Predicate negation: not, n’t, nt
- Infinitive marker: to (and non-standard forms ta, na, too, ot, 2, a)
(This is a slightly motley list and we may still want to rethink this category for English….)
This covers PTB tags POS and some (old PTB style) or all uses of TO, and the subset of RB that is negation.
PRON is used for English pronouns, such as we, her, it, who, and that when used as a relative pronoun.
PRON corresponds to the PTB PRP, PRP$, WP, WP$, EX, and certain things that are tagged DT (question and Wh pronouns, such as who, this, and that), when they comprise a nominal by themselves rather than functioning as the determiner of a nominal head (usually a noun). (The assignment of PRP$ and WP$ to PRON might be subject to revision - they could also become DET.)
PROPN: proper noun
PROPN corresponds to everything tagged NNP or NNPS in the PTB tag set. (Note that at present we make no attempt to exclude words arguably of other parts of speech which appear in proper noun phrases that the PTB tag set would tag with NNP(S). So, United States is United/PROPN States/PROPN.)
PUNCT covers PTB tags:
- Some uses of NFP (for lines of hyphens, asterisks or tildes)
SCONJ: subordinating conjunction
SCONJ is used for these two subclasses of subordinating conjunctions:
- Complementizers: that, whether, if, etc.
- Adverbial clause introducers: when, since, before, etc. (when introducing a clause not a nominal)
These are a subset of the things that the IN tag is used for in the PTB.
We treat the putative relativizer use of that (e.g., Jespersen 1924) as a relative pronoun in modern English, so that it gets the POS tag PRON.
SYM covers PTB tags NFP (except for lines of separators, which become PUNCT), #, $, SYM, and for the percent sign (%).
VERB covers PTB tags VB, VBP, VBZ, VBD, VBG, VBN, except for auxiliary verb uses of be, have, do, and get.
(Auxiliary verbs and modals are
AUX and the infinitive to is
The English tag
X is used for the PTB tags FW, LS, XX, ADD, AFX, and GW. Some things tagged AFX would be candidates for retagging with other tags, but that has not been attempted.