SYM: symbol
Definition
A symbol is a word-like entity that differs from ordinary words by form, function, or both.
Many symbols are or contain special non-alphanumeric characters, similarly to punctuation. What makes them different from punctuation is that they can be substituted by normal words. This involves all currency symbols, e.g. $ 75 is identical to seventy-five dollars.
Mathematical operators form another group of symbols.
Another group of symbols is emoticons and emoji.
Strings that consists entirely of alphanumeric characters are not
symbols but they may be proper nouns: 130XE, DC10; others
may be tagged PROPN (rather than SYM) even if they contain special
characters: DC-10.
Similarly, abbreviations for single words are not symbols but are assigned the part of speech
of the full form. For example, Mr. (mister), kg (kilogram), km (kilometr), dr (doktor)
should be tagged nouns.
Acronyms for proper names such as OSN and NATO should be tagged as proper nouns.
Characters used as bullets in itemized lists (•, ‣) are not symbols, they are punctuation.
Examples
- $, %, §, ©
- +, −, ×, ÷, =, <, >
- :), ♥‿♥, 😝
- john.doe@universal.org, http://universaldependencies.org/, 1-800-COMPANY
Diffs
Prague Dependency Treebank
The PDT part-of-speech tagset does not distinguish symbols from punctuation, hence all non-alphanumeric characters in the converted data are currently tagged PUNCT.
The PDT texts are from the early 1990s and there are no e-mail addresses.
If they were there, the PDT tokenization rules would break them up on all dots and at signs.
The same holds for telephone numbers. For example,
tel.: (05) 4321 6014 is analyzed as eight tokens (NOUN PUNCT PUNCT PUNCT NUM PUNCT NUM NUM).
Treebank Statistics (UD_Czech)
There are 7 SYM lemmas (0%), 7 SYM types (0%) and 1260 SYM tokens (0%).
Out of 17 observed tags, the rank of SYM is: 15 in number of lemmas, 16 in number of types and 15 in number of tokens.
The 10 most frequent SYM lemmas: %, x, +, =, /, *, $
The 10 most frequent SYM types: %, x, +, =, /, *, $
The 10 most frequent ambiguous lemmas: x (SYM 120, NOUN 70, PUNCT 3), / (PUNCT 700, SYM 18), * (PUNCT 655, SYM 10)
The 10 most frequent ambiguous types: x (SYM 120, NOUN 41, PUNCT 3), / (PUNCT 700, SYM 18), * (PUNCT 655, SYM 10)
- x
- /
- *
Morphology
The form / lemma ratio of SYM is 1.000000 (the average of all parts of speech is 2.195930).
The 1st highest number of forms (1) was observed with the lemma “$”: $.
The 2nd highest number of forms (1) was observed with the lemma “%”: %.
The 3rd highest number of forms (1) was observed with the lemma “*”: *.
SYM occurs with 1 features: cs-feat/ConjType (120; 10% instances)
SYM occurs with 1 feature-value pairs: ConjType=Oper
SYM occurs with 2 feature combinations.
The most frequent feature combination is _ (1140 tokens).
Examples: %, +, =, /, *, $
Relations
SYM nodes are attached to their parents using 14 different relations: cs-dep/nmod (927; 74% instances), cs-dep/cc (149; 12% instances), cs-dep/advmod (52; 4% instances), cs-dep/conj (31; 2% instances), cs-dep/case (20; 2% instances), cs-dep/dobj (19; 2% instances), cs-dep/dep (18; 1% instances), cs-dep/root (18; 1% instances), cs-dep/nsubj (8; 1% instances), cs-dep/parataxis (7; 1% instances), cs-dep/appos (5; 0% instances), cs-dep/advcl (3; 0% instances), cs-dep/iobj (2; 0% instances), cs-dep/mark (1; 0% instances)
Parents of SYM nodes belong to 10 different parts of speech: NUM (920; 73% instances), NOUN (168; 13% instances), VERB (100; 8% instances), PROPN (22; 2% instances), ROOT (18; 1% instances), SYM (13; 1% instances), ADV (12; 1% instances), ADJ (4; 0% instances), PRON (2; 0% instances), INTJ (1; 0% instances)
747 (59%) SYM nodes are leaves.
359 (28%) SYM nodes have one child.
70 (6%) SYM nodes have two children.
84 (7%) SYM nodes have three or more children.
The highest child degree of a SYM node is 12.
Children of SYM nodes are attached using 17 different relations: cs-dep/nmod (287; 34% instances), cs-dep/nummod (152; 18% instances), cs-dep/case (85; 10% instances), cs-dep/dobj (83; 10% instances), cs-dep/punct (72; 9% instances), cs-dep/nsubj (43; 5% instances), cs-dep/advmod (32; 4% instances), cs-dep/conj (25; 3% instances), cs-dep/advmod:emph (24; 3% instances), cs-dep/amod (13; 2% instances), cs-dep/cc (8; 1% instances), cs-dep/mark (6; 1% instances), cs-dep/advcl (5; 1% instances), cs-dep/cop (4; 0% instances), cs-dep/appos (3; 0% instances), cs-dep/aux (1; 0% instances), cs-dep/dep (1; 0% instances)
Children of SYM nodes belong to 14 different parts of speech: NOUN (320; 38% instances), NUM (256; 30% instances), ADP (85; 10% instances), PUNCT (72; 9% instances), ADV (23; 3% instances), PRON (17; 2% instances), PART (15; 2% instances), ADJ (14; 2% instances), SYM (13; 2% instances), CONJ (10; 1% instances), VERB (10; 1% instances), PROPN (6; 1% instances), SCONJ (2; 0% instances), AUX (1; 0% instances)
Treebank Statistics (UD_Czech-CAC)
There are 1 SYM lemmas (0%), 1 SYM types (0%) and 3783 SYM tokens (1%).
Out of 16 observed tags, the rank of SYM is: 16 in number of lemmas, 16 in number of types and 14 in number of tokens.
The 10 most frequent SYM lemmas: *
The 10 most frequent SYM types: *
The 10 most frequent ambiguous lemmas:
The 10 most frequent ambiguous types:
Morphology
The form / lemma ratio of SYM is 1.000000 (the average of all parts of speech is 2.206260).
The 1st highest number of forms (1) was observed with the lemma “*”: *.
SYM occurs with 1 features: cs-feat/Abbr (3783; 100% instances)
SYM occurs with 1 feature-value pairs: Abbr=Yes
SYM occurs with 1 feature combinations.
The most frequent feature combination is Abbr=Yes (3783 tokens).
Examples: *
Relations
SYM nodes are attached to their parents using 24 different relations: cs-dep/nmod (2477; 65% instances), cs-dep/advmod (405; 11% instances), cs-dep/conj (290; 8% instances), cs-dep/nsubj (141; 4% instances), cs-dep/dobj (108; 3% instances), cs-dep/dep (85; 2% instances), cs-dep/case (80; 2% instances), cs-dep/root (55; 1% instances), cs-dep/cc (25; 1% instances), cs-dep/cop (20; 1% instances), cs-dep/nsubjpass (19; 1% instances), cs-dep/appos (16; 0% instances), cs-dep/auxpass (9; 0% instances), cs-dep/acl (8; 0% instances), cs-dep/advcl (8; 0% instances), cs-dep/mark (8; 0% instances), cs-dep/mwe (7; 0% instances), cs-dep/iobj (5; 0% instances), cs-dep/aux (4; 0% instances), cs-dep/xcomp (4; 0% instances), cs-dep/advmod:emph (3; 0% instances), cs-dep/punct (3; 0% instances), cs-dep/auxpass:reflex (2; 0% instances), cs-dep/ccomp (1; 0% instances)
Parents of SYM nodes belong to 12 different parts of speech: NOUN (1169; 31% instances), NUM (972; 26% instances), VERB (658; 17% instances), PROPN (584; 15% instances), SYM (177; 5% instances), ADJ (126; 3% instances), ROOT (55; 1% instances), PRON (20; 1% instances), ADV (13; 0% instances), ADP (4; 0% instances), CONJ (4; 0% instances), PUNCT (1; 0% instances)
2449 (65%) SYM nodes are leaves.
688 (18%) SYM nodes have one child.
375 (10%) SYM nodes have two children.
271 (7%) SYM nodes have three or more children.
The highest child degree of a SYM node is 12.
Children of SYM nodes are attached using 26 different relations: cs-dep/case (622; 25% instances), cs-dep/nmod (489; 19% instances), cs-dep/conj (212; 8% instances), cs-dep/nummod (211; 8% instances), cs-dep/punct (201; 8% instances), cs-dep/cc (189; 7% instances), cs-dep/amod (154; 6% instances), cs-dep/nsubj (82; 3% instances), cs-dep/advmod (57; 2% instances), cs-dep/dobj (50; 2% instances), cs-dep/cop (49; 2% instances), cs-dep/advmod:emph (46; 2% instances), cs-dep/mark (44; 2% instances), cs-dep/acl (25; 1% instances), cs-dep/dep (21; 1% instances), cs-dep/expl (16; 1% instances), cs-dep/aux (12; 0% instances), cs-dep/xcomp (11; 0% instances), cs-dep/auxpass:reflex (7; 0% instances), cs-dep/advcl (6; 0% instances), cs-dep/appos (6; 0% instances), cs-dep/nsubjpass (5; 0% instances), cs-dep/csubj (3; 0% instances), cs-dep/mwe (2; 0% instances), cs-dep/parataxis (2; 0% instances), cs-dep/ccomp (1; 0% instances)
Children of SYM nodes belong to 14 different parts of speech: ADP (608; 24% instances), NOUN (550; 22% instances), NUM (222; 9% instances), PUNCT (200; 8% instances), SYM (177; 7% instances), ADJ (169; 7% instances), CONJ (166; 7% instances), VERB (120; 5% instances), PRON (96; 4% instances), ADV (84; 3% instances), SCONJ (41; 2% instances), PART (40; 2% instances), PROPN (38; 2% instances), AUX (12; 0% instances)
Treebank Statistics (UD_Czech-CLTT)
There are 3 SYM lemmas (0%), 3 SYM types (0%) and 18 SYM tokens (0%).
Out of 15 observed tags, the rank of SYM is: 14 in number of lemmas, 15 in number of types and 15 in number of tokens.
The 10 most frequent SYM lemmas: %, +, /
The 10 most frequent SYM types: %, +, /
The 10 most frequent ambiguous lemmas: / (SYM 2, PUNCT 2)
The 10 most frequent ambiguous types: / (SYM 2, PUNCT 2)
- /
- SYM 2: Odpisová sazba na jednotku těženého množství ( Kč / t , Kč / m je podílem pořizovací ceny ložiska na jednotlivém pozemku a zásob nevyhrazeného nerostu ( t , m prokázaných geologickým průzkumem na tomto pozemku .
- PUNCT 2: (4) Rozdíl mezi součtem počátečních zůstatků nově otevřených účtů aktiv a mezi součtem počátečních zůstatků nově otevřených účtů pasiv se uvede na účet v účtové skupině 49 , a to v závislosti na povaze zjištěného rozdílu ( + / - ) jako zůstatek aktivní nebo pasivní .
Morphology
The form / lemma ratio of SYM is 1.000000 (the average of all parts of speech is 1.764161).
The 1st highest number of forms (1) was observed with the lemma “%”: %.
The 2nd highest number of forms (1) was observed with the lemma “+”: +.
The 3rd highest number of forms (1) was observed with the lemma “/”: /.
SYM does not occur with any features.
Relations
SYM nodes are attached to their parents using 3 different relations: cs-dep/nmod (14; 78% instances), cs-dep/cc (2; 11% instances), cs-dep/dep (2; 11% instances)
Parents of SYM nodes belong to 2 different parts of speech: NUM (13; 72% instances), NOUN (5; 28% instances)
8 (44%) SYM nodes are leaves.
7 (39%) SYM nodes have one child.
2 (11%) SYM nodes have two children.
1 (6%) SYM nodes have three or more children.
The highest child degree of a SYM node is 4.
Children of SYM nodes are attached using 6 different relations: cs-dep/nmod (8; 53% instances), cs-dep/conj (2; 13% instances), cs-dep/punct (2; 13% instances), cs-dep/advmod:emph (1; 7% instances), cs-dep/case (1; 7% instances), cs-dep/nummod (1; 7% instances)
Children of SYM nodes belong to 5 different parts of speech: NOUN (8; 53% instances), PUNCT (4; 27% instances), ADP (1; 7% instances), ADV (1; 7% instances), NUM (1; 7% instances)
SYM in other languages: [bg] [cs] [de] [el] [en] [es] [eu] [fa] [fi] [fr] [ga] [he] [hu] [it] [ja] [ko] [sv] [u]