How to find the length of a token in antlr? -
i trying create grammar accepts character or number or anything, provided length equal 1.
is there function check length?
edit
let me make question more clear example. wrote following code:
grammar first; tokens { set = 'set'; val = 'val'; und = 'und'; con = 'con'; on = 'on'; off = 'off'; } @parser::members { private boolean inbounds(token t, int min, int max) { int n = integer.parseint(t.gettext()); return n >= min && n <= max; } } parse : set expr; expr : val('u'('e')?)? string | und('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (on | off) | con('n'('e'('c'('t')?)?)?)? onechar ; char : 'a'..'z'; digit : '0'..'9'; string : (char | digit)+; dot : .; onechar : dot { $dot.text.length() == 1;} ; space : (' ' | '\t' | '\r' | '\n') {$channel=hidden;};
i want grammar following things:
- accept commands like: 'set value abc' , 'set underli on' , 'set conn #'. grammar should intelligent enough accept incomplete words 'underl' instead of 'underline. etc etc.
- the third syntax: 'set connect onechar' should accept character, 1 character. can numeric digit or alphabet or special character. getting compiler error in generated parser file because of this.
- the first syntax: 'set value' should accept possible strings, on , off. when give like: 'set value offer', grammar failing. think happening because have token 'off'.
in grammar 3 requirements have listed above not working fine. don't know why.
there mistakes and/or bad practices in grammar:
#1
the following not validating predicate:
{$dot.text.length() == 1;}
a proper validating predicate in antlr has question mark @ end, , inner code has no semi colon @ end. should be:
{$dot.text.length() == 1}?
instead.
#2
you should not handling these alternative commands:
expr : val('u'('e')?)? string | und('e'('r'('l'('i'('n'('e')?)?)?)?)?)? (on | off) | con('n'('e'('c'('t')?)?)?)? onechar ;
in parser rule. should let lexer handle instead. it:
expr : val string | und (on | off) | con onechar ; // ... val : 'val' ('u' ('e')?)?; und : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?; con : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?;
(also see #5!)
#3
your lexer rules:
char : 'a'..'z'; digit : '0'..'9'; string : (char | digit)+;
are making things complicated you. lexer can produce 3 different kind of tokens because of this: char
, digit
or string
. ideally, should create string
tokens since string
can single char
or digit
. can adding fragment
keyword before these rules:
fragment char : 'a'..'z' | 'a'..'z'; fragment digit : '0'..'9'; string : (char | digit)+;
there no char
, digit
tokens in token stream, string
tokens. in short: fragment
rules used inside lexer rules, by other lexer rules. never tokens of own (and can therefor never appear in parser rule!).
#4
the rule:
dot : .;
does not think does. matches "any token", not "any character". inside lexer rule, .
matches character in parser rules, matches token. realize parser rules can make use of tokens created lexer.
the input source first tokenized based on lexer-rules. after has been done, parser (though parser rules) can operate on these tokens (not characters!!!). make sure understand this! (if not, ask clarification or grab book antlr)
- example -
take following grammar:
p : . ; : 'a' | 'a'; b : 'b' | 'b';
the parser rule p
match token lexer produces: a
- or b
-token. so, p
can match 1 of characters 'a'
, 'a'
, 'b'
or 'b'
, nothing else.
and in following grammar:
prs : . ; foo : 'a'; bar : . ;
the lexer rule bar
matches single character in range \u0000 .. \uffff
, can never match character 'a'
since lexer rule foo
defined before bar
rule , captures 'a'
already. , parser rule prs
again matches token, either foo
or bar
.
#5
putting single characters 'u'
inside parser rules, cause lexer tokenize u
separate token: don't want that. also, putting them in parser rules, unclear token has precedence on other tokens. should keep such literals outside parser rules , make them explicit lexer rules instead. use lexer rules in parser rules.
so, don't do:
prule : 'u' ':' string string : ...
but do:
prule : u ':' string u : 'u'; string : ...
you make ':'
lexer rule, of less importance. 'u'
can string
must appear lexer rule before string
rule.
okay, obvious things come mind. based on them, here's proposed grammar:
grammar first; parse : (set expr {system.out.println("expr = " + $expr.text);} )+ eof ; expr : val string {system.out.print("a :: ");} | ul (on | off) {system.out.print("b :: ");} | con onechar {system.out.print("c :: ");} ; onechar : string {$string.text.length() == 1}? ; set : 'set'; val : 'val' ('u' ('e')?)?; ul : 'und' ( 'e' ( 'r' ( 'l' ( 'i' ( 'n' ( 'e' )?)?)?)?)?)?; con : 'con' ( 'n' ( 'e' ( 'c' ( 't' )?)?)?)?; on : 'on'; off : 'off'; string : (char | digit)+; fragment char : 'a'..'z' | 'a'..'z'; fragment digit : '0'..'9'; space : (' ' | '\t' | '\r' | '\n') {$channel=hidden;};
that can tested following class:
import org.antlr.runtime.*; public class main { public static void main(string[] args) throws exception { string source = "set value abc \n" + "set underli on \n" + "set conn x \n" + "set conn xy "; antlrstringstream in = new antlrstringstream(source); firstlexer lexer = new firstlexer(in); commontokenstream tokens = new commontokenstream(lexer); firstparser parser = new firstparser(tokens); system.out.println("parsing:\n======\n" + source + "\n======"); parser.parse(); } }
which, after generating lexer , parser:
java -cp antlr-3.2.jar org.antlr.tool first.g javac -cp antlr-3.2.jar *.java java -cp .:antlr-3.2.jar main
prints following output:
parsing: ====== set value abc set underli on set conn x set conn xy ====== :: expr = value abc b :: expr = underli on c :: expr = conn x line 0:-1 rule onechar failed predicate: {$string.text.length() == 1}? c :: expr = conn xy
as can see, last command, c :: expr = conn xy
, produces error, expected.
Comments
Post a Comment