Presto-SqlParser源码浅析-SqlParser类
- 2022-01-23
SqlParser类 浅析
解析DSL需要以下步骤
presto 解析 SQL 所用的工具 为 Antlr, 利用Antlr产生的Lexer和Parser进行SQL解析则是由 SqlParser 管理,主要是在 invokeParser 中 进行
首先 将要解析的 sql 从 String 转成 CharStreams
考虑到 SQL 大小写不敏感,因此将字符流转成不区分大小写的字符流。 CaseInsensitiveStream 采用了代理模式,并在 LA方法中添加了 将所有的字符变成了大写 的逻辑。
这样就得到了词法分析器需要的字符流
public class CaseInsensitiveStream implements CharStream {
private final CharStream stream;
public CaseInsensitiveStream(CharStream stream) { this.stream = stream; }
@Override
public String getText(Interval interval) { return stream.getText(interval); }
@Override
public void consume() { stream.consume(); }
@Override
public int LA(int i) {
int result = stream.LA(i);
switch (result) {
case 0:
case IntStream.EOF:
return result;
default:
return Character.toUpperCase(result);
}
}
@Override
public int mark() { return stream.mark(); }
@Override
public void release(int marker) { stream.release(marker); }
@Override
public int index() { return stream.index(); }
@Override
public void seek(int index) { stream.seek(index); }
@Override
public int size() { return stream.size(); }
@Override
public String getSourceName() { return stream.getSourceName(); }
}
接着 将字符流送入词法分析器
SqlBaseLexer 是Antlr根据sqlbase.g4生成的词法分析器,继承Lexer,间接继承了Recognizer<Integer, LexerATNSimulator>, 而 Recognizer<Integer, LexerATNSimulator> 实现了TokenSource
经过词法分析器以后,产生符号流。 由于 SqlBaseLexer 实现了 TokenSource, 因此可以转成 CommonTokenStream
接着 将符号流送入语法分析器
SqlBaseParser 是Antlr根据sqlbase.g4生成的语法分析器,继承Parser,间接继承了Recognizer<Integer, LexerATNSimulator>
为了允许用户对于词法分析器和语法分析器做自定义的处理,initializer接受词法分析器对象和语法分析器对象并对其操作,默认情况下是不进行任何操作
public class SqlParser {
private static final BiConsumer<SqlBaseLexer, SqlBaseParser> DEFAULT_PARSER_INITIALIZER = (SqlBaseLexer lexer, SqlBaseParser parser) -> {};
private final BiConsumer<SqlBaseLexer, SqlBaseParser> initializer;
@Inject
public SqlParser(SqlParserOptions options) { this(options, DEFAULT_PARSER_INITIALIZER); }
public SqlParser(SqlParserOptions options, BiConsumer<SqlBaseLexer, SqlBaseParser> initializer) {
this.initializer = requireNonNull(initializer, "initializer is null");
...
}
private Node invokeParser(String name, String sql, Function<SqlBaseParser, ParserRuleContext> parseFunction, ParsingOptions parsingOptions) {
...
initializer.accept(lexer, parser);
...
}
...
}
Parser注册了 PostProcessor 对象 作为其 Listener, PostProcessor 继承于 SqlBaseBaseListener(由Antlr生成),SqlBaseBaseListener 实现了 SqlBaseListener 接口
SqlBaseBaseListener 对语法树节点都提供了 enterXXX 和 exitXXX,不过这些默认都是空方法,也就是SqlBaseBaseListener其实什么都没有干
PostProcessor 重写了exitUnquotedIdentifier、exitBackQuotedIdentifier、exitDigitIdentifier、exitNonReserved 四个方法
-
exitUnquotedIdentifier 方法处理不带引号的标识符,当标识符存在不合法符号的时候抛出异常
-
不合法符号由EnumSet.complementOf(allowedIdentifierSymbols)确定
-
allowedIdentifierSymbols 在 构造函数 中初始化,并且使用final禁止改变该变量的引用
-
例如 select * from foo@bar 存在不合法符号 @,由此抛出“line 1:15: identifiers must not contain ‘@’”
-
-
exitBackQuotedIdentifier 方法处理反引号包裹的标识符,当存在反引号的时候抛出异常
-
标准的Presto是不能使用反引号的
-
例如 select * from `foo` 存在反引号,由此抛出“line 1:15: backquoted identifiers are not supported; use double quotes to quote identifiers”
-
-
exitDigitIdentifier 方法处理数字开头的标识符
- 例如 select 1x from dual 存在不合法的1X, 由此抛出“line 1:8: identifiers must not start with a digit; surround the identifier with double quotes”
-
exitNonReserved 处理 NonReserved
-
要求是叶子节点(TerminalNode)
-
用 IDENT 标记替换 nonReserved 标记
-
例如 select if(1=1,1,0) from foo 中的 if ,原本if的type类型是 SqlBaseLexer.IF (数值85), 现改为 SqlBaseLexer.IDENTIFIER (数值231)
-
public class SqlParser {
private final EnumSet<IdentifierSymbol> allowedIdentifierSymbols;
public SqlParser(SqlParserOptions options, BiConsumer<SqlBaseLexer, SqlBaseParser> initializer) {
allowedIdentifierSymbols = EnumSet.copyOf(options.getAllowedIdentifierSymbols());
...
}
private class PostProcessor extends SqlBaseBaseListener {
private final List<String> ruleNames;
private final Consumer<ParsingWarning> warningConsumer;
public PostProcessor(List<String> ruleNames, Consumer<ParsingWarning> warningConsumer) {
this.ruleNames = ruleNames;
this.warningConsumer = requireNonNull(warningConsumer, "warningConsumer is null");
}
@Override
public void exitUnquotedIdentifier(SqlBaseParser.UnquotedIdentifierContext context) {
String identifier = context.IDENTIFIER().getText();
for (IdentifierSymbol identifierSymbol : EnumSet.complementOf(allowedIdentifierSymbols)) {
char symbol = identifierSymbol.getSymbol();
if (identifier.indexOf(symbol) >= 0) {
throw new ParsingException("identifiers must not contain '" + identifierSymbol.getSymbol() + "'", null, context.IDENTIFIER().getSymbol().getLine(), context.IDENTIFIER().getSymbol().getCharPositionInLine());
}
}
}
@Override
public void exitBackQuotedIdentifier(SqlBaseParser.BackQuotedIdentifierContext context) {
Token token = context.BACKQUOTED_IDENTIFIER().getSymbol();
throw new ParsingException(
"backquoted identifiers are not supported; use double quotes to quote identifiers",
null,
token.getLine(),
token.getCharPositionInLine());
}
@Override
public void exitDigitIdentifier(SqlBaseParser.DigitIdentifierContext context) {
Token token = context.DIGIT_IDENTIFIER().getSymbol();
throw new ParsingException(
"identifiers must not start with a digit; surround the identifier with double quotes",
null,
token.getLine(),
token.getCharPositionInLine());
}
@Override
public void exitNonReserved(SqlBaseParser.NonReservedContext context) {
// we can't modify the tree during rule enter/exit event handling unless we're dealing with a terminal.
// Otherwise, ANTLR gets confused an fires spurious notifications.
if (!(context.getChild(0) instanceof TerminalNode)) {
int rule = ((ParserRuleContext) context.getChild(0)).getRuleIndex();
throw new AssertionError("nonReserved can only contain tokens. Found nested rule: " + ruleNames.get(rule));
}
// replace nonReserved words with IDENT tokens
context.getParent().removeLastChild();
Token token = (Token) context.getChild(0).getPayload();
if (RESERVED_WORDS_WARNING.contains(token.getText().toUpperCase())) {
warningConsumer.accept(new ParsingWarning(
format("%s should be a reserved word, please use double quote (\"%s\"). This will be made a reserved word in future release.", token.getText(), token.getText()),
token.getLine(),
token.getCharPositionInLine()));
}
context.getParent().addChild(new CommonToken(
new Pair<>(token.getTokenSource(), token.getInputStream()),
SqlBaseLexer.IDENTIFIER,
token.getChannel(),
token.getStartIndex(),
token.getStopIndex()));
}
}
...
}
当然 还为词法分析器和语法分析器添加了错误监听器,在添加错误监听之前移除了全部已有的错误监听器。
LEXER_ERROR_LISTENER 是一个 BaseErrorListener,提供了一个转换功能, 用于将 Antlr 的 syntaxError 转成 Presto 中的 ParsingException 并抛出。
PARSER_ERROR_HANDLER 是一个 ErrorHandler。ErrorHandler 从 BaseErrorListener 继承,并提供了生成器来构造ErrorHandler对象,可以针对 specialRules、 specialTokens、 ignoredRules 做出处理。PARSER_ERROR_HANDLER 处理的规则见代码。
用户可以根据enhancedErrorHandlerEnabled 为 parser 选择 添加 PARSER_ERROR_HANDLER 还是 LEXER_ERROR_LISTENER
还为语法分析器设置了错误处理策略,为了防止弄乱报告,重写了recoverInline方法。
public class SqlParser {
private static final BaseErrorListener LEXER_ERROR_LISTENER = new BaseErrorListener() {
@Override
public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String message, RecognitionException e) {
throw new ParsingException(message, e, line, charPositionInLine);
}
};
private static final ErrorHandler PARSER_ERROR_HANDLER = ErrorHandler.builder()
.specialRule(SqlBaseParser.RULE_expression, "<expression>")
.specialRule(SqlBaseParser.RULE_booleanExpression, "<expression>")
.specialRule(SqlBaseParser.RULE_valueExpression, "<expression>")
.specialRule(SqlBaseParser.RULE_primaryExpression, "<expression>")
.specialRule(SqlBaseParser.RULE_identifier, "<identifier>")
.specialRule(SqlBaseParser.RULE_string, "<string>")
.specialRule(SqlBaseParser.RULE_query, "<query>")
.specialRule(SqlBaseParser.RULE_type, "<type>")
.specialToken(SqlBaseLexer.INTEGER_VALUE, "<integer>")
.ignoredRule(SqlBaseParser.RULE_nonReserved)
.build();
private boolean enhancedErrorHandlerEnabled;
public SqlParser(SqlParserOptions options, BiConsumer<SqlBaseLexer, SqlBaseParser> initializer) {
...
enhancedErrorHandlerEnabled = options.isEnhancedErrorHandlerEnabled();
}
private Node invokeParser(String name, String sql, Function<SqlBaseParser, ParserRuleContext> parseFunction, ParsingOptions parsingOptions) {
...
// Override the default error strategy to not attempt inserting or deleting a token.
// Otherwise, it messes up error reporting
parser.setErrorHandler(new DefaultErrorStrategy() {
@Override
public Token recoverInline(Parser recognizer) throws RecognitionException{
if (nextTokensContext == null) {
throw new InputMismatchException(recognizer);
} else {
throw new InputMismatchException(recognizer, nextTokensState, nextTokensContext);
}
}
});
lexer.removeErrorListeners();
lexer.addErrorListener(LEXER_ERROR_LISTENER);
parser.removeErrorListeners();
if (enhancedErrorHandlerEnabled) {
parser.addErrorListener(PARSER_ERROR_HANDLER);
} else {
parser.addErrorListener(LEXER_ERROR_LISTENER);
}
...
}
...
}
设置 parser 的解析方法,首先尝试使用更快的SLL,如果失败采用LL
parseFunction 接受 parser,并返回语法树
关注到parseFunction的类型为Function<SqlBaseParser, ParserRuleContext>
主要是指定了文法规则入口
public class SqlParser{
public Statement createStatement(String sql, ParsingOptions parsingOptions) {
return (Statement) invokeParser("statement", sql, SqlBaseParser::singleStatement, parsingOptions);
}
public Expression createExpression(String expression, ParsingOptions parsingOptions) {
return (Expression) invokeParser("expression", expression, SqlBaseParser::standaloneExpression, parsingOptions);
}
public Return createReturn(String routineBody, ParsingOptions parsingOptions) {
return (Return) invokeParser("return", routineBody, SqlBaseParser::standaloneRoutineBody, parsingOptions);
}
...
}
最后需要对于这颗语法树的所有元素进行某些处理,考虑到语法树是一个复杂的对象结构,Antlr允许使用访问者模式。
使用AstBuilder对象对这颗语法树进行访问和处理。
最后,当栈溢出(StackOverflowError)的时候, 抛出ParsingException。
public class SqlParser{
private Node invokeParser(String name, String sql, Function<SqlBaseParser, ParserRuleContext> parseFunction, ParsingOptions parsingOptions) {
try {
SqlBaseLexer lexer = new SqlBaseLexer(new CaseInsensitiveStream(CharStreams.fromString(sql)));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
SqlBaseParser parser = new SqlBaseParser(tokenStream);
initializer.accept(lexer, parser);
// Override the default error strategy to not attempt inserting or deleting a token.
// Otherwise, it messes up error reporting
parser.setErrorHandler(new DefaultErrorStrategy() {
@Override
public Token recoverInline(Parser recognizer) throws RecognitionException {
if (nextTokensContext == null) {
throw new InputMismatchException(recognizer);
} else {
throw new InputMismatchException(recognizer, nextTokensState, nextTokensContext);
}
}
});
parser.addParseListener(new PostProcessor(Arrays.asList(parser.getRuleNames()), parsingOptions.getWarningConsumer()));
lexer.removeErrorListeners();
lexer.addErrorListener(LEXER_ERROR_LISTENER);
parser.removeErrorListeners();
if (enhancedErrorHandlerEnabled) {
parser.addErrorListener(PARSER_ERROR_HANDLER);
} else {
parser.addErrorListener(LEXER_ERROR_LISTENER);
}
ParserRuleContext tree;
try {
// first, try parsing with potentially faster SLL mode
parser.getInterpreter().setPredictionMode(PredictionMode.SLL);
tree = parseFunction.apply(parser);
} catch (ParseCancellationException ex) {
// if we fail, parse with LL mode
tokenStream.reset(); // rewind input stream
parser.reset();
parser.getInterpreter().setPredictionMode(PredictionMode.LL);
tree = parseFunction.apply(parser);
}
return new AstBuilder(parsingOptions).visit(tree);
} catch (StackOverflowError e) {
throw new ParsingException(name + " is too large (stack overflow while parsing)");
}
}
}