@q Copyright 2012-2022 Alexander Shibakov@>
@q Copyright 2002-2014 Free Software Foundation, Inc.@>
@q This file is part of SPLinT@>
@q SPLinT is free software: you can redistribute it and/or modify@>
@q it under the terms of the GNU General Public License as published by@>
@q the Free Software Foundation, either version 3 of the License, or@>
@q (at your option) any later version.@>
@q SPLinT is distributed in the hope that it will be useful,@>
@q but WITHOUT ANY WARRANTY; without even the implied warranty of@>
@q MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the@>
@q GNU General Public License for more details.@>
@q You should have received a copy of the GNU General Public License@>
@q along with SPLinT. If not, see .@>
@** The lexer.
\ifbootstrapmode
\input limbo.sty
\input yystype.sty
\input grabstates.sty
\immediate\openout\stlist=ldl_states.h
\def\MRI{}
\def\ld{}
\fi
The lexer used by \ld\ is almost straightforward. There are a few
facilities (\Cee\ header files, some output functions) needed by the
lexer that are conviniently coded into the \Cee\ code run by the
driver routines that make the lexer more complex than it should have
been but the function of each such facility can be easily clarified
using this documentation and occasionally referring to the manual for
the \bison\ parser which is part of this distribution.
@(ldl.ll@>=
@G
@> @<\ld\ lexer definitions@> @=
%{@> @<\ld\ lexer \Cee\ preamble@> @=%}
@> @<\ld\ lexer options@> @=
%%
@> @ @=
%%
@O
void define_all_states( void ) {
@@;
}
@o
@g
@ @<\ld\ lexer options@>=
@G(fs1)
%option bison-bridge
%option noyywrap nounput noinput reentrant
%option noyy_top_state
%option debug
%option stack
%option outfile="ldl.c"
@g
@ @<\ld\ lexer \Cee\ preamble@>=
@ The file \.{ldl\_states.h} below contains the names of all the start
conditions@^start conditions@> (or states) collected by the bootstrap parser.
@=
#define _register_name( name ) @[Define_State( #name, name )@]
#include "ldl_states.h"
#undef _register_name
@ The character classes used by the scanner as well as
lexer state declarations have been put in the definitions section of
the input file. No attempt has been made to clean up the definitions
of the character classes.
@<\ld\ lexer definitions@>=
@<\ld\ lexer states@>@;
@G(fs1)
CMDFILENAMECHAR [_a-zA-Z0-9\/\.\\_\+\$\:\[\]\\\,\=\&\!\<\>\-\~]
CMDFILENAMECHAR1 [_a-zA-Z0-9\/\.\\_\+\$\:\[\]\\\,\=\&\!\<\>\~]
FILENAMECHAR1 [_a-zA-Z\/\.\\\$\_\~]
SYMBOLCHARN [_a-zA-Z\/\.\\\$\_\~0-9]
FILENAMECHAR [_a-zA-Z0-9\/\.\-\_\+\=\$\:\[\]\\\,\~]
WILDCHAR [_a-zA-Z0-9\/\.\-\_\+\=\$\:\[\]\\\,\~\?\*\^\!]
WHITE [ \t\n\r]+
NOCFILENAMECHAR [_a-zA-Z0-9\/\.\-\_\+\$\:\[\]\\\~]
V_TAG [.$_a-zA-Z][._a-zA-Z0-9]*
V_IDENTIFIER [*?.$_a-zA-Z\[\]\-\!\^\\]([*?.$_a-zA-Z0-9\[\]\-\!\^\\]|::)*
@g
@ The lexer uses different sets of rules depending on the context and the current state.
These can be changed from the lexer itself or externally by the parser
(as is the case in \ld\
implementation). \locallink{stateswitchers}Later\endlink, a number of
helper macros implement state switching so that the state names are
very rarely used explicitly. Keeping all the state declarations in the
same section simplifies the job of the
\locallink{bootstrapstates}bootstrap parser\endlink, as well.
\ifbootstrapmode\immediate\openout\stlist=ldl_states.h\fi
@<\ld\ lexer states@>=
@G(fs1)
%s SCRIPT
%s EXPRESSION
%s BOTH
%s DEFSYMEXP
%s MRI
%s VERS_START
%s VERS_SCRIPT
%s VERS_NODE
@g
@*1 Macros for lexer functions.
The \locallink{pingpong}state switching\endlink\ `ping-pong' between the lexer and the parser aside,
the \ld\ lexer is very traditional. One implementation choice
deserving some attention is the treatment of comments. The
difficulty of implementing \Cee\ style comment scanning using regular
expressions is well-known so an often used alternative is a
special function that simply skips to the end of the comment. This is
exactly what the \ld\ lexer does with an aptly named |comment()|
function. The typesetting parser uses the \.{\\ldcomment} macro for
the same purpose. For the curious, here is a \flex\ style regular
expression defining \Cee\ comments\footnote{Taken from W.~McKeeman's site
at
\url{http://www.cs.dartmouth.edu/~mckeeman/cs118/assignments/comment.html} and
adapted to \flex\ syntax. Here is the same regular expression pretty printed by
\splint: \flexrestyle{"/*"("/"`[\^*/]`"*"+[\^*/])*"*"+"/"}}:
$$
\hbox{\.{"/*" ("/"\yl[\^*/]\yl"*"+[\^*/])* "*"+ "/"}}
$$
This expression does not handle {\it every\/} practical situation,
however, since it assumes that the end of line character can be
matched like any other. Neither does it detect some often made
mistakes such as attempting to nest comments. A few minor
modifications can fix this deficiency, as well as add some error
handling, however, for the sake of consistency, the approach taken
here mirrors the one in the original \ld.
The top level of the \.{\\ldcomment} macro simply bypasses the state
setup of the lexer and enters a `|while| loop' in the input
routine. This macro is a reasonable approximation of the functionality
provided by |comment()|.
@=
@G(t)
\def\ldcomment{%
\let\oldyyreturn\yyreturn
\let\oldyylextail\yylextail
\let\yylextail\yymatch %/* start inputting characters until {\tt *}{\tt /} is seen */
\let\yyreturn\ldcommentskipchars
}
@g
@ The rest of the |while| loop merely waits for the \.{*/} combination.
@=
@G(t)
\def\ldcommentskipchars{%
\ifnum\yycp@@=`*
\yybreak{\let\yyreturn\ldcommentseekslash\yyinput}%
%/* {\tt *} found, look for {\tt /} */
\else
\yybreak{\yyinput}% %/* keep skipping characters */
\yycontinue
}%
\def\ldcommentseekslash{%
\ifnum\yycp@@=`/
\yybreak{\ldcommentfinish}%/* {\tt /} found, exit */
\else
\ifnum\yycp@@=`*
\yybreak@@{\yyinput}% %/* keep skipping {\tt *}'s looking for a {\tt /} */
\else
\yybreak@@{\let\yyreturn\ldcommentskipchars\yyinput}%
%/* found a character other than {\tt *} or {\tt /} */
\fi
\yycontinue
}%
@g
@ Once the end of the comment has been found, resume lexing the input
stream.
@=
@G(t)
\def\ldcommentfinish{%
\let\yyreturn\oldyyreturn
\let\yylextail\oldyylextail
\yylextail
}
@g
@ The semantics of the macros defined above do not quite match that
of the |comment()| function. The most significant difference is that
the portion of the action following \.{\\ldcomment} expands {\it
before\/} the comment characters are skipped. In most applications,
|comment()| is the last function called so this would not limit the use
of \.{\\ldcomment} too dramatically.
A more intuitive and easier to use version of \.{\\ldcomment} is
possible, however, if \.{\\yylextail} is not used inside actions (in the case of
an `optimized' lexer the restriction is even weaker, namely,
\.{\\yylextail} merely has to be absent in the portion of the action
following \.{\\ldcomment}).
Another remark might be in order. It would seem more appropriate to
employ \TeX's native grouping mechanism to avoid the side effects
casued by the assignments performed by the macros (such as
\.{\\let\\oldyyreturn\\yyreturn}). While this is possible with some
careful macro writing, a na\:\i ve grouping attempt would interfere
with the assignments performed by \.{\\yymatch}
(e.g.~\.{\\yyresetstreams}). Avoiding assignments like these is still
possible although the effort required is bordering on excessive.
@=
@G(t)
\def\ldcomment#1\yylextail{%
\let\oldyyreturn\yyreturn
\def\yylexcontinuation{#1\yylextail}%
\let\yyreturn\ldcommentskipchars %/* start inputting characters until {\tt *}{\tt /} is seen */
\yymatch
}
\def\ldcommentfinish{%
\let\yyreturn\oldyyreturn
\yylexcontinuation
}
@g
@ \namedspot{pretendbufferswlex}The same idea can be applied to
`\locallink{pretendbuffersw}pretend buffer switching\endlink'. Whenever
the `real' \ld\ parser encounters an \prodstyle{INCLUDE} command, it
switches the input buffer for the lexer and waits for the lexer to
return the tokens from the file it just opened. When the lexer scans
the end of the included file, it returns a special token, \prodstyle{END} that
completes the appropriate production and lets the parser continue with
its job.
We would like to simulate the file inclusion by inserting the
appropriate end of file marker for the lexer (a double
\.{\\yyeof}). After the relevant production completes, the marker
has to be cleaned up from the input stream (the lexer is designed to
leave it intact to be able to read the end of file multiple times
while looking for the longest match).
The macro below is designed to handle this task. The idea is to replace
the double \.{\\yyeof} at the beginning of the input with an appropriate
lexer action. The \.{\\yyreadinput} handles the input buffer and inserts the
tail portion of the current \flex\ action in front of it.
@=
@G(t)
\def\ldcleanyyeof#1\yylextail{%
\yyreadinput{\ldcl@@anyyeof{#1\yylextail}}{\romannumeral0\yyr@@@@dinput}%
}
\def\ldcl@@anyyeof#1#2#3{%
#3\ldcl@@anyye@@f{#1}#2%
}
\def\ldcl@@anyye@@f#1#2\yyeof\yyeof{#1}
@g
@*1 Regular expressions.
The `heart' of any lexer is the collection of regular expressions that
describe the {\it tokens\/} of the appropriate language. The variey of
tokens recognized by \ld\ is quite extensive and is described in the
sections that follow.
Variable names, constants, and algebraic operations come first.
@=
@G(fs2)
{
"/*" {@> @[TeX_( "/ldcomment/yylexnext" );@]@=}
}
{
"-" {@> @[TeX_( "/yylexreturnchar" );@]@=}
"+" {@> @[TeX_( "/yylexreturnchar" );@]@=}
{FILENAMECHAR1}{SYMBOLCHARN}* {@> @[TeX_( "/yylexreturnsym{NAME}" );@]@=}
"=" {@> @[TeX_( "/yylexreturnchar" );@]@=}
}
{
"$"([0-9A-Fa-f])+ {@> @ @=}
([0-9A-Fa-f])+(H|h|X|x|B|b|O|o|D|d) {@> @@=}
}