From e9e1f22f9ca977285d27a2457ab5f2e79ba72158 Mon Sep 17 00:00:00 2001 From: =?utf8?q?Rapha=C3=ABl=20Van=20Dyck?= Date: Tue, 6 Jan 2026 14:12:26 +0100 Subject: [PATCH] revise reference manual --- system-files/REFERENCE-MANUAL | 64 ++++++++++++++++++++++++++++++++--- system-files/all-caps.css | 1 - system-files/all-caps.js | 3 +- 3 files changed, 60 insertions(+), 8 deletions(-) diff --git a/system-files/REFERENCE-MANUAL b/system-files/REFERENCE-MANUAL index f51f3f3..51460c6 100644 --- a/system-files/REFERENCE-MANUAL +++ b/system-files/REFERENCE-MANUAL @@ -167,8 +167,62 @@

Regular Expressions

The tokenizer uses regular expressions to specify the patterns associated with the token categories.

+

The following table summarizes the syntax of the regular expressions used in this document:

+ + + + + + + + + + + + + + +
SyntaxMeaning
'$\ldots$'literal string (literal single quotes and backslashes must be escaped)
"$\ldots$"literal string (literal double quotes and backslashes must be escaped)
[$\ldots$]positive character class (literal carets, hyphens, closing square brackets, and backslashes must be escaped)
[^$\ldots$]negative character class (literal carets, hyphens, closing square brackets, and backslashes must be escaped)
$\mlvar{char}_1$-$\mlvar{char}_2$range of characters (inside character classes)
$\mlvar{re}_1\mlvar{re}_2$concatenation operation
$\mlvar{re}_1$|$\mlvar{re}_2$union (alternation) operation
$\mlvar{re}_1$-$\mlvar{re}_2$difference operation
$\mlvar{re}$*zero-or-more-times (Kleene star) operation
$\mlvar{re}$+one-or-more-times (Kleene plus) operation
$\mlvar{re}$?zero-or-one-time (optional) operation
$\metavar{name}\Coloneq\mlvar{re}$definition of a named regular expression
+

Literal strings can contain the following escape sequences:

+ + + + + + +
Escape sequenceMeaning
\\\\the backslash
\'the single quote
\"the double quote
\U{$\mlvar{hex}$}the Unicode character whose code point is represented by the hexadecimal numeral $\mlvar{hex}$
+

Character classes can contain the following escape sequences:

+ + + + + + + + +
Escape sequenceMeaning
\\\\the backslash
\^the caret
\-the hyphen
\]the closing square bracket
\U{$\mlvar{hex}$}the Unicode character whose code point is represented by the hexadecimal numeral $\mlvar{hex}$
\C{$\mlvar{cat}$}the Unicode characters whose general categories are $\mlvar{cat}$
+

The zero-or-more-times, one-or-more-times, and zero-or-one-time operations have precedence over the concatenation operation and the concatenation operation has precedence over the union and difference operations. All operations are left associative. Parenthesis can be added to override those precedence and associativity rules.

+

References to named regular expressions can be used wherever regular expressions can be used. References to named regular expressions denoting classes of characters can also be used inside character classes. Circular definitions are not allowed.

+

Except for the parts referencing named regular expressions, regular expressions are typeset in a monospaced typeface. Spaces can be added freely outside literal strings and character classes without modifying the meaning of a regular expression.

Extended Backus-Naur Form (EBNF)

The parser and the syntax analyzer use a variant of the extended Backus-Naur form (EBNF) notation to specify various context-free grammars.

+

The following table summarizes the syntax of the variant of the EBNF notation used in this document:

+ + + + + + + + + + + + + + + +
SyntaxMeaning
$\mlvar{lhs}\Coloneq\mlvar{rhs}$definition of a production rule (the left-hand-side is a nonterminal symbol and the right-hand-side is a sequence of zero or more nonterminal and/or terminal symbols)
$\metavar{nonterminal}$nonterminal symbol
terminalterminal symbol
'$\mlvar{char}$'terminal symbol consisting of the character $\mlvar{char}$ ($\mlvar{char}\neq\code{'}$)
"$\mlvar{char}$"terminal symbol consisting of the character $\mlvar{char}$ ($\mlvar{char}\neq\code{"}$)
'$\mlvar{char}_1\ldots\mlvar{char}_n$'abbreviation for '$\mlvar{char}_1$''$\mlvar{char}_n$' ($\mlvar{char}_i\neq\code{'}$)
"$\mlvar{char}_1\ldots\mlvar{char}_n$"abbreviation for "$\mlvar{char}_1$""$\mlvar{char}_n$" ($\mlvar{char}_i\neq\code{"}$)
$\epsilon$empty sequence of symbols
$\mlvar{rhs}_1\mid\cdots\mid\mlvar{rhs}_n$union (alternation) operation
$\mlvar{symbol}\ast$zero-or-more-times (Kleene star) operation
$\mlvar{symbol}+$one-or-more-times (Kleene plus) operation
$\mlvar{symbol}?$zero-or-one-time (optional) operation
$(\mlvar{rhs})$grouping (the group can be used wherever a symbol can be used)

Tokenizer

The tokenizer converts an input sequence of Unicode characters into a sequence of tokens in two steps. During the first step, the tokenizer converts the input sequence of Unicode characters into a provisional sequence of tokens. During the second step, the tokenizer converts the provisional sequence of tokens into a final sequence of tokens.

Character Classes

@@ -300,10 +354,10 @@ \<the less-than sign \U{$\mlvar{hex}$}the Unicode character whose code point is represented by the hexadecimal numeral $\mlvar{hex}$ -

Let $\mlvar{input}$ be the input sequence of Unicode characters. For any token $T$, let us denote by $\lex(T)$ the lexeme associated with $T$ and by $\pat(T)$ the pattern associated with $T$'s category. The tokenizer must find a sequence of tokens $\langle T_0,\ldots,T_{n-1}\rangle$ such that the following conditions are satisfied:

+

Let $\mlvar{input}$ be the input sequence of Unicode characters. For any token $T$, let us denote by $\lex(T)$ the lexeme associated with $T$ and by $\pat(T)$ the pattern associated with $T$'s category. The tokenizer must find a sequence of tokens $\langle T_1,\ldots,T_n\rangle$ such that the following conditions are satisfied:

Because the meaning of a program cannot be ambiguous, there cannot exist more than one sequence of tokens satisfying the previous conditions for any given input. As the following examples demonstrate, the patterns alone do not provide this guarantee: