Posted in development, software-engineering on November 15, 2019 by Adrian Wyssmann ‐ 7 min read
Did you ever want to describe a syntax of a given language? Yes, then this post might be interesting for you. I learned this years ago but have forgotten about. Just found an old document and tough, would be nice to share.
BNF is an acronym for Backus Naur Form. John Backus and Peter Naur introduced for the first time a formal notation to describe the syntax of a given language.
BNF is a formal notation to describe the syntax of a given language.
BNF is one of the most commonly used meta-syntactic notations for specifying the syntax of programming languages, command sets, and the like. It is widely used for language descriptions but seldom documented anywhere (how do you document a meta-syntax?), so that it must usually be learned by osmosis (see RFC 2234.
Most of BNF was introduced by Backus in a report presented at an earlier UNESCO conference on ALGOL 58. Few read the report, but when Peter Naur read it he was surprised at some of the differences he found between his and Backus’s interpretation of ALGOL 58. He decided that for the successor to ALGOL, all participants of the first design had come to recognize some weaknesses, should be given in a similar form so that all participants should be aware of what they were agreeing to. He made a few modifications that are almost universally used and drew up on his own the BNF for ALGOL 60 at the meeting where it was designed. Depending on how you attribute presenting it to the world, it was either by Backus in 59 or Naur in 60.
A grammar written in this way defines rules of syntax. These rules specify what sequences of symbols are allowed to occur in a legal program and in what ways these symbols may be combined. They do not specify how such sequences are to be interpreted. That requires semantic rules, which attach meaning to the syntactic structures of the grammar.
Most programming languages are today defined as having grammars written in some extension of BNF. The precise layout and use of meta-symbols varies. Here we consider an Extended BNF (EBNF) typical of the variants used to describe the C programming language family.
All versions of BNF and EBNF have basically the same elements.
The meta-symbols of BNF are:
| symbols | meaning |
|---|---|
::= | is defined as |
| | (pipe) pipe means a logical or |
< ... > | angle brackets used to surround category names |
The angle brackets distinguish syntax rules names (non-terminal symbols) from terminal symbols which are written exactly as they are to be represented. A BNF rule defining a non-terminal has the form:
<nonterminal> ::= sequence_of_alternatives consisting of strings of
terminals or non-terminals
separated by the meta-symbol |<program> ::= program
<declaration_sequence>
begin
<statements_sequence>
end ;This shows that a mini-language program consists of the keyword program followed by the declaration sequence, then the keyword begin and the statements sequence, finally the keyword end and a semicolon.
In fact, many authors have introduced some slight extensions of BNF for the ease of use:. So here are all rules:
| symbols | meaning | example |
|---|---|---|
::= | is defined as | |
| | or(&&) | |
< ... > | angle brackets used to surround category names | |
[ ... ] | optional items are enclosed in meta symbols [ and ] | (1) |
{ ... } | repetitive items (zero or more times) are enclosed in meta symbols { and } | (2) |
" ... " | terminals of only one character are surrounded by quotes " to distinguish them from meta-symbols | (3) |
In recent text books, terminal and non-terminal symbols are distinguished by using bold faces for terminals and suppressing < and > around non-terminals. This improves greatly the readability. The example then becomes:
if_statement ::= if boolean_expression then
statement_sequence
[ else
statement_sequence ]
end if ";"
(1) showing an optional item
<if_statement> ::= if <boolean_expression> then
<statement_sequence>
[ else
<statement_sequence> ]
end if ;(2) showing repetitive items
<identifier> ::= <letter> { <letter> | <digit> }this rule is equivalent to the recursive rule:
<identifier> ::= <letter> |
<identifier> [ <letter> | <digit> ](3) ; is surrounded by quotes "
<statement_sequence> ::= <statement> { ";" <statement> }Consider this BNF for a US postal address:
<postal-address> ::= <name-part> <street-address> <zip-part>
<personal-part> ::= <name> | <initial> "."
<name-part> ::= <personal-part> <last-name> [<jr-part>] <EOL>
| <personal-part> <name-part>
<street-address> ::= [<apt>] <house-num> <street-name> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>This translates into English as:
postal-address consists of a name-part, followed by a street-address part, followed by a zip-code partpersonal-part consists of either a first name or an initial followed by a dot.name-part consists of either: a personal-part followed by a last name followed by an optional jr-part (Jr., Sr., or dynastic number) and end-of-line, or a personal part followed by a name part (this rule illustrates the use of recursion in BNFs, covering the case of people who use multiple first and middle names and/or initials).street address consists of an optional apartment specifier, followed by a street number, followed by a street name.zip-part consists of a town-name, followed by a comma, followed by a state code, followed by a ZIP-code followed by an end-of-line.Note that many things (such as the format of a personal-part, apartment specifier, or ZIP-code) are left unspecified. These lexical details are presumed to be obvious from context or specified somewhere nearby.
There are many variants and extensions of BNF, possibly containing some or all of the
regexp (regular experiences) wild cards such as * or +. A regular expression is a sequence of characters with the following meanings
| symbols | meaning |
|---|---|
| An ordinary character | An ordinary character (not one of the special characters discussed below) matches that character. A backslash \ followed by any special character matches the special character itself. The special characters are (RE means regular expression!) |
| . | . matches any character except NEWLINE |
* (Kleen-star) | RE* matches zero or more occurrences of RE (repetition). If there is any choice, the longest leftmost matching string is chosen |
^ | at the beginning of an RE matches the start of a line |
$ | at the end of an RE matches the end of a line |
\< | matches the beginning of a word |
\> | matches the end of a word |
\b | often \> and \< are replaced by \b, the special character for “word boundary” |
[string] | matches any one character in that string. If the first character of the string is a ^it matches any character (except NEWLINE) and the remaining characters in the string |
- | may be used to indicate a range of consecutive ASCII characters (e.g. [A-Z] refer all letters between A and Z) |
\( RE \) | matches whatever RE matches. |
\m\ | RE\m\ matches m occurrences of RE.RE\m,\ matches m or more occurrences of RERE\m,n\ matches between m and n occurrences |
The concatenation of REs is a RE that matches the concatenation of the strings matched by each RE.
Checkout https://www.regular-expressions.info/ for more details about regular expressions.