Did you ever want to describe a syntax of a given language? Yes, then this post might be interesting for you. I learned this years ago but have forgotten about. Just found an old document and tough, would be nice to share.

Extended Backus Naur Form (EBNF)

What is BNF notation?

BNF is an acronym for Backus Naur Form. John Backus and Peter Naur introduced for the first time a formal notation to describe the syntax of a given language.

BNF is a formal notation to describe the syntax of a given language.

BNF is one of the most commonly used meta-syntactic notations for specifying the syntax of programming languages, command sets, and the like. It is widely used for language descriptions but seldom documented anywhere (how do you document a meta-syntax?), so that it must usually be learned by osmosis (see RFC 2234.

Most of BNF was introduced by Backus in a report presented at an earlier UNESCO conference on ALGOL 58. Few read the report, but when Peter Naur read it he was surprised at some of the differences he found between his and Backus's interpretation of ALGOL 58. He decided that for the successor to ALGOL, all participants of the first design had come to recognize some weaknesses, should be given in a similar form so that all participants should be aware of what they were agreeing to. He made a few modificiations that are almost universally used and drew up on his own the BNF for ALGOL 60 at the meeting where it was designed. Depending on how you attribute presenting it to the world, it was either by Backus in 59 or Naur in 60.

A grammar written in this way defines rules of syntax. These rules specify what sequences of symbols are allowed to occur in a legal program and in what ways these symbols may be combined. They do not specify how such sequences are to be interpreted. That requires semantic rules, which attach meaning to the syntactic structures of the grammar.

What is EBNF notation?

Most programming languages are today defined as having grammars written in some extension of BNF. The precise layout and use of meta-symbols varies. Here we consider an Extended BNF (EBNF) typical of the variants used to describe the C programming language family.

Rules

All versions of BNF and EBNF have basically the same elements.

  • A set of rules is specified. These are known as production rules. They define the patterns or sequences of symbols allowed in the language.
  • Each production rule defines the pattern that represents a named structured part of the language, such as an expression or a statement. The name of such a part is called a non-terminal symbol in the language.
  • The basic building blocks of the language are symbols which stand for themselves. These can be individual characters or combinations such as keywords and arithmetic operators. These basic elements of the language are called terminal symbols.
  • Each rule contains the name of the non-terminal being defined, followed by some convention that says the rest of the rule now starts. This is then followed by the sequence or alternative sequences allowed for that symbol. A defining sequence can contain any terminal symbols allowed for that language. It can also contan non-terminal symbols defined in other rules. If a non-terminal is used without being defined, the grammar for the language being defined is incomplete.
  • The definition of a rule can also contain the symbol being defined by that rules as part of its definition. This is called recursive definition. Any symbol defined in this way must have an alternative definition which does not contain itself, otherwise the grammar is again incomplete.

meta-symbols

The meta-symbols of BNF are:

symbols meaning
::= is defined as
| (pipe) pipe means a logical or
< ... > angle brackets used to surround category names

The angle brackets distinguish syntax rules names (non-terminal symbols) from terminal symbols which are written exactly as they are to be represented. A BNF rule defining a non-terminal has the form:

<nonterminal> ::= sequence_of_alternatives consisting of strings of
                  terminals or non-terminals 
                  separated by the meta-symbol |

For example, the BNF production for a mini-language is

<program> ::= program
                 <declaration_sequence>
              begin
                 <statements_sequence>
              end ;

This shows that a mini-language program consists of the keyword program followed by the declaration sequence, then the keyword begin and the statements sequence, finally the keyword end and a semicolon.

Rules of EBNF

In fact, many authors have introduced some slight extensions of BNF for the ease of use:. So here are all rules:

symbols meaning example
::= is defined as
| or(&&)
< ... > angle brackets used to surround category names
[ ... ] optional items are enclosed in meta symbols [ and ] (1)
{ ... } repetitive items (zero or more times) are enclosed in meta symbols { and } (2)
" ... " terminals of only one character are surrounded by quotes " to distinguish them from meta-symbols (3)

In recent text books, terminal and non-terminal symbols are distinguished by using bold faces for terminals and suppressing < and > around non-terminals. This improves greatly the readability. The example then becomes:

if_statement ::= if boolean_expression then
                     statement_sequence
                 [ else
                     statement_sequence ]
                 end if ";"

Examples

(1) showing an optional item

<if_statement> ::=  if <boolean_expression> then
    <statement_sequence>
[ else
    <statement_sequence> ]
end if ;

(2) showing repetitive items

<identifier> ::= <letter> { <letter> | <digit> }

this rule is equivalent to the recursive rule:

<identifier> ::= <letter> |
                 <identifier> [ <letter> | <digit> ]

(3) ; is surrounded by quotes "

<statement_sequence> ::= <statement> { ";" <statement> }

A concrete example

Consider this BNF for a US postal address:

<postal-address> ::= <name-part> <street-address> <zip-part>

<personal-part> ::= <name> | <initial> "."

<name-part> ::= <personal-part> <last-name> [<jr-part>] <EOL>
              | <personal-part> <name-part>

<street-address> ::= [<apt>] <house-num> <street-name> <EOL>

<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>

This translates into English as:

  • A postal-address consists of a name-part, followed by a street-address part, followed by a zip-code part
  • A personal-part consists of either a first name or an initial followed by a dot.
  • A name-part consists of either: a personal-part followed by a last namefollowed by an optional jr-part (Jr., Sr., or dynastic number) and end-of-line, or a personal part followed by a name part (this rule illustrates the use of recursion in BNFs, covering the case of people who use multiple first and middle names and/or initials).
  • A street address consists of an optional apartment specifier, followed by a street number, followed by a street name.
  • A zip-part consists of a town-name, followed by a comma, followed by a state code, followed by a ZIP-code followed by an end-of-line.

Note that many things (such as the format of a personal-part, apartment specifier, or ZIP-code) are left unspecified. These lexical details are presumed to be obvious from context or specified somewhere nearby.

Wildcards

There are many variants and extensions of BNF, possibly containing some or all of the regexp (regular experiences) wild cards such as * or +. A regular expression is a sequence of characters with the following meanings

symbols meaning
An ordinary character An ordinary character (not one of the special characters discussed below) matches that character.
A backslash\ followed by any special character matches the special character itself. The special characters are (RE means regular expression!)
. . matches any character except NEWLINE
* (Kleen-star) RE* matches zero or more occurrences of RE (repetition). If there is any choice, the longest leftmost matching string is chosen
^ at the beginning of an RE matches the start of a line
$ at the end of an RE matches the end of a line
\< matches the beginning of a word
\> matches the end of a word
\b often \> and \< are replaced by \b, the special character for "word boundary"
[string] matches any one character in that string. If the first character of the string is a ^it matches any character (except NEWLINE) and the remaining characters in the string
- may be used to indicate a range of consecutive ASCII characters (e.g. [A-Z] refer all letters between A and Z)
\( RE \) matches whatever RE matches.
\m\ RE\m\ matches m occurrences of RE.
RE\m,\ matches m or more occurrences of RE
RE\m,n\ matches between m and n occurrences

The concatenation of REs is a RE that matches the concatenation of the strings matched by each RE.

Checkout https://www.regular-expressions.info/ for more details about regular expressions.