Skip to main content
\(\newcommand{\identity}{\mathrm{id}} \newcommand{\notdivide}{{\not{\mid}}} \newcommand{\notsubset}{\not\subset} \newcommand{\lcm}{\operatorname{lcm}} \newcommand{\gf}{\operatorname{GF}} \newcommand{\inn}{\operatorname{Inn}} \newcommand{\aut}{\operatorname{Aut}} \newcommand{\Hom}{\operatorname{Hom}} \newcommand{\cis}{\operatorname{cis}} \newcommand{\chr}{\operatorname{char}} \newcommand{\Null}{\operatorname{Null}} \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\lt}{ < } \newcommand{\gt}{ > } \newcommand{\amp}{ & } \)

Section14.2Free Monoids and Languages

In this section, we will introduce the concept of a language. Languages are subsets of a certain type of monoid, the free monoid over an alphabet. After defining a free monoid, we will discuss languages and some of the basic problems relating to them. We will also discuss the common ways in which languages are defined.

Let \(A\) be a nonempty set, which we will call an alphabet. Our primary interest will be in the case where \(A\) is finite; however, \(A\) could be infinite for most of the situations that we will describe. The elements of \(A\) are called letters or symbols. Among the alphabets that we will use are \(B=\{0,1\}\), and the set of ASCII (American Standard Code for Information Interchange) characters, which we symbolize as \(ASCII\).

Definition14.2.1Strings over an Alphabet

A string of length \(n\), \(n \geqslant 1\) over alphabet \(A\) is a sequence of \(n\) letters from \(A\): \(a_1a_2\ldots a_n\). The null string, \(\lambda\) is defined as the string of length zero containing no letters. The set of strings of length \(n\) over \(A\) is denoted by \(A^n\). The set of all strings over \(A\) is denoted \(A^*\).

Note14.2.2

  1. If the length of string \(s\) is \(n\), we write \(\lvert s \rvert =n\).

  2. The null string is not the same as the empty set, although they are similar in many ways.

  3. \(A^*=A^0\cup A^1\cup A^2\cup A^3\cup \cdots \textrm{ and if } i\neq j,A^i\cap A^j=\emptyset\); that is, \(\{A^0,A^1,A^2,A^3,\ldots \}\) is a partition of \(A^*\).

  4. An element of \(A\) can appear any number of times in a string.

Proof
Definition14.2.4Concatenation

Let \(a=a_1a_2\cdots a_m\) and \(b=b_1b_2\cdots b_n\) be strings of length \(m\) and \(n\), respectively. The concatenation of a with b, \(a+b\), is the string \(a_1a_2\cdots a_mb_1b_2\cdots b_n\) of length \(m+n\).

The set of strings over any alphabet is a monoid under concatenation.

Note14.2.5

  1. The null string is the identity element of \([A^*; +]\). Henceforth, we will denote the monoid of strings over \(A\) by \(A^*\).

  2. Concatenation is noncommutative, provided \(\lvert A\rvert > 1\).

  3. If \( \lvert A_1 \rvert = \lvert A_2 \rvert\), then the monoids \(A_1^*\) and \(A_2^*\) are isomorphic. An isomorphism can be defined using any bijection \(f:A_1\to A_2\). If \(a=a_1a_2\cdots a_n \in A_1^*\), \(f^*(a)=(f(a_1)f(a_2)\cdots f(a_n))\) defines a bijection from \(A_1^*\) into \(A_2^*\). We will leave it to the reader to prove that for all \(a,b,\in A_1^*,f^*(a+b)=f^*(a)+f^*(b)\).

The languages of the world, English, German, Russian, Chinese, and so forth, are called natural languages. In order to communicate in writing in any one of them, you must first know the letters of the alphabet and then know how to combine the letters in meaningful ways. A formal language is an abstraction of this situation.

Definition14.2.6Formal Language

If \(A\) is an alphabet, a formal language over \(A\) is a subset of \(A^*\).

Example14.2.7Some Formal Languages

  1. English can be thought of as a language over of letters \(A,B,\cdots Z\), both upper and lower case, and other special symbols, such as punctuation marks and the blank. Exactly what subset of the strings over this alphabet defines the English language is difficult to pin down exactly. This is a characteristic of natural languages that we try to avoid with formal languages.

  2. The set of all ASCII stream files can be defined in terms of a language over ASCII. An ASCII stream file is a sequence of zero or more lines followed by an end-of-file symbol. A line is defined as a sequence of ASCII characters that ends with the a “new line” character. The end-of-file symbol is system-dependent.

  3. The set of all syntactically correct expressions in any computer language is a language over the set of ASCII strings.

  4. A few languages over B are

    • \(L_1=\{s\in B^* \mid s \textrm{ has exactly as many 1's as it has 0's}\}\)

    • \(L_2=\{1+s+0 \mid s\in B^*\}\)

    • \(L_3=\langle 0,01\rangle\) = the submonoid of \(B^*\) generated by \(\{0,01\}\).

Investigation14.2.1Two Fundamental Problems: Recognition and Generation

The generation and recognition problems are basic to computer programming. Given a language, \(L\), the programmer must know how to write (or generate) a syntactically correct program that solves a problem. On the other hand, the compiler must be written to recognize whether a program contains any syntax errors.

Problem14.2.8The Recognition Problem

Given a formal language over alphabet \(A\), the Recognition Problem is to design an algorithm that determines the truth of \(s\in L\) in a finite number of steps for all \(a\in A^*\). Any such algorithm is called a recognition algorithm.

Definition14.2.9Recursive Language

A language is recursive if there exists a recognition algorithm for it.

Example14.2.10Some Recursive Languages

  1. The language of syntactically correct propositions over set of propositional variables expressions is recursive.

  2. The three languages in 14.2.7(d) are all recursive. Recognition algorithms for \(L_1\) and \(L_2\) should be easy for you to imagine. The reason a recognition algorithm for \(L_3\) might not be obvious is that the definition of \(L_3\) is more cryptic. It doesn't tell us what belongs to \(L_3\), just what can be used to create strings in \(L_3\). This is how many languages are defined. With a second description of \(L_3\), we can easily design a recognition algorithm. You can prove that \(L_3=\{s\in B^* \mid s=\lambda \textrm{ or } s \textrm{ starts with a 0 and has no consecutive 1's}\}\).

Problem14.2.12The Generation Problem

Design an algorithm that generates or produces any string in \(L\). Here we presume that \(A\) is either finite or countably infinite; hence, \(A^*\) is countable by Theorem 14.2.1, and \(L \subseteq A^*\) must be countable. Therefore, the generation of \(L\) amounts to creating a list of strings in \(L\). The list may be either finite or infinite, and you must be able to show that every string in \(L\) appears somewhere in the list.

Proof
Example14.2.14

Since all of the languages in 14.2.7(d) are recursive, they must have generating algorithms. The one given in the proof of Theorem 14.2.13 is not usually the most efficient. You could probably design more efficient generating algorithms for \(L_2\) and \(L_3\); however, a better generating algorithm for \(L_1\) is not quite so obvious.

The recognition and generation problems can vary in difficulty depending on how a language is defined and what sort of algorithms we allow ourselves to use. This is not to say that the means by which a language is defined determines whether it is recursive. It just means that the truth of the statement “\(L\) is recursive.” may be more difficult to determine with one definition than with another. We will close this section with a discussion of grammars, which are standard forms of definition for a language. When we restrict ourselves to only certain types of algorithms, we can affect our ability to determine whether \(s\in L\) is true. In defining a recursive language, we do not restrict ourselves in any way in regard to the type of algorithm that will be used. In the next section, we will consider machines called finite automata, which can only perform simple algorithms.

One common way of defining a language is by means of a phrase structure grammar (or grammar, for short). The set of strings that can be produced using set of grammar rules is called a phrase structure language.

Example14.2.15Zeros before Ones

We can define the set of all strings over \(B\) for which all 0's precede all 1's as follows. Define the starting symbol \(S\) and establish rules that \(S\) can be replaced with any of the following: \(\lambda\), \(0S\), or \(S1\). These replacement rules are usually called production rules. They are usually written in the format \(S\to \lambda\), \(S\to 0S\), and \(S\to S1\). Now define \(L\) to be the set of all strings that can be produced by starting with \(S\) and applying the production rules until \(S\) no longer appears. The strings in \(L\) are exactly the ones that are described above.

Definition14.2.16Phrase Structure Grammar

A phrase structure grammar consists of four components:

  1. A nonempty finite set of terminal characters, \(T\). If the grammar is defining a language over \(A\), \(T\) is a subset of \(A^*\).

  2. A finite set of nonterminal characters, \(N\).

  3. A starting symbol, \(S\in N\).

  4. A finite set of production rules, each of the form \(X\to Y\), where \(X\) and \(Y\) are strings over \(A\cup N\) such that \(X\neq Y\) and \(X\) contains at least one nonterminal symbol.

If \(G\) is a phrase structure grammar, \(L(G)\) is the set of strings that can be produced by starting with \(S\) and applying the production rules a finite number of times until no nonterminal characters remain. If a language can be defined by a phrase structure grammar, then it is called a phrase structure language.

Example14.2.17Alternating bits language

The language over \(B\) consisting of strings of alternating 0's and 1's is a phrase structure language. It can be defined by the following grammar:

  1. Terminal characters: \(\lambda\), \(0\), and \(1\)

  2. Nonterminal characters: \(S\), \(T\), and \(U\)

  3. Starting symbol: \(S\)

  4. Production rules: \[ \begin{array}{ccc} S\to T & S\to U & S\to \lambda\\ S\to 0& & S\to 1\\ S\to 0T&& S\to 1U \\ T\to 10T&& T\to 10\\ U\to 01U& & U\to 01\\ \end{array} \]

These rules can be visualized with a graph:

Production rules for the language of alternating 0's and 1's
Figure14.2.18Production rules for the language of alternating 0's and 1's

We can verify that a string such as 10101 belongs to the language by starting with \(S\) and producing 10101 using the production rules a finite number of times: \(S\to 1U\to 101U\to 10101\).

Example14.2.19Valid Sage Variables

Let \(G\) be the grammar with components:

  1. Terminal symbols = all letters of the alphabet (both upper and lower case), digits 0 through 9, and underscore

  2. Nonterminal symbols: \(\{I, X\}\),

  3. Starting symbol: \(I\)

  4. Production rules: \(I \to \alpha\), where \(\alpha\) is any letter, \(I \to alpha+X\) for any letter \(\alpha\) ,\(X\to \beta+X\) for any letter, digit or underscore, \(\beta\), and \(X \to \beta\) for any letter, digit or underscore, \(\beta\). There are a total of \(52+52+63+63=230\) production rules for this grammar. The language \(L(G)\) consists of all valid Sage variable names.

Example14.2.20Backus-Naur Form

Backus-Naur form (BNF) is a popular alternate form of defining the production rules in a grammar. If the production rules \(A\to B_1, A\to B_2,\ldots A\to B_n\) are part of a grammar, they would be written in BNF as \(A ::=B_1 \mid B_2\mid \cdots \mid B_n\). The symbol \(\mid\) in BNF is read as “or,” while the \(::=\) is read as “is defined as.” Additional notations of BNF are that \(\{x\}\), represents zero or more repetitions of \(x\) and \([y]\) means that \(y\) is optional.

A BNF version of the production rules for a Sage variable, I, is \[\begin{array}{l} letter ::=a \mid b \mid c \mid \cdots \mid z\mid A\mid B\mid \cdots \mid Z\\ digit::=0 \mid 1 \mid \cdots \mid 9\\ I::= letter\{letter \mid digit \mid \_\}\\ \\ \end{array} \]

Example14.2.21The language of simple arithmetic expressions

An arithmetic expression can be defined in BNF. For simplicity, we will consider only expressions obtained using addition and multiplication of integers. The terminal symbols are (,),+,*, -, and the digits 0 through 9. The nonterminal symbols are \(E\) (for expression), \(T\) (term), \(F\) (factor), and \(N\) (number). The starting symbol is \(E\). Production rules are \begin{equation*}\begin{array}{c} E\text ::=E+T \mid T\\ T ::=T * F \mid F\\ F ::=(E)\mid N\\ N ::=[-]\textrm{digit} \{\textrm{digit}\}\\ \\ \end{array} \end{equation*}

One particularly simple type of phrase structure grammar is the regular grammar.

Definition14.2.22Regular Grammar

A regular (right-hand form) grammar is a grammar whose production rules are all of the form \(A\to t\) and \(A\to tB\), where \(A\) and \(B\) are nonterminal and \(t\) is terminal. A left-hand form grammar allows only \(A \to t\) and \(A\to Bt\). A language that has a regular phrase structure language is called a regular language.

Example14.2.23

  1. The set of Sage variable names is a regular language since the grammar by which we defined the set is a regular grammar.

  2. The language of all strings for which all 0s precede all 1s (Example 14.2.4) is regular; however, the grammar by which we defined this set is not regular. Can you define these strings with a regular grammar?

  3. The language of arithmetic expressions is not regular.

Subsection14.2.1Exercises for Section 14.2

1

  1. If a computer is being designed to operate with a character set of 350 symbols, how many bits must be reserved for each character? Assume each character will use the same number of bits.

  2. Do the same for 3,500 symbols.

Answer
2

It was pointed out in the text that the null string and the null set are different. The former is a string and the latter is a set, two different kinds of objects. Discuss how the two are similar.

3

What sets of strings are defined by the following grammar?

  1. Terminal symbols: \(\lambda\) , 0 and 1

  2. Nonterminal symbols: \(S\) and \(E\)

  3. Starting symbol: \(S\)

  4. Production rules: \(S\to 0S0, S \to 1S1, S\to E, E \to \lambda, E\to 0, E\to 1\)

Answer
4

What sets of strings are defined by the following grammar?

  1. Terminal symbols: \(\lambda\), \(a\), \(b\), and \(c\)

  2. Nonterminal symbols: \(S, T, U \textrm{ and } E\)

  3. Starting symbol: \(S\)

  4. Production rules: \[\begin{array}{ccc} S\to aS & S \to T & T\to bT\\ T\to U & U \to cU & U \to E\\ & E\to \lambda &\\ \end{array}\]

5

Define the following languages over \(B\) with phrase structure grammars. Which of these languages are regular?

  1. The strings with an odd number of characters.

  2. The strings of length 4 or less.

  3. The palindromes, strings that are the same backwards as forwards.

Answer
6

Define the following languages over \(B\) with phrase structure grammars. Which of these languages are regular?

  1. The strings with more 0's than 1's.

  2. The strings with an even number of 1's.

  3. The strings for which all 0's precede all 1's.

7

Prove that if a language over \(A\) is recursive, then its complement is also recursive.

Answer
8

Use BNF to define the grammars in Exercises 3 and 4.

9

  1. Prove that if \(X_1, X_2, \ldots\)is a countable sequence of countable sets, the union of these sets, \(\underset{i=1}{\overset{\infty }{\cup}}X_i\) is countable.

  2. Using the fact that the countable union of countable sets is countable, prove that if \(A\) is countable, then \(A^*\) is countable.

Answer