Topics List 1: Finite state machines

1. Finite state machines (fsms) are introduced as abstractions or theoretical models of digital computers, in order to allow us to prove theorems about them.

Fsms may be specified formally by way of a state-transition and output matrices, or less formally by sketching a graph. We usually allow the latter, on the understanding that the formal description can be constructed when it is needed.

2. Fsms are capable of a wide variety of operations. Construct the following fsms:

(a) an fsm which reads a binary string and tells us whether it has seen an odd or even number of 1's.

(b) an fsm to add two binary integers. At each moment, the input consists of an element from the set {00, 01, 10, 11}, representing one digit from each integer, starting from the least significant end. The output is to be the sum.

3. (ASIDE) A finite state machine whose output depends on the current state and input is called a Mealy fsm. There is another variety, the Moore fsm, whose output depends only on the state which the machine is about to ENTER. Obviously if one has a Moore fsm, it is trivial to write down a Mealy fsm whose output (for any input string) is the same as the Moore one. Show how to do the converse; that is, given a Mealy fsm, show how to construct a Moore fsm whose output (for any input string) is the same as the Mealy one.

What difference would it make if the Moore machine's output depended on the state the machine is LEAVING?

4. Terminology: if we have a finite alphabet of symbols, and if L is some set of strings over the alphabet, we call L a "language" over the alphabet.

5. We are commonly interested in fsms acting as "acceptors" or "recognizers" for languages. In this case, we have no interest in the output of the fsm. Instead, the fsm has a start state, and one or more accept states. The fsm accepts a string if the string causes the machine to move from its start state to one of its accept states; it rejects it otherwise. The set of strings accepted by a particular fsm is the language accepted by that fsm, and the fsm is called an acceptor (or recognizer) for the language.

Many computational questions can be phrased in these terms. If an fsm were to exist which accepted the language {2, 3, 5, 7, 11, 13, ...} over the alphabet of decimal digits, this fsm would surely be computing the primality of the integers. (By the way, no such fsm can exist.)

Construct the following:

(a) an fsm which accepts the language {1³ⁿ| n > 0}; that is, the set of unary integers divisible by 3.

(b) an fsm to accept the set of binary integers divisible by 3. (Comment. When I first set this, I gave a hint: input the number starting at the least significant end. I had in mind an fsm requiring 6 states. The TA, however, came up with a much neater solution which required only 3 states and read the number from the left. The TA's solution had another interesting property, which led one to notice an interesting theorem in arithmetic.)

6. Prove that no fsm can recognize the language {aⁿbⁿ| n > 0}, that is to say, a set of strings in which each consists of n a's followed by n b's.

(Hint. Any string must be presented to the fsm character by character starting from the left. What happens if there are many many a's?)

What is the true essence of this proof? Does it suggest other languages which no fsm can recognize?

In effect, your proof is telling us that no digital computer can recognize these languages. Technically this is true, but is it a useful result? In the end, this will cause us to abandon the fsm as a model of computation.

7. (ASIDE) There is a "pumping" theorem which can be used to prove that certain languages cannot be accepted by fsms. Basically it tells us that in a language accepted by an fsm, any string longer than the number of states must be divisible into three parts x, y, z in such a way that xz, xyz, xyyz, xyyyz, ... are all in the language. x and z may be empty strings, but not y. The reason is that in an fsm with p states, the journey from the initial state I to some accepting state A, for any string of length p or more, must pass through some state more than once. If R is the first repeated state, denote by x the string which takes the fsm from I to R, by y the one which takes it from R back to R, and by z the one which takes it from R to A. Then xz takes the machine from I to R directly to A; xyz, xyyz, xyyyz, etc. take it from I to R, one or more times round the loop, and then from R to A.

The use of this result to show that languages are not accepted by fsms typically takes one of two forms.

The first is to show that there is nowhere to position the y section. In (6) above, for example, it can't be inside the a's, or the a's would have to pumped up without lengthening the b's; it can't be inside the b's for the analogous reason; and it can't cross the ab boundary, for you would then get something like ...abbabb... in the middle of an accepted string.

The second applies to languages such as {1ⁿ| n is a square}. If this language is accepted by an fsm, the theorem tells us that some square has to be expressable as x + y + z, where x + ky + z is also a square for any integer k >= 0. But this is clearly impossible, since the squares get further and further apart, but these integers don't.

Note that the theorem can only be used to prove the negative result: that some language is not accepted by an fsm.

It is not true that if the strings of a language can be pumped up, the language is accepted by an fsm. An example is the language of matched parenthesis strings: (()), ()(), ((())()), (), ... No fsm accepts this language, for essentially the same reason used in (6). If a string begins with many many ('s, an fsm would eventually lose count. But every string in this language has a section which can be pumped up, for example the first ().

8. We make an attempt to generalize the fsm by introducing the notion of non-deterministic fsms.

However, we can prove a theorem that:

Every language recognized by a non-deterministic fsm is is also recognized by a deterministic fsm.

The usual proof shows how, given the non-deterministic fsm, we can construct a (deterministic) fsm which models each action of the original.

9. A language L is accepted by some finite state machine F. Describe how to adapt F to recognize the language:

L^R = {w^R| w belongs to L and w^R is the reverse of w}

10. If L is the language accepted by a deterministic fsm F, show how to adapt F to recognize the language L¯ (i.e. the complement of L, the language containing all strings over the alphabet not in L). Why does your adaptation work only for deterministic fsm's?

Nonetheless, your construction does prove that the complement of a language accepted by any fsm F is also accepted by an fsm, because if F is non-deterministic, you can construct a deterministic fsm accepting L first, and then apply your adaptation.

11. There are many ways to specify languages. For a few simple languages, we have used an ad hoc notation of the form {aⁿbⁿ| n > 0}. As we have seen in (5), an fsm also defines a language. It is now appropriate to introduce a new form of expression which can be used to specify or describe certain languages: the regular expression.

The definition of a regular expression is recursive:

(a) A single letter is a regular expression.
In this case, the language specified by the regular expression consists of a single one-letter string. (And ambiguously, we use the regular expression b to describe the set {b}, hoping that the context will resolve the ambiguity.}
(b) If E and F are regular expressions, E v F is a regular expression.
The language described by E v F is the union of the languages described by E and F.
(c) If E and F are regular expressions, EF is a regular expression.
The language described by EF is the Cartesian product of the languages described by E and F; that is to say, it consists of all strings formed by concatenating a string chosen from E to the left of one chosen from F. We sometimes write E.F, but the dot-operator is usually omitted, as it is for the multiplication of variables in mathematics texts.
(d) If E is a regular expression, E* is a regular expression.
The language described by E* consists of all strings formed by concatenating zero or more strings chosen from E. So if E is a v b, E* consists of any string of a's and b's including the null-string. This operator is called the Kleene-star, in honour of its inventor.
(e) Nothing is a regular expression which cannot be made by finitely many applications
of (a), (b), (c) and (d).

In complicated regular expressions, we may include parentheses to clarify the order in which the three operators are applied: (E v F)G, (E(F*))*, ... As for arithmetic, however, we specify operator priority so as to reduce bracketing: *, dot, v. So EFvG* means ((EF)v(G*)) and EF* means E(F*).

Different regular expressions may describe the same language. (See, for example, (12a) below). This gives rise to an algebra for regular expressions; for example:

(E v F)G = EG v FG

but we will not pursue this very far.

Try to gain a careful understanding of the language (the set of strings) which a given regular expression describes. Languages which can be described by regular expressions are called regular languages.

Programs which search texts for patterns often use regular expressions to define the pattern.

12. Some exercises:

(a) Show that if E and F are regular expressions, the regular expressions (E v F)* and (E* F*)* each represent the same language.

(b) Replace each of (a v bb v ba)* and (a v (bb v ab)*)* by regular expressions not containing v.

13. We will prove (the Kleene-Myhill theorems) that:

(a) every language recognized by an fsm is regular, and

(b) every regular language is recognized by an fsm.

Later we will encounter left-linear and right-linear languages, languages which are specified by a particular kind of syntax mechanism rather different from either fsms or regular expressions. As it happens, every one of these languages too is regular, and vice versa.

14. The union, concatenation, and "star" of languages recognized by fsms are also recognized by fsms. Furthermore the complement and "reverse" of regular languages are regular.

Extra notes

E1. Canonical (or algorithmic) constructions

The proofs for the "equivalence" results in 8, 10, 13(a), 13(b), 40(b), 41, etc. are constructive. Not only do they show, for example, that every regular language is recognized by an fsm, but they also show how to construct an fsm from a regular expression. These constructions, however, are "canonical". They take us from one language description to another by "micro-modelling", that is, by converting the individual components or rules of the source. They make no use of information about what the source description is doing at a higher level, so the result is generally worse than a human might achieve.

Thus, the canonical construction converts a non-deterministic fsm with 12 states to a deterministic fsm with 4096 states (2 power 12). In a given problem, however, an intelligent human might well be able to achieve the same result with only 30 to 40 states.

There may, of course, be other algorithms which simplify the results. There is, for example, a process to minimize the states of an fsm.

E2. We have mentioned five ways to describe regular languages: finite- state machines (fsm), non-deterministic finite-state machines (nfsm), regular expressions (regexp), right-linear grammars (rlg) and left- linear grammars (llg).

To go from each one to each other, we might specify 20 canonical constructions. Theoretically, we need only five; for example:

regexp --> nfsm --> fsm --> rlg --> llg --> regexp

since, for any desired conversion, we can go around this cycle.

Similarly, we need only five translation programs for the five natural languages: English, Japanese, Russian, Chinese, French. Of course, the result of translating from English to French passing through this cycle might be rather poor!

E3. Some applications

(a) Consider the following programming problem:

A text-editor, which is going to process texts involving English, Greek Russian and Hebrew alphabets, needs more keys than there are available. So the programmer has decided to make use of the ESCAPE key (denoted by ESC) and the [ symbol. For the Greek equivalent of b, she will enters ESC b; for the Russian, ESC [ b; and for the Hebrew, ESC ESC b. The effect of these sequences lasts for one character only. If she needs two Russian characters, she has to type ESC [ before each. If a first ESC is followed by any two of ESC or [, all three are discarded.

Design an fsm which takes keys as input symbols, and outputs characters of the four alphabets. (Transitions with no output are legal.)

What changes are needed if you want to have English, Greek, Russian and Hebrew keys which take the program to an alphabet and leave it there until it is changed?

A good way to write the corresponding program is to simulate this fsm.

In fact, complex input processes are often be implemented in this way. Writing such a program in which each transition calls a small procedure is also a widely used device.

(b) Programs such as the UNIX grep permit a file to be scanned for occurrences of particular strings or patterns. Typically these patterns are specified by regular expressions.

How can we determine whether a string beginning at a particular point in a text belongs to the set described by a regular expression? Convert the regular expression to an fsm, using the process mentioned in 13(b). Then feed the characters in the text to the fsm, returning YES if the fsm enters an accepting state.