16 May 1999 23:54:41 -0400

Related articles |
---|

Re: Jack W. Crenshaw - Any clues how to optimize ? mallors@ips1.msfc.nasa.gov (1999-04-19) |

Re: Jack W. Crenshaw - Any clues how to optimize ? andersh@maths.lth.se (Anders Holtsberg) (1999-05-03) |

Re: Jack W. Crenshaw - Any clues how to optimize ? mikee@cetasoft.cog (1999-05-07) |

Parsing != (either 'recursive descent' or 'operator precedence') hunk@alpha1.csd.uwm.edu (1999-05-16) |

From: | hunk@alpha1.csd.uwm.edu (Mark William Hopkins) |

Newsgroups: | comp.compilers |

Date: | 16 May 1999 23:54:41 -0400 |

Organization: | University of Wisconsin - Milwaukee, Computing Services Division |

References: | 99-04-067 99-05-010 99-05-021 |

Keywords: | parse |

On 3 May 1999 14:44:50 -0400, Anders Holtsberg <andersh@maths.lth.se> wrote:

*><[Op precedence parsing more economical than rec. descent]>*

*> (Or am I saying something stupid here: for me "recursive*

*>decent" and "operator precedance" is mutually exclusive. Correct???)*

You can algebraically derive a parser from the SDT that specifies it.

This, by far, transcends anything that even remotely resembles

recursive descent or operator precedence (or anything, in actual fact,

that is published or widely known).

Note: parsers operate on tranductions, not grammars. A parser for a

'context-free language' is actually processing a syntax directed

transduction, or SDT.

An SDT can be converted to an SSDT (simple SDT) by allowing, as output

actions, operations on a 'value stack' or 'tree-building operation'.

This is why you see these items commonly in relation to parsers. The

function actually being fulfilled by having these objects around is to

convert SDT's to SSDT's, because the latter are much more tractible --

as about to be described below.

An SSDT can be specified as a list of rules of the form

N -> RE(x,y,N)

where RE is a regular expression involving input symbols (the x's), output

actions (the y's) and non-terminals (the N's).

Every SSDT is a system of inequalities of the form

N >= RE(x, y, N)

over an algebra that is contains the x's, y's and involves regular

expression operations. The algebra, itself, has the following operators

0: the 'fail' symbol

1: the 'empty word'

A+B: to indicate alteration (may also be denoted A|B).

A B: to indicate concatenation

A*: to indicate iteration (repetitions of 0, 1, 2, or more A's).

It's called a Kleene algebra and is axiomatized by the following:

(w + x) + y = w + (x + y) (wx)y = w(xy)

w + 0 = w = 0 + w w1 = w = 1w

w + x = x+w w(x + y)z = wxz + wyz

w + w = w w0z = 0

with inequality defined by

u >= v if and only if u = v + x for some x

equivalently: if and only if u = u + v

and by the following axioms for the star-operator:

x* >= 1 + x x*

if x >= a b^n c for all n = 0, 1, 2, ... then x >= a b* c

A weaker axiomatization replaces the last rule by the following:

if x >= ax + xb + c then x >= a* x b*

which is a general property deriveable from the last rule. However, the

stronger axiomatization is required for what follows.

----------

The most important, and basic, properties of the >= operator are that:

(A) A + B is the least upper bound of the set { A, B }: A + B = lub {A, B}

Everything >= 0. Another way to say this: 0 = lub {}.

Every finite set { A1, A2, ..., An } has, as its least upper bound

lub { A1, A2, ..., An } = 0 + A1 + A2 + ... + An.

(B) lub(A) lub(B) = lub(A B), for all finite sets A, B

and as a consequence of the last rule:

(C) A* = lub { 1, A, A^2, A^3, ... } = lub {A}*

In general:

(D) Every rational subset RE of the algebra has a least upper bound, such

that

lub(U V) = lub(U) lub(V), for all rational subsets U, V

lub(U*) = lub(U)*

lub(U + V) = lub(U) + lub(V)

The rational subsets of an algebra are those constructed from finite sets

using the set operations:

A B = { u v: u in A, v in B }

A* = {1} union A union A^2 union A^3 ...

A + B = A union B

So a Kleene algebra is actually an image of its own rational subsets.

----------

An SSDT over alphabets X (for inputs) and Y (for outputs) is a finite system

of inequalities of the form

N >= <Expression involving N's, X's, Y's>

with a set of variables (designated above by N's) called non-terminals,

and a specially designated variable (usually denoted S) called the 'start'

symbol.

The subset of (X+Y)* specified by an SSDT will also be called an SSDT.

The fundamental property of SSDT's with respect to this representation is

that if the transduction 'generated' by S is the set L (a subset of (X+Y)*)

then the least solution for the SSDT, in the main variable S is:

S = lub L

So, every SSDT is represented by an object in the algebra: its "translation

expression".

The algebra this occurs in is defined by the following:

* It is generated by the symbols from X and Y

* xy = yx, for all x in X, y in Y

* It is closed under 'least solutions' to SSDT's.

A general property of such SSDT's is that they can be embedded in a larger

(but simpler) algebra consisting of the following:

* If is generated by the symbols from X and Y

* It includes the operator symbols from a set D:

D = { <0|, <1|, ..., <n-1|, |0>, |1>, ..., |n-1> }

where n > 1.

* xy = yx, xd = dx, yd = dy, for all x in X, y in Y and d in D

* <i| |j> = 1 if i = j, 0 else

* |0><0| + |1><1| + ... |n-1><n-1| = 0

For anyone familiar with Quantum Physics, the distinct (and actually deep)

analogy entailed by the last two properties is the reason the extra

symbols are denoted as they are.

Call this algebra C_n(X, Y). Then the following is true:

FUNDAMENTAL THEOREM OF TRANSDUCTION EXPRESSIONS:

The SSDT's are precisely those subsets, L, of (X+Y)* which have least

upper bounds in C_n(X, Y) for which

(lub L) d = d (lub L), for d in D

As such, C_n(X, Y) is a simple extension to the 'regular expression' notation

which happens to also be sufficiently powerful to represent all translation

expressions. The ones which commute with the operators in D are provably

equal to least upper bounds of the very subsets of (X+Y)* that happen to

be SSDT's.

The connection this has to parsers is via the following interpretation:

0 = error

1 = do nothing

AB = do A, then do B

A+B = do A or do B

A* = do A: 0, 1, 2, or more times

x = test the following input for equality to x, and shift if equal.

y = carry out action y

<i| = push i

|j> = pop and test for equality to j.

So, the act of solving an SSDT amounts, essentially, to writing down a

non-deterministic parser for the SSDT. Since every element of C_n(X, Y)

is the least upper bound of a rational subset of X+Y+D, then the solution

can always be expressed as a "regular expression" over the symbols X, Y

and D. This, in turn, can be expressed as (at the least solutuion to) a

finite system of inequalities of the form:

q >= sum of terms of the forms 1, z q, for z in X+Y+D

with only one inequality for each q.

This directly corresponds to a finite automaton with the interpretations

q = state

q >= ... 1 ... means q is a final state

q >= ... z q' ... means q has a transition to q' after doing z.

If the original SSDT is deterministic, this system can be written as

directly as deterministic code using the following correspondences:

* state = goto label

* state transition = goto

* final state = return (or exit)

* tests are ultimately incorporated in if-then-else statements.

To actually render tests as if-then-else statements, it may be necessary

to use the commutativity rule to pull up pop's and input symbols ahead

of everything else. For example, if

q = ... + y q' + ...

q' = x q''

then this is equivalent to:

q = ... + x y q'' + ...

The key is to get a sum of terms headed by distinct x's and |i>'s.

This corresponds closely to what are called "look-aheads" in parser

terminology.

The actual process of resolving the non-deterministic branching in favor

of if-then-else's, via 'look-ahead' substitution and other algebraic means

is where the actual work of the computation would be involved.

The other major task is to resolve is error-handling -- which can

actually be done quite effectively, once you have the derivation of

the rest of the SSDT on-hand, sitting before you.

Since all of the computation involved in transforming an SSDT to a

regular expression (or, equivalently, a finite system of equations)

more art than science this is not something that can be easily

automated. A corollary of this is that no regular or automated

process is going to come anywhere close to embodying any substantial

portion of expertise involved in doing these computations. To

actually get an automated process to do so amounts to nothing less

than writing an artificial intelligence expert system for SSDT

system-solving and Kleene algebra computation.

In comparison, the tools in the current state of the art are rather

blunt and inefficient by about a couple orders of magnitude --

especially given that the processing of a language (normally

considered 'difficult') like that of C is rather routine using the

methods based on this background information. Also, since they

generally do not embody any of the information here (because it is

still 'unknown' at this time), they represent backwards and obsolete

technology.

For a good example of a place where this has actually been applied,

look at the regular expression demo software in the regex directory of

the comp.compilers archive. The parsers used in processing the inputs

in these demos were derived by hand.

Also: listed in the free compilers archive is the CAS 8051 assembler,

which incorporates a large subset of C's expression syntax and a

fair-sized portion of its statement syntax (for doing assembler

directives). This parser was derived in a similar fashion, by hand.

In all cases, the derivation process is actually rather simple,

routine and very reliable.

Post a followup to this message

Return to the
comp.compilers page.

Search the
comp.compilers archives again.