Parsing Techniques in COBOL

Introduction

Where I work, many programs do some sort of text parsing. In particular they parse service orders, which are written in a strange and barbarous tongue peculiar to the telephone industry.

In the great wide world, parsing programs -- such as compilers and interpreters -- are typically written in languages such as C, C++, or Pascal. These languages are well suited for processing text streams in complex ways. In addition, code generators such as lex and yacc can do much of the gruntwork.

At my company, service orders are processed largely by COBOL programs. Parsing presents special problems for COBOL:

  1. COBOL is designed for fixed-length fields in fixed positions, not for free-flowing text.
  2. The COBOL culture does not include any tradition in techniques of parsing, and most COBOL programmers have no background or training in the subject.
So programmers cobble together whatever seems to work. For simpler parsing problems, it may be enough to use UNSTRING, INSPECT, and whatever else comes in handy. Often, however, especially for more complex grammars, the result is an ugly mishmash of fragile, ad-hoc, hard-to-maintain code.

(Similar problems may arise at a later stage of processing. The best way to represent the results of parsing may be to use some combination of various data structures -- linked lists, trees, and the like. Neither the COBOL language nor the COBOL culture is well equipped to handle such data structures.)

These pages describe some techniques which have proven useful in the JOCKEY application. Though some of the discussion is theoretical, most of it comes from the real world of service order processing.

Isolation from physical format

The first step is to extract the text from the physical format in which it is made available.

For typical compilers, this step is mostly fairly straightforward, because the source code is already close to the proper format. This phase may include some minor reformatting. For example, a C compiler must recognize trigraphs, and then splice continued lines together.

For service orders this step is more substantial. Every kind of file has its own format for representing individual lines, and thus requires a customized approach.

JOCKEY uses the JKBLLOAD routine to load input lines into a standardized format. Other routines can then access them without regard to the original source of the text. We can reuse the parsing modules in multiple contexts.

One character at a time

Generally the next step is to break the text down into individual characters, and process them one at a time.

For COBOL programmers this approach does not come naturally. We are used to moving whole blocks of characters in and out of fixed fields. For a very simple parsing task, this traditional approach can work just fine, and more powerful techniques are overkill. For more complex grammars, however, it is cumbersome and inefficient at best.

After loading body lines with the JKBLLOAD routine, JOCKEY uses the JKBLREAD routine to read the stored input lines one character at a time.

Now the fun begins

At this point we have transformed the original input file into a stream of individual characters. But we haven't yet addressed the nub of the problem: how can a program make sense of the text?

Sometimes the simplest thing to do is to code a bunch of IF statements and PERFORM statements, in an ad hoc manner, to do whatever you need to do. That's a little vague, but there's not much more to say. The details of what you code depend on the details of what you need.

It's more interesting to consider some more generic techniques which may be applied to any parsing problem.

Lexical and semantic analysis

The usual textbook approach divides parsing into several phases: These pages mainly describe how JOCKEY performs lexical analysis. The higher levels of analysis are rudimentary, if not downright ugly.

Finite state machines

In non-trivial cases, JOCKEY relies on a finite state machine (or "FSM") to perform its lexical analysis. Each FSM is specialized for a specific task. We have one FSM to extract listing names, one to extract telephone numbers, another to extract addresses, and so forth.

Briefly, an FSM consists of a set of defined states which respond to a set of defined events. At any one time the FSM is in a single state. When an event occurs, the FSM responds with the corresponding action, depending on its current state. Each state has a different mapping of events to actions. An action may include a transition to a different state.

In the context of parsing, an event is usually an input character from the text to be parsed. The significance of a given character depends on the surrounding characters. The rules for the various states reflect those dependencies.

Eventually the FSM reaches a terminal state -- which may be either an error state (signifying a syntax error in the input text) or an accepting state (signifying the recognition of a token).

How JOCKEY approaches a service order

JOCKEY does not attempt to parse an entire service order, mainly because big chunks of the order are irrelevant to JOCKEY. We ignore some sections entirely. In other sections we look for specific kinds of lines and ignore the rest.

In any case, JOCKEY parses only a few lines at a time. Each set of lines begins with a USOC (Universal Service Order Code) or a FID (Field IDentifier) at the left side. Following the USOC or FID there is typically some additional data, which may overflow onto continuation lines.

The first step is to load the USOC or FID, together with all of its continuation lines, into memory via calls to JKBLLOAD.

For a FID, the next step may be to extract the data we want by calling a specialized routine. For example, one routine extracts the listing name. Internally, these routines call JKBLREAD to fetch the stored text one character at a time.

The next step is to look for any relevant data tags. A tag consists of a slash ('/') followed by an identifier. The tag for a telephone number, for example, is '/TN'.

Tags appear in free-form fashion, following the USOC or FID. We use a FINDTAG routine to paw through the text looking for useful tags.

When we find a tag, we may need to look for some additional data following it. For example, if we find /TN, we call a specialized routine to extract a telephone number. If we find /DZIP we call a different routine to extract a ZIP code.


[gears]Finite State Machines [crane]Data Structures [crane]Further Reading
[home]COBOL Home