In the great wide world, parsing programs -- such as compilers and interpreters -- are typically written in languages such as C, C++, or Pascal. These languages are well suited for processing text streams in complex ways. In addition, code generators such as lex and yacc can do much of the gruntwork.
At my company, service orders are processed largely by COBOL programs. Parsing presents special problems for COBOL:
(Similar problems may arise at a later stage of processing. The best way to represent the results of parsing may be to use some combination of various data structures -- linked lists, trees, and the like. Neither the COBOL language nor the COBOL culture is well equipped to handle such data structures.)
These pages describe some techniques which have proven useful in the JOCKEY application. Though some of the discussion is theoretical, most of it comes from the real world of service order processing.
For typical compilers, this step is mostly fairly straightforward, because the source code is already close to the proper format. This phase may include some minor reformatting. For example, a C compiler must recognize trigraphs, and then splice continued lines together.
For service orders this step is more substantial. Every kind of file has its own format for representing individual lines, and thus requires a customized approach.
JOCKEY uses the JKBLLOAD routine to load input lines into a standardized format. Other routines can then access them without regard to the original source of the text. We can reuse the parsing modules in multiple contexts.
For COBOL programmers this approach does not come naturally. We are used to moving whole blocks of characters in and out of fixed fields. For a very simple parsing task, this traditional approach can work just fine, and more powerful techniques are overkill. For more complex grammars, however, it is cumbersome and inefficient at best.
After loading body lines with the JKBLLOAD routine, JOCKEY uses the JKBLREAD routine to read the stored input lines one character at a time.
Sometimes the simplest thing to do is to code a bunch of IF statements and PERFORM statements, in an ad hoc manner, to do whatever you need to do. That's a little vague, but there's not much more to say. The details of what you code depend on the details of what you need.
It's more interesting to consider some more generic techniques which may be applied to any parsing problem.
Briefly, an FSM consists of a set of defined states which respond to a set of defined events. At any one time the FSM is in a single state. When an event occurs, the FSM responds with the corresponding action, depending on its current state. Each state has a different mapping of events to actions. An action may include a transition to a different state.
In the context of parsing, an event is usually an input character from the text to be parsed. The significance of a given character depends on the surrounding characters. The rules for the various states reflect those dependencies.
Eventually the FSM reaches a terminal state -- which may be either an error state (signifying a syntax error in the input text) or an accepting state (signifying the recognition of a token).
In any case, JOCKEY parses only a few lines at a time. Each set of lines begins with a USOC (Universal Service Order Code) or a FID (Field IDentifier) at the left side. Following the USOC or FID there is typically some additional data, which may overflow onto continuation lines.
The first step is to load the USOC or FID, together with all of its continuation lines, into memory via calls to JKBLLOAD.
For a FID, the next step may be to extract the data we want by calling a specialized routine. For example, one routine extracts the listing name. Internally, these routines call JKBLREAD to fetch the stored text one character at a time.
The next step is to look for any relevant data tags. A tag consists of a slash ('/') followed by an identifier. The tag for a telephone number, for example, is '/TN'.
Tags appear in free-form fashion, following the USOC or FID. We use a FINDTAG routine to paw through the text looking for useful tags.
When we find a tag, we may need to look for some additional data following it. For example, if we find /TN, we call a specialized routine to extract a telephone number. If we find /DZIP we call a different routine to extract a ZIP code.
Further Reading