Copyright Ó 1998 by MSTR Technology
This document describes a new module for parsing numeric input for CAE applications, s_parser (symbolic parser). The mechanics of including it in an application are easy if the application was originally using the parser module – the function name and calling sequence are nearly identical. However, quite a lot is going on behind the scenes. In a nutshell, this new module allows the user to define symbolic constants and functions, and insert expressions using these into the input stream rather than hard-coded constant numbers only. All of the logic for doing this is taken care of within the module. All of MSTR Technology applications that rely on numeric input, either from files or the keyboard, can use this capability.
Additional features will eventually be provided for interactive programs to allow the user to view and manipulate the internal symbol tables. This will be done by slightly expanding the s_parser module itself (to include more external control functions) and by writing a set of user-interface objects. You can currently only view the loaded and evaluated symbol table by using the Extras:Symbol Table pull-down menu. When you do this, be aware the first few entries in the table are software defined and should be ignored.
If you are thinking of using this module, please note that its intent is to assist the user to input numerical data, not string data. All of the symbols, functions, and operators deal with 64-bit floating point numbers. Therefore, this module is not particularly appropriate for input consisting primarily of strings or for input data that are totally generated by another computer program. In those cases, the original parser module should be adequate, which was included as a separate function ("old_parser") in the s_parser module
A quick example will give you a flavor of what this module does. Suppose a user wants to create a CAE2D-style points file to define a square box. Suppose the box is centered at (0,0) and its side length is 1. The points file might look like this:
1, -0.5, -0.5
2, 0.5, -0.5
3, 0.5, 0.5
4, -0.5, 0.5
There is nothing particularly wrong with this until the user decides she really wanted the box centered at (1,2) and the side length should be 0.85. Now, she is faced with manually changing eight numbers in the data file (or interactively, if she is using CAEGEN), with all of the tedium and opportunity for errors that represents.
The new method, made possible by this module, is to define the key data values and then make the data file in a new format that is easier to modify:
x0 = 0
y0 = 0
side = 1
half = side/2
1, x0-half, y0-half
2, x0+half, y0-half
3, x0+half, y0+half
4, x0-half, y0+half
When the application program opens the file with the above lines and processes them through the symbolic parser, the parser reports back the following information:
LINE 1: No data (a "comment" line)
LINE 2: No data (a "comment" line)
LINE 3: No data (a "comment" line)
LINE 4: No data (a "comment" line)
LINE 5: One int32, two real64 numbers: 1, -0.5, -0.5
LINE 6: One int32, two real64 numbers: 2, 0.5, -0.5
LINE 7: One int32, two real64 numbers: 3, 0.5, 0.5
LINE 8: One int32, two real64 numbers: 4, -0.5, 0.5
In this way, the application program itself has no direct knowledge of what is going on behind the scenes. But note that the application should be prepared, when expecting a certain input (like two floating point numbers), to accept a line with no numbers on it (it is effectively a comment line) and try again until it gets what it expects.
Now that we have the basic idea, let’s define the quantities that the parser will use.
Numerical Constrants
Numerical constants are numbers that are entered using the standard rules of C or FORTRAN. The optional exponent field can be marked with the characters "E", "e", "D", or "d". All constants are stored internally as real64 floating point numbers.
Identifiers
Identifiers are C-like names for user-defined variables and functions. An identifier must start with an alphabetic character (a-z) or $. An identifier can have additional characters after the first one, consisting of (a-z), (0-9), $, and _ (underscore). Unlike C, there is no distinction between lower case and upper case letters. Up to the first 31 characters are significant. Some identifier names are reserved for standard mathematical functions and argument extraction (see below).
Operators
The s_parser operators are a subset of the operators in the ANSI C language, with one addition. The C bitwise operators have been ignored for obvious reasons. However, all of the logical operators are implemented. Note that logical operators, when applied to floating point numbers, assume that 0.0 (exactly) is false and anything else is true.
The operators in the symbolic parser are
Symbol Operation
+ unary + (null operation)
– arithmetic negation
! logical negation
** or ^ exponentiation (not part of ANSI C)
* multiplication
/ division
+ binary addition
– binary subtraction
> greater than
< less than
>= greater than or equal to
<= less than or equal to
== equal to
!= not equal to
&& logical and
|| logical or
? : C-style conditional evaluation
= assignment
+= addition assignment
–= subtraction assignment
*= multiplication assignment
/= division assignment
The precedence of operators (when not enclosed by parentheses) follows the ANSI C rules, with the addition of exponentiation, whose precedence is between the unary operators and multiplication. The precedence of the arithmetic operators is similar to FORTRAN usage.
Mathematical Functions
The symbolic parser implements all of the mathematical functions in the ANSI C standard math library: sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, exp, log, log10, pow, sqrt, ceil, floor, fabs, ldexp, frexp, modf, fmod. (See Kernighan & Ritchie, 2nd Edition, page 251.)
There are two functions, frexp and modf, that return a value in the second argument. It is unlikely that someone would use these, but the correct syntax would be (for example)
aaa = frexp(4.321,bbb)
In other words, the second argument should be a user variable so that the calculated value destined for the second argument has a place to go.
Valid Expressions
A valid expression is a sequence of operators, functions, and constants that evaluates to a real64 (or integer) using the syntax rules of C. Parentheses can be used to make the expression more readable, change the order of evaluation, and/or indicate a function evaluation. If the parser encounters an invalid expression, it will evaluate it as far as possible, then report the invalid parts as string fields. (See the explanation of the calling sequence below.)
Definition of User Variables
As you may have determined from the discussions above, defining a user variable is accomplished by placing the name of the variable, an equal sign, and a valid expression on an input line. The value of the variable can be changed later with another assignment line. It is also possible to use the C-style assignment operators ( +=, –=, *=, /= ) that respectively add, subtract, multiply, and divide the existing value by the value on the right.
If a user variable is referenced before it is defined, it is created automatically and assigned a value of zero.
Due to the internal algorithm used in the module, it is not possible to put more than one assignment on a line, nor to put any other data on an assignment line.
Definition of User Functions
The principal difference between a user variable and a user function is that the user variable is assigned a specific value at the time it is defined (or modified) and then that value is used when the variable is used; while a function, on the other hand, returns a value that is calculated each time the function is referenced. The function may or may not use arguments that are included in the reference to that function.
A function is defined on a line that contains (in this order) the function name, open parenthesis, close parenthesis, equal sign, and (finally) a valid expression. This syntax does not explicitly specify the arguments of the function. There can be up to four arguments and they are referenced in the expression on the right-hand side as the pseudo-variables arg1, arg2, arg3, and arg4. When the function is referenced in a subsequent expression, arg1, arg2, arg3, and arg4 refer (respectively) to the first, second, third, and fourth arguments in the reference. If an argument is not included, its value is implicitly set to zero. When a function is referenced with no arguments, it still needs open-close parentheses so that the parser can differentiate it from a simple variable. An example will make this clearer.
Suppose we wish to create a function that returns the distance from the point (x0,y0,z0). This could be defined as follows:
dist0() = sqrt((arg1-x0)**2+(arg2-y0)**2+(arg3-z0)**2)
Assuming that we want to find the distance to the point (x1,y1,z1), we can put
ddd = dist0(x1,y1,z1)
If arguments are omitted, they default to zero. For example, the line
dzz = dist0()
calculates the distance from (0,0,0) to (x0,y0,z0).
When a function is defined, the symbolic parser stores a binary tree of coded operations that are executed when the function is evaluated, rather than storing the definition line as ASCII characters. This means that subsequent evaluations of a function are reasonably fast. One function can be defined in terms of others. However, it is not possible to define recursive functions.
The +=, –=, *=, and /= operators cannot be used for a function definition. Also, a function cannot be redefined.
Include Files
There are many occasions when the user will want to apply the same symbolic constants and functions to multiple sets of data files, or save similar sets of definitions in different files so that multiple runs can easily be made with minimal changes. This is done with the concept of an include file, similar to C source code. The syntax is identical to C also. There is an important difference, though: The purpose of the include file is only to define variables and functions, not to input the data lines in the included file (if any). When the s_parser encounters an input line like the following example,
#include "define1.dat"
it opens the file, reads it line by line, saves any definitions, and discards any data lines. The s_parser then reports back to the calling program that the line defining the include command was a comment (data fields = 0).
If the parser cannot open the file for some reason, it reports back a -1 as the number of fields found. (Normally, the parser returns 0 or a positive number.)
Included files can in turn include other files. If an included file in an included file cannot be opened, the return value is -N, where N is the level of recursion of the includes.
Angle Units
It is often convenient in model building or other CAE activities to use angle units other than the default radian units normally associated with the math functions sin, cos, tan, asin, acos, atan, and atan2. The s_parser accepts the following commands to change the degree units for subsequent calls to the standard math functions:
#degree
#degrees
#grad
#grads
#radian
#radians
These commands set the angle units to (respectively) degrees, grads, or radians. Both the singular and plural forms are acceptable, and upper/lower case letters are equivalent. If no commands to override are given, the default units are radians.
Data Lines
The ultimate purpose of s_parser is to read data lines and report the number and values of any integers (int32), reals (real64), and/or strings found thereon. This is done when the parser is sent a data line, which is any line not containing a comment (only), an include, or a variable/function definition.
A data line contains a number of fields, each field being an integer, real, or string. A real field is one valid expression that the parser reduces to a single number. It can contain any combination of operators, variables, and/or functions as long as the syntax is correct. An integer field is the same, except that the real number happens to evaluate to an integer. Since real64 mantissas have more than enough room to store all of the binary digits in an int32, all int32 values can be safely (exactly) stored in real64 format.
A string is identified in one of two ways. First, a quoted string is set off by single or double quotes. The same type of quotes should be used at the beginning and end. Inside the quotes, the standard C convention for "backslash" escapes (like "\n") are correctly handled, except for octal and hexadecimal formats. This makes it possible to imbed new lines, tabs, bells, etc. into a string. The quote characters themselves can also be included with the backslash, as in "Rick said, \"Let’s get going!\"".
The second type of string is anything that cannot be interpreted by the s_parser as a valid expression. For example, the existence of unbalanced parentheses on a data line will result in at least one of them getting tagged as a string field. Also, parentheses enclosing more than one expression will be identified as strings. The most common example is the standard FORTRAN complex number format. If the parser encounters a data line like this:
(1.23,4.56)
it will return 4 fields as follows: (1) a string = "(", (2) a real64 = 1.23, (3) a real64 = 4.56, and (4) a string = ")".
Fields can always be separated by commas or semicolons. Quoted strings are also separated by their enclosing quotes. In some cases, white space (blanks, tabs) can also separate fields, as long as no binary operator intervenes. There are some additional complexities involving the + and – operators, which will be covered below.
There is no limit on the number of fields that s_parser can extract, other than a maximum for certain arrays defined in the calling program where the information about the fields will be stored. This will be covered in the section below detailing the calling sequence.
Those Pesky + and – Operators; Blanks as Separators
The + and – operators are difficult to parse because they can be used in three different ways: (1) as part of the string defining a numerical constant, (2) as unary operators on expressions to the right; or (3) as binary operators for expressions on the left and right. When two expressions are separated only by a +/– and possibly blanks, should the s_parser interpret them as one expression with a binary operator, or as two expressions and a unary operator? This question is tricky to resolve in a way that works consistently when reading both conventional data lines containing constant numbers separated by blanks, and lines containing expressions built using the full power of s_parser.
In the end, the author had to make some assumptions that he hopes cover most practical situations and conform to the intuitive expectations of most users. (Users of a previous version should note that the rules below have been changed to make them more intuitive and to cover situations encountered in preparing data files for MSTR Technology programs.)
Rule 1: If a +/– operator cannot be associated with an expression on the left as a binary operator, it will be assumed to be a unary operator. Examples include cases where the symbol on the left is a comma (",") or semicolon (";"). This also covers the case where a +/– directly follows an open parenthesis ("(").
Rule 2: If a +/- operator can be associated with an expression on the left as a binary operator, it will only be assumed to be a unary operator if it is directly adjacent to an expression on the right and there is intervening white space (blank, tab, etc.) between it and the expression on the left.
Some examples will makes these rules clear:
1.23 -aaa 2 number fields, "1.23" and "–aaa"
1.23-aaa 1 number field after resolution of –
aaa - bbb 1 number field after resolution of –
aaa- bbb 1 number field after resolution of –
aaa-bbb 1 number field after resolution of –
aaa,- bbb 2 number fields, "aaa" and "–bbb"
Comments
Comments are set off by the pound sign (#). Any characters from the # on are ignored, except for the include and angle unit commands described above.
A comment line is any line where the first character is a # (again, excepting the include command). It is part of the design of s_parser that definition lines appear to the calling program as the same thing as comment lines. Internally, they are quite different, of course.
Calling Convention
The application program should include the following line (or something like it)
#include "s_parser.h"
The symbol table and other information are stored in a container object of type s_parser_class. In this way, it is possible to maintain multiple symbol sets if necessary.
(The header file and C++ source are in the utility subdirectory.) The function prototype is as follows:
int s_parser_class::parser
(char *line, int max_fields,
char *field[], parser_field_type ftype[],
int32 ivalue[], real64 dvalue[]);
Note that the function is formally referenced as "parser" rather than "s_parser" to maintain compatibility with the original parser module. The original parser is available as a stand-alone function (without a container class) with the following prototype:
int old_parser(char *line, int max_fields,
char *field[], parser_field_type ftype[],
int32 ivalue[], real64 dvalue[]);
The line variable is the string that you want parsed. It does not have to be a line from a text file; it can be any string containing a definition or data fields. Note that some characters in the string will be altered, and so if you need to keep the original, you should make a copy before calling parser.
max_fields is an integer containing the maximum size of the arrays that are the other arguments to the function.
field is an array containing pointers to the beginnings of the respective data fields in ASCII characters. The field strings are all null terminated (which is why the line is changed). If the data field is an expression, the corresponding string will be the entire ASCII character sequence making up that expression.
ftype is an array containing one of three values that are defined in s_parser.h: PARSER_STRING, PARSER_INT32, PARSER_REAL64. These values give the s_parser’s idea of what the corresponding field is. But note that even if the s_parser says that a field is a PARSER_REAL64, it still returns the correct ASCII characters for that string. Conversely, even if s_parser thinks a field is a string, it will return numbers in the for the field in the ivalue and dvalue arrays (see below).
ivalue is an array containing the int32 values of the respective fields. dvalue is an array containing the real64 values of the respective fields.
All of the arrays are indexed in the standard C way, starting at 0. Thus, the real64 value of the first field is dvalue[0].
The return value is the number of fields found on the line. It may be greater than the maximum value specified. In that case, the extra fields were found and decoded by the s_parser, but could not be reported on through the arrays.
If the return value is zero, that means the line was either a comment line or a definition line. If the return value is negative, it means that an error occurred trying to open an include file.
There is another s_parser function that is designed to interpret one cell in a table. The cell is expected to contain one real, or complex number. The syntax is
parser_field_type s_parser_class::interpret_cell
(char *data, real64 &real_part, real64 &imag_part);
The return value of this function is PARSER_REAL64 if the data string is a real number (or one integer), PARSER_COMPLEX if the data string is a complex number (two reals separated by white space, comma, or semicolon; optionally enclosed in parentheses), or PARSER_STRING if the data string contains something else. If the result is PARSER_REAL64, the value of the real number (or integer) is copied into in the real_part variable. If the return value is PARSER_COMPLEX, the real part of the complex value is copied intothe real_part variable and the imaginary part is copied into the imag_part variable. Here are some examples:
(1.23 -aaa) PARSER_COMPLEX, real_part = 1.23,
imag_part = -aaa
1.23-aaa PARSER_REAL64, real_part = 1.23-aaa
(1,2) PARSER_COMPLEX, real_part = 1,
imag_part = 2
@_:[] PARSER_STRING
In the case of bad syntax on a data line, the s_parser algorithm will most likely return a combination of number and string fields. The first string field that occurs on a line where string input was not expected is a clue to where a syntax error occurred. Also, a line may contain greater or fewer than the expected number of data items.
It is up to the calling program to handle all of these contingencies.
Conclusion
This is the third version of the s_parser. Further refinements are likely as we continue trying it out in applications. Any questions on its use, bug reports, or suggestions for enhancements should be directed to MSTR Technology.