A C language Parser for reading CCP4-style keyworded input

Peter Briggs, September 2001

0. Contents

1. Introduction
2. ccp4_parser functions

2.1 C API functions: overview
2.2 Detailed description of the C API functions
2.3 Accessing the tokens: the CCP4PARSERARRAY and CCP4PARSERTOKEN structures
2.4 Examples of Usage

2.4.1 Reading keyworded input from stdin
2.4.2 Analysing and interpreting ccp4_parse output
2.4.3 Setting delimiter characters

2.5 Fortran API subroutines

3. Acknowledgements

1. Introduction

Most CCP4 programs read command input to set parameters on stdin in the form of `keyworded' records. These have a leading keyword followed possibly by arguments which might be numbers or strings, or keyword/value pairs of the form keyword=value. Such arguments are separated by delimiter characters, which are by default spaces, tabs, commas or `=' characters.

The details of the input expected is given in the documentation for each program. However, there are some general rules:

Only the first four characters of keywords are significant (although you are recommended to use complete keywords) and they are case-insensitive;
Records may be continued across line breaks using & , - or as the last non-blank, non-comment character on the line to be continued;
Text following a non-quoted ! or # is treated as a comment and ignored. A continuation character may preceed the comment;
Strings may be single- or double-quoted or unquoted if they don't contain any delimiter characters mentioned above or if the whole of the rest of the record is read as a single string;
Files may be included in the input stream using a record comprising a leading @ followed by the filename.

This document describes a set of C language functions and structures which can be used to implement this functionality in an application program. A set of Fortran API subroutines are also provided which mimic the functionality of the original core CCP4 parser routines.

2. ccp4_parser functions

Although referred to as a ``parser'', the CCP4 functions actually implement a tokeniser which will take a line of characters and extract a set of tokens delimited by special characters.

Most programs have a loop over input records which are initially fed to the function ccp4_parser (equivalent to the Fortran PARSER subroutine) to tokenise them and extract the initial keyword. ccp4_parser can cope with continued, commented input lines and included files. It calls ccp4_parse (equivalent to the Fortran PARSE subroutine) to tokenise individual records; ccp4_parse can also be called directly by an application.

The following sections describe the functions and usage.

2.1 C API functions: overview

The tokeniser uses the CCP4PARSERARRAY structure to store and pass information between the application and the ccp4_parser functions. This structure is initialised by a call to ccp4_parse_start, which returns a pointer to a new CCP4PARSERARRAY structure.

The pointer to the CCP4PARSERARRAY must be passed to the relevant functions when processing (tokenising) lines of characters. Two functions are provided for this purpose:

ccp4_parse will tokenise a single line supplied by the application
ccp4_parser will read a line supplied by the application or (if this is blank) from standard input or (if the first charcter of the line is an @ symbol) from an external file, and will continue reading from standard input or external file (appending to the original line) if a continuation character is detected. The concatenated line will be tokenised and returned.

In either case the number of tokens detected on the line are returned by the function.

The behaviour of ccp4_parse(r) can be modified by a calling ccp4_parse_delimiters, which enables the application to set or reset the delimiter set used by the CCP4PARSERARRAY, and by ccp4_parse_maxmin, which sets the maximum and minimum base-10 exponents to trap for floating point under- or overflow when evaluating numerical tokens.

Once a line has been tokenised the information is stored in the CCP4PARSERARRAY and can be accessed via members of the structure as outlined in the relevant section. The utility function ccp4_keymatch can be used to check whether two strings match as CCP4-style keywords (i.e. whether they match upto the fourth character independent of case).

The function ccp4_parse_end is used to perform clean-up of a CCP4PARSERARRAY once it is no longer needed.

2.2. Detailed description of the C API functions

Function	Description
CCP4PARSERARRAY *ccp4_parse_start(const int maxtokens);	initialise a CCP4PARSERARRAY to be used in subsequent calls to ccp4_parser routines.
int ccp4_parse_end(CCP4PARSERARRAY *parser)	clean up CCP4PARSERARRAY parser after use

ccp4_parse_start initialises a CCP4PARSERARRAY to be used with the subsequent ccp4_parser functions. The calling function must supply maxtoks, the maximum number of tokens. The function returns a pointer to a new CCP4PARSERARRAY.

ccp4_parse_end cleans up a CCP4PARSEARRAY pointed to by parser after being used by the ccp4_parser functions.

Function	Description
int ccp4_parse(char line, CCP4PARSERARRAY parser)	given a string "line", break up into tokens and store in CCP4PARSERARRAY "parser"
int ccp4_parser(char line, const int nchars, CCP4PARSERARRAY parser, const int print)	read input from stdin or external file, break up into tokens and store in CCP4PARSERARRAY "parser"

ccp4_parse takes an input string line and returns the number of tokens which are delimited by certain characters (defaulted to space, tab, comma, equals - these can be changed by the application using a call to ccp4_parse_delimiters).

Substrings can be delimited by single- or double-quotes but must be surrounded by delimiters to be recognised. An unquoted ! or # in the input line introduces a trailing comment which is ignored. Null fields ar denoted by two adjacent null delimiters (defaulted to comma and equals).

ccp4_parse returns the number of tokens parsed in the input line, 0 (zero) on reaching end-of-file, or -1 on encountering an unrecoverable error. The results of the tokenisation are returned via parser.

ccp4_parser reads ``keyworded'' data from the input stream, and calls ccp4_parse to tokenise it. By default stdin is used as the input stream, but a line starting with @<name> starts reading from the file <name> until end-of-file is reached.

Each input line may be continued on the next line by the continuation characters `&', `-' or `\' at the end of the input line. This character is dropped from the list of tokens returned to the calling application.

nchars is the maximum number of characters which will be read into line. If line has zero length initially then reading starts immediately from the input stream, otherwise line is processed first and more input read in if it ends in a continuation character, or forces reading from an external file.

The print argument should be set to as 0 (zero) to suppress echoing of the input lines to standard output.

ccp4_parser returns the number of tokens parsed in the input line, 0 (zero) on reaching end-of-file, or -1 on encountering an unrecoverable error. The complete line, less any continuation marks, is returned in line. The results of the tokenisation are returned via parser.

Function	Description
int ccp4_parse_delimiters(CCP4PARSERARRAY parser, const char delimiters, const char *nulldelimiters)	set up or restore non-default delimiters
int ccp4_parse_maxmin(CCP4PARSERARRAY *parser, const double max_exp, const double min_exp)	set non-default maximum and minimum values for numerical tokens

ccp4_parse_delimiters allows the calling application to set its own delimiter characters to be used in the ccp4_parse and ccp4_parser. The delimiter and null delimiter character lists are supplied as the strings delimiters and nulldelimiters respectively. If a NULL pointer is supplied for either of the two lists then then the default delimiters are (re)set.

ccp4_parse_delimiters returns 1 on success, 0 if there was an error. In the event of an error the delimiter lists will be unchanged.

ccp4_parse_maxmin allows the application to set its own maximum and minimum base-10 exponent values, which are used as limits when evaluating the values of numerical tokens in order to avoid floating point over- and underflow. The maximum and minimum exponents are supplied as max_exp and min_exp respectively.

Function	Description
int ccp4_keymatch(const char keyin1, const char keyin2)	compare input strings to see if they match as CCP4-style keywords

ccp4_keymatch returns 1 if keyword strings keyin1 and keyin2 are ``identical'' CCP4-style keywords, and 0 otherwise. Keywords are identical if they are the same up to the first four characters, independent of case.

Function	Description
int ccp4_parse_init_token(CCP4PARSERARRAY *parser, const int itok)	initialise a single token in CCP4PARSERARRAY before use
ccp4_parse_reset(parser)	initialise CCP4PARSERARRAY before use (includes calls to ccp4_parse_init_token to initialise all tokens)

ccp4_parse_init_token resets the values for token itok within parser. ccp4_parse_reset initialises all the tokens in parser. It is not normally necessary to call either of these functions.

Internal functions

The following functions are listed here for completeness, but they are not intended to be called externally and full specifications are not given. Full details are in the source code comments.

Function	Description
int strtoupper(char str1, const char str2)	convert string str2 to uppercase and return in string str1.
int strmatch(const char str1, const char str2)	return 1 if strings str1 and str2 are identical, return 0 otherwise.
int charmatch(const char character, const char *charlist)	return 1 if character appears in the string charlist, return 0 otherwise.
int doublefromstr(const char str, const double max_exp, const double min_exp, double valuePtr, double intvaluePtr, int intdigitsPtr, double frcvaluePtr, int frcdigitsPtr, double expvaluePtr, int expdigitsPtr)	convert a string representation of a number str into the number (returned via valuePtr), checking for exponent over/underflow using the maximum/minimum exponents max_exp/min_exp. Also return the number of digits and the values of each ``component'' of the number (integer and fractional parts, and base-10 exponent).

2.3 Accessing the tokens: the CCP4PARSERARRAY and CCP4PARSERTOKEN structures

Information about the line, plus the tokens and information about them, are stored as members of the CCP4PARSERARRAY structure and can be accessed by the application by accessing the members of the structure.

A new CCP4PARSERARRAY is created and initialised with a call to ccp4_parse_start:

parser = (CCP4PARSERARRAY *) ccp4_parse_start(maxtokens);

where maxtokens specifies the maximum number of tokens that this instance of the CCP4PARSERARRAY can store (typically 20 or more).

The ``public'' members of the CCP4PARSERARRAY structure are:

keyword: (char [5]) the first four characters (or fewer, if the token is shorter than four characters) of the first token on the line, uppercased and terminated by a null character
ntokens: (int) the number of tokens detected on the line
token: (CCP4PARSERTOKEN *) an array of structures describing the tokens

The first two members are accessed as expected, e.g. parser->keyword and parser->ntokens. The tokens themselves are stored in an array of CCP4PARSERTOKENs. A CCP4PARSERTOKEN is a structure whose members hold information which characterises the token:

fullstring: (char *) the complete token as a string
word: (char [5]) the first four characters (or fewer, if the token is shorter than four characters) of the token, terminated by a null character
ibeg, iend: (int) the positions in the line for the first and last characters of the token respectively

The token is characterised by the following set of structure members:

isstring: (int) true if the token is a character string
isnumber: (int) true if the token is number
isquoted: (int) true if the token is contained within quotes
isnull: (int) true if token is null field

In this context, a ``character string'' indicates a string of characters which cannot be interpreted as a number. In this case there is additional information about the token:

strlen: (int) the number of characters in whole token

Valid numerical tokens include integers, decimals and exponentials. The additional information about numerical tokens is:

value: (double) the equivalent numerical value of the token
intdigits: (int) the number of 'digits' preceeding the decimal point
frcdigits: (int) the number of 'digits' after the decimal point

Information about particular tokens can then be accessed as expected, using constructs such as parser->token[1].value or parser->token[5].isstring.

Note that there are also a number of ``private'' members of the CCP4PARSERARRAY, which should not be accessed or changed directly from the application program. There is no guarantee that these members will not change in future revisions of the library.

2.4 Examples of Usage

2.4.1 Reading keyworded input from stdin

The following code implements a loop to read keyworded and tokenise input from the standard input stream, using ccp4_parser:

  /* Initialise a parser array which can store up to 20 tokens */
  parser = (CCP4PARSERARRAY *) ccp4_parse_start(20);

  /* Read lines from stdin until END/end keyword is entered or
     EOF is reached */
  cont = 1;

  while (cont) {
    /* Blank the line before calling ccp4_parser
       to force reading from stdin */
    line[0] = '\0';

    /* Call ccp4_parser to read input line and break into tokens
       Returns the number of tokens, or zero for eof */
    ntok = ccp4_parser(line,200,parser,1);

    if (ntok < 1) {
      /* End of file encountered */
      cont = 0;
    } else {      
      /* Perform keyword analysis and interpretation */
      ......
    }
    /* Loop around to read next line */
  }

  /* Clean up parser array */
  ccp4_parse_end(parser);

2.4.2 Analysing and interpreting ccp4_parse output

The following code implements an END keyword:

      /* END keyword
         Usage: END */
      if (ccp4_keymatch("END",key) && parser->ntokens == 1) {
        return 0;
      }

A slightly more sophisticated example is shown for a PROBE keyword, which takes a single numerical argument (the radius of a probe sphere used for accessible surface area calculations).

      /* PROBE keyword
         Usage: PROBE rad */
      if (ccp4_keymatch("PROBE",parser->keyword)) {
        if (parser->ntokens != 2) {
          printf("Error: PROBE keyword - requires a single argument\n");
        } else if (parser->token[1].isnumber) {
          printf("Error: PROBE keyword - arguments must be numerical\n");
        } else {
          probe_radius = parser->token[1].value;
          printf("Probe radius set to %f\n",probe_radius);
          if (probe_radius <= 0.0) {
            printf("Error: PROBE keyword - value should be positive\n");
          }
        }
      }

Note that the ccp4_keymatch function provides a way of comparing strings to see if they match as CCP4-style keywords.

2.4.3 Setting delimiter characters

The following code shows examples of setting and resetting the delimiter characters:

  /* Set up parser to only allow tabs and spaces as delimiters */
  ccp4_parse_delimiters(parser,"\t","")

  /* Reset to default delimiters */
  ccp4_parse_delimiters(parser,NULL,NULL)

2.5 Fortran API subroutines

Fortran APIs are provided for the core functionality: PARSE, PARSER and PARSDL. Other Fortran routines which use these subroutines (e.g. RDSYMM, or the KEYPARSE family) are not implemented - the existing Fortran subroutines should be used instead.

The PARSE, PARSER and PARSDL subroutines (and their usage) are described in detail elsewhere, in the parser.f documentation, and this information is not reproduced here.

3. Acknowledgements

The C functions and Fortran APIs were written by Peter Briggs, with useful feedback provided by Martyn Winn.

The functionality is based heavily on the original CCP4 Fortran parser library, which itself was based on Mike Levitt's routine of the same name and subsequently modified by: Peter Brick, Phil Evans, Eleanor Dodson, Dave Love, Peter Briggs.