Processing character data

Market research data has traditionally contained very few character fields, and there have been even fewer requirements to process them. This has changed in recent years, however, and instead of gender being coded as 1 and 2, we are seeing data files that contain character strings ‘male’ and ‘female’. This is especially true of Excel data, which MERLIN is well able to process, and MERLIN also has powerful facilities for processing character data, some of which we will highlight in this blog.


Character variables

A MERLIN character variable (cvar) may be up to 9999 bytes in length. Each byte usually holds one character but UTF-8 data that cannot be converted into ANSI (e.g. Chinese) needs two or more bytes for each character. Note that UTF-8 files may be used freely in MERLIN and usually the user doesn’t even need to be aware of file formats but, in the single case of processing cvars derived from non-Latin data such as Chinese, RCP NOLATIN must be set, and the user needs to be aware of the additional storage space required. We will not discuss this further here, and all examples in this blog will assume the default setting RCP LATIN.

Let us suppose that gender is stored in our data file as a string of up to 6 characters. If we are using ASCII data, the definition might look like this…

DC $GENDER=$21-26,              !pick up from columns 21-26

… or, if we are using Excel or CSV data, like this…

DC $GENDER(6)=$*5,               !pick up from field #5

In the first case, the cvar length is implied by the length of the pickup field but must be explicitly stated in the second case, since the field could be any length.

The initial value of all cvars is Undefined (U) so, if a definition is bypassed by branching statements such as IF or FILTER, the cvar will retain this value – but if the definition statement is executed, the cvar will then contain exactly what is found in the data, e.g. ‘Male’, ‘FEMALE’, or maybe blank. If an Excel or CSV field is longer than the cvar’s length, a warning will be issued and additional characters ignored.


Listing, tabulating and exporting cvars

Cvars may be shown in report files using the LIST statement, or you can use a FREQ statement to generate a frequency count of all the different character strings found – particularly useful if you are unsure what the data file contains.

Cvars may be written out to new data files with the OUTPUT statement, or exported to other formats and packages via MERLIN Toolkit.

Cvars up to 1024 bytes in length may be tabulated on the vertical axis of a MERLIN table, showing all the character strings found in alphabetical order, followed by the number of blank or undefined records (if any).


Literals

Let us now suppose we wish to create a single coded variable (svar) called $GENDERX from our cvar $GENDER. We can do this with the following code:

DS $GENDERX=$GENDER/
‘male’; ‘Men’,
‘female’; ‘Women’,

‘male’ and ‘female’ are the strings we are looking for in the data, and are known as literals. ‘Men’ and ‘Women’ are the labels to be used in tables, which we have deliberately changed to illustrate the fact that they may differ from the literals.

It is not essential to create a cvar first – you can define the svar directly from the raw data location. In either case, if the literal is shorter than the entity being tested, MERLIN will assume there are trailing blanks at the end of the literal.

If you run a frequency count on the cvar or the raw data, MERLIN will generate code similar to the above in an FRQ file, which can then be incorporated into your script.

By default, tests on literals are not case sensitive, so the first category above would include ‘male’, ‘Male’, ‘MALE’, or even ‘mALe’. If you want the test to be case sensitive, set format ICL (Ignore Case in Literals) false.

If you know that your data contains variations of the literals, you can include these in the definition, e.g.

DS $GENDERX=$GENDER/
‘male’ + ‘M’; ‘Men’,
‘female’ + ‘F’; ‘Women’,

or you could use ‘pattern matching’, described below.


Pattern matching

By default, MERLIN recognises two characters in literals as representing any character(s):

?          means any single character (including a blank space)
*           means any sequence of characters of any length

So, ‘A?C’ would pick up ‘AXC’, ‘A3C’, ‘A#C’, etc, and ‘A*C’ would pick up ‘AXC’, ‘AXxC’, ‘AX3xC’, etc. This means we could simplify our last example to…

DS $GENDERX=$GENDER/
‘M*’; ‘Men’,
‘F*’; ‘Women’,

If you need to test for ‘?’ or ‘*’ in a literal, you can precede them with the MERLIN escape character (the grave accent ‘`’), e.g. ‘Why`?’ means you are looking for the actual string ‘Why?’. Alternatively, you can disable pattern matching completely by setting format PML (Pattern Matching in Literals) false.


Cvar substrings

It is possible to isolate any part of a cvar (known as a ‘substring’) so, for example, $CVAR(2:3)/ ‘LM’ tests for characters 2-3 being ‘LM’.

Either or both of the substring boundaries may be integer constants, or ivars, e.g. $CVAR($IVAR1:$IVAR2) – which opens up powerful possibilities for processing free format text strings, e.g. a comma-separated address field within ASCII data.

If the substring reference starts on the first character, you may omit the value before the colon and if it ends on the last character, you may omit the value after the colon, e.g.

DC $PART1=$CVAR(:6),
DC $PART2=$CVAR(7:),


Functions

MERLIN has numerous functions that facilitate the processing of character data, including:

%CLEN determines the length of a character string
%INDEX determines the position of one character string within another
%LOWCASE converts a character string into lower case
%UPCASE converts a character string into UPPER case
%NUM converts a character string into an ivar or wvar
%STR converts an ivar into a character string


For more details about these and other functions, see the ‘Functions’ chapter of the MERLIN Help File.

Any questions? Email support@merlinco.co.uk. x