Generating rows and columns of statistics

MERLIN allows you to generate statistical data in rows or columns of a table from single or multi coded variables (svars or mvars) – or in rows only from numeric variables (ivars or wvars). Two different methods are available, which we shall refer to as automatic and manual generation methods. Here is a list of all the statistics that may be generated, and the formats which control them.

BST Base for statistics
SUM Sum of scored values
SSQ Sum of squares
AVG Average
SDV Standard deviation
SER Standard error
EVR Error variance
ILL Lowest score value
ILH Highest score value
MDE Mode
MED Median
ILEn Quantiles (automatic generation only)

 

When using numeric variables, the last five statistics above may only be produced in frequency tables (not statistics tables).
 

Automatic generation method (rows only)

Rows of statistics may be generated automatically from svars, mvars, ivars or wvars. In the case of svars and mvars, statistics will only be generated if score values are specified in the labels with label control <Vn>, e.g.

DS $Q3=$20/
1;’Very good<V2>’,
2;’Good<V1>’,
3;’Neither<V0>’,
4;’Bad<V-1>’,
5;’Very bad<V-2>’,
6;’Dk/na’,

If any such variable is used on the vertical axis of a table, one or more rows of statistics will automatically appear at the end of the table, depending on the format(s) set – so, if F=AVG/SDV/EVR, three rows of statistics will appear. By default, they will appear in the order of the above list, but this can be changed with the SELECT STAT ORDER statement.

Note that format ILE needs a numeric setting n between 2 (= median) and 10 (= deciles). This format will produce n-1 rows showing the quantile points, i.e. 1 row for a median up to 9 rows for deciles.

To globally change the row label associated with a statistic, you can re-define the relevant special text, e.g. %AVG – but to change it for a specific variable, it is easier to include a dummy item at the end of the variable, using label control <AA> to assign the label to AVG, or <A1> to assign it to the first statistic shown, e.g. for an svar or mvar:

D;’Product X average<AA>’,

… or, for an ivar or wvar with no other labels:

V=D;’Item A base for stats<A1>’,

 
Manual generation method (rows or columns)

Rows or columns of statistics may be generated from svars or mvars (providing the labels contain score values as above), by specifying the statistics as dummy items within the variable. This is done using label control <T=xxx> where xxx is the relevant format setting (ILE is not allowed).

Although it requires more typing, this method has several advantages:

  • you can specify the position (and order) of the statistics (e.g. to show them before DK/NA)
  • you can give each statistic your own label
  • you can vary the number of decimal places with label control <Dn>
  • the statistics will appear on whichever axis the variable is used

Here is an example:

DS $Q3=$20/
1;’Very good<V2>’,
2;’Good<V1>’,
3;’Neither<V0>’,
4;’Bad<V-1>’,
5;’Very bad<V-2>’,
D;’Mean score<T=AVG>’,
D;’Std.deviation<T=SDV>’,
D;’Err.variance<T=EVR>’,
6;’Dk/na’,

Any questions? Email support@merlinco.co.uk.

Tabulating numeric data in statistics and frequency tables

MERLIN provides two ways of tabulating a numeric variable on the vertical axis of a table – known as statistics and frequency tables. Users are sometimes unclear about the difference between the two, so we will illustrate it using a variable called $NUMBER. This statement…

T#1=$NUMBER * $CLASS,

…produces a statistics table like this (the statistics shown depend on FORMAT settings)…

Often this type of table is joined onto the end of another table (maybe showing ranges of values) – or several statistics tables may be joined together to produce a statistical summary from different variables.

This statement…

T#2=$NUMBER(15) * $CLASS,

…produces a frequency table showing the frequency distribution of the values, as well as the statistics…

The syntax ‘(15)’ tells MERLIN to create a table with 15 rows (plus total, undefined, and the statistics rows) – in other words, we are allowing for up to 15 different values being found in the variable $NUMBER. This is an estimate, and our table shows that only 12 different values were found, so MERLIN automatically suppressed the unused rows.

You may be tempted to specify a high number like ‘(9999)’ but this is very inefficient since MERLIN will unnecessarily create a huge table – which is likely to slow the run down and (if there are many such tables), could cause limits to be exceeded. Since there are only 20 respondents in the sample, there cannot possibly be more than 20 different values, so you should specify 20 or less. If the number you specify is too low, MERLIN will report this and abort the run so you can increase it.

 

Format TNP

If your script contains many frequency tables, all containing the syntax ‘(100)’ for example, it is more efficient to use format TNP. If you set this to 100, it means that all table specifications from that point onwards which have a numeric variable on the vertical axis but no number in brackets, will generate frequency tables with up to 100 different values. If you want to use a different number (higher or lower), you can specify this in the usual way and it will take precedence over the TNP setting. If you want to produce a statistics table instead of a frequency table, you can specify ‘(0)’. For example…

SELECT BANNER $CLASS,

F=TNP100, !allocate 100 rows from now on
T#1=$NUM1 *, !frequency table with 100 rows
T#2=$SVAR *, !not a numeric variable so TNP ignored
T#3=$NUM2 *, !frequency table with 100 rows
T#4=$NUM3(200) *, !frequency table with 200 rows
T#5=$NUM4(0) *, !statistics table
T#6=$NUM5 *, !frequency table with 100 rows

 

Statistics requiring frequency distribution

The following statistics can only be produced in frequency tables, since the calculation requires a full frequency distribution:

ILE quantiles
ILH highest value
ILL lowest value
MDE mode
MED median

 

If you do not wish to show the frequency distribution, you can suppress it by setting format DIS false, e.g.

T#7(F=MED/NDIS)=$NUM6(150)*,    !frequency table with median but no distribution rows

 

Character tables

Finally, although not strictly part of our topic, we will mention that the tabulation of character data (e.g. postcodes) works in a similar way to numeric data. You can either show summary rows only (answered, not answered, undefined) – or you can show the full distribution by specifying ‘(n)’ after the character variable. Format TNP may be used as for numeric data.

Any questions? Email support@merlinco.co.uk.

Significance testing

MERLIN can perform many statistical tests, but the one we get asked about most is how to flag significant differences between cells in the same row with letters – something like this…

The letter highlighted in yellow denotes that the percentage above it is significantly greater than the percentage in column B of the same row at the 99% level – and the letter highlighted in green denotes that the mean score above it is significantly greater than the mean score in column G of the same row at the 95% level.

In the table above, we have set formats TTS3/SHG1, but this article describes many variations allowed.

Which values are to be tested?

This is determined by format TTS, which can be set as follows:
TTS0 = no tests
TTS1 = test mean scores only (using a t-test)
TTS2 = test column percentages only (using a z-test)
TTS3 = test both

Which columns are to be tested against which?

This is determined by format SHG:
SHG-2 = test each column against the remainder (i.e. total column minus current column)
SHG-1 = test each column against the total column
SHG0 = test all columns against each other (the default setting)
SHG1 = test within each 1st level header group, i.e. REGION, AGE, GENDER
SHG2 = test within each 2nd level header group (known in MERLIN as ‘overheaders’)
SHG3 = test within each 3rd level header group (known in MERLIN as ‘superheaders’)

Note that SHG-1 is a special case that should not be used unless the data has been changed with MANIP.

In the table above, we have used SHG1 – as indicated by the footnote which is generated automatically (although its appearance and position can be varied with special text %SHG and formats CGI, CGS and PSF).

Clients sometimes wish to test columns within header groups (SHG1) and against total minus each column (SHG-2) on the same table – and this can be achieved using formats TTS3/SHG1/SGX3. SGX does the same test as TTS3/SHG-2, but flags cells with one or two asterisks (depending on the significance level) instead of letters, and making it easy to distinguish between the two tests. By default, the asterisks are shown in the same cell as the values but format FBC allows you to move them below, and format SMS allows you to show + or – instead of *. Special text %SGX allows you to add text to the footnote.

Finally, you can specify your own pattern of testing with the SELECT SHG statement, e.g. SELECT SHG (1.4) (2.5) (3.6), means test column 1 against 4, 2 against 5, and 3 against 6.

Which levels of testing are to be used?

MERLIN allows testing at two significance levels, determined by formats SLA and SLB. By default these are set to 95 and 99 respectively but can be given any values between 60 – 99.99999, with up to 5 decimal places. If SLA and SLB have the same value, only one level of testing will be done.

Which flag characters are to be used?

By default, the columns on each page will be lettered from A-Z (lower case for SLA level and UPPER case for SLB level). This means that you cannot test more than 26 columns, which can be a problem when outputting tables to Excel with no ‘page breaks’ in the banners – but there are two ways around this. First, format TTL may be set to:
TTL0 = assign letters across all columns (the default setting)
TTL1 = re-start lettering on each test group, as determined by format SHG. So, if F=SHG1, the lettering will re-start on the 1st column within each header group
TTL2 = same as TTL1, but omitting the first column in each test group (typically used when this contains a sub-total). So, if F=SHG2, the lettering will re-start on the 2nd column within each overheader group.

Second, you may re-define and/or extend the string of characters used, in special texts %FCA and %FCB, e.g.
%FCA = ‘abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ’,
%FCB = ‘ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞŸ’,
If you are only testing at one level, you could redefine %FCA to include both upper and lower case letters, allowing a large number of columns to be tested against each other.

Finally, you can assign flag characters manually using label control <T=SMAx/SMBy>, and this method allows you to exclude columns from testing by not assigning any characters to them.

The table below was produced using formats TTS3/SHG1/SGX3/FBC/SLB95/TTL1.

What else?

Here are some other significance tests that can be done in MERLIN, using the formats shown:

CHI chi-square test
CHS single sample chi-square test
DEP dependent t-test
KST Kolmogorov-Smirnov test
LSD least significant difference
MWW Mann-Whitney-Wilcoxon test
TTF f-test

Any questions? Email support@merlinco.co.uk.

Using Excel and CSV data in MERLIN

When computers were first used to analyse market research data the most common means of storing data was column binary (card image) format but, with the advent of telephone and web interviewing packages, ASCII data became the favourite. Now, however, more and more data are stored in Excel or CSV files – and there are numerous advantages to this approach including:

  • Easy to view, edit, sort, etc in Excel.
  • Data fields may be referenced by names in the header row rather than absolute locations. So if Q3 is in the 11th column of one data file, but the 13th column of another, this doesn’t matter so long as you refer to it by the name Q3.
  • Variable names can be inferred from the header row names, or vice versa.
  • More efficient storage (no empty spaces in the data file).

MERLIN can read and write data in Excel or CSV files, and the principles described below apply to both. We refer to data fields / cells in such files as freefields (free format fields) to distinguish them from fixed location fields in ASCII and column binary files, which we have traditionally called fields.

Assuming that the first row in each data file is a header containing the names of the variables (which is nearly always the case) you will need to specify one of these RCPs (Run Control Parameters):

  • IGHEAD means the header row is to be ignored
  • USEHEAD means the header row is to be used, and enables you to refer to freefields by these names, rather than absolute locations.

The RECORD LENGTH statement needs to specify the number of freefields in your data file(s), rather than the number of characters, e.g. RECORD LENGTH 57 FIELDS. There could, in fact, be more freefields, but this would mean you are only analysing the first 57.

To refer to a freefield, you can use one of three approaches (which can be mixed freely within the same script). In all cases, the reference starts with $* (as opposed to $ on its own, used to refer to ASCII and column binary data locations):

  • By number, e.g. $*10 means the 10th freefield in the data record.
  • By letter (as used in Excel), e.g. $*AC means freefield AC, i.e. the 29th
  • By name, e.g. $*’Q25′, e.g. the freefield that has the name Q25 in the header row of the current data file. Freefield names are not case sensitive, so you could type $*’Q25′ or $*’q25′. If the freefield name is the same as the variable name, you can omit it in a definition, e.g. DS $Q25=$*/1-5 means define a single-coded variable called Q25 from the freefield of the same name (with a value range of 1-5). Names may only be used (or implied) if you have specified RCP USEHEAD.

The last approach is the most flexible, since Q25 could be in any location, and its position may vary from file to file (which is often the case when you are analysing different waves of the same survey together).

Another advantage of Excel or CSV files is that multi-coded variables can be stored in a single freefield, as opposed to the traditional approach of entering code 0 or 1 in a series of locations (known as a ‘bitstring’). The bitstring approach may still be used with freefield data but the former is easier to work with, and more flexible if the number of valid codes changes. A single freefield could therefore contain “1,7,10” to signify these 3 values (the quotation marks are optional in Excel, but compulsory in a CSV file). To indicate that a freefield may contain more than one value, an extra asterisk must be added after the freefield reference, e.g.

  • DM $Q30=$*15*/1-20, means freefield 15 may contain 1 or more values between 1-20
  • DM $Q30=$*’Q30’*/1-20, freefield named Q30 ditto
  • DM $Q30=$**/1-20, ditto omitting freefield name (same as the variable name)

If your data file is not in Excel or CSV format, you can use MERLIN Toolkit to convert it, e.g. you can import from SPSS, and you can automatically generate a MERLIN script defining the variables.

Any questions? Email support@merlinco.co.uk.