Phil Hearn: Blogger, Writer & Founder of MRDC Software Ltd.

March 18, 2022/inData Analysis, MRDCL

5 ways to make MRDCL runs faster

MRDCL users are used to volumes of tables often being available in seconds. It is often a surprise when a run takes several minutes or even hours to complete. So, let’s look at the five biggest things that will make your MRDCL runs faster because, in our experience, most long run times can be reduced hugely.

1. Temporary variables

It is common to have variables you only need in the data stage of an MRDCL run. These variables are what you might call ‘building block’ variables. They may be variables that pick up data or process data that is then modified into another variable for tabulation. Or, they may have come from an online survey, storing redundant data.

Keeping IDF sizes to the minimum

Such variables will get written to the IDF (MRDCL’s internal database of variables and data) and simply inflate its size for no benefit. Of course, in a small project of 500 respondents with 100 tables, it will have a negligible benefit (probably less than one second), but that’s not the point. It is good practice to make redundant variables temporary so that they are not written to the IDF if they are not used in tabulations. As a side benefit, if you export data from MRDCL, temporary variables are excluded, making the file easier to understand and use for recipients of the exported file.

Making the variables temporary

How do you do this? You simply put an ‘x’ on the front of the variable name. Let’s imagine Q1 has 99 responses, but in the table output, you want a sub-total of the first five codes. You might write this:

Dm $q1=$101-199,

Dm $qq1=$q1/1..5,1-99,

However, you should write this:

Dm $xq1=$101-199,

Dm $q1=$xq1/1..5,1-99,

This will mean that $xq1 is not written to the IDF, and your run will be faster.

Increasing the temporary space

If you have a lot of temporary variables, you may exceed the default temporary limit. However, you can increase this by having this statement at the top of your .stp file.

Temporary /10000/,

2. Big multi variables

In terms of processing, there is a huge difference between processing a single-coded variable and a multi-coded variable.

How MRDCL stores single and multi-coded variables

A single value variable can store its value as a number. Indeed, an empty single value variable is stored as 0 and can be referred to in MRDCL code zero. If a single response variable has 100 responses and a respondent has selected code 56, MRDCL stores the value 56 for that variable. It’s as obvious as that!

However, a multi-value variable can have any number of values. So, for a variable of 100 responses (codes), there are 100 responses that may be true or false. In other words, there is theoretically any combination of 100 possible data selections possible rather than one value for a single-value variable. As far as MRDCL is concerned, any combination and density of answers are possible.

Does it need to be a multi-value variable?

It may seem tempting to define all variables with responses as multi-coded variables. It saves some thought but places more onus on your computer’s processing. If it’s one or two variables, it will not matter much, but if there are many unnecessary multi-coded variables or several big variables of, say, 9999 responses (the maximum is 30000), this can impact heavily.

Do you need big variables? Can they be single?

I have seen several examples of the misuse of multi-value variables. Here are examples:

Variables of 9999 empty responses for backcoding data from open-ended questions when 100 will be ample.
Picking up data from a field that can only be single with inefficient code such as this – dm $q33=$121-123/1-500,
Making big multi-variables when two variables are more efficient. For example, you may want to show brands 1-1000, 1001-2000 and 2001-3000 as sub-totals. You could generate a multi-variable such as:

Dm $brands=$120-123/1..1000,1001..2000,2001..3000,1-3000,

Effectively, you are making a variable with 3003 responses, but only the first three responses make it necessary to be a multi-value variable. Therefore, it is better to tabulate two variables like this:

Ds $brands=$120-123/1-3000,

Ds $brands_summary=$brands/1..1000,1001..2000,2001..3000,

Your table would need to appear as:

T#1=$brands_summary * $banner,

+t#1x=$brands * $banner,

As with all these examples, one or two inefficiencies do not matter, however, if there are a large number, you are slowing your runs unnecessarily.

3. Unneeded if statements

If statements take time to process, particularly if there are a lot of them. One of the frequently unknown statement types in MRDCL is the ability to convert a code to a value. If you’ve ever entered code like this, there is a much more efficient way. I have encountered this numerous times, especially when you are backcoding data for each record in data file.

If $q1/1,di $iq1=5,

If $q1/2,di $iq1=15,

If $q1/3,di $iq1=25,

If $q1/4,di $iq1=35,

If $q1/5,di $iq1=45,

If $q1/6,di $iq1=55,

You could write the same code like this:

Di $iq1=$q1(5,15,25,35,45,55),

The processing time will be a fraction of the time of the first example if you have hundreds of if statements.

4. Data loops/hierarchical data/occasion-based data

One of the reasons that MRDCL scripts get very big is when there are repeated sections of a questionnaire. This occurs with hierarchical data or occasion-based data. Such surveys are often referred to as having data loops. Examples are as follows:

A survey where you ask questions about a household and have separate questionnaires for each adult in the household. This is often called hierarchical data, with a ‘household’ record and a record for each household member.
A survey where you ask respondents about meal occasions in a week. There will typically be a respondent record and a record for each meal occasion.
A survey about business travel where respondents tell you about each business trip in the last three months.

Data from data loops

A common trait of this type of survey is that one ‘level’ of data is fixed, but other levels are not. For example, for the business travel survey, there will be one record for each respondent, but each respondent may have any number of business trips in the last three months. Sometimes, the ‘other levels’ are fixed in number – e.g. four most recent eating out occasions. However, how you process these surveys in MRDCL can significantly impact the effort you need to write the MRDCL script and the processing time.

Making unmanageable data manageable

Let’s stay with the example of the business travel survey, and let’s say we cap the data we collect to the last ten business trips. Let’s assume we ask 30 questions about each business trip. Theoretically, we need to pick 300 variables (30 questions X 10 trips). However, MRDCL lets you process each trip as a separate level of data, so you only pick up 30 variables. It does this by numbering each level of data – maybe 1 for each respondent and 2 for each trip.

Making the gains even bigger than you might think

From a processing point of view, MRDCL will run faster; MRDCL only has to process 30 variables rather than 300. However, there are more benefits.

Tables become simpler – you don’t need to add together the data from 10 variables to make tables based on ‘all trips’.
Calculating data across variables becomes simple. Examples are:
- Calculating total expenditure from all trips
- Calculating the percentage of spending on hotels
- Counting the number of times a particular hotel chain was used
- Calculating the usage of each hotel chain once or more across all trips

Without treating these surveys as hierarchical datasets, these calculations are not as easy as they may appear.

A video on our YouTube channel explains this technique – it can save hours of processing time

Data and variables: Dat-101 – Handling hierarchical data-Processing records

5. Using a large number of +t tables

Adding a mean score to the bottom of a table is easily achieved in MRDCL using a +t table. However, if you have a lot of +t tables, run times can start to increase, particularly if they are filtered with if statements. If you only have one or two tables, it will make a negligible difference, but if you have a lot of tables, run times can be reduced significantly. I’ve seen this when users want many summary tables of mean scores. So, something like this might look familiar:

Mean score summaries can be a problem.

[*do a=1:100]

%avg=’Mean score [a]’,

[*sk 99 on a.eq.1]+[*99]t#1[~a]=@ $q1[~a] * $banner,

[*end a]

Now, with the pre-processor, that looks like quite efficient code, but you are effectively running 100 tables and suppressing the distributions. The following code may look more cumbersome, but it will mean producing only two tables.

Doing the mean score summary in 2 tables

Dm $var100rows(100),

Dm $q1_not_undefined=[*do a=100]$q1[~a]/nu,[*end a]

T#1(f=nitb/nrtv/dpt2)=$var100rows * $banner,

t#1x(nptb)=$q1_not_undefined * $banner,

[*do a=1:100]

Select wq $q1[~a],t#1=[a] * $banner,

[*end a]

Select wq 1,

Then in the manip stage,

Mt#1=#1 / #1x,

But this looks more complicated!

It does look more complicated, but this code is effectively weighting one table by the scores, producing another table with the base for the mean scores and dividing one table by the other to derive mean scores. However, neater solutions are at hand by building templates. This is covered more fully in our video on this topic, but let’s take a brief look at how easy templates can be to use.

Make it a general-purpose piece of code.

Using data statements, you should be able to convert that code to something like:

[*data 100=q1,100][*insert meanscoresummary]

In other words, you have a standard template called by inserting meanscoresummary.stp and setting parameters in *data 100 to automate this for any project. Easier and more efficient!

In summary

Do loops are great but can be dangerous

One of the problems with MRDCL is the power of the code you can generate. Do loops are great but remember they simply produce larger amounts of code for you. If that code is inefficient, the pre-processor will not care; it will just do what you tell it. So, think about what is being produced and processed and, subsequently, added to your IDF.

Differences can be huge.

As I said in the opening paragraph, I am not just discussing best practices here. I have seen run times reduced from hours to one minute. That makes a difference in your productivity and what you can offer clients. Any runs that take a long time can become problematic, as any errors will cause delays in delivering output. Learning the techniques to improve run times is time well spent.

If you need more help, please feel to contact us at support@mrdcsoftware.com.