Efficient vs Inefficient Market Research Data Processing
I am sometimes shocked to see how inefficiently s…Read more
MRDCL users are so used to runs completing in seconds that it sometimes comes as a surprise when a run takes several minutes or even hours to run. So, let’s look at the five things that will make your MRDCL runs faster because, in our experience, most long run times can be reduced hugely.
It’s more than best practice
By adhering to these rules, you are not just adopting best practice, you might be reducing the run time for a project from, say, 5 hours to 2 minutes! Let’s take a look at each.
1.Temporary variables not used
It is quite common to have variables that you only use in the data stage of a MRDCL run. These variables are what you might call ‘building block’ variables. They are being used to pick up data or process data that is then modified into another variable that is actually tabulated. Or, they may have come from an online survey and store redundant data.
Keep IDF sizes to the minimum
Such variables will get written to the IDF (MRDCL’s internal database of variables and data) and simply inflate its size for no benefit. Of course, in a small project of 500 respondents with 100 tables, it will have a negligible benefit (probably less than one second), but that’s not the point. It is good to get into the habit of making redundant variables temporary so that they are not written to the IDF if they are not used in tabulations.
Make the variables temporary
How do you do this? You simply put a ‘x’ on the front of the variable name. Let’s imagine Q1 has 99 responses, but in the tables you want a sub-total of the first five codes. You might write this:
However, you should write this:
This will mean that $xq1 is notwritten to the IDF and your run will be faster.
Increase the temporary space if necessary
If you have a lot of temporary variables, you may exceed the default temporary limit. However, you can easily increase this by having this statement at the top of your .stp file.
2.Big multi variables
In terms of processing, there is a huge different in processing time between a single value variable and a multi value variable. Once, you think about it, it is probably obvious.
How MRDCL stores single and multi value variables
A single value variable can store its value as a number. Indeed, an empty single value variable is stored as 0 and be referred to in MRDCL code as $q55/0. So, if a single response variable has 100 responses and a respondent has selected code 56, the value 56 can store the value for that variable.
However, a multi value variable can have any number of values. So, for a variable of 100 responses (codes), there are 100 responses which may either be true or false. In the words, this is 100 possible data selections rather than one value for a single value variable.
Do you need big variables? Can they be single?
I have seen users define variables that have 9999 responses and then not use them in the tables. This is wasting a lot of processing and storage time. I have also seen users make every variable multi value ‘because it seems easier’. Every big multi value variable will impact on your run times. And, of course, maybe it could be temporary as described above.
3.Unneeded if statements
If statements take time to process, particularly if there a lot of them. One of the frequently unknown statement types in MRDCL is the ability to convert a code to a value. If you’ve ever entered code like this, there is a much more efficient way.
Can you avoid using if statements completely?
If $q1/1,di $iq1=5,
If $q1/2,di $iq1=15,
If $q1/3,di $iq1=25,
If $q1/4,di $iq1=35,
If $q1/5,di $iq1=45,
If $q1/6,di $iq1=55,
You could write the same code like this:
The processing time will be a fraction of the time of the first example.
4.Producing lots of +t tables
Adding a mean score to the bottom of a table is easily achieved in MRDCL by using a +t table. However, if you have a lot of +t tables, run times can start to increase. If you got one or two tables, it will only make a small difference, but if you have a lot of tables, run times can be reduced significantly. I’ve seen this when users want many summary tables of mean scores. So, something like this might look familiar:
Mean score summaries can be a problem
%avg=’Mean score [a]’,
[*sk 99 on a.eq.1]+[*99]t#1[~a]=@ $q1[~a] * $banner,
Now, with the pre-processor that looks quite efficient code, but you are effectively running 100 tables. This code may look more cumbersome but it will mean that there are only two tables.
Doing the mean score summary in 2 tables
Dm $q1_not_undefined=[*do a=100]$q1[~a]/nu,[*end a]
T#1(f=nitb/nrtv/dpt2)=$var100rows * $banner,
t#1x(nptb)=$q1_not_undefined * $banner,
Select wq $q1[~a],t#1=[a] * $banner,
Select wq 1,
Then in the manip stage,
Mt#1=#1 / #1x,
Make it a general purpose piece of code.
That looks more complex and it is, but it is more efficient. Using data statements you should be able to convert that code to something like:
[*data 100=q1,100][*insert meanscoresummary]
5.Having lots of repeated blocks of code (for occasion based questions, for example)
One of the reasons that MRDCL scripts get very big is when there are repeated sections of a questionnaire. I recently encountered a user who had several big variables for each possible eating or drinking occasion over a 2 week period. Rather than 25 big variables, the user had 5000 variables (25 big variables * 200 possible eating/drinking occasions). Each block of 25 variables were the same except for their data location and variable name.
Treat the data as hierarchical data
By treating, the data as two levels of data, the survey can be processed far more efficiently. Effectively, 25 variables are processed 200 times (but only if they have data) rather than processing and adding 5000 variables to your IDF. Also, your tables become simple crosstabs rather than tables with 200 overlays.
There’s a video on our YouTube channel that explains this technique – it can save hours of processing time.
Do loops are great but can be dangerous
One of the problems with MRDCL is the power of the pre-processor. Do loops are great but remember they are just producing larger amounts of code for you. If that code is inefficient, the pre-processor will not care, it will just do what you tell it. So, think about what is being produced and processed and, subsequently, added to your IDF.
Differences can be huge
As it said in the opening paragraph, I am not just talking about best practice here. I have seen run times reduced from hours to one minute. That makes a difference to your life!
If you need more help, please feel to contact me firstname.lastname@example.org.