Efficient vs Inefficient Market Research Data Processing
I am sometimes shocked to see how inefficiently s…Read more
Data storage is cheap and data collection is cheaper. This means that there is less effort to control the amount that is collected. The result is that when data is output from survey software programs, the record length may be extraordinarily long usually due to the amount text that has been collected. For most surveys, this will not be a problem, so you only need to be concerned with this for exceptional projects.
Unlike almost all survey analysis programs, MRDCL has no limits to the amount of data that can be processed if the data are handled correctly.
Many survey software programs output data to ASCII or to Triple-S, which also uses ASCII for its data storage. ASCII data means that data is output in a fixed format setting the width of each field to the maximum required for any one respondent. This means that if one respondent has replied to an open ended question using 2000 characters, the field would be set to 2000 characters (Note: some survey tools may limit such a field to a certain number of characters). Further, long brand lists where multiple responses are allowed may be output as a separate field for each possible response, such that if there are 1000 make/models that a respondent could choose, there would be 1000 fields in the output file.
ASCII stands for American Standard Code for Information Interchange (see Wikipedia link for more info - https://en.wikipedia.org/wiki/ASCII). In practical terms, for survey data this means that data in an ASCII file is stored in a fixed format using the characters on a keyboard – this is not strictly true but it’s a good enough way to think about ASCII. Most survey data are stored as numeric codes where each code represents a specific response to a question. The exception is open ended questions or text based answers where the text is stored.
In survey terms, an ASCII record usually refers to the data relating to one specific respondent. Thus, a data file will usually contain a series of records which bear data to the same number of respondents as records in the file. There are exceptions to this. For example, when data is collected for each respondent and for each meal occasion.
There are no rules as to what may constitute the maximum length of an ASCII record. In theory, it may be any length, but most survey analysis software systems will have limits. Some will be very limited. This can present a problem if the record is too long. If the survey analysis tool you are using cannot process an ASCII record length of more than, say, 10000 characters, it may either fail to run or ignore any data beyond the 10000th character.
One problem I have with long ASCII record is finding a tool to inspect that data if you want to see what is coded on one or more records. Notepad, for example, is extremely limited and, more importantly, does not have any tools to indicate what code is present in a specific field. There are a number of more powerful editors, such as my favourite Winvi, which handle bigger files as well as displaying the line you are viewing or editing as well as the character in the line / within the record.
However, most editors either become very slow or crash if a file has thousands of lines (records) and a long ASCII record length – even Winvi, which is very powerful, can fail to work. UltraEdit has successfully handled some of the biggest files that I have seen, but a file with a record length of 500,000 characters per line might cause UltraEdit to struggle. Being able to view/edit a file in its raw form is important, in my view.
MRDCL sets the limit for a line within a file to 99999. This is, of course, an arbitrary limit that we could increase. However, every increase would degrade the performance of MRDCL for more normal sized files, so we have set a generous limit which does not degrade MRDCL’s impressive run time performance. Further, most of the more powerful editors can handle up 100000 characters. But, for records over 99999 characters a solution is at hand.
BRIMS stands for Break Records Into Manageable Sizes. A free MRDC utility will cut each line/record in the data file into two or more lines of a manageable length. For example, if you have a file with a record length of 500,000 characters – probably bigger than any survey analysis tool or editor can handle – the free utility will chop the line into say 10 lines of 50,000 characters. This data can then be analysed using MRDCL and can be viewed in the more powerful editors. You can download the toolbox which contains this Cut ASCII utility.
If your data record started with a record length of 500,000 characters and using the BRIMS technique was output as ten lines for each respondent of 50,000 characters, the new file would appear as follows for each record:
Line 1 – Fields 1-50,000 would be the fields 1-50,000 in the original file
Line 2 – Fields 1-50,000 would be the fields 50,001 to 100,000 in the original file
Line 3 – Fields 1-50,000 would be the fields 100,001 to 150,000 in the original file
Line 10 – Fields 1-50,000 would be the fields 450,001 to 500,000 in the original file.
In other words, your output file from your original data would have 10 times the number of lines but it would be one-tenth of the width (or record length).
To process this data in MRDCL, you would need to read and process each line for each respondent in turn. You will need to use READ statements to read each line for each respondent otherwise the START CONTROL will act as the statement that reads data. Your script would look like this:
Start control, C=myfile.asc, BRA, Finish control, Start data, Card characters 50000, Read $1-50000, !Process data from the first line (original data fields 1-50,000) Ds $var=$5/1,2, Read $1-50000, !Process data from the second line (original data fields 50,001 – 100,000 Ds $var2=$5/1,2, !What you are picking up from $5 was originally in field 50,005. Read $1-50000, !Process data from the third line (original data fields 100,001 – 150,000 Ds $var3=$5/1,2, !What you are picking up from $5 was originally in field 100,005. !etc etc etc for the next 6 lines Read $1-50000, !Process data from the tenth line (original data fields 450,001 – 500,000 Ds $var4=$5/1,2, !What you are picking up from $5 was originally in field 450,005. Finish data,
Effectively, you are reading the ten lines applicable to each respondent after using the BRIMS technique and then when your program control hits FINISH DATA, it will write the record away in the usual way.
You will note that the Run Control Parameter (RCP) BRA is used in the example script. This RCP tells MRDCL that Blank Records are Accepted. This is important as MRDCL will need to treat a blank line as being an empty data instead of being a line that it would usually ignore. It is important that this RCP is used.
If the software system that holds the original data allows you to control which variables are exported to ASCII, you may have another option, which may be better. You could export, say, half of the data to one ASCII file, selecting half of the variables and then carry out a second export to another data file. You can then use MRDCL to read the data from two separate files. The limitations with this methodology are, firstly, that it may not be possible to control which variables are exported, especially from some of the free online survey tools and, secondly, MRDCL has a limit of opening four files simultaneously. However, it means that you do not need to use BRIMS and requires a simpler MRDCL script.
The script below assumes you have data in two separate files called myfile1.asc and myfile2.asc.
Start control, C(a)=myfile1.asc, C(b)=myfile2.asc, BRA, Finish control, Start data, Card characters 99999, Read(a) $1-99999, !Process a record from the first file Ds $var=$5/1,2, Read(b) $1-99999, !Process a record from the second file Ds $var2=$5/1,2, Finish data,
It may or may not be possible to control how data are exported from the software you have used to collect the data. However, if it is possible, these are some things that will make your exports more efficient:
More and more data contains text. Large texts fields can take a large amount of space, particularly when there are no limitations on the amount of text a survey respondent can enter. Often this text is not used for tabulation analysis – you might only use it to view in Excel, for example. Therefore, if you don’t need the text questions, you should consider removing the data when you have large files, especially when there are several text-based questions.
Some survey tools store a lot of non-survey data, such as email address, IP address, home address etc. as well as various reference numbers or codes. Generally, you do not need these when analysing survey results, so you may choose to remove them if it is possible.
Most survey tools will output the data from a multiple response question as a series of 1s and 0s (or blanks) for each of the responses. For example, if a question asks which of twenty brands are bought regularly, the data is usually output as 20 fields containing 1 or 0 (or blank) where 1 indicates that that brand was bought. For example, if brands 2, 4 and 20 were bought regularly, all 20 fields would be 0 or blank except for the second, fourth and twentieth fields which would contain a code 1.
A spreadfield could store the same data as 020420 in three 2 digit fields representing 2, 4 and 20. This taking up six columns of data rather than 20. Now, imagine that you have a make/model list of 2000 cars and a respondent is asked to choose their three favourite cars, you could store this data in 3 fields of 4 digits (12 columns) rather than 2000 columns. Now, imagine that there are several questions asking about makes and models of cars and the data will grow rapidly. So, spreadfields can be an efficient way to store multiple response questions that have a lot of answers.
The biggest problem is that most survey software tools cannot output data in this way – again, most free tools would not have this flexibility, but the better paid for tools will often have this facility. One way you can get round this, however, is to set up the questionnaire so each response is collected separately. This would work well for the example above where the first three choices are collected. You could ask the question as three questions asking for the respondent’s first, second and third choices – the data for each of these three questions would be single responses and could be stored efficiently.
You should be able to handle any ASCII data file using MRDCL provided that you follow one of the above guidelines. MRDCL will process these big files efficiently and quickly – far quicker than other products. If you need further help, just ask us to explain how MRDCL can deal with your data.