The need for big data science keeps on increasing if not exponentially then definitely in a polynomial graph curvature as the industry has merged with techniques in machine learning, Artificial Intelligence and High Performance Computing (HPC) which in itself has grown remarkably, thereby having a direct symbiotic growth effect for demand in big data science. Traditionally, the data science was practiced by and large by corporate firms who could pay hefty license fee for the tools that have tailor made solutions to a large number of issues which the open source movement do not offer. These tools such as SAS offered specialized procedures and features such as reading files without loading them completely into memory, unlike in open source tools like R which first loads the data as a data frame into memory before operating for any functionality check on them. SAS (pronounced “sass”) once stood for “statistical analysis system”, however now SAS covers more than just statistics as it engages machine learning and other complex computing also into its system ( https://www.sas.com/en_us/home.html). R is an open source language for statistical computing and graphics (https://www.sas.com/en_us/home.html).. In this article, apart from enlightening the differences SAS and open source tools like R world, we also provide a toolkit that can convert a single or a set of SAS7BDAT file into CSV format to be used by R and other applications. At the same time, the SAS users might also be interested in doing some operations on a CSV format file to SAS format file, for which the tool also gives a conversion script flexibility much in the same way of bag-of-tasks. The scripts can be executed either on a standalone serial mode, or in a parallel environment batch submission serial mode, or in a ‘bag-of-task’ oriented parallel distributed mode.
Need for SAS7BDAT data to CSV data format conversion & vice-versa Toolkit
The corporate world, which could afford hefty license fee for tools such as SAS did not have to worry about big data challenges as much as it later became apparent of the open source world such as those using R. A lock in situation arose as all the IT organizational memory in terms of prior-codes, dependency prior-codes, know-how and any custom made training program for new hires were all in SAS. On the other hand, as the need for data science and big data analysis also became apparently important in the not so rich corporate, training, research and academic world alike, they grew their emphasis on R making it more and more popular. The gap between people wanting to work on SAS and people wanting to work on R thus simply kept increasing largely because of the inertia of certain skill sets and orientation acquired and thus the switching cost involved in terms of effort needed. While, the big memory requirements that a single R session would need for big files cannot be simply changed without some intelligent parallel programming, as it is in SAS, at least the data scientist can be in some comfort zone if the data provided to him is in Comma Separated Value (CSV) format rather than a SAS7BDAT format. Given the recent increase in the demand for R programmers despite the lacking features in R compared to SAS, due to the above reasoning I gave, there has been efforts by people to write modules or packages that can easily import the sas7bdat files to CSV files. These efforts have been very recent in only last 2–3 years, clearly emphasizing the need where R professionals are wanting to take over the roles which SAS professionals have had been having so far in corporate world. These additional packages such as haven by Wickham et. al. 2016 [1] or sas7bdat by Shotwell et. al. 2015 [2] thus became recently very popular as many companies have data in the order of several gigabytes in the sas7bdat format, and that can be easily dealt with an R or SPSS professional if they were simply converted to CSV format or at least loaded into data-frame without the need for CSV format conversion. The problem with these additional packages is that they fail to perform well on higher data sizes of capacity 100s of gigabytes to start with. More importantly, if there be need to read a section of the data, such that the parsing can be done in parallel for several different sections without the need for a high memory for the system such as not in the order of 100s of GB of RAM but few GBs of RAM, these packages will provide inherent deadlock situation. Even more importantly, any attempt to convert the read sas7bdat file in the form of data frame to a CSV file can be even more challenging in term of computational resources utilization of RAM, time and disk space simply because the CSV files will write the blank entries as ‘NA’ and not leave it blank without even a space, and mark every entry enclosed in a set of double quotes (""). The need for a tool that can facilitate the conversion of a large SAS7BDAT file to CSV format or even 100s of SAS7BDAT files to CSV format in an automated fashion is thus immense, given the time, cost and effort the data scientist can save this tool which the R packages does not currently offer at least efficiently for big files of sizes over 100 gigabytes to start with. At this point, I must revisit the data provider’s business professionals. As the professionals in data collection and selling business are aware of the merits of big data high performance computing technologies, and are already locked in for their demands from corporate world for their profits, they tend to generate as much data whatsoever and in preferably SAS7BDAT format. As an example, in the health analytics sector, the Truven MarketScan database, an IBM enterprise, provides all of its data in SAS7BDAT format, and even if you purchase a 10 year commercial and medicare data that would be about 3.5 TeraBytes of SAS7BDAT format files to deal with. There have already been several biomedical publications citing Truven MarketScan database already, as an example of recent publications [6,7,8,9], as data science gets more popular even in the biomedical sector, given that it had already gained importance in big data bioinformatics and genomics sector. An R expert by no means and by no currently known freely available package will be able to even read the data unless you have a 1 Terabyte capacity of RAM on your supercomputer. The SasCsvToolkit finds its critical application to cater to the underpinning requirements of ever getting popular big data science world which is reaching out to every industry in being the game changer.
At the same time, a SAS expert would be having his comfort zone for data analytics in the SAS environment and would simply like to convert all CSV format data to SAS7BDAT format. Even for people who know both SAS and other data processing tools like SPSS, MATLAB, R, AWK, SED, PERL, know that some things can be done easier in one tool than the other and there can be situation that conversion to a SAS7BDAT format would be making more sense if the process to do downstream analysis is simpler in SAS systems. One such example could be to simply use SQL procedure in SAS in case SQL server is missing from the environment and the way to do inner joins and other table query might be difficult in AWK for instance or even using the JOIN command on Linux for the need for sorting and data spacing issue to be taken care of. Thus, a parallel distributed tool for converting large and/or many such CSV files to SAS7BDAT format would be an acute necessity.
The conversion of CSV to SAS7BDAT format also accelerates the benefits as the toolkit also comprises of sample template scripts for a database table search, exact match, substring match, except condition, filters, unique values, table joins and data mining which a SAS database programmer always struggles in finding the right method to develop the codes for it.
SasCsvToolkit and its Usage
The SasCsvToolkit is written in shell script and has SAS 9.4, Perl and Linux OS dependency. The scripts can be used to submit to a job submission scheduler such as the portable batch system, PBS, Nitzberg et. al. [3] or via SLURM, Jette et. al. [4], for which automation pipelines with parameters that can be adjusted in the file has been provided in the toolkit. The toolkit also comprises of automated SASSQL routine scripts for conducting basic search, string pattern match and inner joins that are a usual requirement for any database work.
A ‘README’ file is incorporated that well documents the usage syntax for the tool. The parallel jobs are implemented using Perl scripts and shell scripts combination and does not require an MPI, CUDA, OpenMP or Posix threads at all on top of the SLURM scheduler. The environment variables and SAS system command line variables have been used in a clever way to write files and communicate within script workloads.
Before using the SasCsvToolkit, it is critical to change the directory in the.sas extension files to where the library for SAS files are or are planned to be generated. A sub-directory called TrialDir is provided where the results will be generated. Sample small CSV files are also provided in this TrialDir for an execution to try out hands for CSV to SAS format conversion. To keep things easily identifiable and traceable for any error the log files are being generated for each sub-tasks or SAS files and the log file bears the name of the sub-task SAS file and its corresponding submission Id. The file sampleSasYbdatFileNames.txt has been provided as an example file which bears the name of the files in each line without the.sas7bdat extension. It is advised that the user should read the README file for more clarity about various kinds of usage options that exists for CSV to SAS format conversion or vice-versa in serial command line, batch submission or parallel mode.