DataStage and File Formats - datastage

Does anyone know how I can determine within a DataStage job if the input sequential file has EOL markers of MicroSoft or Unix such that it can direct the path through the rest of the job ?
Thanks

If you just want to make sure you can read all the data regardless of the line endings, this might be useful to do it in one job instead of maintaining separate jobs for each DOS and UNIX files. So this answer does not match the exact question but might still be the answer.
In the Sequential File Stage, you can define a filter sed 's/\r//' to convert DOS (Windows) line breaks (\r\n) to UNIX/Linux line breaks (\n).
In the Format tab, define UNIX newline as Record delimiter.
Note that when defining a filter in a sequential file, it has some drawbacks:
The Option First Line is Column Names is not available when using a filter.
If your input file has column names in the first line, you need to remove it manually by extending the filter with a tail command: sed 's/\r//' | tail -n +2
The Option Report Progress is not available when using a filter.
You might want to set the type of the last column in your column definition as VarChar or NVarChar, I've seen strange behaviour defining fixed length types like NChar or Char as last column's type. Though I didn't do deeper research on this, but I believe it has to do with using a filter.
Tested in DS 11.7.1

Related

Perl Command Understanding

I have been working on a product code to resolve an issue but am stuck on a line of code
Can anyone help me understand what exactly does this command do?
perl -MText::CSV -lne 'BEGIN{$p = Text::CSV->new()} print join "|", $p->fields() if $p->parse($_)' /home/daily/${FULL_FILENAME} > /home/output.txt
I think its to copy the file to my home location with some transformations but not sure exactly
This is a slightly broken program that translates a comma-separated values (CSV) file to a pipe-separated values file.
The particular command-line switches are documented in perlrun. This is a "one-liner", so you can read about those to see what's going on there.
The Text::CSV module deals with CSV files, and the program is parsing a line from the file and re-outputting as a pipe-separated file.
But, this program deals with each line as a complete record. That might be fine for you, but at some point you might end up with a literal value that has a newline in it, like a,"b\nc",d. Now reading line-by-line breaks the program since the quotes appear to be unclosed within the first line. Note only that, it blindly concatenates the parsed fields without considering if any of the fields should be quoted. It might be unlikely that a pipe character would be in the data, but the problem isn't it's rarity but the consequences and costliness when it does show up.
The rewrite.pl example script in the related module Text::CSV_XS is a tool that could replace this one-liner. It properly reads the input and knows how to properly translate it.

SAS- Reading multiple compressed data files

I hope you are all well.
So my question is about the procedure to open multiple raw data files that are compressed.
My files' names are ordered so I have for example : o_equities_20080528.tas.zip o_equities_20080529.tas.zip o_equities_20080530.tas.zip ...
Thank you all in advance.
How much work this will be depends on whether:
You have enough space to extract all the files simultaneously into one folder
You need to be able to keep track of which file each record has come from (i.e. you can't tell just from looking at a particular record).
If you have enough space to extract everything and you don't need to track which records came from which file, then the simplest option is to use a wildcard infile statement, allowing you to import the records from all of your files in one data step:
infile "c:\yourdir\o_equities_*.tas" <other infile options as per individual files>;
This syntax works regardless of OS - it's a SAS feature, not shell expansion.
If you have enough space to extract everything in advance but you need to keep track of which records came from each file, then please refer to this page for an example of how to do this using the filevar option on the infile statement:
http://www.ats.ucla.edu/stat/sas/faq/multi_file_read.htm
If you don't have enough space to extract everything in advance, but you have access to 7-zip or another archive utility, and you don't need to keep track of which records came from each file, you can use a pipe filename and extract to standard output. If you're on a Linux platform then this is very simple, as you can take advantage of shell expansion:
filename cmd pipe "nice -n 19 gunzip -c /yourdir/o_equities_*.tas.zip";
infile cmd <other infile options as per individual files>;
On windows it's the same sort of idea, but as you can't use shell expansion, you have to construct a separate filename for each zip file, or use some of 7zip's more arcane command-line options, e.g.:
filename cmd pipe "7z.exe e -an -ai!C:\yourdir\o_equities_*.tas.zip -so -y";
This will extract all files from all of the matching archives to standard output. You can narrow this down further via the 7-zip command if necessary. You will have multiple header lines mixed in with the data - you can use findstr to filter these out in the pipe before SAS sees them, or you can just choose to tolerate the odd error message here and there.
Here, the -an tells 7-zip not to read the zip file name from the command line, and the -ai tells it to expand the wildcard.
If you need to keep track of what came from where and you can't extract everything at once, your best bet (as far as I know) is to write a macro to process one file at a time, using the above techniques and add this information while you're importing each dataset.

diff ignore certain pattern in the file

I want to make diff between two files which contains lines beginning with "line_$NR". I want to make diff between the files making abstraction of the presence of "lines_$NR" but when the differences are printed I want lines_$NR to be displayed.
It is possible to do that?
Thanks.
I believe in this case, you have to preprocess your iput files to remove /^line_[0-9]*/, diff the resulting files, then recombine the diff output with the removed words according to line numbers in diff output.
Python's difflib should be very handy here, or same from perl. If you want to stick to shell, I suppose you could get by with awk.
If you don't need exact output, perhaps you can use diff's --line-format=... directive to inject actual line number in a diff, rather than the word you removed in preprocessing step.

Paginate a big text file

I have a big text file. Each line in the file is a record. And I need to parse the text file and show only 20 records in a HTML table at a time. I will have to support sorting as well.
What I am currently doing is read the file line by line based on the parameters start, stop, and page_size which is provided in querystring. It seems to work fine until I have to sort the records, because in order to sort I need to process every line in the text file.
So is there a Unix command which can I extract from line to line and sort? I tried grep but I do not know enough it to get this problem solved.
Take a look at the pr command. This is what we use to use all the time to paginate big files. You can set the page length, headers, footers, turn on line numbers, etc.
There's probably even a way to munge the output into HTML.
How big is the file?
man sort
Here

How to collect column headers and data using dbisqlc.exe command

I am trying to query a Sybase ASA database using the dbisqlc.exe command-line on a Windows system and would like to collect the column headers along with the associated table data.
Example:
dbisqlc.exe -nogui -c "ENG=myDB;DBN=dbName;UID=dba;PWD=mypwd;CommLinks=tcpip{PORT=12345}" select * from myTable; OUTPUT TO C:\OutputFile.txt
I would prefer it if this command wrote to stdout however that does not appear to be an option aside from using dbisql.exe which is not available in the environment I am in.
When I run it in this format the header and data is generated however in an unparsable format.
Any assistance would be greatly appreciated.
Try adding the 'FORMAT SQL' clause to the OUTPUT statement. It will give you the select statement containing the column names as well as the data.
In reviewing the output from the following dbisqlc.exe command, it appears as though I can parse the output using perl.
Command:
dbisqlc.exe -nogui -c "ENG=myDB;DBN=dbName;UID=dba;PWD=mypwd;CommLinks=tcpip{PORT=12345}" select * from myTable; OUTPUT TO C:\OutputFile.txt
The output appears to break in odd places using text editors such as vi or TextPad however the output from this command is actually returned with specific column widths.
The second line of the output includes a set of ='s signs which are contained for the width of each column. What I did was build a "template" string based on the ='s which can be passed to perls unpack function. I then use this template to build an array of column names and parse the result set using unpack.
This may not be the most efficient method however I think it should give me the results I am looking for.