File Splitting with DataStage (8.5) - datastage

I have a job that successfully produces a sequential file (CSV) output with some hundred million rows, can someone provide an example where the output is written to a hundred separate sequential files, each with a million rows?
What does the sequential file stage look like, how is it configured?
This is to ultimately allow QA to review any one of the individual outputs without a special text editor that can view large text files.

Based on the suggestion from #Mr. Llama and a lack of forthcoming solutions we decided on a simple script to be executed at the end of the scheduled DataStage event.
#!/bin/bash
# usage:
# sh ./[script] [input]
# check for input:
if [ ! $# == 1 ]; then
echo "No input file provided."
exit
fi
# directory for output:
mkdir split
# header without content:
head -n 1 $1 > header.csv
# content without header:
tail +2 $1 > content.csv
# split content into 100000 record files:
split -l 100000 content.csv split/data_
# loop through the new split files, adding the header
# and a '.csv' extension:
for f in split/*; do cat header.csv $f > $f.csv; rm $f; done;
# remove the temporary files:
rm header.csv
rm content.csv
Crude but works for us in this case.

Related

Perl - Changing file name in the middle of write

I am trying to take a very large txt file (over a million lines) that I created in Perl and run it through a different statement in Perl that will essentially look something like this (note the following is shell)
a=0
b=1
while read line;
do
echo -n "" > "Write file"${b}
a=($a + 1)
while ( $a <= 5000)
do
echo $line >> "Write file"${b}
a=($a + 1)
done
a=0
b=($b + 1)
done < "read file"
Trying to size it down to 5k lines per file, and incrementing each time (filename1.txt, filename2.txt, filename3.txt, etc)
This doesn't seem to work in shell, possibly due to the size of the input file, and for the life of me I can't think of how to change what file I am writing to in the middle of the loop..
You can just do this in the shell using split.
For example:
split -l 5000 filename.txt filename.txt.
will split filename.txt into multiple files with a max of 5,000 lines each. The output files will be names filename.txt.aa, filename.txt.ab, filename.txt.ac, etc.
From my man split:
NAME
split -- split a file into pieces
SYNOPSIS
split [-a suffix_length] [-b byte_count[k|m]] [-l line_count] [-p pattern] [file [name]]
DESCRIPTION
The split utility reads the given file and breaks it up into files of 1000 lines each. If file is a single dash (`-') or absent, split reads from the stan-
dard input.
The options are as follows:
-a suffix_length
Use suffix_length letters to form the suffix of the file name.
-b byte_count[k|m]
Create smaller files byte_count bytes in length. If ``k'' is appended to the number, the file is split into byte_count kilobyte pieces. If ``m'' is
appended to the number, the file is split into byte_count megabyte pieces.
-l line_count
Create smaller files n lines in length.
-p pattern
The file is split whenever an input line matches pattern, which is interpreted as an extended regular expression. The matching line will be the
first line of the next output file. This option is incompatible with the -b and -l options.
If additional arguments are specified, the first is used as the name of the input file which is to be split. If a second additional argument is specified,
it is used as a prefix for the names of the files into which the file is split. In this case, each file into which the file is split is named by the prefix
followed by a lexically ordered suffix using suffix_length characters in the range ``a-z''. If -a is not specified, two letters are used as the suffix.
If the name argument is not specified, the file is split into lexically ordered files named with the prefix ``x'' and with suffixes as above.
As an aside, this is your fixed script:
#!/bin/sh
a=0
b=1
while read line; do
if [ $a -eq 0 ]; then
echo -n '' > out-file-${b}
fi
echo $line >> out-file-${b}
a=$(( $a + 1 ))
if [ $a -eq 10 ]; then
a=0
b=$(( $b + 1 ))
fi
done < in-file
Tested with bash and dash.

how to replace with sed when source contains $

I have a file that contains:
$conf['minified_version'] = 100;
I want to increment that 100 with sed, so I have this:
sed -r 's/(.*minified_version.*)([0-9]+)(.*)/echo "\1$((\2+1))\3"/ge'
The problem is that this strips the $conf from the original, along with any indentation spacing. What I have been able to figure out is that it's because it's trying to run:
echo " $conf['minified_version'] = $((100+1));"
so of course it's trying to replace the $conf with a variable which has no value.
Here is an awk version:
$ awk '/minified_version/{$3+=1} 1' file
$conf['minified_version'] = 101
This looks for lines that contain minified_version. Anytime such a line is found the third field, $3, is incremented by.
My suggested approach to this would be to have a file on-disk that contained nothing but the minified_version number. Then, incrementing that number would be as simple as:
minified_version=$(< minified_version)
printf '%s\n' "$(( minified_version + 1 ))" >minified_version
...and you could just put a sigil in your source file where that needs to be replaced. Let's say you have a file named foo.conf.in that contains:
$conf['minified_version'] = #MINIFIED_VERSION#
...then you could simply run, in your build process:
sed -e "s/#MINIFIED_VERSION#/$(<minified_version)/g" <foo.conf.in >foo.conf
This has the advantage that you never have code changing foo.conf.in, so you don't need to worry about bugs overwriting the file's contents. It also means that if you're checking your files into source control, so long as you only check in foo.conf.in and not foo.conf you avoid potential merge conflicts due to context near the version number changing.
Now, if you did want to do the native operation in-place, here's a somewhat overdesigned approach written in pure native bash (reading from infile and writing to outfile; just rename outfile back over infile when successful to make this an in-place replacement):
target='$conf['"'"'minified_version'"'"'] = '
suffix=';'
while IFS= read -r line; do
if [[ $line = "$target"* ]]; then
value=${line##*=}
value=${value%$suffix}
new_value=$(( value + 1 ))
printf '%s\n' "${target}${new_value}${suffix}"
else
printf '%s\n' "$line"
fi
done <infile >outfile

extraction of required columns from many files and writing it to a single file

perl -F"\t" -lane '$, = ","; print $F[0], $F[4]' EM2.gcount > Em2gcount.csv
Using this command I was able to extract 0 and 4 column from file 1 and wrote in the separate file in .csv...I have many files and also I want to print them in to single file...
please help me what changes should i make
find -type f -name "*.gcount" -exec <yourperlcommand> {} >> Em2gcount.csv \;
this will find all .gcount files from your current directory and execute the perl command on {} which references the file found and appends it to Em2gcount.csv

Print line numbers after comparison

Can someone tell me the best way to print the number of different lines in 2 files. I have 2 directories with 1000s of files and I have a perl script that compares all files in dir1 with all files in dir2 and outputs the difference to a different file. Now I need to add something like Filename - # of different lines
File1 - 8
File2 - 30
Right now I am using
my $diff = `diff -y --suppress-common-lines "$DirA/$file" "$DirB/$file"`;
But along with this I also need to print how many lines are different in each one of those 1000 files.
Sorry is a duplicate of my prev thread. So would be glad if some moderator could delete the previous one
Why you even use perl?
for i in "$dirA"/*; do file="${i##*/}"; echo "$file - $(diff -y --suppress-common-lines "$i" "$dirB/$file" | wc -l)" ; done > diffs.txt

Substituting environment variables in a file: awk or sed?

I have a file of environment variables that I source in shell scripts, for example:
# This is a comment
ONE=1
TWO=2
THREE=THREE
# End
In my scripts, I source this file (assume it's called './vars') into the current environment, and change (some of) the variables based on user input. For example:
#!/bin/sh
# Read variables
source ./vars
# Change a variable
THREE=3
# Write variables back to the file??
awk 'BEGIN{FS="="}{print $1=$$1}' <./vars >./vars
As you can see, I've been experimenting with awk for writing the variables back, sed too. Without success. The last line of the script fails. Is there a way to do this with awk or sed (preferably preserving comments, even comments with the '=' character)? Or should I combine 'read' with string cutting in a while loop or some other magic? If possible, I'd like to avoid perl/python and just use the tools available in Busybox. Many thanks.
Edit: perhaps a use case might make clear what my problem is. I keep a configuration file consisting of shell environment variable declarations:
# File: network.config
NETWORK_TYPE=wired
NETWORK_ADDRESS_RESOLUTION=dhcp
NETWORK_ADDRESS=
NETWORK_ADDRESS_MASK=
I also have a script called 'setup-network.sh':
#!/bin/sh
# File: setup-network.sh
# Read configuration
source network.config
# Setup network
NETWORK_DEVICE=none
if [ "$NETWORK_TYPE" == "wired" ]; then
NETWORK_DEVICE=eth0
fi
if [ "$NETWORK_TYPE" == "wireless" ]; then
NETWORK_DEVICE=wlan0
fi
ifconfig -i $NETWORK_DEVICE ...etc
I also have a script called 'configure-network.sh':
#!/bin/sh
# File: configure-network.sh
# Read configuration
source network.config
echo "Enter the network connection type:"
echo " 1. Wired network"
echo " 2. Wireless network"
read -p "Type:" -n1 TYPE
if [ "$TYPE" == "1" ]; then
# Update environment variable
NETWORK_TYPE=wired
elif [ "$TYPE" == "2" ]; then
# Update environment variable
NETWORK_TYPE=wireless
fi
# Rewrite configuration file, substituting the updated value
# of NETWORK_TYPE (and any other updated variables already existing
# in the network.config file), so that later invocations of
# 'setup-network.sh' read the updated configuration.
# TODO
How do I rewrite the configuration file, updating only the variables already existing in the configuration file, preferably leaving comments and empty lines intact? Hope this clears things up a little. Thanks again.
You can't use awk and read and write from the same file (is part of your problem).
I prefer to rename the file before I rewrite (but you can save to a tmp and then rename too).
/bin/mv file file.tmp
awk '.... code ...' file.tmp > file
If your env file gets bigger, you'll see that is is getting truncated at the buffer size of your OS.
Also, don't forget that gawk (the std on most Linux installations) has a built in array ENVIRON. You can create what you want from that
awk 'END {
for (key in ENVIRON) {
print key "=" ENVIRON[key]
}
}' /dev/null
Of course you get everything in your environment, so maybe more than you want. But probably a better place to start with what you are trying to accomplish.
Edit
Most specifically
awk -F"=" '{
if ($1 in ENVIRON) {
printf("%s=%s\n", $1, ENVIRON[$1])
}
# else line not printed or add code to meet your situation
}' file > file.tmp
/bin/mv file.tmp file
Edit 2
I think your var=values might need to be export -ed so they are visible to the awk ENVIRON array.
AND
echo PATH=xxx| awk -F= '{print ENVIRON[$1]}'
prints the existing value of PATH.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, and/or give it a + (or -) as a useful answer.
I don't exactly know what you are trying to do, but if you are trying to change the value of variable THREE ,
awk -F"=" -vt="$THREE" '$1=="THREE" {$2=t}{print $0>FILENAME}' OFS="=" vars
You can do this in just with bash:
rewrite_config() {
local filename="$1"
local tmp=$(mktemp)
# if you want the header
echo "# File: $filename" >> "$tmp"
while IFS='=' read var value; do
declare -p $var | cut -d ' ' -f 3-
done < "$filename" >> "$tmp"
mv "$tmp" "$filename"
}
Use it like
source network.config
# manipulate the variables
rewrite_config network.config
I use a temp file to maintain the existance of the config file for as long as possible.