Perl - Changing file name in the middle of write - perl

I am trying to take a very large txt file (over a million lines) that I created in Perl and run it through a different statement in Perl that will essentially look something like this (note the following is shell)
a=0
b=1
while read line;
do
echo -n "" > "Write file"${b}
a=($a + 1)
while ( $a <= 5000)
do
echo $line >> "Write file"${b}
a=($a + 1)
done
a=0
b=($b + 1)
done < "read file"
Trying to size it down to 5k lines per file, and incrementing each time (filename1.txt, filename2.txt, filename3.txt, etc)
This doesn't seem to work in shell, possibly due to the size of the input file, and for the life of me I can't think of how to change what file I am writing to in the middle of the loop..

You can just do this in the shell using split.
For example:
split -l 5000 filename.txt filename.txt.
will split filename.txt into multiple files with a max of 5,000 lines each. The output files will be names filename.txt.aa, filename.txt.ab, filename.txt.ac, etc.
From my man split:
NAME
split -- split a file into pieces
SYNOPSIS
split [-a suffix_length] [-b byte_count[k|m]] [-l line_count] [-p pattern] [file [name]]
DESCRIPTION
The split utility reads the given file and breaks it up into files of 1000 lines each. If file is a single dash (`-') or absent, split reads from the stan-
dard input.
The options are as follows:
-a suffix_length
Use suffix_length letters to form the suffix of the file name.
-b byte_count[k|m]
Create smaller files byte_count bytes in length. If ``k'' is appended to the number, the file is split into byte_count kilobyte pieces. If ``m'' is
appended to the number, the file is split into byte_count megabyte pieces.
-l line_count
Create smaller files n lines in length.
-p pattern
The file is split whenever an input line matches pattern, which is interpreted as an extended regular expression. The matching line will be the
first line of the next output file. This option is incompatible with the -b and -l options.
If additional arguments are specified, the first is used as the name of the input file which is to be split. If a second additional argument is specified,
it is used as a prefix for the names of the files into which the file is split. In this case, each file into which the file is split is named by the prefix
followed by a lexically ordered suffix using suffix_length characters in the range ``a-z''. If -a is not specified, two letters are used as the suffix.
If the name argument is not specified, the file is split into lexically ordered files named with the prefix ``x'' and with suffixes as above.

As an aside, this is your fixed script:
#!/bin/sh
a=0
b=1
while read line; do
if [ $a -eq 0 ]; then
echo -n '' > out-file-${b}
fi
echo $line >> out-file-${b}
a=$(( $a + 1 ))
if [ $a -eq 10 ]; then
a=0
b=$(( $b + 1 ))
fi
done < in-file
Tested with bash and dash.

Related

how to replace with sed when source contains $

I have a file that contains:
$conf['minified_version'] = 100;
I want to increment that 100 with sed, so I have this:
sed -r 's/(.*minified_version.*)([0-9]+)(.*)/echo "\1$((\2+1))\3"/ge'
The problem is that this strips the $conf from the original, along with any indentation spacing. What I have been able to figure out is that it's because it's trying to run:
echo " $conf['minified_version'] = $((100+1));"
so of course it's trying to replace the $conf with a variable which has no value.
Here is an awk version:
$ awk '/minified_version/{$3+=1} 1' file
$conf['minified_version'] = 101
This looks for lines that contain minified_version. Anytime such a line is found the third field, $3, is incremented by.
My suggested approach to this would be to have a file on-disk that contained nothing but the minified_version number. Then, incrementing that number would be as simple as:
minified_version=$(< minified_version)
printf '%s\n' "$(( minified_version + 1 ))" >minified_version
...and you could just put a sigil in your source file where that needs to be replaced. Let's say you have a file named foo.conf.in that contains:
$conf['minified_version'] = #MINIFIED_VERSION#
...then you could simply run, in your build process:
sed -e "s/#MINIFIED_VERSION#/$(<minified_version)/g" <foo.conf.in >foo.conf
This has the advantage that you never have code changing foo.conf.in, so you don't need to worry about bugs overwriting the file's contents. It also means that if you're checking your files into source control, so long as you only check in foo.conf.in and not foo.conf you avoid potential merge conflicts due to context near the version number changing.
Now, if you did want to do the native operation in-place, here's a somewhat overdesigned approach written in pure native bash (reading from infile and writing to outfile; just rename outfile back over infile when successful to make this an in-place replacement):
target='$conf['"'"'minified_version'"'"'] = '
suffix=';'
while IFS= read -r line; do
if [[ $line = "$target"* ]]; then
value=${line##*=}
value=${value%$suffix}
new_value=$(( value + 1 ))
printf '%s\n' "${target}${new_value}${suffix}"
else
printf '%s\n' "$line"
fi
done <infile >outfile

File Splitting with DataStage (8.5)

I have a job that successfully produces a sequential file (CSV) output with some hundred million rows, can someone provide an example where the output is written to a hundred separate sequential files, each with a million rows?
What does the sequential file stage look like, how is it configured?
This is to ultimately allow QA to review any one of the individual outputs without a special text editor that can view large text files.
Based on the suggestion from #Mr. Llama and a lack of forthcoming solutions we decided on a simple script to be executed at the end of the scheduled DataStage event.
#!/bin/bash
# usage:
# sh ./[script] [input]
# check for input:
if [ ! $# == 1 ]; then
echo "No input file provided."
exit
fi
# directory for output:
mkdir split
# header without content:
head -n 1 $1 > header.csv
# content without header:
tail +2 $1 > content.csv
# split content into 100000 record files:
split -l 100000 content.csv split/data_
# loop through the new split files, adding the header
# and a '.csv' extension:
for f in split/*; do cat header.csv $f > $f.csv; rm $f; done;
# remove the temporary files:
rm header.csv
rm content.csv
Crude but works for us in this case.

Perl script for normalizing negative numbers

I have several large csv files in which I would like to replace all numbers less than 100, including negative numbers with 500 or another positive number.
I'm not a programmer but I found a nice perl one liner to replace the white space with comma 's/[^\S\n]+/,/g'. I was wondering if there's any easy way to do this as well.
Using Windows formatting for a perl 1-liner
perl -F/,/ -lane "print join(q{,},map{/^[-\d.]+$/ && $_ < 100 ? 100: $_} #F),qq{\n};" input.csv > output.csv
The following works for me, assuming there are 2 files in the directory:
test1.txt:
201,400,-1
-2.5,677,90.66,30.32
222,18
test2.txt
-1,-1,-1,99,101
3,3,222,190,-999
22,100,100,3
using the one liner:
perl -p -i.bak -e 's/(-?\d+\.?\d*)/$1<100?500:$1/ge' *
-p will apply the search-replace process to each line in each file, -i.bak means do the replacement in the original files and backup those files with new files having .bak extension. s///ge part will find all the numbers (including negative numbers) and then compare each number with 100, if less than 100 then replace it with 500. g means find all match numbers. e means the replacement part will be treated as Perl code. * means process all the files in the directory
After executed this one liner, I got 4 files in the directory as:
test1.txt.bak test1.txt test2.txt.bak test2.txt
and the content for test1.txt and test2.txt are:
test1.txt
201,400,500
500,677,500,500
222,500
test2.txt
500,500,500,500,101
500,500,222,190,500
500,100,100,500

Inserting headers into multiple files

I found some command line with Perl that inserts headers into my files without going through the tedious process of inserting them one by one. Can someone walk me through the Perl aspect of this command line? I'm new to this and can't seem to find the right explanations for what I wrote.
cat header.txt | perl -0 -i -pe 'BEGIN{$h = <STDIN>}; print $h' 1*
-e
rather than provide a script in a xxxx.pl file, provide it on the command line
-p
makes it iterate over filename arguments somewhat like sed but also prints the contents of $_ at the end of the script.
the two above are combined in -pe
-i
indicate you want to edit the file in place and write the output to the same file. In practice, Perl renames the input file and reads from this renamed version while writing to a new file with the original name
-0
redefines the end of record character (\n by default) so that you can read the entire input file as a single line
1*
is the command line argument to your script, so I guess you are modifying any file with a name that starts with 1 (you could have used *.c, or whatever depending on the type of files you are trying to modify)
print $h
prints the variable $h that is the "main" of your script. if it was initialized with the content of the header file (the intent of this one-liner) then it will print the header file
BEGIN{ some code here }
this is stuff you execute before the script starts. this is where I'm stumped. this doesn't seem like valid perl code
so basically:
this will supposedly slurp the entire header file (because of -0) in the BEGIN block and store it in the variable $h
iterate over all the files specified by the wildcards at the end of the command line
for each file: print the header (print $h) then print hte file itself (because of -pe)
so it's equivalent to spelling the script out:
$h = gets content of the entire header file
while (<>){ #loop implied by -pe, iterates over all the 1* files
# the main contents of the "-e" script are inserted below as part of executing -pe
print h$; #print the header we saved
print $_; # implied by -pe, and since we are using -0, this prints the entire content in one shot
# end of the "-e" script. again it was a single print $h statement, the second print is implied by -pe
}
It's a bit hard to explain, take a look at the perlrun documentation for details (run man perlrun).
This is not 100% complete explanation because I don;t think the BEGIN block is right. I tried it on my ubuntu machine and it complained about its syntax too
Here's something similar, with an explanation. The program in the question doesn't run on my mac.
I needed to add the #nullable disable directive to the top of all my csharp files as part of migrating to nullable reference types.
perl -w -i -p -0777 -e 's/^/#nullable disable\n\n/' $(find . -iname '*.cs')
-w enable warnings
-i edit files in place
-p read each file block by block, printing each block after applying a perl expression. the default block size is one line
-0777 changes the default block size to the entire file
-e the perl expression to execute
The final argument uses shell command substitution to create a list of files. It passes that list of file paths to the perl command. The find command searches for files that end in .cs.
The perl program is a single substitution command. It matches the very beginning of the block and replaces (prepends, really) with "#nullable disable" and a couple new-lines.

Best way to parse this particular string using awk / sed?

I need to get a particular version string from a file (call it version.lst) and use it to compare another in a shell script. For example sake, the file contains lines that look like this:
V1.000 -- build date and other info here -- APP1
V1.000 -- build date and other info here -- APP2
V1.500 -- build date and other info here -- APP3
.. and so on. Let's say I am trying to grab the first version (in this case, V1.000) from APP1. Obviously, the versions can change and I want this to be dynamic. What I have right now works:
var = `cat version.lst | grep " -- APP1" | grep -Eo V[0-9].[0-9]{3}`
Pipe to grep will get the line containing APP1 and the second pipe to grep will get the version string. However, I hear grep is not the way to do this so I'd like to learn the best way using awk or sed. Any ideas? I am new to both and haven't found a tutorial easy enough to learn the syntax of it. Do they support egrep? Thanks!
Try this to get the complete version:
#!/bin/sh
app=APP1
var=$(awk -v "app=$app" '$NF == app {print $1}' version.lst)
or to get only the major version number, the last line could be:
var=$(awk -v "app=$app" '$NF == app {split($1,a,"."); print a[1]}' version.lst)
Using sed to get the complete version:
var=$(sed -n "/ $app\$/s/^\([^ ]*\).*/\1/p" version.lst)
or this to get only the major version number:
var=$(sed -n "/ $app\$/s/^\([^.]*\).*/\1/p" version.lst)
Explanations:
The second AWK command:
-v "app=$app" - set an AWK variable equal to a shell variable
$NF == app - if the last field is equal to the contents of the variable (NF is the number of field, so $NF is the contents of the NFth field)
{split($1,a,".") - then split the first field at the dot
print a[1] - and print the first part of the result of the split
The sed commands:
-n - don't print any output unless directed to
"/ $app\$/ - for any line that ends with (\$) the contents of the shell variable $app (not that double quotes are used to allow the variable to be expanded and it's a good idea to escape the second dollar sign)
s/^\([^ ]*\).*/\1/p" - starting at the beginning of the line (^), capture \(\) the sequence of characters that consists of non-spaces ([^ ]) (or non-dots in the second version) of any number (zero or more *) and match but don't capture all the rest of the characters on the line (.*), replace the matched text (the whole line in this case) with the string that was captured (the version number) (\1 refers to the first (only, in this case) capture group, and print it (p)
If I understood correctly: egrep "APP1$" version.lst | awk '{print $1}'
$ awk '/^V1\.00.* APP1$/{print $NF}' version.lst
APP1
That regular expression matches lines that start with "V1.00", followed by any number of any other characters, ending with " APP1". The backslash in the middle there might be really important--it matches only ".", and so it excludes (probably corrupt) lines that might begin with, say, "V1a00". The space before "APP1" excludes things like "APP2_APP1".
"NF" is an automatically generated variable that contains the number of field in the input line. It's also the number of the last field, which happens to be the one you're interested in.
There are a couple of ways to prune off the "V1". Here's one way, although you and I might not be talking about quite the same thing.
$ awk '/^V1\.00.* APP1$/{print substr($1, 1, index($1, ".") - 1), $NF}' version.lst
V1 APP1