I have a file with several lines of the following:
DELIMITER ;
I want to create a separate file for each of these sections.
The man page of split command does not seem to have such option.
The split command only splits a file into blocks of equal size (maybe except for the last one).
However, awk is perfect for your type of problem. Here's a solution example.
Sample input
1
2
3
DELIMITER ;
4
5
6
7
DELIMITER ;
8
9
10
11
awk script split.awk
#!/usr/bin/awk -f
BEGIN {
n = 1;
outfile = n;
}
{
# FILENAME is undefined inside the BEGIN block
if (outfile == n) {
outfile = FILENAME n;
}
if ($0 ~ /DELIMITER ;/) {
n++;
outfile = FILENAME n;
} else {
print $0 >> outfile;
}
}
As pointed out by glenn jackman, the code also can be written as:
#!/usr/bin/awk -f
BEGIN {
n = 1;
}
$0 ~ /DELIMITER ;/ {
n++;
next;
}
{
print $0 >> FILENAME n;
}
The notation on the command prompt awk -v x="DELIMITER ;" -v n=1 '$0 ~ x {n++; next} {print > FILENAME n}' is more suitable if you don't use the script more often, however you can also save it in a file as well.
Test run
$ ls input*
input
$ chmod +x split.awk
$ ./split.awk input
$ ls input*
input input1 input2 input3
$ cat input1
1
2
3
$ cat input2
4
5
6
7
$ cat input3
8
9
10
11
The script is just a starting point. You probably have to adapt it to your personal needs and environment.
Related
I have a data file that needs a new column of identifiers from 1 to 5. The final purpose is to split the data into five separate files with no leftover file (split leaves a leftover file).
Data:
aa
bb
cc
dd
ff
nn
ww
tt
pp
with identifier column:
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
Not sure if this can be done with seq? Afterwards it will be split with:
awk '$2 == 1 {print $0}'
awk '$2 == 2 {print $0}'
awk '$2 == 3 {print $0}'
awk '$2 == 4 {print $0}'
awk '$2 == 5 {print $0}'
Perl to the rescue:
perl -pe 's/$/" " . $. % 5/e' < input > output
Uses 0 instead of 5.
$. is the line number.
% is the modulo operator.
the /e modifier tells the substitution to evaluate the replacement part as code
i.e. end of line ($) is replaced with a space concatenated (.) with the line number modulo 5.
$ awk '{print $0, ((NR-1)%5)+1}' file
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
No need for that to create 5 separate files of course. All you need is:
awk '{print > ("file_" ((NR-1)%5)+1)}' file
Looks like you're happy with a perl solution that outputs 1-4 then 0 instead of 1-5 so FYI here's the equivalent in awk:
$ awk '{print $0, NR%5}' file
aa 1
bb 2
cc 3
dd 4
ff 0
nn 1
ww 2
tt 3
pp 4
I am going to offer a Perl solution even though it wasn't tagged because Perl is well suited to solve this problem.
If I understand what you want to do, you have a single file that you want to split into 5 separate files based on the position of a line in the data file:
the first line in the data file goes to file 1
the second line in the data file goes to file 2
the third line in the data file goes to file 3
...
since you already have the lines position in the file, you don't really need the identifier column (though you could pursue that solution if you wanted).
Instead you can open 5 filehandles and simply alternate which handle you write to:
use strict;
use warnings;
my $datafilename = shift #ARGV;
# open filehandles and store them in an array
my #fhs;
foreach my $i ( 0 .. 4 ) {
open my $fh, '>', "${datafilename}_$i"
or die "$!";
$fhs[$i] = $fh;
}
# open the datafile
open my $datafile_fh, '<', $datafilename
or die "$!";
my $row_number = 0;
while ( my $datarow = <$datafile_fh> ) {
print { $fhs[$row_number++ % #fhs] } $datarow;
}
# close resources
foreach my $fh ( #fhs ) {
close $fh;
}
I have a "pipe-separated" file that has about 20 columns. I want to just hash the first column which is a number like account number using sha1sum and return the rest of the columns as is.
Whats the best way I can do this using awk or sed?
Accountid|Time|Category|.....
8238438|20140101021301|sub1|...
3432323|20140101041903|sub2|...
9342342|20140101050303|sub1|...
Above is an example of the text file showing just 3 columns. Only the first column has the hashfunction implemented on it. Result should like:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
What the Best Way™ is is up for debate. One way to do it with awk is
awk -F'|' 'BEGIN { OFS=FS } NR == 1 { print } NR != 1 { gsub(/'\''/, "'\'\\\\\'\''", $1); command = ("echo '\''" $1 "'\'' | sha1sum -b | cut -d\\ -f 1"); command | getline hash; close(command); $1 = hash; print }' filename
That is
BEGIN {
OFS = FS # set output field separator to field separator; we will use
# it because we meddle with the fields.
}
NR == 1 { # first line: just print headers.
print
}
NR != 1 { # from there on do the hash/replace
# this constructs a shell command (and runs it) that echoes the field
# (singly-quoted to prevent surprises) through sha1sum -b, cuts out the hash
# and gets it back into awk with getline (into the variable hash)
# the gsub bit is to prevent the shell from barfing if there's an apostrophe
# in one of the fields.
gsub(/'/, "'\\''", $1);
command = ("echo '" $1 "' | sha1sum -b | cut -d\\ -f 1")
command | getline hash
close(command)
# then replace the field and print the result.
$1 = hash
print
}
You will notice the differences between the shell command at the top and the awk code at the bottom; that is all due to shell expansion. Because I put the awk code in single quotes in the shell commands (double quotes are not up for debate in that context, what with $1 and all), and because the code contains single quotes, making it work inline leads to a nightmare of backslashes. Because of this, my advice is to put the awk code into a file, say foo.awk, and run
awk -F'|' -f foo.awk filename
instead.
Here's an awk executable script that does what you want:
#!/usr/bin/awk -f
BEGIN { FS=OFS="|" }
FNR != 1 { $1 = encodeData( $1 ) }
47
function encodeData( fld ) {
cmd = sprintf( "echo %s | sha1sum", fld )
cmd | getline output
close( cmd )
split( output, arr, " " )
return arr[1]
}
Here's the flow break down:
Set the input and output field separators to |
When the row isn't the first (header) row, re-assign $1 to an encoded value
Print the entire row when 47 is true (always)
Here's the encodeData function break down:
Create a cmd to feed data to sha1sum
Feed it to getline
Close the cmd
On my system, there's extra info after sha1sum, so I discard it by spliting the output
Return the first field of the sha1sum output.
With your data, I get the following:
Accountid|Time|Category|.....
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
running by calling awk.script data (or ./awk.script data if you bash)
EDIT by EdMorton:
sorry for the edit, but your script above is the right approach but needs some tweaks to make it more robust and this is much easier than trying to describe them in a comment:
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { for (i=1; i<=NF; i++) f[$i] = i; next }
{ $(f["Accountid"]) = encodeData($(f["Accountid"])); print }
function encodeData( fld, cmd, output ) {
cmd = "echo \047" fld "\047 | sha1sum"
if ( (cmd | getline output) > 0 ) {
sub(/ .*/,"",output)
}
else {
print "failed to hash " fld | "cat>&2"
output = fld
}
close( cmd )
return output
}
$ awk -f tst.awk file
104a1f34b26ae47a67273fe06456be1fe97f75ba|20140101021301|sub1|...
c84270c403adcd8aba9484807a9f1c2164d7f57b|20140101041903|sub2|...
4fa518d8b005e4f9a085d48a4b5f2c558c8402eb|20140101050303|sub1|...
The f[] array decouples your script from hard-coding the number of the field that needs to be hashed, the additional args for your function make them local and so always null/zero on each invocation, the if on getline means you won't return the previous success value if it fails (see http://awk.info/?tip/getline) and the rest is maybe more style/preference with a bit of a performance improvement.
I have a file, xx.txt, like this.
1PPYA
2PPYB
1GBND
1CVHA
The first line of this file is "1PPYA". I would like to
Read the last character of "1PPYA." In this example, it's "A/"
Find "1PPY.txt" (the first four characters) from the "yy" directory.
Delete the lines start with "csh" which contain the "A" character.
Given the following "1PPY.txt" in the "yy" directory:
csh 1 A 1 27.704 6.347
csh 2 A 1 28.832 5.553
csh 3 A 1 28.324 4.589
csh 4 B 1 27.506 3.695
csh 5 C 1 29.411 4.842
csh 6 A 1 28.378 4.899
The required output would be:
csh 4 B 1 27.506 3.695
csh 5 C 1 29.411 4.842
Assuming your shell is bash
while read word; do
if [[ $word =~ ^(....)(.)$ ]]; then
filename="yy/${BASH_REMATCH[1]}.txt"
letter=${BASH_REMATCH[2]}
if [[ -f "$filename" ]]; then
sed "/^csh.*$letter/d" "$filename"
fi
fi
done < xx.txt
As you've tagged the question with awk:
awk '{
filename = "yy/" substr($1,1,4) ".txt"
letter = substr($1,5)
while (getline < filename)
if (! match($0, "^csh.*" letter))
print
close(filename)
}' xx.txt
This might work for you:
sed 's|^ *\(.*\)\(.\)$|sed -i.bak "/^ *csh.*\2/d" yy/\1.txt|' xx.txt | sh
N.B. I added a file backup. If this is not needed amend the -i.bak to -i
You can use this bash script:
while read f l
do
[[ -f $f ]] && awk -v l=$l '$3 != l' $f
done < <(awk '{len=length($0);l=substr($0,len);f=substr($0,0,len-1);print "yy/" f ".txt", l;}' xx.txt)
I posted this because you are a new user, however it will be much better to show us what you have tried and where you're stuck.
TXR:
#(next "xx.txt")
#(collect)
#*prefix#{suffix /./}
# (next `yy/#prefix.txt`)
# (collect)
# (all)
#{whole-line}
# (and)
# (none)
#shell #num #suffix #(skip)
# (end)
# (end)
# (do (put-string whole-line) (put-string "\n"))
# (end)
#(end)
Run:
$ txr del.txr
csh 4 B 1 27.506 3.695
csh 5 C 1 29.411 4.842
txr: unhandled exception of type file_error:
txr: (del.txr:5) could not open yy/2PPY.txt (error 2/No such file or directory)
Because of the outer #(collect)/#(end) (easily removed) this processes all of the lines from xx.txt, not just the first line, and so it blows up because I don't have 2PPY.txt.
Hallo, my SO friend, my question is:
Specification: annotate the fields of FILE_2 to the corresponding position of FILE_1.
A field is marked, and hence identified, by a delimiter pair.
I did this job in python before I knew awk and sed, with a couple hundred lines of code.
Now I want to see how powerful and efficient awk and sed can be.
Show me some masterpiece of awk or sed, please!
The delimiter pairs can be configured in FILE_3, but let's assume the first delimiter in a pair is 'Marker (number i) start', the other one is 'Marker (number i) done'
Example:
|-----------------FILE_1------------------|
text text text
text blabla
Marker_1_start
Marker_1_done
any text
in between blabla
Marker_2_start
Marker_2_done
text text
|-----------------FILE_2------------------|
Marker_1_start
11
1111
Marker_1_done
Marker_2_start
2222
22
Marker_2_done
Expected Output:
|-----------------FILE_Out------------------|
text text text
text blabla
Marker_1_start
11
1111
Marker_1_done
any text
in between blabla
Marker_2_start
2222
22
Marker_2_done
text text
awk '
FNR==NR && /Marker_.*_done/ {sep = ""; next}
FNR==NR && /Marker_.*_start/ {marker = $0; next}
FNR==NR {marker_text[marker] = marker_text[marker] sep $0; sep = "\n"; next}
1 {print}
/Marker_.*_start/ {print marker_text[$0]}
' file_2 file_1
There are several ways to approach this. I'm assuming that FILE_2 is smaller than FILE_1 and of a reasonable size.
#!/usr/bin/awk -f
FNR == NR {
if ($0 ~ /^Marker.*start$/) {
flag = 1
idx = $0
next
}
if ($0 ~ /^Marker.*done$/) {
flag = 0
nl = ""
next
}
if (flag) lines[idx] = lines[idx] nl $0
nl = "\n"
next
}
{
print
if (lines[$0]) print lines[$0]
}
To run it:
./script.awk FILE_2 FILE_1
Now I want to see how powerful and
efficient awk and sed can be
For this type of problem, very efficient. I'm sure my code can be further reduced.
#!/bin/bash
awk '
FNR == NR {
if ($0 ~ /Marker_1_start/){m1=1;next}
if ($0 ~ /Marker_2_start/){m2=1;next}
if ($0 ~ /Marker_1_done/){m1=0}
if ($0 ~ /Marker_2_done/){m2=0}
if(m1){a[i++]=$0}
if(m2){b[j++]=$0}
}
FNR != NR {
if ($0 ~ /Marker_1_start/){print;n1=1}
if ($0 ~ /Marker_2_start/){print;n2=1}
if ($0 ~ /Marker_1_done/){n1=0}
if ($0 ~ /Marker_2_done/){n2=0}
if(n1)
for (k = 0; k < i; k++)
print a[k]
else if(n2)
for (l = 0; l < j; l++)
print b[l]
else
print
}' ./file_2 ./file_1
Output
$ ./filemerge.sh
text text text
text blabla
Marker_1_start
11
1111
Marker_1_done
any text
in between blabla
Marker_2_start
2222
22
Marker_2_done
text text
So my dear SOers, Let me be direct to the point:
specification: filter a text file using pairs of patterns.
Example: if we have a file:
line 1 blabla
line 2 more blabla
line 3 **PAT1a** blabla
line 4 blabla
line 5 **PAT1b** blabla
line 6 blabla
line 7 **PAT2a** blabla
line 8 blabla
line 9 **PAT2b** blabla
line 10 **PAT3a** blabla
line 11 blabla
line 12 **PAT3b** blabla
more and more blabla
should give:
line 3 **PAT1a** blabla
line 4 blabla
line 5 **PAT1b** blabla
line 7 **PAT2a** blabla
line 8 blabla
line 9 **PAT2b** blabla
line 10 **PAT3a** blabla
line 11 blabla
line 12 **PAT3b** blabla
I know how to filer only one part of it using 'sed':
sed -n -e '/PAT1a/,/PAT1b/{p}'
But how to filter all the snippets, do i need to write those pairs of patterns in a configuration file, read a pair from it, use the sed cmd above, go to next pair...?
Note: Suppose PAT1, PAT2 and PAT3, etc share no common prefix(like 'PAT' in this case)
One thing more: how to make a newline in quota text in this post without leaving a whole blank line?
I assumed the pattern pairs are given as a separate file. Then, when they appear in order in the input, you could use this awk script:
awk 'NR == FNR { a[NR] = $1; b[NR] = $2; next }
!s && $0 ~ a[i+1] { s = 1 }
s
s && $0 ~ b[i+1] { s = 0; i++ }' patterns.txt input.txt
And a more complicated version when the patterns can appear out of order:
awk 'NR == FNR { a[++n] = $1; b[n] = $2; next }
{ for (i = 1; !s && i <= n; i++) if ($0 ~ a[i]) s = i; }
s
s && $0 ~ b[s] { s = 0 }' patterns.txt input.txt
Awk.
$ awk '/[0-9]a/{o=$0;getline;$0=o"\n"$0;print;next}/[0-9]b/' file
line 3 PAT1a blabla
line 4 blabla
line 5 PAT1b blabla
line 7 PAT2a blabla
line 8 blabla
line 9 PAT2b blabla
line 10 PAT3a blabla
line 11 blabla
line 12 PAT3b blabla
Note: Since you said "share no common prefix", then I use the number and [ab] pattern for regex
Use the b command to skip all lines between the patterns and the d command to delete all other lines:
sed -e '/PAT1a/,/PAT1b/b' -e '/PAT2a/,/PAT2b/b' -e '/PAT3a/,/PAT3b/b' -e d