Combine columns from different files - perl

I have two text files that have these structures:
File 1
Column1:Column2
Column1:Column2
...
File 2
Column3
Column3
...
I would like to create a file that has this file structure:
Column1:Column3
Column1:Column3
...
Open to any suggestions, but it would be nice if the solution can be done from a Bash shell, or sed / awk / perl / etc...

cut -d: -f1 "File 1" | paste -d: - "File 2"
This cuts field 1 from File 1 (delimited by a colon) and pastes it with the only column in File 2, separating the output fields with a colon.

Here's an awk solution. It assumes file1 and file2 have an equal number of lines.
awk -F : '{ printf "%s:",$1; getline < "file2"; print }' < file1

Since a pure bash implementation hasn't been suggested, also assuming an equal number of lines (bash v4 only):
mapfile -t file2 < file2
index=0
while IFS=: read -r column1 _; do
echo "$column1:${file2[index]}"
((index++))
done < file1
bash v3:
IFS=$'\n' read -r -d '' file2 < file2
index=0
while IFS=: read -r column1 _; do
echo "$column1:${file2[index]}"
((index++))
done < file1

Related

bash while loop failing while using sed

I am facing an issue with sed in a while-loop.using sed. I want to read the 2nd column of file1, compare it with the content of file2, and if the string is matched, i want to replace the matched string of file1 with file2 string.
I tried with the following code, but it is not returning any output.
cat file1 | while read a b; do
sed -i "s/$b/$(grep $b file2)/g" file1 > file3;
done
Example input:
file_1 content:
1 1234
2 8765
file2 content:
12345
34567
87654
Expected output:
1 12345
2 87654
Your script is very inefficient. Using the while-loop you read each line of file1. This is N operations. Per line you process with the while loop, you reproscess the full file1, making it an N*N process. However, in the sed, you grep file2 constantly. If file2 has M lines, this becomes an N*N*M process. This is very inefficient.
On top of that there are some issues:
You updated file1 inplace because you use the -i flag. An inplace update does not provide any output, so file3 will be empty.
You are reading file1 with the while-loop and at the same time you update file1 with sed. I don't know how this will react, but I don't believe it is healthy.
If $b is not in file2 you would, according to your logic, have a line with only a single column. This is not what you expect.
A fix of your script, would be this:
while read -r a b; do
c=$(grep "$b" file2)
[[ "$c" == "" ]] || echo "$a $c"
done < file1 > file3
which is still not efficient, but it is already M*N. The best way is using awk
note: as a novice, always parse your script with http://www.shellcheck.net
note: as a professional, always parse your script with http://www.shellcheck.net
Could you please try following.
awk 'FNR==NR{a[$2]=$1;next} {for(i in a){if(match($0,"^"i)){print a[i],$0;continue}}}' file1 file2
Adding a non-one liner form of solution:
awk '
FNR==NR{
a[$2]=$1
next
}
{
for(i in a){
if(match($0,"^"i)){
print a[i],$0
continue
}
}
}
' Input_file1 Input_file2
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk code from here.
FNR==NR{ ##Checking condition if FNR==NR then do following.
a[$2]=$1 ##Creating array a whose index is $2 and value is $1.
next ##next will skip all further statements from here.
}
{ ##Statements from here will run for 2nd Input_file only.
for(i in a){ ##Traversing through array a all elements here.
if(match($0,"^"i)){ ##Checking condition if current line matches index of current item from array a then do following.
print a[i],$0 ##Printing array a whose index is i and current line here.
continue ##Again take cursor to for loop.
}
}
}
' Input_file1 Input_file2 ##Mentioning all Input_file names here.

Make some replacements on a bunch of files depending the number of columns per line

I'm having a problem dealing with some files. I need to perform a column count for every line in a file and depending the number of columns i need to add severals ',' in in the end of each line. All lines should have 36 columns separated by ','
This line solves my problem, but how do I run it in a folder with several files in a automated way?
awk ' BEGIN { FS = "," } ;
{if (NF == 32) { print $0",,,," } else if (NF==31) { print $0",,,,," }
}' <SOURCE_FILE> > <DESTINATION_FILE>
Thank you for all your support
R&P
The answer depends on your OS, which you haven't told us. On UNIX and assuming you want to modify each original file, it'd be:
for file in *
do
awk '...' "$file" > tmp$$ && mv tmp$$ "$file"
done
Also, in general to get all records in a file to have the same number of fields you can do this without needing to specify what that number of fields is (though you can if appropriate):
$ cat tst.awk
BEGIN { FS=OFS=","; ARGV[ARGC++] = ARGV[ARGC-1] }
NR==FNR { nf = (NF > nf ? NF : nf); next }
{
tail = sprintf("%*s",nf-NF,"")
gsub(/ /,OFS,tail)
print $0 tail
}
$
$ cat file
a,b,c
a,b
a,b,c,d,e
$
$ awk -f tst.awk file
a,b,c,,
a,b,,,
a,b,c,d,e
$
$ awk -v nf=10 -f tst.awk file
a,b,c,,,,,,,
a,b,,,,,,,,
a,b,c,d,e,,,,,
It's a short one-liner with Perl:
perl -i.bak -F, -alpe '$_ .= "," x (36-#F)' *
if this is only a single folder without subfolders, use:
for oldfile in /path/to/files/*
do
newfile="${oldfile}.new"
awk '...' "${oldfile}" > "${newfile}"
done
if you also want to include subdirectories recursively, it's probably easiest to put the awk+redirection into a small shell-script, like this:
#!/bin/bash
oldfile=$1
newfile="${oldfile}.new"
awk '...' "${oldfile}" > "${newfile}"
and then run this script (let's calls it runawk.sh) via find:
find /path/to/files/ -type f -not -name "*.new" -exec runawk.sh \{\} \;

Remove lines from AWK output

I would like to remove lines that have less than 2 columns from a file:
awk '{ if (NF < 2) print}' test
one two
Is there a way to store these lines into variable and then remove it with xargs and sed, something like
awk '{ if (NF < 2) VARIABLE}' test | xargs sed -i /VARIABLE/d
GNU sed
I would like to remove lines that have less than 2 columns
less than 2 = remove lines with only one column
sed -r '/^\s*\S+\s+\S+/!d' file
If you would like to split the input into two files (named "pass" and "fail"), based on condition:
awk '{if (NF > 1 ) print > "pass"; else print > "fail"}' input
If you simply want to filter/remove lines with NF < 2:
awk '(NF > 1){print}' input

Joining lines in order of different blocks in the same text file

I have a file split in blocks like the following:
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTGGGG
AGGTAGTTATTATTTTTTTGGTTTTTAGTATTTAATTGAGTGTTT
ATGTAGGTGTTTATGTATTAGTTTTTTTTAGGTTTAGGGTGTTGT
ATTTAGGTTTTGTGTTTTGTGTATTATTGAATTTAATTAAAGTTA
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTT
AGTTTTTTTTTATTTGTCGGGATATTTTAGTTGATTTTAGATTGC
TATATTTTTAGTTTCGATTCGTCGTAAGTTTTATTTTTTTTTAAT
GGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTTT
I've truncated/wrapped the lines for clarity's sake, but imagine very long lines. The point of my question is that I want a final file that looks like this:
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTGGGGAGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTT
AGGTAGTTATTATTTTTTTGGTTTTTAGTATTTAATTGAGTGTTTAGTTTTTTTTTATTTGTCGGGATATTTTAGTTGATTTTAGATTGC
ATGTAGGTGTTTATGTATTAGTTTTTTTTAGGTTTAGGGTGTTGTTATATTTTTAGTTTCGATTCGTCGTAAGTTTTATTTTTTTTTAAT
ATTTAGGTTTTGTGTTTTGTGTATTATTGAATTTAATTAAAGTTAGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTTT
Where this new block has:
the same number of lines as the initial blocks,
each of the lines of the resulting block is a concatenation of the lines with the same line-number in the initial blocks.
this concatenation should be in-order (i.e. "1st line of 1st block" + "1st line of 2nd block", etc
Is it possible to achieve this final block using sed and/or awk, could you show me how it could be done?
In bash with paste:
$ paste <(head -4 file) <(tail -4 file) | tr -d '\t'
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTGGGGAGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTT
AGGTAGTTATTATTTTTTTGGTTTTTAGTATTTAATTGAGTGTTTAGTTTTTTTTTATTTGTCGGGATATTTTAGTTGATTTTAGATTGC
ATGTAGGTGTTTATGTATTAGTTTTTTTTAGGTTTAGGGTGTTGTTATATTTTTAGTTTCGATTCGTCGTAAGTTTTATTTTTTTTTAAT
ATTTAGGTTTTGTGTTTTGTGTATTATTGAATTTAATTAAAGTTAGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTTT
try this:
awk -vOFS="" '$0{a[NR]=$0}END{for(i=1;i<=NR/2;i++)print a[i],a[i+5]}' file
test with your example:
kent$ cat tmp.txt
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTGGGG
AGGTAGTTATTATTTTTTTGGTTTTTAGTATTTAATTGAGTGTTT
ATGTAGGTGTTTATGTATTAGTTTTTTTTAGGTTTAGGGTGTTGT
ATTTAGGTTTTGTGTTTTGTGTATTATTGAATTTAATTAAAGTTA
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTT
AGTTTTTTTTTATTTGTCGGGATATTTTAGTTGATTTTAGATTGC
TATATTTTTAGTTTCGATTCGTCGTAAGTTTTATTTTTTTTTAAT
GGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTTT
kent$ awk -vOFS="" '$0{a[NR]=$0}END{for(i=1;i<=NR/2;i++)print a[i],a[i+5]}' tmp.txt
AGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTGGGGAGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTT
AGGTAGTTATTATTTTTTTGGTTTTTAGTATTTAATTGAGTGTTTAGTTTTTTTTTATTTGTCGGGATATTTTAGTTGATTTTAGATTGC
ATGTAGGTGTTTATGTATTAGTTTTTTTTAGGTTTAGGGTGTTGTTATATTTTTAGTTTCGATTCGTCGTAAGTTTTATTTTTTTTTAAT
ATTTAGGTTTTGTGTTTTGTGTATTATTGAATTTAATTAAAGTTAGGATAGGTTTTGGTGTTTGAGGTTAATTTTGTTTTATTTTTTTTT
awk -F'\n' -v RS= '{for (i=1;i<=NF;i++) {str[i] = str[i] $i} END {for (i=1;i<=NF;i++) print str[i]}' file

deleting lines from text files based on the last character which are in another file using awk or sed

I have a file, xx.txt, like this.
1PPYA
2PPYB
1GBND
1CVHA
The first line of this file is "1PPYA". I would like to
Read the last character of "1PPYA." In this example, it's "A/"
Find "1PPY.txt" (the first four characters) from the "yy" directory.
Delete the lines start with "csh" which contain the "A" character.
Given the following "1PPY.txt" in the "yy" directory:
csh 1 A 1 27.704 6.347
csh 2 A 1 28.832 5.553
csh 3 A 1 28.324 4.589
csh 4 B 1 27.506 3.695
csh 5 C 1 29.411 4.842
csh 6 A 1 28.378 4.899
The required output would be:
csh 4 B 1 27.506 3.695
csh 5 C 1 29.411 4.842
Assuming your shell is bash
while read word; do
if [[ $word =~ ^(....)(.)$ ]]; then
filename="yy/${BASH_REMATCH[1]}.txt"
letter=${BASH_REMATCH[2]}
if [[ -f "$filename" ]]; then
sed "/^csh.*$letter/d" "$filename"
fi
fi
done < xx.txt
As you've tagged the question with awk:
awk '{
filename = "yy/" substr($1,1,4) ".txt"
letter = substr($1,5)
while (getline < filename)
if (! match($0, "^csh.*" letter))
print
close(filename)
}' xx.txt
This might work for you:
sed 's|^ *\(.*\)\(.\)$|sed -i.bak "/^ *csh.*\2/d" yy/\1.txt|' xx.txt | sh
N.B. I added a file backup. If this is not needed amend the -i.bak to -i
You can use this bash script:
while read f l
do
[[ -f $f ]] && awk -v l=$l '$3 != l' $f
done < <(awk '{len=length($0);l=substr($0,len);f=substr($0,0,len-1);print "yy/" f ".txt", l;}' xx.txt)
I posted this because you are a new user, however it will be much better to show us what you have tried and where you're stuck.
TXR:
#(next "xx.txt")
#(collect)
#*prefix#{suffix /./}
# (next `yy/#prefix.txt`)
# (collect)
# (all)
#{whole-line}
# (and)
# (none)
#shell #num #suffix #(skip)
# (end)
# (end)
# (do (put-string whole-line) (put-string "\n"))
# (end)
#(end)
Run:
$ txr del.txr
csh 4 B 1 27.506 3.695
csh 5 C 1 29.411 4.842
txr: unhandled exception of type file_error:
txr: (del.txr:5) could not open yy/2PPY.txt (error 2/No such file or directory)
Because of the outer #(collect)/#(end) (easily removed) this processes all of the lines from xx.txt, not just the first line, and so it blows up because I don't have 2PPY.txt.