Sort a file with unordered columns of integers - perl

I have an input file with two columns of integer values. I would like to chop the input file in this way
input file:
...
...
12312 565456
565456 12312
...
...
#
output file:
...
...
12312 565456
...
...
namely if two numbers are present in couple more then one time, writing a unique line in the output file where the first number if the smaller of the two.
How can be done with sort or a perl script?

You can try:
perl -nale ' #F=reverse #F if($F[0]>$F[1]);
$x=$F[0]." ".$F[1]; if(!$h{$x}){print $x;$h{$x}=1;}'
See it

You could combine perl and sort:
perl -lne 'BEGIN { $, = " " } print sort split' infile | sort -u

awk -vOFS="\t" '$2<$1 {print $2,$1} $1<=$2 {print}'|sort -u
would also work

Related

Replacing all occurrence after nth occurrence in a line in perl

I need to replace all occurrences of a string after nth occurrence in every line of a Unix file.
My file data:
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
My output data:
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
tried using sed: sed 's/://3g' test.txt
Unfortunately, the g option with the occurrence is not working as expected. instead, it is replacing all the occurrences.
Another approach using awk
awk -v c=':' -v n=2 'BEGIN{
FS=OFS=""
}
{
j=0;
for(i=0; ++i<=NF;)
if($i==c && j++>=n)$i=""
}1' file
$ cat file
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
$ awk -v c=':' -v n=2 'BEGIN{FS=OFS=""}{j=0;for(i=0; ++i<=NF;)if($i==c && j++>=n)$i=""}1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
With GNU awk, using gensub please try following. This is completely based on your shown samples, where OP wants to remove : from 3rd occurrence onwards. Using gensub to segregate parts of matched values and removing all colons from 2nd part(from 3rd colon onwards) in it as per OP's requirement.
awk -v regex="^([^:]*:)([^:]*:)(.*)" '
{
firstPart=restPart=""
firstPart=gensub(regex, "\\1 \\2", "1", $0)
restPart=gensub(regex,"\\3","1",$0)
gsub(/:/,"",restPart)
print firstPart restPart
}
' Input_file
I have inferred based on the limited data you've given us, so it's possible this won't work. But I wouldn't use regex for this job. What you have there is colon delimited fields.
So I'd approach it using split to extract the data, and then some form of string formatting to reassemble exactly what you like:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my ( undef, $first, #rest ) = split /:/;
print ":$first:", join ( "", #rest ),"\n";
}
__DATA__
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
This gives you the desired result, whilst IMO being considerably clearer for the next reader than a complicated regex.
You can use the perl solution like
perl -pe 's~^(?:[^:]*:){2}(*SKIP)(?!)|:~~g if /^:account_id:/' test.txt
See the online demo and the regex demo.
The ^(?:[^:]*:){2}(*SKIP)(?!)|: regex means:
^(?:[^:]*:){2}(*SKIP)(?!) - match
^ - start of string (here, a line)
(?:[^:]*:){2} - two occurrences of any zero or more chars other than a : and then a : char
(*SKIP)(?!) - skip the match and go on to search for the next match from the failure position
| - or
: - match a : char.
And only run the replacement if the current line starts with :account_id: (see if /^:account_id:/').
Or an awk solution like
awk 'BEGIN{OFS=FS=":"} /^:account_id:/ {result="";for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result}' test.txt
See this online demo. Details:
BEGIN{OFS=FS=":"} - sets the input/output field separator to :
/^:account_id:/ - line must start with :account_id:
result="" - sets result variable to an empty string
for (i=1; i<=NF; ++i) { result = result (i > 2 ? $i : $i OFS)}; print result} - iterates over the fields and if the field number is greater than 2, just append the current field value to result, else, append the value + output field separator; then print the result.
I would use GNU AWK following way if n fixed and equal 2 following way, let file.txt content be
:account_id:12345:6789:Melbourne:Aus
:account_id:98765:43210:Adelaide:Aus
then
awk 'BEGIN{FS=":";OFS=""}{$2=FS $2 FS;print}' file.txt
output
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
Explanation: use : as field separator and nothing as output field separator, this itself does remove all : so I add : which have to be preserved: 1st (before second column) and 2nd (after second column). Beware that I tested it solely for this data, so if you would want to use it you should firstly test it with more possible inputs.
(tested in gawk 4.2.1)
This might work for you (GNU sed):
sed 's/:/\n/3;h;s/://g;H;g;s/\n.*\n//' file
Replace the third occurrence of : by a newline.
Make a copy of the line.
Delete all occurrences of :'s.
Append the amended line to the copy.
Join the two lines by removing everything from third occurrence of the copy to the third occurrence of the amended line.
N.B. The use of the newline is the best delimiter to use in the case of sed, as the line presented to seds commands are initially devoid of newlines. However the important property of the delimiter is that it is unique and therefore can be any such character as long as it is not found anywhere in the data set.
An alternative solution uses a loop to remove all :'s after the first two:
sed -E ':a;s/^(([^:]*:){2}[^:]*):/\1/;ta' file
With GNU awk for the 3rd arg to match() and gensub():
$ awk 'match($0,/(:[^:]+:)(.*)/,a){ $0=a[1] gensub(/:/,"","g",a[2]) } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus
and with any awk in any shell on every Unix box:
$ awk 'match($0,/:[^:]+:/){ tgt=substr($0,1+RLENGTH); gsub(/:/,"",tgt); $0=substr($0,1,RLENGTH) tgt } 1' file
:account_id:123456789MelbourneAus
:account_id:9876543210AdelaideAus

Extracting fasta ids after string match

I have a list of fasta sequences as following:
>Product_1_001:299:H377WBGXB:1:11101
TGATCATCTCACCTACTAATAGGACGATGACCCAGTGACGATGA
>Product_2_001:299:H377WBGXB:2:11101
CATCGATGATCATTGATAAGGGGCCCATACCCATCAAAACCGTT
The original fasta sequence is much longer than the subset posted here. I wanted to extract the 10 characters after the pattern "TCAT" into a separate file and did this
grep -oP "(?<=TCAT).{10}"
I do get the needed result as:
CTCACCTACT
TGATAAGGGG
I would like their corresponding fasta ids as one column and the extracted pattern as second column like:
>Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
>Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
Try this one-liner
perl -lne ' /^[^<].+?(?<=TCAT)(.{10})/ and print $p,"\t",$1; $p=$_ ' file
with your given inputs
$ cat fasta.txt
>Product_1_001:299:H377WBGXB:1:11101
TGATCATCTCACCTACTAATAGGACGATGACCCAGTGACGATGA
>Product_2_001:299:H377WBGXB:2:11101
CATCGATGATCATTGATAAGGGGCCCATACCCATCAAAACCGTT
$ perl -lne ' /^[^<].+?(?<=TCAT)(.{10})/ and print $p,"\t",$1; $p=$_ ' fasta.txt
>Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
>Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
$
Another way will be ussing awk command like this :
cat <your_file>| awk -F"_" '/Product/{printf "%s", $0; next} 1'|awk -F"TCAT" '{ print substr($1,1,35) "\t" substr($2,1,10)}'
the output :
Product_1_001:299:H377WBGXB:1:11101 CTCACCTACT
Product_2_001:299:H377WBGXB:2:11101 TGATAAGGGG
hope it help you.

Perl To Parse Whitespace Separated Columns

I have a large text file which has three columns present, each column separated by four spaces. I need a perl script to read this text file and output columns #1, and #2 to a new text file with each of these columns wrapped in quotation marks and separated commas in the output file.
The text file with the four columns has data which looks like this:
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
What I would like the output to look like is
"9a2ba3c0580b5f3799ad9d6f487b2d38","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg"
Easy as a one-liner:
perl -lane 'print join ",", map qq("$_"), #F[0, 1]'
-l handles newlines in print
-n reads the input line by line
-a splits each line on whitespace into the #F array
#F[0, 1] is an array slice, it extracts the first two elements of the #F array
map wraps each element in double quotes
join inserts the comma in between
Below code for your reference:
#!/usr/bin/perl
my $defaultFileName=defined $ARGV[0]?$ARGV[0]:"filename.txt";
die "Could not find file: $defaultFileName" unless(-f $defaultFileName);
open my $fh, '<',"textFileName.log";
foreach my $line(<$fh>) {
my #tmpData=split(/\s+/, $line);
printf "\"%s\",\"%s\"\\n\n",$tmpData[1],$tmpData[2];
}
close $fh;
This can also be done with awk
>>cat test
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
9a2ba3c0580b5f3799ad9d6f487b2d3 /folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg HOST
OUTPUT:
>>awk '{FS=" "}{print "\""$1"\",""\""$2"\",""\""$3"\"" }' test
"9a2ba3c0580b5f3799ad9d6f487b2d3","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg","HOST"
"9a2ba3c0580b5f3799ad9d6f487b2d3","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg","HOST"
"9a2ba3c0580b5f3799ad9d6f487b2d3","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg","HOST"
"9a2ba3c0580b5f3799ad9d6f487b2d3","/folder1/folder2/folder3/folder4/folder5/folder6/folder7_name_PC/images/filename.jpg","HOST"
>>awk '{FS=" "}{print "\""$1"\",""\""$2"\",""\""$3"\"" }' test > output.txt
then output.txt will have a desired output.

Insert a string/number into a specific cell of a csv file

Basically right now I have a for loop running that runs a series of tests. Once the tests pass I input the results into a csv file:
for (( some statement ))
do
if[[ something ]]
input this value into a specific row and column
fi
done
What I can't figure out right now is how to input a specific value into a specific cell in the csv file. I know in awk you can read a cell with this command:
awk -v "row=2" -F'#' 'NR == row { print $2 }' some.csv and this will print the cell in the 2nd row and 2nd column. I need something similar to this except it can input a value into a specific cell instead of read it. Is there a function that does this?
You can use the following:
awk -v value=$value -v row=$row -v col=$col 'BEGIN{FS=OFS="#"} NR==row {$col=value}1' file
And set the bash values $value, $row and $col. Then you can redirect and move to the original:
awk ... file > new_file && mv new_file file
This && means that just if the first command (awk...) is executed successfully, then the second one will be performed.
Explanation
-v value=$value -v row=$row -v col=$col pass the bash variables to awk. Note value, row and col could be other names, I just used the same as bash to make it easier to understand.
BEGIN{FS=OFS="#"} set the Field Separator and Output Field Separator to be #. The OFS="#" is not necessary here, but can be useful in case you do some print.
NR==row {$col=value} when the number of record (number of line here) is equal to row, then set the col column with value value.
1 perform the default awk action: {print $0}.
Example
$ cat a
hello#how#are#you
i#am#fine#thanks
hoho#haha#hehe
$ row=2
$ col=3
$ value="XXX"
$ awk -v value=$value -v row=$row -v col=$col 'BEGIN{FS=OFS="#"} NR==row {$col=value}1' a
hello#how#are#you
i#am#XXX#thanks
hoho#haha#hehe
Your question has a 'perl' tag so here is a way to do it using Tie::Array::CSV which allows you to treat the CSV file as an array of arrays and use standard array operations:
use strict;
use warnings;
use Tie::Array::CSV;
my $row = 2;
my $col = 3;
my $value = 'value';
my $filename = '/path/to/file.csv';
tie my #file, 'Tie::Array::CSV', $filename, sep_char => '#';
$file[$row][$col] = $value;
untie #file;
using sed
row=2 # define the row number
col=3 # define the column number
value="value" # define the value you need change.
sed "$row s/[^#]\{1,\}/$value/$col" file.csv # use shell variable in sed to find row number first, then replace any word between #, and only replace the nominate column.
# So above sed command is converted to sed "2 s/[^#]\{1,\}/value/3" file.csv
If the above command is fine, and your sed command support the option -i, then run the command to change the content directly in file.csv
sed -i "$row s/[^#]\{1,\}/$value/$col" file.csv
Otherwise, you need export to temp file, and change the name back.
sed "$row s/[^#]\{1,\}/$value/$col" file.csv > temp.csv
mv temp.csv file.csv

Swap two columns - awk, sed, python, perl

I've got data in a large file (280 columns wide, 7 million lines long!) and I need to swap the first two columns. I think I could do this with some kind of awk for loop, to print $2, $1, then a range to the end of the file - but I don't know how to do the range part, and I can't print $2, $1, $3...$280! Most of the column swap answers I've seen here are specific to small files with a manageable number of columns, so I need something that doesn't depend on specifying every column number.
The file is tab delimited:
Affy-id chr 0 pos NA06984 NA06985 NA06986 NA06989
You can do this by swapping values of the first two fields:
awk ' { t = $1; $1 = $2; $2 = t; print; } ' input_file
I tried the answer of perreal with cygwin on a windows system with a tab separated file. It didn't work, because the standard separator is space.
If you encounter the same problem, try this instead:
awk -F $'\t' ' { t = $1; $1 = $2; $2 = t; print; } ' OFS=$'\t' input_file
Incoming separator is defined by -F $'\t' and the seperator for output by OFS=$'\t'.
awk -F $'\t' ' { t = $1; $1 = $2; $2 = t; print; } ' OFS=$'\t' input_file > output_file
Try this more relevant to your question :
awk '{printf("%s\t%s\n", $2, $1)}' inputfile
This might work for you (GNU sed):
sed -i 's/^\([^\t]*\t\)\([^\t]*\t\)/\2\1/' file
Have you tried using the cut command? E.g.
cat myhugefile | cut -c10-20,c1-9,c21- > myrearrangedhugefile
This is also easy in perl:
perl -pe 's/^(\S+)\t(\S+)/$2\t$1/;' file > outputfile
You could do this in Perl:
perl -F\\t -nlae 'print join("\t", #F[1,0,2..$#F])' inputfile
The -F specifies the delimiter. In most shells you need to precede a backslash with another to escape it. On some platforms -F automatically implies -n and -a so they can be dropped.
For your problem you wouldn't need to use -l because the last columns appears last in the output. But if in a different situation, if the last column needs to appear between other columns, the newline character must be removed. The -l switch takes care of this.
The "\t" in join can be changed to anything else to produce a different delimiter in the output.
2..$#F specifies a range from 2 until the last column. As you might have guessed, inside the square brackets, you can put any single column or range of columns in the desired order.
No need to call anything else but your shell:
bash> while read col1 col2 rest; do
echo $col2 $col1 $rest
done <input_file
Test:
bash> echo "first second a c d e f g" |
while read col1 col2 rest; do
echo $col2 $col1 $rest
done
second first a b c d e f g
Maybe even with "inlined" Python - as in a Python script within a shell script - but only if you want to do some more scripting with Bash beforehand or afterwards... Otherwise it is unnecessarily complex.
Content of script file process.sh:
#!/bin/bash
# inline Python script
read -r -d '' PYSCR << EOSCR
from __future__ import print_function
import codecs
import sys
encoding = "utf-8"
fn_in = sys.argv[1]
fn_out = sys.argv[2]
# print("Input:", fn_in)
# print("Output:", fn_out)
with codecs.open(fn_in, "r", encoding) as fp_in, \
codecs.open(fn_out, "w", encoding) as fp_out:
for line in fp_in:
# split into two columns and rest
col1, col2, rest = line.split("\t", 2)
# swap columns in output
fp_out.write("{}\t{}\t{}".format(col2, col1, rest))
EOSCR
# ---------------------
# do setup work?
# e. g. list files for processing
# call python script with params
python3 -c "$PYSCR" "$inputfile" "$outputfile"
# do some more processing
# e. g. rename outputfile to inputfile, ...
If you only need to swap the columns for a single file, then you can also just create a single Python script and statically define the filenames. Or just use an answer above.
awk swapping sans temp-variable :
echo '777777744444444464449: 317 647 14423 262927714037 : 0x2A29D5A1BAA7A95541' |
mawk '1; ($1 = $2 substr(_, ($2 = $1)^_))^_' FS=':' OFS=':'
777777744444444464449: 317 647 14423 262927714037 : 0x2A29D5A1BAA7A95541
317 647 14423 262927714037 :777777744444444464449: 0x2A29D5A1BAA7A95541