using sed to capture groups of indeterminate length and a multitude of characters

using sed to capture groups of indeterminate length and a multitude of characters - sed

I am struggling to grasp the sed command.
I am working with gene annotation files. In particular, I convert gff3 to gtf files needed to execute cellranger-arc mkref. Both gffread and agat fail to do so perfectly on gff3 files from ncbi. My agat-gtf file doesn't contain 'transcript_id' as is.
The gtf format is a tab delimited format, with the final column being for attributes. The attributes are separated using semicolons. Currently, my agat-gtf file has 'locus_tag' descriptors which I want to replace as 'transcript_id' with necessary quote marks around the name of the transcript. As an example, I want
... ; locus_tag AbcdE_f1 ; ...
to be replaced with
... ; transcript_id "AbcdE_f1" ; ...
I have tried
sed -i.bak "s/locus_tag\([0-9a-zA-Z ,._-]{1,}\);/transcript_id \"1\";/g" myFile.gtf, but it does nothing. Thanks for any help.
As per request (I'll include two lines as input) typical input
sample:
ChrPT RefSeq exon 956 981 . + . Dbxref "GeneID:38831453" ; ID "nbis-exon-1" ; Parent PhpapaC_p1 ; gbkey exon ; gene "3' rps12" ; locus_tag PhpapaC_p1 ; product "ribosomal protein S12" <br>
ChrPT RefSeq gene 1033 1500 . + . Dbxref "GeneID:2546745" ; ID "nbis-gene-17" ; Name rps7 ; gbkey Gene ; gene rps7 ; gene_biotype protein_coding ; locus_tag PhpapaCp002
Desired output:
ChrPT RefSeq exon 956 981 . + . Dbxref "GeneID:38831453" ; ID "nbis-exon-1" ; Parent PhpapaC_p1 ; gbkey exon ; gene "3' rps12" ; transcript_id "PhpapaC_p1" ; product "ribosomal protein S12" <br>
ChrPT RefSeq gene 1033 1500 . + . Dbxref "GeneID:2546745" ; ID "nbis-gene-17" ; Name rps7 ; gbkey Gene ; gene rps7 ; gene_biotype protein_coding ; transcript_id "PhpapaCp002"

Using GNU sed
$ sed -E 's/\<locus_tag\>[ \t]([^ \t]*)/transcript_id "\1"/' input_file
ChrPT RefSeq exon 956 981 . + . Dbxref "GeneID:38831453" ; ID "nbis-exon-1" ; Parent PhpapaC_p1 ; gbkey exon ; gene "3' rps12" ; transcript_id "PhpapaC_p1" ; product "ribosomal protein S12" <br>
ChrPT RefSeq gene 1033 1500 . + . Dbxref "GeneID:2546745" ; ID "nbis-gene-17" ; Name rps7 ; gbkey Gene ; gene rps7 ; gene_biotype protein_coding ; transcript_id "PhpapaCp002"

FYI using AGAT properly should definitely provide a proper GTF file with transcript_id

Related

Perl command not giving expected output

Command:
perl -lpe '1 while (s/(^|\s)(0\d*)(\s|$)/$1"$2"$3/)' test5
Input:
1234 012345 0
0.000 01234 0
01/02/03 5467 0abc
01234 0123
0000 000054
0asdf 0we23-1
Current Output:
perl -lpe '1 while (s/(^|\s)(0\d*)(\s|$)/$1"$2"$3/)' test5
1234 "012345" "0"
0.000 "01234" "0"
01/02/03 5467 "0abc"
"01234" "0123"
"0000" "000054"
0asdf 0we23-1
Excepted Output:
1234 "012345" 0
0.000 "01234" 0
01/02/03 5467 "0abc"
"01234" "0123"
"0000" "000054"
"0asdf" "0we23-1"
Conditions to follow in output:
All strings starting with 0 having alphanumeric character except / and . should be double quoted.
if string starting with 0 have only 0 character should not be quoted.
Spacing between strings should be preserved.

This appears to do what you want:
#!/usr/bin/env perl
use strict;
use warnings;
while ( <DATA> ) {
my #fields = split;
s/^(0[^\.\/]+)$/"$1"/ for #fields;
print join " ", #fields, "\n";
}
__DATA__
1234 012345 0
0.000 01234 0
01/02/03 5467 0abc
01234 0123
0000 000054
0asdf 0we23-1
Note - it doesn't strictly preserve whitespace like you asked though - it just removes it and reinserts a single space. That seems to meet your spec, but you could instead:
my #fields = split /(\s+)/;
as this would capture the spaces too.
join "", #fields;
This is reducible to a one liner using -a for autosplitting:
perl -lane 's/^(0[^\.\/]+)$/"$1"/ for #F; print join " ", #F'
If you wanted to do the second bit (preserving whitespace strictly) then you'd need to drop the -a and use split yourself.

sed: delete lines that match a pattern in a given field

I have a file tab delimited that looks like this:
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
53_234 78 . CCG GAT 999 . . GT:PL:DP:DPR
45_569 5 . TCCG GTTA 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
I am trying to use sed to delete all the lines that contain more than one letter in the 4th field (in the case above, line 7 and 8 from the top). I have tried the following regular expression but there must be a glitch some where that I cannot find:
sed '5,${;/\([^.]*\t\)\{3\}\[A-Z][A-Z]\+\t/d;}' input.vcf>new.vcf
The syntax is as follows:
5,$ #start at line 5 until the end of the file ($)
([^.]*\t) #matching group is any single character followed by a zero or more characters followed by a tab.
{3} #previous block repeated 3 times (presumably for the 4th field)
[A-Z][A-Z]+\t #followed by any string of two letters or more followed by a tab.
Unfortunately, this doesn' t work but I know I am close to make it to work. Any hints or help will make this a great teaching moment.
Thanks.

If awk is okay for you, you can use below command:
awk '(FNR<5){print} (FNR>=5)&&length($4)<=1' input.vcf
Default delimiter is space, you can use -F"\t" to switch it to tab, put it after awk. for instance, awk -F"\t" ....
(FNR<5){print} FNR is file number record, when it is less than 5, print the whole line
(FNR>=5) && length($4)<=1 will handle the rest lines and filter lines which 4th field has one character or less.
Output:
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
You can redirect the output to an output file.

$ awk 'NR<5 || $4~/^.$/' file
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR

Fixed your sed filter (took me a while almost went crazy over it)
5,${/^\([^\t]\+\t\)\{3\}[A-Z][A-Z]\+\t/d}
Your errors:
[^.]*: everything but a dot.
Thanks to Ed, now I know that. I thought dot had to be escaped, but that does not seem to apply between brackets. Anyhow, this could match a tabulation char and match 2 or 3 groups instead of one, failing to match your line (regex are greedy by default)
\[A-Z][A-Z]: bad backslash. What did it do? hum, dunno!
test:
$ sed '5,${/^\([^\t]\+\t\)\{3\}[A-Z][A-Z]\+\t/d}' foo.Txt
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
conclusion: to process delimited fields, awk is better :)

How to extract specific columns from different files and output in one file?

I have in a directory 12 files , each file has 4 columns. The first column is a gene name and the rest 3 are count columns. All the files are in the same directory. I want to extract 1,4 columns for each files (12 files in total) and paste them in one output file, since the first column is same for every files the output file should only have one once the 1st column and the rest will be followed by 4th column of each file. The first column of each file is same. I do not want to use R here. I am a big fan of awk. So I was trying something like below but it did not work
My input files look like
Input file 1
ZYG11B 8267 16.5021 2743.51
ZYG11A 4396 0.28755 25.4208
ZXDA 5329 2.08348 223.281
ZWINT 1976 41.7037 1523.34
ZSCAN5B 1751 0.0375582 1.32254
ZSCAN30 4471 4.71253 407.923
ZSCAN23 3286 0.347228 22.9457
ZSCAN20 4343 3.89701 340.361
ZSCAN2 3872 3.13983 159.604
ZSCAN16-AS1 2311 1.1994 50.9903
Input file 2
ZYG11B 8267 18.2739 2994.35
ZYG11A 4396 0.227859 19.854
ZXDA 5329 2.44019 257.746
ZWINT 1976 8.80185 312.072
ZSCAN5B 1751 0 0
ZSCAN30 4471 9.13324 768.278
ZSCAN23 3286 1.03543 67.4392
ZSCAN20 4343 3.70209 318.683
ZSCAN2 3872 5.46773 307.038
ZSCAN16-AS1 2311 3.18739 133.556
Input file 3
ZYG11B 8267 20.7202 3593.85
ZYG11A 4396 0.323899 29.8735
ZXDA 5329 1.26338 141.254
ZWINT 1976 56.6215 2156.05
ZSCAN5B 1751 0.0364084 1.33754
ZSCAN30 4471 6.61786 596.161
ZSCAN23 3286 0.79125 54.5507
ZSCAN20 4343 3.9199 357.177
ZSCAN2 3872 5.89459 267.58
ZSCAN16-AS1 2311 2.43055 107.803
Desired output from above
ZYG11B 2743.51 2994.35 3593.85
ZYG11A 25.4208 19.854 29.8735
ZXDA 223.281 257.746 141.254
ZWINT 1523.34 312.072 2156.05
ZSCAN5B 1.32254 0 1.33754
ZSCAN30 407.923 768.278 596.161
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN20 340.361 318.683 357.177
ZSCAN2 159.604 307.038 267.58
ZSCAN16-AS1 50.9903 133.556 107.803
here as you can see above first column from each file and 4 column , since the first column of each file is same so I just want to keep it once and rest the ouptut will have 4th column of each file. I have just shown for 3 files. It should work for all the files in the directory at once since all of the files have similar naming conventions like file1_quant.genes.sf file2_quant.genes.sf , file3_quant.genes.sf
Every files has same first column but different counts in rest column. My idea is to create one output file which should have 1st column and 4th column from all the files.
awk '{print $1,$2,$4}' *_quant.genes.sf > genes.estreads
Any heads up?

If I understand you correctly, what you're looking for is one line per key, collated from multiple files.
The tool you need for this job is an associative array. I think awk can, but I'm not 100% sure. I'd probably tackle it in perl though:
#!/usr/bin/perl
use strict;
use warnings;
# an associative array, or hash as perl calls it
my %data;
#iterate the input files (sort might be irrelevant here)
foreach my $file ( sort glob("*_quant.genes.sf") ) {
#open the file for reading.
open( my $input, '<', $file ) or die $!;
#iterate line by line.
while (<$input>) {
#extract the data - splitting on any whitespace.
my ( $key, #values ) = split;
#add'column 4' to the hash (of arrays)
push( #{$data{$key}}, $values[2] );
}
close($input);
}
#start output
open( my $output, '>', 'genes.estreads' ) or die;
#sort, because hashes are explicitly unordered.
foreach my $key ( sort keys %data ) {
#print they key and all the elements collected.
print {$output} join( "\t", $key, #{ $data{$key} } ), "\n";
}
close($output);
With data as specified as above, this produces:
ZSCAN16-AS1 50.9903 133.556 107.803
ZSCAN2 159.604 307.038 267.58
ZSCAN20 340.361 318.683 357.177
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN30 407.923 768.278 596.161
ZSCAN5B 1.32254 0 1.33754
ZWINT 1523.34 312.072 2156.05
ZXDA 223.281 257.746 141.254
ZYG11A 25.4208 19.854 29.8735
ZYG11B 2743.51 2994.35 3593.85

The following is how you do it in awk:
awk 'BEGIN{FS = " "};{print $1, $4}' *|awk 'BEGIN{FS = " "};{temp = x[$1];x[$1] = temp " " $2;};END {for(xx in x) print xx,x[xx]}'
As cryptic as it looks, I am just using associative arrays.
Here is the solution broken down:
Just print the key and the value, one per line.
print $1, $2
Store the data in an associative array, keep updating
temp = x[$1];x[$1] = temp " " $2;}
Display it:
for(xx in x) print xx,x[xx]
Sample run:
[cloudera#quickstart test]$ cat f1
A k1
B k2
[cloudera#quickstart test]$ cat f2
A k3
B k4
C k1
[cloudera#quickstart test]$ awk 'BEGIN{FS = " "};{print $1, $2}' *|awk 'BEGIN{FS = " "};{temp = x[$1];x[$1] = temp " " $2;};END {for(xx in x) print xx,x[xx]}'
A k1 k3
B k2 k4
C k1
As a side note, the approach should be reminiscent of the Map Reduce paradigm.

awk '{E[$1]=E[$1] "\t" $4}END{for(K in E)print K E[K]}' *_quant.genes.sf > genes.estreads
Order is order of appearence when reading files (so normaly based on 1 readed file)

If the first column is the same in all the files, you can use paste:
paste <(tabify f1 | cut -f1,4) \
<(tabify f2 | cut -f4) \
<(tabify f3 | cut -f4)
Where tabify changes consecutive spaces to tabs:
sed 's/ \+/\t/g' "$#"
and f1, f2, f3 are the input files' names.

Here's another way to do it in Perl:
perl -lane '$data{$F[0]} .= " $F[3]"; END { print "$_ $data{$_}" for keys %data }' input_file_1 input_file_2 input_file_3

Here's another way of doing it with awk. And it supports using multiple files.
awk 'FNR==1{f++}{a[f,FNR]=$1}{b[f,FNR]=$4}END { for(x=1;x<=FNR;x++){printf("%s ",a[1,x]);for(y=0;y<=ARGC;y++)printf("%s ",b[y,x]);print ""}}' input1.txt input2.txt input3.txt
That line of code, give the following output
ZYG11B 2743.51 2994.35 3593.85
ZYG11A 25.4208 19.854 29.8735
ZXDA 223.281 257.746 141.254
ZWINT 1523.34 312.072 2156.05
ZSCAN5B 1.32254 0 1.33754
ZSCAN30 407.923 768.278 596.161
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN20 340.361 318.683 357.177
ZSCAN2 159.604 307.038 267.58
ZSCAN16-AS1 50.9903 133.556 107.803

Optimizing Large Data Intersect

I have two files from which a subset looks like this:
regions
chr1 150547262 150547338 v2MCL1_29.1.122 . GENE_ID=MCL1;Pool=2;PURPOSE=CNV
chr1 150547417 150547537 v2MCL1_29.1.283 . GENE_ID=MCL1;Pool=1;PURPOSE=CNV
chr1 150547679 150547797 v2MCL1_29.2.32 . GENE_ID=MCL1;Pool=2;PURPOSE=CNV
chr1 150547866 150547951 v2MCL1_29.2.574 . GENE_ID=MCL1;Pool=1;PURPOSE=CNV
chr1 150548008 150548096 v2MCL1_29.2.229 . GENE_ID=MCL1;Pool=2;PURPOSE=CNV
chr4 1801108 1801235 v2FGFR3_3.11.182 . GENE_ID=FGFR3;Pool=2;PURPOSE=CNV
chr4 1801486 1801615 v2FGFR3_3.11.202 . GENE_ID=FGFR3;Pool=1;PURPOSE=CNV
chrX 66833436 66833513 v2AR_region.70.118 . GENE_ID=AR;Pool=1;PURPOSE=CNV
chrX 66866117 66866228 v2AR_region.103.68 . GENE_ID=AR;Pool=2;PURPOSE=CNV
chrX 66871579 66871692 v2AR_region.108.32 . GENE_ID=AR;Pool=1;PURPOSE=CNV
Note: field 1 goes from chr1..chrX
query (a somewhat standard VCF file)
1 760912 . C T 21408 PASS . GT:DP:GQ:PL 1/1:623:99:21408,1673,0
1 766105 . T A 11865 PASS . GT:DP:GQ:PL 1/1:618:99:11865,1025,0
1 767780 . G A 15278 PASS . GT:DP:GQ:PL 1/1:512:99:15352,1274,74
1 150547747 . G A 9840 PASS . GT:DP:GQ:PL 0/1:645:99:9840,0,9051
1 204506107 . C T 22929 PASS . GT:DP:GQ:PL 1/1:636:99:22929,1801,0
1 204508549 . T G 22125 PASS . GT:DP:GQ:PL 1/1:638:99:22125,1757,0
2 2765262 . A G 22308 PASS . GT:DP:GQ:PL 1/1:678:99:22308,1854,0
2 2765887 . C T 9355 PASS . GT:DP:GQ:PL 0/1:649:99:9355,0,9235
2 25463483 . G A 31041 PASS . GT:DP:GQ:PL 1/1:936:99:31041,2422,0
2 212578379 . TA T 5355 PASS . GT:DP:GQ:PL 0/1:500:99:5355,0,3249
3 178881270 . T G 10012 PASS . GT:DP:GQ:PL 0/1:632:99:10012,0,7852
3 182673196 . C T 31170 PASS . GT:DP:GQ:PL 1/1:896:99:31170,2483,0
4 1801511 . C T 12218 PASS . GT:DP:GQ:PL 0/1:885:99:12218,0,11568
4 55097835 . G C 7259 PASS . GT:DP:GQ:PL 0/1:512:99:7259,0,7099
4 55152040 . C T 15866 PASS . GT:DP:GQ:PL 0/1:1060:99:15866,0,14953
X 152017752 . G A 9786 PASS . GT:DP:GQ:PL 0/1:735:99:9786,0,11870
X 152018832 . T G 12281 PASS . GT:DP:GQ:PL 0/1:924:99:12281,0,13971
X 152019715 . A G 10128 PASS . GT:DP:GQ:PL 0/1:689:99:10128,0,9802
Note: there are several leading lines that comprise the header and start with a '#' char.
I'm trying to write a script that will use the first two fields of the query file to see if the coordinates fall between the second and third fields of the regions file. I've coded it like this:
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dump;
my $bed = shift;
my $query_file = shift;
my %regions;
open( my $region_fh, "<", $bed ) || die "Can not open the input regions BED file: $!";
while (<$region_fh>) {
next if /track/;
my #line = split;
$line[0] =~ s/chr//; # need to strip of 'chr' or won't match query file
my ($gene, $pool, $purpose) = $line[5] =~ /GENE_ID=(\w+);(Pool=\d+);PURPOSE=(.*)$/;
#{$regions{$line[3]}} = (#line[0..4],$gene,$pool,$purpose);
}
close $region_fh;
my ( #header, #results );
open( my $query_fh, "<", $query_file ) || die "Can not open the query file: $!";
while (<$query_fh>) {
if ( /^#/ ) {
push( #header, $_ );
next;
}
my #fields = split;
for my $amp ( keys %regions ) {
if ( $fields[0] eq $regions{$amp}->[0] && $fields[1] >= $regions{$amp}->[1] && $fields[1] <= $regions{$amp}->[2] ) {
$fields[2] = $regions{$amp}->[5]; # add gene name to VCF file
push( #results, join( "\t", #fields ) );
}
}
}
close $query_fh;
The issue is that the query file is ~3.25 million lines long, and the regions file is about 2500 lines long. So, running this takes a very long time (I quit after about 20 minutes of waiting).
I think my overall logic is correct (hopefully!), and I'm wondering if there is a way to optimize how the data is processed to speed up the time it takes to process. I think the problem is that I need to traverse the array within regions 2500*3.25 million times. Can anyone offer any advice on how to revise my algorithm to process these data more efficiently?
Edit: Added a larger sample dataset, which should show some positives this time.

There are two approaches that I can think of. The first is to change the keys of %regions to the chromosome names, with the values being a list of all the start, end, and gene ID values corresponding to this chromosome, sort by the start value.
With your new data the hash would look like this
(
chr1 => [
[150547262, 150547338, "MCL1"],
[150547417, 150547537, "MCL1"],
[150547679, 150547797, "MCL1"],
[150547866, 150547951, "MCL1"],
[150548008, 150548096, "MCL1"],
],
chr4 => [
[1801108, 1801235, "FGFR3"],
[1801486, 1801615, "FGFR3"]
],
chrX => [
[66833436, 66833513, "AR"],
[66866117, 66866228, "AR"],
[66871579, 66871692, "AR"],
],
)
This way the chromosome name would give instant acccess to the right part of the hash instead of having to search through every entry each time, and the sorted start value allows a binary search.
The other possibility is to write the whole of the regions file to an SQLite temporary in-memory database. Once the data is stored and indexed, looking up a gene ID for a given chromosome and position will be pretty fast.

How do I split up a line and rearrange its elements?

I have some data on a single line like below
abc edf xyz rfg yeg udh
I want to present the data as below
abc
xyz
yeg
edf
rfg
udh
so that alternate fields are printed with newline separated.
Are there any one liners for this?

The following awk script can do it:
> echo 'abc edf xyz rfg yeg udh' | awk '{
for (i = 1;i<=NF;i+=2){print $i}
print "";
for (i = 2;i<=NF;i+=2){print $i}
}'
abc
xyz
yeg
edf
rfg
udh

Python in the same spirit as the above awk (4 lines):
$ echo 'abc edf xyz rfg yeg udh' | python -c 'f=raw_input().split()
> for x in f[::2]: print x
> print
> for x in f[1::2]: print x'
Python 1-liner (omitting the pipe to it which is identical):
$ python -c 'f=raw_input().split(); print "\n".join(f[::2] + [""] + f[1::2])'

Another Perl 5 version:
#!/usr/bin/env perl
use Modern::Perl;
use List::MoreUtils qw(part);
my $line = 'abc edf xyz rfg yeg udh';
my #fields = split /\s+/, $line; # split on whitespace
# Divide into odd and even-indexed elements
my $i = 0;
my ($first, $second) = part { $i++ % 2 } #fields;
# print them out
say for #$first;
say ''; # Newline
say for #$second;

A shame that the previous perl answers are so long. Here are two perl one-liners:
echo 'abc edf xyz rfg yeg udh'|
perl -naE '++$i%2 and say for #F; ++$j%2 and say for "",#F'
On older versions of perl (without "say"), you may use this:
echo 'abc edf xyz rfg yeg udh'|
perl -nae 'push #{$a[++$i%2]},"$_\n" for "",#F; print map{#$_}#a;'

Just for comparison, here's a few Perl scripts to do it (TMTOWTDI, after all). A rather functional style:
#!/usr/bin/perl -p
use strict;
use warnings;
my #a = split;
my #i = map { $_ * 2 } 0 .. $#a / 2;
print join("\n", #a[#i]), "\n\n",
join("\n", #a[map { $_ + 1 } #i]), "\n";
We could also do it closer to the AWK script:
#!/usr/bin/perl -p
use strict;
use warnings;
my #a = split;
my #i = map { $_ * 2 } 0 .. $#a / 2;
print "$a[$_]\n" for #i;
print "\n";
print "$a[$_+1]\n" for #i;
I've run out of ways to do it, so if any other clever Perlers come up with another method, feel free to add it.

Another Perl solution:
use strict;
use warnings;
while (<>)
{
my #a = split;
my #b = map { $a[2 * ($_%(#a/2)) + int($_ / (#a /2))] . "\n" } (0 .. #a-1);
print join("\n", #a[0..((#b/2)-1)], '', #a[(#b/2)..#b-1], '');
}
You could even condense it into a real one-liner:
perl -nwle'my #a = split;my #b = map { $a[2 * ($_%(#a/2)) + int($_ / (#a /2))] . "\n" } (0 .. #a-1);print join("\n", #a[0..((#b/2)-1)], "", #a[(#b/2)..#b-1], "");'

Here's the too-literal, non-scalable, ultra-short awk version:
awk '{printf "%s\n%s\n%s\n\n%s\n%s\n%s\n",$1,$3,$5,$2,$4,$6}'
Slightly longer (two more characters), using nested loops (prints an extra newline at the end):
awk '{for(i=1;i<=2;i++){for(j=i;j<=NF;j+=2)print $j;print ""}}'
Doesn't print an extra newline:
awk '{for(i=1;i<=2;i++){for(j=i;j<=NF;j+=2)print $j;if(i==1)print ""}}'
For comparison, paxdiablo's version with all unnecessary characters removed (1, 9 or 11 more characters):
awk '{for(i=1;i<=NF;i+=2)print $i;print "";for(i=2;i<=NF;i+=2)print $i}'
Here's an all-Bash version:
d=(abc edf xyz rfg yeg udh)
i="0 2 4 1 3 5"
for w in $i
do
echo ${d[$w]}
[[ $w == 4 ]]&&echo
done

My attempt in haskell:
Prelude> (\(x,y) -> putStr $ unlines $ map snd (x ++ [(True, "")] ++ y)) $ List.partition fst $ zip (cycle [True, False]) (words "abc edf xyz rfg yeg udh")
abc
xyz
yeg
edf
rfg
udh
Prelude>

you could also just use tr:
echo "abc edf xyz rfg yeg udh" | tr ' ' '\n'

Ruby versions for comparison:
ARGF.each do |line|
groups = line.split
0.step(groups.length-1, 2) { |x| puts groups[x] }
puts
1.step(groups.length-1, 2) { |x| puts groups[x] }
end
ARGF.each do |line|
groups = line.split
puts groups.select { |x| groups.index(x) % 2 == 0 }
puts
puts groups.select { |x| groups.index(x) % 2 != 0 }
end

$ echo 'abc edf xyz rfg yeg udh' |awk -vRS=" " 'NR%2;NR%2==0{_[++d]=$0}END{for(i=1;i<=d;i++)print _[i]}'
abc
xyz
yeg
edf
rfg
udh
For newlines, i leave it to you to do yourself.

Here is yet another way, using Bash, to manually rearrange words in a line - with previous conversion to an array:
echo 'abc edf xyz rfg yeg udh' | while read tline; do twrds=($(echo $tline)); echo -e "${twrd[0]} \n${twrd[2]} \n${twrd[4]} \n\n ${twrd[1]} \n${twrd[3]} \n${twrd[5]} \n" ; done
Cheers!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

using sed to capture groups of indeterminate length and a multitude of characters - sed

FYI using AGAT properly should definitely provide a proper GTF file with transcript_id

Related

Perl command not giving expected output

sed: delete lines that match a pattern in a given field

How to extract specific columns from different files and output in one file?

Optimizing Large Data Intersect

How do I split up a line and rearrange its elements?

Categories

Resources