How to double the columns in a data frame in perl - perl

I have a big data frame that looks like this:
name1 A A G
name2 C C T
name3 A G G
name4 H G G
name5 C - T
name6 C C C
name7 A G G
name8 G G A
I expect the data frame changed to:
name1 A A A A G G
name2 C C C C T T
name3 A A G G G G
name4 H H G G G G
name5 C C - - T T
name6 C C C C C C
name7 A A G G G G
name8 G G G G A A
I tried to work with R to do this but the memory limit not allow me to do it. Please help me with a perl solution. I don't know how to write a perl script. Thanks.

perl -lane'
BEGIN { $, ="\t" }
print shift(#F), map{ ($_)x2 } #F
' file
output
name1 A A A A G G
name2 C C C C T T
name3 A A G G G G
name4 H H G G G G
name5 C C - - T T
name6 C C C C C C
name7 A A G G G G
name8 G G G G A A

Using a perl one-liner
perl -lane 'print join "\t", shift(#F), map {($_) x 2} #F' data.txt

Related

Spaces in nssm.exe command output in powershell

I'm using nssm.exe in my scripts to manage the windows services. But in PowerShell, the command output is coming with spaces after every alphabet.
PS> $nssm = (Get-Command D:\nssm.exe)
PS> & $nssm
nssm.exe : N S S M : T h e n o n - s u c k i n g s e r v i c e m a n a g e r
At line:1 char:1
+ & $nssm
+ ~~~~~~~
+ CategoryInfo : NotSpecified: (N S S M : T h... m a n a g e r :String) [], RemoteException
+ FullyQualifiedErrorId : NativeCommandError
V e r s i o n 2 . 2 4 3 2 - b i t , 2 0 1 4 - 0 8 - 3 1
U s a g e : n s s m < o p t i o n > [ < a r g s > . . . ]
T o s h o w s e r v i c e i n s t a l l a t i o n G U I :
n s s m i n s t a l l [ < s e r v i c e n a m e > ]
T o i n s t a l l a s e r v i c e w i t h o u t c o n f i r m a t i o n :
n s s m i n s t a l l < s e r v i c e n a m e > < a p p > [ < a r g s > . . . ]
T o s h o w s e r v i c e e d i t i n g G U I :
n s s m e d i t < s e r v i c e n a m e >
How to get the output without such wide-format spaces among alphabets?

No output when running spark NetworkWordCount example

I am a Beginner of spark, and I am trying to use docker to run spark example NetworkWordCount. But no output when I run the example:
start a new terminal by docker exec -it container_id bash,
root#sandbox:/usr/local/spark# nc -lk 9999
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
a ba c d e f g
then start another terminal:
root#sandbox:/usr/local/spark# bin/run-example streaming.NetworkWordCount localhost 9999
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/05/12 02:55:57 INFO StreamingExamples: Setting log level to [WARN] for streaming example. To override add a custom log4j.properties to the classpath.
16/05/12 02:55:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
...
Anyone can help me?
Executing using below command worked for me.
[root#quickstart spark]# ./bin/run-example streaming.NetworkWordCount localhost 9999 --master "local[2]"

Additional space added while reading text from Notepad using powershell

My requirement is to find the location of a particular string in a line from notepad file but while reading it in powershell additional space get added that's why not able to find the location of a particular string. How I can find the location is this case??
I am using this code for achieving this
$ParamsPathForData = ($dir + "\TimeStats\TimeStats_1slot\29_12_2015_07TimeStats1.txt")
$data = Get-Content $ParamsPathForData
write-host $data.count total lines read from file
foreach ($line in $data)
{
$l =$line.IndexOf("12/29/2015")
write-host $l
}
I am reading this line from notepad ->
TimeStats 29 12/29/2015 7:13:42 AM +00:00 Debug PREPROCESS: SlotNo:
325-00313, Ip Address: 10.2.200.15, Duplicate Message: False,
Player-Card-No: , MessageId: 883250003130047966, MessageName:
GameIdInfo, Thread Init Delay: 14, Time To Parse: 155, Time To Exec
Main Workflow: 424, Time To Construct & send Response: 22, Total
Response Time: 615
But while exceuting it in powershell i am getting this with additinal spaces ->
T i m e S t a t s 2 9 1 2 / 2 9 / 2 0 1 5 7 : 1 3 : 4 2 A M
+ 0 0 : 0 0 D e b u g P R E P R O C E S S : S l o t N o : 3 2 5 - 0 0 3 1 3 , I p A d d r e s s : 1 0 . 2 . 2 0 0 . 1 5 , D
u p l i c a t e M e s s a g e : F a l s e , P l a y e r
- C a r d - N o : , M e s s a g e I d : 8 8 3 2 5 0 0 0 3 1 3 0 0 4 7 9 6 6 , M e s s a g e N a m e : G a m e I d I n f o , T h
r e a d I n i t D e l a y : 1 4 , T i m e T o P a r s e :
1 5 5 , T i m e T o E x e c M a i n W o r k f l o w : 4 2
4 , T i m e T o C o n s t r u c t & s e n d R e s p o n s
e : 2 2 , T o t a l R e s p o n s e T i m e : 6 1 5
Anybody please help me???
Change the the encoding to Unicode...
$data = Get-Content $ParamsPathForData -Encoding Unicode

Read Tab delimited file and count the occurrences and delete row

I am fairly new to programming and trying to resolve this problem. I have the file like this.
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 77 T C T T T T T
tg93 79 C - C C C - -
tg93 79 C G C C C C G C
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 105 A G A A A A A G A
tg93 108 A G A A A A G A A
tg93 114 T C T T T T T C T
tg93 131 A C A A A A A A A
tg93 136 G C C G C C G G G
tg93 150 CTCTC - CTCTC - CTCTC CTCTC
In this file, in the heading
CHROM - name
POS - position
REF - reference
ALT - alternate
10 - 16_sample.bam - samplesd
I
Now i wanted to see how many times the letter in REF and ALT column occured. If either of them is repeated less than two times, i need to delete that row.
For example
In the first row, i have 'T' in REF and 'C' in ALT . I see in 7 samples, there are 5 T's and 2 blanks and no C. So i need to delete this row.
In Second row, REF is 'C' and Alt is '-'. Now in seven samples we have 3 C's, 2 '-'s and 2 blanks. So we keep this row as C and - have repeated more than 2 times.
Always we ignore the blanks while counting
The final file after filtering is
#CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 79 C - C C C - -
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 136 G C C G C C G G G
I am able to read the columns in to arrays and display them in the code but i am not sure how to start the loops to read the base and count their occurrences and remain the column. Can anyone tell me how i should be proceeding with this? Or it will be helpful if you have any example code i can modify up on.
#!/usr/bin/env perl
use strict;
use warnings;
print scalar(<>); # Read and output the header.
while (<>) { # Read a line.
chomp; # Remove the newline from the line.
my ($chrom, $pos, $ref, $alt, #samples) =
split /\t/; # Parse the remainder of the line.
my %counts; # Count the occurrences of sample values.
++$counts{$_} for #samples; # e.g. Might end up with $counts{"G"} = 3.
print "$_\n" # Print line if we want to keep it.
if ($counts{$ref} || 0) >= 2 # ("|| 0" avoids a spurious warning.)
&& ($counts{$alt} || 0) >= 2;
}
Output:
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 79 C - C C C - -
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 136 G C C G C C G G G
You included 108 in your desired output, but it only has one instance of ALT in the seven samples.
Usage:
perl script.pl file.in >file.out
Or in-place:
perl -i script.pl file
Here's an approach that does not assume tab separation between fields
use IO::All;
my $chrom = "tg93";
my #lines = io('file.txt')->slurp;
foreach(#lines) {
%letters = ();
# use regex with backreferences to extract data - this method does not depend on tab separated fields
if(/$chrom\s+\d+\s+([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])/) {
# initialize hash counts
$letters{$1} = 0;
$letters{$2} = 0;
# loop through the samples and increment the counter when matches are found
foreach($3, $4, $5, $6, $7, $8, $9) {
if ($_ eq $1) {
++$letters{$1};
}
if ($_ eq $2) {
++$letters{$2};
}
}
# if the counts for both POS and REF are greater than or equal to 2, print the line
if($letters{$1} >= 2 && $letters{$2} >= 2) {
print $_;
}
}
}

cut specific columns from several files and reshape using unix tools

I have several hundred files in a folder. Each of these file is a tab delimited text file that contain more than a million rows and 27 columns. From each file, I want to be able to extract only specific columns (say pull out only columns: 1,2,11,12,13). Columns 3:10 & 14:27 can be ignored. I want to be able to do this for all files in the folder (say 2300 files). The columns from each of the 2300 file looks like this..........
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27
1234567890_A rs758676 - - 1 T T - ....col27
1234567890_A rs3916934 - - 1 T T - ....col27
1234567890_A rs2711935 - - 1 T C - ....col27
1234567890_A rs17126880 - - 1 - - - ....col27
1234567890_A rs12831433 - - 1 T T - ....col27
1234567890_A rs12797197 - - 1 T C - ....col27
The cut columns from the 2nd file may look like this....
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27
1234567899_C rs758676 - - 100 T A - ....col27
1234567899_C rs3916934 - - 100 T T - ....col27
1234567899_C rs2711935 - - 100 T C - ....col27
1234567899_C rs17126880 - - 100 C G - ....col27
1234567899_C rs12831433 - - 100 T T - ....col27
1234567899_C rs12797197 - - 100 T C - ....col27
The cut columns from the 3rd file may look like this....
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27
1234567999_F rs758676 - - 256 A A - ....col27
1234567999_F rs3916934 - - 256 T T - ....col27
1234567999_F rs2711935 - - 256 T C - ....col27
1234567999_F rs17126880 - - 256 C G - ....col27
1234567999_F rs12831433 - - 256 T T - ....col27
1234567999_F rs12797197 - - 256 C C - ....col27
The width of the Sample.ID, Sample.Index are the same in each file but can change between files. The value of Sample.ID is the same within each file but different between files. Each of the cut files have the same values under "SNP.Name" column. The Sample.Index column may sometimes be same from different file. The other two columns values (Allele1...Forward & Allele2...Forward) may change, and are pasted with " " sep under each SNP.Name for each Sample.ID.
I finally want to merge (tab-delemited) all the cut columns from the 2300 files into this format ......
Sample.Index Sample.ID rs758676 rs3916934 rs2711935 rs17126880 rs12831433 rs12797197
1 1234567890_A T T T T T C 0 0 T T T C
200 1234567899_C T A T T T C C G T T T C
256 1234567999_F A A T T T C C G T T C C
In simple terms I want to be able to convert a long format into wide format based on the Sample.ID column. This is similar to reshape function in R. I tried this with R and it runs out of memory and is really slow. Can anyone help with unix tools?
When reshape.sh was applied to 20 files... it produced a spurious "Samples line" in the output. The first 4 fields are featured here.
Sample.Index Sample.ID rs476542 rs7073746
1234567891_A 11 C C A G
1234567892_A 191 T C A G
1234567893_A 204 T C G G
1234567894_A 15 T C A G
1234567895_A 158 T T A A
1234567896_A 208 T C A A
1234567897_A 111 T T G G
1234567898_A 137 T C G G
1234567899_A 216 T C A G
1234567900_A 113 T C G G
1234567901_A 152 T C A G
1234567902_A 178 C C A A
1234567903_A 135 C C A A
1234567904_A 125 T C A A
1234567905_A 194 C C A A
1234567906_A 110 C C G G
1234567907_A 126 C C A A
Sample -
1234567908_A 169 C C G G
1234567909_A 173 C C G G
1234567910_A 168 T C A A
#!/bin/bash
awk '
BEGIN {
maxInd = length("Sample.Index")
maxID = length("Sample.ID")
}
FNR>11 && $2 ~ "^rs" {
SNP[$2]
key[$11,$1]
val[$2,$11,$1]=$12" "$13
maxInd = (len=length($11)) > maxInd ? len : maxInd
maxID = (len=length($1)) > maxID ? len : maxID
}
END {
printf("%-*s\t%*s\t", maxInd, "Sample.Index", maxID, "Sample.ID")
for (rs in SNP)
printf("%s\t", rs)
printf("\n")
for(pair in key) {
split(pair,a,SUBSEP)
printf("%-*s\t%*s\t", maxInd, a[1], maxID, a[2])
for(rs in SNP) {
ale = val[rs,a[1],a[2]]
out = ale == "- -" || ale == "" ? "0 0" : ale
printf("%*s\t", length(rs), out)
}
printf("\n")
}
}' DNA*.txt
Proof of Concept
$ ./reshapeDNA
Sample.Index Sample.ID rs2711935 rs10829026 rs3924674 rs2635442 rs715350 rs17126880 rs7037313 rs11983370 rs6424572 rs7055953 rs758676 rs7167305 rs12831433 rs2147587 rs12797197 rs3916934 rs11002902
11 1234567890_A T T 0 0 C C 0 0 0 0 T C 0 0 C C T G 0 0 C C 0 0 T C A G T T T C G G
111 1234567892_A T T T C C C 0 0 0 0 C C T C C C T T 0 0 C C 0 0 T T A A T T T T G G
1 1234567894_A T T 0 0 T C C C A G C C 0 0 C C 0 0 T C C C T T T T A G T T C C G G
12 1234567893_A T T 0 0 C C T C A A T C 0 0 C C 0 0 T T C C T G T C A G T T T C G G
15 1234567891_A T T C C C C 0 0 0 0 C C C C C C T T 0 0 C C 0 0 T C A G T T T T G G