cut specific columns from several files and reshape using unix tools

cut specific columns from several files and reshape using unix tools - perl

I have several hundred files in a folder. Each of these file is a tab delimited text file that contain more than a million rows and 27 columns. From each file, I want to be able to extract only specific columns (say pull out only columns: 1,2,11,12,13). Columns 3:10 & 14:27 can be ignored. I want to be able to do this for all files in the folder (say 2300 files). The columns from each of the 2300 file looks like this..........
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27
1234567890_A rs758676 - - 1 T T - ....col27
1234567890_A rs3916934 - - 1 T T - ....col27
1234567890_A rs2711935 - - 1 T C - ....col27
1234567890_A rs17126880 - - 1 - - - ....col27
1234567890_A rs12831433 - - 1 T T - ....col27
1234567890_A rs12797197 - - 1 T C - ....col27
The cut columns from the 2nd file may look like this....
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27
1234567899_C rs758676 - - 100 T A - ....col27
1234567899_C rs3916934 - - 100 T T - ....col27
1234567899_C rs2711935 - - 100 T C - ....col27
1234567899_C rs17126880 - - 100 C G - ....col27
1234567899_C rs12831433 - - 100 T T - ....col27
1234567899_C rs12797197 - - 100 T C - ....col27
The cut columns from the 3rd file may look like this....
Sample.ID SNP.Name col3 col10 Sample.Index Allele1...Forward Allele2...Forward col14 ....col27
1234567999_F rs758676 - - 256 A A - ....col27
1234567999_F rs3916934 - - 256 T T - ....col27
1234567999_F rs2711935 - - 256 T C - ....col27
1234567999_F rs17126880 - - 256 C G - ....col27
1234567999_F rs12831433 - - 256 T T - ....col27
1234567999_F rs12797197 - - 256 C C - ....col27
The width of the Sample.ID, Sample.Index are the same in each file but can change between files. The value of Sample.ID is the same within each file but different between files. Each of the cut files have the same values under "SNP.Name" column. The Sample.Index column may sometimes be same from different file. The other two columns values (Allele1...Forward & Allele2...Forward) may change, and are pasted with " " sep under each SNP.Name for each Sample.ID.
I finally want to merge (tab-delemited) all the cut columns from the 2300 files into this format ......
Sample.Index Sample.ID rs758676 rs3916934 rs2711935 rs17126880 rs12831433 rs12797197
1 1234567890_A T T T T T C 0 0 T T T C
200 1234567899_C T A T T T C C G T T T C
256 1234567999_F A A T T T C C G T T C C
In simple terms I want to be able to convert a long format into wide format based on the Sample.ID column. This is similar to reshape function in R. I tried this with R and it runs out of memory and is really slow. Can anyone help with unix tools?
When reshape.sh was applied to 20 files... it produced a spurious "Samples line" in the output. The first 4 fields are featured here.
Sample.Index Sample.ID rs476542 rs7073746
1234567891_A 11 C C A G
1234567892_A 191 T C A G
1234567893_A 204 T C G G
1234567894_A 15 T C A G
1234567895_A 158 T T A A
1234567896_A 208 T C A A
1234567897_A 111 T T G G
1234567898_A 137 T C G G
1234567899_A 216 T C A G
1234567900_A 113 T C G G
1234567901_A 152 T C A G
1234567902_A 178 C C A A
1234567903_A 135 C C A A
1234567904_A 125 T C A A
1234567905_A 194 C C A A
1234567906_A 110 C C G G
1234567907_A 126 C C A A
Sample -
1234567908_A 169 C C G G
1234567909_A 173 C C G G
1234567910_A 168 T C A A

#!/bin/bash
awk '
BEGIN {
maxInd = length("Sample.Index")
maxID = length("Sample.ID")
}
FNR>11 && $2 ~ "^rs" {
SNP[$2]
key[$11,$1]
val[$2,$11,$1]=$12" "$13
maxInd = (len=length($11)) > maxInd ? len : maxInd
maxID = (len=length($1)) > maxID ? len : maxID
}
END {
printf("%-*s\t%*s\t", maxInd, "Sample.Index", maxID, "Sample.ID")
for (rs in SNP)
printf("%s\t", rs)
printf("\n")
for(pair in key) {
split(pair,a,SUBSEP)
printf("%-*s\t%*s\t", maxInd, a[1], maxID, a[2])
for(rs in SNP) {
ale = val[rs,a[1],a[2]]
out = ale == "- -" || ale == "" ? "0 0" : ale
printf("%*s\t", length(rs), out)
}
printf("\n")
}
}' DNA*.txt
Proof of Concept
$ ./reshapeDNA
Sample.Index Sample.ID rs2711935 rs10829026 rs3924674 rs2635442 rs715350 rs17126880 rs7037313 rs11983370 rs6424572 rs7055953 rs758676 rs7167305 rs12831433 rs2147587 rs12797197 rs3916934 rs11002902
11 1234567890_A T T 0 0 C C 0 0 0 0 T C 0 0 C C T G 0 0 C C 0 0 T C A G T T T C G G
111 1234567892_A T T T C C C 0 0 0 0 C C T C C C T T 0 0 C C 0 0 T T A A T T T T G G
1 1234567894_A T T 0 0 T C C C A G C C 0 0 C C 0 0 T C C C T T T T A G T T C C G G
12 1234567893_A T T 0 0 C C T C A A T C 0 0 C C 0 0 T T C C T G T C A G T T T C G G
15 1234567891_A T T C C C C 0 0 0 0 C C C C C C T T 0 0 C C 0 0 T C A G T T T T G G

Related

Boolean Simplification - Q=A.B.(~B+C)+B.C+B

I've been struggling with boolean simplification in class, and took it to practice some more at home. I found a list of questions, but they don't have any answers or workings. This one I'm stuck on, if you could answer clearly showing each step I'd much appreciate:
Q=A.B.(~B+C)+B.C+B
I tried looking for a calculator to give me the answer and then to work out how to get to that, but I'm lost
(I'm new to this)
Edit: ~B = NOT B

I've never done this, so I'm using this site to help me.
A.B.(B' + C) = A.(B.B' + B.C) = A.(0 + B.C) = A.(B.C)
So the expression is now A.(B.C) + B.C + B.
Not sure about this, but I'm guessing A.(B.C) + (B.C) = (A + 1).(B.C).
This equals A.(B.C).
So the expression is now A.(B.C) + B.
As A.(B + C) = B.(A.C), the expression is now B.(A.C) + B, which equals (B + 1).(A.C) = B.(A.C).
NOTE: This isn't complete yet, so please avoid downvoting as I'm not finished yet (posted this to help the OP understand the first part).

Let's be lazy and use sympy, a Python library for symbolic computation.
>>> from sympy import *
>>> from sympy.logic import simplify_logic
>>> a, b, c = symbols('a, b, c')
>>> expr = a & b & (~b | c) | b & c | b # A.B.(~B+C)+B.C+B
>>> simplify_logic(expr)
b
There are two ways to go about such a formula:
Applying simplifications,
Brute force
Let's look at brute force first. The following is a dense truth table (for a better looking table, look at Wα), enumerating all possible value for a, b and c, alongside the values of the expression.
a b c -- a & b & (~b | c) | b & c | b = Q
0 0 0 0 0 10 1 0 0 0 0 0 = 0
0 0 1 0 0 10 1 1 0 0 1 0 = 0
0 1 0 0 1 01 0 0 1 0 0 1 = 1
0 1 1 0 1 01 1 1 1 1 1 1 = 1
1 0 0 1 0 10 1 0 0 0 0 0 = 0
1 0 1 1 0 10 1 1 0 0 1 0 = 0
1 1 0 1 1 01 1 0 1 0 0 1 = 1
1 1 1 1 1 01 1 1 1 1 1 1 = 1
You can also think of the expression as a tree, which will depend on the precedence rules (e.g. usually AND binds stronger than OR, see also this question on math.se).
So the expression:
a & b & (~b | c) | b & c | b
is a disjunction of three terms:
a & b & (~b | c)
b & c
b
You can try to reason about the individual terms, knowing that only one has to be true (as this is a disjunction).
The last two will be true, if and only if b is true. For the first, this a bit harder to see, but if you look closely: you have now a conjunction (terms concatenated by AND): All of them must be true, for the whole expression to be true, so a and b must be true. Especially b must be true.
In summary: For the whole expression to be true, in all three top-level cases, b must be true (and it will be false, if b is false). So it simplifies to just b.
Explore more on Wolfram Alpha:
https://www.wolframalpha.com/input/?i=a+%26+b+%26+(~b+%7C+c)+%7C+b+%26+c+%7C+b

A.B.(~B+C) + B.C + B = A.B.~B + A.B.C + B.C + B ; Distribution
= A.B.C + B.C + B ; Because B.~B = 0
= B.C + B ; Because A.B.C <= B.C
= B ; Because B.C <= B

iText - Can't read the content of a PD4ML generated pdf

I have an issue at reading pdf content with iText. I have tested all the different technics. They all work with standard pdf documents, but I have one pdf document that I need to amend and I can't get the content.
This document has been generated by PD4ML. It can be read in Acrobat reader, but cannot be read in Open Office.
For exemple using the command
PdfReader reader = new PdfReader(src);
FileOutputStream out = new FileOutputStream(result);
out.write(reader.getPageContent(1));
Produces this output:
q Q q 29.18088 102.1433 536.9282 675.0511 re W n /Cs1 cs 1 1 1 sc 29.18088
775.5042 m 574.5602 775.5042 l 574.5602 -2599.312 l 29.18088 -2599.312 l h
f Q q 43.26609 761.4189 m 560.475 761.4189 l 560.475 -2572.832 l 43.26609
-2572.832 l h W n 29.18088 102.1433 536.9282 675.0511 re W n q 24.78997 0 0 22.53634 51.71722 733.2485
cm /Im1 Do Q /Cs1 cs 0.2 0.2 0.2 sc /Cs1 CS 0.2 0.2 0.2 SC 0.5 w 2 J 2 Tr
q 0.5634084 0 0 0.5634084 29.18088 711.2756 cm BT 20 0 0 20 40 0 Tm /G1 1
Tf [ <0033> 1 <004800550049> 1 <00520055005000440051004600480003> 1 <0044005100470003>
But when I am trying to get the text context, there are text items, they are not displayed. Like if the text format was different.
This code:
PdfReader reader = new PdfReader(src);
PdfReaderContentParser parser = new PdfReaderContentParser(reader); PrintWriter out = new PrintWriter(new FileOutputStream(result)); TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
}
Just produces spaces. Same for TextLocationStrategy.
The command
PdfContentReaderTool.listContentStream(new File(src), out);
Produces
==============Page 1====================
- - - - - Dictionary - - - - - -
(/Parent=Dictionary of type: /Pages, /Contents=Stream, /Type=/Page, /Resources=Dictionary, /MediaBox=[0, 0, 595.29, 841.89])
Subdictionary /Parent = (/Type=/Pages, /MediaBox=[0, 0, 595.29, 841.89], /Count=6, /Kids=[2 0 R, 14 0 R, 26 0 R, 30 0 R, 34 0 R, 38 0 R])
Subdictionary /Resources = (/XObject=Dictionary, /ProcSet=[/PDF, /Text, /ImageB, /ImageC, /ImageI], /ColorSpace=Dictionary, /Font=Dictionary)
Subdictionary /XObject = (/Im1=Stream of type: /XObject)
Subdictionary /ColorSpace = (/Cs1=[/ICCBased, 12 0 R])
Subdictionary /Font = (/G2=Dictionary of type: /Font, /G1=Dictionary of type: /Font)
Subdictionary /G2 = (/BaseFont=/HCNQGU+font000000001c036002, /DescendantFonts=[50 0 R], /Type=/Font, /Encoding=/Identity-H, /Subtype=/Type0, /ToUnicode=Stream)
Subdictionary /G1 = (/BaseFont=/HCZCBJ+font000000001c036002, /DescendantFonts=[43 0 R], /Type=/Font, /Encoding=/Identity-H, /Subtype=/Type0, /ToUnicode=Stream)
- - - - - XObject Summary - - - - - -
------ /Im1 - subtype = /Image = 9148 bytes ------
Content Stream - - - - - -
q Q q 29.18088 102.1433 536.9282 675.0511 re W n /Cs1 cs 1 1 1 sc 29.18088
775.5042 m 574.5602 775.5042 l 574.5602 -2599.312 l 29.18088 -2599.312 l h
f Q q 43.26609 761.4189 m 560.475 761.4189 l 560.475 -2572.832 l 43.26609
-2572.832 l h W n 29.18088 102.1433 536.9282 675.0511 re W n q 24.78997 0 0 22.53634 51.71722 733.2485
cm /Im1 Do Q /Cs1 cs 0.2 0.2 0.2 sc /Cs1 CS 0.2 0.2 0.2 SC 0.5 w 2 J 2 Tr
q 0.5634084 0 0 0.5634084 29.18088 711.2756 cm BT 20 0 0 20 40 0 Tm /G1 1
But The part Text Extraction is empty.
Any idea why I can't read the text? Is there something else I could do or test before getting the text?
Any pointer welcome.
Gilles

How to vectorize this for-loop in Matlab?

This is perhaps a simple question. I have a vector and a matrix and want to make a new matrix based on some manipulation. I constructed the new matrix using for loop and I would like to know how can I write it with Vector operator that are likely faster.
d=[n x 1];
t= [n x n];
I want the new Delta matrix which is [n x n] as follows:
for i=1:39
for j=1:39
Delta(i,j)=d(i)-d(j)-t(i,j);
end
end
The result
[d (1) - d (1) - t( 1 ,1),d (1) - d (2) - t( 1 ,2), ... d(1) - d (39) - t( 1 ,39)
d (2) - d (1) - t( 2 ,1),d (2) - d (2) - t( 2 ,2), .... ,d (2) - d (39) - t( 2 ,39)
.
.
.
d (38) - d (1) - t( 38 ,1),d (38) - d (2) - t( 38 ,2), ... , d(38) -d (39)-t(38,39)
d (39) - d (1) - t( 39 ,1),d (39) - d (2) - t( 39 ,2), ..., d(39)- d (39)- t(39 ,39)]

You can use the efficient bsxfun -
Delta = bsxfun(#minus,d,d.') - t

Multiple Combinations of Values in Numerical Order

I have searched and attempted to solve this puzzle myself (I've gotten close, but I've had no luck). I have a large table of values (composed of Sets of Values) that can have multiple combinations, but those combinations must be returned in the ID order.
I have not been able to get this to work in SQL.
Example Set:
(Sorry I am not able to post an image which would explain it better so Ill keep it simple.)
Table[(ID, Value) {(1,A),(1,B),(1,C),(2,D),(3,F),(3,G), (4,J), (5,S),(5,T),(5,U))}
RESULTS
ID VALUE
1 A
2 F
3 G
4 J
5 S
1 A
2 F
3 G
4 J
5 T
1 A
2 F
3 G
4 J
5 U
1 A
2 F
3 H
4 J
5 S
1 A
2 F
3 H
4 J
5 T
1 A
2 F
3 H
4 J
5 U
1 B
2 F
3 G
4 J
5 S
1 B
2 F
3 G
4 J
5 T
1 B
2 F
3 G
4 J
5 U
1 B
2 F
3 H
4 J
5 S
1 B
2 F
3 H
4 J
5 T
1 B
2 F
3 H
4 J
5 U
1 C
2 F
3 G
4 J
5 S
1 C
2 F
3 G
4 J
5 T
1 C
2 F
3 G
4 J
5 U
1 C
2 F
3 H
4 J
5 S
1 C
2 F
3 H
4 J
5 T
1 C
2 F
3 H
4 J
5 U

Here's the problem in dynamic SQL without any cursors or loops.
IF OBJECT_ID('yourTable') IS NOT NULL
DROP TABLE yourTable;
CREATE TABLE yourTable (ID INT, Value CHAR(1));
INSERT INTO yourTable
VALUES (1,'A'),(1,'B'),(1,'C'),
(2,'D'),
(3,'F'),(3,'G'),
(4,'J'),
(5,'S'),(5,'T'),(5,'U');
DECLARE #row_number_cols VARCHAR(MAX),
#Aliased_Cols VARCHAR(MAX),
#Cross_Joins VARCHAR(MAX),
#Unpivot VARCHAR(MAX);
SELECT #row_number_cols = COALESCE(#row_number_cols + ',','') + col,
#Aliased_Cols = COALESCE(#Aliased_Cols + ',','') + CONCAT(col,' AS col',ID),
#Cross_Joins = COALESCE(#Cross_Joins,'') + CASE
WHEN ID = 1 THEN CONCAT(' FROM (SELECT * FROM yourTable WHERE ID = 1) AS ID',ID)
ELSE CONCAT(' CROSS JOIN (SELECT * FROM yourTable WHERE ID = ',ID,') AS ID',ID)
END,
#Unpivot = COALESCE(#Unpivot + ',','') + CONCAT('col',ID)
FROM yourTable A
CROSS APPLY (SELECT CONCAT('ID',ID,'.Value')) CA(col) --Just so I can reuse "col" in my code
GROUP BY A.ID,CA.col
SELECT #row_number_cols,#Aliased_Cols,#Cross_Joins,#Unpivot
SELECT
'WITH CTE_crossJoins
AS
(
SELECT ROW_NUMBER() OVER (ORDER BY ' + #row_number_cols + ') group_num,' + #Aliased_Cols +
#Cross_Joins + '
)
SELECT group_num,
val
FROM CTE_crossJoins
UNPIVOT
(
val for col IN (' + #Unpivot + ')
) unpvt
ORDER BY 1,2'
Results:
group_num val
-------------------- ----
1 A
1 D
1 F
1 J
1 S
2 A
2 D
2 G
2 J
2 S
3 A
3 D
3 G
3 J
3 T
4 A
4 D
4 F
4 J
4 T
5 A
5 D
5 F
5 J
5 U
6 A
6 D
6 G
6 J
6 U
7 B
7 D
7 G
7 J
7 S
8 B
8 D
8 F
8 J
8 S
9 B
9 D
9 F
9 J
9 T
10 B
10 D
10 G
10 J
10 T
11 B
11 D
11 G
11 J
11 U
12 B
12 D
12 F
12 J
12 U
13 C
13 D
13 F
13 J
13 S
14 C
14 D
14 G
14 J
14 S
15 C
15 D
15 G
15 J
15 T
16 C
16 D
16 F
16 J
16 T
17 C
17 D
17 F
17 J
17 U
18 C
18 D
18 G
18 J
18 U

I think this has been answered before here:
How to generate all possible data combinations in SQL?
difference being that they essentially dropped the ID column, should be easy to pull it through though.

You can employ the SQL windows function to achieve this.
;WITH CTE AS
(
SELECT Id,
Value,
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY ID) RN
FROM Tbl
)
SELECT * FROM CTE ORDER BY RN, ID, VALUE
Fiddle

Read Tab delimited file and count the occurrences and delete row

I am fairly new to programming and trying to resolve this problem. I have the file like this.
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 77 T C T T T T T
tg93 79 C - C C C - -
tg93 79 C G C C C C G C
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 105 A G A A A A A G A
tg93 108 A G A A A A G A A
tg93 114 T C T T T T T C T
tg93 131 A C A A A A A A A
tg93 136 G C C G C C G G G
tg93 150 CTCTC - CTCTC - CTCTC CTCTC
In this file, in the heading
CHROM - name
POS - position
REF - reference
ALT - alternate
10 - 16_sample.bam - samplesd
I
Now i wanted to see how many times the letter in REF and ALT column occured. If either of them is repeated less than two times, i need to delete that row.
For example
In the first row, i have 'T' in REF and 'C' in ALT . I see in 7 samples, there are 5 T's and 2 blanks and no C. So i need to delete this row.
In Second row, REF is 'C' and Alt is '-'. Now in seven samples we have 3 C's, 2 '-'s and 2 blanks. So we keep this row as C and - have repeated more than 2 times.
Always we ignore the blanks while counting
The final file after filtering is
#CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 79 C - C C C - -
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 136 G C C G C C G G G
I am able to read the columns in to arrays and display them in the code but i am not sure how to start the loops to read the base and count their occurrences and remain the column. Can anyone tell me how i should be proceeding with this? Or it will be helpful if you have any example code i can modify up on.

#!/usr/bin/env perl
use strict;
use warnings;
print scalar(<>); # Read and output the header.
while (<>) { # Read a line.
chomp; # Remove the newline from the line.
my ($chrom, $pos, $ref, $alt, #samples) =
split /\t/; # Parse the remainder of the line.
my %counts; # Count the occurrences of sample values.
++$counts{$_} for #samples; # e.g. Might end up with $counts{"G"} = 3.
print "$_\n" # Print line if we want to keep it.
if ($counts{$ref} || 0) >= 2 # ("|| 0" avoids a spurious warning.)
&& ($counts{$alt} || 0) >= 2;
}
Output:
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam
tg93 79 C - C C C - -
tg93 80 G A G G G G A A G
tg93 81 A C A A A A C C C
tg93 86 C A C C A A A A C
tg93 136 G C C G C C G G G
You included 108 in your desired output, but it only has one instance of ALT in the seven samples.
Usage:
perl script.pl file.in >file.out
Or in-place:
perl -i script.pl file

Here's an approach that does not assume tab separation between fields
use IO::All;
my $chrom = "tg93";
my #lines = io('file.txt')->slurp;
foreach(#lines) {
%letters = ();
# use regex with backreferences to extract data - this method does not depend on tab separated fields
if(/$chrom\s+\d+\s+([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])\s{3}([A-Z-\s])/) {
# initialize hash counts
$letters{$1} = 0;
$letters{$2} = 0;
# loop through the samples and increment the counter when matches are found
foreach($3, $4, $5, $6, $7, $8, $9) {
if ($_ eq $1) {
++$letters{$1};
}
if ($_ eq $2) {
++$letters{$2};
}
}
# if the counts for both POS and REF are greater than or equal to 2, print the line
if($letters{$1} >= 2 && $letters{$2} >= 2) {
print $_;
}
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

cut specific columns from several files and reshape using unix tools - perl

Related

Boolean Simplification - Q=A.B.(~B+C)+B.C+B

iText - Can't read the content of a PD4ML generated pdf

How to vectorize this for-loop in Matlab?

Multiple Combinations of Values in Numerical Order

Read Tab delimited file and count the occurrences and delete row

Categories

Resources