Generating 2 files based on two columns in a third file - perl

I am trying to prepare two input files based on information in a third file. File 1 is for sample1 and File 2 is for sample2. Both these files have lines with tab delimited columns. The first column contains unique identifier and the second column contains information.
File 1
>ENT01 xxxxxxxxxxxxxx
>ENT02 xyxyxyxyxyxy
>ENT03 ththththththt
..so on. Similarly, File 2 contains
>ENG012 ggggggggggggg
>ENG098 ksksksksksks
>ENG234 wewewewewew
I have a File 3 that contains two columns each corresponding to the identifier from File 1 and File 2
>ENT01 >ENG78
>ENT02 >ENG098
>ENT02 >ENG012
>ENT02 >ENG234
>ENT03 >ENG012
and so on. I want to prepare input files for File 1 and File 2 by following the order in file 3. If an entry is repeated in file 3 (ex ENT02) I want to repeat the information for that entry. The expected output is
For File 1:
>ENT01 xxxxxxxxxx
>ENT02 xyxyxyxyxyxy
>ENT02 xyxyxyxyxyx
>ENT02 xyxyxyxyxyx
>ENT03 ththththththth
And for file 2
>ENG78 some info
>ENG098 ksksksksks
>ENG012 gggggggg
>ENG234 wewewewewew
>ENG012 gggggggg
All the the entries in file 1 and file 2 are unique but not in file 3. Also, there are some entries in file3 in either column that is not present in either file 1 or file 2. The current logic I am using is that finding an intersection of identifiers from column 1 in both files1&2 with respective columns in file 3, storing this as a list and using this list to compare with File1 and File 2 separately to generate output for File 1 & 2. I am working with the following lines
awk 'FNR==NR{a[$1]=$0;next};{print a[$1]}' file1 intersectlist
grep -v -x -f idsnotfoundinfile1 file3
I am not able to get the right output as I think at some point it is getting sorted and only uniq values are printed out. Can someone please help me clear work this out.

You need to read and remember the first 2 files into some data structure, and then for the third file, output to 2 new files:
$ awk -F'\t' -v OFS='\t' '
FNR == 1 {file_num++}
file_num == 1 || file_num == 2 {data[file_num,$1] = $2; next}
function value(str) {
return str ? str : "some info"
}
{
for (i=1; i<=2; i++) {
print $i, value(data[i,$i]) > ARGV[i] ".new"
}
}
' file1 file2 file3
$ cat file1.new
>ENT01 xxxxxxxxxxxxxx
>ENT02 xyxyxyxyxyxy
>ENT02 xyxyxyxyxyxy
>ENT02 xyxyxyxyxyxy
>ENT03 ththththththt
$ cat file2.new
>ENG78 some info
>ENG098 ksksksksksks
>ENG012 ggggggggggggg
>ENG234 wewewewewew
>ENG012 ggggggggggggg

The files 1 and 2 first need be read so that you can find their lines with identifiers from file 3. Since the identifiers in these files are unique you can build a hash for each file, with identifiers as keys.
Then process file 3 line by line, where for each identifier on the line retrieve its value from the hash for the appropriate file and write the corresponding lines to new files 1 and 2.
use warnings;
use strict;
use feature 'say';
use Path::Tiny;
my ($file1, $file2, $file3) = qw(File1.txt File2.txt File3.txt);
my ($fileout1, $fileout2) = map { $_ . 'new' } ($file1, $file2);
my %file1 = map { split } path($file1)->lines;
my %file2 = map { split } path($file2)->lines;
my ($ofh1, $ofh2) = map { path($_)->openw } ($fileout1, $fileout2);
open my $fh, '<', $file3 or die "Can't open $file3: $!";
while (<$fh>) {
my ($f1, $f2) = split;
say $ofh1 "$f1\t", $file1{$f1} // 'some info'; #/ see text
say $ofh2 "$f2\t", $file2{$f2} // 'some info';
}
close $_ for $ofh1, $ofh2, $fh;
This produces the correct output based on fragments of input files that are provided.
I use Path::Tiny here for its conciseness. Its lines method returns all lines, and in map's block each is split by default space. The list of such pairs returned by map is assigned to a hash, whereby each pair of successive strings forms a key-value pair.
Multiple files can be opened in one statement, and Path::Tiny again makes it clean with openw. Its methods throw the exception (die) on errors, so we get error checking as well.
If an identifier in File 3 is not found in File 1/2 I bluntly use 'some info' as stated in the question,† but I expect that there is a more rounded solution for such a case. Then the laconic // should be changed to accommodate extra processing (or call a sub in place of 'some info' string).
It is assumed that files 1 and 2 always have two entries on a line.
Some shortcuts are taken, like reading each file into a hash in one line. Please expand the code as appropriate, with whatever checks may be needed.
† In such a case $file1{$f1} is undef so // (defined-or) operator returns its right-hand-side argument. A "proper" way is to test if (exist $file1{$f1}) but // works as well.

Related

get column list using sed/awk/perl

I have different files like below format
Scenario 1 :
File1
no,name
1,aaa
20,bbb
File2
no,name,address
5,aaa,ghi
7,ccc,mn
I would like to get column list which is having more number of columns and if it is in the same order
**Expected output for scenario 1 :**
no,name,address
Scenario 2 :
File1
no,name
1,aaa
20,bbb
File2
no,age,name,address
5,2,aaa,ghi
7,3,ccc,mn
Expected Results :
Both file headers and positions are different as a message
I am interested in any short solution using bash / perl / sed / awk.
Perl solution:
perl -lne 'push #lines, $_;
close ARGV;
next if #lines < 2;
#lines = sort { length $a <=> length $b } #lines;
if (0 == index "$lines[1],", $lines[0]) {
print $lines[1];
} else {
print "Both file headers and positions are different";
}' -- File1 File2
-n reads the input line by line and runs the code for each line
-l removes newlines from input and adds them to printed lines
closing the special file handle ARGV makes Perl open the next file and read from it instead of processing the rest of the currently opened file.
next makes Perl go back to the beginning of the code, it can continue once more than one input line has been read.
sort sorts the lines by length so that we know the longer one is in the second element of the array.
index is used to check whether the shorter header is a prefix of the longer one (including the comma after the first header, so e.g. no,names is correctly rejected)

search for occurrence of a string in another file in a particular column

I have two files:
1) Tab file with the following content. Let's call this reference file:
V$HMGIY_01_rc Ncor=0.405
V$CACD_01 Ncor=0.405
V$GKLF_02 Ncor=0.650
V$AML2_Q3 Ncor=0.792
V$WT1_Q6 Ncor=0.607
V$KID3_01 Ncor=0.668
V$CNOT3_01 Ncor=0.491
V$KROX_Q6 Ncor=0.423
V$ETF_Q6_rc Ncor=0.547
V$E2F_Q2_rc Ncor=0.653
V$SP1_Q6_01_rc Ncor=0.650
V$SP4_Q5 Ncor=0.660
2) The second tab file contains the search string X as shown below. Let's call this file as search_string:
A X
NF-E2_SC-22827 NF-E2
NRSF NRSF
NFATC1_SC-17834 NFATC1
NFKB NFKB
TCF3_SC-349 TCF3
MEF2A MEF2A
what I would like to do is: Take the first search term (from search_string file; column X), check if it occurs in first column of the reference file.
Example: The first search term is NF-E2. I would need to check if this string occurs in the first column of the reference file. If it occurs, then give a score of 1, else give 0. Also i would like to count the number of times it matches the pattern.
I want the output to be created as follows:
X X in file? number of times it occurs
NF-E2 1 3
NRSF 0 0
NFATC1 0 0
NFKB 1 7
TCF3 0 0
Please note: I need to search each string in different files i.e. The first string (Nf-E2) should be searched in file NF-E2.tab; the second string (NRSF) should be searched in file NRSF.tab and so on. Also I would like to program it using either R or Perl scripts only.
Please help!!
Here is a one-liner that you can play with and alter to suit:
perl -lanE '$str=$F[1]; $f="/home/$str/list/$str.txt"; $c=`grep -c "$str" "$f"`;chomp($c);$x=0;$x++ if $c;say "$str\t$x\t$c"' file2
It assumes your second file is called file2. Here is some sample output from input files I made up on my machine:
NF-E2 0 0
NRSF 1 1
NFATC1 1 2
TCF3 1 3
MEF2A 0 0
It just uses grep -c to count the occurrences and stores that in variable $c. chomp() removes the linefeed from the output of grep. $x is set to zero and incremented if the count ($c) is greater than zero. Then the result is printed using say.
I'll get you started with the search string and the name of the file to search in...
$perl -lanE '$str=$F[1];$f=$str.".txt";print "$str $f"' file2
NF-E2 NF-E2.txt
NRSF NRSF.txt
NFATC1 NFATC1.txt
NFKB NFKB.txt
TCF3 TCF3.txt
MEF2A MEF2A.txt
Explanation
Perl command-line switches used:
-l Perl takes care of line endings for us saving us the trouble - thanks Perl !
-a split the fields of the input file into an array called $F[]
-n put an implict loop around our code to process each line of the input file (file2)
-E execute the code that follows inside single quotes and enable the say feature
Then the actual code inside the single quotes ('')... assign the value of the second field, i.e. $F[1] because fields start at 0, to the variable $str. Assign the value of $str appended with ".txt" to the variable $f - which is the search string. Then print the search string $str and the filename $f.
EDITED
If you find Bash easier to understand, here is a Bash version.
#!/bin/bash
# Set tabs to align output columns
tabs -12
# Output headers
echo -e "X\tPresent?\tCount"
# Extract second column of file2
awk '{print $2}' file2 | while read item
do
# Work out name of file to search in
FILE="/home/${item}/list/${item}.txt"
# Count occurrences of $item in $FILE
COUNT=$(grep -cw "$item" "$FILE")
# If COUNT>0 the value is present
PRESENT=0
[ $COUNT -gt 0 ] && PRESENT=1
echo -e "$item\t$PRESENT\t$COUNT"
done
Save the file as go, then run like this:
chmod +x go # Only necessary for the first run
./go

if non-empty file compare columns and write matching lines to new file

I want to write a script that does the following: the user can choose to import a .txt file (for this I have written the code)(here $list1). This file consists of only one column with names on each line. If the user imported a file which is not empty, than I want to compare the names from a column from another file (here $file2) with the names in the imported file. I there is a match, then the whole line of this original file ($file2) should be placed in a new file ($filter1).
This is what I have so far:
my $list1;
if (prompt_yn("Do you want to import a genelist for filtering?")){
my $genelist1 = prompt("Give the name of the first genelist file:\n");
open($list1,'+<',$genelist1) or die "Could not open file $genelist1 $!";
}
open(my $filter1,'+>',"filter1.txt") || die "Can't write new file: $!";
my %hash1=();
while(<$list1>){ # $list1 is the variable from the imported .txt file
chomp;
next unless -z $_;
my $keyfield= $_; # this imported file contains only one column
$hash1{$keyfield}++;
}
seek $file2,0,0; #cursor resetting
while(<$file2>){ # this is the other file with multiple columns
my #line=split(/\t/); # split on tabs
my $keyfield=$line[2]; # values to compare are in column 3
if (exists($hash1{$keyfield})){
print $filter1 $_;
}
}
When running this script my output filter1.txt is empty. Which is not correct because there are definitely matches between the columns.
Because you have declared the $list1 filehandle as a lexical ( "my" ) variable inside a block, it is only visible in that block.
So the later lines in your script can't see $list1 and it gives the error message mentioned
To fix this, declare $list1 before the if.. block that opens the file
As the script stands, doesn't set keys or values in %hash1
Your spec is fuzzy, but what you might be intending is loading hash1 keys from file1
while(<$list1>){ # $list1 is the variable from the imported .txt file
chomp; # remove newlines
my $keyfield=$_; # this imported file contains only one column
$hash1{$keyfield}++; # this sets a key in %hash1
}
Then when going through file2
while(<$file2>){ # this is the other file with multiple columns
my #line=split(/\t/); # split on tabs
my $keyfield=$line[2]; # values to compare are in column "2"
if (exists($hash1{$keyfield}) ){ # do hash lookup for exact match
print $_; # output entire line
}
Incidentally, $line[2] is actually column 3, the first column is $line[0], the second $line[1] etc
If you actually want to do a partial or pattern match (like a grep) then using a hash isn't appropriate
Finally, you will have to amend the print $_; # output entire line to output to a file, if this is what you require. I removed the reference to $filter1 as this isn't declared in the script fragment shown

Splitting a concatenated file based on header text

I have a few very large files which are basically a concatenation of several small files and I need to split them into their constituent files. I also need to name the files the same as the original files.
For example the files QMAX123 and QMAX124 have been concatenated to:
;QMAX123 - Student
... file content ...
;QMAX124 - Course
... file content ...
I need to recreate the file QMAX123 as
;QMAX123 - Student
... file content ...
And QMAX124 as
;QMAX124 - Course
... file content ...
The original file's header ;QMAX<some number> is unique and only appears as a header in the file.
I used the script below to split the content of the files, but I haven't been able to adapt it to get the file names right.
awk '/^;QMAX/{close("file"f);f++}{print $0 > "file"f}' <filename>
So I can either adapt that script to name the file correctly or I can rename the split files created using the script above based on the content of the file, whichever is easier.
I'm currently using cygwin bash (which has perl and awk) if that has any bearing on your answer.
The following Perl should do the trick
use warnings ;
use strict ;
my $F ; #will hold a filehandle
while (<>) {
if ( / ^ ; (\S+) /x) {
my $filename = $1 ;
open $F, '>' , $filename or die "can't open $filename " ;
} else {
next unless defined $F ;
print $F $_ or warn "can't write" ;
}
}
Note it discards any input before a line with filename next unless defined $F ; You may care to generate an error or add a default file. Let me know and I can change it
With Awk, it's as simple as
awk '/^;QMAX/ {filename = substr($1,2)} {print >> filename}' input_file

How can I check if contents of one file exist in another in Perl?

Requirement:-
File1 has contents like -
ABCD00000001,\some\some1\ABCD00000001,Y,,5 (this indicates there are 5 file in total in unit)
File2 has contents as ABCD00000001
So what i need to do is check if ABCD00000001 from File2 exist in File1 -
if yes{
print the output to Output.txt till it finds another ',Y,,X'}
else{ No keep checking}
Anyone? Any help is greatly appreciated.
Hi Arkadiy Output should be :- any filename from file 2 -ABCD00000001 in file1 and from Y to Y .
for ex :- file 1 structure will be :-
ABCD00000001,\some\some1\ABCD00000001,Y,,5
ABCD00000001,\some\some1\ABCD00000002
ABCD00000001,\some\some1\ABCD00000003
ABCD00000001,\some\some1\ABCD00000004
ABCD00000001,\some\some1\ABCD00000005
ABCD00000001,\some\some1\ABCD00000006,Y,,2
so out put should contain all line between
ABCD00000001,\some\some1\ABCD00000001,Y,,5 and
ABCD00000001,\some\some1\ABCD00000006,Y,,2
#!/usr/bin/perl -w
use strict;
my $optFile = "C:\\Documents and Settings\\rgolwalkar\\Desktop\\perl_scripts\\SampleOPT1.opt";
my $tifFile = "C:\\Documents and Settings\\rgolwalkar\\Desktop\\perl_scripts\\tif_to_stitch.txt";
print "Reading OPT file now\n";
open (OPT, $optFile);
my #opt_in_array = <OPT>;
close(OPT);
foreach(#opt_in_array){
print();
}
print "\nReading TIF file now\n";
open (TIF, $tifFile);
my #tif_in_array = <TIF>;
close(TIF);
foreach(#tif_in_array){
print();
}
so all it does it is reads 2 files "FYI -> I am new to programming"
Try breaking up your problem into discrete steps. It seems that you need to do this (although your question is not very clear):
open file1 for reading
open file2 for reading
read file1, line by line:
for each line in file1, check if there is particular content anywhere in file2
Which part are you having difficulty with? What code have you got so far? Once you have a line in memory, you can compare it to another string using a regular expression, or perhaps a simpler form of comparison.
OK, I'll bite (partially)...
First general comments. Use strict and -w are good, but you are not checking for the results of open or explicitly stating your desired read/write mode.
The contents of your OPT file kinda sorta looks like it is CSV and the second field looks like a Windows path, true? If so, use the appropriate library from CPAN to parse CSV and verify your file names. Misery and pain can be the result otherwise...
As Ether stated earlier, you need to read the file OPT then match the field you want. If the first file is CSV, first you need to parse it without destroying your file names.
Here is a small snippet that will parse your OPT file. At this point, all it does is print the fields, but you can add logic to match to the other file easily. Just read (slurp) the entire second file into a single string and match with your chosen field from the first:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new();
my #opt_fields;
while (<DATA>) {
if ($csv->parse($_)) {
push #opt_fields, [ $csv->fields() ];
} else {
my $err = $csv->error_input;
print "Failed to parse line: $err";
}
}
foreach my $ref (#opt_fields) {
# foreach my $field (#$ref) { print "$field\n"; }
print "The anon array: #$ref\n";
print "Use to match?: $ref->[0]\n";
print "File name?: $ref->[1]\n";
}
__DATA__
ABCD00000001,\some\some1\ABCD00000001,Y,,5
ABCD00000001,\some\some1\ABCD00000002
ABCD00000001,\some\some1\ABCD00000003
ABCD00000001,\some\some1\ABCD00000004
ABCD00000001,\some\some1\ABCD00000005
ABCD00000001,\some\some1\ABCD00000006,Y,,2