Matching string with substrings

Matching string with substrings - perl

I’m working with multiple vcf files in a directory (Linux server) and also a tab delimited key file that contains the sample names and the corresponding barcodes.
Here is how the files are named:
RA_4090_v1_RA_4090_RNA_v1.vcf
RA_4090_dup_v1_RA_4090_dup_RNA_v1.vcf
RA_565_v1.vcf
RA_565_dup_v1.vcf
RA_HCC-78-2.vcf
Here are contents of the key file:
Barcode ID Sample Name
IonSelect-2 RA_4090
IonSelect-4 RA_565
IonSelect-6 RA_HCC-78-2
IonSelect-10 RA_4090_dup
IonSelect-12 RA_565_dup
I need to correlate the correct sample names with each .vcf file and then rename each .vcf file.
There is always one vcf file for each sample. However, sometimes the samples names begin with the same substring and it’s impossible to match them up correctly, since the sample names are not standardized.
The following code works well when the sample names are different but fails if multiple sample names begin with the same substring. I have no idea how to account for multiple sample names that begging with the same substring.
Please suggest something that will work. Here is the current code:
#!/usr/bin/perl
use warnings;
use strict;
use File::Copy qw(move);
my $home="/data/";
my $bam_directory = $home."test_all_runs/".$ARGV[0];
my $matrix_key = $home."test_all_runs/".$ARGV[0]."/key.txt";
my #matrix_key = ();
open(TXT2, "$matrix_key") or die "Can't open '$matrix_key': $!";
while (<TXT2>){
push (#matrix_key, $_);
}
close(TXT2);
my #ant_vcf = glob "$bam_directory/*.vcf";
for my $tsv_file (#ant_vcf){
my $matrix_barcode_vcf = "";
my $matrix_sample_vcf = "";
foreach (#matrix_key){
chomp($_);
my #matrix_key = split ("\t", $_);##
if (index ($tsv_file,$matrix_key[1]) != -1) {
$matrix_barcode_vcf = $matrix_key[0]; print $matrix_key[0];
$matrix_sample_vcf = $matrix_key[1];
chomp $matrix_barcode_vcf;
chomp $matrix_sample_vcf;
#print $bam_directory."/".$matrix_sample_id."_".$matrix_barcode.".bam";
move $tsv_file, $bam_directory."/".$matrix_sample_vcf."_".$matrix_sample_vcf.".vcf";
}
}
}

The following code works well when the sample names are different but fails if multiple sample names begin with the same substring. I have no idea how to account for multiple sample names that begging with the same substring.
The key to solving your problem is sorting the 'Sample Name' names by length - longest first.
For example, MATCHES RA_4090_dup should be before MATCHES RA_4090 in the #matrix_key array so it will attempt to match the longer string first. Then, after a match, you stop searching (I used first from the List::Util module which is part of core perl since version 5.08).
#!/usr/bin/perl
use strict;
use warnings;
use List::Util 'first';
my #files = qw(
RA_4090_v1_RA_4090_RNA_v1.vcf
RA_4090_dup_v1_RA_4090_dup_RNA_v1.vcf
RA_565_v1.vcf
RA_565_dup_v1.vcf
RA_HCC-78-2.vcf
);
open my $key, '<', 'junk.txt' or die $!; # key file
<$key>; # throw away header line in key file (first line)
my #matrix_key = sort {length($b->[1]) <=> length($a->[1])} map [ split ], <$key>;
close $key or die $!;
for my $tsv_file (#files) {
if ( my $aref = first { index($tsv_file, $_->[1]) != -1 } #matrix_key ) {
print "$tsv_file \t MATCHES $aref->[1]\n";
print "\t$aref->[1]_$aref->[0]\n\n";
}
}
This produced this output:
RA_4090_v1_RA_4090_RNA_v1.vcf MATCHES RA_4090
RA_4090_IonSelect-2
RA_4090_dup_v1_RA_4090_dup_RNA_v1.vcf MATCHES RA_4090_dup
RA_4090_dup_IonSelect-10
RA_565_v1.vcf MATCHES RA_565
RA_565_IonSelect-4
RA_565_dup_v1.vcf MATCHES RA_565_dup
RA_565_dup_IonSelect-12
RA_HCC-78-2.vcf MATCHES RA_HCC-78-2
RA_HCC-78-2_IonSelect-6

Related

Creating multiple hashes from multiple files in one go

I want to perform a vlookup like process but with multiple files wherein the contents of the first column from all files (sorted n uniq-ed) is reference value. Now I would like to store these key-values pairs from each file in each hash and then print them together. Something like this:
file1: while(){$hash1{$key}=$val}...file2: while(){$hash2{$key}=$val}...file3: while(){$hash3{$key}=$val}...so on
Then print it: print "$ref_val $hash1{$ref_val} $hash3{$ref_val} $hash3{$ref_val}..."
$i=1;
#FILES = #ARGV;
foreach $file(#FILES)
{
open($fh,$file);
$hname="hash".$i; ##trying to create unique hash by attaching a running number to hash name
while(<$fh>){#d=split("\t");$hname{$d[0]}=$d[7];}$i++;
}
$set=$i-1; ##store this number for recreating the hash names during printing
open(FH,"ref_list.txt");
while(<FH>)
{
chomp();print "$_\t";
## here i run the loop recreating the hash names and printing its corresponding value
for($i=1;$i<=$set;$i++){$hname="hash".$i; print "$hname{$_}\t";}
print "\n";
}
Now this where I am stuck perl takes $hname as hash name instead of $hash1, $hash2...
Thanks in advance for the helps and opinions

The shown code attempts to use symbolic references to construct variable names at runtime. Those things can raise a lot of trouble and should not be used, except very occasionally in very specialized code.
Here is a way to read multiple files, each into a hash, and store them for later processing.
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd);
my #files = #ARGV;
my #data;
for my $file (#files) {
open my $fh, '<', $file or do {
warn "Skip $file, can't open it: $!";
next;
};
push #data, { map { (split /\t/, $_)[0,6] } <$fh> };
}
dd \#data;
Each hash associates the first column with the seventh (index 6), as clarified, for each line. A reference to such a hash for each file, formed by { }, is added to the array.
Note that when you add a key-value pair to a hash which already has that key the new overwrites the old. So if a string repeats in the first column in a file, the hash for that file will end up with the value (column 7) for the last one. The OP doesn't discuss possible duplicates of this kind in data files (only for the reference file), please clarify if needed.
The Data::Dump is used only to print; if you don't wish to install it use core Data::Dumper.
I am not sure that I get the use of that "reference file", but you can now go through the array of hash references for each file and fetch values as needed. Perhaps like
open my $fh_ref, '<', $ref_file or die "Can't open $ref_file: $!";
while (my $line = <$fh_ref>) {
my $key = ... # retrieve the key from $line
print "$key: ";
foreach my $hr (#data) {
print "$hr->{$key} ";
}
say '';
}
This will print key: followed by values for that string, one from each file.

Reading File and Inserting into Variables using Perl

I'm new to Perl, so please bare with my on my ignorance. What I'm trying to do is read a file (already using File::Slurp module) and create variables from the data in the file. Currently I have this setup:
use File::Slurp;
my #targets = read_file("targetfile.txt");
print #targets;
Within that target file, I have the following bits of data:
id: 123456789
name: anytownusa
1.2.3.4/32
5.6.7.8/32
The first line is an ID, the second line is a name, and all successive lines will be IP addresses (maximum length of a few hundred).
So my goal is to read that file and create variables that look something like this:
$var1="123456789";
$var2="anytownusa";
$var3="1.2.3.4/32,5.6.7.8/32,etc,etc,etc,etc,etc";
** Taking note that all the IP addresses end up grouped together into a single variable and seperated by a (,) comma.

File::Slurp will read the complete file data in one go. This might cause an issue if the file size is very big. Let me show you a simple approach to this problem.
Read file line by line using while loop
Check line number using $. and assign line data to respective variable
Store ips in an array and at the end print them using join
Note: If you have to alter the line data then use search and replace in the respective conditional block before assigning the line data to the variable.
Code:
#!/usr/bin/perl
use strict;
use warnings;
my ($id, $name, #ips);
while(<DATA>){
chomp;
if ($. == 1){
$id = $_;
}
elsif ($. == 2){
$name = $_;
}
else{
push #ips, $_;
}
}
print "$id\n";
print "$name\n";
print join ",", #ips;
__DATA__
id: 123456789
name: anytownusa
1.2.3.4/32
5.6.7.8/32
Demo

As it has been noted, there is no reason to "slurp" the whole file into a variable. If nothing else, it only makes the processing harder.
Also, why not store named labels in a hash, in this example
my %identity = (id => 123456789, name => 'anytownusa');
The code below picks up the key names from the file, they aren't hard-coded.
Then
use warnings;
use strict;
use feature 'say';
my (#ips, %identity);
my $file = 'targetfile.txt';
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>)
{
next if not /\S/;
chomp;
my ($m1, $m2) = split /:/; #/(stop bad syntax highlight)
if ($m1 and $m2) { $identity{$m1} = $m2; }
else { push #ips, $m1; }
}
say "$_: $identity{$_}" for keys %identity;
say join '/', #ips;
If the line doesn't have : the split will return it whole, which will be the ip and which is stored in an array for processing later. Otherwise it returns the named pair, for 'id' and 'name'.
We first skipped blank lines with next if not /\S/;, so the line must have some non-space elements and else suffices, as there is always something in $m1. We also need to remove the newline, with chomp.

Read the file into variables directly:
use Modern::Perl;
my ($id, $name, #ips) = (<DATA>,<DATA>,<DATA>);
chomp ($id, $name, #ips);
say $id;
say $name;
$" = ',';
say "#ips";
__DATA__
id: 123456789
name: anytownusa
1.2.3.4/32
5.6.7.8/32
Output:
id: 123456789
name: anytownusa
1.2.3.4/32,5.6.7.8/32

Create a table by merging many files

This seemed like such an easy task, yet I am boggled.
I have text files, each named after a type of tissue (e.g. cortex.txt, heart.txt)
Each file contains two columns, and the column headers are gene_name and expression_value
Each file contains around 30K to 40K rows
I need to merge the files into one file with 29 columns, with headers
genename, tissue1, tissue2, tissue3, etc. to tissue28
So that each row contains one gene and its expression value in the 28 tissues
The following code creates an array containing a list of every gene name in every file:
my #list_of_genes;
foreach my $input_file ( #input_files ) {
print $input_file, "\n";
open ( IN, "outfiles/$input_file");
while ( <IN> ) {
if ( $_ =~ m/^(\w+\|ENSMUSG\w+)\t/) {
# check if the gene is already in the gene list
my $count = grep { $_ eq $1 } #list_of_genes;
# if not in list, add to the list
if ( $count == 0 ) {
push (#list_of_genes, $1);
}
}
}
close IN;
}
The next bit of code I was hoping would work, but the regex only recognises the first gene name.
Note: I am only testing it on one test file called "tissue1.txt".
The idea is to create an array of all the file names, and then take each gene name in turn and search through each file to extract each value and write it to the outfile in order along the row.
foreach my $gene (#list_of_genes) {
# print the gene name in the first column
print OUT $gene, "\t";
# use the gene name to search the first element of the #input_file array and dprint to the second column
open (IN, "outfiles/tissue1.txt");
while ( <IN> ) {
if ($_ =~ m/^$gene\t(.+)\n/i ) {
print OUT $1;
}
}
print OUT "\n";
}
EDIT 1:
Thank you Borodin. The output of your code is indeed a list of every gene name with a all expression values in each tissue.
e.g. Bcl20|ENSMUSG00000000317,0.815796340254127,0.815796340245643
This is great much better than I managed thank you. Two additional things are needed.
1) If a gene name is not found in the a .txt file then a value of 0 should be recorded
e.g. Ht4|ENSMUSG00000000031,4.75878049632381, 0
2) I need a comma separated header row so that the tissue from which each value comes remains associated with the value (basically a table) - the tissue is the name of the text file
e.g. From 2 files heart.txt and liver.txt the first row should be:
genename|id,heart,liver
where genename|id is always the first header

That's a lot of code to implement the simple idiom of using a hash to enforce uniqueness!
It's looking like you want an array of expression values for each different ENSMUSG string in all *.txt files in your outfiles directory.
If the files you need are the only ones in the outfles directory, then the solution looks like this. I've used autodie to check the return status of all Perl IO operations (chdir, open, print etc.) and checked only that the $gene value contains |ENSMUSG. You may not need even this check if your input data is well-behaved.
Please forgive me if this is bugged, as I have no access to a Perl compiler at present. I have checked it by sight and it looks fine.
use strict;
use warnings 'all';
use autodie;
chdir '/path/to/outfiles';
my %data;
while ( my $file = glob '*.txt' ) {
open my $fh, '<', $file;
while ( <$fh> ) {
my ($gene, $value) = split;
next unless $gene =~ /\|ENSMUSG/;
push #{ $data{$gene} }, $value;
}
}
print join(',', $_, #{ $data{$_} }), "\n" for keys %data;

Perl to compare filename starting with some specific number exist or not

I need to create a Perl script to check the first four characters of the file name of all the files mentioned in a path, and compare it with a text file containing those four characters.
The idea is to check whether any file starting with a list of numbers is missing.
For Example. Files in path D:/temp are
1234-2041-123.txt
1194-2041-123.txt
3234-2041-123.txt
1574-2041-123.txt
I need to compare the first four letter of filename - 1234, 1194, 3234, 1574 - with a text file containing the sequences 1234, 1194, 3234, 1574, 1111, 2222 and send the output
File starting with 1111, 2222 is missing.
I hope I am clear.
I am able to take out the first four characters from the file name but cannot proceed further
#files = <d:/temp/*>;
foreach $file (#files) {
my $xyz = substr $file, 8, 4;
print $xyz . "\n";
}

Similarly to #F. Hauri's proposed solution, I propose a hash-based solution:
# Get this list from the file. Here, 'auto' and 'cron' will exist in
# /etc, but 'fake' and 'mdup' probably won't.
my #expected_prefixes = qw| auto cron fake mdup |;
# Initialize entries for each prefix in the seed file to false
my %prefix_list;
undef #prefix_list{ #expected_prefixes };
opendir my ${dir}, "/etc";
while (my $file_name = readdir $dir ) {
my $first_four = substr $file_name, 0, 4;
# Increment element for the prefix for found files
$prefix_list{$first_four}++;
}
# Get list of prefixes with no matching files found in the directory
my #missing_files = grep { ! $prefix_list{$_} } keys %prefix_list;
say "Missing files: " . Dumper(\#missing_files);

This solution works by creating a hash from all of the values in the file prefixes.txt, then deleting elements from that hash as files are found starting with each sequence.
In addition, if any file name starts with a sequence that doesn't appear in the file then a warning is printed.
The output is simply a matter of listing all the elements of the hash that remain after this process.
use strict;
use warnings;
my %prefixes;
open my $fh, '<', 'prefixes.txt' or die $!;
while (<$fh>) {
chomp;
$prefixes{$_} = 1;
}
my #files = qw/
1234-2041-123.txt
1194-2041-123.txt
3234-2041-123.txt
1574-2041-123.txt
/;
for my $name (#files) {
my $pref = substr $name, 0, 4;
if ($prefixes{$pref}) {
delete $prefixes{$pref};
}
else {
warn qq{Prefix for file "$name" not listed};
}
}
printf "File starting with %s is missing.\n", join ', ', sort keys %prefixes;
output
File starting with 1111, 2222 is missing.

Using Perl to parse a CSV file from a particular row to the end of the file

am very new to Perl and need your help
I have a CSV file xyz.csv with contents:
here level1 and er values are strings names...not numbers...
level1,er
level2,er2
level3,er3
level4,er4
I parse this CSV file using the script below and pass the fields to an array in the first run
open(my $d, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$d>) {
chomp $line;
my #data = split "," , $line;
#XYX = ( [ "$data[0]", "$data[1]" ], );
}
For the second run I take an input from a command prompt and store in variable $val. My program should parse the CSV file from the value stored in variable until it reaches the end of the file
For example
I input level2 so I need a script to parse from the second line to the end of the CSV file, ignoring the values before level2 in the file, and pass these values (level2 to level4) to the #XYX = (["$data[1]","$data[1]"],);}
level2,er2
level3,er3
level4,er4
I input level3 so I need a script to parse from the third line to the end of the CSV file, ignoring the values before level3 in the file, and pass these values (level3 and level4) to the #XYX = (["$data[0]","$data[1]"],);}
level3,er3
level4,er4
How do I achieve that? Please do give your valuable suggestions. I appreciate your help

As long as you are certain that there are never any commas in the data you should be OK using split. But even so it would be wise to limit the split to two fields, so that you get everything up to the first comma and everything after it
There are a few issues with your code. First of all I hope you are putting use strict and use warnings at the top of all your Perl programs. That simple measure will catch many trivial problems that you could otherwise overlook, and so it is especially important before you ask for help with your code
It isn't commonly known, but putting a newline "\n" at the end of your die string prevent Perl from giving file and line number details in the output of where the error occurred. While this may be what you want, it is usually more helpful to be given the extra information
Your variable names are verly unhelpful, and by convention Perl variables consist of lower-case alphanumerics and underscores. Names like #XYX and $W don't help me understand your code at all!
Rather than splitting to an array, it looks like you would be better off putting the two fields into two scalar variables to avoid all that indexing. And I am not sure what you intend by #XYX = (["$data[1]","$data[1]"],). First of all do you really mean to use $data[1] twice? Secondly, your should never put scalar variables inside double quotes, as it does something very specific, and unless you know what that is you should avoid it. Finally, did you mean to push an anonymous array onto #XYX each time around the loop? Otherwise the contents of the array will be overwritten each time a line is read from the file, and the earlier data will be lost
This program uses a regular expression to extract $level_num from the first field. All it does it find the first sequence of digits in the string, which can then be compared to the minimum required level $min_level to decide whether a line from the log is relevant
use strict;
use warnings;
my $file = 'xyz.csv';
my $min_level = 3;
my #list;
open my $fh, '<', $file or die "Could not open '$file' $!";
while (my $line = <$fh>) {
chomp $line;
my ($level, $error) = split ',', $line, 2;
my ($level_num) = $level =~ /(\d+)/;
next unless $level_num >= $min_level;
push #list, [ $level, $error ];
}

For deciding which records to process you can use the "flip-flop" operator (..) along these lines.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $level = shift || 'level1';
while (<DATA>) {
if (/^\Q$level,/ .. 0) {
print;
}
}
__DATA__
level1,er
level2,er2
level3,er3
level4,er4
The flip-flop operator returns false until its first operand is true. At that point it returns false until its second operand is true; at which point it returns false again.
I'm assuming that your file is ordered so that once you start to process it, you never want to stop. That means that the first operand to the flip-flop can be /^\Q$level,/ (match the string $level at the start of the line) and the second operand can just be zero (as we never want it to stop processing).
I'd also strongly recommend not parsing CSV records using split /,/. That may work on your current data but, in general, the fields in a CSV file are allowed to contain embedded commas which will break this approach. Instead, have a look at Text::CSV or Text::ParseWords (which is included with the standard Perl distribution).
Update: I seem to have got a couple of downvotes on this. It would be great if people would take the time to explain why.

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my #XYZ;
my $file = 'xyz.csv';
open my $fh, '<', $file or die "$file: $!\n";
my $level = shift; # get level from commandline
my $getall = not defined $level; # true if level not given on commandline
my $parser = Text::CSV->new({ binary => 1 }); # object for parsing lines of CSV
while (my $row = $parser->getline($fh)) # $row is an array reference containing cells from a line of CSV
{
if ($getall # if level was not given on commandline, then put all rows into #XYZ
or # if level *was* given on commandline, then...
$row->[0] eq $level .. 0 # ...wait until the first cell in a row equals $level, then put that row and all subsequent rows into #XYZ
)
{
push #XYZ, $row;
}
}
close $fh;

#!/usr/bin/perl
use strict;
use warnings;
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
my $level = shift ||"level1";
while (my $line = <$data>) {
chomp $line;
my #fields = split "," , $line;
if($fields[0] eq $level .. 0){
print "\n$fields[0]\n";
print "$fields[1]\n";
}}
This worked....thanks ALL for your help...

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Matching string with substrings - perl

Related

Creating multiple hashes from multiple files in one go

Reading File and Inserting into Variables using Perl

Create a table by merging many files

Perl to compare filename starting with some specific number exist or not

Using Perl to parse a CSV file from a particular row to the end of the file

Categories

Resources