Vcf to Bayescan format - perl script not recognising populations - perl

I am trying to convert a .vcf file into the correct format for BayeScan. I have tried using PGDSpider as recommended but my .vcf file is too big so I get a memory issue.
I then found a perl script on Github that may be able to convert my file even though it is really big. The script can be found here. However it does not correctly identify the number of populations I have. It only finds 1 popualtion, whereas I have 30.
The top of my population file looks like so, following the example format in the perl script.
index01_barcode_10_PA-1-WW-10 pop1
index02_barcode_29_PA-5-Ferm-19 pop2
index01_barcode_17_PA-1-WW-17 pop1
index02_barcode_20_PA-5-Ferm-10 pop2
index03_barcode_16_PA-7-CA-14 pop3
I have also tried the script with a sorted population file.
I have no experience with perl language so I am struggling to work out why the script is not working.
I think it is to do with this section of the script but cannot be sure:
# read and process pop file
while (<POP>){
chomp $_;
#line = split /\t/, $_;
$pops{$line[0]} = $line[1];
}
close POP;
# Get populations and sort them
my #upops = sort { $a cmp $b } uniq ( values %pops );
print "found ", scalar #upops, " populations\n";
Appolgies as I am not sure how to make this a reproducible example but I am hoping someone could at least help me understand what this part of the code is doing and if there is a way to adapt it? Isthe problem that my individual names include _ and -?
Thank you so much for your advice and help in advance :)

Firslty thank you to #toolic for his help and guidance :)
Whilst trying to create a reproducible example it started working and I think the problem is how I made my populations file.
Previously I used: paste sample_names pops | column -s $'\t' -t > pop_file.txt
to output the file printed in the question.
However it works if i simply use: paste sample_names pops > pop_file.txt
Also I have put the full path to the .vcf file instead of path from the current directory.
I hope this helps anyone who comes across this issue in the future :)

Related

Inverting PDF colors (negative) using Perl PDF::API2

Like from title I'm trying to write a script that inverts the colors of a bunch of PDF, by using Perl and PDF::API2.
I'm not very familiar with perl, I've modified a brief script I found here on stackoverflow, from this post, with the help of chatGPT
how to change all colours in a PDF to their respective complimentary colours; how to make a PDF negative
The code I've come to is the following:
use strict;
use warnings;
use PDF::API2;
use PDF::API2::Basic::PDF::Utils;
my $dirname = '.';
my $filename;
opendir(DIR, $dirname) or die "Could not open $dirname\n";
mkdir("inverted") unless -d "inverted";
while ($filename = readdir(DIR)) {
print("$filename\n");
next unless $filename =~ /\.pdf$/; # Skip files that are not PDFs
my $pdf = PDF::API2->open($filename);
for my $n (1..$pdf->pages()) {
my $p = $pdf->openpage($n);
$p->{Group} = PDFDict();
$p->{Group}->{CS} = PDFName('DeviceRGB');
$p->{Group}->{S} = PDFName('Transparency');
my $gfx = $p->gfx(1); # prepend
$gfx->fillcolor('white');
$gfx->rect($p->mediabox());
$gfx->fill();
$gfx = $p->gfx(); # append
$gfx->egstate($pdf->egstate->blendmode('Difference'));
$gfx->fillcolor('white');
$gfx->rect($p->mediabox());
$gfx->fill();
}
$pdf->saveas("inverted/$filename");
}
closedir(DIR);
It seems to partially work sometimes, some pages are correctly inverted; however sometimes the first half of the page is not inverted, it remains white, like in this pic:
Page partially inverted
I'd like to fix this, I'd really need a simple script that perform this job, I've also written a script that after that join all the pdf files from multiple files into a single PDF. If anyone has an idea on how to fix it I'll be grateful, I could also upload the result on github if anyone needs this (it's a question has been asked other times too, but I haven't found a script nor in python or other languages that performs this work well, except for a couple of scripts that relies on docker and nodejs in order to install them)
I've tried working with chatGPT to fix the issue, but it has no idea on how to do this (yes, I know, I shouldn't rely on it, but this is the first time I use Perl)
I am debugging this and am confident it has to do with rotation of the pages but I do not understand the details of the problem yet. However, I have this workaround for the test file:
Rotate it 90 degrees with pdftk, then apply the perl script, then rotate it back 90 degrees with pdftk:
$ pdftk test.pdf cat 1-endLeft output test2.pdf
$ # run perl script to invert the colors in test2.pdf
$ pdftk test2.pdf cat 1-endRight output test3.pdf
After this test3.pdf seems to be correctly inverted. This workaround might also work for the other files you have.

About searching recursively in Perl

I have a Perl script that I, well, mostly pieced together from questions on this site. I've read the documentation on some parts to better understand it. Anyway, here it is:
#!/usr/bin/perl
use File::Find;
my $dir = '/home/jdoe';
my $string = "hard-coded pattern to match";
find(\&printFile, $dir);
sub printFile
{
my $element = $_;
if(-f $element && $element =~ /\.txt$/)
{
open my $in, "<", $element or die $!;
while(<$in>)
{
if (/\Q$string\E/)
{
print "$File::Find::name\n";
last; # stops looking after match is found
}
}
}
}
This is a simple script that, similar to grep, will look down recursively through directories for a matching string. It will then print the location of the file that contains the string. It works, but only if the file is located in my home directory. If I change the hard-coded search to look in a different directory (that I have permissions in), for example /admin/programs, the script no longer seems to do anything: No output is displayed, even when I know it should be matching at least one file (tested by making a file in admin/programs with the hard-coded pattern. Why am I experiencing this behavior?
Also, might as well disclaim that this isn't a really useful script (heck, this would be so easy with grep or awk!), but understanding how to do this in Perl is important to me right now. Thanks
EDIT: Found the problem. A simple oversight in that the files in the directory I was looking for did not have .txt as extension. Thanks for helping me find that.
I was able to get the desired output using the code you pasted by making few changes like:
use strict;
use warnings;
You should always use them as they notify of various errors in your code which you may not get hold of.
Next I changed the line :
my $dir = './home/jdoe'; ##'./admin/programs'
The . signifies current directory. Also if you face problems still try using the absolute path(from source) instead of relative path. Do let me know if this solves your problem.
This script works fine without any issue. One thing hidden from this script to us is the pattern. you can share the pattern and let us know what you are expecting from that pattern, so that we can validate that.
You could also run your program in debug mode i.e.,
perl -d your_program.
That should take you to debug mode and there are lot of options available to inspect through the flow. type 'n' on the debug prompt to step in to the code flow to understand how your code flows. Typing 'n' will print the code execution point and its result

Perl Skipping Some Files When Looping Across Files and Writing Their Contents to Output File

I'm having an issue with Perl and I'm hoping someone here can help me figure out what's going on. I have about 130,000 .txt files in a directory called RawData and I have a Perl program that loads them into an array, then loops through this array, loading each .txt file. For simplicity, suppose I have four text files I'm looping through
File1.txt
File2.txt
File3.txt
File4.txt
The contents of each .txt file look something like this:
007 C03XXYY ZZZZ
008 A01XXYY ZZZZ
009 A02XXYY ZZZZ
where X,Y,Z are digits. In my simplified code below, the program then pulls out just line 007 in each .txt file, saves XX as ID, ignores YY and grabs the variable data ZZZZ that I've called VarVal. Then it writes everything to a file with a header specified in the code below:
#!/usr/bin/perl
use warnings;
use strict;
open(OUTFILE, "> ../Data/OutputFile.csv") or die $!;
opendir(MYDIR,"../RawData")||die $!;
my #txtfiles=grep {/\.txt$/} readdir(MYDIR);
closedir(MYDIR);
print OUTFILE "ID,VarName,VarVal\n";
foreach my $txtfile (#txtfiles){
#Prints to the screen so I can see where I am in the loop.
print $txtfile","\n";
open(INFILE, "< ../RawData/$txtfile") or die $!;
while(<INFILE>){
if(m{^007 C03(\d{2})(\d+)(\s+)(.+)}){
print OUTFILE "$1,VarName,$4\n"
}
}
}
The issue I'm having is that the contents of, for example File3.txt, don't show up in OutputFile.csv. However, it's not an issue with Perl not finding a match because I checked that the if statement is being executed by deleting OUTFILE and looking at what the code prints to the terminal screen. What shows up is exactly what should be there.
Furthermore, If I just run the problematic file (File3.txt) through the loop itself by commenting out the opendir and closedir stuff and doing something like my #textfile = "File3.txt";. Then when I run the code, the only data that shows up in the OutputFile.csv IS what's in File3.txt. But when it goes through the loop, it won't show up in OutputFile.csv. Plus, I know that File3.txt is being sent to into the loop because I can see it being printed on the screen with print $txtfile","\n";. I'm at a loss as to what is going on here.
The other issue is that I don't think it's something specific to this one particular file (maybe it is) but I can't just troubleshoot this one file because I have 130,000 files and I just happened to stumble across the fact that this one wasn't being written to the output file. So there may be other files that also aren't getting written, even though there is no obvious reason they shouldn't be just like the case of File3.txt.
Perhaps because I'm doing so many files in rapid succession, looping 130,000 files, causes some sort of I/O issues that randomly fails every so often to write the contents in memory to the output file? That's my best guess but I have not idea how to diagnose or fix this.
This is kind of a difficult question to debug, but I'm hoping someone on here has some insight or has seen similar problems that would provide me with a solution.
Thanks
There's nothing obviously wrong that I can see in your code. It is a little outdated as using autodie and lexical filehandles would be better.
However, I would recommend that you make your regex slightly less restrictive by making the spacing variable length after the first value and making the last variable optionally of 0 length. I'd also output the filename as well. Then you can see which other files aren't being caught for whatever reason:
if (m{^007\s+C03(\d{2})\d+\s+(.*)}){
print OUTFILE "$txtfile $1,VarName,$2\n";
last;
}
Finally, assuming there is only a single 007 C03 in each file, you could throw in a last call after one is found.
You may want to try sorting the #txtfiles list, then trying to systematically look through the output to see what is or isn't there. With 130k files in random order, it would be pretty difficult to be certain that you missed one. Perl should be giving you the files in the actual order they appear in the directory, which is different that user level commands like ls, so it may be different that you'd expect.

Use awk in Perl to parse everything between two strings

I have a huge pile of log files constantly being updated on HP-UX server.I have created the Perl code to find out the name of log file in which the string i'm using resides inside.
Perl gets the file name using split and passes it into a variable.Using the userinput i create the start and stop strings as two variables.Such as:
my $ssh = Net::OpenSSH->new($host, user => $user,
master_opts => [ -o => 'NumberOfPasswordPrompts=1',
-o => 'PreferredAuthentications=keyboard-interactive,password'],
login_handler => \&login_handler);
$ssh-> error and die "Unable to connect" . $ssh->error;
my $output=$ssh->capture("grep .$userinput1. /app/bea/user_projects/domains/granite/om_ni.log*");
my $array = (split ":", $output)[0];
print "$array"."\n";
[EDIT]: As you guys requested,above is the beginning of how the $array got filled in.Below is where the awk sequence starts:
my $a= "INFO - $userinput1";print $a;
my $b= "INFO - ProcessNode terminated... [$userinput1]";print $b;
Using the awk as part of ssh capture command,it will search through the whole log file and capture every line between the string $a and string $b,then get everything inside another array.Such as:
my $output2=$ssh->capture("awk -v i=$array '$a,$b' i");
Here $array is where the log file's full path is held and it work completely fine as a passing variable.
I tried using the awk without -v parameter as well,didn't matter at all.
[EDIT 2]:this is the result of print "$array"."\n";
/app/bea/user_projects/domains/granite/om_ni.log.2
When I run the perl script,I get the result:
INFO - 28B26AD1-E959-4F5F-BD89-A7A6E601BE18INFO - ProcessNode terminated... [28B26AD1-E959-4F5F-BD89-A7A6E601BE18] syntax error The source line is 1.
The error context is
INFO - 28B26AD1-E959-4F5F-BD89-A7A6E601BE18,INFO - ProcessNode >>> terminated. <<< .. [28B26AD1-E959-4F5F-BD89-A7A6E601BE18]
awk: Quitting
The source line is 1.
Error pointing at the "terminate" word somehow but even when I use escape characters all over the strings,it just doesn't care and returns the same error.
Any help on this issue is highly appreciated.Thanks a lot in advance.
While I don't really know awk, the way you are invoking it does not appear to be correct. Here is the manual for awk on HP-UX.
The part in single quotes ($a,$b) should be the program. However, you are passing it two text strings, which are not even quoted to separate them. This is not a valid awk program; hence the syntax error.
I think what you want is something like '/$a/, /$b/' for the program (but again, I am not an awk expert).
Also, you are setting the filename to variable i, then using i in the place of the filename when you invoke the command. I don't know why you are doing this, and I don't think it will even work to use a variable in the filename. Just use $array (which you should rename to something like $file for clarity) in the filename position.
So your whole command should look something like:
"awk '/$a/,/$b/' $file"
In this single command, you are dealing with three different tools: Perl, SSH, and awk. This is very hard to debug, because if there is a problem, it can be hard to tell where the problem is. It is essential that you break down the task into smaller parts in order to get something like this working.
In this case, that means that you should manually SSH into the server and play around with awk until you get the command right. Only when you are sure that you have the awk command right should you try incorporating it into Perl. It will be much easier if you break down the task in that way.

How can download via FTP all files with a current date in their name?

I have a file format which is similar to "IDY03101.200901110500.axf". I have about 25 similar files residing in an ftp repository and want to download all those similar files only for the current date. I have used the following code, which is not working and believe the regular expression is incorrect.
my #ymb_date = "IDY.*\.$year$mon$mday????\.axf";
foreach my $file ( #ymb_date)
{
print STDOUT "Getting file: $file\n";
$ftp->get($file) or warn $ftp->message;
}
Any help appreciated.
EDIT:
I need all the current days file.
using ls -lt | grep "Jan 13" works in the UNIX box but not in the script
What could be a vaild regex in this scenario?
It doesn't look like you're using any regular expression. You're trying to use the literal pattern as the filename to download.
Perhaps you want to use the ls method of Net::FTP to get the list of files then filter them.
foreach my $file ( $ftp->ls ) {
next unless $file =~ m/$regex/;
$ftp->get($file);
}
You might also like the answers that talk about implementing mget for "Net::FTP" at Perlmonks.
Also, I think you want the regex that finds four digits after the date. In Perl, you could write that as \d{4}. The ? is a quantifier in Perl, so four of them in a row don't work.
IDY.*\.$year$mon$mday\d{4}\.axf
I do not think regexes work like that. Though it has been awhile since I did Perl, so I could be way off there. :)
Does your $ftp object have access to an mget() method? If so, maybe try this?
$ftp->mget("IDY*.$year$mon$mday*.axf") or warn $ftp->message;