How to recode missing genotype code is " '-' " in the ped file of plink - imputation

I'm trying to impute genotype data from the public reference panels but my files fail the file sanity check on Sanger Imputation server and it gives the following error:
failed sanity check :
of Non-ACGTN alternate allele at 1:4635556 .. REF_SEQ:'(null)' vs VCF:'-'
I have tried fixing this in the plink with the following command ./plink --bfile chr1 --recode vcf --out chr1_vcf --missing-genotype -
but then it gives error Underscore(s) present in sample IDs.
--recode vcf to chr1_vcf.vcf ... done.
but I still see '_' in the new coded file.
I would appreciate any help, suggestions and comments.
Thanks
Jasdeep

You will have to replace _ with a different character in your PLINK files before running your code.
See below from PLINK manual
When using --recode vcf, sample IDs are formed by merging the FID and IID and placing an underscore between them. When the FID or IID already contains an underscore, this may make it difficult to reconstruct them from the VCF file; you may want to replace underscores with a different character in PLINK files (Unix tr is handy here).

Related

Extracting symbol names from nm output

I'd like to use nm -P -g symbol names to generate a .c file. however I'm not sure how to extract those symbol names.
Reading https://pubs.opengroup.org/onlinepubs/9699919799/utilities/nm.html says:
The format given in nm STDOUT uses <space> characters between the fields, which may be any number of <blank> characters required to align the columns.
I'm not sure how to interpret this - should my regex be ^[^ ]+_mkdocs[ ] [note: workaround for stackoverflow's wonky code formatting] or something else? I want the result to be whatever symbol name I extracted concatenated with (&doc);
e.g.
foo_mkdocs T 0 0
should become
foo_mkdocs(&doc);
but I'm unsure if I'm understanding nm's output format specification correctly.

Input argument is a file or an either content to Perl

I wrote a Perl script to convert from TEX format to JSON format.
Calling in the batch file:
perl -w C:\test\support.pl TestingSample.tex
This is working fine now.
Perl script having two types of input from another program (might be any platform/technology) one is file (*TEX) or else content (*TEX file) either this or that option.
How can I receive the full content as the input to the Perl script?
Now my Perl script is:
my $texfile = $ARGV[0]; my $texcnt = "";
readFileinString($texfile, \$texcnt);
I am trying to update:
perl -w C:/test/support.pl --input $texcnt" #Content is Input
I am receiving error message:
The command line is too long.
Could someone please advice?
First of all regarding the error you're getting:
Perl (or your shell) is complaining that your input argument is too long.
Parsing entire files as arguments to scripts is generally a bad idea anyway, for example quotation mark escaping etc. might not be handled and thus leave a wide open vulnarbility to your entire system!
So the solution to this is to modify your script so that it can take the file as an argument (if that isn't already the case) and if you really need to have an entire file's content parsed as an argument I'd really advise you to create a temporary file in /tmp/ (if on Linux) or in your %TEMP% directory on Windows and parse the file the content into the file and after that give your support.pl script the new temp file as an argument.

Perl Code : Output not displayed properly

I have a perl code where I access multiple txt files and produce output for them.
While I run the code, the output lines on the console are overwritten.
2015-04-21:12-04-54|getFilesInInputDir| ********** name : PEPORT **********
PEPORT4-21:12-04-54|readNFormOutputFile| name :
PEPORT" is : -04-54|readNFormOutputFile| Frequency for name "
Please note, that the second and third line it should have been like
2015-04-21:12-04-54|readNFormOutputFile| name : PEPORT
2015-04-21:12-04-54|readNFormOutputFile| Frequency for name "PEPORT"
Also, after this the code stops processing my files. The code seems fine. May I know what may be the possible cause for this.
Thanks.
Seems like CR/LF versus LF issue. Convert your input from MSWin to Linux by running dos2unix or fromdos, or remove the "\r" characters from within the Perl code.
As choroba says, I guess you are reading a file on Linux that has been generated on Windows. The easiest fix is to replace chomp with s/\s+\z//or s/\p{cntrl}+\z//
Or, if trailing spaces are significant, you can use s/[\r\n]+\z// or, if you are running version 10 or later of Perl 5, s/\R\z//

Using a .fasta file to compute relative content of sequences

So me being the 'noob' that I am, being introduced to programming via Perl just recently, I'm still getting used to all of this. I have a .fasta file which I have to use, although I'm unsure if I'm able to open it, or if I have to work with it 'blindly', so to speak.
Anyway, the file that I have contains DNA sequences for three genes, written in this .fasta format.
Apparently it's something like this:
>label
sequence
>label
sequence
>label
sequence
My goal is to write a script to open and read the file, which I have gotten the hang of now, but I have to read each sequence, compute relative amounts of 'G' and 'C' within each sequence, and then I'm to write it to a TAB-delimited file the names of the genes, and their respective 'G' and 'C' content.
Would anyone be able to provide some guidance? I'm unsure what a TAB-delimited file is, and I'm still trying to figure out how to open a .fasta file to actually see the content. So far I've worked with .txt files which I can easily open, but not .fasta.
I apologise for sounding completely bewildered. I'd appreciate your patience. I'm not like you pros out there!!
I get that it's confusing, but you really should try to limit your question to one concrete problem, see https://stackoverflow.com/faq#questions
I have no idea what a ".fasta" file or 'G' and 'C' is.. but it probably doesn't matter.
Generally:
Open input file
Read and parse data. If it's in some strange format that you can't parse, go hunting on http://metacpan.org for a module to read it. If you're lucky someone has already done the hard part for you.
Compute whatever you're trying to compute
Print to screen (standard out) or another file.
A "TAB-delimite" file is a file with columns (think Excel) where each column is separated by the tab ("\t") character. As quick google or stackoverflow search would tell you..
Here is an approach using 'awk' utility which can be used from the command line. The following program is executed by specifying its path and using awk -f <path> <sequence file>
#NR>1 means only look at lines above 1 because you said the sequence starts on line 2
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases
total++
}
{
for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:
c++; else
if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
g++
}
END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs
print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
}

MATLAB: How do you insert a line of text at the beginning of a file?

I have a file full of ascii data. How would I append a string to the first line of the file? I cannot find that sort of functionality using fopen (it seems to only append at the end and nothing else.)
The following is a pure MATLAB solution:
% write first line
dlmwrite('output.txt', 'string 1st line', 'delimiter', '')
% append rest of file
dlmwrite('output.txt', fileread('input.txt'), '-append', 'delimiter', '')
% overwrite on original file
movefile('output.txt', 'input.txt')
Option 1:
I would suggest calling some system commands from within MATLAB. One possibility on Windows is to write your new line of text to its own file and then use the DOS for command to concatenate the two files. Here's what the call would look like in MATLAB:
!for %f in ("file1.txt", "file2.txt") do type "%f" >> "new.txt"
I used the ! (bang) operator to invoke the command from within MATLAB. The command above sequentially pipes the contents of "file1.txt" and "file2.txt" to the file "new.txt". Keep in mind that you will probably have to end the first file with a new line character to get things to append correctly.
Another alternative to the above command would be:
!for %f in ("file2.txt") do type "%f" >> "file1.txt"
which appends the contents of "file2.txt" to "file1.txt", resulting in "file1.txt" containing the concatenated text instead of creating a new file.
If you have your file names in strings, you can create the command as a string and use the SYSTEM command instead of the ! operator. For example:
a = 'file1.txt';
b = 'file2.txt';
system(['for %f in ("' b '") do type "%f" >> "' a '"']);
Option 2:
One MATLAB only solution, in addition to Amro's, is:
dlmwrite('file.txt',['first line' 13 10 fileread('file.txt')],'delimiter','');
This uses FILEREAD to read the text file contents into a string, concatenates the new line you want to add (along with the ASCII codes for a carriage return and a line feed/new line), then overwrites the original file using DLMWRITE.
I get the feeling Option #1 might perform faster than this pure MATLAB solution for huge text files, but I don't know that for sure. ;)
How about using the frewind(fid) function to take the pointer to the beginning of the file?
I had a similar requirement and tried frewind() followed by the necessary fprintf() statement.
But, warning: It will overwrite on whichever is the 1st line. Since in my case, I was the one writing the file, I put a dummy data at the starting of the file and then at the end, let that be overwritten after the operations specified above.
BTW, even I am facing one problem with this solution, that, depending on the length(/size) of the dummy data and actual data, the program either leaves part of the dummy data on the same line, or bring my new data to the 2nd line..
Any tip in this regards is highly appreciated.