creating a per sample table from a vcf using bcftools

creating a per sample table from a vcf using bcftools - vcftools

I have a multi-sample vcf file and I want to get a table of IDs on the left column with the variants in which they have an alternate allele in. It should look like this:
ID1 chr2:87432:A:T_0/1 chr10:43234:C:G_1/1
ID2 chr2:87432_A:T_1/1
ID3 chr11:432434:T:G chr14:34234234:C:G chr20:34324234:T:C
This is to then read into R
I have tried combinations of:
bcftools query -f '[%SAMPLE\t] %CHROM:%POS:%REF:%ALT[%GT]\n'
but I keep getting sample IDs overlapping on the same line and I can't quite figure out the sytnax.
Your help would be much appreciated

You cannot achieve what you want with a single BCFtools command. BCFtools parses one VCF variant at a time. However, you can use a command like this to extract what you want:
bcftools +split -i 'GT="0/1" | GT="1/1"' -Ob -o DIR input.vcf
This will create one small .bcf file for each sample and you can then run multiple instance of bcftools query to get what you want

Related

mongoimport: set type for all fields when importing CSV

I have multiple problems with importing a CSV with mongoimport that has a headerline.
Following is the case:
I have a big CSV file with the names of the fields in the first line.
I know you can set this line to use as field names with: --headerline.
I want all field types to be strings, but mongoimport sets the types automatically to what it looks like.
IDs such as 0001 will be turn into 1, which can have bad side effects.
Unfortunately, there is (as far as i know) no way of setting them as string with a single command, but by naming each field and setting it type with
--columnsHaveTypes --fields "name.string(), ... "
When I did that, the next problem appeared.
The headerline (with all field names) got imported as values in a separate document.
So basically, my questions are:
Is there a way of setting all field types as string using the --headerline command ?
Alternative, is there a way to ignore the first line ?

I had this problem when uploading 41 million record CSV file into mongodb.
./mongoimport -d testdb -c testcollection --type csv --columnsHaveTypes -f
"RECEIVEDDATE.date(2006-01-02 15:04:05)" --file location/test.csv
As above we have a command to upload file with data types called '-f' or '--fields' but when we use this command to the file that contain header line, mondodb upload first row as well i.e header lines row then its leads error 'cannot convert to datatype' or upload column names also as data set.
Unfortunately we cannot use '--headerline' command instead of '--fields'.
Here the solutions that I found for this problem.
1)Remove header column and upload using '--fields' command as above command. if you re use linux environment you can use below command to remove first row of the huge file i.e header line.it took 2-3 mints for me.(depending on the machine performance)
sed -i -e "1d" location/test.csv
2)upload the file using '--headerline' command then mongodb uploads the file with its default identified data types.Then open mongodb shell command use testdb then run javascript command that get each record and change it into specific data types.But if you have huge file this will takes time.
found this solution from stackoverflow
db.testcollection.find().forEach( function (x) {
x.RECEIVEDDATE = new Date(x.RECEIVEDDATE ); db.testcollection .save(x);});
If you wanna remove the unnecessary rows that not fit to data type use below command.
mongodb document
'--parseGrace skipRow'

I found a solution, that I am comfortable with
Basically, I wanted to use mongoimport within my Clojure Code to import a CSV file in the DB and do a lot of stuff with it automatically. Due to the above mentioned problems I had to find a workaround, to delete this wrong document.
I did following to "solve" this problem:
To set the types as I wanted, I wrote a function to read the first line, put it in a vector and then used String concatenation to set these as fields.
Turning this: id,name,age,hometown,street
into this: id.string(),name.string(),age.string() etc
Then I used the values from the vector to identify the document with
{ name : "name"
age : "age"
etc : "etc" }
and then deleted it with a simple remving.find() command.
Hope this helps any dealing with the same kind of problem.

https://docs.mongodb.com/manual/reference/program/mongoimport/#example-csv-import-types reads:
MongoDB 3.4 added support for specifying field types. Specify field names and types in the form .() using --fields, --fieldFile, or --headerline.
so your first line within the csv file should have names with types. e.g.:
name.string(), ...
and the mongoimport parameters
--columnsHaveTypes --headerline --file <filename.csv>
As to the question of how to remove the first line, you can employ pipes. mongoimport reads from STDIN if no --file option passed. E.g.:
tail -n+2 <filename.csv> | mongoimport --columnsHaveTypes --fields "name.string(), ... "

How to filter by column name?

I've a csv file. I'd like to filter it and keep only columns with headers beginning 'hit'. How can I do that?
Small example input:
hit1,miss1,hit2,miss2
a,0,d,0
b,0,e,0
c,0,f,0
Desired output:
hit1,hit2
a,d
b,e
c,f
I think I want the exclude command but I can't figure out the syntax

The order command will let you specify an inclusive list of column names to be included in the output:
csvfix order -fn hit1,hit2 data.csv
(I realize I'm late to the party, but maybe this will helpful to the next person.)

Can a CSV in a text variable be COPY-ied in PostgreSQL

For example, say I've got the output of:
SELECT
$text$col1, col2, col3
0,my-value,text
7,value2,string
0,also a value,fort
$text$;
Would it be possible to populate a table directly from it with the COPY command?

Sort of. You would have to strip the first two and last lines of your example in order to use the data with COPY. You could do this by using the PROGRAM keyword:
COPY table_name FROM PROGRAM 'sed -e ''1,2d;$d'' inputfile';
Which is direct in that you are doing everything from the COPY command and indirect in that you are setting up an outside program to filter your input.

Replace matches of one regex expression with matches from another, across two files

I am currently helping a friend reorganise several hundred images on a database driven website. I have generated a list of the new, reorganised image paths offline and would like to replace each matching image reference in the sql export of the database with the new paths.
EDIT: Here is an example of what I am trying to achieve
The new_paths_list.txt is a file that I generated using a batch script after I had organised all of the existing images into folders. Prior to this all of the images were in just a few folders. A sample of this generated list might be:
image/data/product_photos/telephones/snom/snom_xyz.jpg
image/data/product_photos/telephones/gigaset/giga_xyz.jpg
A sample of my_exported_db.sql (the database exported from the website) might be:
...
,(110,32,'data/phones/snom_xyz.jpg',3),(213,50,'data/telephones/giga_xyz.jpg',0),
...
The result I want is my_exported_db.sql to be:
...
,(110,32,'data/product_photos/telephones/snom/snom_xyz.jpg',3),(213,50,'data/product_photos/telephones/gigaset/giga_xyz.jpg',0),
...
Some pseudo code to illustrate:
1/ Find the first image name in my_exported_db.sql, such as 'snom_xyz.jpg'.
2/ Find the same image name in new_paths_list.txt
3/ If it is present, copy the whole line (the path and filename)
4/ Replace the whole path in in my_exported_db.sql of this image with the copied line
5/ Repeat for all other image names in my_exported_db.sql
A regex expression that appears to match image names is:
([^)''"/])+\.(?:jpg|jpeg|gif|png)
and one to match image names, complete with path (for relative or absolute) is:
\bdata[^)''"\s]+\.(?:jpg|jpeg|gif|png)
I have looked around and have seen that Sed or Awk may be capable of doing this, but some pointers would be greatly appreciated. I understand that this will only work accurately if there are no duplicated filenames.

You can use sed to convert new_paths_list.txt into a set of sed replacement commands:
sed 's|\(.*\(/[^/]*$\)\)|s#data\2#\1#|' new_paths_list.txt > rules.sed
The file rules.sed will look like this:
s#data/snom_xyz.jpg#image/data/product_photos/telephones/snom/snom_xyz.jpg#
s#data/giga_xyz.jpg#image/data/product_photos/telephones/gigaset/giga_xyz.jpg#
Then use sed again to translate my_exported_db.sql:
sed -i -f rules.sed my_exported_db.sql
I think in some shells it's possible to combine these steps and do without rules.sed:
sed 's|\(.*\(/[^/]*$\)\)|s#data\2#\1#|' new_paths_list.txt | sed -i -f - my_exported_db.sql
but I'm not certain about that.
EDIT<:
If the images are in several directories under data/, make this change:
sed "s|image/\(.*\(/[^/]*$\)\)|s#[^']*\2#\1#|" new_paths_list.txt > rules.sed

how to find the difference between a csv file and a file containing only one column of this csv

I have a CSV file containing some user data it looks like this:
"10333","","an.10","Kenyata","","Aaron","","","","","","","","","",""
"12222","","an.4","Wendy","","Aaron","","","","","","","","","",""
"14343","","aaron.5","Nanci","","Aaron","","","","","","","","","",""
I also have a file which has an item on each line like this:
an.10
arron.5
What I want is to find only the lines in the CSV file contained in the list file.
So desired output would be:
"10333","","an.10","Kenyata","","Aaron","","","","","","","","","",""
"14343","","aaron.5","Nanci","","Aaron","","","","","","","","","",""
(Note how an.4 is not contained in this new list.)
I have any environment available to me and am willing to try just about anything aside from manually doing so as this csv contains millions of records and there are about 100k entries in the list itself.

How unique are the identifiers an.10 and the like?
Maybe a very small *x shell script would be enough:
for i in $(uniq list.txt); do grep "\"$i\"" data.csv; done
That would, for every unique entry in the list, return all matching lines in the csv file. It does not match exclusively on the second column however. (That could be done with awk for example)

If the csv file is data.csv and the list file is list.txt, I would do this:
for i in `cat list.txt`; do grep $i data.csv; done

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

creating a per sample table from a vcf using bcftools - vcftools

Related

mongoimport: set type for all fields when importing CSV

How to filter by column name?

Can a CSV in a text variable be COPY-ied in PostgreSQL

Replace matches of one regex expression with matches from another, across two files

how to find the difference between a csv file and a file containing only one column of this csv

Categories

Resources