Combining two VCF files with differing sampleIds and locations - pyspark

Good day,
How to combine multiple Variant call files (VCF) with differing subjects?
I multiple VCF datasets with differing sampleIds and locations:
file1:
contigName |start | end | names | referenceAllele | alternateAlleles| qual| filters| splitFromMultiAllelic| genotypes
1 |792460|792461|["bla"]|G |["A"] |null|["PASS"] |false | [{"sampleId": "abba", "phased": false, "calls": [0, 0]}]
1 |792461|792462|["blaA"]|G |["A"] |null|["PASS"] |false | [{"sampleId": "abba", "phased": false, "calls": [0, 0]}]
file2:
contigName |start | end | names | referenceAllele | alternateAlleles| qual| filters| splitFromMultiAllelic| genotypes
1 |792460|792461|["bla"]|G |["A"] |null|["PASS"] |false | [{"sampleId": "baab", "phased": false, "calls": [0, 0]}]
1 |792464|792465|["blaB"]|G |["A"] |null|["PASS"] |false | [{"sampleId": "baab", "phased": false, "calls": [0, 0]}]
I need to combine these to single VCF file. I'm required to work in DataBricks (pyspark/scala) environment due to data security.
Glow documentation had and idea, which I aped:
import pyspark.sql.functions as F
spark.read.format("vcf")\
.option("flattenInfoFields", True)\
.load(file_list)\
.groupBy('contigName', 'start', 'end', 'referenceAllele', 'alternateAlleles', 'qual', 'filters','splitFromMultiAllelic')\
.agg(F.sort_array(F.flatten(F.collect_list('genotypes'))).alias('genotypes'))\
.write.mode("overwrite").format("vcf").save(.my_output_destination )
This only works when sampleId's are same in both files:
Task failed while writing rows
Cannot infer sample ids because they are not the same in every row.
I'm considering creating dummy table with NULL calls for all the IDs but that seems silly. (Not to mention huge resource sink.
Is there simple way to combine VCF files with differing sampleIds? Or autofill missing values with NULL calls?
Edit: I managed to do this with bigVCF format. However it autofills -1,-1 calls. I'd like to manually set autofilled values as something more clear that's it's not 'real'
write.mode("overwrite").format("bigvcf").save(

The code above works if you have identical variants in both tables. I would not recommend using it to combine two distinct datasets as this would introduce batch effects.
The best practice for combining two datasets is to reprocess them from the BAM files to gVCF using the same pipeline. Then run joint-genotyping to merge the samples (instead of a custom spark-sql function).
Databricks does provide a GATK4 best practices pipeline that includes joint-genotyping. Or you can use Deep variant to call mutations.
If it is not possible to reprocess the data, then the two datasets should be treated separately in a meta-analysis, as opposed to merging the VCFs and performing a mega-analysis.

Related

Scala/Spark - Find total number of value in row based on a key

I have a large text file which contains the page views of some Wikimedia projects. (You can find it here if you're really interested) Each line, delimited by a space, contains the statistics for one Wikimedia page. The schema looks as follows:
<project code> <page title> <num hits> <page size>
In Scala, using Spark RDDs or Dataframes, I wish to compute the total number of hits for each project, based on the project code.
So for example for projects with the code "zw", I would like to find all the rows that begin with project code "zw", and add up their hits. Obviously this should be done for all project codes simultaneously.
I have looked at functions like aggregateByKey etc, but the examples I found don't go into enough detail, especially for a file with 4 fields. I imagine it's some kind of MapReduce job, but how exactly to implement it is beyond me.
Any help would be greatly appreciated.
First, you have to read the file in as a Dataset[String]. Then, parse each string into a tuple, so that it can be easily converted to a Dataframe. Once you have a Dataframe, a simple .GroupBy().agg() is enough to finish the computation.
import org.apache.spark.sql.functions.sum
val df = spark.read.textFile("/tmp/pagecounts.gz").map(l => {
val a = l.split(" ")
(a(0), a(2).toLong)
}).toDF("project_code", "num_hits")
val agg_df = df.groupBy("project_code")
.agg(sum("num_hits").as("total_hits"))
.orderBy($"total_hits".desc)
agg_df.show(10)
The above snippet shows the top 10 project codes by total hits.
+------------+----------+
|project_code|total_hits|
+------------+----------+
| en.mw| 5466346|
| en| 5310694|
| es.mw| 695531|
| ja.mw| 611443|
| de.mw| 572119|
| fr.mw| 536978|
| ru.mw| 466742|
| ru| 463437|
| es| 400632|
| it.mw| 400297|
+------------+----------+
It is certainly also possible to do this with the older API as an RDD map/reduce, but you lose many of the optimizations that Dataset/Dataframe api brings.

How to make ReadAllFromText not Block Beam Pipeline?

I would like to implement a very simple beam pipeline:
read google storage links to text files from PubSub topic->read each text line by line->write to BigQuery.
Apache Beam has pre-implemented PTransform for each process.
So pipeline would be:
Pipeline | ReadFromPubSub("topic_name") | ReadAllFromText() | WriteToBigQuery("table_name")
However, ReadAllFromText() blocks the pipeline somehow. Creating custom PTransform which return the random line after reading from PubSub and writing it to BigQuery table works normally (no blocking). Adding the fixed window of 3 seconds or triggering each element doesn't solve the problem either.
Each file is around 10MB and 23K lines.
Unfortunately, I cannot find the documentation about how ReadAllFromText is supposed to work. It would be really weird if it tries to block the pipeline until reading all the files. And I would expect the function to push each line to the pipeline as soon as it reads the line.
Is there any known reason for the above behavior? Is it a bug or am I doing something wrong?
Pipeline code:
pipeline_options = PipelineOptions(pipeline_args)
with beam.Pipeline(options=pipeline_options) as p:
lines = p | ReadFromPubSub(subscription=source_dict["input"]) \
| 'window' >> beam.WindowInto(window.FixedWindows(3, 0)) \
| ReadAllFromText(skip_header_lines=1)
elements = lines | beam.ParDo(SplitPayload())
elements | WriteToBigQuery(source_dict["output"], write_disposition=BigQueryDisposition.WRITE_APPEND)
.
.
.
class SplitPayload(beam.DoFn):
def process(self, element, *args, **kwargs):
timestamp, id, payload = element.split(";")
return [{
'timestamp': timestamp,
'id': id,
'payload': payload
}]

Splayed table upsert leading to error: `cast

I built a data loader prototype that saves CSV into splayed tables. The workflow is as follows:
Create schema the first time e.g. volatilitysurface table:
volatilitysurface::([date:`datetime$(); ccypair:`symbol$()] atm_convention:`symbol$(); premium_included:`boolean$(); smile_type:`symbol$(); vs_type:`symbol$(); delta_ratio:`float$(); delta_setting:`float$(); wing_extrapolation:`float$(); spread_type:`symbol$());
For every file in the rawdata folder import it:
myfiles:#[system;"dir /b /o:gn ",string `$getenv[`KDBRAWDATA],"*.volatilitysurface.csv 2> nul";()];
if[myfiles~();.lg.o[`load;"no volatilitysurface files found!"];:0N];
.lg.o[`load;"loading data files ..."];
/ load each file
{
mypath:"" sv (string `$getenv[`KDBRAWDATA];x);
.lg.o[`load;"loading file name '",mypath,"' ..."];
myfile:hsym`$mypath;
tmp1:select date,ccypair,atm_convention,premium_included,smile_type,vs_type,delta_ratio,delta_setting,wing_extrapolation,spread_type from update date:x, premium_included:?[premium_included = `$"true";1b;0b] from ("ZSSSSSFFFS";enlist ",")0:myfile;
`volatilitysurface upsert tmp1;
} #/: myfiles;
delete tmp1 from `.;
.Q.gc[];
.lg.o[`done;"loading volatilitysurface data done"];
.lg.o[`save;"saving volatilitysurface schema to ",string afolder];
volatilitysurface::0!volatilitysurface;
.Q.dpft[afolder;`;`ccypair;`volatilitysurface];
.lg.o[`cleanup;"removing volatilitysurface from memory"];
delete volatilitysurface from `.;
.Q.gc[];
.lg.o[`done;"saving volatilitysurface schema done"];
This works perfectly. I use .Q.gc[]; frequently to avoid hitting the wsfull. When new CSV files are available I open the existing schema, upsert into it and save it again effectively overwriting the existing HDB file system.
Open schema:
.lg.o[`open;"tables already exists, opening the schema ..."];
#[system;"l ",(string afolder) _ 0;{.lg.e[`open;"failed to load hdb directory: ", x]; 'x}];
/ Re-create table index
volatilitysurface::`date`ccypair xkey select from volatilitysurface;
Re-run step #2 to append new CSV files into the existing volatilitysurfacetable, it upserts the first CSV perfectly but the second CSV fails with:
error: `cast
I debug to the point of the error and to double-check I see that the metadata of tmp1 and volatilitysurface are perfectly the same. Any ideas why this is happening? I get the same issue with any other table. I have tried cleaning the keys from the table after every upsert but doesn't help i.e.
volatilitysurface::0!volatilitysurface;
volatilitysurface::`date`ccypair xkey volatilitysurface;
And the metadata comparison at the point of the cast error:
meta tmp1
c | t f a
------------------| -----
date | z
ccypair | s
atm_convention | s
premium_included | b
smile_type | s
vs_type | s
delta_ratio | f
delta_setting | f
wing_extrapolation| f
spread_type | s
meta volatilitysurface
c | t f a
------------------| -----
date | z
ccypair | s p
atm_convention | s
premium_included | b
smile_type | s
vs_type | s
delta_ratio | f
delta_setting | f
wing_extrapolation| f
spread_type | s
UPDATE Using the input of the answer below I tried using Torq's .loader.loadallfiles function like this (it doesn't fail but nothing happens either, the table is not created in memory and the data is not written to the database):
.loader.loadallfiles[`headers`types`separator`tablename`dbdir`dataprocessfunc!(`x`ccypair`atm_convention`premium_included`smile_type`vs_type`delta_ratio`delta_setting`wing_extrapolation`spread_type;"ZSSSSSFFFS";enlist ",";`volatilitysurface;`:hdb; {[p;t] select date,ccypair,atm_convention,premium_included,smile_type,vs_type,delta_ratio,delta_setting,wing_extrapolation,spread_type from update date:x, premium_included:?[premium_included = `$"true";1b;0b] from t}); `:rawdata]
UDPATE2 This is the output I get from TorQ:
2017.11.20D08:46:12.550618000|wsp18497wn|dataloader|dataloader1|INF|dataloader|**** LOADING :rawdata/20171102_113420.disccurve.csv ****
2017.11.20D08:46:12.550618000|wsp18497wn|dataloader|dataloader1|INF|dataloader|reading in data chunk
2017.11.20D08:46:12.566218000|wsp18497wn|dataloader|dataloader1|INF|dataloader|Read 10000 rows
2017.11.20D08:46:12.566218000|wsp18497wn|dataloader|dataloader1|INF|dataloader|processing data
2017.11.20D08:46:12.566218000|wsp18497wn|dataloader|dataloader1|INF|dataloader|Enumerating
2017.11.20D08:46:12.566218000|wsp18497wn|dataloader|dataloader1|INF|dataloader|writing 4525 rows to :hdb/2017.09.12/volatilitysurface/
2017.11.20D08:46:12.581819000|wsp18497wn|dataloader|dataloader1|INF|dataloader|writing 4744 rows to :hdb/2017.09.13/volatilitysurface/
2017.11.20D08:46:12.659823000|wsp18497wn|dataloader|dataloader1|INF|dataloader|writing 731 rows to :hdb/2017.09.14/volatilitysurface/
2017.11.20D08:46:12.737827000|wsp18497wn|dataloader|dataloader1|INF|init|retrieving sort settings from :C:/Dev/torq//config/sort.csv
2017.11.20D08:46:12.737827000|wsp18497wn|dataloader|dataloader1|INF|sort|sorting the volatilitysurface table
2017.11.20D08:46:12.737827000|wsp18497wn|dataloader|dataloader1|INF|sorttab|No sort parameters have been specified for : volatilitysurface. Using default parameters
2017.11.20D08:46:12.737827000|wsp18497wn|dataloader|dataloader1|INF|sortfunction|sorting :hdb/2017.09.05/volatilitysurface/ by these columns : sym, time
2017.11.20D08:46:12.753428000|wsp18497wn|dataloader|dataloader1|ERR|sortfunction|failed to sort :hdb/2017.09.05/volatilitysurface/ by these columns : sym, time. The error was: hdb/2017.09.
I get the following error sorttab|No sort parameters have been specified for : volatilitysurface. Using default parameters where is this sorttab documented? does it use the table PK by default?
UPDATE3 Ok fixed UPDATE2 out by providing a non-default sort.csv under my config folder:
tabname,att,column,sort
default,p,sym,1
default,,time,1
volatilitysurface,,date,1
volatilitysurface,,ccypair,1
But now I see that if I call the function multiple times on the same files, it simply appends duplicated data instead of upserting it.
UPDATE4 Still not there yet ... assuming I can check to make sure that no duplicate file is used. When I load and then start the database I get some structure back that ressembles some sort of dictionary and not a table.
2017.10.31| (,`volatilitysurface)!,+`date`ccypair`atm_convention`premium_incl..
2017.11.01| (,`volatilitysurface)!,+`date`ccypair`atm_convention`premium_incl..
2017.11.02| (,`volatilitysurface)!,+`date`ccypair`atm_convention`premium_incl..
2017.11.03| (,`volatilitysurface)!,+`date`ccypair`atm_convention`premium_incl..
sym | `AUDNOK`AUDCNH`AUDJPY`AUDHKD`AUDCHF`AUDSGD`AUDCAD`AUDDKK`CADSGD`C..
Note that date is actually datetime Z and not just date. My full and latest version of the function invocation is:
target:hsym `$("" sv ("./";getenv[`KDBHDB];"/volatilitysurface"));
rawdatadir:hsym `$getenv[`KDBRAWDATA];
.loader.loadallfiles[`headers`types`separator`tablename`dbdir`partitioncol`dataprocessfunc!(`x`ccypair`atm_convention`premium_included`smile_type`vs_type`delta_ratio`delta_setting`wing_extrapolation`spread_type;"ZSSSSSFFFS";enlist ",";`volatilitysurface;target;`date;{[p;t] select date,ccypair,atm_convention,premium_included,smile_type,vs_type,delta_ratio,delta_setting,wing_extrapolation,spread_type from update date:x, premium_included:?[premium_included = `$"true";1b;0b] from t}); rawdatadir];
I'm going to add a second answer here to try and tackle the question about using TorQ's data loader.
I'd like to clarify what output you are getting after running this function? There should be some logging messages output, can you post these? For example when I run the function:
jmcmurray#homer ~/deploy/TorQ (master) $ q torq.q -procname loader -proctype loader -debug
<torq startup messages removed>
q).loader.loadallfiles[`headers`types`separator`tablename`dbdir`partitioncol`dataprocessfunc!(c;"TSSFJFFJJBS";enlist",";`quotes;`:testdb;`date;{[p;t] select date:.z.d,time:TIME,sym:INSTRUMENT,BID,ASK from t});`:csvtest]
2017.11.17D15:03:20.312336000|homer.aquaq.co.uk|loader|loader|INF|dataloader|**** LOADING :csvtest/tradesandquotes20140421.csv ****
2017.11.17D15:03:20.319110000|homer.aquaq.co.uk|loader|loader|INF|dataloader|reading in data chunk
2017.11.17D15:03:20.339414000|homer.aquaq.co.uk|loader|loader|INF|dataloader|Read 11000 rows
2017.11.17D15:03:20.339463000|homer.aquaq.co.uk|loader|loader|INF|dataloader|processing data
2017.11.17D15:03:20.339519000|homer.aquaq.co.uk|loader|loader|INF|dataloader|Enumerating
2017.11.17D15:03:20.340061000|homer.aquaq.co.uk|loader|loader|INF|dataloader|writing 11000 rows to :testdb/2017.11.17/quotes/
2017.11.17D15:03:20.341669000|homer.aquaq.co.uk|loader|loader|INF|dataloader|**** LOADING :csvtest/tradesandquotes20140422.csv ****
2017.11.17D15:03:20.349606000|homer.aquaq.co.uk|loader|loader|INF|dataloader|reading in data chunk
2017.11.17D15:03:20.370793000|homer.aquaq.co.uk|loader|loader|INF|dataloader|Read 11000 rows
2017.11.17D15:03:20.370858000|homer.aquaq.co.uk|loader|loader|INF|dataloader|processing data
2017.11.17D15:03:20.370911000|homer.aquaq.co.uk|loader|loader|INF|dataloader|Enumerating
2017.11.17D15:03:20.371441000|homer.aquaq.co.uk|loader|loader|INF|dataloader|writing 11000 rows to :testdb/2017.11.17/quotes/
2017.11.17D15:03:20.460118000|homer.aquaq.co.uk|loader|loader|INF|init|retrieving sort settings from :/home/jmcmurray/deploy/TorQ/config/sort.csv
2017.11.17D15:03:20.466690000|homer.aquaq.co.uk|loader|loader|INF|sort|sorting the quotes table
2017.11.17D15:03:20.466763000|homer.aquaq.co.uk|loader|loader|INF|sorttab|No sort parameters have been specified for : quotes. Using default parameters
2017.11.17D15:03:20.466820000|homer.aquaq.co.uk|loader|loader|INF|sortfunction|sorting :testdb/2017.11.17/quotes/ by these columns : sym, time
2017.11.17D15:03:20.527216000|homer.aquaq.co.uk|loader|loader|INF|applyattr|applying p attr to the sym column in :testdb/2017.11.17/quotes/
2017.11.17D15:03:20.535095000|homer.aquaq.co.uk|loader|loader|INF|sort|finished sorting the quotes table
After all this, I can run \l testdb and there is a table called "quotes" containing my loaded data
If you can post logging messages like these, it could be helpful to see what's going on.
UPDATE
"But now I see that if I call the function multiple times on the same files, it simply appends duplicated data instead of upserting it."
If I'm understanding the problem correctly, it sounds like you likely shouldn't call the function multiple times on the same files. Another process within TorQ could be useful here, the "file alerter". This process will monitor a directory for new & updated files, and can call a function on any that appear (so you can have it call the loader function with every new file automatically). It has a number of options such as moving files after processing (so you can "archive" loaded CSVs)
Note that the file alerter requires that a function take exactly two parameters - the directory & the file name. This effectively means you will need a "wrapper" function around the loader function, which takes a dictionary & a directory. I don't think TorQ includes a function similar to .loader.loadallfiles for a single file, so it might be necessary to copy the target file to a temporary directory, run loadallfiles on that directory and then delete the file from there before loading the next.
`cast error refers to a value not being enumerated
I can't see any enumeration going on here, splayed tables on disk need to have symbol columns enumerated. For example, this can be done with the following line, before calling .Q.dpft
volatilitysurface:.Q.en[afolder;volatilitysurface];
You may like to consider using an example CSV loader for loading your data. One such example is included in TorQ, the KDB framework developed by AquaQ Analytics (as a disclaimer, I work for AquaQ)
The framework is available (free of charge) here: https://github.com/AquaQAnalytics/TorQ
The specific component you will likely be interested in is dataloader.q and is documented here: http://aquaqanalytics.github.io/TorQ/utilities/#dataloaderq
This script will handle everything necessary, loading all files, enumerating, sorting on disk, applying attributes etc. as well as using .Q.fsn to prevent running out of memory

Linking multiple custom variables in Omniture

I have an implementation in Sitecatalyst, where i have to track categories and the multiple tags associated with the categories. How should i go for it. What should be the variables which should be defined for it in omniture.
for example -
|---------------------|
| MOOD | // Main Category
|---------------------|
| Uplifting | // Sub Category
|---------------------|
| Fun | // Sub Category
|---------------------|
| Proud | // Sub Category
|---------------------|
| Fun | // Sub Category
There are three options:
Listprop
You could create a listprop which you can enable in the report suite settings by implementing a delimiter for the traffic variable. This is especially handy if you have the possibility to create multiple levels. You can then implement the traffic variable like so:
s.prop1 = "MOOD|Uplifting|Fun|Proud|Fun";
Adobe Analytics will automatically split the values based on the pipe character. Please note that listprops don't allow correlations and pathing.
Multiple traffic variables/props
The other option would be to create multiple traffic variables and define all of them on every page where they're required.
s.prop1 = "MOOD";
s.prop2 = "Uplifting";
s.prop3 = "Fun";
s.prop4 = "Proud";
s.prop5 = "Fun";
However, this option will consume a lot of traffic variables of which you only have 75.
Classification
The third option would look the same as the listprop, however, you don't configure the traffic variable as a listprop, you configure it as a normal traffic variable and classify it later on using the Classification Rule Builder.
Using the classification rule builder you can split the incoming data by the pipe character (using regular expression) and create new dimensions resembling the categories.
s.prop1 = "MOOD|Uplifting|Fun|Proud|Fun";
I would personally go for the third option as it doesn't require a lot of props and it allows for a future proof approach of measuring the categories even when you're adding more levels.
Good luck with your implementation!

Pass control from one gen_fsm to another

I'm creating a generic Erlang server that should be able to handle hundreds of client connections concurrently. For simplicity, let's suppose that the server performs for every client some basic computation, e.g., addition or subtraction of every two values which the client provides.
As a starting point, I'm using this tutorial for basic TCP client-server interaction. An excerpt that represents the supervision tree:
+----------------+
| tcp_server_app |
+--------+-------+
| (one_for_one)
+----------------+---------+
| |
+-------+------+ +-------+--------+
| tcp_listener | + tcp_client_sup |
+--------------+ +-------+--------+
| (simple_one_for_one)
+-----|---------+
+-------|--------+|
+--------+-------+|+
| tcp_echo_fsm |+
+----------------+
I would like to extend this code and allow tcp_echo_fsm to pass the control over the socket to one out of two modules: tcp_echo_addition (to compute the addition of every two client values), or tcp_echo_subtraction (to compute the subtraction between every two client values).
The tcp_echo_fsm would choose which module to handle a socket based on the first message from the client, e.g., if the client sends <<start_addition>>, then it would pass control to tcp_echo_addition.
The previous diagram becomes:
+----------------+
| tcp_server_app |
+--------+-------+
| (one_for_one)
+----------------+---------+
| |
+-------+------+ +-------+--------+
| tcp_listener | + tcp_client_sup |
+--------------+ +-------+--------+
| (simple_one_for_one)
+-----|---------+
+-------|--------+|
+--------+-------+|+
| tcp_echo_fsm |+
+----------------+
|
|
+----------------+---------+
+-------+-----------+ +-------+--------------+
| tcp_echo_addition | + tcp_echo_subtraction |
+-------------------+ +-------+--------------+
My questions are:
Am I on the right path? Is the tutorial which I'm using a good starting point for a scalable TCP server design?
How can I pass control from one gen_fsm (namely, tcp_echo_fsm) to another gen_fsm (either tcp_echo_addition or tcp_echo_subtraction)? Or better yet: is this a correct/clean way to design the server?
This related question suggests that passing control between a gen_fsm and another module is not trivial and there might be something wrong with this approach.
For 2, you can use gen_tcp:controlling_process/2 to pass control of the tcp connection: http://erlang.org/doc/man/gen_tcp.html#controlling_process-2.
For 1, I am not sure of the value of spawning a new module as opposed to handling the subtraction and addition logic as part of the defined states in your finite state machine. Doing so creates code which is now running outside of your supervision tree, so it's harder to handle errors and restarts. Why not define addition and subtraction as different states within your state machines handle that logic within those two states?
You can create tcp_echo_fsm:subtraction_state/2,3 and tcp_echo_fsm:addition_state/2,3 to handle this logic and use your first message to transition to the appropriate state rather than adding complexity to your application.