Join two pipelinedRDDs - pyspark

I am trying to join two pipelinedRDDs using .join() in pyspart jupyter notebook
First RDD:
primaryType.take(5)
['DECEPTIVE PRACTICE',
'CRIM SEXUAL ASSAULT',
'BURGLARY',
'THEFT',
'CRIM SEXUAL ASSAULT']
Second RDD:
districts.take(5)
['004', '022', '008', '003', '001']
Join RDDs:
rdd_joined = primaryType.join(districts)
rdd_joined.take(5)
Output:
[]
What am I donig wrong here?

There should be some unique key to join both the rdds, so use rdd.zipWithIndex() to create indexes for both the rdds and then try to join them
districts.take(5)
['004', '022', '008', '003', '001']
primaryType.take(5)
['DECEPTIVE PRACTICE',
'CRIM SEXUAL ASSAULT',
'BURGLARY',
'THEFT',
'CRIM SEXUAL ASSAULT']
districts=districts.zipWithIndex()
districts.take(5)
[('004', 0), ('022', 1), ('008', 2), ('003', 3), ('001', 4)]
districts=districts.map(lambda (x,y):(y,x))
primaryType=primaryType.zipWithIndex()
primaryType=primaryType.map(lambda (x,y):(y,x))
primaryType.join(districts).map(lambda (x,y):y).take(5)
[('DECEPTIVE PRACTICE', '004'), ('CRIM SEXUAL ASSAULT', '001'), ('CRIM SEXUAL ASSAULT', '022'), ('BURGLARY', '008'), ('THEFT', '003')]

Related

Tbl_Strata count by distinct individual vs rows

How can I use tbl_strata and get the output to show counts by distinct individual rather than rows?
Also, how can I change the order that displays for the variable I am putting in the by= section in tbl_summary?
I have a long table AND a wide table with one row per patient. Not sure how to apply the wide table to this code. I can apply the long table but getting row counts instead of distinct patient counts.
I have included an example of the Long table I have and the wide table and what I would like the output to look like in the picture.
Example code:
#Wide Table Example
df_Wide <- data.frame(patientICN =c(1, 2, 3, 4, 5)
,testtype =c("liquid", "tissue", "tissue", "liquid", "liquid")
,gene1 =c("unk", "pos", "neg", "neg", "unk")
,gene2 =c("pos", "neg", "pos", "unk", "neg")
,gene3 =c("neg", "unk", "unk", "pos", "pos"))
#Long Table Example
df_Long <- data.frame(patientICN =c(1, 1, 2, 2, 3)
,testtype =c("liquid", "tissue", "tissue", "liquid", "liquid")
,gene =c("Gene1", "Gene2", "Gene3", "Gene1", "Gene2")
,result=c("Positve", "Negative", "Unknown","Positive","Unknown"))
#Table Categorized by testtype and result for long table
df_Long %>%
select (result, gene, testtype)%>%
mutate(testcategory=paste("TestType",testtype))%>%
tbl_strata(
strata=testtype,
.tbl_fun =
~.x %>%
tbl_summary(by=result,missing="no")%>%
add_n(),
.header= "**{strata}**, N={n}"
)
##above is giving multiple Rows per patient counts
Is this what you're after? You can install the bstfun pkg from my R-universe: https://ddsjoberg.r-universe.dev/ui#packages
library(gtsummary)
library(dplyr)
#Long Table Example
df_Long <- data.frame(patientICN =c(1, 1, 2, 2, 3)
,testtype =c("liquid", "tissue", "tissue", "liquid", "liquid")
,gene =c("Gene1", "Gene2", "Gene3", "Gene1", "Gene2")
,result=c("Positive", "Negative", "Unknown","Positive","Unknown"))
tbl <-
df_Long %>%
tidyr::pivot_wider(
id_cols = c(patientICN, testtype),
names_from = gene,
values_from = result,
values_fill = "Unknown"
) %>%
mutate(across(starts_with('Gene'), ~factor(.x, levels = c("Positive", "Negative", "Unknown")))) %>%
tbl_strata(
strata = testtype,
~ .x %>%
bstfun::tbl_likert(
include = starts_with("Gene")
)
)
Created on 2022-10-06 with reprex v2.0.2

How clever is the PostgreSQL planner

I mainly use the R package dbplyr to interact with a PostgreSQL database. That works by "piping" operations which are then translated to SQL and executed in one query. This tends to result in many nested joins. I wonder how clever the planner is when it comes to resolving such rather verbose and un-optimized expressions. Is it even possible to write "bad queries" as long as they only use e.g. SELECT, WHERE and JOIN (and no functions, casting etc) and where the end result is the same? How would such a query look? For example, will the planner figure out which columns are needed in a hash join in order to reduce memory, even if the columns are not specified in that join but only at the end after a join involving, say, 6 tables?
For example, can I safely ignore:
Join order
When columns are selected
When filters are applied
I've found a lot of info about how the planner calculates costs and chooses a path, but not so much on how it arrives at a "minimal form" of the query at the first place. EXPLAIN ANALYZE doesn't help because it doesn't show which columns end up being selected. I'm sure someone will be unhappy with this question due to being too vague. If so, please point me in the right direction :)
EDIT:
An example.
Here is how a typical query would look in R with dbplyr. "gene_annotations" have the columns "gene" and "annotation_term". "genemaps" have "genemap", "gene", "probe", "study". Here I want to get the gene and annotation associated to a probe.
tbl(con, "gene_annotations") %>% inner_join(tbl(con, "genemaps"), by = "gene") %>%
filter(probe == 1L) %>% select(gene, annotation_term)
This translates to:
SELECT "gene", "annotation_term"
FROM (SELECT "LHS"."gene" AS "gene", "LHS"."annotation_term" AS "annotation_term", "RHS"."genemap" AS "genemap", "RHS"."probe" AS "probe", "RHS"."study" AS "study"
FROM "gene_annotations" AS "LHS"
INNER JOIN "genemaps" AS "RHS"
ON ("LHS"."gene" = "RHS"."gene")
) "dbplyr_004"
WHERE ("probe" = 1)
Can I trust that this has the exact same performance as e.g. this expression (except for the time for parsing and analyzing the expression)?
tbl(con, "gene_annotations") %>% inner_join(tbl(con, "genemaps") %>%
filter(probe == 1L) %>% select(gene) , by = "gene")
SELECT "LHS"."gene" AS "gene", "LHS"."annotation_term" AS "annotation_term"
FROM "gene_annotations" AS "LHS"
INNER JOIN (SELECT "gene"
FROM "genemaps"
WHERE ("probe" = 1)) "RHS"
ON ("LHS"."gene" = "RHS"."gene")
The plan is the same in both cases:
Nested Loop (cost=0.86..72.09 rows=546 width=8)
-> Index Only Scan using genemaps_probe_index on genemaps (cost=0.43..2.16 rows=36 width=4)
Index Cond: (probe = 1)
-> Index Only Scan using gene_annotations_pkey on gene_annotations "LHS" (cost=0.43..1.79 rows=15 width=8)
Index Cond: (gene = genemaps.gene)
I didn't want to provide an example, because I don't have an issue with this specific query. I'm wondering is if I can always disregard these issues altogether and just piece together joins until I get the end result I want.
EDIT 2:
I found out that there is a VERBOSE option to EXPLAIN where you can see which columns are returned. For the small example above the plan was identical also in that regard. Still, can I assume that holds for all reasonably complex queries? This is an example of how my queries typically look. As you can see, the SQL dbplyr generates isn't very easy to read. Here it joins six tables after various SELECT/WHERE.
SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."value" AS "value", "LHS"."gene" AS "gene", "LHS"."probe" AS "probe", "LHS"."gene_symbol" AS "gene_symbol", "LHS"."probe_name" AS "probe_name", "RHS"."factor_order" AS "factor_order"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."value" AS "value", "LHS"."gene" AS "gene", "LHS"."probe" AS "probe", "LHS"."gene_symbol" AS "gene_symbol", "RHS"."probe_name" AS "probe_name"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."value" AS "value", "LHS"."gene" AS "gene", "LHS"."probe" AS "probe", "RHS"."gene_symbol" AS "gene_symbol"
FROM (SELECT "sample_group", "sample_group_name", "sample_group_description", "sample", "sample_name", "value", "gene", "probe"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "LHS"."genemap" AS "genemap", "LHS"."annotation_term" AS "annotation_term", "LHS"."value" AS "value", "RHS"."gene" AS "gene", "RHS"."probe" AS "probe"
FROM (SELECT "LHS"."sample_group" AS "sample_group", "LHS"."sample_group_name" AS "sample_group_name", "LHS"."sample_group_description" AS "sample_group_description", "LHS"."sample" AS "sample", "LHS"."sample_name" AS "sample_name", "RHS"."genemap" AS "genemap", "RHS"."annotation_term" AS "annotation_term", "RHS"."value" AS "value"
FROM (SELECT *
FROM (SELECT "sample_group", "sample_group_name", "sample_group_description", "sample", "sample_name"
FROM "sample_view") "dbplyr_031"
WHERE (270 = 270)) "LHS"
INNER JOIN "gene_measurements" AS "RHS"
ON ("LHS"."sample" = "RHS"."sample")
) "LHS"
INNER JOIN (SELECT "genemap", "gene", "probe"
FROM "genemaps"
WHERE ("gene" IN (54812) AND "study" = 270)) "RHS"
ON ("LHS"."genemap" = "RHS"."genemap")
) "dbplyr_032") "LHS"
INNER JOIN (SELECT "gene", "gene_symbol"
FROM "genes") "RHS"
ON ("LHS"."gene" = "RHS"."gene")
) "LHS"
INNER JOIN (SELECT "probe", "probe_name"
FROM "probes") "RHS"
ON ("LHS"."probe" = "RHS"."probe")
) "LHS"
INNER JOIN (SELECT "group", "annotation_term_value" AS "factor_order"
FROM (SELECT "LHS"."group" AS "group", "LHS"."annotation_term" AS "annotation_term", "RHS"."annotation_term_value" AS "annotation_term_value"
FROM "group_annotations" AS "LHS"
INNER JOIN (SELECT "annotation_term", "annotation_term_value"
FROM "annotation_terms"
WHERE ("annotation_type" = 111)) "RHS"
ON ("LHS"."annotation_term" = "RHS"."annotation_term")
) "dbplyr_033") "RHS"
ON ("LHS"."sample_group" = "RHS"."group")

How to do handle this use-case (running-window data) in spark

I am using spark-sql-2.4.1v with java 1.8.
Have source data as below :
val df_data = Seq(
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2020-03-01"),
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-06-01"),
("Indus_1","Indus_1_Name","Country1", "State1",12789979,"2019-03-01"),
("Indus_2","Indus_2_Name","Country1", "State2",21789933,"2020-03-01"),
("Indus_2","Indus_2_Name","Country1", "State2",300789933,"2018-03-01"),
("Indus_3","Indus_3_Name","Country1", "State3",27989978,"2019-03-01"),
("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-06-01"),
("Indus_3","Indus_3_Name","Country1", "State3",30014633,"2017-03-01"),
("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2020-03-01"),
("Indus_4","Indus_4_Name","Country2", "State1",41789978,"2018-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2019-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2018-03-01"),
("Indus_5","Indus_5_Name","Country3", "State3",67789978,"2017-03-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-03-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2020-06-01"),
("Indus_6","Indus_6_Name","Country1", "State1",37899790,"2018-03-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-03-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2020-12-01"),
("Indus_7","Indus_7_Name","Country3", "State1",26689900,"2019-03-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-03-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2018-09-01"),
("Indus_8","Indus_8_Name","Country1", "State2",212359979,"2016-03-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2020-03-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2019-09-01"),
("Indus_9","Indus_9_Name","Country4", "State1",97899790,"2016-03-01")
).toDF("industry_id","industry_name","country","state","revenue","generated_date");
Query :
val distinct_gen_date = df_data.select("generated_date").distinct.orderBy(desc("generated_date"));
For each "generated_date" in list distinct_gen_date , need to get all unique industry_ids for 6 months data
val cols = {col("industry_id")}
val ws = Window.partitionBy(cols).orderBy(desc("generated_date"));
val newDf = df_data
.withColumn("rank",rank().over(ws))
.where(col("rank").equalTo(lit(1)))
//.drop(col("rank"))
.select("*");
How to get moving aggregate (on unique industry_ids for 6 months data ) for each distinct item , how to achieve this moving aggregation.
more details :
Example, in the given sample data given , assume, is from "2020-03-01" to "2016-03-01". if some industry_x is not there in "2020-03-01", need to check "2020-02-01" "2020-01-01","2019-12-01","2019-11-01","2019-10-01","2019-09-01" sequentically whenever we found thats rank-1 is taken into consider for that data set for calculating "2020-03-01" data......we next go .."2020-02-01" i.e. each distinct "generated_date".. for each distinct date go back 6 months get unique industries ..pick rank 1 data...this data for ."2020-02-01"...next pick another distinct "generated_date" and do same so on .....here dataset keep changing....using for loop I can do but it is not giving parallesm..how to pick distinct dataset for each distinct "generated_date" parallell ?
I don't know how to do this with window functions but a self join can solve your problem.
First, you need a DataFrame with distinct dates:
val df_dates = df_data
.select("generated_date")
.withColumnRenamed("generated_date", "distinct_date")
.distinct()
Next, for each row in your industries data you need to calculate up to which date that industry will be included, i.e., add 6 months to generated_date. I think of them as active dates. I've used add_months() to do this but you can think of different logics.
import org.apache.spark.sql.functions.add_months
val df_active = df_data.withColumn("active_date", add_months(col("generated_date"), 6))
If we start with this data (separated by date just for our eyes):
industry_id generated_date
(("Indus_1", ..., "2020-03-01"),
("Indus_1", ..., "2019-12-01"),
("Indus_2", ..., "2019-12-01"),
("Indus_3", ..., "2018-06-01"))
It has now:
industry_id generated_date active_date
(("Indus_1", ..., "2020-03-01", "2020-09-01"),
("Indus_1", ..., "2019-12-01", "2020-06-01"),
("Indus_2", ..., "2019-12-01", "2020-06-01")
("Indus_3", ..., "2018-06-01", "2018-12-01"))
Now proceed with self join based on dates, using the join condition that will match your 6 month period:
val condition: Column = (
col("distinct_date") >= col("generated_date")).and(
col("distinct_date") <= col("active_date"))
val df_joined = df_dates.join(df_active, condition, "inner")
df_joined has now:
distinct_date industry_id generated_date active_date
(("2020-03-01", "Indus_1", ..., "2020-03-01", "2020-09-01"),
("2020-03-01", "Indus_1", ..., "2019-12-01", "2020-06-01"),
("2020-03-01", "Indus_2", ..., "2019-12-01", "2020-06-01"),
("2019-12-01", "Indus_1", ..., "2019-12-01", "2020-06-01"),
("2019-12-01", "Indus_2", ..., "2019-12-01", "2020-06-01"),
("2018-06-01", "Indus_3", ..., "2018-06-01", "2018-12-01"))
Drop that auxiliary column active_date or even better, drop duplicates based on your needs:
val df_result = df_joined.dropDuplicates(Seq("distinct_date", "industry_id"))
Which drops the duplicated "Indus_1" in "2020-03-01" (It appeared twice because it's retrieved from two different generated_dates):
distinct_date industry_id
(("2020-03-01", "Indus_1"),
("2020-03-01", "Indus_2"),
("2019-12-01", "Indus_1"),
("2019-12-01", "Indus_2"),
("2018-06-01", "Indus_3"))

How to get value of previous row in scala apache rdd[row]?

I need to get value from previous or next row while Im iterating through RDD[Row]
(10,1,string1)
(11,1,string2)
(21,1,string3)
(22,1,string4)
I need to sum strings for rows where difference between 1st value is not higher than 3. 2nd value is ID. So the result should be:
(1, string1string2)
(1, string3string4)
I tried use groupBy, reduce, partitioning but still I can't achieve what I want.
I'm trying to make something like this(I know it's not proper way):
rows.groupBy(row => {
row(1)
}).map(rowList => {
rowList.reduce((acc, next) => {
diff = next(0) - acc(0)
if(diff <= 3){
val strings = acc(2) + next(2)
(acc(1), strings)
}else{
//create new group to aggregatre strings
(acc(1), acc(2))
}
})
})
I wonder if my idea is proper to solve this problem.
Looking for help!
I think you can use sqlContext to Solve your problem by using lag function
Create RDD:
val rdd = sc.parallelize(List(
(10, 1, "string1"),
(11, 1, "string2"),
(21, 1, "string3"),
(22, 1, "string4"))
)
Create DataFrame:
val df = rdd.map(rec => (rec._1.toInt, rec._2.toInt, rec._3.toInt)).toDF("a", "b", "c")
Register your Dataframe:
df.registerTempTable("df")
Query the result:
val res = sqlContext.sql("""
SELECT CASE WHEN l < 3 THEN ROW_NUMBER() OVER (ORDER BY b) - 1
ELSE ROW_NUMBER() OVER (ORDER BY b)
END m, b, c
FROM (
SELECT b,
(a - CASE WHEN lag(a, 1) OVER (ORDER BY a) is not null
THEN lag(a, 1) OVER (ORDER BY a)
ELSE 0
END) l, c
FROM df) A
""")
Show the Results:
res.show
I Hope this will Help.

concat columns by joining multiple DataFrames

I have multiple dataframes I need to concat the addresses and zip based condition.Actually I had sql query which i need to convert to dataframe join
I had written UDF which is working fine for concating multiple columns to obtain a single column,
val getConcatenated = udf( (first: String, second: String,third: String,fourth: String,five: String,six: String) => { first + "," + second + "," +third + "," +fourth + "," +five + "," +six } )
MySQl Query
select
CONCAT(al.Address1,',',al.Address2,',',al.Zip) AS AtAddress,
CONCAT(rl.Address1,',',rl.Address2,',',rl.Zip) AS RtAddress,
CONCAT(d.Address1,',',d.Address2,','d.Zip) AS DAddress,
CONCAT(s.Address1,',',s.Address2,',',s.Zip) AS SAGddress,
CONCAT(vl.Address1,',',vl.Address2,',vl.Zip) AS VAddress,
CONCAT(sg.Address1,',',sg.Address2,',sg.Zip) AS SAGGddress
FROM
si s inner join
at a on s.cid = a.cid and s.cid =a.cid
inner join De d on s.cid = d.cid AND d.aid = a.aid
inner join SGrpM sgm on s.cid = sgm.cid and s.sid =sgm.sid and sgm.status=1
inner join SeG sg on sgm.cid =sg.cid and sgm.gid =sg.gid
inner join bd bu on s.cid = bu.cid and s.sid =bu.sid
inner join locas al on a.ALId = al.lid
inner join locas rl on a.RLId = rl.lid
inner join locas vl on a.VLId = vl.lid
I am facing issue when joining the dataframes which gives me null value.
val DS = DS_SI.join(at,Seq("cid","sid"),"inner").join(DS_DE,Seq("cid","aid"),"inner") .join(DS_SGrpM,Seq("cid","sid"),"inner").join(DS_SG,Seq("cid","gid"),"inner") .join(at,Seq("cid","sid"),"inner")
.join(DS_BD,Seq("cid","sid"),"inner").join(DS_LOCAS("ALId") <=> DS_LOCATION("lid") && at("RLId") <=> DS_LOCAS("lid")&& at("VLId") <=> DS_LOCAS("lid"),"inner")
Iam trying to join my dataFrames like above which is not giving be proper results and then I want to concat by adding the column
.withColumn("AtAddress",getConcatenated())
.withColumn("RtAddress",getConcatenated())....
Any one tell me how effectively we can achieve this and am I joining the dataframes correctly or any better approach for this .....
You can use concat_ws(separator, columns_to_concat).
Example:
import org.apache.spark.sql.functions._
df.withColumn("title", concat_ws(", ", DS_DE("Address2"), DS_DE("Address2"), DS_DE("Zip")))