pyspark sql with inline case when join condition - pyspark

This is the original SQL with inline case when condition:
select *
from table_a
LEFT JOIN table_b
ON case when table_a.key not in ('1','2') then '0' else table_a.key end = table_b.key
What is the equivalent pyspark code?
I was trying when().otherwise() function and if() function but neither worked.
table_a=spark.sql('''select 1 as key union select 2 as key''')
table_a.show()
table_b=spark.sql('''select 3 as key union select 0 as key''')
table_b.show()
val join_condition = when(((table_a.key == '1') | (table_a.key == '2')), table_a.key == table_b.key).otherwise(('0' == table_b.key))
df = table_a.join(table_b, join_condition, 'leftouter').select(table_a['*'], table_b['*'])
df = table_a.join(table_b, ((if(((table_a.key == '1') | (table_a.key == '2')), ('0'), (table_a.key))) == table_b.key), 'leftouter')\
.select(table_a['*'], table_b['*'])
Thank you!

Whenever you have a multi conditions like this (table_a.key == '1' | table_a.key == '2'), you'd have to wrap each of them separately like this ((table_a.key == '1') | (table_a.key == '2'))

Related

Optimize multiple case expressions with THEN/END

I have following query.
SELECT
i.id,
CASE WHEN ia.detail_count = 1 THEN i.space_id ELSE NULL END AS space_id,
CASE WHEN ia.detail_count = 1 THEN i.resident_id ELSE NULL END AS resident_id,
CASE WHEN ia.detail_count = 1 THEN i.lease_id ELSE NULL END AS lease_id,
i.deleted_by,
i.deleted_on,
i.updated_by,
i.updated_on,
i.created_by
From
myTable i
JOIN (
SELECT
icd.id,
json_build_object(
'lease_ids', array_remove(array_agg(icd.lease_id), NULL),
'resident_ids', array_remove(array_agg(icd.resident_id), NULL),
'space_ids', array_remove(array_agg(icd.space_id), NULL)
) AS details,
COUNT(icd.id) As detail_count
FROM
mytable_details icd
GROUP BY
icd.id
) ia ON ia.id = i.id;
Can we optimize following three expressions into 1, since condition is same only operand is different.
CASE WHEN ia.detail_count = 1 THEN i.space_id ELSE NULL END AS space_id,
CASE WHEN ia.detail_count = 1 THEN i.resident_id ELSE NULL END AS resident_id,
CASE WHEN ia.detail_count = 1 THEN i.lease_id ELSE NULL END AS lease_id,
I'm not sure that this counts as an optimisation, but I think you could modify the inner query:
(COUNT(icd.id) = 1) as one_detail
... which would return a Boolean result, and then ...
case when one_detail then i.space_id end as space_id,
case when one_detail then i.resident_id end as resident_id,
case when one_detail then i.lease_id end as lease_id,
I'm not sure it's worthwhile in a simple case like this, but for a more complex condition it might be.

Scala Quill: how to write? I need use GROUP BY and ORDER BY and multiple WHERE conditions

I want to get sql like this:
SELECT
v0.`uid`,
v0.`title`,
v0.`price`,
v0.`publishtime`,
v0.`status`,
v0.`type`,
v0.`is_lfst`,
v0.`app_image`
FROM
`news` v0
LEFT JOIN `news_zq` v1 ON v0.`id` = v1.`nid`
WHERE
v0.`status` = 1
AND v0.`is_lfst` = 1
AND v0.`type` = 2
AND v1.`zq_id` = 2
GROUP BY
v0.`id`
ORDER BY
v0.`publishtime` DESC
LIMIT 20 OFFSET 0
I tried "dynamicQuery" and "infix",but failed
dynamicQuery[News]
.leftJoin(dynamicQuery[NewsZq]).on((a, b) => a.id == b.nid)
.filter(_._1.status == lift(1))
.filterIf(newsDo.cid.isDefined && newsDo.cid.get > 0)(_._1.cid == lift(newsDo.cid.get))
.filterIf(newsDo.`type`.isDefined)(_._1.`type` == lift(newsDo.`type`.get))
.groupBy(_._1.id).map(_._2.map(_._1)) // error
.sortBy(_._1.publishtime)(Ord.desc)
.drop(quote(lift(offset)))
.take(quote(lift(limit)))
.map(_._1)

Is there an equivalent in Entity Framework for CASE WHEN SomeCol IS NULL THEN 0 ELSE 1 END

The T-SQL statement is below, essentially I want to return a boolean computed field xmlHasValue
SELECT TOP 10
hrd.pkID
, etc= "etc..."
, xmlHasValue = CASE WHEN hdr.someVeryLongXml IS NULL THEN 0 ELSE 1 END
FROM MyLeftTable hdr
inner JOIN MyRightTable lines ON hdr.pkID = lines.fkID
WHERE hdr.SomeField == 123
ORDER BY hdr.pkID DESC
How can I write this in EntityFramework (Full Fx, not dotnet-core), such that EF produces the Case statement as above?
My attempt:
var query = from hdr in dbCtx.MyLeftTable
join lines in dbCtx.MyRightTable on hdr.pkID equals lines.fkID
where hdr.SomeField == 123
orderby hdr.pkID descending
select new //select into anon C# obj
{
pkID = hdr.pkID,
etc = "etc...",
xmlHasValue = hdr.someVeryLongXml //<== ??? stuck here ???
};
var anonObjList = query.AsNoTracking()
.Take(10)
.ToList(); //exec qry on the SERVER, and fill the anon object.

How to select from subquery if column contains a specific value in postgre

I would like to ask if it is possible to select again from a result set if a column contains a specific value?
For example, from the below query I want to select it as subquery and check if that subquery's first column contains both 2 and 3 result. Otherwise, no values should be return.
select e.evaluator_id, ROUND(avg(cast(e.rating_score as int))::numeric,1)::varchar, c.q_category_name
from tms.t_evaluation e
inner join tms.m_q_category c
on e.nendo=c.nendo
and e.q_category_id = c.q_category_id
and c.delete_flg = '0'
inner join tms.m_q_subcategory qs
on e.q_category_id = qs.q_category_id
and e.q_subcategory_id = qs.q_subcategory_id
and c.nendo = qs.nendo
and qs.delete_flg = '0'
where e.nendo = '2018'
and e.empl_id = 'empl05'
and e.delete_flg = '0'
and e.evaluator_id in ('2' , '3')
group by e.empl_id, e.nendo, e.q_category_id,
c.q_category_name, e.evaluator_id, e.history_no
Result contains both 2 and 3 in first column. Is this possible?
select e.evaluator_id, ROUND(avg(cast(e.rating_score as int))::numeric,1)::varchar, c.q_category_name
from tms.t_evaluation e
inner join tms.m_q_category c
on e.nendo=c.nendo
and e.q_category_id = c.q_category_id
and c.delete_flg = '0'
inner join tms.m_q_subcategory qs
on e.q_category_id = qs.q_category_id
and e.q_subcategory_id = qs.q_subcategory_id
and c.nendo = qs.nendo
and qs.delete_flg = '0'
where e.nendo = '2018'
and e.empl_id = 'empl05'
and e.delete_flg = '0'
and e.evaluator_id in (select case when evaluator_id=2 or evaluator_id=3 then evaluator_id else null from t_evaluation order by evaluator_id asc)
group by e.empl_id, e.nendo, e.q_category_id,
c.q_category_name, e.evaluator_id, e.history_no

How to get multiple sums that are subqueries

Im using Linqpad to test out my EF query and I cant seem to get my end result to include a few extra columns that represent sums of a field based on different conditions
StorePaymentInvoices table contains a FK over to CustomerStatementBatchPayments. So I need to sum the CustomerStatementBatchPayment.net field if there is a corresponding value in StorePaymentInvoices
Getting the sums is turning out to be a real mess. Any suggestions?
Sometimes what is hard to do in one statement, ends up being easier done in multiple steps.
var retval = (
from a in CustomerStatementBatches
join b in CustomerStatementBatchPayments on a.ID equals b.CustomerStatementBatchID into grp1
from c in grp1
where a.CustomerStatementID == StatementId
group c by c.CustomerStatementBatchID into grp2
from e in grp2
select new {
StatementId = e.CustomerStatementBatch.CustomerStatementID,
BatchId = e.CustomerStatementBatchID,
Applied = CustomerStatementBatchPayments.Where(csbp => !StorePaymentInvoices.Select (pi => pi.CustomerStatementBatchPaymentID ).ToList().Contains(e.ID)).Sum (csbp => csbp.Net )
}
).ToList();
retval.Dump();
[ UPDATE 1]
This is what Ive done to get the "conditional" sum values and I seem to be getting the correct numbers. The resulting SQL that it generates is kinda ugly, but executes in < 1 second.
var retval1 = (
from a in CustomerStatementBatches
join b in CustomerStatementBatchPayments on a.ID equals b.CustomerStatementBatchID into grp1
from c in grp1
where a.CustomerStatementID == StatementId
group c by new { a.CustomerStatementID, c.CustomerStatementBatchID} into grp2
from e in grp2.Distinct()
select new {
StatementId = e.CustomerStatementBatch.CustomerStatementID,
BatchId = e.CustomerStatementBatchID
}
).ToList()
.Distinct()
.Select(a => new
{
StatementId = a.StatementId,
BatchId = a.BatchId,
AppliedTotal = (from b in CustomerStatementBatchPayments.Where(r => r.CustomerStatementBatchID == a.BatchId)
join c in StorePaymentInvoices on b.ID equals c.CustomerStatementBatchPaymentID
group b by b.CustomerStatementBatchID into g1
from d in g1
select new{ Total = (decimal?)d.Net}).DefaultIfEmpty().Sum (at => (decimal?)at.Total ) ?? 0.0m,
Unappliedtotal = (from b in CustomerStatementBatchPayments.Where(r => r.CustomerStatementBatchID == a.BatchId)
.Where(s => !StorePaymentInvoices.Any (pi => pi.CustomerStatementBatchPaymentID == s.ID ) )
select new{ Total = (decimal?)b.Net}).DefaultIfEmpty().Sum (at => (decimal?)at.Total ) ?? 0.0m
})
.ToList();
Try this
from a in db.CustomerStatementBatches
join b in db.CustomerStatementBatchPayments
//.Where(i => ...)
.GroupBy(i => i.CustomerStatementBatchesId)
.Select(i => new {
CustomerStatementBatchesId = i.Key,
SumOfPayments = i.Sum(t => t.Net)
}
)
into tmp from b in tmp.DefaultIfEmpty()
on a.CustomerStatementBatchesId equals b.CustomerStatementBatchesId
select new
{
StatementId = a.CustomerStatementId,
BatchId = a.CustomerStatementBatchId,
Applied = ((b == null) ? 0 : b.SumOfPayments)
}