Converting Sql query to spark - scala

I have sql query which I want to convert to spark-scala
SELECT aid,DId,BM,BY
FROM (SELECT DISTINCT aid,DId,BM,BY,TO FROM SU WHERE cd =2) t
GROUP BY aid,DId,BM,BY HAVING COUNT(*) >1;
SU is my Data Frame. I did this by
sqlContext.sql("""
SELECT aid,DId,BM,BY
FROM (SELECT DISTINCT aid,DId,BM,BY,TO FROM SU WHERE cd =2) t
GROUP BY aid,DId,BM,BY HAVING COUNT(*) >1
""")
Instead of that I need this in utilizing my dataframe

This should be the DataFrame equivalent:
SU.filter($"cd" === 2)
.select("aid","DId","BM","BY","TO")
.distinct()
.groupBy("aid","DId","BM","BY")
.count()
.filter($"count" > 1)
.select("aid","DId","BM","BY")

Related

Convert pyspark code to snowflake to create row level policy in snowflake

I am trying to convert pyspark code to snowflake to create row level policy in snowflake.
I am new to snowflake and not sure how to add split and case statement in snowflake row level policy.
Pyspark code
df=df.withColumn('first_part',upper(split(col('id'),'#').getItem(0)))\
.withColumn('last_part',split(col('id'),'#').getItem(1))
df.createOrReplaceTempView("df_table")
df1.createOrReplaceTempView("df1_table")
joined_df=spark.sql("select c.*, case when first_part == 'emp' then '1' else p.flag end as flag, p.agreement_date from df_table c left join df1_table p on c.last_part = p.empid")
Snowflake part
create or replace row access policy policy.policy_row_hi as (col1 varchar) returns boolean ->
exists
(select 1
from schema.table1 t1
inner join schema.table2 t2 on (t2.oid = t1.oid)
where t2.empid = col1
and t1.flag = '1'
);

Selecting row(s) that have distinct count (one) of certain column

I have following dataset:
org system_id punch_start_tb1 punch_start_tb2
CG 100242 2022-08-16T00:08:00Z 2022-08-16T03:08:00Z
LA 250595 2022-08-16T00:00:00Z 2022-08-16T03:00:00Z
LB 300133 2022-08-15T04:00:00Z 2022-08-16T04:00:00Z
LB 300133 2022-08-16T04:00:00Z 2022-08-15T04:00:00Z
MO 400037 2022-08-15T14:00:00Z 2022-08-15T23:00:00Z
MO 400037 2022-08-15T23:00:00Z 2022-08-15T14:00:00Z
I am trying to filter out data so that it only populates the outcome when Count of "system_id" = 1.
So, the expected outcome would be only following two rows:
org system_id punch_start_tb1 punch_start_tb2
CG 100242 2022-08-16T00:08:00Z 2022-08-16T03:08:00Z
LA 250595 2022-08-16T00:00:00Z 2022-08-16T03:00:00Z
I tried with Group by and Having clause, but I did not have a success.
You can try below
SELECT * FROM
(
SELECT org,system_id,punch_start_tbl,punch_start_tb2
,ROW_NUMBER()OVER(PARTITION BY system_id ORDER BY system_id)RN
FROM <TableName>
)X
WHERE RN = 1
CTE returns org with only one record then join with main table on org column.
;WITH CTE AS (
select org
from <table_name>
group by org
Having count(1) = 1
)
select t.*
from cte
inner join <table_name> t on cte.org = t.org
You can try this (use min because we have only one row):
select MIN(org), system_id, MIN(punch_start_tb1), MIN(punch_start_tb2)
from <table_name>
group by system_id
Having count(1) = 1
or use answer #Meyssam Toluie with group by by system_id

Loop through the list which has queries to be executed and appended to dataframe

I need to loop through each element in the list and run this query against the database and append the result in to the same dataframe (df). Could you please let me know how to achieve this.
PS : I am using spark scala for this.
List((select * from table1 where a=10 ) as rules,
(select * from table1 where b=10) as rules,
(select * from table1 where c=10 ) as rules)
Thank you.
As you load data from the same table table1, you can simply use multiple conditions with or in where clause :
val df = spark.sql("select * from table1 where a=10 or b=10 or c=10")
If the queries are on different tables you can load into list of dataframes then union:
val queries = List(
"select * from table1 where a=10",
"select * from table1 where b=10",
"select * from table1 where c=10"
)
val df = queries.map(spark.sql).reduce(_ unionAll _)

Postgres: make repeated subqueries more efficient?

I have a Postgres 9.6 database with two tables, template and project.
template
id integer
name varchar
project
id integer
name varchar
template_id integer (foreign key)
is_deleted boolean
is_listed boolean
I want to get a list of all templates, with a count of the projects for each template, and a count of the deleted projects for each template, i.e. this type of output
id,name,num_projects,num_deleted,num_listed
1,"circle",19,2,7
2,"square",10,0,8
I have a query like this:
select id, name,
(select count(*) from project where template_id=template.id)
as num_projects,
(select count(*) from project where template_id=template.id and is_deleted)
as num_deleted,
(select count(*) from project where template_id=template.id and is_listed)
as num_listed
from template;
However, looking at the EXPLAIN, this isn't very efficient as the large project table is queried separately three times.
Is there any way to get Postgres to query and iterate over the project table just once?
The query could be rewritten as:
SELECT t.id, t.name,
COUNT(p.template_id) as num_projects,
COUNT(p.template_id) FILTER(WHERE p.is_deleted) as num_deleted,
COUNT(p.template_id) FILTER(WHERE p.is_listed) as num_listed
FROM template t
LEFT JOIN project p
ON p.template_id=t.id
GROUP BY t.id, t.name
Sometimes, doing the aggregation before joining is more efficient then aggregating the result of a join.
SELECT t.id, t.name,
coalesce(p.num_projects, 0) as num_projects,
coalesce(p.num_deleted, 0) as num_deleted,
coalesce(p.num_listed, 0) as num_listed
FROM template t
LEFT JOIN (
SELECT template_id,
count(*) as num_projects
count(*) filter (where p.is_deleted) as num_deleted,
count(*) filter (where p.is_listed) as num_listed
FROM project
GROUP BY template_id
) p ON p.template_id = t.id

How to get the number of results from a query with pgsql in processing?

For example if we have a query like:
pgsql.query(SELECT * FROM table_name)
We can get the result by:
pgsql.getString("field_name");
How to get the result from:
pgsql.query(SELECT count(*) FROM table_name)
Give the count(*) column an alias
pgsql.query(SELECT count(*) as the_count FROM table_name)
Then use whatever is the method to get a number from a column
pgsql.getString("the_count");