How to use "with table as" in pyspark sql function? - pyspark

The code below has an error
spark.sql("WITH q AS (SELECT b.*
FROM df \
WHERE parent_id = 59 \
UNION ALL \
SELECT d.* \
FROM df d \
JOIN q \
ON d.parent_id = q.id) \
SELECT * \
FROM q").show()
the error is:
AnalysisException: Table or view not found: q; line 1 pos 221;
So I need to create a view of 'q', but how?

Related

spark-rdbms : Overwrite mode is working different from Append

I am on Spark 3.0.0-preview and trying to save a dataset to the PostgreSQL database. Following are the steps that I am following:
Get the data from table A
Get the data from table B (same structure as A)
Do a left anti join b/w Table A and B. This is done to get the rows from Table B which are not in Table A
Join Table A with the outcome from Step 3. This is done to get unique rows from table A and B.
Save the result with Override mode to Table B
Actual: Only rows from table_a get updated in the DB.
Expected: A union of record of table a and step 3 should get updated on the database.
Analysis: If I use mode as 'Append' the records count is correct but I am looking for truncating the table and not appending.
Code:
val spark = SparkSession.builder.master("local[*]").appName("Testing")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
val tableA = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/test")
.option("user", "sample")
.option("password", "sample")
.option("query", "select t.uid, t.employer_key, t.name from (select uid, employer_key, name , row_number() over(partition by employer_key order by updated_at desc) as rn from test.table_a) t where t.rn = 1")
.load()
val tableB = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/test")
.option("user", "sample")
.option("password", "sample")
.option("query", "select t.uid, t.employer_key, t.name from test.table_b t")
.load()
val nonUpdatedDFRows = tableB.join(tableA, tableB("employer_key") === tableA("employer_key"), "leftanti")
nonUpdatedDFRows.show(5) //Working correctly
val refreshDF = nonUpdatedDFRows.unionByName(tableA)
refreshDF.show(5) //Working correctly
refreshDF.write.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/test")
.option("user", "sample")
.option("password", "sample")
.option("dbtable", "test.table_b")
.option("truncate", "true").mode("overwrite")
.save();
//only rows from table_a get updated in the DB but if I change the mode to Append, it will work fine.
The problem in my code was that I was trying to overwrite the same table from which I was reading.
To solve the problem I have to first cache the value, as below:
val tableB = spark.read.format("jdbc").option("url",
"jdbc:postgresql://localhost:5432/test")
.option("user", "sample")
.option("password", "sample")
.option("query", "select t.uid, t.employer_key, t.name from test.table_b t")
.load().cahce()
Do some operation to enforce spark to load data, as below:
tableB.show(2)

SQLAchemy ORM: LEFT JOIN LATERAL() ON TRUE

I am trying to replicate the following raw query:
SELECT r.id, r.name, e.id, e.title, e.start, e.end
FROM room r
LEFT JOIN LATERAL (
SELECT evt.id, evt.title, evt.start, evt.end
FROM event evt, calendar cal
WHERE
r.calendar_id=cal.id AND evt.calendar_id=cal.id AND evt.end>%(start)s
ORDER BY abs(extract(epoch from (evt.start - %(start)s)))
LIMIT 1
) e ON TRUE
WHERE r.company_id=%(company_id)s;
with the SQLAlchemy ORM:
start = datetime.datetime.now()
company_id = 6
event_include = session.query(
Event.id,
Event.title,
Event.start,
Event.end) \
.filter(
Room.calendar_id == Calendar.id,
Event.calendar_id == Calendar.id,
Event.end > start,
) \
.order_by(func.abs(func.extract('epoch', Event.start - start))) \
.limit(1) \
.subquery() \
.lateral()
query = session.query(Room.id, Room.name, event_include) \
.filter(Room.company_id == company_id)
Which produces the following SQL:
SELECT room.id AS room_id, room.name AS room_name, anon_1.id AS anon_1_id, anon_1.title AS anon_1_title, anon_1.start AS anon_1_start, anon_1."end" AS anon_1_end
FROM room, LATERAL (
SELECT event.id AS id, event.title AS title, event.start AS start, event."end" AS "end"
FROM event, calendar
WHERE room.calendar_id = calendar.id AND event.calendar_id = calendar.id AND event."end" > %(end_1)s ORDER BY abs(EXTRACT(epoch FROM event.start - %(start_1)s)
)
LIMIT %(param_1)s) AS anon_1
WHERE room.company_id = %(company_id_1)s
This returns all the rooms and their next calendar event, but only if there is a next calendar event available. It needs to be a LEFT JOIN LATERAL() ON TRUE, but I'm not sure how to do that.
Any help here would be great.
Use outerjoin with true expression
from sqlalchemy import true
query = session.query(Room.id, Room.name, event_include) \
.outerjoin(event_include, true()) \
.filter(Room.company_id == company_id)

Converting Sql query to spark

I have sql query which I want to convert to spark-scala
SELECT aid,DId,BM,BY
FROM (SELECT DISTINCT aid,DId,BM,BY,TO FROM SU WHERE cd =2) t
GROUP BY aid,DId,BM,BY HAVING COUNT(*) >1;
SU is my Data Frame. I did this by
sqlContext.sql("""
SELECT aid,DId,BM,BY
FROM (SELECT DISTINCT aid,DId,BM,BY,TO FROM SU WHERE cd =2) t
GROUP BY aid,DId,BM,BY HAVING COUNT(*) >1
""")
Instead of that I need this in utilizing my dataframe
This should be the DataFrame equivalent:
SU.filter($"cd" === 2)
.select("aid","DId","BM","BY","TO")
.distinct()
.groupBy("aid","DId","BM","BY")
.count()
.filter($"count" > 1)
.select("aid","DId","BM","BY")

SetSortMode in sphinx not working with delta index

So my sphinx.conf file contains something similar. Basically I am using delta index to make things quick.
source main
{
#...
sql_query_pre = SET NAMES utf8
sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM photo
sql_query = \
SELECT p.id AS id, p.search AS search, COUNT(li.id) AS total_likes \
FROM `photo` p \
LEFT JOIN `like` li \
ON p.id = li.photo_id \
WHERE p.id <= ( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 ) \
GROUP BY \
p.id
#...
sql_query_info = SELECT * FROM photo WHERE id=$id
}
source delta : main
{
sql_query_pre = SET NAMES utf8
sql_query = \
SELECT p.id AS id, p.search AS search, COUNT(li.id) AS total_likes \
FROM `photo` p \
LEFT JOIN `like` li \
ON p.id = li.photo_id \
WHERE p.id > ( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 ) \
GROUP BY \
p.id
}
And in the php when I retrieve data I also want to have some sort of sorting methods.
$s->SetSortMode(SPH_SORT_EXTENDED, '#relevance DESC, total_likes DESC, #id DESC');
$result = $s->Query($data['query'], "delta main");
Sorting was working fine when I had only main index. But now when I search with both indexes, results from the delta index is appended at the front. What I actually want is results from both indexes are fetched and then sorted according to preferences i.e. #relevance DESC, total_likes DESC, #id DESC in my case. That is total_likes should be given preference over id
Thanks #barryhunter for the solution. The solution was that in the delta index second sql_query_pre had to be overwritten.
sql_query_pre = SET NAMES utf8
sql_query_pre =
sql_query = \

Dynamic JPA CriteriaBuilder for hierarchical data

I have a hierarchical data structure as following Where nodes are mapped to the parent node as following:
#Entity
public class Node implements Serializable {
#Id
private long id;
#Column(name="PARENT_ID")
private Long parentId;
#OneToMany
#JoinColumn(name="PARENT_ID")
public List<Node> children = new LinkedList<Node>();
}
So for example lets say I have the following data:
[A]
/ \
/ \
/ \
[B] [C]
/ \ \
/ \ \
[D] [E] [F]
\
\
[G]
Now I want to build a dynamic query in JPA CriteriaBuilder that can query for any node and return the results of its children as well. For example if I query for B, I get the following results:
B
D
E
G
And, if I query for E, I get:
-E
-G
And so on...
Since I'm using SQL Server 2012 as my database I have used with clause as following:
Assuming [E] node id is 8:
Top to Bottom:
WITH NODE_TREE AS(
SELECT N.ID, N.PARENT_ID FROM NODE_TABLE N WHERE N.ID = 8
UNION ALL
SELECT N.ID, N.PARENT_ID FROM NODE_TABLE N
INNER JOIN NODE_TREE NT
ON N.ID = NT.PARENT_ID
)
SELECT * FROM NODE_TREE;
This will return a top-to-bottom list of nodes:
B, E, D, G
Bottom to Top:
WITH NODE_TREE AS(
SELECT N.ID, N.PARENT_ID FROM NODE_TABLE N WHERE N.ID = 8
UNION ALL
SELECT N.ID, N.PARENT_ID FROM NODE_TABLE N
INNER JOIN NODE_TREE NT
ON N.PARENT_ID = NT.ID
)
SELECT * FROM NODE_TREE;
This will return a bottom-to-top list of nodes:
B, A