IF/Case statement when using SparkSQL with Cassandra - scala

I am trying to transform data when selecting from Cassandra to Spark using Scala.
When selecting data I would like to transform the data to place the counts into a specific count_* column based on the value.
I am unable to find an IF/CASE statement to use with Spark SQL. Any ideas?
val results = csc.sql("
SELECT trip_sell_key, trip_veh_key, idle_stop_date, COUNT(*),
SUM (case when idle_stop_duration >= 0
and idle_stop_duration < 5 then 1 else 0 end)
from veh_trip"
)

I'm not even sure your SQL is valid for SparkSQL. Don't remember if SparkSQL does support case else statements.
Another point is that COUNT(*) and SUM(...) are aggregation functions and they can only work in conjunction with a GROUP BY clause, which is missing in your statement

Related

Pyspark window functions (lag and row_number) generate inconsistent results

I have been fighting an issue with window functions in pyspark for a few weeks now.
I have the following query to detect changes for a given key field:
rowcount = sqlContext.sql(f"""
with temp as (
select key, timestamp, originalcol, lag(originalcol,1) over (partition by key order by timestamp) as lg
from snapshots
where originalcol is not null
)
select count(1) from (
select *
from temp
where lg is not null
and lg != originalcol
)
""")
Data types are as follows:
key: string (not null)
timestamp: timestamp (unique, not null)
originalcol: timestamp
The snapshots table contains over a million records. This query is producing different row counts after each execution: 27952, 27930, etc. while the expected count is 27942. I can say it is only approximately correct, with a deviation of around 10 records, however this is not acceptable as running the same function twice with the same inputs should produce the same results.
I have a similar problem with row_number() over the same window, then filtering for row_number = 1, but I guess the issue should be related.
I tried the query in an AWS Glue job as both pyspark and athena SQL, and the inconsistencies are similar.
Any clue about what I am doing wrong here?
Spark is pretty picky about some silly things...
and lg != originalcol doesn't detect Null values and thus the first value from the window partition will always be filtered out (since the first value from LAG will always be Null).
The same thing happens when you try using Null using In statment
Another example where Null will filter-out:
where test in (Null, 1)
After a bit of research, I discovered that column timestamp is not unique. Even though SQL Server manages to produce the same execution plan and results, pyspark and presto get confused with the order by clause in the window function and produce different results after each execution. If anything can be learned from this experience, it would be to double-check the partition and order by keys in a window function.

How to optimize broadcast join in spark Scala?

I am a new developper at Spark Scala and I want to improve my code by using a broadcast join.
As I understand, a broadcast join can optimise the code if we have a large DataFrame with a small one. It's exactly the case for me. I have a first DF (tab1 in my example) that contains more 3 billions data that I have to join with a second one with only 900 data.
Here is my sql request :
SELECT tab1.id1, regexp_extract(tab2.emp_name, ".*?(\\d+)\\)$", 1) AS city,
topo_2g3g.emp_id AS emp_id, tab1.emp_type
FROM table1 tab1
INNER JOIN table2 tab2
ON (tab1.emp_type = tab2.emp_type AND tab1.start = tab2.code)
And here is my attempt to use a broadcast join :
val tab1 = df1.filter(""" id > 100 """).as("table1")
val tab2 = df2.filter(""" id > 100 """).as("table2")
val result = tab1.join(
broadcast(tab2)
, col("tab1.emp_type") === col("tab2.emp_type") && col("tab1.start") === col("tab2.code")
, "inner")
The problem is that this way is not optimized at all. I mean it contains ALL the columns for the two table, while I don't need all those columns. I just need 3 of them and the last one (with a regex on it), which is not optimal at all. It's like, we generate a very big table first and then we reduce it to a small table. While in SQL, we got directly the small table.
So, after this step :
I have to use withColumn to generate the new column (with the regex)
Apply a filter method to select the 3 colmuns that I. While i got them IMMEDIATELY in sql (with no filter I mean).
Can you help me please to optimize my code and my request ?
Thanks in advance
you select the columns you want before doing the join
df1.select("col1", "col2").filter(""" id > 100 """).as("table1")

Spark SQL distinct and Scala distinct give different counts, in spite of same data input data

val apple1=spark.sql("select count(distinct *) from apple1 where data1_ts =201804").show
10871344
apple1.filter(col("data1_ts") === "201804").distinct.count
20573671
any idea both are using distinct on a table in s3,and selecting a particular directory, but gives different values?

Most efficient way to select and process data from a dataframe

I would like to load and process data from a dataframe in Spark using Scala.
The raw SQL Statement looks like this:
INSERT INTO TABLE_1
(
key_attribute,
attribute_1,
attribute_2
)
SELECT
MIN(TABLE_2.key_attribute),
CURRENT_TIMESTAMP as attribute_1,
'Some_String' as attribute_2
FROM TABLE_2
LEFT OUTER JOIN TABLE_1
ON TABLE_2.key_attribute = TABLE_1.key_attribute
WHERE
TABLE_1.key_attribute IS NULL
AND TABLE_2.key_attribute IS NOT NULL
GROUP BY
attribute_1,
attribute_2,
TABLE_2.key_attribute
What I've done so far:
I created a DataFrame from the Select Statement and joined it with the TABLE_2 DataFrame.
val table_1 = spark.sql("Select key_attribute, current_timestamp() as attribute_1, 'Some_String' as attribute_2").toDF();
table_2.join(table_1, Seq("key_attribute"), "left_outer");
Not really much progress because I face to many difficulties:
How do I handle the SELECT with processing data efficiently? Keep everything in seperate DataFrames?
How do I insert the WHERE/GROUP BY clause with attributes from several sources?
Is there any other/better way except Spark SQL?
Few steps in handling are -
First create the dataframe with your raw data
Then save it as temp table.
You can use filter() or "where condition in sparksql" and get the
resultant dataframe
Then as you used - you can make use of jons with datframes. You can
think of dafaframes as a representation of table.
Regarding efficiency, since the processing will be done in parallel, its being taken care. If you want anything more regarding efficiency, please mention it.

How to group by in cypher efficiently?

I translated the below SQL query to cypher. group by in cypher is implicit and it causes confusion and more query execution time. My SQL query is:
INSERT INTO tmp_build
(result_id, hshld_id, product_id)
SELECT b.result_id, a.hshld_id, b.cluster_id
FROM fact a
INNER JOIN productdata b ON a.product_id = b.barcode
WHERE b.result_id = 1
GROUP BY b.result_id, a.hshld_id, b.cluster_id;
The equivalent cypher query is:
MATCH (b:PRODUCTDATA {RESULT_ID: 1 })
WITH b
MATCH (b)<-[:CREATES_PRODUCTDATA]-(a:FACT)
WITH b.RESULT_ID as RESULT_ID , collect(b.RESULT_ID) as result, a.HSHLD_ID as HSHLD_ID,
collect(a.HSHLD_ID) as hshld, b.CLUSTER_ID as CLUSTER_ID, collect(b.CLUSTER_ID) as cluster
CREATE (:TMP_BUILD { RESULT_ID:RESULT_ID , HSHLD_ID:HSHLD_ID , PRODUCT_ID:CLUSTER_ID });
This query is running slow because of collect(). Without collect function is not giving me the group by results. Is there any way to optimise it? or better implementation of group by in cypher?
In the Cypher query, you are attempting to return rows with both a singular values (RESULT_ID, HSHLD_ID, CLUSTER_ID) and their collections, but since you're returning both, your collections will only have the same value repeated the number of times it occurred in the results (for example, RESULT_ID = 1, result = [1,1,1,1]). I don't think that's useful for you.
Also, nothing in your original query seems to suggest you need aggregations. Your GROUP BY columns are the only columns being returned, there are no aggregation columns, so that seems like you just need distinct rows. Try removing the collection columns from your Cypher query, and use WITH DISTINCT instead of just WITH.
If that doesn't work, then I think you will need to further explain exactly what it is that you are attempting to get as the result.