Spark SQL COALESCE - scala

I have extracted the coalesce value from a table using Spark SQL. Then I'm converting the result to String so that I can INSERT that value into another table.
However, the column name of the COALESCE is getting inserted into the table instead of the coalesce value.
These are my COALESCE and INSERT queries,
COALESCE:
---------
val lastPartition = spark.sql("SELECT COALESCE(MAX(partition_name), 'XXXXX') FROM db1.table1").toString.mkString
Result:
-------
COALESCE(MAX(partition_name),XXXXX
20210309
INSERT:
-------
val result = spark.sql(s"""INSERT INTO db2.table2 VALUES ('col1','col2','${lastPartition}','col4')""")
Result:
--------
col1 col2 col3 col4
1 John [COALESCE(MAX(partition_name),XXXXX):string] 15313.21
Here, I want the value of column (col3) to be 20210309 and not the coalesce column name.

You need to use .head().getString(0) to get the string as the variable. Otherwise, if you use .toString, you'll get the expression instead because of lazy evaluation.
val lastPartition = spark.sql("SELECT COALESCE(MAX(partition_name), 'XXXXX') FROM db1.table1").head().getString(0)

Related

How to retrieve column value by passing another column value with IN clause in spark

I have a scenario that to read a column from DataFrame by using another column from same DataFrame through where condition and this value pass through as IN condition to select same value from another DataFrame and how can I achieve in spark DataFrame.
In SQL it will be like:
select distinct(A.date) from table A where A.key in (select B.key from table B where cond='D');
I tried like below:
val Bkey: DataFrame = b_df.filter(col("cond")==="D").select(col("key"))
I have table A data in a_df DataFrame and table B data in b_df DataFrame. How can I pass variable Bkey value to outer query and achieve in Spark?
You can do a semi join:
val result = a_df.join(b_df.filter(col("cond")==="D"), Seq("key"), "left_semi").select("date").distinct()

Transform rows to multiple rows in Spark Scala

I have a problem where I need to transform one row to multiple rows. This is based on a different mapping that I have. I have tried to provide an example below.
Suppose I have a parquet file with the below schema
ColA, ColB, ColC, Size, User
I need to aggregate the above data into multiple rows based on a lookup map. Suppose I have a static map
ColA, ColB, Sum(Size)
ColB, ColC, Distinct (User)
ColA, ColC, Sum(Size)
This means that one row in the input RDD needs to be transformed to 3 aggregate. I believe RDD is the way to go with FlatMapPair, but I am not sure how to go about this.
I am also OK to concat the columns into one key, something like ColA_ColB etc.
For creating multiple aggregates from the same data, I have started with something like this
val keyData: PairFunction[Row, String, Long] = new PairFunction[Row, String, Long]() {
override def call(x: Row) = {
(x.getString(1),x.getLong(5))
}
}
val ip15M = spark.read.parquet("a.parquet").toJavaRDD
val pairs = ip15M.mapToPair(keyData)
java.util.List[(String, Long)] = [(ios,22), (ios,23), (ios,10), (ios,37), (ios,26), (web,52), (web,1)]
I believe I need to do flatmaptopair instead of mapToPair. On similar lines, I tried
val FlatMapData: PairFlatMapFunction[Row, String, Long] = new PairFlatMapFunction[Row, String, Long]() {
override def call(x: Row) = {
(x.getString(1),x.getLong(5))
}
}
but it is giving Error
Expression of type (String, Long) doesn't conform to expected type util.Iterator[(String, Long)]
Any help is appreciated. Please let me know if I need to add any more details.
the outcome should only have 3 columns? I mean col1, col2, col3 (the agg outcome).
The second aggregate is a distinct count of users? (I assume yes).
If so you can basically create 3 data frames and then union them.
Something in the way of:
val df1 = spark.sql("select colA as col1, colB as col2 ,sum(Size) as colAgg group by colA,colB")
val df2 = spark.sql("select colB as col1, colC as col2 ,Distinct(User) as colAgg group by colB,colC")
val df3 = spark.sql("select colA as col1, colC as col2 ,sum(Size) as colAgg group by colA,colC")
df1.union(df2).union(df3)

KDB: How to assign string datatype to all columns

When I created the table Tab, I specified the columns as string,
Tab: ([Key1:string()] Col1:string();Col2:string();Col3:string())
But the column datatype (t) is empty. I suppose specifying the column as string has no effect.
meta Tab
c t f a
--------------------
Key1
Col1
Col2
Col3
After I do a bulk upsert in Java...
c.Dict dict = new c.Dict((Object[]) columns.toArray(new String[columns.size()]), data);
c.Flip flip = new c.Flip(dict);
conn.c.ks("upsert", table, flip);
The datatypes are all symbols:
meta Tab
c t f a
--------------------
Key1 s
Col1 s
Col2 s
Col3 s
How can I specify the datatype of the columns as string and have it remain as string?
You cant define a column of the empty table with as strings as they are merely lists of lists of characters
You can just set them as empty lists which is what your code is doing.
But the column will then take on the type of whatever data is inserted into it.
Real question is what is your java process sending symbols when it should be sending strings. You need to make the change there before publishing to KDB
Note if you define as chars you still wont be able to upsert strings
q)Tab: ([Key1:`char$()] Col1:`char$();Col2:`char$();Col3:`char$())
q)Tab upsert ([Key1:enlist"test"] Col1:enlist"test";Col2:enlist"test";Col3:enlist "test")
'rank
[0] Tab upsert ([Key1:enlist"test"] Col1:enlist"test";Col2:enlist"test";Col3:enlist "test")
^
q)Tab: ([Key1:()] Col1:();Col2:();Col3:())
q)Tab upsert ([Key1:enlist"test"] Col1:enlist"test";Col2:enlist"test";Col3:enlist "test")
Key1 | Col1 Col2 Col3
------| --------------------
"test"| "test" "test" "test"
KDB does not allow to define column types as list during creation of table. So that means you can not define your column type as String because that is also a list.
To do that only way is to define column as empty list like:
q) t:([]id:`int$();val:())
Then when you insert data to this table the column will automatically take type of that data.
q)`t insert (4;"row1")
q) meta t
c | t f a
---| -----
id | i
val| C
In your case, one option is to send string data from your Java process as mentioned by user 'emc211' or other option is to convert your data to string in KDB process before insertion.

Spark SQL - regexp_replace not updating the column value

I ran the following query in Hive and it successfully updated the column value in the table: select id, regexp_replace(full_name,'A','C') from table
But when I ran the same query from Spark SQL, it did not update the actual records
hiveContext.sql("select id, regexp_replace(full_name,'A','C') from table")
but when I do a hiveContext.sql("select id, regexp_replace(full_name,'A','C') from table").show() -- it displays A replaced with C successfully ... only in the display and not in the actual table
I tried to assign the result to another variable
val vFullName = hiveContext.sql("select id, regexp_replace(full_name,'A','C') from table")
and then
vFullName.show() -- it displays the original values without replacement
How do I get the value replaced in the table from SparkSQL?

Calculate table column using over table on server side

Suppose, there are two tables in db:
Table registries:
Column | Type |
--------------------+-----------------------------+---------
registry_id | integer | not null
name | character varying | not null
...
uploaded_at | timestamp without time zone | not null
Table rows:
Column | Type | Modifiers
---------------+-----------------------------+-----------
row_id | character varying | not null
registry_id | integer | not null
row | character varying | not null
In real world registries is just a csv-file and rows is lines of the files. In my scala-slick application, I want to know how many lines in each file.
registries:
1,foo,...
2,bar,...
3,baz,...
rows:
aaa,1,...
bbb,1,...
ccc,2,...
desired result:
1,foo,... - 2
2,bar,... - 1
3,baz,... - 0
My code now is (slick-3.0):
def getRegistryWithLength(rId: Int) = {
val q1 = registries.filter(_.registryId===rId).take(1).result.headOption
val q2 = rows.filter(_.registryId===rId).length.result
val registry = Await.result(db.run(q1), 5.seconds)
val length = Await.result(db.run(q2), 5.seconds)
(registry, length)
}
(Await is bad idea, I know it)
How can I do getRegistryWithLength using single sql query?
I could add column row_n into table registries, but then I'll be forced to do updating column row_n after delete/insert query of rows table.
How can I do automatic calculation column row_n in table registries on db server side?
The basic query could be:
SELECT r.*, COALESCE(n.ct, 0) AS ct
FROM registry r
LEFT JOIN (
SELECT registry_id, count(*) AS ct
FROM rows
GROUP BY registry_id
) n USING (registry_id);
The LEFT [OUTER] JOIN is essential so you do not filter rows from registry without related rows in rows.
COALESCE to return 0 instead of NULL where no related rows are found.
There are many related answers on SO. One here:
SQL: How to save order in sql query?
You could wrap this in a VIEW for convenience:
CREATE VIEW reg_rn AS
SELECT ...
... which you query like a table.
Aside: It's unwise to use reserved SQL key words as identifiers. row is a no-go for a column name (even if allowed in Postgres).
Thanks Erwin Brandstetter for awesome answer, using it, I wrote code for my scala-slick application.
Scala code looks much more complicated than plain sql:
val registryQuery = registries.filter(_.userId === userId)
val rowQuery = rows groupBy(_.registryId) map { case (regId, rowItems) => (regId, rowItems.length)}
val q = registryQuery joinLeft rowQuery on (_.registryId === _._1) map {
case (registry, rowsCnt) => (registry, rowsCnt.map(_._2))
}
but it works!