I have a dataframe which has one of the column called "Query" having the select statement present. Want to execute this query and create a new column having actual results from the TempView.
+--------------+-----------+-----+----------------------------------------+
|DIFFCOLUMNNAME|DATATYPE |ISSUE|QUERY |
+--------------+-----------+-----+----------------------------------------+
|Firstname |StringType |YES |Select Firstname from TempView limit 1 |
|LastName |StringType |NO |Select LastName from TempView limit 1 |
|Designation |StringType |YES |Select Designation from TempView limit 1|
|Salary |IntegerType|YES |Select Salary from TempView limit 1 |
+--------------+-----------+-----+----------------------------------------+
Getting error as Type mismatch, Required String found column.
Do I need to use UDF here. But not sure how to write and use. Please suggest
DF.withColumn("QueryResult", spark.sql(col("QUERY")))
TempView is Temporary View which I have created having all the required columns.
Expected final Dataframe will be something like this with the new column added QUERYRESULT.
+--------------+-----------+-----+----------------------------------------+------------+
|DIFFCOLUMNNAME|DATATYPE |ISSUE|QUERY | QUERY RESULT
+--------------+-----------+-----+----------------------------------------+------------+
|Firstname |StringType |YES |Select Firstname from TempView limit 1 | Bunny |
|LastName |StringType |NO |Select LastName from TempView limit 1 | Gummy |
|Designation |StringType |YES |Select Designation from TempView limit 1| Developer |
|Salary |IntegerType|YES |Select Salary from TempView limit 1 | 100 |
+--------------+-----------+-----+----------------------------------------+------------+
If number of queries is limited, you can collect them, execute each, and join with original queries dataframe (Kieran was faster with his answer, but mine answer has example):
val queriesDF = Seq(
("Firstname", "StringType", "YES", "Select Firstname from TempView limit 1 "),
("LastName", "StringType", "NO", "Select LastName from TempView limit 1 "),
("Designation", "StringType", "YES", "Select Designation from TempView limit 1"),
("Salary", "IntegerType", "YES", "Select Salary from TempView limit 1 ")
).toDF(
"DIFFCOLUMNNAME", "DATATYPE", "ISSUE", "QUERY"
)
val data = Seq(
("Bunny", "Gummy", "Developer", 100)
)
.toDF("Firstname", "LastName", "Designation", "Salary")
data.createOrReplaceTempView("TempView")
// get all queries and evaluate results
val queries = queriesDF.select("QUERY").distinct().as(Encoders.STRING).collect().toSeq
val queryResults = queries.map(q => (q, spark.sql(q).as(Encoders.STRING).first()))
val queryResultsDF = queryResults.toDF("QUERY", "QUERY RESULT")
// Join original queries and results
queriesDF.alias("queriesDF")
.join(queryResultsDF, Seq("QUERY"))
.select("queriesDF.*", "QUERY RESULT")
Output:
+----------------------------------------+--------------+-----------+-----+------------+
|QUERY |DIFFCOLUMNNAME|DATATYPE |ISSUE|QUERY RESULT|
+----------------------------------------+--------------+-----------+-----+------------+
|Select Firstname from TempView limit 1 |Firstname |StringType |YES |Bunny |
|Select LastName from TempView limit 1 |LastName |StringType |NO |Gummy |
|Select Designation from TempView limit 1|Designation |StringType |YES |Developer |
|Select Salary from TempView limit 1 |Salary |IntegerType|YES |100 |
+----------------------------------------+--------------+-----------+-----+------------+
Assuming you don't have that many 'query rows', just collect the results to driver using df.collect() and then map over queries using plain Scala.
Related
I'm trying to get the size of each table in my database.
I listed first all my tables in a dataframe using this command :
df = spark.sql("show tables in db")
And this is my current dataframe :
+---------+
| tabs |
+---------+
|db.tab1 |
|db.tab2 |
|db.tab3 |
|db.tab4 |
|db.tab5 |
+---------+
Then, for each table I want to get some informations such as count and last modification date.
To explain more, what I want to do is something like this (it's not working) :
df1 = df.withColumn("count", spark.sql('select count(*) from {0}'.format(df.tabs)))
This is the desired result :
+---------+------+
| tabs | count|
+---------+------+
|db.tab1 | 122 |
|db.tab2 | 156 |
|db.tab3 | 235 |
|db.tab4 | 11 |
|db.tab5 | 98 |
+---------+------+
You can try like below.
get count for each table and union them.
tables = df.collect()
count_dfs = [
spark.sql(f'select {table} as tabs, count(*) as count from {table}')
for table in tables
]
result_df = reduce(lambda union_dfs, count_df: union_dfs.union(count_df))
result_df.show()
I have a table whose schema along with data (table_name : raw_data) appears to be this :
name | category | clear_date |
A | GOOD | 2020-05-30 |
A | GOOD | 2020-05-30 |
A | GOOD | 2020-05-30 |
A | GOOD | 2020-05-30 |
A | BAD | 2020-05-30 |
A | BAD | 2020-05-30 |
Now if I perform a "groupby" operation using the following statement :
SELECT name, category, date(clear_date), count(clear_date)
FROM raw_data
GROUP BY name, category, date(clear_date)
ORDER BY name
I get the following answer :
name | caetgory | date | count |
A | GOOD |2020-05-30 | 4 |
A | BAD |2020-05-30 | 1 |
A | BAD |2020-05-31 | 1 |
IN order to produce the pivot in following format :
name | category | 2020-05-30 | 2020-05-31 |
A | GOOD | 4 | NULL |
A | BAD | 1 | 1 |
I am using the following query :
select * from crosstab (
'select name, category, date(clear_date), count(clear_date) from raw_data group by name, category, date(clear_date) order by 1,2,3',
'select distinct date(clear_date) from raw_data order by 1'
)
as newtable (
node_name varchar, alarm_name varchar, "2020-05-30" integer, "2020-05-31" integer
)
ORDER BY name
But I am getting results as follows :
name | category | 2020-05-30 | 2020-05-31 |
A | BAD | 4 | 1 |
Can anyone please try to suggest how can i achieve the result mentioned above. It appears crosstab removes the duplicate entry of A automatically.
Not sure if this is possible using crosstab because you have a missing records in some dates. Here is an example how to get expected result but not sure is what you need. Anyway hope this helps.
SELECT r1.*, r2.counter AS "2020-05-30", r3.counter AS "2020-05-31"
FROM (
SELECT DISTINCT name, category
FROM raw_data
) AS r1
LEFT JOIN (
SELECT name, category, count(*) AS counter
FROM raw_data
WHERE clear_date = '2020-05-30'
GROUP BY name, category
) AS r2 ON (r2.category = r1.category AND r2.name = r1.name)
LEFT JOIN (
SELECT name, category, count(*) AS counter
FROM raw_data
WHERE clear_date = '2020-05-31'
GROUP BY name, category
) AS r3 ON (r3.category = r1.category AND r3.name = r1.name)
ORDER BY r1.category DESC;
I have two dataframes which are large csv files which I am reading into dataframes in Spark (Scala)
First Dataframe is something like
key| col1 | col2 |
-------------------
1 | blue | house |
2 | red | earth |
3 | green| earth |
4 | cyan | home |
Second dataframe is something like
key| col1 | col2 | col3
-------------------
1 | blue | house | xyz
2 | cyan | earth | xy
3 | green| mars | xy
I want to get differences like this for common keys & common columns (keys are like primary key) in a different dataframe
key| col1 | col2 |
------------------------------------
1 | blue | house |
2 | red --> cyan | earth |
3 | green | home--> mars |
Below is my approach so far:
//read the files into dataframe
val src_df = read_df(file1)
val tgt_df = read_df(file2)
//truncate dataframe to only contain common keys
val common_src = spark.sql(
"""
select *
from src_df src
where src.key IN(
select tgt.key
from tgt_df tgt
"""
val tgt_common = spark.sql(
"""
select *
from tgt_df tgt
where tgt.key IN(
select src.key
from src_df src
"""
//merge both the dataframes
val joined_df = src_common.join(tgt_common, src_common(key) === tgt_common(key), "inner")
I was unsuccessfully trying to do something like this
joined_df
.groupby(key)
.apply(some_function(?))
I have tried looking in existing solutions posted online . But I couldn't get the desired result.
PS: Also hoping the solution would be able to scale for large data
Thanks
Try the following:
spark.sql(
"""
select
s.id,
if(s.col1 = t.col1, s.col1, s.col1 || ' --> ' || t.col1) as col1,
if(s.col2 = t.col2, s.col2, s.col2 || ' --> ' || t.col2) as col2
from src_df s
inner join tgt_df t on s.id = t.id
""").show
I have a data frame with two columns: "ID" and "Amount", each row representing a transaction of a particular ID and the transacted amount. My example uses the following DF:
val df = sc.parallelize(Seq((1, 120),(1, 120),(2, 40),
(2, 50),(1, 30),(2, 120))).toDF("ID","Amount")
I want to create a new column identifying whether said amount is a recurring value, i.e. occurs in any other transaction for the same ID, or not.
I have found a way to do this more generally, i.e. across the entire column "Amount", not taking into account the ID, using the following function:
def recurring_amounts(df: DataFrame, col: String) : DataFrame = {
var df_to_arr = df.select(col).rdd.map(r => r(0).asInstanceOf[Double]).collect()
var arr_to_map = df_to_arr.groupBy(identity).mapValues(_.size)
var map_to_df = arr_to_map.toSeq.toDF(col, "Count")
var df_reformat = map_to_df.withColumn("Amount", $"Amount".cast(DoubleType))
var df_out = df.join(df_reformat, Seq("Amount"))
return df_new
}
val df_output = recurring_amounts(df, "Amount")
This returns:
+---+------+-----+
|ID |Amount|Count|
+---+------+-----+
| 1 | 120 | 3 |
| 1 | 120 | 3 |
| 2 | 40 | 1 |
| 2 | 50 | 1 |
| 1 | 30 | 1 |
| 2 | 120 | 3 |
+---+------+-----+
which I can then use to create my desired binary variable to indicate whether the amount is recurring or not (yes if > 1, no otherwise).
However, my problem is illustrated in this example by the value 120, which is recurring for ID 1 but not for ID 2. My desired output therefore is:
+---+------+-----+
|ID |Amount|Count|
+---+------+-----+
| 1 | 120 | 2 |
| 1 | 120 | 2 |
| 2 | 40 | 1 |
| 2 | 50 | 1 |
| 1 | 30 | 1 |
| 2 | 120 | 1 |
+---+------+-----+
I've been trying to think of a way to apply a function using
.over(Window.partitionBy("ID") but not sure how to go about it. Any hints would be much appreciated.
If you are good in sql, you can write sql query for your Dataframe. The first thing that you need to do is to register your Dataframeas a table in the spark's memory. After that you can write the sql on top of the table. Note that spark is the spark session variable.
val df = sc.parallelize(Seq((1, 120),(1, 120),(2, 40),(2, 50),(1, 30),(2, 120))).toDF("ID","Amount")
df.registerTempTable("transactions")
spark.sql("select *,count(*) over(partition by ID,Amount) as Count from transactions").show()
Please let me know if you have any questions.
Okay, I have a table with column definitions and corresponding ordinal positions. I'm building a metadata-driven ETL framework using Spark (scala). I have a table that contains the following info:
tablename
columnname
datatype
ordinalposition
I have to build a CREATE TABLE statement from that data. Not a big deal, right? I tried what appears to be the standard answer:
var metadatadef = spark.sql("SELECT tablename, columnname, datatype, ordinalposition FROM metadata")
.withColumn("columndef", concat($"columnname", lit(" "), $"datatype"))
.sort($"tablename", $"ordinalposition")
.groupBy("tablename")
.agg(concat_ws(", ", collect_list($"columndef")).as("columndefs"))
But the sort() call seems to be ignored here. Or between collect_list() and concat_ws() it gets reshuffled. Given source data like this:
+-----------+--------------+---------------+-----------------+
| tablename | columnname | datatype | ordinalposition |
+ ----------+--------------+---------------+-----------------+
| table1 | IntColumn | int | 0 |
| table2 | StringColumn | string | 2 |
| table1 | StringColumn | string | 2 |
| table2 | IntColumn | int | 0 |
| table1 | DecColumn | decimal(15,2) | 1 |
| table2 | DecColumn | decimal(15,2) | 1 |
+-----------+--------------+---------------+-----------------+
I am trying to get output like this:
+-----------+----------------------------------------------------------------+
| tablename | columndefs |
+-----------+----------------------------------------------------------------+
| table1 | IntColumn int, DecColumn decimal(15,2), StringColumn string |
| table2 | IntColumn int, DecColumn decimal(15,2), StringColumn string |
+-----------+----------------------------------------------------------------+
Instead, I wind up with something like this:
+-----------+----------------------------------------------------------------+
| tablename | columndefs |
+-----------+----------------------------------------------------------------+
| table1 | IntColumn int, StringColumn string, DecColumn decimal(15,2) |
| table2 | StringColumn string, IntColumn int, DecColumn decimal(15,2) |
+-----------+----------------------------------------------------------------+
Do I need to build a UDF to ensure I get proper order? I need the output to end up in a dataframe for comparison purposes, not just building the CREATE TABLE statement.
You can create a struct column of (ordinalposition, columndef) and apply sort_array to sort the aggregated columndef in the wanted order during the groupBy transformation as follows:
import org.apache.spark.sql.functions._
val df = Seq(
("table1", "IntColumn", "int", "0"),
("table2", "StringColumn", "string", "2"),
("table1", "StringColumn", "string", "2"),
("table2", "IntColumn", "int", "0"),
("table1", "DecColumn", "decimal(15,2)", "1"),
("table2", "DecColumn", "decimal(15,2)", "1")
).toDF("tablename", "columnname", "datatype", "ordinalposition")
df.
withColumn("columndef",
struct($"ordinalposition", concat($"columnname", lit(" "), $"datatype").as("cdef"))
).
groupBy("tablename").agg(sort_array(collect_list($"columndef")).as("sortedlist")).
withColumn("columndefs", concat_ws(", ", $"sortedlist.cdef")).
drop("sortedlist").
show(false)
// +---------+-----------------------------------------------------------+
// |tablename|columndefs |
// +---------+-----------------------------------------------------------+
// |table2 |IntColumn int, DecColumn decimal(15,2), StringColumn string|
// |table1 |IntColumn int, DecColumn decimal(15,2), StringColumn string|
// +---------+-----------------------------------------------------------+