pyspark sql concat empty col - pyspark

With pyspark sql functions, I'm trying to do this
from pyspark.sql import functions as sf
query = sf.concat(sf.lit("UPDATE abc"), sf.lit(" SET col1= '"), sf.col("col1"), sf.lit("'"), sf.lit(", col2= '"), sf.col("col2"), sf.lit("'"), sf.lit(" WHERE col3 = 1")
myDataframe = myDataframe.withColumn("query", query)
query_collect = myDataframe.collect()
conn = createConnexion(args, username, password)
try:
for row in query_collect:
print(row["query"])
conn.run(row["query"])
conn.commit()
But it doesn't works. It work with just col1, but col2 make an error because sometime, this col2 is empty (null)
query column is null and conn.run(row["query"]) throw this exception :
None 'NoneType' object has no attribute 'encode'
I'm trying to use pyspark sql.when like this but this is the same issue :
myDataframe = myDataframe.fillna(value="NO_SQL")
query = sf.concat(sf.lit("UPDATE abc"),
sf.lit(" SET col1= '"),
sf.col("col1"),
sf.lit("'"),
sf.when(sf.col("col2") != "NO_SQL", sf.concat(sf.lit(", col2= '"), sf.col("col2"), sf.lit("'"))),
sf.lit(" WHERE col3 = 1")
Edit for #Linus :
I'm tryin this
#udf(returnType=StringType())
def sql_worker(col1, col2, colWhere):
col2_setting = ", {col2} = '{col2}'" if col2 is not None else ""
return f" UPDATE entreprise SET {col1} = '{col1}'{col2_setting} WHERE abc = {colWhere} "
def aaa(dynToInsert, colonne, args, username, password, forLog):
dfToInsert = dynToInsert.toDF()
dfToInsert.withColumn("query", sql_worker(sf.col('col1'), sf.col('col2'), sf.col('col3')))
But I have this exception : Invalid returnType: returnType should be DataType or str but is StringType({})
Thanks

maybe this can solve ur issue:
#udf(returnType=StringType())
def sql_worker(col1, col2):
col2_setting = ", col2 = '{col2}'" if col2 is not None else ""
return f" UPDATE abc SET col1 = '{col1}'{col2_setting} WHERE col3 = 1 "
data = spark.createDataFrame([
('abc', 'lll',),
('ddd', 'xxx',),
('qqq', None),
], ['col1', 'col2'])
data.withColumn("query", sql_worker(col('col1'), col('col2'))).show(10, False)
# +----+----+-------------------------------------------------------------+
# |col1|col2|query |
# +----+----+-------------------------------------------------------------+
# |abc |lll | UPDATE abc SET col1 = 'abc', col2 = '{col2}' WHERE col3 = 1 |
# |ddd |xxx | UPDATE abc SET col1 = 'ddd', col2 = '{col2}' WHERE col3 = 1 |
# |qqq |null| UPDATE abc SET col1 = 'qqq' WHERE col3 = 1 |
# +----+----+-------------------------------------------------------------+

It works with when().otherwise(). At the beginning, I'm trying without the otherwise and it is an error. Thanks.
query = sf.concat(
sf.lit("UPDATE def"),
sf.lit(" SET " + colonne + " = "), sf.col(colonne),
sf.when(sf.col("abc").isNull(), "").otherwise(sf.concat(sf.lit(" , abc = '"), sf.col("abc"), sf.lit("'"))),
sf.lit(" WHERE " + colonne + " = "), sf.col(colonne)
)

Related

Union can be done on tables with same no of column In Scala

Below is my code:
val adf = spark.emptyDataFrame
for (i <- 0 until 10 )
{
val df1 = spark.read.format("csv").load("c:\\file.txt") // file has 2 coulmna
val df2 = spark.read.format("csv").load("c:\\file.txt") // file has 2 coulmna
df1.creatOrReplaceTempView("tab1")
df2.creatOrReplaceTempView("tab2")
val res = spark.sql("Select A.* , B.* from tab1 a join tab2 b on a.id = b.id")
adf = adf.union(res)
}
adf.show()
Union is failing as it saying "Union can be done on tables with same no of columns"
Can anyone please help?
Schema of an empty dataframe doesn't have any columns, but your result dataframe has 4 columns, it is the reason why union operation is failing. You can make from adf a variable and use it with null or Option value specified. Example with using Option:
var adf: Option[DataFrame] = None
for (i <- 0 until 10 )
{
val df1 = spark.read.format("csv").load("c:\\file.txt") // file has 2 coulmsn
val df2 = spark.read.format("csv").load("c:\\file.txt") // file has 2 coulmsn
df1.creatOrReplaceTempView("tab1")
df2.creatOrReplaceTempView("tab2")
val res = spark.sql("Select A.* , B.* from tab1 a join tab2 b on a.id = b.id")
if (adf.isDefined) {
adf = Option(adf.get.union(res))
} else adf = Option(res)
}
if (adf.isDefined) adf.get.show()
Resolved the issue by:
Creating Empty Dataframe with Schema:
//Created Schema
val schema = StructType(Array
(StructField ("Col1",StringType,true),
(StructField ("Col2",StringType,true),
(StructField ("Col3",StringType,true),
(StructField ("Col4",StringType,true)))
//Created EmptyDataFrame
var adf = spark.createDataFrame(spark.sparkContext.emptyRDD[row],schema)
for (i <- 0 until 10 )
{
val df1 = spark.read.format("csv").load("c:\\file.txt") // file has 2 column
val df2 = spark.read.format("csv").load("c:\\file.txt") // file has 2 coulmna
df1.creatOrReplaceTempView("tab1")
df2.creatOrReplaceTempView("tab2")
val res = spark.sql("Select A.* , B.* from tab1 a join tab2 b on a.id = b.id")
adf = adf.union(res)
}
adf.show()

How to apply an empty condition to sql select by using "and" in Spark?

I have an UuidConditionSet, when the if condition is wrong, I want apply an empty string to my select statement(or just ignore this UuidConditionSet), but I got this error. How to solve this problem?
mismatched input 'FROM' expecting <EOF>(line 10, pos 3)
This is the select
(SELECT
item,
amount,
date
from my_table
where record_type = 'myType'
and ( date_format(date, "yyyy-MM-dd") >= '2020-02-27'
and date_format(date, "yyyy-MM-dd") <= '2020-02-28' )
and ()
var UuidConditionSet = ""
var UuidCondition = Seq.empty[String]
if(!UuidList.mkString.isEmpty) {
UuidCondition = for {
Uuid <- UuidList
UuidConditionSet = s"${SQLColumnHelper.EVENT_INFO_STRUCT_NAME}.${SQLColumnHelper.UUID} = '".concat(eventUuid).concat("'")
} yield UuidConditionSet
UuidConditionSet = UuidCondition.reduce(_.concat(" or ").concat(_))
}
s"""SELECT
| ${SQLColumnHelper.STRUCT_NAME_ITEM},
| ${SQLColumnHelper.STRUCT_NAME_AMOUNT},
| ${SQLColumnHelper.DATE}
| from ${sqlTableHelper.TABLE}
| where ${SQLColumnHelper.EVENT_INFO_STRUCT_NAME} = '${RECORD_TYPE}'
| and ( date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") >= '${stayDateRangeTuple._1}'
| and date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") <= '${stayDateRangeTuple._2}' )
| and ($UuidConditionSet)
You can use pattern matching on the list UuidList to check the size and return an empty string if the list is empty. Also, you can use IN instead of multiple ORs here.
Try this:
val UuidCondition = UuidList match {
case l if (l.size > 0) => {
l.map(u => s"'$u'").mkString(
s"and ${SQLColumnHelper.EVENT_INFO_STRUCT_NAME}.${SQLColumnHelper.UUID} in (",
",",
")"
)
}
case _ => ""
}
s"""SELECT
| ${SQLColumnHelper.STRUCT_NAME_ITEM},
| ${SQLColumnHelper.STRUCT_NAME_AMOUNT},
| ${SQLColumnHelper.DATE}
| from ${sqlTableHelper.TABLE}
| where ${SQLColumnHelper.EVENT_INFO_STRUCT_NAME} = '${RECORD_TYPE}'
| and date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") >= '${stayDateRangeTuple._1}'
| and date_format(${SQLColumnHelper.DATE}, "${Constant.STAY_DATE_FORMAT}") <= '${stayDateRangeTuple._2}'
| $UuidCondition
"""

Check count of a column from a dataframe and and add column and count as Map

I am a scala beginner. I am trying to find count of null values in a column of a table and add column name and count as key value pair in Map. The below code doesn't work as expected. Please guide me how I can modify this code to make it work
def nullCheck(databaseName:String,tableName:String) ={
var map = scala.collection.mutable.Map[String, Int]()
validationColumn = Array(col1,col2)
for(i <- 0 to validationColumn.length) {
val nullVal = spark.sql(s"select count(*) from $databaseName.$tableName where validationColumn(i) is NULL")
if(nullval == 0)
map(validationColumn(i)) = nullVal
map
}
The function should return ((col1,count),(col2,count)) as Map
This can be done with creating a dynamic sql string and then mapping it. Your approach reads same data multiple times
Here is the solution. I used an "example" DataFrame.
scala> val inputDf = Seq((Some("Sam"),None,200),(None,Some(31),30),(Some("John"),Some(25),25),(Some("Harry"),None,100)).toDF("name","age","not_imp_column")
scala> inputDf.show(false)
+-----+----+--------------+
|name |age |not_imp_column|
+-----+----+--------------+
|Sam |null|200 |
|null |31 |30 |
|John |25 |25 |
|Harry|null|100 |
+-----+----+--------------+
and our ValidationColumns Are name and age where we shall count Nulls
we put them in a List
scala> val validationColumns = List("name","age")
And We Create a SQL String that will be driving this whole calculation
scala> val sqlStr = "select " + validationColumns.map(x => "sum(" + x + "_count) AS " + x + "_sum" ).mkString(",") + " from (select " + validationColumns.map(x => "case when " + x + " = '$$' then 1 else 0 end AS " + x + "_count").mkString(",") + " from " +" (select" + validationColumns.map(x => " nvl( " + x +",'$$') as " + x).mkString(",") + " from example_table where " + validationColumns.map(x => x + " is null ").mkString("or ") + " ) layer1 ) layer2 "
It resolves to ==>
"select sum(name_count) AS name_sum,sum(age_count) AS age_sum from (select case when name = '$$' then 1 else 0 end AS name_count,case when age = '$$' then 1 else 0 end AS age_count from (select nvl( name,'$$') as name, nvl( age,'$$') as age from example_table where name is null or age is null ) layer1 ) layer2 "
now we create a temporary view of our dataframe
inputDf.createOrReplaceTempView("example_table")
only thing left to do is execute the sql and creating a Map which is done by
validationColumns zip spark.sql(sqlStr).collect.map(_.toSeq).flatten.toList toMap
and result
Map(name -> 1, age -> 2) // obviously you can make it type safe

Spark SQL add column/update-accumulate value

I have the following DataFrame:
name,email,phone,country
------------------------------------------------
[Mike,mike#example.com,+91-9999999999,Italy]
[Alex,alex#example.com,+91-9999999998,France]
[John,john#example.com,+1-1111111111,United States]
[Donald,donald#example.com,+1-2222222222,United States]
[Dan,dan#example.com,+91-9999444999,Poland]
[Scott,scott#example.com,+91-9111999998,Spain]
[Rob,rob#example.com,+91-9114444998,Italy]
exposed as temp table tagged_users:
resultDf.createOrReplaceTempView("tagged_users")
I need to add additional column tag to this DataFrame and assign calculated tags by different SQL conditions, which are described in the following map(key - tag name, value - condition for WHERE clause)
val tags = Map(
"big" -> "country IN (SELECT * FROM big_countries)",
"medium" -> "country IN (SELECT * FROM medium_countries)",
//2000 other different tags and conditions
"sometag" -> "name = 'Donald' AND email = 'donald#example.com' AND phone = '+1-2222222222'"
)
I have the following DataFrames(as data dictionaries) in order to be able to use them in SQL query:
Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
I want to test each line in my tagged_users table and assign it appropriate tags. I tried to implement the following logic in order to achieve it:
tags.foreach {
case (tag, tagCondition) => {
resultDf = spark.sql(buildTagQuery(tag, tagCondition, "tagged_users"))
.withColumn("tag", lit(tag).cast(StringType))
}
}
def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
f"SELECT * FROM $table WHERE $tagCondition"
}
but right now I don't know how to accumulate tags and not override them. Right now as the result I have the following DataFrame:
name,email,phone,country,tag
Dan,dan#example.com,+91-9999444999,Poland,medium
Scott,scott#example.com,+91-9111999998,Spain,medium
but I need something like:
name,email,phone,country,tag
Mike,mike#example.com,+91-9999999999,Italy,big
Alex,alex#example.com,+91-9999999998,France,big
John,john#example.com,+1-1111111111,United States,big
Donald,donald#example.com,+1-2222222222,United States,(big|sometag)
Dan,dan#example.com,+91-9999444999,Poland,medium
Scott,scott#example.com,+91-9111999998,Spain,(big|medium)
Rob,rob#example.com,+91-9114444998,Italy,big
Please note that Donal should have 2 tags (big|sometag) and Scott should have 2 tags (big|medium).
Please show how to implement it.
UPDATED
val spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.config("spark.master", "local")
.getOrCreate();
import spark.implicits._
import spark.sql
Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
val df = Seq(
("Mike", "mike#example.com", "+91-9999999999", "Italy"),
("Alex", "alex#example.com", "+91-9999999998", "France"),
("John", "john#example.com", "+1-1111111111", "United States"),
("Donald", "donald#example.com", "+1-2222222222", "United States"),
("Dan", "dan#example.com", "+91-9999444999", "Poland"),
("Scott", "scott#example.com", "+91-9111999998", "Spain"),
("Rob", "rob#example.com", "+91-9114444998", "Italy")).toDF("name", "email", "phone", "country")
df.collect.foreach(println)
df.createOrReplaceTempView("tagged_users")
val tags = Map(
"big" -> "country IN (SELECT * FROM big_countries)",
"medium" -> "country IN (SELECT * FROM medium_countries)",
"sometag" -> "name = 'Donald' AND email = 'donald#example.com' AND phone = '+1-2222222222'")
val sep_tag = tags.map((x) => { s"when array_contains(" + x._1 + ", country) then '" + x._1 + "' " }).mkString
val combine_sel_tag1 = tags.map((x) => { s" array_contains(" + x._1 + ",country) " }).mkString(" and ")
val combine_sel_tag2 = tags.map((x) => x._1).mkString(" '(", "|", ")' ")
val combine_sel_all = " case when " + combine_sel_tag1 + " then " + combine_sel_tag2 + sep_tag + " end as tags "
val crosqry = tags.map((x) => { s" cross join ( select collect_list(country) as " + x._1 + " from " + x._1 + "_countries) " + x._1 + " " }).mkString
val qry = " select name,email,phone,country, " + combine_sel_all + " from tagged_users " + crosqry
spark.sql(qry).show
spark.stop()
fails with the following exception:
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'sometag_countries' not found in database 'default';
at org.apache.spark.sql.catalyst.catalog.ExternalCatalog$class.requireTableExists(ExternalCatalog.scala:48)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireTableExists(InMemoryCatalog.scala:45)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.getTable(InMemoryCatalog.scala:326)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:701)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:730)
... 74 more
Check out this DF solution:
scala> val df = Seq(("Mike","mike#example.com","+91-9999999999","Italy"),
| ("Alex","alex#example.com","+91-9999999998","France"),
| ("John","john#example.com","+1-1111111111","United States"),
| ("Donald","donald#example.com","+1-2222222222","United States"),
| ("Dan","dan#example.com","+91-9999444999","Poland"),
| ("Scott","scott#example.com","+91-9111999998","Spain"),
| ("Rob","rob#example.com","+91-9114444998","Italy")
| ).toDF("name","email","phone","country")
df: org.apache.spark.sql.DataFrame = [name: string, email: string ... 2 more fields]
scala> val dfbc=Seq("Italy", "France", "United States", "Spain").toDF("country")
dfbc: org.apache.spark.sql.DataFrame = [country: string]
scala> val dfmc=Seq("Poland", "Hungary", "Spain").toDF("country")
dfmc: org.apache.spark.sql.DataFrame = [country: string]
scala> val dfbc2=dfbc.agg(collect_list('country).as("bcountry"))
dfbc2: org.apache.spark.sql.DataFrame = [bcountry: array<string>]
scala> val dfmc2=dfmc.agg(collect_list('country).as("mcountry"))
dfmc2: org.apache.spark.sql.DataFrame = [mcountry: array<string>]
scala> val df2=df.crossJoin(dfbc2).crossJoin(dfmc2)
df2: org.apache.spark.sql.DataFrame = [name: string, email: string ... 4 more fields]
scala> df2.selectExpr("*","case when array_contains(bcountry,country) and array_contains(mcountry,country) then '(big|medium)' when array_contains(bcountry,country) then 'big' when array_contains(mcountry,country) then 'medium' else 'none' end as `tags`").select("name","email","phone","country","tags").show(false)
+------+------------------+--------------+-------------+------------+
|name |email |phone |country |tags |
+------+------------------+--------------+-------------+------------+
|Mike |mike#example.com |+91-9999999999|Italy |big |
|Alex |alex#example.com |+91-9999999998|France |big |
|John |john#example.com |+1-1111111111 |United States|big |
|Donald|donald#example.com|+1-2222222222 |United States|big |
|Dan |dan#example.com |+91-9999444999|Poland |medium |
|Scott |scott#example.com |+91-9111999998|Spain |(big|medium)|
|Rob |rob#example.com |+91-9114444998|Italy |big |
+------+------------------+--------------+-------------+------------+
scala>
SQL approach
scala> Seq(("Mike","mike#example.com","+91-9999999999","Italy"),
| ("Alex","alex#example.com","+91-9999999998","France"),
| ("John","john#example.com","+1-1111111111","United States"),
| ("Donald","donald#example.com","+1-2222222222","United States"),
| ("Dan","dan#example.com","+91-9999444999","Poland"),
| ("Scott","scott#example.com","+91-9111999998","Spain"),
| ("Rob","rob#example.com","+91-9114444998","Italy")
| ).toDF("name","email","phone","country").createOrReplaceTempView("tagged_users")
scala> Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
scala> Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
scala> spark.sql(""" select name,email,phone,country,case when array_contains(bc,country) and array_contains(mc,country) then '(big|medium)' when array_contains(bc,country) then 'big' when array_contains(mc,country) then 'medium' else 'none' end as tags from tagged_users cross join ( select collect_list(country) as bc from big_countries ) b cross join ( select collect_list(country) as mc from medium_countries ) c """).show(false)
+------+------------------+--------------+-------------+------------+
|name |email |phone |country |tags |
+------+------------------+--------------+-------------+------------+
|Mike |mike#example.com |+91-9999999999|Italy |big |
|Alex |alex#example.com |+91-9999999998|France |big |
|John |john#example.com |+1-1111111111 |United States|big |
|Donald|donald#example.com|+1-2222222222 |United States|big |
|Dan |dan#example.com |+91-9999444999|Poland |medium |
|Scott |scott#example.com |+91-9111999998|Spain |(big|medium)|
|Rob |rob#example.com |+91-9114444998|Italy |big |
+------+------------------+--------------+-------------+------------+
scala>
Iterating through the tags
scala> val tags = Map(
| "big" -> "country IN (SELECT * FROM big_countries)",
| "medium" -> "country IN (SELECT * FROM medium_countries)")
tags: scala.collection.immutable.Map[String,String] = Map(big -> country IN (SELECT * FROM big_countries), medium -> country IN (SELECT * FROM medium_countries))
scala> val sep_tag = tags.map( (x) => { s"when array_contains("+x._1+", country) then '" + x._1 + "' " } ).mkString
sep_tag: String = "when array_contains(big, country) then 'big' when array_contains(medium, country) then 'medium' "
scala> val combine_sel_tag1 = tags.map( (x) => { s" array_contains("+x._1+",country) " } ).mkString(" and ")
combine_sel_tag1: String = " array_contains(big,country) and array_contains(medium,country) "
scala> val combine_sel_tag2 = tags.map( (x) => x._1 ).mkString(" '(","|", ")' ")
combine_sel_tag2: String = " '(big|medium)' "
scala> val combine_sel_all = " case when " + combine_sel_tag1 + " then " + combine_sel_tag2 + sep_tag + " end as tags "
combine_sel_all: String = " case when array_contains(big,country) and array_contains(medium,country) then '(big|medium)' when array_contains(big, country) then 'big' when array_contains(medium, country) then 'medium' end as tags "
scala> val crosqry = tags.map( (x) => { s" cross join ( select collect_list(country) as "+x._1+" from "+x._1+"_countries) "+ x._1 + " " } ).mkString
crosqry: String = " cross join ( select collect_list(country) as big from big_countries) big cross join ( select collect_list(country) as medium from medium_countries) medium "
scala> val qry = " select name,email,phone,country, " + combine_sel_all + " from tagged_users " + crosqry
qry: String = " select name,email,phone,country, case when array_contains(big,country) and array_contains(medium,country) then '(big|medium)' when array_contains(big, country) then 'big' when array_contains(medium, country) then 'medium' end as tags from tagged_users cross join ( select collect_list(country) as big from big_countries) big cross join ( select collect_list(country) as medium from medium_countries) medium "
scala> spark.sql(qry).show
+------+------------------+--------------+-------------+------------+
| name| email| phone| country| tags|
+------+------------------+--------------+-------------+------------+
| Mike| mike#example.com|+91-9999999999| Italy| big|
| Alex| alex#example.com|+91-9999999998| France| big|
| John| john#example.com| +1-1111111111|United States| big|
|Donald|donald#example.com| +1-2222222222|United States| big|
| Dan| dan#example.com|+91-9999444999| Poland| medium|
| Scott| scott#example.com|+91-9111999998| Spain|(big|medium)|
| Rob| rob#example.com|+91-9114444998| Italy| big|
+------+------------------+--------------+-------------+------------+
scala>
UPDATE2:
scala> Seq(("Mike","mike#example.com","+91-9999999999","Italy"),
| ("Alex","alex#example.com","+91-9999999998","France"),
| ("John","john#example.com","+1-1111111111","United States"),
| ("Donald","donald#example.com","+1-2222222222","United States"),
| ("Dan","dan#example.com","+91-9999444999","Poland"),
| ("Scott","scott#example.com","+91-9111999998","Spain"),
| ("Rob","rob#example.com","+91-9114444998","Italy")
| ).toDF("name","email","phone","country").createOrReplaceTempView("tagged_users")
scala> Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
scala> Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
scala> val tags = Map(
| "big" -> "country IN (SELECT * FROM big_countries)",
| "medium" -> "country IN (SELECT * FROM medium_countries)",
| "sometag" -> "name = 'Donald' AND email = 'donald#example.com' AND phone = '+1-2222222222'")
tags: scala.collection.immutable.Map[String,String] = Map(big -> country IN (SELECT * FROM big_countries), medium -> country IN (SELECT * FROM medium_countries), sometag -> name = 'Donald' AND email = 'donald#example.com' AND phone = '+1-2222222222')
scala> val sql_tags = tags.map( x => { val p = x._2.trim.toUpperCase.split(" ");
| val qry = if(p.contains("IN") && p.contains("FROM"))
| s" case when array_contains((select collect_list("+p.head +") from " + p.last.replaceAll("[)]","")+ " ), " +p.head + " ) then '" + x._1 + " ' else '' end " + x._1 + " "
| else
| " case when " + x._2 + " then '" + x._1 + " ' else '' end " + x._1 + " ";
| qry } ).mkString(",")
sql_tags: String = " case when array_contains((select collect_list(COUNTRY) from BIG_COUNTRIES ), COUNTRY ) then 'big ' else '' end big , case when array_contains((select collect_list(COUNTRY) from MEDIUM_COUNTRIES ), COUNTRY ) then 'medium ' else '' end medium , case when name = 'Donald' AND email = 'donald#example.com' AND phone = '+1-2222222222' then 'sometag ' else '' end sometag "
scala> val outer_query = tags.map( x=> x._1).mkString(" regexp_replace(trim(concat(", ",", " )),' ','|') tags ")
outer_query: String = " regexp_replace(trim(concat(big,medium,sometag )),' ','|') tags "
scala> spark.sql(" select name,email, country, " + outer_query + " from ( select name,email, country ," + sql_tags + " from tagged_users ) " ).show
+------+------------------+-------------+-----------+
| name| email| country| tags|
+------+------------------+-------------+-----------+
| Mike| mike#example.com| Italy| big|
| Alex| alex#example.com| France| big|
| John| john#example.com|United States| big|
|Donald|donald#example.com|United States|big|sometag|
| Dan| dan#example.com| Poland| medium|
| Scott| scott#example.com| Spain| big|medium|
| Rob| rob#example.com| Italy| big|
+------+------------------+-------------+-----------+
scala>
If you need to aggregate the results and not just execute each query perhaps use map instead of foreach then union the results
val o = tags.map {
case (tag, tagCondition) => {
val resultDf = spark.sql(buildTagQuery(tag, tagCondition, "tagged_users"))
.withColumn("tag", new Column("blah"))
resultDf
}
}
o.foldLeft(o.head) {
case (acc, df) => acc.union(df)
}
I would define multiple tags tables with columns value, tag.
Then your tags definition would be a collection say Seq[(String, String] where the first tuple element is the column on which the tag is calculated.
Lets say
Seq(
"country" -> "bigCountries", // Columns [country, bigCountry]
"country" -> "mediumCountries", // Columns [country, mediumCountry]
"email" -> "hotmailLosers" // [country, hotmailLoser]
)
Then iterate through this list, left join each table on the relevant column with the associated column.
After joining each table simply select your tags column to be the current value + the joined column if it is not null.

hiveql remove duplicates including records that had a duplicate

I have a select statement that i am storing in a dataframe....
val df = spark.sqlContext.sql("select prty_tax_govt_issu_id from CST_EQUIFAX.eqfx_prty_emp_incm_info where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A'");
I then want to take this dataframe and ONLY select unique records. So determine all duplicates on the prty_tax_govt_issu_id field and if there are duplicates not only remove the duplicate(s), but the entire record that has that prty_tax_govt_issu_id
So original data frame may look like...
+---------------------+
|prty_tax_govt_issu_id|
+---------------------+
| 000000005|
| 000000012|
| 000000012|
| 000000028|
| 000000038|
+---------------------+
The new dataframe should look like....
|prty_tax_govt_issu_id|
+---------------------+
| 000000005|
| 000000028|
| 000000038|
+---------------------+
Not sure if i need to do this after I store in the dataframe or if i can just get that result in my select statement. Thanks :)
Count the number of rows per id and select those ones with count=1.
val df = spark.sql("select prty_tax_govt_issu_id from CST_EQUIFAX.eqfx_prty_emp_incm_info where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A'")
// Get counts per id
val counts = df.groupBy("prty_tax_govt_issu_id").count()
// Filter for id's having only one row
counts.filter($"count" == 1).select($"prty_tax_govt_issu_id").show()
In SQL, you could do
val df = spark.sql("""
select prty_tax_govt_issu_id
from CST_EQUIFAX.eqfx_prty_emp_incm_info
where emp_mtch_cd = 'Y' and emp_mtch_actv_rcrd_in = 'Y' and emp_sts_in = 'A'
group by prty_tax_govt_issu_id
having count(*)=1
""")
df.show()
a group by clause would do it
select prty_tax_govt_issu_id
from CST_EQUIFAX.eqfx_prty_emp_incm_info
where emp_mtch_cd = 'Y'
and emp_mtch_actv_rcrd_in = 'Y'
and emp_sts_in = 'A'
GROUP BY prty_tax_govt_issu_id