How to compare two columns data in Spark Dataframes using Scala - scala

I want to compare two columns in a Spark DataFrame: if the value of a column (attr_value) is found in values of another (attr_valuelist) I want only that value to be kept. Otherwise, the column value should be null.
For example, given the following input
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes, No
2 1 test1 No Yes, No
3 2 test2 value1 val1, Value1,value2
I would expect the following output
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes
2 1 test1 No No
3 2 test2 value1 Value1

I assume, given your sample input, that the column with the search item contains a string while the search target is a sequence of strings. Also, I assume you're interested in case-insensitive search.
This is going to be the input (I added a column that would have yielded a null to test the behavior of the UDF I wrote):
+---+---+--------+----------+----------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+----------------------+
|1 |2 |test |Yes |[Yes, No] |
|2 |1 |test1 |No |[Yes, No] |
|3 |2 |test2 |value1 |[val1, Value1, value2]|
|3 |2 |test2 |value1 |[val1, value2] |
+---+---+--------+----------+----------------------+
You can solve your problem with a very simple UDF.
val find = udf {
(item: String, collection: Seq[String]) =>
collection.find(_.toLowerCase == item.toLowerCase)
}
val df = spark.createDataFrame(Seq(
(1, 2, "test", "Yes", Seq("Yes", "No")),
(2, 1, "test1", "No", Seq("Yes", "No")),
(3, 2, "test2", "value1", Seq("val1", "Value1", "value2")),
(3, 2, "test2", "value1", Seq("val1", "value2"))
)).toDF("id1", "id2", "attrname", "attr_value", "attr_valuelist")
df.select(
$"id1", $"id2", $"attrname", $"attr_value",
find($"attr_value", $"attr_valuelist") as "attr_valuelist")
showing the output of the last command would yield the following output:
+---+---+--------+----------+--------------+
|id1|id2|attrname|attr_value|attr_valuelist|
+---+---+--------+----------+--------------+
| 1| 2| test| Yes| Yes|
| 2| 1| test1| No| No|
| 3| 2| test2| value1| Value1|
| 3| 2| test2| value1| null|
+---+---+--------+----------+--------------+
You can execute this code in any spark-shell. If you are using this from a job you are submitting to a cluster, remember to import spark.implicits._.

can you try this code. I think it will work with that SQL contains case when.
val emptyRDD = sc.emptyRDD[Row]
var emptyDataframe = sqlContext.createDataFrame(emptyRDD, your_dataframe.schema)
your_dataframe.createOrReplaceTempView("tbl")
emptyDataframe = sqlContext.sql("select id1, id2, attrname, attr_value, case when
attr_valuelist like concat('%', attr_value, '%') then attr_value else
null end as attr_valuelist from tbl")
emptyDataframe.show

Related

Spark Scala split column values in a dataframe to appended lists

I have data in a spark dataframe that I need to search for elements by name, append the values to a list, and split searched elements into separate columns of the dataframe.
I am using scala and the below is an example of my current code that works to get the first value but I need to append all values available not just the first.
I'm new to Scala (and python) so any help will be greatly appreciated!
val getNumber: (String => String) = (colString: String) => {
if (colString != null) {
raw"number:(\d+)".r
.findAllIn(colString)
.group(1)
}
else
null
}
val udfGetColumn = udf(getNumber)
val mydf = df.select(cols.....)
.withColumn("var_number", udfGetColumn($"var"))
Example Data:
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| key| var |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |["[number:123456 rate:111970 position:1]","[number:123457 rate:662352 position:2]","[number:123458 rate:890 position:3]","[number:123459 rate:190 position:4]"] | |
|2 |["[number:654321 rate:211971 position:1]","[number:654322 rate:124 position:2]","[number:654323 rate:421 position:3]"] |
+------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
Desired Result:
+------+------------------------------------------------------------+
| key| var_number | var_rate | var_position |
+------+------------------------------------------------------------+
|1 | 123456 | 111970 | 1 |
|1 | 123457 | 662352 | 2 |
|1 | 123458 | 890 | 3 |
|1 | 123459 | 190 | 4 |
|2 | 654321 | 211971 | 1 |
|2 | 654322 | 124 | 2 |
|2 | 654323 | 421 | 3 |
+------+-----------------+---------------------+--------------------+
You don't need to use UDF here. You can easily transform the array column var by converting each element into a map using str_to_map after removing the square brackets ([]) with regexp_replace function. Finally, explode the transformed array and select the fields:
val df = Seq(
(1, Seq("[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]")),
(2, Seq("[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"))
).toDF("key", "var")
val result = df.withColumn(
"var",
explode(expr(raw"transform(var, x -> str_to_map(regexp_replace(x, '[\\[\\]]', ''), ' '))"))
).select(
col("key"),
col("var").getField("number").alias("var_number"),
col("var").getField("rate").alias("var_rate"),
col("var").getField("position").alias("var_position")
)
result.show
//+---+----------+--------+------------+
//|key|var_number|var_rate|var_position|
//+---+----------+--------+------------+
//| 1| 123456| 111970| 1|
//| 1| 123457| 662352| 2|
//| 1| 123458| 890| 3|
//| 1| 123459| 190| 4|
//| 2| 654321| 211971| 1|
//| 2| 654322| 124| 2|
//| 2| 654323| 421| 3|
//+---+----------+--------+------------+
From you comment, it appears the column var is of type string not array. In this case, you can first transform it by removing [] and " characters then split by comma to get an array:
val df = Seq(
(1, """["[number:123456 rate:111970 position:1]", "[number:123457 rate:662352 position:2]", "[number:123458 rate:890 position:3]", "[number:123459 rate:190 position:4]"]"""),
(2, """["[number:654321 rate:211971 position:1]", "[number:654322 rate:124 position:2]", "[number:654323 rate:421 position:3]"]""")
).toDF("key", "var")
val result = df.withColumn(
"var",
split(regexp_replace(col("var"), "[\\[\\]\"]", ""), ",")
).withColumn(
"var",
explode(expr("transform(var, x -> str_to_map(x, ' '))"))
).select(
// select your columns as above...
)

Add values to a dataframe against some particular ID in Spark Scala

I have the following dataframe:
ID Name City
1 Ali swl
2 Sana lhr
3 Ahad khi
4 ABC fsd
And a list of values like (1,2,1):
val nums: List[Int] = List(1, 2, 1)
I want to add these values against ID == 3. So that DataFrame looks like:
ID Name City newCol newCol2 newCol3
1 Ali swl null null null
2 Sana lhr null null null
3 Ahad khi 1 2 1
4 ABC fsd null null null
I wonder if it is possible? Any help will be appreciated. Thanks
Yes, Its possible.
Use when for populating matched values & otherwise for not matched values.
I have used zipWithIndex for making column names unique.
Please check below code.
scala> import org.apache.spark.sql.functions._
scala> val df = Seq((1,"Ali","swl"),(2,"Sana","lhr"),(3,"Ahad","khi"),(4,"ABC","fsd")).toDF("id","name","city") // Creating DataFrame with given sample data.
df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val nums = List(1,2,1) // List values.
nums: List[Int] = List(1, 2, 1)
scala> val filterData = List(3,4)
scala> spark.time{ nums.zipWithIndex.foldLeft(df)((df,c) => df.withColumn(s"newCol${c._2}",when($"id".isin(filterData:_*),c._1).otherwise(null))).show(false) } // Used zipWithIndex to make column names unique.
+---+----+----+-------+-------+-------+
|id |name|city|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
|1 |Ali |swl |null |null |null |
|2 |Sana|lhr |null |null |null |
|3 |Ahad|khi |1 |2 |1 |
|4 |ABC |fsd |1 |2 |1 |
+---+----+----+-------+-------+-------+
Time taken: 43 ms
scala>
Firstly you can convert it to DataFrame with single array column and then "decompose" the array column into columns as follows:
import org.apache.spark.sql.functions.{col, lit}
import spark.implicits._
val numsDf =
Seq(nums)
.toDF("nums")
.select(nums.indices.map(i => col("nums")(i).alias(s"newCol$i")): _*)
After that you can use outer join for joining data to numsDf with ID == 3 condition as follows:
val resultDf = data.join(numsDf, data.col("ID") === lit(3), "outer")
resultDf.show() will print:
+---+----+----+-------+-------+-------+
| ID|Name|City|newCol0|newCol1|newCol2|
+---+----+----+-------+-------+-------+
| 1| Ali| swl| null| null| null|
| 2|Sana| lhr| null| null| null|
| 3|Ahad| khi| 1| 2| 3|
| 4| ABC| fsd| null| null| null|
+---+----+----+-------+-------+-------+
Make sure you have added spark.sql.crossJoin.crossJoin.enabled = true option to the spark session:
val spark = SparkSession.builder()
...
.config("spark.sql.crossJoin.enabled", value = true)
.getOrCreate()

isin throws stackoverflow error in withcolumn function in spark

I am using spark 2.3 in my scala application. I have a dataframe which create from spark sql that name is sqlDF in the sample code which I shared. I have a string list that has the items below
List[] stringList items
-9,-8,-7,-6
I want to replace all values that match with this lists item in all columns in dataframe to 0.
Initial dataframe
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |-6 |1
-7 |-8 |-7
It must return to
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |0 |1
0 |0 |0
For this I am itarating the query below for all columns (more than 500) in sqlDF.
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList:_*), 0).otherwise(col(currColumnName)))
But getting the error below, by the way if I choose only one column for iterating it works, but if I run the code above for 500 columns iteration it fails
Exception in thread "streaming-job-executor-0"
java.lang.StackOverflowError at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
at
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:285) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
What is the thing that I am missing?
Here is a different approach applying left anti join between columnX and X where X is your list of items transferred into a dataframe. The left anti join will return all the items not present in X, the results we concatenate them all together through an outer join (which can be replaced with left join for better performance, this though will exclude records with all zeros i.e id == 3) based on the id assigned with monotonically_increasing_id:
import org.apache.spark.sql.functions.{monotonically_increasing_id, col}
val df = Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7))
.toDF("c1", "c2", "c3")
.withColumn("id", monotonically_increasing_id())
val exdf = Seq(-9, -8, -7, -6).toDF("x")
df.columns.map{ c =>
df.select("id", c).join(exdf, col(c) === $"x", "left_anti")
}
.reduce((df1, df2) => df1.join(df2, Seq("id"), "outer"))
.na.fill(0)
.show
Output:
+---+---+---+---+
| id| c1| c2| c3|
+---+---+---+---+
| 0| 1| 1| 1|
| 1| 2| -5| 1|
| 3| 0| 0| 0|
| 2| 6| 0| 1|
+---+---+---+---+
foldLeft works perfect for your case here as below
val df = spark.sparkContext.parallelize(Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7)
)).toDF("a", "b", "c")
val list = Seq(-7, -8, -9)
val resultDF = df.columns.foldLeft(df) { (acc, name) => {
acc.withColumn(name, when(col(name).isin(list: _*), 0).otherwise(col(name)))
}
}
Output:
+---+---+---+
|a |b |c |
+---+---+---+
|1 |1 |1 |
|2 |-5 |1 |
|6 |-6 |1 |
|0 |0 |0 |
+---+---+---+
I would suggest you to broadcast the list of String :
val stringList=sc.broadcast(<Your List of List[String]>)
After that use this :
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList.value:_*), 0).otherwise(col(currColumnName)))
Make sure your currColumnName also is in String Format. Comparison should be String to String

Get the number of null per row in PySpark dataframe

This is probably a duplicate, but somehow I have been searching for a long time already:
I want to get the number of nulls per Row in a Spark dataframe. I.e.
col1 col2 col3
null 1 a
1 2 b
2 3 null
Should in the end be:
col1 col2 col3 number_of_null
null 1 a 1
1 2 b 0
2 3 null 1
In a general fashion, I want to get the number of times a certain string or number appears in a spark dataframe row.
I.e.
col1 col2 col3 number_of_ABC
ABC 1 a 1
1 2 b 0
2 ABC ABC 2
I am using Pyspark 2.3.0 and prefer a solution that does not involve SQL syntax. For some reason, I seem not to be able to google this. :/
EDIT: Assume that I have so many columns that I can't list them all.
EDIT2: I explicitely dont want to have a pandas solution.
EDIT3: The solution explained with sums or means does not work as it throws errors:
(data type mismatch: differing types in '((`log_time` IS NULL) + 0)' (boolean and int))
...
isnull(log_time#10) + 0) + isnull(log#11))
In Scala:
val df = List(
("ABC", "1", "a"),
("1", "2", "b"),
("2", "ABC", "ABC")
).toDF("col1", "col2", "col3")
val expected = "ABC"
val complexColumn: Column = df.schema.fieldNames.map(c => when(col(c) === lit(expected), 1).otherwise(0)).reduce((a, b) => a + b)
df.withColumn("countABC", complexColumn).show(false)
Output:
+----+----+----+--------+
|col1|col2|col3|countABC|
+----+----+----+--------+
|ABC |1 |a |1 |
|1 |2 |b |0 |
|2 |ABC |ABC |2 |
+----+----+----+--------+
As stated in pasha701's answer, I resort to map and reduce. Note that I am working on Spark 1.6.x and Python 2.7
Taking your DataFrame as df (and as is)
dfvals = [
(None, "1", "a"),
("1", "2", "b"),
("2", None, None)
]
df = sqlc.createDataFrame(dfvals, ['col1', 'col2', 'col3'])
new_df = df.withColumn('null_cnt', reduce(lambda x, y: x + y,
map(lambda x: func.when(func.isnull(func.col(x)) == 'true', 1).otherwise(0),
df.schema.names)))
Check if the value is Null and assign 1 or 0. Add the result to get the count.
new_df.show()
+----+----+----+--------+
|col1|col2|col3|null_cnt|
+----+----+----+--------+
|null| 1| a| 1|
| 1| 2| b| 0|
| 2|null|null| 2|
+----+----+----+--------+

Parse nested data using SCALA

I have the dataframe as follows :
ColA ColB ColC
1 [2,3,4] [5,6,7]
I need to convert it to the below
ColA ColB ColC
1 2 5
1 3 6
1 4 7
Can someone please help with the Code in SCALA?
You can zip the two array columns by means of a UDF and explode the zipped column as follows:
val df = Seq(
(1, Seq(2, 3, 4), Seq(5, 6, 7))
).toDF("ColA", "ColB", "ColC")
def zip = udf(
(x: Seq[Int], y: Seq[Int]) => x zip y
)
val df2 = df.select($"ColA", zip($"ColB", $"ColC").as("BzipC")).
withColumn("BzipC", explode($"BzipC"))
val df3 = df2.select($"ColA", $"BzipC._1".as("ColB"), $"BzipC._2".as("ColC"))
df3.show
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
| 1| 2| 5|
| 1| 3| 6|
| 1| 4| 7|
+----+----+----+
The idea I am presenting here is a bit complex which requires you to use map to combine the two arrays of ColB and ColC. Then use the explode function to explode the combined array. and finally extract the exploded combined array to different columns.
import org.apache.spark.sql.functions._
val tempDF = df.map(row => {
val colB = row(1).asInstanceOf[mutable.WrappedArray[Int]]
val colC = row(2).asInstanceOf[mutable.WrappedArray[Int]]
var array = Array.empty[(Int, Int)]
for(loop <- 0 to colB.size-1){
array = array :+ (colB(loop), colC(loop))
}
(row(0).asInstanceOf[Int], array)
})
.toDF("ColA", "ColB")
.withColumn("ColD", explode($"ColB"))
tempDF.withColumn("ColB", $"ColD._1").withColumn("ColC", $"ColD._2").drop("ColD").show(false)
this would give you result as
+----+----+----+
|ColA|ColB|ColC|
+----+----+----+
|1 |2 |5 |
|1 |3 |6 |
|1 |4 |7 |
+----+----+----+
You can also use a combination of posexplode and lateral view from HiveQL
sqlContext.sql("""
select 1 as colA, array(2,3,4) as colB, array(5,6,7) as colC
""").registerTempTable("test")
sqlContext.sql("""
select
colA , b as colB, c as colC
from
test
lateral view
posexplode(colB) columnB as seqB, b
lateral view
posexplode(colC) columnC as seqC, c
where
seqB = seqC
""" ).show
+----+----+----+
|colA|colB|colC|
+----+----+----+
| 1| 2| 5|
| 1| 3| 6|
| 1| 4| 7|
+----+----+----+
Credits: https://stackoverflow.com/a/40614822/7224597 ;)