I have a dataframe where one column is ; separated strings, e.g. "str1;str2;str3;str4", I also have another static list "strx;stry;strz", the goal is to split the column string value and check if the split array has any intersection with the static list, and keep that row
I tried
df.where($"column".split(";").intersect(staticList).nonEmpty)
or df.where(split($"column", ";").intersect(staticList).nonEmpty)
or replace $"" with col()
or use $"".getString before calling split
I always get error split is not a member of org.apache.spark.sql.ColumnName, similar to getString. It appears the split is applied to a Column type instead of String type.
So my question is, during where or filter, how can I access the string value for a column and split it?
Thanks!!
It seems you're mixing up Spark's split method for Columns with Scala's split for Strings. Please see example below for how the two different split methods are used. Method array_intersect is for intersecting the split Array column with the split element-filter string.
val df = Seq(
(1, "a;b;c;d"),
(2, "u;v;w")
).toDF("id", "csv_str")
val filterCsvStr = "a;c;x;y"
val df2 = df.withColumn("wanted_strs", array_intersect(
split($"csv_str", ";"),
lit(filterCsvStr.split(";"))
)
)
df2.show
// +---+-------+-----------+
// | id|csv_str|wanted_strs|
// +---+-------+-----------+
// | 1|a;b;c;d| [a, c]|
// | 2| u;v;w| []|
// +---+-------+-----------+
df2.where(size($"wanted_strs") > 0).show
// +---+-------+-----------+
// | id|csv_str|wanted_strs|
// +---+-------+-----------+
// | 1|a;b;c;d| [a, c]|
// +---+-------+-----------+
A couple of notes:
Method array_intersect removes duplicate elements in the Array column.
lit is for converting the Scala list to a Column as array_intersect expects Column-typed arguments.
If the element filters are available as values of a Column, arguments for the array_intersect will be simpler:
val df = Seq(
(1, "a;b;c;d", "b;c;e"),
(2, "u;v;w", "t;v")
).toDF("id", "csv_str", "filter_csv_str")
val df2 = df.withColumn("wanted_strs", array_intersect(
split($"csv_str", ";"),
split($"filter_csv_str", ";")
)
)
df2.show
// +---+-------+--------------+-----------+
// | id|csv_str|filter_csv_str|wanted_strs|
// +---+-------+--------------+-----------+
// | 1|a;b;c;d| b;c;e| [b, c]|
// | 2| u;v;w| t;v| [v]|
// +---+-------+--------------+-----------+
I have a column called responseTimes which is of arrayType:
ArrayType(IntegerType,true)
I'm trying to add another column to count the number of null or not-set values in this array:
val contains_null = udf((xs: Seq[Integer]) => xs.contains(null))
df.withColumn("totalNulls", when(contains_null(col("responseTimes")),
lit(1)).otherwise(0))
Although this gives me the right output, IntelliJ keeps telling me to avoid the use of null in my UDF which makes me think this is bad. Is there any other way to do it? Also, is it possible without using UDFs?
The reason is very simple , it is because of the rules of spark udf, well spark deals with null in a different distributed way, I don't know if you know the array_contains built-in function in spark sql.
If UDFs are needed, follow these rules:
Scala code should deal with null values gracefully and shouldn’t error out if there are null values.
Scala code should return None (or null) for values that are unknown, missing, or irrelevant. DataFrames should also use null for for values that are unknown, missing, or irrelevant.
Use Option in Scala code and fall back on null if Option becomes a performance bottleneck.
Please refer to this link if you like tp read more: https://mungingdata.com/apache-spark/dealing-with-null/
You can rewrite your UDF to use Option. In scala, Option(null) gives None, so you can do :
val contains_null = udf((xs: Seq[Integer]) => xs.exists(e => Option(e).isEmpty))
However, if you are using Spark 2.4+, it is more suitable to use Spark built-in functions for this. To check if an array column contains null elements, use exists as suggested by #mck's answer.
If you want to get the count of nulls in array you can combine filter and size function :
df.withColumn("totalNulls", size(expr("filter(responseTimes, x -> x is null)")))
A better way is probably to use higher order function exists to check isnull for each array element:
// sample dataframe
val df = spark.sql("select array(1,null,2) responseTimes union all select array(3,4)")
df.show
+-------------+
|responseTimes|
+-------------+
| [1,, 2]|
| [3, 4]|
+-------------+
// check whether there exists null elements in the array
val df2 = df.withColumn("totalNulls", expr("int(exists(responseTimes, x -> isnull(x)))"))
df2.show
+-------------+----------+
|responseTimes|totalNulls|
+-------------+----------+
| [1,, 2]| 1|
| [3, 4]| 0|
+-------------+----------+
You can also use array_max together with transform:
val df2 = df.withColumn("totalNulls", expr("int(array_max(transform(responseTimes, x -> isnull(x))))"))
df2.show
+-------------+----------+
|responseTimes|totalNulls|
+-------------+----------+
| [1,, 2]| 1|
| [3, 4]| 0|
+-------------+----------+
I am new to Scala/spark. I am working on Scala/Spark application that selects a couple of columns from a hive table and then converts it into a Mutable map with the first column being the keys and second column being the values. For example:
+--------+--+
| c1 |c2|
+--------+--+
|Newyork |1 |
| LA |0 |
|Chicago |1 |
+--------+--+
will be converted to Scala.mutable.Map(Newyork -> 1, LA -> 0, Chicago -> 1)
Here is my code for the above conversion:
val testDF = hiveContext.sql("select distinct(trim(c1)),trim(c2) from default.table where trim(c1)!=''")
val testMap = scala.collection.mutable.Map(testDF.map(r => (r(0).toString,r(1).toString)).collectAsMap().toSeq: _*)
I have no problem with the conversion. However, when I print the counts of rows in the Dataframe and the size of the Map, I see that they don't match:
println("Map - "+testMap.size+" DataFrame - "+testDF.count)
//Map - 2359806 DataFrame - 2368295
My idea is to convert the Dataframes to collections and perform some comparisons. I am also picking up data from other tables, but they are just single columns. and I have no problem converting them to ArrayBuffer[String] - The counts match.
I don't understand why I am having a problem with the testMap. Generally, the counts rows in the DF and the size of the Map should match, right?
Is it because there are too many records? How do I get the same number of records in the DF into the Map?
Any help would be appreciated. Thank you.
I believe the mismatch in counts is caused by elimination of duplicated keys (i.e. city names) in Map. By design, Map maintains unique keys by removing all duplicates. For example:
val testDF = Seq(
("Newyork", 1),
("LA", 0),
("Chicago", 1),
("Newyork", 99)
).toDF("city", "value")
val testMap = scala.collection.mutable.Map(
testDF.rdd.map( r => (r(0).toString, r(1).toString)).
collectAsMap().toSeq: _*
)
// testMap: scala.collection.mutable.Map[String,String] =
// Map(Newyork -> 99, LA -> 0, Chicago -> 1)
You might want to either use a different collection type or include an identifying field to your Map key to make it unique. Depending on your data processing need, you can also aggregate data into a Map-like dataframe via groupBy like below:
testDF.groupBy("city").agg(count("value").as("valueCount"))
In this example, the total of valueCount should match the original row count.
If you add entries with duplicate key to your map, duplicates are automatically removed. So what you should compare is:
println("Map - "+testMap.size+" DataFrame - "+testDF.select($"c1").distinct.count)
I need an idea for how to join two datasets with millions of arrays. Each dataset will have Longs numbered 1-10,000,000. But with different groupings in each one ex. [1,2] [3, 4] and [1], [2, 3], [4] output should be [1,2,3,4]
I need some way to join these sets efficiently.
I have tried an approach where I explode and group by multiple times, finally sorting and distincting the arrays. This works on small sets but is very inefficent for large sets because it explodes the number of rows many times over.
Any ideas on how to use another approach like a reducer or aggregation to solve this problem more efficiently.
The following is a scala code example. However, I would need an approach that works in java as well.
val rdd1 = spark.sparkContext.makeRDD(Array("""{"groupings":[1,2,3]}""", """{"groupings":[4,5,6]}""", """{"groupings":[7,8,9]}""", """{"groupings":[10]}""", """{"groupings":[11]}"""))
val rdd2 = spark.sparkContext.makeRDD(Array("""{"groupings":[1]}""", """{"groupings":[2,3,4]}""", """{"groupings":[7,8]}""", """{"groupings":[9]}""", """{"groupings":[10,11]}"""))
val srdd1 = spark.read.json(rdd1)
val srdd2 = spark.read.json(rdd2)
Dataset 1:
+---------+
|groupings|
+---------+
|[1, 2, 3]|
|[4, 5, 6]|
|[7, 8, 9]|
| [10]|
| [11]|
+---------+
Dataset 2:
+---------+
|groupings|
+---------+
| [1]|
|[2, 3, 4]|
| [7, 8]|
| [9]|
| [10, 11]|
+---------+
Output should be
+------------------+
| groupings|
+------------------+
|[1, 2, 3, 4, 5, 6]|
| [7, 8, 9]|
| [10, 11]|
+------------------+
Update:
This was my original code, which I had problems running, #AyanGuha had me thinking that perhaps it would be simpler to just use a series of joins instead, I am testing that now and will post a solution if it works out.
srdd1.union(srdd2).withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.select(callUDF("sortLongArray", col("groupings")).alias("groupings"))
.distinct()
What this code showed was that after 3 iterations the data coalesced, ideally then 3 joins would do the same.
Update 2:
Looks like I have a new working version, still seems inefficient but I think this will be handled better by spark.
val ardd1 = spark.sparkContext.makeRDD(Array("""{"groupings":[1,2,3]}""", """{"groupings":[4,5,6]}""", """{"groupings":[7,8,9]}""", """{"groupings":[10]}""", """{"groupings":[11,12]}""", """{"groupings":[13,14]}"""))
val ardd2 = spark.sparkContext.makeRDD(Array("""{"groupings":[1]}""", """{"groupings":[2,3,4]}""", """{"groupings":[7,8]}""", """{"groupings":[9]}""", """{"groupings":[10,11]}""", """{"groupings":[12,13]}""", """{"groupings":[14, 15]}"""))
var srdd1 = spark.read.json(ardd1)
var srdd2 = spark.read.json(ardd2)
val addUDF = udf((x: Seq[Long], y: Seq[Long]) => if(y == null) x else (x ++ y).distinct.sorted)
val encompassUDF = udf((x: Seq[Long], y: Seq[Long]) => if(x.size == y.size) false else (x diff y).size == 0)
val arrayContainsAndDiffUDF = udf((x: Seq[Long], y: Seq[Long]) => (x.intersect(y).size > 0) && (y diff x).size > 0)
var rdd1 = srdd1
var rdd2 = srdd2.withColumnRenamed("groupings", "groupings2")
for (i <- 1 to 3){
rdd1 = rdd1.join(rdd2, arrayContainsAndDiffUDF(col("groupings"), col("groupings2")), "left")
.select(addUDF(col("groupings"), col("groupings2")).alias("groupings"))
.distinct
.alias("rdd1")
rdd2 = rdd1.select(col("groupings").alias("groupings2")).alias("rdd2")
}
rdd1.join(rdd2, encompassUDF(col("groupings"), col("groupings2")), "leftanti")
.show(10, false)
Outputs:
+------------------------+
|groupings |
+------------------------+
|[10, 11, 12, 13, 14, 15]|
|[1, 2, 3, 4, 5, 6] |
|[7, 8, 9] |
+------------------------+
I will try this at scale and see what I get.
This works on small sets but is very inefficent for large sets because it explodes the number of rows many times over.
I don't think you have other options than explodeing the arrays, join followed by distinct. Spark is fairly good at such computations and tries doing them as much using internal binary rows as possible. The datasets are compressed and often comparisons are done at byte level (outside JVM)
That's just a matter of enough memory to hold all the elements which may not that big deal.
I'd recommend giving your solution a try and check out the physical plan and the stats. It could in the end turn out to be the only available solution.
Here is an alternate solution using ARRAY data type that is supported as part of HiveQL. This will at least make your coding simple [i.e. building out the logic]. Code below assumes that the raw data is in a text file.
Step 1. Create table
create table array_table1(array_col1 array<int>) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ',' LINES TERMINATED
BY'\n' STORED AS text;
Step 2: Load data into both tables
LOAD DATA INPATH('/path/to/file') OVERWRITE INTO TABLE array_table1
Step 3: Apply sql functions to get results
select distinct(explode(array_col1)) from array_table1 union
select distinct(explode(array_col2)) from array_table2
I am not clear on what is the final output you are looking for, from the example. Is it just union of all distinct numbers - or are they supposed to have a grouping? But anyway, with the tables created you can use a combination of distinct, explode(), left anti join and union - to get the expected results.
You may want to optimize this code to filter the final data set again for duplicates.
Hope that helps!
OK I finally figured it out.
First of all with my array joins I was doing something very wrong, which I overlooked initially.
When joining two arrays with a equivalency. EX. does [1,2,3] equal [1,2,3]? the arrays are hashed. I was doing an intersection match using a UDF. Given x in [1,2,3] is any x in [1, 2, 3, 4, 5]. This cannot be hashed and therefore requires a plan which will check every row with every row.
So to do this you have to explode both arrays first, then join them.
You can then apply other criteria. For example I saved time by only joining arrays which were not equal and when summed were less then the other.
Example with a self join:
rdd2 = rdd2.withColumn("single", explode(col("grouping"))) // Explode the grouping
temp = rdd2.withColumnRenamed("grouping", "grouping2").alias("temp") // Alias for self join
rdd2 = rdd2.join(temp, rdd2.col("single").equalTo(temp.col("single")) // Compare singles which will be hashed
.and(col("grouping").notEqual(col("grouping2"))) // Apply further conditions
.and(callUDF("lessThanArray", col("grouping"), col("grouping2"))) // Make it so only [1,2,3] [4,5,6] is joined and not the duplicate [4,5,6] [1,2,3], for efficiency
, "left") // Left so that the efficiency criteria do not drop rows
I then grouped by the grouping which was joined against aggregated the groupings from the self join.
rdd2.groupBy("grouping")
.agg(callUDF("list_agg",col('grouping2')).alias('grouping2')) // List agg is a UserDefinedAggregateFunction which aggregates lists into a distinct list
.select(callUDF("addArray", col("grouping"), col("grouping2")).alias("grouping")) // AddArray is a udf which concats and distincts 2 arrays
grouping grouping2
[1,2,3] [3,4,5]
[1,2,3] [2,6,7]
[1,2,3] [2,8,9]
becomes just
[1,2,3] [2,3,4,5,6,7,8,9]
after addArray
[1,2,3,4,5,6,7,8,9]
I then iterated that code 3 times, which seems to make everything coalesce and threw in a distinct for good measure.
Notice from the original question I had two datasets, for my specific problem I discovered some assumptions about the first and second set. The first set I could assume had no duplicates as it was a master list, the second set had duplicates hence I only needed to apply the above code to the second set, then join it with the first. I would assume if both sets had duplicates they could be unioned together first.