Spark SQL Split or Extract words from String of Words - scala

I have a spark Dataframe like Below.I'm trying to split the column into 2 more columns:
date time content
28may 11am [ssid][customerid,shopid]
val personDF2 = personDF.withColumn("temp",split(col("content"),"\\[")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s/col$i)): _*)
date time content col1 col2 col3
28may 11 [ssid][customerid,shopid] ssid customerid shopid

Assuming a String to represent an Array of Words. Got your request. You can optimize the number of dataframes as well to reduce load on system. If there are more than 9 cols etc. you may need to use c00, c01, etc. for c10 etc. Or just use integer as name for columns. leave that up to you.
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", "[foo][customerid,shopid][Donald,Trump,Esq][single]"),
("B", "[foo]")
)).toDF("k", "v")
val df2 = df.withColumn("words_temp", regexp_replace($"v", lit("]"), lit("" )))
val df3 = df2.withColumn("words_temp2", regexp_replace($"words_temp", lit(","), lit("[" ))).drop("words_temp")
val df4 = df3.withColumn("words_temp3", expr("substring(words_temp2, 2, length(words_temp2))")).withColumn("cnt", expr("length(words_temp2)")).drop("words_temp2")
val df5 = df4.withColumn("words",split(col("words_temp3"),"\\[")).drop("words_temp3")
val df6 = df5.withColumn("num_words", size($"words"))
val df7 = df6.withColumn("v2", explode($"words"))
// Convert to Array of sorts via group by
val df8 = df7.groupBy("k")
.agg(collect_list("v2"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df8.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df9 = rdd3.toDF("k", "v")
val df10 = df9.withColumn("vn", explode($"v"))
val df11 = df10.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df11.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in this case:
+---+---+----------+------+------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |c6 |
+---+---+----------+------+------+-----+----+------+
|B |foo|null |null |null |null |null|null |
|A |foo|customerid|shopid|Donald|Trump|Esq |single|
+---+---+----------+------+------+-----+----+------+

Update:
Based on original title stating array of words. See other answer.
If new, then a few things here. Can also be done with dataset and map I assume. Here is a solution using DFs and rdd's. I might well investigate a complete DS in future, but this works for sure and at scale.
// Can amalgamate more steps
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", Array(Array("foo", "bar"), Array("Donald", "Trump","Esq"), Array("single"))),
("B", Array(Array("foo2", "bar2"), Array("single2"))),
("C", Array(Array("foo3", "bar3", "x", "y", "z")))
)).toDF("k", "v")
// flatten via 2x explode, can be done more elegeantly with def or UDF, but keeping it simple here
val df2 = df.withColumn("v2", explode($"v"))
val df3 = df2.withColumn("v3", explode($"v2"))
// Convert to Array of sorts via group by
val df4 = df3.groupBy("k")
.agg(collect_list("v3"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df4.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df5 = rdd3.toDF("k", "v")
val df6 = df5.withColumn("vn", explode($"v"))
val df7 = df6.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df7.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in correct col order:
+---+----+----+-------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |
+---+----+----+-------+-----+----+------+
|B |foo2|bar2|single2|null |null|null |
|C |foo3|bar3|x |y |z |null |
|A |foo |bar |Donald |Trump|Esq |single|
+---+----+----+-------+-----+----+------+

Related

Spark higher order functions to compute top N products from a comma separated list

I am using Spark 2.4 and I have a spark dataframe that has 2 columns - id and product_list. The data consists of list of products that every id has interacted with.
here is the sample code -
scala> spark.version
res3: String = 2.4.3
val df = Seq(
("1", "p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2"),
("2", "p2,p2,p2,p2,p2,p4,p4,p4,p1,p3")
).toDF("id", "product_list")
df.createOrReplaceTempView("df")
+---+--------------------------------+
|id |product_list |
+---+--------------------------------+
|1 |p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2|
|2 |p2,p2,p2,p2,p2,p4,p4,p4,p1,p3 |
+---+--------------------------------+
I would like to return those top 2 products that every id has had a interaction with. For instance, id = 1 has viewed products p1 - 5 times and p2 - 4 times, so i would like to return p1,p2 for id = 1. Similarly, p2,p4 for id = 2.
My final output should look like
id, most_seen_products
1, p1,p2
2, p2,p4
Since I am using Spark 2.4, I was wondering if there is a higher order function to first convert this list to array and then return the top 2 viewed products. In general the code should handle top N products.
Here is my approach
val df = Seq(
("1", "p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2"),
("2", "p2,p2,p2,p2,p2,p4,p4,p4,p1,p3")
).toDF("id", "product_list")
def getMetrics(value: Row, n: Int): (String, String) = {
val split = value.getAs[String]("product_list").split(",")
val sortedRecords = split.groupBy(x => x).map(data => (data._1, data._2.size)).toList.sortWith(_._2 > _._2)
(value.getAs[String]("id"), sortedRecords.take(n).map(_._1).mkString(","))
}
df.map(value =>
getMetrics(value, 2)
).withColumnRenamed("_1", "id").withColumnRenamed("_2", "most_seen_products") show (false)
Result
+---+------------------+
|id |most_seen_products|
+---+------------------+
|1 |p1,p2 |
|2 |p2,p4 |
+---+------------------+
Looking at your data format, you can just use a .map() or in case of SQL, a UDF, which converts all rows. The function will be:
productList => {
// list of products = split productList by comma
// add all items to a String/Count map
// sort the map, get first 2 elements
// return string.join of those 2 elements
}
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> import scala.collection.immutable.ListMap
scala> def max_products:UserDefinedFunction = udf((product:String) => {
val productList = product.split(",").toList
val finalList = ListMap(productList.groupBy(i=>i).mapValues(_.size).toSeq.sortWith(_._2 > _._2):_*).keys.toList
finalList(0) + "," + finalList(1)
})
scala> df.withColumn("most_seen_products", max_products(col("product_list"))).show(false)
+---+--------------------------------+------------------+
|id |product_list |most_seen_products|
+---+--------------------------------+------------------+
|1 |p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2|p1,p2 |
|2 |p2,p2,p2,p2,p2,p4,p4,p4,p1,p3 |p2,p4 |
+---+--------------------------------+------------------+

How to subtract vector from scalar in scala?

I have parquet file which contain two columns (id,features).I want to subtract features from scalar and divide output by another scalar.
parquet file
df.withColumn("features", ((df("features")-constant1)/constant2))
but give me error
requirement failed: The number of columns doesn't match. Old column
names (2): id, features New column names (1): features
How to solve it?
My scala spark code to this as below . Only way to do any operation on vector sparkm datatype is casting to string. Also used UDF to perform subtraction and division.
import spark.implicits._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions._
var df = Seq((1, Vectors.dense(35)),
(2, Vectors.dense(45)),
(3, Vectors.dense(4.5073)),
(4, Vectors.dense(56)))
.toDF("id", "features")
df.show()
val constant1 = 10
val constant2 = 2
val performComputation = (s: Double, val1: Int, val2: Int) => {
Vectors.dense((s - val1) / val2)
}
val performComputationUDF = udf(performComputation)
df.printSchema()
df = df.withColumn("features",
regexp_replace(df.col("features").cast("String"),
"[\\[\\]]", "").cast("Double")
)
df = df.withColumn("features",
performComputationUDF(df.col("features"),
lit(constant1), lit(constant2))
)
df.show(20, false)
// Write State should with mode overwrite
df.write
.mode("overwrite")
.parquet("file:///usr/local/spark/dataset/output1/")
Result
+---+----------+
|id |features |
+---+----------+
|1 |[12.5] |
|2 |[17.5] |
|3 |[-2.74635]|
|4 |[23.0] |
+---+----------+

append multiple columns to existing dataframe in spark

I need to append multiple columns to the existing spark dataframe where column names are given in List
assuming values for new columns are constant, for example given input columns and dataframe are
val columnsNames=List("col1","col2")
val data = Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4))
and after appending both columns, assuming constant values are "val1" for col1 and "val2" for col2,output data frame should be
+-----+---+-------+------+
| _1| _2|col1 |col2|
+-----+---+-------+------+
| one| 1|val1 |val2|
| two| 2|val1 |val2|
|three| 3|val1 |val2|
| four| 4|val1 |val2|
+-----+---+-------+------+
i have written a function to append columns
def appendColumns (cols: List[String], ds: DataFrame): DataFrame = {
cols match {
case Nil => ds
case h :: Nil => appendColumns(Nil, ds.withColumn(h, lit(h)))
case h :: tail => appendColumns(tail, ds.withColumn(h, lit(h)))
}
}
Is there any better way and more functional way to do it.
thanks
Yes, there is a better and simpler way. Basically, you make as many calls to withColumn as you have columns. With lots of columns, catalyst, the engine that optimizes spark queries may feel a bit overwhelmed (I've had the experience in the past with a similar use case). I've even seen it cause an OOM on the driver when experimenting with thousands of columns. To avoid stressing catalyst (and write less code ;-) ), you can simply use select like this below to get this done in one spark command:
val data = Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF
// let's assume that we have a map that associates column names to their values
val columnMap = Map("col1" -> "val1", "col2" -> "val2")
// Let's create the new columns from the map
val newCols = columnMap.keys.map(k => lit(columnMap(k)) as k)
// selecting the old columns + the new ones
data.select(data.columns.map(col) ++ newCols : _*).show
+-----+---+----+----+
| _1| _2|col1|col2|
+-----+---+----+----+
| one| 1|val1|val2|
| two| 2|val1|val2|
|three| 3|val1|val2|
| four| 4|val1|val2|
+-----+---+----+----+
As opposed to recursion the more general approach using a foldLeft would I think be more general, for a limited number of columns. Using Databricks Notebook:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._
val columnNames = Seq("c3","c4")
val df = Seq(("one", 1), ("two", 2), ("three", 3), ("four", 4)).toDF("c1", "c2")
def addCols(df: DataFrame, columns: Seq[String]): DataFrame = {
columns.foldLeft(df)((acc, col) => {
acc.withColumn(col, lit(col)) })
}
val df2 = addCols(df, columnNames)
df2.show(false)
returns:
+-----+---+---+---+
|c1 |c2 |c3 |c4 |
+-----+---+---+---+
|one |1 |c3 |c4 |
|two |2 |c3 |c4 |
|three|3 |c3 |c4 |
|four |4 |c3 |c4 |
+-----+---+---+---+
Please beware of the following: https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015 albeit in a slightly different context and the other answer alludes to this via the select approach.

Removing the Option type from a joined RDD

There are two rdds.
val pairRDD1 = sc.parallelize(List( ("cat",2), ("girl", 5), ("book", 4),("Tom", 12)))
val pairRDD2 = sc.parallelize(List( ("cat",2), ("cup", 5), ("mouse", 4),("girl", 12)))
And then I will do this join operation.
val kk = pairRDD1.fullOuterJoin(pairRDD2).collect
it shows like that:
kk: Array[(String, (Option[Int], Option[Int]))] = Array((book,(Some(4),None)), (Tom,(Some(12),None)), (girl,(Some(5),Some(12))), (mouse,(None,Some(4))), (cup,(None,Some(5))), (cat,(Some(2),Some(2))))
if i would like to fill the NONE by 0 and transform Option[int] to Int.what should I code?Thanks!
You can use mapValues on kk as follows (note this is before the collect):
pairRDD1.fullOuterJoin(pairRDD2).mapValues(pair => (pair._1.getOrElse(0), pair._2.getOrElse(0)))
You might have to do this before collect in an RDD, otherwise you could do:
kk.map { case (k, pair) => (k, (pair._1.getOrElse(0), pair._2.getOrElse(0))) }
Based on commnets in first answer, if you are fine using DataFrames, you can do with dataframes with any number of columns.
val ss = SparkSession.builder().master("local[*]").getOrCreate()
val sc = ss.sparkContext
import ss.implicits._
val pairRDD1 = sc.parallelize(List(("cat", 2,9999), ("girl", 5,8888), ("book", 4,9999), ("Tom", 12,6666)))
val pairRDD2 = sc.parallelize(List(("cat", 2,9999), ("cup", 5,7777), ("mouse", 4,3333), ("girl", 12,1111)))
val df1 = pairRDD1.toDF
val df2 = pairRDD2.toDF
val joined = df1.join(df2, df1.col("_1") === df2.col("_1"),"fullouter")
joined.show()
Here _1,_2 e.t.c are default column names provided by Spark. But, if you wish to have proper names you can change it as you wish.
Result:
+----+----+----+-----+----+----+
| _1| _2| _3| _1| _2| _3|
+----+----+----+-----+----+----+
|girl| 5|8888| girl| 12|1111|
| Tom| 12|6666| null|null|null|
| cat| 2|9999| cat| 2|9999|
|null|null|null| cup| 5|7777|
|null|null|null|mouse| 4|3333|
|book| 4|9999| null|null|null|
+----+----+----+-----+----+----+

How to split comma separated string and get n values in Spark Scala dataframe?

How to take only 2 data from arraytype column in Spark Scala?
I got the data like val df = spark.sqlContext.sql("select col1, col2 from test_tbl").
I have data like following:
col1 | col2
--- | ---
a | [test1,test2,test3,test4,.....]
b | [a1,a2,a3,a4,a5,.....]
I want to get data like following:
col1| col2
----|----
a | test1,test2
b | a1,a2
When I am doing df.withColumn("test", col("col2").take(5)) it is not working. It give this error:
value take is not a member of org.apache.spark.sql.ColumnName
How can I get the data in above order?
Inside withColumn you can call udf getPartialstring for that you can use slice or take method like below example snippet untested.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val getPartialstring = udf((array : Seq[String], fromIndex : Int, toIndex : Int)
=> array.slice(fromIndex ,toIndex ).mkString(","))
your caller will appear like
df.withColumn("test",getPartialstring(col("col2"))
col("col2").take(5) is failing because column doesn't have a method take(..) that's why your error message says
error: value take is not a member of org.apache.spark.sql.ColumnName
You can use udf approach to tackle this.
You can use the array Column's apply function to get each individual item up to a certain index, and then build a new array using the array function:
import spark.implicits._
import org.apache.spark.sql.functions._
// Sample data:
val df = Seq(
("a", Array("a1", "a2", "a3", "a4", "a5", "a6")),
("a", Array("b1", "b2", "b3", "b4", "b5")),
("c", Array("c1", "c2"))
).toDF("col1", "col2")
val n = 4
val result = df.withColumn("col2", array((0 until n).map($"col2"(_)): _*))
result.show(false)
// +----+--------------------+
// |col1|col2 |
// +----+--------------------+
// |a |[a1, a2, a3, a4] |
// |a |[b1, b2, b3, b4] |
// |c |[c1, c2, null, null]|
// +----+--------------------+
Note that this will "pad" results with nulls for records with arrays smaller than n.