How to pass elements of a list to concat function? - scala

I am currently using the following approach to concat the columns in a dataframe:
val Finalraw = raw.withColumn("primarykey", concat($"prod_id",$"frequency",$"fee_type_code"))
But the thing is that I do not want to hardcode the columns as the number of columns is changing everytime. I have a list that consists the column names:
columnNames: List[String] = List("prod_id", "frequency", "fee_type_code")
So, the question is how to pass the list elements to the concat function instead of hardcoding the column names?

The concat function takes multiple columns as input while you have a list of strings. You need to transform the list to fit the method input.
First, use map to transform the strings into column objects and then unpack the list with :_* to correctly pass the arguments to concat.
val Finalraw = raw.withColumn("primarykey", concat(columnNames.map(col):_*))
For an explaination of the :_* syntax, see What does `:_*` (colon underscore star) do in Scala?

Map the list elements to List[org.apache.spark.sql.Column] in a separate variable.
Check this out.
scala> val df = Seq(("a","x-","y-","z")).toDF("id","prod_id","frequency","fee_type_code")
df: org.apache.spark.sql.DataFrame = [id: string, prod_id: string ... 2 more fields]
scala> df.show(false)
+---+-------+---------+-------------+
|id |prod_id|frequency|fee_type_code|
+---+-------+---------+-------------+
|a |x- |y- |z |
+---+-------+---------+-------------+
scala> val arr = List("prod_id", "frequency", "fee_type_code")
arr: List[String] = List(prod_id, frequency, fee_type_code)
scala> val arr_col = arr.map(col(_))
arr_col: List[org.apache.spark.sql.Column] = List(prod_id, frequency, fee_type_code)
scala> df.withColumn("primarykey",concat(arr_col:_*)).show(false)
+---+-------+---------+-------------+----------+
|id |prod_id|frequency|fee_type_code|primarykey|
+---+-------+---------+-------------+----------+
|a |x- |y- |z |x-y-z |
+---+-------+---------+-------------+----------+
scala>

Related

Pass arguments to a udf from columns present in a list of strings

I have a list of strings which represent column names inside a dataframe.
I want to pass the arguments from these columns to a udf. How can I do it in spark scala ?
val actualDF = Seq(
("beatles", "help|hey jude","sad",4),
("romeo", "eres mia","old school",56)
).toDF("name", "hit_songs","genre","xyz")
val column_list: List[String] = List("hit_songs","name","genre")
// example udf
val testudf = org.apache.spark.sql.functions.udf((s1: String, s2: String) => {
// lets say I want to concat all values
})
val finalDF = actualDF.withColumn("test_res",testudf(col(column_list(0))))
From the above example, I want to pass my list column_list to a udf. I am not sure how can I pass a complete list of string representing column names. Though in case of 1 element I saw I can do it with col(column_list(0))). Please support.
Replace
testudf(col(column_list(0)))
with
testudf(column_list: _*)
This will interpret the list as multiple individual input arguments.
hit_songs is of type Seq[String], You need to change first parameter of your udf to Seq[String].
scala> singersDF.show(false)
+-------+-------------+----------+
|name |hit_songs |genre |
+-------+-------------+----------+
|beatles|help|hey jude|sad |
|romeo |eres mia |old school|
+-------+-------------+----------+
scala> actualDF.show(false)
+-------+----------------+----------+
|name |hit_songs |genre |
+-------+----------------+----------+
|beatles|[help, hey jude]|sad |
|romeo |[eres mia] |old school|
+-------+----------------+----------+
scala> column_list
res27: List[String] = List(hit_songs, name)
Change your UDF like below.
// s1 is of type Seq[String]
val testudf = udf((s1:Seq[String],s2:String) => {
s1.mkString.concat(s2)
})
Applying UDF
scala> actualDF
.withColumn("test_res",testudf(col(column_list.head),col(column_list.last)))
.show(false)
+-------+----------------+----------+-------------------+
|name |hit_songs |genre |test_res |
+-------+----------------+----------+-------------------+
|beatles|[help, hey jude]|sad |helphey judebeatles|
|romeo |[eres mia] |old school|eres miaromeo |
+-------+----------------+----------+-------------------+
Without UDF
scala> actualDF.withColumn("test_res",concat_ws("",$"name",$"hit_songs")).show(false) // Without UDF.
+-------+----------------+----------+-------------------+
|name |hit_songs |genre |test_res |
+-------+----------------+----------+-------------------+
|beatles|[help, hey jude]|sad |beatleshelphey jude|
|romeo |[eres mia] |old school|romeoeres mia |
+-------+----------------+----------+-------------------+

How to convert spark dataframe array to tuple

How can I convert spark dataframe to a tuple of 2 in scala?
I tried to explode the array and create a new column with help of lead function, so that I can use two columns to create tuple.
In order to use lead function, I need a column to sort by, I don't have any.
Please suggest which is best way to solve this?
Note: I need to retain the same order in the array.
For example:
Input
For example, input looks something like this,
id1 | [text1, text2, text3, text4]
id2 | [txt, txt2, txt4, txt5, txt6, txt7, txt8, txt9]
expected o/p:
I need to get output of tuple of length 2
id1 | [(text1, text2), (text2, text3), (text3,text4)]
id2 | [(txt, txt2), (txt2, txt4), (txt4, txt5), (txt5, txt6), (txt6, txt7), (txt7, txt8), (txt8, txt9)]
You can create an udf to create list of tuple using sliding window function
val df = Seq(
("id1", List("text1", "text2", "text3", "text4")),
("id2", List("txt", "txt2", "txt4", "txt5", "txt6", "txt7", "txt8", "txt9"))
).toDF("id", "text")
val sliding = udf((value: Seq[String]) => {
value.toList.sliding(2).map { case List(a, b) => (a, b) }.toList
})
val result = df.withColumn("text", sliding($"text"))
Output:
+---+-------------------------------------------------------------------------------------------------+
|id |text |
+---+-------------------------------------------------------------------------------------------------+
|id1|[[text1, text2], [text2, text3], [text3, text4]] |
|id2|[[txt, txt2], [txt2, txt4], [txt4, txt5], [txt5, txt6], [txt6, txt7], [txt7, txt8], [txt8, txt9]]|
+---+-------------------------------------------------------------------------------------------------+
Hope this helps!

Convert Dataset of array into DataFrame

Given Dataset[Array[String]].
In fact, this structure has a single field of array type.
Is there any possibility to convert it into a DataFrame with each array item placed into a separate column?
If I have RDD[Array[String]] I can achieve it in this way:
val rdd: RDD[Array[String]] = ???
rdd.map(arr => Row.fromSeq(arr))
But surprisingly I cannot do the same with Dataset[Array[String]] – it says that there's no encoder for Row.
And I cannot replace an array with Tuple or case class because the size of the array is unknown at compile time.
If arrays have the same size, "select" can be used:
val original: Dataset[Array[String]] = Seq(Array("One", "Two"), Array("Three", "Four")).toDS()
val arraySize = original.head.size
val result = original.select(
(0 until arraySize).map(r => original.col("value").getItem(r)): _*)
result.show(false)
Output:
+--------+--------+
|value[0]|value[1]|
+--------+--------+
|One |Two |
|Three |Four |
+--------+--------+
Here you can do a foldLeft to create all your columns manually.
val df = Seq(Array("Hello", "world"), Array("another", "row")).toDS()
Then you calculate the size of your array.
val size_array = df.first.length
Then you add the columns to your dataframe with a foldLeft :
0.until(size_array).foldLeft(df){(acc, number) => df.withColumn(s"col$number", $"value".getItem(number))}.show
Here our accumulator is our df, and we just add the columns one by one.

How to deal with array<String> in spark dataframe?

I have a json dataset, and it is formated as:
val data = spark.read.json("user.json").select("user_id","friends").show()
+--------------------+--------------------+
| user_id| friends|
+--------------------+--------------------+
|18kPq7GPye-YQ3LyK...|[rpOyqD_893cqmDAt...|
|rpOyqD_893cqmDAtJ...|[18kPq7GPye-YQ3Ly...|
|4U9kSBLuBDU391x6b...|[18kPq7GPye-YQ3Ly...|
|fHtTaujcyKvXglE33...|[18kPq7GPye-YQ3Ly...|
+--------------------+--------------------+
data: org.apache.spark.sql.DataFrame = [user_id: string, friends: array<string>]
How can I transform it to [user_id: String, friend: String], eg:
+--------------------+--------------------+
| user_id| friend|
+--------------------+--------------------+
|18kPq7GPye-YQ3LyK...| rpOyqD_893cqmDAt...|
|18kPq7GPye-YQ3LyK...| 18kPq7GPye-YQ3Ly...|
|4U9kSBLuBDU391x6b...| 18kPq7GPye-YQ3Ly...|
|fHtTaujcyKvXglE33...| 18kPq7GPye-YQ3Ly...|
+--------------------+--------------------+
How can I get this dataframe?
You can use concat_ws function to concat the array of string and get only a string
data.withColumn("friends", concat_ws("",col("friends")))
concat_ws(java.lang.String sep, Column... exprs) Concatenates multiple
input string columns together into a single string column, using the
given separator.
Or you can use simple udf to convert array to string as below
import org.apache.spark.sql.functions._
val value = udf((arr: Seq[String]) => arr.mkString(" "))
val newDf = data.withColumn("hobbies", value($"friends"))
If you are trying to get values of array for user then you can use explode method as
data.withColumn("friends", explode($"friends"))
explode(Column e) Creates a new row for each element in the given
array or map column.
If you are trying to get only one data then, as #ramesh suggested you can get first element as
data.withColumn("friends", $"friends"(0))
Hope this helps!

How to concatenate multiple columns into single column (with no prior knowledge on their number)?

Let say I have the following dataframe:
agentName|original_dt|parsed_dt| user|text|
+----------+-----------+---------+-------+----+
|qwertyuiop| 0| 0|16102.0| 0|
I wish to create a new dataframe with one more column that has the concatenation of all the elements of the row:
agentName|original_dt|parsed_dt| user|text| newCol
+----------+-----------+---------+-------+----+
|qwertyuiop| 0| 0|16102.0| 0| [qwertyuiop, 0,0, 16102, 0]
Note: This is a just an example. The number of columns and names of them is not known. It is dynamic.
TL;DR Use struct function with Dataset.columns operator.
Quoting the scaladoc of struct function:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
There are two variants: string-based for column names or using Column expressions (that gives you more flexibility on the calculation you want to apply on the concatenated columns).
From Dataset.columns:
columns: Array[String] Returns all column names as an array.
Your case would then look as follows:
scala> df.withColumn("newCol",
struct(df.columns.head, df.columns.tail: _*)).
show(false)
+----------+-----------+---------+-------+----+--------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+--------------------------+
|qwertyuiop|0 |0 |16102.0|0 |[qwertyuiop,0,0,16102.0,0]|
+----------+-----------+---------+-------+----+--------------------------+
I think this works perfect for your case
here is with an example
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(
Seq(
("qwertyuiop", 0, 0, 16102.0, 0)
)
).toDF("agentName","original_dt","parsed_dt","user","text")
val result = data.withColumn("newCol", split(concat_ws(";", data.schema.fieldNames.map(c=> col(c)):_*), ";"))
result.show()
+----------+-----------+---------+-------+----+------------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+------------------------------+
|qwertyuiop|0 |0 |16102.0|0 |[qwertyuiop, 0, 0, 16102.0, 0]|
+----------+-----------+---------+-------+----+------------------------------+
Hope this helped!
In general, you can merge multiple dataframe columns into one using array.
df.select($"*",array($"col1",$"col2").as("newCol")) \\$"*" will capture all existing columns
Here is the one line solution for your case:
df.select($"*",array($"agentName",$"original_dt",$"parsed_dt",$"user", $"text").as("newCol"))
You can use udf function to concat all the columns into one. All you have to do is define a udf function and pass all the columns you want to concat to the udf function and call the udf function using .withColumn function of dataframe
Or
You can use concat_ws(java.lang.String sep, Column... exprs) function available for dataframe.
var df = Seq(("qwertyuiop",0,0,16102.0,0))
.toDF("agentName","original_dt","parsed_dt","user","text")
df.withColumn("newCol", concat_ws(",",$"agentName",$"original_dt",$"parsed_dt",$"user",$"text"))
df.show(false)
Will give you output as
+----------+-----------+---------+-------+----+------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+------------------------+
|qwertyuiop|0 |0 |16102.0|0 |qwertyuiop,0,0,16102.0,0|
+----------+-----------+---------+-------+----+------------------------+
That will get you the result you want
There may be syntax errors in my answer. This is useful if you are using java<8 and spark<2.
String columns=null
For ( String columnName : dataframe.columns())
{
Columns = columns == null ? columnName : columns+"," + columnName;
}
SqlContext.sql(" select *, concat_ws('|', " +columns+ ") as complete_record " +
"from data frame ").show();