How to deal with array<String> in spark dataframe? - scala

I have a json dataset, and it is formated as:
val data = spark.read.json("user.json").select("user_id","friends").show()
+--------------------+--------------------+
| user_id| friends|
+--------------------+--------------------+
|18kPq7GPye-YQ3LyK...|[rpOyqD_893cqmDAt...|
|rpOyqD_893cqmDAtJ...|[18kPq7GPye-YQ3Ly...|
|4U9kSBLuBDU391x6b...|[18kPq7GPye-YQ3Ly...|
|fHtTaujcyKvXglE33...|[18kPq7GPye-YQ3Ly...|
+--------------------+--------------------+
data: org.apache.spark.sql.DataFrame = [user_id: string, friends: array<string>]
How can I transform it to [user_id: String, friend: String], eg:
+--------------------+--------------------+
| user_id| friend|
+--------------------+--------------------+
|18kPq7GPye-YQ3LyK...| rpOyqD_893cqmDAt...|
|18kPq7GPye-YQ3LyK...| 18kPq7GPye-YQ3Ly...|
|4U9kSBLuBDU391x6b...| 18kPq7GPye-YQ3Ly...|
|fHtTaujcyKvXglE33...| 18kPq7GPye-YQ3Ly...|
+--------------------+--------------------+
How can I get this dataframe?

You can use concat_ws function to concat the array of string and get only a string
data.withColumn("friends", concat_ws("",col("friends")))
concat_ws(java.lang.String sep, Column... exprs) Concatenates multiple
input string columns together into a single string column, using the
given separator.
Or you can use simple udf to convert array to string as below
import org.apache.spark.sql.functions._
val value = udf((arr: Seq[String]) => arr.mkString(" "))
val newDf = data.withColumn("hobbies", value($"friends"))
If you are trying to get values of array for user then you can use explode method as
data.withColumn("friends", explode($"friends"))
explode(Column e) Creates a new row for each element in the given
array or map column.
If you are trying to get only one data then, as #ramesh suggested you can get first element as
data.withColumn("friends", $"friends"(0))
Hope this helps!

Related

How to split column in Spark Dataframe to multiple columns

In my case how to split a column contain StringType with a format '1-1235.0 2-1248.0 3-7895.2' to another column with ArrayType contains ['1','2','3']
this is relatively simple with UDF:
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("input")
val extractFirst = udf((s: String) => s.split(" ").map(_.split('-')(0).toInt))
df.withColumn("newCol", extractFirst($"input"))
.show()
gives
+--------------------+---------+
| input| newCol|
+--------------------+---------+
|1-1235.0 2-1248.0...|[1, 2, 3]|
+--------------------+---------+
I could not find an easy soluton with spark internals (other than using split in combination with explode etc and then re-aggregating)
You can split the string to an array using split function and then you can transform the array using Higher Order Function TRANSFORM (it is available since Sark 2.4) together with substring_index:
import org.apache.spark.sql.functions.{split, expr}
val df = Seq("1-1235.0 2-1248.0 3-7895.2").toDF("stringCol")
df.withColumn("array", split($"stringCol", " "))
.withColumn("result", expr("TRANSFORM(array, x -> substring_index(x, '-', 1))"))
Notice that this is native approach, no UDF applied.

Convert Dataset of array into DataFrame

Given Dataset[Array[String]].
In fact, this structure has a single field of array type.
Is there any possibility to convert it into a DataFrame with each array item placed into a separate column?
If I have RDD[Array[String]] I can achieve it in this way:
val rdd: RDD[Array[String]] = ???
rdd.map(arr => Row.fromSeq(arr))
But surprisingly I cannot do the same with Dataset[Array[String]] – it says that there's no encoder for Row.
And I cannot replace an array with Tuple or case class because the size of the array is unknown at compile time.
If arrays have the same size, "select" can be used:
val original: Dataset[Array[String]] = Seq(Array("One", "Two"), Array("Three", "Four")).toDS()
val arraySize = original.head.size
val result = original.select(
(0 until arraySize).map(r => original.col("value").getItem(r)): _*)
result.show(false)
Output:
+--------+--------+
|value[0]|value[1]|
+--------+--------+
|One |Two |
|Three |Four |
+--------+--------+
Here you can do a foldLeft to create all your columns manually.
val df = Seq(Array("Hello", "world"), Array("another", "row")).toDS()
Then you calculate the size of your array.
val size_array = df.first.length
Then you add the columns to your dataframe with a foldLeft :
0.until(size_array).foldLeft(df){(acc, number) => df.withColumn(s"col$number", $"value".getItem(number))}.show
Here our accumulator is our df, and we just add the columns one by one.

How to pass elements of a list to concat function?

I am currently using the following approach to concat the columns in a dataframe:
val Finalraw = raw.withColumn("primarykey", concat($"prod_id",$"frequency",$"fee_type_code"))
But the thing is that I do not want to hardcode the columns as the number of columns is changing everytime. I have a list that consists the column names:
columnNames: List[String] = List("prod_id", "frequency", "fee_type_code")
So, the question is how to pass the list elements to the concat function instead of hardcoding the column names?
The concat function takes multiple columns as input while you have a list of strings. You need to transform the list to fit the method input.
First, use map to transform the strings into column objects and then unpack the list with :_* to correctly pass the arguments to concat.
val Finalraw = raw.withColumn("primarykey", concat(columnNames.map(col):_*))
For an explaination of the :_* syntax, see What does `:_*` (colon underscore star) do in Scala?
Map the list elements to List[org.apache.spark.sql.Column] in a separate variable.
Check this out.
scala> val df = Seq(("a","x-","y-","z")).toDF("id","prod_id","frequency","fee_type_code")
df: org.apache.spark.sql.DataFrame = [id: string, prod_id: string ... 2 more fields]
scala> df.show(false)
+---+-------+---------+-------------+
|id |prod_id|frequency|fee_type_code|
+---+-------+---------+-------------+
|a |x- |y- |z |
+---+-------+---------+-------------+
scala> val arr = List("prod_id", "frequency", "fee_type_code")
arr: List[String] = List(prod_id, frequency, fee_type_code)
scala> val arr_col = arr.map(col(_))
arr_col: List[org.apache.spark.sql.Column] = List(prod_id, frequency, fee_type_code)
scala> df.withColumn("primarykey",concat(arr_col:_*)).show(false)
+---+-------+---------+-------------+----------+
|id |prod_id|frequency|fee_type_code|primarykey|
+---+-------+---------+-------------+----------+
|a |x- |y- |z |x-y-z |
+---+-------+---------+-------------+----------+
scala>

How to concatenate multiple columns into single column (with no prior knowledge on their number)?

Let say I have the following dataframe:
agentName|original_dt|parsed_dt| user|text|
+----------+-----------+---------+-------+----+
|qwertyuiop| 0| 0|16102.0| 0|
I wish to create a new dataframe with one more column that has the concatenation of all the elements of the row:
agentName|original_dt|parsed_dt| user|text| newCol
+----------+-----------+---------+-------+----+
|qwertyuiop| 0| 0|16102.0| 0| [qwertyuiop, 0,0, 16102, 0]
Note: This is a just an example. The number of columns and names of them is not known. It is dynamic.
TL;DR Use struct function with Dataset.columns operator.
Quoting the scaladoc of struct function:
struct(colName: String, colNames: String*): Column Creates a new struct column that composes multiple input columns.
There are two variants: string-based for column names or using Column expressions (that gives you more flexibility on the calculation you want to apply on the concatenated columns).
From Dataset.columns:
columns: Array[String] Returns all column names as an array.
Your case would then look as follows:
scala> df.withColumn("newCol",
struct(df.columns.head, df.columns.tail: _*)).
show(false)
+----------+-----------+---------+-------+----+--------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+--------------------------+
|qwertyuiop|0 |0 |16102.0|0 |[qwertyuiop,0,0,16102.0,0]|
+----------+-----------+---------+-------+----+--------------------------+
I think this works perfect for your case
here is with an example
val spark =
SparkSession.builder().master("local").appName("test").getOrCreate()
import spark.implicits._
val data = spark.sparkContext.parallelize(
Seq(
("qwertyuiop", 0, 0, 16102.0, 0)
)
).toDF("agentName","original_dt","parsed_dt","user","text")
val result = data.withColumn("newCol", split(concat_ws(";", data.schema.fieldNames.map(c=> col(c)):_*), ";"))
result.show()
+----------+-----------+---------+-------+----+------------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+------------------------------+
|qwertyuiop|0 |0 |16102.0|0 |[qwertyuiop, 0, 0, 16102.0, 0]|
+----------+-----------+---------+-------+----+------------------------------+
Hope this helped!
In general, you can merge multiple dataframe columns into one using array.
df.select($"*",array($"col1",$"col2").as("newCol")) \\$"*" will capture all existing columns
Here is the one line solution for your case:
df.select($"*",array($"agentName",$"original_dt",$"parsed_dt",$"user", $"text").as("newCol"))
You can use udf function to concat all the columns into one. All you have to do is define a udf function and pass all the columns you want to concat to the udf function and call the udf function using .withColumn function of dataframe
Or
You can use concat_ws(java.lang.String sep, Column... exprs) function available for dataframe.
var df = Seq(("qwertyuiop",0,0,16102.0,0))
.toDF("agentName","original_dt","parsed_dt","user","text")
df.withColumn("newCol", concat_ws(",",$"agentName",$"original_dt",$"parsed_dt",$"user",$"text"))
df.show(false)
Will give you output as
+----------+-----------+---------+-------+----+------------------------+
|agentName |original_dt|parsed_dt|user |text|newCol |
+----------+-----------+---------+-------+----+------------------------+
|qwertyuiop|0 |0 |16102.0|0 |qwertyuiop,0,0,16102.0,0|
+----------+-----------+---------+-------+----+------------------------+
That will get you the result you want
There may be syntax errors in my answer. This is useful if you are using java<8 and spark<2.
String columns=null
For ( String columnName : dataframe.columns())
{
Columns = columns == null ? columnName : columns+"," + columnName;
}
SqlContext.sql(" select *, concat_ws('|', " +columns+ ") as complete_record " +
"from data frame ").show();

get the distinct elements of an ArrayType column in a spark dataframe

I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:
Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[]
2, ["feat1_2"],["feat2_1","feat2_2"]
3,["feat1_4"],["feat2_3"]
I want to get the list of distinct elements inside each feature column, so the output will be:
distinct_feat1,distinct_feat2
-----------------------------
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]
what is the best way to do this in Scala?
You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:
import org.apache.spark.sql.functions._
val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
withColumn("feat2", explode(col("feat2"))).
agg(collect_set("feat1").alias("distinct_feat1"),
collect_set("feat2").alias("distinct_feat2"))
distinct_df.show
+--------------------+--------------------+
| distinct_feat1| distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+
distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
WrappedArray(, feat2_1, feat2_2, feat2_3)])
one more solution for spark 2.4+
.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))
beware, if one of columns is null, result will be null
The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields:
def array_unique_values(df, fields):
from pyspark.sql.functions import col, collect_set, explode
from functools import reduce
data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])
And then:
data = array_unique_values(df, my_fields)
data.take(1)