Create new DataFrame with new rows depending in number of a column - Spark Scala - scala

I have a DataFrame with the following data:
num_cta | n_lines
110000000000| 2
110100000000| 3
110200000000| 1
With that information, I need to create a new DF with different number of rows depending the value that comes over the n_lines column.
For example, for the first row of my DF (110000000000), the value of the n_lines column is 2. The result would have to be something like the following:
num_cta
110000000000
110000000000
For all the Dataframe example that I show, the result to get would have to be something like this:
num_cta
110000000000
110000000000
110100000000
110100000000
110100000000
110200000000
Is there a way to do that? And multiply a row n times, depending on the value of a column value?
Regards.

One approach would be to expand n_lines into an array with an UDF and explode it:
val df = Seq(
("110000000000", 2),
("110100000000", 3),
("110200000000", 1)
)toDF("num_cta", "n_lines")
def fillArr = udf(
(n: Int) => Array.fill(n)(1)
)
val df2 = df.withColumn("arr", fillArr($"n_lines")).
withColumn("a", explode($"arr")).
select($"num_cta")
df2.show
+------------+
| num_cta|
+------------+
|110000000000|
|110000000000|
|110100000000|
|110100000000|
|110100000000|
|110200000000|
+------------+

There is no off the shelve way to doing this. However you can try iterate over the dataframe and return a list of num_cta where the number of elements are equal to the corresponding n_lines.
Something like
import spark.implicits._
case class (num_cta:String) // output dataframe schema
case class (num_cta:String, n_lines:Integer) // input dataframe 'df' schema
val result = df.flatmap(x => {
List.fill(x.n_lines)(x.num_cta)
}).toDF

Related

How to add a list or array of strings as a column to a Spark Dataframe

So, I have n number of strings that I can keep either in an array or in a list like this:
val checks = Array("check1", "check2", "check3", "check4", "check5")
val checks: List[String] = List("check1", "check2", "check3", "check4", "check5")
Now, I have a spark dataframe df and I want to add a column with the values present in this List/Array. (It is guaranteed that the number of items in my List/Array will be exactly equal to the number of rows in the dataframe, i.e n)
I tried doing:
df.withColumn("Value", checks)
But that didn't work. What would be the best way to achieve this?
You need to add it as an array column as follows:
val df2 = df.withColumn("Value", array(checks.map(lit):_*))
If you want a single value for each row, you can get the array element:
val df2 = df.withColumn("Value", array(checks.map(lit):_*))
.withColumn("rn", row_number().over(Window.orderBy(lit(1))) - 1)
.withColumn("Value", expr("Value[rn]"))
.drop("rn")

Convert Dataset of array into DataFrame

Given Dataset[Array[String]].
In fact, this structure has a single field of array type.
Is there any possibility to convert it into a DataFrame with each array item placed into a separate column?
If I have RDD[Array[String]] I can achieve it in this way:
val rdd: RDD[Array[String]] = ???
rdd.map(arr => Row.fromSeq(arr))
But surprisingly I cannot do the same with Dataset[Array[String]] – it says that there's no encoder for Row.
And I cannot replace an array with Tuple or case class because the size of the array is unknown at compile time.
If arrays have the same size, "select" can be used:
val original: Dataset[Array[String]] = Seq(Array("One", "Two"), Array("Three", "Four")).toDS()
val arraySize = original.head.size
val result = original.select(
(0 until arraySize).map(r => original.col("value").getItem(r)): _*)
result.show(false)
Output:
+--------+--------+
|value[0]|value[1]|
+--------+--------+
|One |Two |
|Three |Four |
+--------+--------+
Here you can do a foldLeft to create all your columns manually.
val df = Seq(Array("Hello", "world"), Array("another", "row")).toDS()
Then you calculate the size of your array.
val size_array = df.first.length
Then you add the columns to your dataframe with a foldLeft :
0.until(size_array).foldLeft(df){(acc, number) => df.withColumn(s"col$number", $"value".getItem(number))}.show
Here our accumulator is our df, and we just add the columns one by one.

Spark Dataframe select based on column index

How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above df.select to fetch only those columns at the specific indexes.
You can map over columns:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
or:
df.select(colNos map (df.columns andThen col): _*)
or:
df.select(colNos map (col _ compose df.columns): _*)
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
#user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :
val colNames = Seq("col1", "col2" ...... "col99", "col100")
val selectColNames = Seq("col1", "col3", .... selected column names ... )
val selectCols = selectColNames.map(name => df.col(name))
df = df.select(selectCols:_*)
If you are hesitant to write all the 100 column names then there is a shortcut method too
val colNames = df.schema.fieldNames
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
import org.apache.spark.sql.functions.col
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)
You cannot simply do this (as I tried and failed):
// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
val sliceCols = df.columns.slice(beg_val, end_val)
return df.select(sliceCols.map(name => col(name)):_*)
}
// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))

Adding a constant value to columns in a Spark dataframe

I have a Spark data frame which will be like below
id person age
1 naveen 24
I want add a constant "del" to each column value except the last column in the dataframe like below,
id person age
1del naveendel 24
Can someone assist me how to implement this in Spark df using Scala
You can use the lit and concat functions:
import org.apache.spark.sql.functions._
// add suffix to all but last column (would work for any number of cols):
val colsWithSuffix = df.columns.dropRight(1).map(c => concat(col(c), lit("del")) as c)
def result = df.select(colsWithSuffix :+ $"age": _*)
result.show()
// +----+---------+---+
// |id |person |age|
// +----+---------+---+
// |1del|naveendel|24 |
// +----+---------+---+
EDIT: to also accommodate null values, you can wrap the column with coalesce before appending the suffix - replace the like calculating colsWithSuffix with:
val colsWithSuffix = df.columns.dropRight(1)
.map(c => concat(coalesce(col(c), lit("")), lit("del")) as c)

get the distinct elements of an ArrayType column in a spark dataframe

I have a dataframe with 3 columns named id, feat1 and feat2. feat1 and feat2 are in the form of Array of String:
Id, feat1,feat2
------------------
1, ["feat1_1","feat1_2","feat1_3"],[]
2, ["feat1_2"],["feat2_1","feat2_2"]
3,["feat1_4"],["feat2_3"]
I want to get the list of distinct elements inside each feature column, so the output will be:
distinct_feat1,distinct_feat2
-----------------------------
["feat1_1","feat1_2","feat1_3","feat1_4"],["feat2_1","feat2_2","feat2_3]
what is the best way to do this in Scala?
You can use the collect_set to find the distinct values of the corresponding column after applying the explode function on each column to unnest the array element in each cell. Suppose your data frame is called df:
import org.apache.spark.sql.functions._
val distinct_df = df.withColumn("feat1", explode(col("feat1"))).
withColumn("feat2", explode(col("feat2"))).
agg(collect_set("feat1").alias("distinct_feat1"),
collect_set("feat2").alias("distinct_feat2"))
distinct_df.show
+--------------------+--------------------+
| distinct_feat1| distinct_feat2|
+--------------------+--------------------+
|[feat1_1, feat1_2...|[, feat2_1, feat2...|
+--------------------+--------------------+
distinct_df.take(1)
res23: Array[org.apache.spark.sql.Row] = Array([WrappedArray(feat1_1, feat1_2, feat1_3, feat1_4),
WrappedArray(, feat2_1, feat2_2, feat2_3)])
one more solution for spark 2.4+
.withColumn("distinct", array_distinct(concat($"array_col1", $"array_col2")))
beware, if one of columns is null, result will be null
The method provided by Psidom works great, here is a function that does the same given a Dataframe and a list of fields:
def array_unique_values(df, fields):
from pyspark.sql.functions import col, collect_set, explode
from functools import reduce
data = reduce(lambda d, f: d.withColumn(f, explode(col(f))), fields, df)
return data.agg(*[collect_set(f).alias(f + '_distinct') for f in fields])
And then:
data = array_unique_values(df, my_fields)
data.take(1)