Spark dataframe add a row for every existing row - scala

I have a dataframe with following columns:
groupid,unit,height
----------------------
1,in,55
2,in,54
I want to create another dataframe with additional rows where unit=cm and height=height*2.54.
Resulting dataframe:
groupid,unit,height
----------------------
1,in,55
2,in,54
1,cm,139.7
2,cm,137.16
Not sure how I can use spark udf and explode here.
Any help is appreciated.
Thanks in advance.

you can create another dataframe with changes you require using withColumn and then union both dataframes as
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "in", 55),
(2, "in", 54)
).toDF("groupid", "unit", "height")
val df2 = df.withColumn("unit", lit("cm")).withColumn("height", col("height")*2.54)
df.union(df2).show(false)
you should have
+-------+----+------+
|groupid|unit|height|
+-------+----+------+
|1 |in |55.0 |
|2 |in |54.0 |
|1 |cm |139.7 |
|2 |cm |137.16|
+-------+----+------+

Related

How to create a Spark DF as below

I need to create a Scala Spark DF as below. This question may be silly but need to know what is the best approach to create small structures for testing purpose
For creating a minimal DF.
For creating a minimal RDD.
I've tried the following code so far without success :
val rdd2 = sc.parallelize(Seq("7","8","9"))
and then creating to DF by
val dfSchema = Seq("col1", "col2", "col3")
and
rdd2.toDF(dfSchema: _*)
Here's a sample Dataframe I'd like to obtain :
c1 c2 c3
1 2 3
4 5 6
abc_spark, here's a sample you can use to easily create Dataframes and RDDs for testing :
import spark.implicits._
val df = Seq(
(1, 2, 3),
(4, 5, 6)
).toDF("c1", "c2", "c3")
df.show(false)
+---+---+---+
|c1 |c2 |c3 |
+---+---+---+
|1 |2 |3 |
|4 |5 |6 |
+---+---+---+
val rdd: RDD[Row] = df.rdd
rdd.map{_.getAs[Int]("c2")}.foreach{println}
Gives
5
2
You are missing one "()" in Seq. Use it as below:
scala> val df = sc.parallelize(Seq(("7","8","9"))).toDF("col1", "col2", "col3")
scala> df.show
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 7| 8| 9|
+----+----+----+

How to subtract vector from scalar in scala?

I have parquet file which contain two columns (id,features).I want to subtract features from scalar and divide output by another scalar.
parquet file
df.withColumn("features", ((df("features")-constant1)/constant2))
but give me error
requirement failed: The number of columns doesn't match. Old column
names (2): id, features New column names (1): features
How to solve it?
My scala spark code to this as below . Only way to do any operation on vector sparkm datatype is casting to string. Also used UDF to perform subtraction and division.
import spark.implicits._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions._
var df = Seq((1, Vectors.dense(35)),
(2, Vectors.dense(45)),
(3, Vectors.dense(4.5073)),
(4, Vectors.dense(56)))
.toDF("id", "features")
df.show()
val constant1 = 10
val constant2 = 2
val performComputation = (s: Double, val1: Int, val2: Int) => {
Vectors.dense((s - val1) / val2)
}
val performComputationUDF = udf(performComputation)
df.printSchema()
df = df.withColumn("features",
regexp_replace(df.col("features").cast("String"),
"[\\[\\]]", "").cast("Double")
)
df = df.withColumn("features",
performComputationUDF(df.col("features"),
lit(constant1), lit(constant2))
)
df.show(20, false)
// Write State should with mode overwrite
df.write
.mode("overwrite")
.parquet("file:///usr/local/spark/dataset/output1/")
Result
+---+----------+
|id |features |
+---+----------+
|1 |[12.5] |
|2 |[17.5] |
|3 |[-2.74635]|
|4 |[23.0] |
+---+----------+

Convert an array to custom string format in Spark with Scala

I created a DataFrame as follows:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
(1, List(1,2,3)),
(1, List(5,7,9)),
(2, List(4,5,6)),
(2, List(7,8,9)),
(2, List(10,11,12))
).toDF("id", "list")
val df1 = df.groupBy("id").agg(collect_set($"list").as("col1"))
df1.show(false)
Then I tried to convert the WrappedArray row value to string as follows:
import org.apache.spark.sql.functions._
def arrayToString = udf((arr: collection.mutable.WrappedArray[collection.mutable.WrappedArray[String]]) => arr.flatten.mkString(", "))
val d = df1.withColumn("col1", arrayToString($"col1"))
d: org.apache.spark.sql.DataFrame = [id: int, col1: string]
scala> d.show(false)
+---+----------------------------+
|id |col1 |
+---+----------------------------+
|1 |1, 2, 3, 5, 7, 9 |
|2 |4, 5, 6, 7, 8, 9, 10, 11, 12|
+---+----------------------------+
What I really want is to generate an output like the following:
+---+----------------------------+
|id |col1 |
+---+----------------------------+
|1 |1$2$3, 5$7$ 9 |
|2 |4$5$6, 7$8$9, 10$11$12 |
+---+----------------------------+
How can I achieve this?
You don't need a udf function, a simple concat_ws should do the trick for you as
import org.apache.spark.sql.functions._
val df1 = df.withColumn("list", concat_ws("$", col("list")))
.groupBy("id")
.agg(concat_ws(", ", collect_set($"list")).as("col1"))
df1.show(false)
which should give you
+---+----------------------+
|id |col1 |
+---+----------------------+
|1 |1$2$3, 5$7$9 |
|2 |7$8$9, 4$5$6, 10$11$12|
+---+----------------------+
As usual, udf function should be avoided if inbuilt functions are available since udf function would require serialization and deserialization of column data to primitive types for calculation and from primitives to columns respectively
even more concise you can avoid the withColumn step as
val df1 = df.groupBy("id")
.agg(concat_ws(", ", collect_set(concat_ws("$", col("list")))).as("col1"))
I hope the answer is helpful

Convert Array of String column to multiple columns in spark scala

I have a dataframe with following schema:
id : int,
emp_details: Array(String)
Some sample data:
1, Array(empname=xxx,city=yyy,zip=12345)
2, Array(empname=bbb,city=bbb,zip=22345)
This data is there in a dataframe and I need to read emp_details from the array and assign it to new columns as below or if I can split this array to multiple columns with column names as empname,city and zip:
.withColumn("empname", xxx)
.withColumn("city", yyy)
.withColumn("zip", 12345)
Could you please guide how we can achieve this by using Spark (1.6) Scala.
Really appreciate your help...
Thanks a lot
You can use withColumn and split to get the required data
df1.withColumn("empname", split($"emp_details" (0), "=")(1))
.withColumn("city", split($"emp_details" (1), "=")(1))
.withColumn("zip", split($"emp_details" (2), "=")(1))
Output:
+---+----------------------------------+-------+----+-----+
|id |emp_details |empname|city|zip |
+---+----------------------------------+-------+----+-----+
|1 |[empname=xxx, city=yyy, zip=12345]|xxx |yyy |12345|
|2 |[empname=bbb, city=bbb, zip=22345]|bbb |bbb |22345|
+---+----------------------------------+-------+----+-----+
UPDATE:
If you don't have fixed sequence of data in array then you can use UDF to convert to map and use it as
val getColumnsUDF = udf((details: Seq[String]) => {
val detailsMap = details.map(_.split("=")).map(x => (x(0), x(1))).toMap
(detailsMap("empname"), detailsMap("city"),detailsMap("zip"))
})
Now use the udf
df1.withColumn("emp",getColumnsUDF($"emp_details"))
.select($"id", $"emp._1".as("empname"), $"emp._2".as("city"), $"emp._3".as("zip"))
.show(false)
Output:
+---+-------+----+---+
|id |empname|city|zip|
+---+-------+----+---+
|1 |xxx |xxx |xxx|
|2 |bbb |bbb |bbb|
+---+-------+----+---+
Hope this helps!

How to get columns from dataframe into a list in spark

I have a DataFrame that has like 80 columns, and I need to get 12 of them into a collection, either Array or List is fine. I did google a bit and found this:
dataFrame.select("YOUR_COLUMN_NAME").rdd.map(r => r(0)).collect()
The problem is, this works for one column. If I do df.select(col1,col2,col3...).rdd.map.collect(), then it's giving me something like this: Array[[col1,col2,col3]].
What I want is Array[[col1],[col2],[col3]]. Is there any way to do this in Spark?
Thanks in advance.
UPDATE
For example I have a dataframe:
----------
A B C
----------
1 2 3
4 5 6
I need to get the columns into this format:
Array[[1,4],[2,5],[3,6]]
Hope this is more clear...Sorry for the confusion
you can get Array[Array[Any]] by doing the following
scala> df.select("col1", "col2", "col3", "col4").rdd.map(row => (Array(row(0)), Array(row(1)), Array(row(2)), Array(row(3))))
res6: org.apache.spark.rdd.RDD[(Array[Any], Array[Any], Array[Any], Array[Any])] = MapPartitionsRDD[34] at map at <console>:32
RDD is like an Array so your required array is above. If you want RDD[Array[Array[Any]]] then you can do
scala> df.select("col1", "col2", "col3", "col4").rdd.map(row => Array(Array(row(0)), Array(row(1)), Array(row(2)), Array(row(3))))
res7: org.apache.spark.rdd.RDD[Array[Array[Any]]] = MapPartitionsRDD[39] at map at <console>:32
You can proceed the same way for your twelve columns
Updated
Your updated question is more clear. So you can use collect_list function before you convert into an rdd and carry on as before.
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val rdd = df.select(collect_list("col1"), collect_list("col2"), collect_list("col3"), collect_list("col4")).rdd.map(row => Array(row(0), row(1), row(2), row(3)))
rdd: org.apache.spark.rdd.RDD[Array[Any]] = MapPartitionsRDD[41] at map at <console>:36
scala> rdd.map(array => array.map(element => println(element))).collect
[Stage 11:> (0 + 0) / 2]WrappedArray(1, 1)
WrappedArray(2, 2)
WrappedArray(3, 3)
WrappedArray(4, 4)
res8: Array[Array[Unit]] = Array(Array((), (), (), ()))
Dataframe only
You can do all of these in a dataframe itself and do not need to convert to rdd
given that you have dataframe as
scala> df.show(false)
+----+----+----+----+----+----+
|col1|col2|col3|col4|col5|col6|
+----+----+----+----+----+----+
|1 |2 |3 |4 |5 |6 |
|1 |2 |3 |4 |5 |6 |
+----+----+----+----+----+----+
You can simply do the following
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> df.select(array(collect_list("col1"), collect_list("col2"), collect_list("col3"), collect_list("col4")).as("collectedArray")).show(false)
+--------------------------------------------------------------------------------+
|collectedArray |
+--------------------------------------------------------------------------------+
|[WrappedArray(1, 1), WrappedArray(2, 2), WrappedArray(3, 3), WrappedArray(4, 4)]|
+--------------------------------------------------------------------------------+