Need to transform a dataframe using dataframe API instead RDD - scala

I have a a dataframe with the following data
loans,MTG,111
loans,MTG 102
loans,CRDS,103
loans,PCL,104
loans,PCL,105
I want to get result something like this
loans , MTG:111:102, PCL:104:105 , CRDS:103
I am able to achieve it using the RDD transformations
var data = Seq(("loans","MTG",111),("loans","MTG" ,102),("loans","CRDS",103),("loans","PCL",104),("loans","PCL",105))
var fd1 = sc.parallelize(data)
var fd2 = fd1.map(x => ( (x(0),x(1)) , x(2) ) )
var fd3 = fd2.reduceByKey( (a,b) => a.toString + ":" + b.toString )
var fd4 = fd3.map( x=> (x._1._1,(x._1._2 + ":"+ x._2)))
var fd5 = fd4.groupByKey()
I want to use the dataframe /Dataset API or may be spark SQL to achieve the same result. Could you please help.

Use .groupBy, .collect_list and concat_ws in built functions from dataframe api.
Example:
//sample dataframe
var data = Seq(("loans","MTG",111),("loans","MTG" ,102),("loans","CRDS",103),("loans","PCL",104),("loans","PCL",105)).toDF("col1","col2","col3")
import org.apache.spark.sql.functions._
data.show()
//+-----+----+----+
//| col1|col2|col3|
//+-----+----+----+
//|loans| MTG| 111|
//|loans| MTG| 102|
//|loans|CRDS| 103|
//|loans| PCL| 104|
//|loans| PCL| 105|
//+-----+----+----+
data.groupBy("col1","col2").
agg(concat_ws(":",collect_set("col3")).alias("col3")).
selectExpr("col1","""concat_ws(":",col2,col3) as col2""").
groupBy("col1").
agg(concat_ws(",",collect_list("col2")).alias("col2")).
show(false)
//+-----+--------------------------------+
//|col1 |col2 |
//+-----+--------------------------------+
//|loans|MTG:102:111,CRDS:103,PCL:104:105|
//+-----+--------------------------------+
//collect
data.groupBy("col1","col2").agg(concat_ws(":",collect_set("col3")).alias("col3")).selectExpr("col1","""concat_ws(":",col2,col3) as col2""").groupBy("col1").agg(concat_ws(",",collect_list("col2")).alias("col2")).collect()
//res22: Array[org.apache.spark.sql.Row] = Array([loans,MTG:102:111,CRDS:103,PCL:104:105])

Related

How to Transform a Spark Scala Nested Map within a Map Data Structure?

I want to write a nested data structure consisting of a Map inside another Map using an array of a Scala case class.
The result should transform this dataframe:
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
| 123| ITA|1475600500|18.0|
| 123| ITA|1475600516|19.0|
+-----+-------+----------+----+
into:
+--------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600500":18,"1475600516":19}}}]
+--------------------------------------------------------------------+
The actualResult dataset below gets me close but the structure isn't quite the same as my expected dataframe.
case class Record(value: Integer, attributes: Map[String, Map[String, BigDecimal]])
val actualResult = df
.map(r =>
Array(
Record(
r.getAs[Int]("Value"),
Map(
r.getAs[String]("Country") ->
Map(
r.getAs[String]("Timestamp") -> new BigDecimal(
r.getAs[Double]("Sum").toString
)
)
)
)
)
)
The Timestamp column in the actualResult dataset doesn't get combined together into the same Record row but rather creates two separate rows instead.
+----------------------------------------------------+
|value |
+----------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600516":19}}}]
[{"value":123,"attributes":{"ITA":{"1475600500":18}}}]
+----------------------------------------------------+
With the use of groupBy and collect_list by creatng combined column using struct I was able to get single row as below output.
val mycsv =
"""
|Value|Country|Timestamp|Sum
| 123|ITA|1475600500|18.0
| 123|ITA|1475600516|19.0
""".stripMargin('|').lines.toList.toDS()
val df: DataFrame = spark.read.option("header", true)
.option("sep", "|")
.option("inferSchema", true)
.csv(mycsv)
df.show
val df1 = df.
groupBy("Value","Country")
.agg( collect_list(struct(col("Country"), col("Timestamp"), col("Sum"))).alias("attributes")).drop("Country")
val json = df1.toJSON // you can save in to file
json.show(false)
Result combined 2 rows
+-----+-------+----------+----+
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
|123.0|ITA |1475600500|18.0|
|123.0|ITA |1475600516|19.0|
+-----+-------+----------+----+
+----------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|{"Value":123.0,"attributes":[{"Country":"ITA","Timestamp":1475600500,"Sum":18.0},{"Country":"ITA","Timestamp":1475600516,"Sum":19.0}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------+

Convert Map(key-value) into spark scala Data-frame

convert myMap = Map([Col_1->1],[Col_2->2],[Col_3->3])
to Spark scala Data-frame key as column and value as column value,i am not
getting expected result, please check my code and provide solution.
var finalBufferList = new ListBuffer[String]()
var finalDfColumnList = new ListBuffer[String]()
var myMap:Map[String,String] = Map.empty[String,String]
for ((k,v) <- myMap){
println(k+"->"+v)
finalBufferList += v
//finalDfColumnList += "\""+k+"\""
finalDfColumnList += k
}
val dff = Seq(finalBufferList.toSeq).toDF(finalDfColumnList.toList.toString())
dff.show()
My result :
+------------------------+
|List(Test, Rest, Incedo)|
+------------------------+
| [4, 5, 3]|
+------------------------+
Expected result :
+------+-------+-------+
|Col_1 | Col_2 | Col_3 |
+------+-------+-------+
| 4 | 5 | 3 |
+------+-------+-------+
please give me suggestion .
if you have defined your Map as
val myMap = Map("Col_1"->"1", "Col_2"->"2", "Col_3"->"3")
then you should create RDD[Row] using the values as
import org.apache.spark.sql.Row
val rdd = sc.parallelize(Seq(Row.fromSeq(myMap.values.toSeq)))
then you create a schema using the keys as
import org.apache.spark.sql.types._
val schema = StructType(myMap.keys.toSeq.map(StructField(_, StringType)))
then finally use createDataFrame function to create the dataframe as
val df = sqlContext.createDataFrame(rdd, schema)
df.show(false)
finally you should have
+-----+-----+-----+
|Col_1|Col_2|Col_3|
+-----+-----+-----+
|1 |2 |3 |
+-----+-----+-----+
I hope the answer is helpful
But remember all this would be useless if you are working in small dataset.

Dataframe to RDD[Row] replacing space with nulls

I am converting a Spark dataframe to RDD[Row] so I can map it to final schema to write into Hive Orc table. I want to convert any space in the input to actual null so the hive table can store actual null instead of a empty string.
Input DataFrame (a single column with pipe delimited values):
col1
1|2|3||5|6|7|||...|
My code:
inputDF.rdd.
map { x: Row => x.get(0).asInstanceOf[String].split("\\|", -1)}.
map { x => Row (nullConverter(x(0)),nullConverter(x(1)),nullConverter(x(2)).... nullConverter(x(200)))}
def nullConverter(input: String): String = {
if (input.trim.length > 0) input.trim
else null
}
Is there any clean way of doing it rather than calling the nullConverter function 200 times.
Update based on single column:
Going with your approach, I will do something like:
inputDf.rdd.map((row: Row) => {
val values = row.get(0).asInstanceOf[String].split("\\|").map(nullConverter)
Row(values)
})
Make your nullConverter or any other logic a udf:
import org.apache.spark.sql.functions._
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
Now, use the udf on your df and apply to all columns:
val convertedDf = inputDf.select(inputDf.columns.map(c => nullConverter(col(c)).alias(c)):_*)
Now, you can do your RDD logic.
This would be easier to do using the DataFrame API before converting to an RDD. First, split the data:
val df = Seq(("1|2|3||5|6|7|8||")).toDF("col0") // Example dataframe
val df2 = df.withColumn("col0", split($"col0", "\\|")) // Split on "|"
Then find out the length of the array:
val numCols = df2.first.getAs[Seq[String]](0).length
Now, for each element in the array, use the nullConverter UDF and then assign it to it's own column.
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
val df3 = df2.select((0 until numCols).map(i => nullConverter($"col0".getItem(i)).as("col" + i)): _*)
The result using the example dataframe:
+----+----+----+----+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|col6|col7|col8|col9|
+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3|null| 5| 6| 7| 8|null|null|
+----+----+----+----+----+----+----+----+----+----+
Now convert it to an RDD or continue using the data as a DataFrame depending on your needs.
There is no point in converting dataframe to rdd
import org.apache.spark.sql.functions._
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
df.select(regexp_replace(col("*"), " ", "NULL"))

Spark Dataframe GroupBy and compute Complex aggregate function

Using Spark dataframe , I need to compute the percentage by using the below
complex formula :
Group by "KEY " and calculate "re_pct" as ( sum(sa) / sum( sa / (pct/100) ) ) * 100
For Instance , Input Dataframe is
val values1 = List(List("01", "20000", "45.30"), List("01", "30000", "45.30"))
.map(row => (row(0), row(1), row(2)))
val DS1 = values1.toDF("KEY", "SA", "PCT")
DS1.show()
+---+-----+-----+
|KEY| SA| PCT|
+---+-----+-----+
| 01|20000|45.30|
| 01|30000|45.30|
+---+-----+-----+
Expected Result :
+---+-----+--------------+
|KEY| re_pcnt |
+---+-----+--------------+
| 01| 45.30000038505 |
+---+-----+--------------+
I have tried to calculate as below
val result = DS1.groupBy("KEY").agg(((sum("SA").divide(
sum(
("SA").divide(
("PCT").divide(100)
)
)
)) * 100).as("re_pcnt"))
But facing Error:(36, 16) value divide is not a member of String ("SA").divide({
Any suggestion on implementing the above logic ?
You can try importing spark.implicits._ and then use $ to refer to a column.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val result = DS1.groupBy("KEY")
.agg(((sum($"SA").divide(sum(($"SA").divide(($"PCT").divide(100))))) * 100)
.as("re_pcnt"))
Which will give you the requested output.
If you do not want to import you can always use the col() command instead of $.
It is possible to use a string as input to the agg() function with the use of expr(). However, the input string need to be changed a bit. The following gives exactly the same result as before, but uses a string instead:
val opr = "sum(SA)/(sum(SA/(PCT/100))) * 100"
val df = DS1.groupBy("KEY").agg(expr(opr).as("re_pcnt"))
Note that .as("re_pcnt") need to be inside the agg() method, it can not be outside.
Your code works almost perfectly. You just have to put the '$' symbol in order to specify you're passing a column:
val result = DS1.groupBy($"KEY").agg(((sum($"SA").divide(
sum(
($"SA").divide(
($"PCT").divide(100)
)
)
)) * 100).as("re_pcnt"))
Here's the output:
result.show()
+---+-------+
|KEY|re_pcnt|
+---+-------+
| 01| 45.3|
+---+-------+

DataFrame: Append column name to rows data

I'm looking a way to append column names to data frame row's data .
Number of columns could be different from time to time
I've Spark 1.4.1
I've a dataframe :
Edit: : all data is String type only
+---+----------+
|key| value|
+---+----------+
|foo| bar|
|bar| one, two|
+---+----------+
I'd like to get :
+-------+---------------------+
|key | value|
+-------+---------------------+
|key_foo| value_bar|
|key_bar| value_one, value_two|
+---+-------------------------+
I tried
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val concatColNamesWithElems = udf { seq: Seq[Row] =>
seq.map { case Row(y: String) => (col +"_"+y)}}
Save DataFrame as Table (Ex: dfTable), So that you write SQL on it.
df.registerTempTable("dfTable")
Create UDF and Register: I'd assume your value column type is String
sqlContext.udf.register("prefix", (columnVal: String, prefix: String) =>
columnVal.split(",").map(x => prefix + "_" + x.trim).mkString(", ")
)
Use UDF in Query
//prepare columns which have UDF and all column names with AS
//Ex: prefix(key, "key") AS key // you can this representation
val columns = df.columns.map(col => s"""prefix($col, "$col") AS $col """).mkString(",")
println(columns) //for testing how columns framed
val resultDf = sqlContext.sql("SELECT " + columns + " FROM dfTable")