How to fetch the constant value from a column in Spark Dataframes - scala

I have this following dataframe where certain columns like version and datSetName are supposedly constants. I am trying to get these constants into a variable(version is of type float and dataSetName is string).
|id |version |dataSetName
|1 |1.0 | employee
|2 |1.0 | employee
|3 |1.0 | employee
|4 |1.0 | employee
using the following way gives me a Row
val datSetName = df.select("dataSetName").distinct.collect()(0)
what's the best way to get dataSetName and version into String and Float variables respectively.

Check below code.
verison
df
.select("version")
.distinct.map(_.getAs[Double](0))
.collect
.head
dataSetName
df
.select("dataSetName")
.distinct
.map(_.getAs[String](0))
.collect
.head
version & dataSetName
df
.select("version","dataSetName")
.distinct
.map(c => (c.getAs[Double](0),c.getAs[String](1)))
.collect
.head
(Double, String) = (1.0,employee) // Output

Related

Convert every value of a dataframe

I need to modify the values of every column of a dataframe so that, they all are enclosed within double quotes after mapping but the dataframe still retains its original structure with the headers.
I tried mapping the values by changing the rows to sequences but it loses its headers in the output dataframe.
With this read in as input dataframe:
|prodid|name |city|
+------+-------+----+
|1 |Harshit|VNS |
|2 |Mohit |BLR |
|2 |Mohit |RAO |
|2 |Mohit |BTR |
|3 |Rohit |BOM |
|4 |Shobhit|KLK |
I tried the following code.
val columns = df.columns
df.map{ row =>
row.toSeq.map{col => "\""+col+"\"" }
}.toDF(columns:_*)
But it throws an error stating there's only 1 header i.e value in the mapped dataframe.
This is the actual result (if I remove ".df(columns:_*)"):
| value|
+--------------------+
|["1", "Harshit", ...|
|["2", "Mohit", "B...|
|["2", "Mohit", "R...|
|["2", "Mohit", "B...|
|["3", "Rohit", "B...|
|["4", "Shobhit", ...|
+--------------------+
And my expected result is something like:
|prodid|name |city |
+------+---------+------+
|"1" |"Harshit"|"VNS" |
|"2" |"Mohit" |"BLR" |
|"2" |"Mohit" |"RAO" |
|"2" |"Mohit" |"BTR" |
|"3" |"Rohit" |"BOM" |
|"4" |"Shobhit"|"KLK" |
Note: There are only 3 headers in this example but my original data has a lot of headers so manually typing each and every one of them is not an option in case the file header changes. How do I get this modified value dataframe from that?
Edit: If I need the quotes on all values except the Integers. So, the output is something like:
|prodid|name |city |
+------+---------+------+
|1 |"Harshit"|"VNS" |
|2 |"Mohit" |"BLR" |
|2 |"Mohit" |"RAO" |
|2 |"Mohit" |"BTR" |
|3 |"Rohit" |"BOM" |
|4 |"Shobhit"|"KLK" |
Might be easier to use select instead:
val df = Seq((1, "Harshit", "VNS"), (2, "Mohit", "BLR"))
.toDF("prodid", "name", "city")
df.select(df.schema.fields.map {
case StructField(name, IntegerType, _, _) => col(name)
case StructField(name, _, _, _) => format_string("\"%s\"", col(name)) as name
}:_*).show()
Output:
+------+---------+-----+
|prodid| name| city|
+------+---------+-----+
| 1|"Harshit"|"VNS"|
| 2| "Mohit"|"BLR"|
+------+---------+-----+
Note that there are other numeric types as well such as LongType and DoubleType so might need to handle these as well or alternatively just quote StringType etc.

spark 2.3.1 insertinto remove value of fields and change it to null

I just upgrade my spark cluster from 2.2.1 to 2.3.1 in order to enjoy the feature of overwrite specific partitions. see link.
But ....
From some reason when I am testing it I get a very strange behavior see code:
import org.apache.spark.SparkConf
import org.apache.spark.sql.{SaveMode, SparkSession}
case class MyRow(partitionField: Int, someId: Int, someText: String)
object ExampleForStack2 extends App{
val sparkConf = new SparkConf()
sparkConf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
sparkConf.setMaster(s"local[2]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val list1 = List(
MyRow(1, 1, "someText")
,MyRow(2, 2, "someText2")
)
val list2 = List(
MyRow(1, 1, "someText modified")
,MyRow(3, 3, "someText3")
)
val df = spark.createDataFrame(list1)
val df2 = spark.createDataFrame(list2)
df2.show(false)
df.write.partitionBy("partitionField").option("path","/tmp/tables/").saveAsTable("my_table")
df2.write.mode(SaveMode.Overwrite).insertInto("my_table")
spark.sql("select * from my_table").show(false)
}
And output:
+--------------+------+-----------------+
|partitionField|someId|someText |
+--------------+------+-----------------+
|1 |1 |someText modified|
|3 |3 |someText3 |
+--------------+------+-----------------+
+------+---------+--------------+
|someId|someText |partitionField|
+------+---------+--------------+
|2 |someText2|2 |
|1 |someText |1 |
|3 |3 |null |
|1 |1 |null |
+------+---------+--------------+
Why I get those nulls ?
It seems that fields were moved ? but why?
Thanks
Ok found it, insert into is based on fields position. see documentation
Unlike saveAsTable, insertInto ignores the column names and just uses position-based resolution. For example:
scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1")
scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1")
scala> sql("select * from t1").show
+---+---+
| i| j|
+---+---+
| 5| 6|
| 3| 4|
| 1| 2|
+---+---+
Because it inserts data to an existing table, format or options will
be ignored.
Moreover I am using dynamic partition which should appear as the last field. So the solution is to move the dynamic partitions to the end of the dataframe, which means in my case:
df2.select("someId", "someText","partitionField").write.mode(SaveMode.Overwrite).insertInto("my_table")

RDD Key-Value pair with composite value

I have here a toy data set for which I need to compute list of cities in each state and population of that state(sum of population of all the cities in that state)Data
I want to do it using RDDs without using groupByKey and joins. My approach so far:
In this approach I used 2 separate key-value pairs and joined them.
val rdd1=inputRdd.map(x=>(x._1,x._3.toInt))
val rdd2=inputRdd.map(x=>(x._1,x._2))
val popn_sum=rdd1.reduceByKey(_+_)
val list_cities=rdd2.reduceByKey(_++_)
popn_sum.join(list_cities).collect()
Is it possible to get the same output with just 1 key-value pair and without any joins.
I have created a new key-value pair, but I do not know how to proceed to get the same output using aggregateByKey or reduceByKey with this RDD:
val rdd3=inputRdd.map(x=>(x._1,(List(x._2),x._3)))
I am new to spark and want to learn the best way get this output.
Array((B,(12,List(B1, B2))), (A,(6,List(A1, A2, A3))), (C,(8,List(C1, C2))))
Thanks in advance
If your inputRdd is of type
inputRdd: org.apache.spark.rdd.RDD[(String, String, Int)]
Then you can achieve your desired result by simply using one reduceByKey as
inputRdd.map(x => (x._1, (List(x._2), x._3.toInt))).reduceByKey((x, y) => (x._1 ++ y._1, x._2+y._2))
and you can it with aggregateByKey as
inputRdd.map(x => (x._1, (List(x._2), x._3.toInt))).aggregateByKey((List.empty[String], 0))((x, y) => (x._1 ++ y._1, x._2+y._2), (x, y) => (x._1 ++ y._1, x._2+y._2))
DataFrame way
Even better approach would be to use dataframe way. You can convert your rdd to dataframe simply by applying .toDF("state", "city", "population") which should give you
+-----+----+----------+
|state|city|population|
+-----+----+----------+
|A |A1 |1 |
|B |B1 |2 |
|C |C1 |3 |
|A |A2 |2 |
|A |A3 |3 |
|B |B2 |10 |
|C |C2 |5 |
+-----+----+----------+
After that you can just use groupBy, and apply collect_list and sum inbuilt aggregation functions as
import org.apache.spark.sql.functions._
inputDf.groupBy("state").agg(collect_list(col("city")).as("cityList"), sum("population").as("sumPopulation"))
which should give you
+-----+------------+-------------+
|state|cityList |sumPopulation|
+-----+------------+-------------+
|B |[B1, B2] |12 |
|C |[C1, C2] |8 |
|A |[A1, A2, A3]|6 |
+-----+------------+-------------+
Dataset is almost the same but comes with additional type-safety

Replace the value of one column from another column in spark dataframe

I have a dataframe like below
+---+------------+----------------------------------------------------------------------+
|id |indexes |arrayString |
+---+------------+----------------------------------------------------------------------+
|2 |1,3 |[WrappedArray(3, Str3), WrappedArray(1, Str1)] |
|1 |2,4,3 |[WrappedArray(2, Str2), WrappedArray(3, Str3), WrappedArray(4, Str4)] |
|0 |1,2,3 |[WrappedArray(1, Str1), WrappedArray(2, Str2), WrappedArray(3, Str3)] |
+---+------------+----------------------------------------------------------------------+
i want to loop through arrayString and get the first element as index and second element as String. Then replace the indexes with String corresponding to the index in arrayString. I want an output like below.
+---+---------------+
|id |replacedString |
+---+---------------+
|2 |Str1,Str3 |
|1 |Str2,Str4,Str3 |
|0 |Str1,Str2,Str3 |
+---+---------------+
I tried using the below udf function.
val replaceIndex = udf((itemIndex: String, arrayString: Seq[Seq[String]]) => {
val itemIndexArray = itemIndex.split("\\,")
arrayString.map(i => {
itemIndexArray.updated(i(0).toInt,i(1))
})
itemIndexArray
})
This is giving me error and i am not getting my desired output. Is there any other way to achieve this. I cant use explode and join as i want the indexes replaced with string without losing the order.
.
You can create an udf as below to get the required result, Convert to the Array of array to map and find the index as a key in map.
val replaceIndex = udf((itemIndex: String, arrayString: Seq[Seq[String]]) => {
val indexList = itemIndex.split("\\,")
val array = arrayString.map(x => (x(0) -> x(1))).toMap
indexList map array mkString ","
})
dataframe.withColumn("arrayString", replaceIndex($"indexes", $"arrayString"))
.show( false)
Output:
+---+-------+--------------+
|id |indexes|arrayString |
+---+-------+--------------+
|2 |1,3 |Str1,Str3 |
|1 |2,4,3 |Str2,Str4,Str3|
|0 |1,2,3 |Str1,Str2,Str3|
+---+-------+--------------+
Hope this helps!

how to access the column index for spark dataframe in scala for calculation

I am new to Scala programming , i have worked on R very extensively but while working for scala it has become tough to work in a loop to extract specific columns to perform computation on the column values
let me explain with help of an example :
i have Final dataframe arrived after joining the 2 dataframes,
now i need to perform calculation like
Above is the computation with reference to the columns , so after computation we'll get the below spark dataframe
How to refer to the column index in for-loop to compute the new column values in spark dataframe in scala
Here is one solution:
Input Data:
+---+---+---+---+---+---+---+---+---+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |
+---+---+---+---+---+---+---+---+---+
|24 |74 |74 |21 |66 |65 |100|27 |19 |
+---+---+---+---+---+---+---+---+---+
Zipped the columns to remove the non-matching columns:
val oneCols = data.schema.filter(_.name.contains("1")).map(x => x.name).sorted
val twoCols = data.schema.filter(_.name.contains("2")).map(x => x.name).sorted
val cols = oneCols.zip(twoCols)
//cols: Seq[(String, String)] = List((a1,a2), (b1,b2), (c1,c2), (d1,d2))
Use foldLeft function to dynamically add columns:
import org.apache.spark.sql.functions._
val result = cols.foldLeft(data)((data,c) => data.withColumn(s"Diff_${c._1}",
(col(s"${lit(c._2)}") - col(s"${lit(c._1)}"))/col(s"${lit(c._2)}")))
Here is the result:
result.show(false)
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |Diff_a1 |Diff_b1|Diff_c1 |Diff_d1 |
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|24 |74 |74 |21 |66 |65 |100|27 |19 |0.6307692307692307|0.26 |-1.7407407407407407|-0.10526315789473684|
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+