Dynamic dataframe with n columns and m rows - scala

Reading data from json(dynamic schema) and i'm loading that to dataframe.
Example Dataframe:
scala> import spark.implicits._
import spark.implicits._
scala> val DF = Seq(
(1, "ABC"),
(2, "DEF"),
(3, "GHIJ")
).toDF("id", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> DF.show
+------+-----+
|id | word|
+------+-----+
| 1| ABC|
| 2| DEF|
| 3| GHIJ|
+------+-----+
Requirement:
Column count and names can be anything. I want to read rows in loop to fetch each column one by one. Need to process that value in subsequent flows. Need both column name and value. I'm using scala.
Python:
for i, j in df.iterrows():
print(i, j)
Need the same functionality in scala and it column name and value should be fetched separtely.
Kindly help.

df.iterrows is not from pyspark, but from pandas. In Spark, you can use foreach :
DF
.foreach{_ match {case Row(id:Int,word:String) => println(id,word)}}
Result :
(2,DEF)
(3,GHIJ)
(1,ABC)
I you don't know the number of columns, you cannot use unapply on Row, then just do :
DF
.foreach(row => println(row))
Result :
[1,ABC]
[2,DEF]
[3,GHIJ]
And operate with row using its methods getAs etc

Related

How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?

myFunc(Row): String = {
//process row
//returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
inputDF.withColumn("newcol",myFunc(Row))
inputDF
}
But no new column got created in my case. My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules. Can I do it this way? If not, what is the right way? Thanks in advance.
I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.
Dataframe:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation
val testDf = spark.sparkContext.parallelize(Seq(
(1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")
testDf.show
val rddRes = testDf
.rdd
.map{x =>
val y = myFunc (x)
Row.fromSeq (x.toSeq ++ Seq(y) )
}
val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))
spark.sqlContext.createDataFrame(rddRes, newSchema).show
Results:
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| def|
| 3| ghi|
+---+----+
+---+----+------+
| id|col1| col2|
+---+----+------+
| 1| abc|abcxyz|
| 2| def|defxyz|
| 3| ghi|ghixyz|
+---+----+------+
With Dataset:
case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)
val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS
val transformedData: Dataset[transformedData] = test
.map { x: testData =>
val newCol = x.col1 + "xyz"
transformedData(x.id, x.col1, newCol)
}
transformedData.show
As you can see datasets is more readable, plus provides strong type casting.
Since I'm unaware of your spark version, providing both solutions here. However if you're using spark v>=1.6, you should look into Datasets. Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee

Spark Dataframe extracting columns based dynamically selected columns

Schema of input dataframe
- employeeKey (int)
- employeeTypeId (string)
- loginDate (string)
- employeeDetailsJson (string)
{"Grade":"100","ValidTill":"2021-12-01","Supervisor":"Alex","Vendor":"technicia","HourlyRate":29}
For Perm employees , some attributes are available and some not. Same for Contracting Employees.
So looking to find an efficient way to build dataframe based on only selected columns, as against transforming all columns and select the ones which I need.
Also please advise this is the best way to extract values from json string based on a key. As the attributes in the string are dynamic, I can not build StructSchema based on it. So using good old get_json_object.
(spark 2.45 and will use spark 3 in future)
val dfSelectColumns=List("Employee-Key", "Employee-Type","Login-Date","cont.Vendor-Name","cont.Hourly-Rate" )
//val dfSelectColumns=List("Employee-Key", "Employee-Type","Login-Date","perm.Level","perm-Validity","perm.Supervisor" )
val resultDF = inputDF.get
.withColumn("Employee-Key", col("employeeKey"))
.withColumn("Employee-Type", when(col("employeeTypeId") === 1, "Permanent")
.when(col("employeeTypeId") === 2, "Contractor")
.otherwise("unknown"))
.withColumn("Login-Date", to_utc_timestamp(to_timestamp(col("loginDate"), "yyyy-MM-dd'T'HH:mm:ss"), ""America/Chicago""))
.withColumn("perm.Level", get_json_object(col("employeeDetailsJson"), "$.Grade"))
.withColumn("perm.Validity", get_json_object(col("employeeDetailsJson"), "$.ValidTill"))
.withColumn("perm.SuperVisor", get_json_object(col("employeeDetailsJson"), "$.Supervisor"))
.withColumn("cont.Vendor-Name", get_json_object(col("employeeDetailsJson"), "$.Vendor"))
.withColumn("cont.Hourly-Rate", get_json_object(col("employeeDetailsJson"), "$.HourlyRate"))
.select(dfSelectColumns.head, dfSelectColumns.tail: _*)
I see that you have 2 schemas, one for Permanent and another for Contractor. You can have 2 schemas.
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val schemaBase = new StructType().add("Employee-Key", IntegerType).add("Employee-Type", StringType).add("Login-Date", DateType)
val schemaPerm = schemaBase.add("Level", IntegerType).add("Validity", StringType)// Permanent attributes
val schemaCont = schemaBase.add("Vendor", StringType).add("HourlyRate", DoubleType) // Contractor attributes
Then you can use the 2 schemas to load the data into dataframe.
For Permanent Employee:
val jsonPermDf = Seq( // Construct sample dataframe
(2, """{"Employee-Key":2, "Employee-Type":"Permanent", "Login-Date":"2021-11-01", "Level":3, "Validity":"ok"}""")
, (3, """{"Employee-Key":3, "Employee-Type":"Permanent", "Login-Date":"2020-10-01", "Level":2, "Validity":"ok-yes"}""")
).toDF("key", "raw_json")
val permDf = jsonPermDf.withColumn("data", from_json(col("raw_json"),schemaPerm)).select($"data.*")
permDf.show()
For Contractor:
val jsonContDf = Seq( // Construct sample dataframe
(1, """{"Employee-Key":1, "Employee-Type":"Contractor", "Login-Date":"2021-12-01", "Vendor":"technicia", "HourlyRate":29}""")
, (4, """{"Employee-Key":4, "Employee-Type":"Contractor", "Login-Date":"2019-09-01", "Vendor":"Minis", "HourlyRate":35}""")
).toDF("key", "raw_json")
val contDf = jsonContDf.withColumn("data", from_json(col("raw_json"),schemaCont)).select($"data.*")
contDf.show()
This is the result datafrme for Permanent:
+------------+-------------+----------+-----+--------+
|Employee-Key|Employee-Type|Login-Date|Level|Validity|
+------------+-------------+----------+-----+--------+
| 2| Permanent|2021-11-01| 3| ok|
| 3| Permanent|2020-10-01| 2| ok-yes|
+------------+-------------+----------+-----+--------+
This is the result dataframe for Contractor:
+------------+-------------+----------+---------+----------+
|Employee-Key|Employee-Type|Login-Date| Vendor|HourlyRate|
+------------+-------------+----------+---------+----------+
| 1| Contractor|2021-12-01|technicia| 29.0|
| 4| Contractor|2019-09-01| Minis| 35.0|
+------------+-------------+----------+---------+----------+
If the schema of the JSON in employeeDetailsJson is unstable, you can still parse it into Map(String, String) type using from_json function with schema map<string,string>. Then you can explode the map column and pivot to get keys as columns.
Example:
val df1 = df.withColumn(
"employeeDetails",
from_json(col("employeeDetailsJson"), "map<string,string>")
).select(
col("employeeKey"),
col("employeeTypeId"),
col("loginDate"),
explode("employeeDetails")
).groupBy("employeeKey", "employeeTypeId", "loginDate")
.pivot("key")
.agg(first("value"))
df1.show()
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
//|employeeKey|employeeTypeId|loginDate |Grade|HourlyRate|Supervisor|ValidTill |Vendor |
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+
//|1 |1 |2021-02-05'T'21:28:06|100 |29 |Alex |2021-12-01|technicia|
//+-----------+--------------+---------------------+-----+----------+----------+----------+---------+

Scala filter out rows where any column2 matches column1

Hi Stackoverflow,
I want to remove all rows in a dataframe where column A matches any of the distinct values in column B. I would expect this code block to do exactly that, but it seems to remove values where column B is null as well, which is weird since the filter should only consider column A anyway. How can I fix this code to perform the expected behavior, which is remove all rows in a dataframe where column A matches any of the distinct values in column B.
import spark.implicits._
val df = Seq(
(scala.math.BigDecimal(1) , null),
(scala.math.BigDecimal(2), scala.math.BigDecimal(1)),
(scala.math.BigDecimal(3), scala.math.BigDecimal(4)),
(scala.math.BigDecimal(4), null),
(scala.math.BigDecimal(5), null),
(scala.math.BigDecimal(6), null)
).toDF("A", "B")
// correct, has 1, 4
val to_remove = df
.filter(
df.col("B").isNotNull
).select(
df("B")
).distinct()
// incorrect, returns 2, 3 instead of 2, 3, 5, 6
val final = df.filter(!df.col("A").isin(to_remove.col("B")))
// 4 != 2
assert(4 === final.collect().length)
isin function accepts a list. However, in your code, you're passing Dataset[Row]. As per documentation https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.Column#isin%28scala.collection.Seq%29
it's declared as
def isin(list: Any*): Column
You first need to extract the values into Sequence and then use that in isin function. Please, note that this may have performance implications.
scala> val to_remove = df.filter(df.col("B").isNotNull).select(df("B")).distinct().collect.map(_.getDecimal(0))
to_remove: Array[java.math.BigDecimal] = Array(1.000000000000000000, 4.000000000000000000)
scala> val finaldf = df.filter(!df.col("A").isin(to_remove:_*))
finaldf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [A: decimal(38,18), B: decimal(38,18)]
scala> finaldf.show
+--------------------+--------------------+
| A| B|
+--------------------+--------------------+
|2.000000000000000000|1.000000000000000000|
|3.000000000000000000|4.000000000000000000|
|5.000000000000000000| null|
|6.000000000000000000| null|
+--------------------+--------------------+
Change filter condition !df.col("A").isin(to_remove.col("B")) to !df.col("A").isin(to_remove.collect.map(_.getDecimal(0)):_*)
Check below code.
val finaldf = df
.filter(!df
.col("A")
.isin(to_remove.map(_.getDecimal(0)).collect:_*)
)
scala> finaldf.show
+--------------------+--------------------+
| A| B|
+--------------------+--------------------+
|2.000000000000000000|1.000000000000000000|
|3.000000000000000000|4.000000000000000000|
|5.000000000000000000| null|
|6.000000000000000000| null|
+--------------------+--------------------+

Scala - Fill "null" column with another column

I want to replicate the problem mentioned here in Scala DataFrames. I have tried using the following approaches, to no success so far.
Input
Col1 Col2
A M
B K
null S
Expected Output
Col1 Col2
A M
B K
S <---- S
Approach 1
val output = df.na.fill("A", Seq("col1"))
The fill method does not take a column as the (first) input.
Approach 2
val output = df.where(df.col("col1").isNull)
I cannot find a suitable method to call after I have identified the null values.
Approach 3
val output = df.dtypes.map(column =>
column._2 match {
case "null" => (column._2 -> 0)
}).toMap
I get a StringType error.
I'd use when/otherwise, as shown below:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
("A", "M"), ("B", "K"), (null, "S")
).toDF("Col1", "Col2")
df.withColumn("Col1", when($"Col1".isNull, $"Col2").otherwise($"Col1")).show
// +----+----+
// |Col1|Col2|
// +----+----+
// | A| M|
// | B| K|
// | S| S|
// +----+----+

Replace one dataframe column value with another's value

I have two dataframes (Scala Spark) A and B. When A("id") == B("a_id") I want to update A("value") to B("value"). Since DataFrames have to be recreated I'm assuming I have to do some joins and withColumn calls but I'm not sure how to do this. In SQL it would be a simple update call on a natural join but for some reason this seems difficult in Spark?
Indeed, a left join and a select call would do the trick:
// assuming "spark" is an active SparkSession:
import org.apache.spark.sql.functions._
import spark.implicits._
// some sample data; Notice it's convenient to NAME the dataframes using .as(...)
val A = Seq((1, "a1"), (2, "a2"), (3, "a3")).toDF("id", "value").as("A")
val B = Seq((1, "b1"), (2, "b2")).toDF("a_id", "value").as("B")
// left join + coalesce to "choose" the original value if no match found:
val result = A.join(B, $"A.id" === $"B.a_id", "left")
.select($"id", coalesce($"B.value", $"A.value") as "value")
// result:
// +---+-----+
// | id|value|
// +---+-----+
// | 1| b1|
// | 2| b2|
// | 3| a3|
// +---+-----+
Notice that there's no real "update" here - result is a new DataFrame which you can use (write / count / ...) but the original DataFrames remain unchanged.