Transform a column included in array column - scala

I need to transform an array column in my dataframe, array is called 'cities' and has the type Array(City) and I want put city name in upper case.
Structure:
val cities: StructField = StructField("cities", ArrayType(CityType), nullable = true)
def CityType: StructType =
StructType(
Seq(
StructField(code, StringType, nullable = true),
StructField(name, StringType, nullable = true)
)
)
Code I tried:
.withColumn(
newColumn,
forall(
col(cities),
(col: Column) =>
struct(
Array(
col(code),
upper(col(name))
): _*
)
)
)
Error says
cannot resolve 'forall(...

There is no such thing called forall. You can use transform instead:
// sample data
val df = spark.sql("select array(struct('1' as code, 'abc' as name), struct('2' as code, 'def' as name)) cities")
import org.apache.spark.sql.Column
val df2 = df.withColumn(
"newcol",
transform(
col("cities"),
(c: Column) => struct(c("code"), upper(c("name")))
)
)
df2.show
+--------------------+--------------------+
| cities| newcol|
+--------------------+--------------------+
|[[1, abc], [2, def]]|[[1, ABC], [2, DEF]]|
+--------------------+--------------------+

Related

Spark create a dataframe from multiple lists/arrays

So, I have 2 lists in Spark(scala). They both contain the same number of values. The first list a contains all strings and the second list b contains all Long's.
a: List[String] = List("a", "b", "c", "d")
b: List[Long] = List(17625182, 17625182, 1059731078, 100)
I also have a schema defined as follows:
val schema2=StructType(
Array(
StructField("check_name", StringType, true),
StructField("metric", DecimalType(38,0), true)
)
)
What is the best way to convert my lists to a single dataframe, that has schema schema2 and the columns are made from a and b respectively?
You can create an RDD[Row] and convert to Spark dataframe with the given schema:
val df = spark.createDataFrame(
sc.parallelize(a.zip(b).map(x => Row(x._1, BigDecimal(x._2)))),
schema2
)
df.show
+----------+----------+
|check_name| metric|
+----------+----------+
| a| 17625182|
| b| 17625182|
| c|1059731078|
| d| 100|
+----------+----------+
Using Dataset:
import spark.implicits._
case class Schema2(a: String, b: Long)
val el = (a zip b) map { case (a, b) => Schema2(a, b)}
val df = spark.createDataset(el).toDF()

How can I apply boolean indexing in a Spark-Scala dataframe?

I have two Spark-Scala dataframes and I need to use one boolean column from one dataframe to filter the second dataframe. Both dataframes have the same number of rows.
In pandas I would so it like this:
import pandas as pd
df1 = pd.DataFrame({"col1": ["A", "B", "A", "C"], "boolean_column": [True, False, True, False]})
df2 = pd.DataFrame({"col1": ["Z", "X", "Y", "W"], "col2": [1, 2, 3, 4]})
filtered_df2 = df2[df1['boolean_column']]
// Expected filtered_df2 should be this:
// df2 = pd.DataFrame({"col1": ["Z", "Y"], "col2": [1, 3]})
How can I do the same operation in Spark-Scala in the most time-efficient way?
My current solution is to add "boolean_column" from df1 to df2, then filter df2 by selecting only the rows with a true value in the newly added column and finally removing "boolean_column" from df2, but I'm not sure it is the best solution.
Any suggestion is appreciated.
Edit:
The expected output is a Spark-Scala dataframe (not a list or a column) with the same schema as the second dataframe, and only the subset of rows from df2 that satisfy the boolean mask from the "boolean_column" of df1.
The schema of df2 presented above is just an example. I'm expecting to receive df2 as a parameter, with any number of columns of different (and not fixed) schemas.
you can zip both DataFrames and filter on those tuples.
val ints = sparkSession.sparkContext.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
val bools = sparkSession.sparkContext.parallelize(List(true, false, true, false, true, false, true, false, true, false))
val filtered = ints.zip(bools).filter { case (int, bool) => bool }.map { case (int, bool) => int }
println(filtered.collect().toList) //List(1, 3, 5, 7, 9)
I managed to solve this with the following code:
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession}
val spark = SparkSession.builder().appName(sc.appName).master(sc.master).getOrCreate()
val sqlContext = spark.sqlContext
def addColumnIndex(df: DataFrame, sqlContext: SQLContext) = sqlContext.createDataFrame(
// Add Column index
df.rdd.zipWithIndex.map{case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex)},
// Create schema
StructType(df.schema.fields :+ StructField("columnindex", LongType, nullable = false))
)
import spark.implicits._
val DF1 = Seq(
("A", true),
("B", false),
("A", true),
("C", false)
).toDF("col1", "boolean_column")
val DF2 = Seq(
("Z", 1),
("X", 2),
("Y", 3),
("W", 4)
).toDF("col_1", "col_2")
// Add index
val DF1WithIndex = addColumnIndex(DF1, sqlContext)
val DF2WithIndex = addColumnIndex(DF2, sqlContext)
// Join
val joinDF = DF2WithIndex
.join(DF1WithIndex, Seq("columnindex"))
.drop("columnindex", "col1")
// Filter
val filteredDF2 = joinDF.filter(joinDF("boolean_column")).drop("boolean_column")
The filtered dataframe will be the following:
+-----+-----+
|col_1|col_2|
+-----+-----+
| Z| 1|
| Y| 3|
+-----+-----+

what is the order guarantee when joining two columns of a spark dataframe which are processed separately?

I have dataframe with 3 columns
date
jsonString1
jsonString2
I want to expand attributes inside json into columns. so i did something like this.
val json1 = spark.read.json(dataframe.select(col("jsonString1")).rdd.map(_.getString(0)))
val json2 = spark.read.json(dataframe.select(col("jsonString2")).rdd.map(_.getString(0)))
val json1Table = json1.selectExpr("id", "status")
val json2Table = json2.selectExpr("name", "address")
now i want to put these table together. so i did the following
val json1TableWithIndex = addColumnIndex(json1Table)
val json2TableWithIndex = addColumnIndex(json2Table)
var finalResult = json1Table
.join(json2Table, Seq("columnindex"))
.drop("columnindex")
def addColumnIndex(df: DataFrame) = spark.createDataFrame(
df.rdd.zipWithIndex.map { case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex) },
StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
)
After sampling few rows I observe that rows match exactly as in the source dataframe
I did not find any information on the order guarantee when joining two columns of a dataframe which are processed separately. Is this the right way to solve my problem. Any help is appreciated.
It is always risky to rely on undocumented behaviours, as your code might not work as you intended, because you only have a partial understanding of it.
You can do the same thing in a much more efficient way without using any split and join approach. Use a from_json function to create two nested columns and then flatten out the nested columns and finally drop out the intermediate JSON string columns and nested columns.
Here is an example fo the whole process.
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val df = (Seq(
("09-02-2020","{\"id\":\"01\", \"status\":\"Active\"}","{\"name\":\"Abdullah\", \"address\":\"Jumeirah\"}"),
("10-02-2020","{\"id\":\"02\", \"status\":\"Dormant\"}","{\"name\":\"Ali\", \"address\":\"Jebel Ali\"}")
).toDF("date","jsonString1","jsonString2"))
scala> df.show()
+----------+--------------------+--------------------+
| date| jsonString1| jsonString2|
+----------+--------------------+--------------------+
|09-02-2020|{"id":"01", "stat...|{"name":"Abdullah...|
|10-02-2020|{"id":"02", "stat...|{"name":"Ali", "a...|
+----------+--------------------+--------------------+
val schema1 = (StructType(Seq(
StructField("id", StringType, true),
StructField("status", StringType, true)
)))
val schema2 = (StructType(Seq(
StructField("name", StringType, true),
StructField("address", StringType, true)
)))
val dfFlattened = (df.withColumn("jsonData1", from_json(col("jsonString1"), schema1))
.withColumn("jsonData2", from_json(col("jsonString2"), schema2))
.withColumn("id", col("jsonData1.id"))
.withColumn("status", col("jsonData1.status"))
.withColumn("name", col("jsonData2.name"))
.withColumn("address", col("jsonData2.address"))
.drop("jsonString1")
.drop("jsonString2")
.drop("jsonData1")
.drop("jsonData2"))
scala> dfFlattened.show()
+----------+---+-------+--------+---------+
| date| id| status| name| address|
+----------+---+-------+--------+---------+
|09-02-2020| 01| Active|Abdullah| Jumeirah|
|10-02-2020| 02|Dormant| Ali|Jebel Ali|
+----------+---+-------+--------+---------+

Creating a Spark Vector Column with createDataFrame

I can make a Spark DataFrame with a vector column with the toDF method.
val dataset = Seq((1.0, org.apache.spark.ml.linalg.Vectors.dense(0.0, 10.0, 0.5))).toDF("id", "userFeatures")
scala> dataset.printSchema()
root
|-- id: double (nullable = false)
|-- userFeatures: vector (nullable = true)
scala> dataset.schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(id,DoubleType,false), StructField(userFeatures,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7,true))
I'm not sure how to create a vector column with the createDataFrame method. There isn't a VectorType type in org.apache.spark.sql.types.
This doesn't work:
val rows = spark.sparkContext.parallelize(
List(
Row(1.0, Vectors.dense(1.0, 2.0))
)
)
val schema = List(
StructField("id", DoubleType, true),
StructField("features", new org.apache.spark.ml.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
rows,
StructType(schema)
)
df.show()
df.printSchema()
To create a Spark Vector Column with createDataFrame, you can use following code:
val rows = spark.sparkContext.parallelize(
List(
Row(1.0, org.apache.spark.mllib.linalg.Vectors.dense(1.0, 2.0))
)
)
val schema = List(
StructField("id", DoubleType, true),
StructField("features", new org.apache.spark.mllib.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
rows,
StructType(schema)
)
df.show()
+---+---------+
| id| features|
+---+---------+
|1.0|[1.0,2.0]|
+---+---------+
df.printSchema()
root
|-- id: double (nullable = true)
|-- features: vector (nullable = true)
The actual issue was incompatible type org.apache.spark.ml.linalg.Vectors.dense which is not a valid external type for schema of vector. So, we have to switch to mllib package instead of ml package.
I hope it helps!
Note: I am using Spark v2.3.0. Also, class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg.
For reference - https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib

Scala DataFrame: Explode an array

I am using the spark libraries in Scala. I have created a DataFrame using
val searchArr = Array(
StructField("log",IntegerType,true),
StructField("user", StructType(Array(
StructField("date",StringType,true),
StructField("ua",StringType,true),
StructField("ui",LongType,true))),true),
StructField("what",StructType(Array(
StructField("q1",ArrayType(IntegerType, true),true),
StructField("q2",ArrayType(IntegerType, true),true),
StructField("sid",StringType,true),
StructField("url",StringType,true))),true),
StructField("where",StructType(Array(
StructField("o1",IntegerType,true),
StructField("o2",IntegerType,true))),true)
)
val searchSt = new StructType(searchArr)
val searchData = sqlContext.jsonFile(searchPath, searchSt)
I am now what to explode the field what.q1, which should contain an array of integers, but the documentation is limited:
http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#explode(java.lang.String,%20java.lang.String,%20scala.Function1,%20scala.reflect.api.TypeTags.TypeTag)
So far I tried a few things without much luck
val searchSplit = searchData.explode("q1", "rb")(q1 => q1.getList[Int](0).toArray())
Any ideas/examples of how to use explode on an array?
Did you try with an UDF on field "what"? Something like that could be useful:
val explode = udf {
(aStr: GenericRowWithSchema) =>
aStr match {
case null => ""
case _ => aStr.getList(0).get(0).toString()
}
}
val newDF = df.withColumn("newColumn", explode(col("what")))
where:
getList(0) returns "q1" field
get(0) returns the first element of "q1"
I'm not sure but you could try to use getAs[T](fieldName: String) instead of getList(index: Int).
I'm not used to Scala; but in Python/pyspark, the array type column nested within a struct type field can be exploded as follows. If it works for you, then you can convert it to corresponding Scala representation.
from pyspark.sql.functions import col, explode
from pyspark.sql.types import ArrayType, IntegerType, LongType, StringType, StructField, StructType
schema = StructType([
StructField("log", IntegerType()),
StructField("user", StructType([
StructField("date", StringType()),
StructField("ua", StringType()),
StructField("ui", LongType())])),
StructField("what", StructType([
StructField("q1", ArrayType(IntegerType())),
StructField("q2", ArrayType(IntegerType())),
StructField("sid", StringType()),
StructField("url", StringType())])),
StructField("where", StructType([
StructField("o1", IntegerType()),
StructField("o2", IntegerType())]))
])
data = [((1), ("2022-01-01","ua",1), ([1,2,3],[6],"sid","url"), (7,8))]
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
Output:
+---+-------------------+--------------------------+------+
|log|user |what |where |
+---+-------------------+--------------------------+------+
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|
+---+-------------------+--------------------------+------+
With what.q1 exploded:
df.withColumn("what.q1_exploded", explode(col("what.q1"))).show(truncate=False)
Output:
+---+-------------------+--------------------------+------+----------------+
|log|user |what |where |what.q1_exploded|
+---+-------------------+--------------------------+------+----------------+
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|1 |
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|2 |
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|3 |
+---+-------------------+--------------------------+------+----------------+