I have a data quality class to perform checks on a df. I use methods defined in this class to run these checks (they always return tuples of 3). These methods are called by a udf that I want to call from another df:
#F.udf(StructType())
def dq_check_wrapper(df, col, _test):
if _test == 'is_null':
return Valid_df(df).is_not_null(col).execute()
elif _test == 'unique':
return Valid_df(df).is_unique(col).execute()
Say I want to asses the DQ on this df:
df = spark.createDataFrame(
[
(None, 128.0, 1),(110, 127.0, 2),(111, 127.0, 3),(111, 127.0, 4)
,(111, 126.0, 5),(111, 127.0, 6),(109, 126.0, 7),(111, 126.0, 1001)
,(114, 126.0, 1003),(115, 83.0, 1064),(116, 127.0, 1066)
], ['HR', 'maxABP', 'Second']
)
To make it dynamic, I want to use a metadata df:
metadata = sqlContext.sql("select 'HR' as col, 'is_null' as dq_check")
+---+--------+
|col|dq_check|
+---+--------+
| HR| is_null|
+---+--------+
But then, when I try:
metadata\
.withColumn("valid_dq", dq_check_wrapper(df, metadata.col, metadata.dq_check))\
.show()
I get a TypeError:
TypeError: Invalid argument, not a string or column: DataFrame[HR: bigint, maxABP: double, Second: bigint] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
Why?
Because if I don't inform that df is of type DataFrame, it infers as String:
def dq_check_wrapper(df: DataFrame, col, _test):
Related
I'm performing aggregation in a dataframe for percentage calculation. I need to store the sum of each column in a separate variable and I can use this in division to calculate the percentage
val sumOfCol1 = df.agg(round(sum("col1"),2))
This code gives the sum but it will be stored as a dataframe object and it cannot be used for division. The type is:
sumOfCol1: org.apache.spark.sql.DataFrame = [round(sum(col1), 2): double]
How can I store it as a constant or double value so that I can use it in later stage of the aggregations?
To access the actual value in a dataframe as a Double you need to collect the dataframe to the driver using collect. The function will return an array with all the rows, see the documentation.
Since you have a dataframe it will contain Row objects and you have to use getAs to access the actual underlying values. A more intuitive way would be to first convert to a dataset and then collect:
val sumOfCol1 = df.agg(round(sum("col1"),2)).as[Double].collect()(0)
In this case since you only want a single value, you can also use the first method:
val sumOfCol1 = df.agg(round(sum("col1"),2)).as[Double].first
First let's create a data frame:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val schema = List(
StructField("col1", IntegerType, true),
StructField("col2", IntegerType, true),
StructField("col3", IntegerType, true)
)
val data=Seq(Row(10,100,1000),Row(20,200,2000),Row(30,300,3000))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schema))
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 10| 100|1000|
| 20| 200|2000|
| 30| 300|3000|
+----+----+----+
Now we have the data frame.
We can use pattern matching when assigning values to collect the desired results. Since df.first() returns a Row object, we can do something like this:
val cols = df.columns.toList
val sums = cols.map(c => round(sum(c),2).cast("double"))
val Row(sumCol1: Double, sumCol2: Double, sumCol3: Double) = df.groupBy().agg(sums.head, sums.tail:_*).first()
sumCol1: Double = 60.0
sumCol2: Double = 600.0
sumCol3: Double = 6000.0
I have the following table
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
The table is represented as a Dataset.
scala> dataDS
res187: org.apache.spark.sql.Dataset[FlightData] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
The schema of dataDS is
scala> dataDS.printSchema;
root
|-- DEST_COUNTRY_NAME: string (nullable = true)
|-- ORIGIN_COUNTRY_NAME: string (nullable = true)
|-- count: integer (nullable = true)
I want to sum all the values of the count column. I suppose I can do it using the reduce method of Dataset.
I thought I could do the following but got error
scala> (dataDS.select(col("count"))).reduce((acc,n)=>acc+n);
<console>:38: error: type mismatch;
found : org.apache.spark.sql.Row
required: String
(dataDS.select(col("count"))).reduce((acc,n)=>acc+n);
^
To make the code work, I had to explicitly specify that count is Int even though in the schema, it is an Int
scala> (dataDS.select(col("count").as[Int])).reduce((acc,n)=>acc+n);
Why did I have to explicitly specify type of count? Why Scala's type inference didn't work? In fact, the schema of the intermediate Dataset also infers count as a Int.
dataDS.select(col("count")).printSchema;
root
|-- count: integer (nullable = true)
I think you need to do it in another way. I will assume FlightData is case class with the above schema. So, the solution is using the map and reduce as below
val totalSum = dataDS.map(_.count).reduce(_+_) //this line replace the above error as col("count") can't be selected.
Updated: The issue of inference is not related to the dataset, Actually, when you use select you will work on Dataframe(same if you join) which is not statically typed schema and you will lose the feature of your case class. For example, the type of select will be Dataframe not Dataset so, you will not be able to infer the type.
val x: DataFrame = dataDS.select('count)
val x: Dataset[Int] = dataDS.map(_.count)
Also, from this Answer To get a TypedColumn from Column you simply use myCol.as[T].
I did a simple example to reproduce the code and the data.
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}
object EntryMainPoint extends App {
//val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
val spark = SparkSession
.builder()
.master("local[*]")
.appName("SparkSessionZipsExample")
//.config("spark.sql.warehouse.dir", warehouseLocation)
.getOrCreate()
val someData = Seq(
Row("United States", "Romania", 15),
Row("United States", "Croatia", 1),
Row("United States", "Ireland", 344),
Row("Egypt", "United States", 15)
)
val flightDataSchema = List(
StructField("DEST_COUNTRY_NAME", StringType, true),
StructField("ORIGIN_COUNTRY_NAME", StringType, true),
StructField("count", IntegerType, true)
)
case class FlightData(DEST_COUNTRY_NAME: String, ORIGIN_COUNTRY_NAME: String, count: Int)
import spark.implicits._
val dataDS = spark.createDataFrame(
spark.sparkContext.parallelize(someData),
StructType(flightDataSchema)
).as[FlightData]
val totalSum = dataDS.map(_.count).reduce(_+_) //this line replace the above error as col("count") can't be selected.
println("totalSum = " + totalSum)
dataDS.printSchema()
dataDS.show()
}
Output below
totalSum = 375
root
|-- DEST_COUNTRY_NAME: string (nullable = true)
|-- ORIGIN_COUNTRY_NAME: string (nullable = true)
|-- count: integer (nullable = true)
+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
| United States| Romania| 15|
| United States| Croatia| 1|
| United States| Ireland| 344|
| Egypt| United States| 15|
+-----------------+-------------------+-----+
Note: You can do a selection from the dataset using the below way
val countColumn = dataDS.select('count) //or map(_.count)
You can also have a look in this reduceByKey in Spark Dataset
Just follow the types or look at the compiler messages.
You start with Dataset[FlightData].
You call it's select with col("count") as an argument. col(_) returns Column
The only variant of Dataset.select which takes Column returns DataFrame which is an alias for Dataset[Row].
There are two variants of Dataset.reduce one taking ReduceFunction[T] and the second (T, T) => T, where T is type constructor parameter of the Dataset, i.e. Dataset[T]. (acc,n)=>acc+n function is a Scala anonymous function, hence the second version apply.
Expanded:
(dataDS.select(col("count")): Dataset[Row]).reduce((acc: Row, n: Row) => acc + n): Row
which sets constraints - function takes Row and Row and returns Row.
Row has no + method, so the only option to satisfy
(acc: ???, n: Row) => acc + n)
is to use String (you can + Any to String.
However this doesn't satisfy the complete expression - hence the error.
You've already figured out that you can use
dataDS.select(col("count").as[Int]).reduce((acc, n) => acc + n)
where col("count").as[Int] is a TypedColumn[Row, Int] and corresponding select returns Dataset[Int].
Similarly you could
dataDS.select(col("count")).as[Int].reduce((acc, n) => acc + n)
and
dataDS.toDF.map(_.getAs[Int]("count")).reduce((acc, n) => acc + n)
In all cases
.reduce((acc, n) => acc + n)
being (Int, Int) => Int.
I have a sample dataframe as follows:
val df = Seq((Seq("abc", "cde"), 19, "red, abc"), (Seq("eefg", "efa", "efb"), 192, "efg, efz efz")).toDF("names", "age", "color")
And a user defined function as follows which replaces "color" column in df with the string length:
def strLength(inputString: String): Long = inputString.size.toLong
I am saving the udf reference for performance as follows:
val strLengthUdf = udf(strLength _)
And when I try to process the udf while performing the select it works if I don't have any other column names:
val x = df.select(strLengthUdf(df("color")))
scala> x.show
+----------+
|UDF(color)|
+----------+
| 8|
| 12|
+----------+
But when I want to pick other columns along with the udf processed column, I get the following error:
scala> val x = df.select("age", strLengthUdf(df("color")))
<console>:27: error: overloaded method value select with alternatives:
[U1, U2](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1], c2: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U2])org.apache.spark.sql.Dataset[(U1, U2)] <and>
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (String, org.apache.spark.sql.Column)
val x = df.select("age", strLengthUdf(df("color")))
^
What am I missing here val x = df.select("age", strLengthUdf(df("color")))?
You cannot mix Strings and Columns in a select statement.
This will work:
df.select(df("age"), strLengthUdf(df("color")))
I want to make changes to a column in the dataframe. The column is an Array for Integers. I want to replace an elements of the array, taking index from another array and replacing that element with an element from third array. Example: I have three columns C1, C2, C3 all three arrays. I want to replace elements in C3 as follows.
C3[C2[i]] = C1[i].
I wrote the following UDF:
def UpdateHist = udf((CRF_count: Seq[Long], Day: Seq[String], History: Seq[Int])=> for(i <- 0 to Day.length-1){History.updated(Day(i).toInt-1 , CRF_count(i).toInt)})
and executed this:
histdate3.withColumn("History2", UpdateHist2(col("CRF_count"), col("Day"), col("History"))).show()
But its returning an error as below:
scala> histdate3.withColumn("History2", UpdateHist2(col("CRF_count"), col("Day"), col("History"))).show()
java.lang.UnsupportedOperationException: Schema for type Unit is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:733)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:671)
at org.apache.spark.sql.functions$.udf(functions.scala:3100)
at UpdateHist2(:25)
... 48 elided
I think I'm returning some different type, a View type which is not supported. Please help me how I can solve this.
Your for loop returns a Unit hence the error message. You could use for-yield to return values, but since the Seq should be updated successively, a simple foldLeft would work better:
import org.apache.spark.sql.functions._
val df = Seq(
(Seq(101L, 102L), Seq("1", "2"), Seq(11, 12)),
(Seq(201L, 202L, 203L), Seq("2", "3"), Seq(21, 22, 23))
).toDF("C1", "C2", "C3")
// +---------------+------+------------+
// |C1 |C2 |C3 |
// +---------------+------+------------+
// |[101, 102] |[1, 2]|[11, 12] |
// |[201, 202, 203]|[2, 3]|[21, 22, 23]|
// +---------------+------+------------+
def updateC3 = udf( (c1: Seq[Long], c2: Seq[String], c3: Seq[Int]) =>
c2.foldLeft( c3 ){ (acc, i) =>
val idx = i.toInt - 1
acc.updated(idx, c1(idx).toInt)
}
)
df.withColumn("C3", updateC3($"C1", $"C2", $"C3")).show(false)
// +---------------+------+--------------+
// |C1 |C2 |C3 |
// +---------------+------+--------------+
// |[101, 102] |[1, 2]|[101, 102] |
// |[201, 202, 203]|[2, 3]|[21, 202, 203]|
// +---------------+------+--------------+
I am converting a Spark dataframe to RDD[Row] so I can map it to final schema to write into Hive Orc table. I want to convert any space in the input to actual null so the hive table can store actual null instead of a empty string.
Input DataFrame (a single column with pipe delimited values):
col1
1|2|3||5|6|7|||...|
My code:
inputDF.rdd.
map { x: Row => x.get(0).asInstanceOf[String].split("\\|", -1)}.
map { x => Row (nullConverter(x(0)),nullConverter(x(1)),nullConverter(x(2)).... nullConverter(x(200)))}
def nullConverter(input: String): String = {
if (input.trim.length > 0) input.trim
else null
}
Is there any clean way of doing it rather than calling the nullConverter function 200 times.
Update based on single column:
Going with your approach, I will do something like:
inputDf.rdd.map((row: Row) => {
val values = row.get(0).asInstanceOf[String].split("\\|").map(nullConverter)
Row(values)
})
Make your nullConverter or any other logic a udf:
import org.apache.spark.sql.functions._
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
Now, use the udf on your df and apply to all columns:
val convertedDf = inputDf.select(inputDf.columns.map(c => nullConverter(col(c)).alias(c)):_*)
Now, you can do your RDD logic.
This would be easier to do using the DataFrame API before converting to an RDD. First, split the data:
val df = Seq(("1|2|3||5|6|7|8||")).toDF("col0") // Example dataframe
val df2 = df.withColumn("col0", split($"col0", "\\|")) // Split on "|"
Then find out the length of the array:
val numCols = df2.first.getAs[Seq[String]](0).length
Now, for each element in the array, use the nullConverter UDF and then assign it to it's own column.
val nullConverter = udf((input: String) => {
if (input.trim.length > 0) input.trim
else null
})
val df3 = df2.select((0 until numCols).map(i => nullConverter($"col0".getItem(i)).as("col" + i)): _*)
The result using the example dataframe:
+----+----+----+----+----+----+----+----+----+----+
|col0|col1|col2|col3|col4|col5|col6|col7|col8|col9|
+----+----+----+----+----+----+----+----+----+----+
| 1| 2| 3|null| 5| 6| 7| 8|null|null|
+----+----+----+----+----+----+----+----+----+----+
Now convert it to an RDD or continue using the data as a DataFrame depending on your needs.
There is no point in converting dataframe to rdd
import org.apache.spark.sql.functions._
df = sc.parallelize([
(1, "foo bar"), (2, "foobar "), (3, " ")
]).toDF(["k", "v"])
df.select(regexp_replace(col("*"), " ", "NULL"))