How to find the size of a dataframe in pyspark - pyspark

How can I replicate this code to get the dataframe size in pyspark?
scala> val df = spark.range(10)
scala> print(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats)
Statistics(sizeInBytes=80.0 B, hints=none)
What I would like to do is get the sizeInBytes value into a variable.

In Spark 2.4 you can do
df = spark.range(10)
df.createOrReplaceTempView('myView')
spark.sql('explain cost select * from myView').show(truncate=False)
|== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(8)), Statistics(sizeInBytes=80.0 B, hints=none)
In Spark 3.0.0-preview2 you can use explain with the cost mode:
df = spark.range(10)
df.explain(mode='cost')
== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(8)), Statistics(sizeInBytes=80.0 B)

See of this helps-
Reading the json file source and computing stats like size in bytes , number of rows etc. This stat will also help spark to take it=ntelligent decision while optimizing execution plan This code should be same in pysparktoo
/**
* file content
* spark-test-data.json
* --------------------
* {"id":1,"name":"abc1"}
* {"id":2,"name":"abc2"}
* {"id":3,"name":"abc3"}
*/
val fileName = "spark-test-data.json"
val path = getClass.getResource("/" + fileName).getPath
spark.catalog.createTable("df", path, "json")
.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |abc1|
* |2 |abc2|
* |3 |abc3|
* +---+----+
*/
// Collect only statistics that do not require scanning the whole table (that is, size in bytes).
spark.sql("ANALYZE TABLE df COMPUTE STATISTICS NOSCAN")
spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)
/**
* +----------+---------+-------+
* |col_name |data_type|comment|
* +----------+---------+-------+
* |Statistics|68 bytes | |
* +----------+---------+-------+
*/
spark.sql("ANALYZE TABLE df COMPUTE STATISTICS")
spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)
/**
* +----------+----------------+-------+
* |col_name |data_type |comment|
* +----------+----------------+-------+
* |Statistics|68 bytes, 3 rows| |
* +----------+----------------+-------+
*/
more info - databricks sql doc

Typically, you can access the scala methods through py4j. I just tried this in the pyspark shell:
>>> spark._jsparkSession.sessionState().executePlan(df._jdf.queryExecution().logical()).optimizedPlan().stats().sizeInBytes()
716

Related

Get average length of values of a column (from a hive table) in spark along with data types

Task: Get data types of a table (in hive) and the average length of values of each column.
I'm trying to do the above task in spark using scala.
Firstly I did
val table = spark.sql("desc table")
The output has three columns, col_name, datatype, comment.
And then, I tried to get only the column values as a comma-separated string.
val col_string = table.select("col_name").rdd.map(i => "avg(length(trim("+i(0).toString+")))").collect.mkString(", ")
Now, I can use this string in another query to get the average length of all columns like given below, but the output dataframe has as many numbers of columns as the table, I don't know how to join it with the table dataframe.
val tbl_length = spark.sql("select " + col_string + " from schema.table")
I've looked at transposing the second dataframe, that looks not efficient, and hard for me to grasp as a beginner in spark and scala.
Is my method above is good/efficient one? if there is a better way please suggest.
Even if there is a better way, can you also please explain how I can join two such datasets of row=>column.
Input table:
col1| col2| col3
Ac| 123| 0
Defg| 23456| 0
Expected output
column_name| data_type| avg_length
col1| String| 3
col2| Int| 4
col3| Int| 1
Try this-
val table = spark.catalog.getTable("df")
val df = spark.sql(s"select * from ${table.name}")
df.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |abc1|
* |2 |abc2|
* |3 |abc3|
* +---+----+
*/
val aggs = df.columns.map(f => avg(length(trim(col(f)))).as(f))
val values = df.agg(aggs.head, aggs.tail: _*).head.getValuesMap[Double](df.columns).values.toSeq
df.schema.map(sf => (sf.name, sf.dataType)).zip(values).map{ case ((name, dt), value) => (name, dt.simpleString, value)}
.toDF("column_name", "data_type", "avg_length")
.show(false)
/**
* +-----------+---------+----------+
* |column_name|data_type|avg_length|
* +-----------+---------+----------+
* |id |bigint |1.0 |
* |name |string |4.0 |
* +-----------+---------+----------+
*/

spark - mimic pyspark asDict() for scala without using case class

Pyspark allows you to create a Dictionary when a single a single row is returned from the dataframe using the below approach.
t=spark.sql("SET").withColumn("rw",expr("row_number() over(order by key)")).collect()[0].asDict()
print(t)
print(t["key"])
print(t["value"])
print(t["rw"])
print("Printing using for comprehension")
[print(t[i]) for i in t ]
Results:
{'key': 'spark.app.id', 'value': 'local-1594577194330', 'rw': 1}
spark.app.id
local-1594577194330
1
Printing using for comprehension
spark.app.id
local-1594577194330
1
I'm trying the same in scala-spark. It is possible using case class approach.
case class download(key:String, value:String,rw:Long)
val t=spark.sql("SET").withColumn("rw",expr("row_number() over(order by key)")).as[download].first
println(t)
println(t.key)
println(t.value)
println(t.rw)
Results:
download(spark.app.id,local-1594580739413,1)
spark.app.id
local-1594580739413
1
In actual problem, I have nearly 200+ columns and don't want to use case class approach. I'm trying something like below to avoid the case class option.
val df =spark.sql("SET").withColumn("rw",expr("row_number() over(order by key)"))
(df.columns).zip(df.take(1)(0))
but getting error.
<console>:28: error: type mismatch;
found : (String, String, Long)
required: Iterator[?]
(df.columns.toIterator).zip(df.take(1)(0))
Is there a way to solve this.
In scala, there is a method getValuesMap to convert a row into Map[columnName: String, columnValue: T].
Try using the same as below-
val df =spark.sql("SET").withColumn("rw",expr("row_number() over(order by key)"))
df.show(false)
df.printSchema()
/**
* +----------------------------+-------------------+---+
* |key |value |rw |
* +----------------------------+-------------------+---+
* |spark.app.id |local-1594644271573|1 |
* |spark.app.name |TestSuite |2 |
* |spark.driver.host |192.168.1.3 |3 |
* |spark.driver.port |58420 |4 |
* |spark.executor.id |driver |5 |
* |spark.master |local[2] |6 |
* |spark.sql.shuffle.partitions|2 |7 |
* +----------------------------+-------------------+---+
*
* root
* |-- key: string (nullable = false)
* |-- value: string (nullable = false)
* |-- rw: integer (nullable = true)
*/
val map = df.head().getValuesMap(df.columns)
println(map)
println(map("key"))
println(map("value"))
println(map("rw"))
println("Printing using for comprehension")
map.foreach(println)
/**
* Map(key -> spark.app.id, value -> local-1594644271573, rw -> 1)
* spark.app.id
* local-1594644271573
* 1
* Printing using for comprehension
* (key,spark.app.id)
* (value,local-1594644271573)
* (rw,1)
*/
The problem is that zip is a method on a collection which can only take another collection object which implements IterableOnce, and df.take(1)(0) is a Spark SQL Row, which doesn't fall into that category.
Try converting the row to a Seq using its toSeq method.
df.columns.zip(df.take(1)(0).toSeq)
result:
Array((key,spark.app.id), (value,local-1594577194330), (rw,1))

Better Alternatives to EXCEPT Spark Scala

I have been told that EXCEPT is a very costly operation and one should always try to avoid using EXCEPT.
My Use Case -
val myFilter = "rollNo='11' AND class='10'"
val rawDataDf = spark.table(<table_name>)
val myFilteredDataframe = rawDataDf.where(myFilter)
val allOthersDataframe = rawDataDf.except(myFilteredDataframe)
But I am confused, in such use case , what are my alternatives ?
Use left anti join as below-
val df = spark.range(2).withColumn("name", lit("foo"))
df.show(false)
df.printSchema()
/**
* +---+----+
* |id |name|
* +---+----+
* |0 |foo |
* |1 |foo |
* +---+----+
*
* root
* |-- id: long (nullable = false)
* |-- name: string (nullable = false)
*/
val df2 = df.filter("id=0")
df.join(df2, df.columns.toSeq, "leftanti")
.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |foo |
* +---+----+
*/

Spark/Scala: use string variable in conditional expressions in DataFrame operations

Let me explain this with an example. Starting with the following dataframe
val df = Seq((1, "CS", 0, Array(0.1, 0.2, 0.4, 0.5)),
(4, "Ed", 0, Array(0.4, 0.8, 0.3, 0.6)),
(7, "CS", 0, Array(0.2, 0.5, 0.4, 0.7)),
(101, "CS", 1, Array(0.5, 0.7, 0.3, 0.8)),
(5, "CS", 1, Array(0.4, 0.2, 0.6, 0.9))).toDF("id", "dept", "test", "array")
df.show()
+---+----+----+--------------------+
| id|dept|test| array|
+---+----+----+--------------------+
| 1| CS| 0|[0.1, 0.2, 0.4, 0.5]|
| 4| Ed| 0|[0.4, 0.8, 0.3, 0.6]|
| 7| CS| 0|[0.2, 0.5, 0.4, 0.7]|
|101| CS| 1|[0.5, 0.7, 0.3, 0.8]|
| 5| CS| 1|[0.4, 0.2, 0.6, 0.9]|
+---+----+----+--------------------+
Considering the following two common operations as example (but do not have to be limited to them):
import org.apache.spark.sql.functions._ // for `when`
val dfFilter1 = df.where($"dept" === "CS")
val dfFilter3 = df.withColumn("category", when($"dept" === "CS" && $"id" === 101, 10).otherwise(0))
Now, I have a string variable colName = "dept". And it is required that $"dept" in the previous operation has to be replaced by colName in some form to achieve the same functionality. I managed to achieve the first one as following:
val dfFilter2 = df.where(s"${colName} = 'CS'")
But similar operation fails in the second case:
val dfFilter4 = df.withColumn("category", when(s"${colName} = 'CS'" && $"id" === 101, 10).otherwise(0))
Specifically it gives the following error:
Name: Unknown Error
Message: <console>:35: error: value && is not a member of String
val dfFilter4 = df.withColumn("category", when(s"${colName} = 'CS'" && $"id" === 101, 10).otherwise(0))
My understanding so far is that after I use s"${variable}" to deal with a variable, everything becomes pure string, and it is difficult to have logic operation involved.
So, my question are:
1. What is the best way to use such string variable as colName for operations similar as the two I listed above (I also do not like the solution I have for .where())?
2. Are there any general guidelines to use such string variable in more general operations other than the two examples here (I always felt that it is very case-specific when I deal with string related operations).
You can use expr function as
val dfFilter4 = df.withColumn("category", when(expr(s"${colName} = 'CS' and id = 101"), 10).otherwise(0))
Reason of the error
where function when defined with string query as following is working
val dfFilter2 = df.where(s"${colName} = 'CS'")
because there are supporting apis for both string and column
/**
* Filters rows using the given condition. This is an alias forfilter.
* {{{
* // The following are equivalent:
* peopleDs.filter($"age" > 15)
* peopleDs.where($"age" > 15)
* }}}
*
* #group typedrel
* #since 1.6.0
*/
def where(condition: Column): Dataset[T] = filter(condition)
and
/**
* Filters rows using the given SQL expression.
* {{{
* peopleDs.where("age > 15")
* }}}
*
* #group typedrel
* #since 1.6.0
*/
def where(conditionExpr: String): Dataset[T] = {
filter(Column(sparkSession.sessionState.sqlParser.parseExpression(conditionExpr)))
}
But there is only one api for when function supporting only column type
/**
* Evaluates a list of conditions and returns one of multiple possible result expressions.
* If otherwise is not defined at the end, null is returned for unmatched conditions.
*
* {{{
* // Example: encoding gender string column into integer.
*
* // Scala:
* people.select(when(people("gender") === "male", 0)
* .when(people("gender") === "female", 1)
* .otherwise(2))
*
* // Java:
* people.select(when(col("gender").equalTo("male"), 0)
* .when(col("gender").equalTo("female"), 1)
* .otherwise(2))
* }}}
*
* #group normal_funcs
* #since 1.4.0
*/
def when(condition: Column, value: Any): Column = withExpr {
CaseWhen(Seq((condition.expr, lit(value).expr)))
}
So you cannot use string sql query for when function
So, correct way of doing is as following
val dfFilter4 = df.withColumn("category", when(col(s"${colName}") === "CS" && $"id" === 101, 10).otherwise(0))
or in short as
val dfFilter4 = df.withColumn("category", when(col(colName) === "CS" && col("id") === 101, 10).otherwise(0))
What is the best way to use such string variable as colName for operations similar as the two I listed above
You can use col function from org.apache.spark.sql.functions
import org.apache.spark.sql.functions._
val colName = "dept"
For dfFilter2
val dfFilter2 = df.where(col(colName) === "CS")
For dfFilter4
val dfFilter4 = df.withColumn("category", when(col(colName) === "CS" && $"id" === 101, 10).otherwise(0))

Spark Dataframe GroupBy and compute Complex aggregate function

Using Spark dataframe , I need to compute the percentage by using the below
complex formula :
Group by "KEY " and calculate "re_pct" as ( sum(sa) / sum( sa / (pct/100) ) ) * 100
For Instance , Input Dataframe is
val values1 = List(List("01", "20000", "45.30"), List("01", "30000", "45.30"))
.map(row => (row(0), row(1), row(2)))
val DS1 = values1.toDF("KEY", "SA", "PCT")
DS1.show()
+---+-----+-----+
|KEY| SA| PCT|
+---+-----+-----+
| 01|20000|45.30|
| 01|30000|45.30|
+---+-----+-----+
Expected Result :
+---+-----+--------------+
|KEY| re_pcnt |
+---+-----+--------------+
| 01| 45.30000038505 |
+---+-----+--------------+
I have tried to calculate as below
val result = DS1.groupBy("KEY").agg(((sum("SA").divide(
sum(
("SA").divide(
("PCT").divide(100)
)
)
)) * 100).as("re_pcnt"))
But facing Error:(36, 16) value divide is not a member of String ("SA").divide({
Any suggestion on implementing the above logic ?
You can try importing spark.implicits._ and then use $ to refer to a column.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val result = DS1.groupBy("KEY")
.agg(((sum($"SA").divide(sum(($"SA").divide(($"PCT").divide(100))))) * 100)
.as("re_pcnt"))
Which will give you the requested output.
If you do not want to import you can always use the col() command instead of $.
It is possible to use a string as input to the agg() function with the use of expr(). However, the input string need to be changed a bit. The following gives exactly the same result as before, but uses a string instead:
val opr = "sum(SA)/(sum(SA/(PCT/100))) * 100"
val df = DS1.groupBy("KEY").agg(expr(opr).as("re_pcnt"))
Note that .as("re_pcnt") need to be inside the agg() method, it can not be outside.
Your code works almost perfectly. You just have to put the '$' symbol in order to specify you're passing a column:
val result = DS1.groupBy($"KEY").agg(((sum($"SA").divide(
sum(
($"SA").divide(
($"PCT").divide(100)
)
)
)) * 100).as("re_pcnt"))
Here's the output:
result.show()
+---+-------+
|KEY|re_pcnt|
+---+-------+
| 01| 45.3|
+---+-------+