I have a dataframe, in which I want to groupBy column A then find different stats like mean, min, max, std dev and quantiles.
I am able to find min, max and mean using the following code:
df.groupBy("A").agg(min("B"), max("B"), mean("B")).show(50, false)
But I am unable to find the quantiles(0.25, 0.5, 0.75). I tried approxQuantile and percentile but it gives the following error:
error: not found: value approxQuantile
if you have Hive in classpath, you can use many UDAF like percentile_approx and stddev_samp, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inAggregateFunctions(UDAF)
You can call these functions using callUDF:
import ss.implicits._
import org.apache.spark.sql.functions.callUDF
val df = Seq(1.0,2.0,3.0).toDF("x")
df.groupBy()
.agg(
callUDF("percentile_approx",$"x",lit(0.5)).as("median"),
callUDF("stddev_samp",$"x").as("stdev")
)
.show()
Here is a code that I have tested on Spark 3.1
val simpleData = Seq(("James","Sales","NY",90000,34,10000),
("Michael","Sales","NY",86000,56,20000),
("Robert","Sales","CA",81000,30,23000),
("Maria","Finance","CA",90000,24,23000),
("Raman","Finance","CA",99000,40,24000),
("Scott","Finance","NY",83000,36,19000),
("Jen","Finance","NY",79000,53,15000),
("Jeff","Marketing","CA",80000,25,18000),
("Kumar","Marketing","NY",91000,50,21000)
)
val df = simpleData.toDF("employee_name","department","state","salary","age","bonus")
df.show()
df.groupBy($"department")
.agg(
percentile_approx($"salary",lit(0.5), lit(10000))
)
.show(false)
Output
+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
| James| Sales| NY| 90000| 34|10000|
| Michael| Sales| NY| 86000| 56|20000|
| Robert| Sales| CA| 81000| 30|23000|
| Maria| Finance| CA| 90000| 24|23000|
| Raman| Finance| CA| 99000| 40|24000|
| Scott| Finance| NY| 83000| 36|19000|
| Jen| Finance| NY| 79000| 53|15000|
| Jeff| Marketing| CA| 80000| 25|18000|
| Kumar| Marketing| NY| 91000| 50|21000|
+-------------+----------+-----+------+---+-----+
+----------+-------------------------------------+
|department|percentile_approx(salary, 0.5, 10000)|
+----------+-------------------------------------+
|Sales |86000 |
|Finance |83000 |
|Marketing |80000 |
+----------+-------------------------------------+
Related
I have a dataframe. I need to calculate the Max length of the String value in a column and print both the value and its length.
I have written the below code but the output here is the max length only but not its corresponding value.
This How to get max length of string column from dataframe using scala? did help me out in getting the below query.
df.agg(max(length(col("city")))).show()
Use row_number() window function on length('city) desc order.
Then filter out only the first row_number column and add length('city) column to dataframe.
Ex:
val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"))
.toDF("city","num","country")
val win=Window.orderBy(length('city).desc)
df.withColumn("str_len",length('city))
.withColumn("rn", row_number().over(win))
.filter('rn===1)
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len|rn |
+----+---+-------+-------+---+
|ABC |1 |US |3 |1 |
+----+---+-------+-------+---+
(or)
In spark-sql:
df.createOrReplaceTempView("lpl")
spark.sql("select * from (select *,length(city)str_len,row_number() over (order by length(city) desc)rn from lpl)q where q.rn=1")
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len| rn|
+----+---+-------+-------+---+
| ABC| 1| US| 3| 1|
+----+---+-------+-------+---+
Update:
Find min,max values:
val win_desc=Window.orderBy(length('city).desc)
val win_asc=Window.orderBy(length('city).asc)
df.withColumn("str_len",length('city))
.withColumn("rn", row_number().over(win_desc))
.withColumn("rn1",row_number().over(win_asc))
.filter('rn===1 || 'rn1 === 1)
.show(false)
Result:
+----+---+-------+-------+---+---+
|city|num|country|str_len|rn |rn1|
+----+---+-------+-------+---+---+
|A |1 |US |1 |3 |1 | //min value of string
|ABC |1 |US |3 |1 |3 | //max value of string
+----+---+-------+-------+---+---+
In case you have multiple rows which share the same length, then the solution with the window function won't work, since it filters the first row after ordering.
Another way would be to create a new column with the length of the string, find it's max element and filter the data frame upon the obtained maximum value.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._
val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"), ("DEF", 2, "US"))
.toDF("city","num","country")
val dfWithLength = df.withColumn("city_length", length($"city")).cache()
dfWithLength.show()
+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| A| 1| US| 1|
| AB| 1| US| 2|
| ABC| 1| US| 3|
| DEF| 2| US| 3|
+----+---+-------+-----------+
val Row(maxValue: Int) = dfWithLength.agg(max("city_length")).head()
dfWithLength.filter($"city_length" === maxValue).show()
+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| ABC| 1| US| 3|
| DEF| 2| US| 3|
+----+---+-------+-----------+
Find a maximum string length on a string column with pyspark
from pyspark.sql.functions import length, col, max
df2 = df.withColumn("len_Description",length(col("Description"))).groupBy().max("len_Description")
I have two DataFrames with one columns each (300 rows each) :
df_realite.take(1)
[Row(realite=1.0)]
df_proba_classe_1.take(1)
[Row(probabilite=0.6196931600570679)]
I would like to do one DataFrame with the two columns.
I tried :
_ = spark.createDataFrame([df_realite.rdd, df_proba_classe_1.rdd] ,
schema=StructType([ StructField('realite' , FloatType() ) ,
StructField('probabilite' , FloatType() ) ]))
But
_.take(10)
gives me empty values:
[Row(realite=None, probabilite=None), Row(realite=None, probabilite=None)]
There may be a more concise way (or a way without a join), but you could always just give them both an id and join them like:
from pyspark.sql import functions
df1 = df_realite.withColumn('id', functions.monotonically_increasing_id())
df2 = df_proba_classe_1.withColumn('id', functions.monotonically_increasing_id())
df1.join(df2, on='id').select('realite', 'probabilite'))
i think this is what you are looking for and i would only recommend this method if your data is very small like it is in your case (300 rows) because collect() is not a good practice on tons of data otherwise go the join route with dummy cols and do a broadcast join so no shuffle occurs
from pyspark.sql.functions import *
from pyspark.sql.types import *
df1 = spark.range(10).select(col("id").cast("float"))
df2 = spark.range(10).select(col("id").cast("float"))
l1 = df1.rdd.flatMap(lambda x: x).collect()
l2 = df2.rdd.flatMap(lambda x: x).collect()
list_df = zip(l1, l2)
schema=StructType([ StructField('realite', FloatType() ) ,
StructField('probabilite' , FloatType() ) ])
df = spark.createDataFrame(list_df, schema=schema)
df.show()
+-------+-----------+
|realite|probabilite|
+-------+-----------+
| 0.0| 0.0|
| 1.0| 1.0|
| 2.0| 2.0|
| 3.0| 3.0|
| 4.0| 4.0|
| 5.0| 5.0|
| 6.0| 6.0|
| 7.0| 7.0|
| 8.0| 8.0|
| 9.0| 9.0|
+-------+-----------+
I want to implement a function similar to Oracle's LISTAGG function.
Equivalent oracle code is
select KEY,
listagg(CODE, '-') within group (order by DATE) as CODE
from demo_table
group by KEY
Here is my spark scala dataframe implementation, but unable to order the values with in each group.
Input:
val values = List(List("66", "PL", "2016-11-01"), List("66", "PL", "2016-12-01"),
List("67", "JL", "2016-12-01"), List("67", "JL", "2016-11-01"), List("67", "PL", "2016-10-01"), List("67", "PO", "2016-09-01"), List("67", "JL", "2016-08-01"),
List("68", "PL", "2016-12-01"), List("68", "JO", "2016-11-01"))
.map(row => (row(0), row(1), row(2)))
val df = values.toDF("KEY", "CODE", "DATE")
df.show()
+---+----+----------+
|KEY|CODE| DATE|
+---+----+----------+
| 66| PL|2016-11-01|
| 66| PL|2016-12-01|----- group 1
| 67| JL|2016-12-01|
| 67| JL|2016-11-01|
| 67| PL|2016-10-01|
| 67| PO|2016-09-01|
| 67| JL|2016-08-01|----- group 2
| 68| PL|2016-12-01|
| 68| JO|2016-11-01|----- group 3
+---+----+----------+
udf implementation :
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.udf
val listAgg = udf((xs: Seq[String]) => xs.mkString("-"))
df.groupBy("KEY")
.agg(listAgg(collect_list("CODE")).alias("CODE"))
.show(false)
+---+--------------+
|KEY|CODE |
+---+--------------+
|68 |PL-JO |
|67 |JL-JL-PL-PO-JL|
|66 |PL-PL |
+---+--------------+
Expected Output : - order by date
+---+--------------+
|KEY|CODE |
+---+--------------+
|68 |JO-PL |
|67 |JL-PO-PL-JL-JL|
|66 |PL-PL |
+---+--------------+
Use struct inbuilt function to combine the CODE and DATE columns and use that new struct column in collect_list aggregation function. And in the udf function sort by the DATE and collect the CODE as - separated string
import org.apache.spark.sql.functions._
def sortAndStringUdf = udf((codeDate: Seq[Row])=> codeDate.sortBy(row => row.getAs[Long]("DATE")).map(row => row.getAs[String]("CODE")).mkString("-"))
df.withColumn("codeDate", struct(col("CODE"), col("DATE").cast("timestamp").cast("long").as("DATE")))
.groupBy("KEY").agg(sortAndStringUdf(collect_list("codeDate")).as("CODE"))
which should give you
+---+--------------+
|KEY| CODE|
+---+--------------+
| 68| JO-PL|
| 67|JL-PO-PL-JL-JL|
| 66| PL-PL|
+---+--------------+
I hope the answer is helpful
Update
I am sure this will be faster than using udf function
df.withColumn("codeDate", struct(col("DATE").cast("timestamp").cast("long").as("DATE"), col("CODE")))
.groupBy("KEY")
.agg(concat_ws("-", expr("sort_array(collect_list(codeDate)).CODE")).alias("CODE"))
.show(false)
which should give you the same result as above
I am trying to create a very simple DataFrame with for example 3 columns and 3 rows.
I would like to have something like this:
+------+---+-----+
|nameID|age| Code|
+------+---+-----+
|2123 | 80| 4553|
|65435 | 10| 5454|
+------+---+-----+
How can I create that Dataframe in Scala (is an example).
I have the next program:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
object ejemploApp extends App{
val schema = StructType(List(
StructField("name", LongType, true),
StructField("pandas", LongType, true),
StructField("id", LongType, true)))
}
val outputDF = sqlContext.createDataFrame(sc.emptyRDD, schema)
First problem:
It is throwing error in the outputDF that cannot resolve symbol schema.
Second problem:
How can I add the random numbers to the DataFrame?
I would do something like this:
val nRows = 10
import scala.util.Random
val df = (1 to nRows)
.map(_ => (Random.nextLong,Random.nextLong,Random.nextLong))
.toDF("nameID","age","Code")
+--------------------+--------------------+--------------------+
| nameID| age| Code|
+--------------------+--------------------+--------------------+
| 5805854653225159387|-1935762756694500432| 1365584391661863428|
| 4308593891267308529|-1117998169834014611| 366909655761037357|
|-6520321841013405169| 7356990033384276746| 8550003986994046206|
| 6170542655098268888| 1233932617279686622| 7981198094124185898|
|-1561157245868690538| 1971758588103543208| 6200768383342183492|
|-8160793384374349276|-6034724682920319632| 6217989507468659178|
| 4650572689743320451| 4798386671229558363|-4267909744532591495|
| 1769492191639599804| 7162442036876679637|-4756245365203453621|
| 6677455911726550485| 8804868511911711123|-1154102864413343257|
|-2910665375162165247|-7992219570728643493|-3903787317589941578|
+--------------------+--------------------+--------------------+
Of course, the age isn't very realistic, but you can change your random numbers as you wish (i.e. using scalas modulo function and absolute value), you could e.g.
val df = (1 to nRows)
.map(id => (id.toLong,Math.abs(Random.nextLong % 100L),Random.nextLong))
.toDF("nameID","age","Code")
+------+---+--------------------+
|nameID|age| Code|
+------+---+--------------------+
| 1| 17| 7143235115334699492|
| 2| 83|-3863778506510275412|
| 3| 31|-3839786144396379186|
| 4| 40| 8057989112338559775|
| 5| 67| 7601061291211506255|
| 6| 71| 7393782421106239325|
| 7| 43| 28349510524075085|
| 8| 50| 539042255545625624|
| 9| 41|-8654000375112432924|
| 10| 82|-1300111870445007499|
+------+---+--------------------+
EDIT: make sure you have the implicits imported:
Spark 1.6:
import sqlContext.implicits._
Spark 2:
import sparkSession.implicits._
I am using scala and spark and have a simple dataframe.map to produce the required transformation on data. However I need to provide an additional row of data with the modified original. How can I use the dataframe.map to give out this.
ex:
dataset from:
id, name, age
1, john, 23
2, peter, 32
if age < 25 default to 25.
dataset to:
id, name, age
1, john, 25
1, john, -23
2, peter, 32
Would a 'UnionAll' handle it?
eg.
df1 = original dataframe
df2 = transformed df1
df1.unionAll(df2)
EDIT: implementation using unionAll()
val df1=sqlContext.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfTransform= udf[Int,Int] { (age) => if (age<25) 25 else age }
val df2=df1.withColumn("age2", udfTransform($"age")).
where("age!=age2").
drop("age2")
df1.withColumn("age", udfTransform($"age")).
unionAll(df2).
orderBy("id").
show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 25|
| 1| john| 23|
| 2|peter| 32|
+---+-----+---+
Note: the implementation differs a bit from the originally proposed (naive) solution. The devil is always in the detail!
EDIT 2: implementation using nested array and explode
val df1=sx.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfArr= udf[Array[Int],Int] { (age) =>
if (age<25) Array(age,25) else Array(age) }
val df2=df1.withColumn("age", udfArr($"age"))
df2.show()
+---+-----+--------+
| id| name| age|
+---+-----+--------+
| 1| john|[23, 25]|
| 2|peter| [32]|
+---+-----+--------+
df2.withColumn("age",explode($"age") ).show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 23|
| 1| john| 25|
| 2|peter| 32|
+---+-----+---+