val df = sc.parallelize(Seq((201601, a),
(201602, b),
(201603, c),
(201604, c),
(201607, c),
(201604, c),
(201608, c),
(201609, c),
(201605, b))).toDF("col1", "col2")
I want to get top 3 values of col1. Can any please let me know the better way to do this.
Spark : 1.6.2
Scala : 2.10
You can do it like below.
df.select($"col1").orderBy($"col1".desc).limit(3).show()
You will get
+------+
| col1|
+------+
|201609|
|201608|
|201607|
+------+
You can extract the maxDate firstly and then filter based on the maxDate:
val maxDate = df.agg(max("col1")).first().getAs[Int](0)
// maxDate: Int = 201609
def minusThree(date: Int): Int = {
var Year = date/100
var month = date%100
if(month <= 3) {
Year -= 1
month += 9
} else { month -= 3}
Year*100 + month
}
df.filter($"col1" > minusThree(maxDate)).show
+------+----+
| col1|col2|
+------+----+
|201607| c|
|201608| c|
|201609| c|
+------+----+
You can get same results in one more way using top function
Example:
val data=sc.parallelize(Seq(("maths",52),("english",75),("science",82), ("computer",65),("maths",85))).top(2)
Results:
(science,82)
(maths,85)
Related
Hi how's it going? Here's a sample dataframe..
val team_df = Seq(("yankees","aaron judge",24),("yankees","giancarlo stanton",20),("yankees","brett gardner",11),("dodgers","cody bellinger",20),("dodgers","jock pederson",10),
("dodgers","justin turner",15)).toDF("team","player","hits")
and here's a screenshot in tabular format:
say I wanted to return a dataframe for each team, with the rows of the 2 highest players per team in hits (or N highest).
So it should return one dataframe for yankees with aaron judge 24 and gianarlo stanton 20, and one dataframe for the dodgers with cody bellinger 20 and justin turner 15, in this toy example.
Thanks and have a great day!
def findMultipleDF(df: DataFrame, NHighest:Int): Map[String, DataFrame] = {
val map = Map[String, DataFrame]()
val rankedDF = df.withColumn("Rank", rank().over(Window.partitionBy("team").orderBy($"hits".desc)))
val count = df.groupBy("team").count().collect()
count.map(x => {
val tempDF = rankedDF.filter($"team" === x.get(0) && col("Rank").leq(NHighest)).toDF()
map.+=((x.get(0).toString(), tempDF))
})
map
}
val output = findMultipleDF(team_df, 2)
output.map(x=>{
x._2.show()
})
+-------+--------------+----+----+
| team| player|hits|Rank|
+-------+--------------+----+----+
|dodgers|cody bellinger| 20| 1|
|dodgers| justin turner| 15| 2|
+-------+--------------+----+----+
+-------+-----------------+----+----+
| team| player|hits|Rank|
+-------+-----------------+----+----+
|yankees| aaron judge| 24| 1|
|yankees|giancarlo stanton| 20| 2|
+-------+-----------------+----+----+
You can try like above but not sure why you want to have output in different dataframe.
So I have a huge data frame which is combination of individual tables, it has an identifier column at the end which specifies the table number as shown below
+----------------------------+
| col1 col2 .... table_num |
+----------------------------+
| x y 1 |
| a b 1 |
| . . . |
| . . . |
| q p 2 |
+----------------------------+
(original table)
I have to split this into multiple little dataframes based on table num. The number of tables combined to create this is pretty large so it's not feasible to individually create the disjoint subset dataframes, so I was thinking if I made a for loop iterating over min to max values of table_num I could achieve this task but I can't seem to do it, any help is appreciated.
This is what I came up with
for (x < min(table_num) to max(table_num)) {
var df(x)= spark.sql("select * from df1 where state = x")
df(x).collect()
but I don't think the declaration is right.
so essentially what I need is df's that look like this
+-----------------------------+
| col1 col2 ... table_num |
+-----------------------------+
| x y 1 |
| a b 1 |
+-----------------------------+
+------------------------------+
| col1 col2 ... table_num |
+------------------------------+
| xx xy 2 |
| aa bb 2 |
+------------------------------+
+-------------------------------+
| col1 col2 ... table_num |
+-------------------------------+
| xxy yyy 3 |
| aaa bbb 3 |
+-------------------------------+
... and so on ...
(how I would like the Dataframes split)
In Spark Arrays can be almost in data type. When made as vars you can dynamically add and remove elements from them. Below I am going to isolate the table nums into their own array, this is so I can easily iterate through them. After isolated I go through a while loop to add each table as a unique element to the DF Holder Array. To query the elements of the array use DFHolderArray(n-1) where n is the position you want to query with 0 being the first element.
//This will go and turn the distinct row nums in a queriable (this is 100% a word) array
val tableIDArray = inputDF.selectExpr("table_num").distinct.rdd.map(x=>x.mkString.toInt).collect
//Build the iterator
var iterator = 1
//holders for DF and transformation step
var tempDF = spark.sql("select 'foo' as bar")
var interimDF = tempDF
//This will be an array for dataframes
var DFHolderArray : Array[org.apache.spark.sql.DataFrame] = Array(tempDF)
//loop while the you have note reached end of array
while(iterator<=tableIDArray.length) {
//Call the table that is stored in that location of the array
tempDF = spark.sql("select * from df1 where state = '" + tableIDArray(iterator-1) + "'")
//Fluff
interimDF = tempDF.withColumn("User_Name", lit("Stack_Overflow"))
//If logic to overwrite or append the DF
DFHolderArray = if (iterator==1) {
Array(interimDF)
} else {
DFHolderArray ++ Array(interimDF)
}
iterator = iterator + 1
}
//To query the data
DFHolderArray(0).show(10,false)
DFHolderArray(1).show(10,false)
DFHolderArray(2).show(10,false)
//....
Approach is to collect all unique keys and build respective data frames. I added some functional flavor to it.
Sample dataset:
name,year,country,id
Bayern Munich,2014,Germany,7747
Bayern Munich,2014,Germany,7747
Bayern Munich,2014,Germany,7746
Borussia Dortmund,2014,Germany,7746
Borussia Mönchengladbach,2014,Germany,7746
Schalke 04,2014,Germany,7746
Schalke 04,2014,Germany,7753
Lazio,2014,Germany,7753
Code:
val df = spark.read.format(source = "csv")
.option("header", true)
.option("delimiter", ",")
.option("inferSchema", true)
.load("groupby.dat")
import spark.implicits._
//collect data for each key into a data frame
val uniqueIds = df.select("id").distinct().map(x => x.mkString.toInt).collect()
// List buffer to hold separate data frames
var dataframeList: ListBuffer[org.apache.spark.sql.DataFrame] = ListBuffer()
println(uniqueIds.toList)
// filter data
uniqueIds.foreach(x => {
val tempDF = df.filter(col("id") === x)
dataframeList += tempDF
})
//show individual data frames
for (tempDF1 <- dataframeList) {
tempDF1.show()
}
One approach would be to write the DataFrame as partitioned Parquet files and read them back into a Map, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("a", "b", 1), ("c", "d", 1), ("e", "f", 1),
("g", "h", 2), ("i", "j", 2)
).toDF("c1", "c2", "table_num")
val filePath = "/path/to/parquet/files"
df.write.partitionBy("table_num").parquet(filePath)
val tableNumList = df.select("table_num").distinct.map(_.getAs[Int](0)).collect
// tableNumList: Array[Int] = Array(1, 2)
val dfMap = ( for { n <- tableNumList } yield
(n, spark.read.parquet(s"$filePath/table_num=$n").withColumn("table_num", lit(n)))
).toMap
To access the individual DataFrames from the Map:
dfMap(1).show
// +---+---+---------+
// | c1| c2|table_num|
// +---+---+---------+
// | a| b| 1|
// | c| d| 1|
// | e| f| 1|
// +---+---+---------+
dfMap(2).show
// +---+---+---------+
// | c1| c2|table_num|
// +---+---+---------+
// | g| h| 2|
// | i| j| 2|
// +---+---+---------+
Let's say I have following dataframe:
/*
+---------+--------+----------+--------+
|a |b | c |d |
+---------+--------+----------+--------+
| bob| -1| 5| -1|
| alice| -1| -1| -1|
+---------+--------+----------+--------+
*/
I want to remove columns which only have -1 in all rows (in this case b and d). I found a solution but when I run my job I found out it was very inefficient:
private def removeEmptyColumns(df: DataFrame): DataFrame = {
val types = List("IntegerType", "DoubleType", "LongType")
val dTypes: Array[(String, String)] = df.dtypes
dTypes.foldLeft(df)((d, t) => {
val colType = t._2
val colName = t._1
if (types.contains(colType)) {
if (colType.equals("IntegerType")) {
if (d.select(colName).filter(col(colName) =!= -1).take(1).length == 0) d.drop(colName)
else d
} else if (colType.equals("DoubleType")) {
if (d.select(colName).filter(col(colName) =!= -1.0).take(1).length == 0) d.drop(colName)
else d
} else {
if (d.select(colName).filter(col(colName) =!= -1).take(1).length == 0) d.drop(colName)
else d
}
} else {
d
}
})
}
Is there a better solution or way to improve my existing code?
(I think this line val count = d.select(colName).distinct.count is the bottleneck)
I am using Spark 2.2 atm.
Many thanks
Instead of counting number of distinct values try to check if there exist any other value that is not -1
d.select(colName).filter(_ != -1).take(1).length == 0
Another approach
Instead of going through the dataframe n times (once for each column) you can try to collect statistics all at once
val summary = d.agg(
max(col1).as(s"${col1}_max"), min(col1).as(s"${col1}_min")),
max(col2).as(s"${col1}_max"), min(col2).as(s"${col2}_min")),
...)
.first
Then compare if min and max value for the column is the same -1
I have an RDD containing data like this: (downloadId: String, date: LocalDate, downloadCount: Int). The date and download-id are unique and the download-count is for the date.
What I've been trying to accomplish is to get the number of consecutive days (going backwards from the current date) that a download-id was in the top 100 of all download-ids. So if a given download was in the top-100 today, yesterday and the day before, then it's streak would be 3.
In SQL, I guess this could be solved using window functions. I've seen similar questions like this. How to add a running count to rows in a 'streak' of consecutive days
(I'm rather new to Spark but wasn't sure to how to map-reduce an RDD to even begin solving a problem like this.)
Some more information, the dates are the last 30 days and there are approximately unique 4M download-ids per day.
I suggest you work with DataFrames, as they are much easier to use than RDDs. Leo's answer is shorter, but I couldn't find where it was filtering for the top 100 downloads, so I decide to post my answer as well. It does not depend on window functions, but it is bound on the number of days in the past you want to streak by. Since you said you only use the last 30 days' data, that should not be a problem.
As a first Step, I wrote some code to generate a DF similar to what you described. You don't need to run this first block (if you do, reduce the number of rows unless you have a cluster to try it on, it's heavy on memory). You can see how to transform the RDD (theData) into a DF (baseData). You should define a schema for it, like I did.
import java.time.LocalDate
import scala.util.Random
val maxId = 10000
val numRows = 15000000
val lastDate = LocalDate.of(2017, 12, 31)
// Generates the data. As a convenience for working with Dataframes, I converted the dates to epoch days.
val theData = sc.parallelize(1.to(numRows).map{
_ => {
val id = Random.nextInt(maxId)
val nDownloads = Random.nextInt((id / 1000 + 1))
Row(id, lastDate.minusDays(Random.nextInt(30)).toEpochDay, nDownloads)
}
})
//Working with Dataframes is much simples, so I'll generate a DF named baseData from the RDD
val schema = StructType(
StructField("downloadId", IntegerType, false) ::
StructField("date", LongType, false) ::
StructField("downloadCount", IntegerType, false) :: Nil)
val baseData = sparkSession.sqlContext.createDataFrame(theData, schema)
.groupBy($"downloadId", $"date")
.agg(sum($"downloadCount").as("downloadCount"))
.cache()
Now you have the data you want in a DF called baseData. The next step is to restrict it to the top 100 for each day - you should discard the data you don't before doing any additional heavy transformations.
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Row}
def filterOnlyTopN(data: DataFrame, n: Int = 100): DataFrame = {
// For each day in the data, let's find the cutoff # of downloads to make it into the top N
val getTopNCutoff = udf((downloads: Seq[Long]) => {
val reverseSortedDownloads = downloads.sortBy{- _ }
if (reverseSortedDownloads.length >= n)
reverseSortedDownloads.drop(n - 1).head
else
reverseSortedDownloads.last
})
val topNLimitsByDate = data.groupBy($"date").agg(collect_set($"downloadCount").as("downloads"))
.select($"date", getTopNCutoff($"downloads").as("cutoff"))
// And then, let's throw away the records below the top 100
data.join(topNLimitsByDate, Seq("date"))
.filter($"downloadCount" >= $"cutoff")
.drop("cutoff", "downloadCount")
}
val relevantData = filterOnlyTopN(baseData)
Now that you have the relevantData DF with only the data you need, you can calculate the streak for them. I have left the ids with no streaks as streak 0, you can filter those out by using streaks.filter($"streak" > lit(0)).
def getStreak(df: DataFrame, fromDate: Long): DataFrame = {
val calcStreak = udf((dateList: Seq[Long]) => {
if (!dateList.contains(fromDate))
0
else {
val relevantDates = dateList.sortBy{- _ } // Order the dates descending
.dropWhile(_ != fromDate) // And drop everything until we find the starting day we are interested in
if (relevantDates.length == 1) // If there's only one day left, it's a one day streak
1
else // Otherwise, let's count the streak length (this works if no dates are left, too - but not with only 1 day)
relevantDates.sliding(2) // Take days by pairs
.takeWhile{twoDays => twoDays(1) == twoDays(0) - 1} // While the pair is of consecutive days
.length+1 // And the streak will be the number of consecutive pairs + 1 (the initial day of the streak)
}
})
df.groupBy($"downloadId").agg(collect_list($"date").as("dates")).select($"downloadId", calcStreak($"dates").as("streak"))
}
val streaks = getStreak(relevantData, lastDate.toEpochDay)
streaks.show()
+------------+--------+
| downloadId | streak |
+------------+--------+
| 8086 | 0 |
| 9852 | 0 |
| 7253 | 0 |
| 9376 | 0 |
| 7833 | 0 |
| 9465 | 1 |
| 7880 | 0 |
| 9900 | 1 |
| 7993 | 0 |
| 9427 | 1 |
| 8389 | 1 |
| 8638 | 1 |
| 8592 | 1 |
| 6397 | 0 |
| 7754 | 1 |
| 7982 | 0 |
| 7554 | 0 |
| 6357 | 1 |
| 7340 | 0 |
| 6336 | 0 |
+------------+--------+
And there you have the streaks DF with the data you need.
Using a similar approach in the listed PostgreSQL link, you can apply Window function in Spark as well. Spark's DataFrame API doesn't have encoders for java.time.LocalDate, so you'll need to convert it to, say, java.sql.Date.
Here're the steps: First, transfrom the RDD to a DataFrame with supported date format; next, create a UDF to compute the baseDate which requires a date and a per-id chronological row-number (generated using Window function) as parameters. Another Window function is applied to calculate per-id-baseDate row-number, which is the wanted streak value:
import java.time.LocalDate
val rdd = sc.parallelize(Seq(
(1, LocalDate.parse("2017-12-13"), 2),
(1, LocalDate.parse("2017-12-16"), 1),
(1, LocalDate.parse("2017-12-17"), 1),
(1, LocalDate.parse("2017-12-18"), 2),
(1, LocalDate.parse("2017-12-20"), 1),
(1, LocalDate.parse("2017-12-21"), 3),
(2, LocalDate.parse("2017-12-15"), 2),
(2, LocalDate.parse("2017-12-16"), 1),
(2, LocalDate.parse("2017-12-19"), 1),
(2, LocalDate.parse("2017-12-20"), 1),
(2, LocalDate.parse("2017-12-21"), 2),
(2, LocalDate.parse("2017-12-23"), 1)
))
val df = rdd.map{ case (id, date, count) => (id, java.sql.Date.valueOf(date), count) }.
toDF("downloadId", "date", "downloadCount")
def baseDate = udf( (d: java.sql.Date, n: Long) =>
new java.sql.Date(new java.util.Date(d.getTime).getTime - n * 24 * 60 * 60 * 1000)
)
import org.apache.spark.sql.expressions.Window
val dfStreak = df.withColumn("rowNum", row_number.over(
Window.partitionBy($"downloadId").orderBy($"date")
)
).withColumn(
"baseDate", baseDate($"date", $"rowNum")
).select(
$"downloadId", $"date", $"downloadCount", row_number.over(
Window.partitionBy($"downloadId", $"baseDate").orderBy($"date")
).as("streak")
).orderBy($"downloadId", $"date")
dfStreak.show
+----------+----------+-------------+------+
|downloadId| date|downloadCount|streak|
+----------+----------+-------------+------+
| 1|2017-12-13| 2| 1|
| 1|2017-12-16| 1| 1|
| 1|2017-12-17| 1| 2|
| 1|2017-12-18| 2| 3|
| 1|2017-12-20| 1| 1|
| 1|2017-12-21| 3| 2|
| 2|2017-12-15| 2| 1|
| 2|2017-12-16| 1| 2|
| 2|2017-12-19| 1| 1|
| 2|2017-12-20| 1| 2|
| 2|2017-12-21| 2| 3|
| 2|2017-12-23| 1| 1|
+----------+----------+-------------+------+
I´m trying to find a way, to calculate the Median for a given Dataframe.
val df = sc.parallelize(Seq(("a",1.0),("a",2.0),("a",3.0),("b",6.0), ("b", 8.0))).toDF("col1", "col2")
+----+----+
|col1|col2|
+----+----+
| a| 1.0|
| a| 2.0|
| a| 3.0|
| b| 6.0|
| b| 8.0|
+----+----+
Now I want to do sth like that:
df.groupBy("col1").agg(calcmedian("col2"))
the result should look like this:
+----+------+
|col1|median|
+----+------+
| a| 2.0|
| b| 7.0|
+----+------+`
therefore calcmedian() has to be a UDAF, but the problem is, the "evaluate" method of the UDAF only takes a Row, but i need the whole table to sort the values and return the median...
// Once all entries for a group are exhausted, spark will evaluate to get the final result
def evaluate(buffer: Row) = {...}
Is this possible somehow? or is there another nice workaround? I want to stress, that i know how to calculate the median on a dataset with "one group". But i don´t want to use this algorithm in a "foreach" loop as this is inefficient!
Thank you!
edit:
that´s what i tried so far:
object calcMedian extends UserDefinedAggregateFunction {
// Schema you get as an input
def inputSchema = new StructType().add("col2", DoubleType)
// Schema of the row which is used for aggregation
def bufferSchema = new StructType().add("col2", DoubleType)
// Returned type
def dataType = DoubleType
// Self-explaining
def deterministic = true
// initialize - called once for each group
def initialize(buffer: MutableAggregationBuffer) = {
buffer(0) = 0.0
}
// called for each input record of that group
def update(buffer: MutableAggregationBuffer, input: Row) = {
buffer(0) = input.getDouble(0)
}
// if function supports partial aggregates, spark might (as an optimization) comput partial results and combine them together
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1(0) = input.getDouble(0)
}
// Once all entries for a group are exhausted, spark will evaluate to get the final result
def evaluate(buffer: Row) = {
val tile = 50
var median = 0.0
//PROBLEM: buffer is a Row --> I need DataFrame here???
val rdd_sorted = buffer.sortBy(x => x)
val c = rdd_sorted.count()
if (c == 1){
median = rdd_sorted.first()
}else{
val index = rdd_sorted.zipWithIndex().map(_.swap)
val last = c
val n = (tile/ 100d) * (c*1d)
val k = math.floor(n).toLong
val d = n - k
if( k <= 0) {
median = rdd_sorted.first()
}else{
if (k <= c){
median = index.lookup(last - 1).head
}else{
if(k >= c){
median = index.lookup(last - 1).head
}else{
median = index.lookup(k-1).head + d* (index.lookup(k).head - index.lookup(k-1).head)
}
}
}
}
} //end of evaluate
try this:
import org.apache.spark.functions._
val result = data.groupBy("col1").agg(callUDF("percentile_approx", col("col2"), lit(0.5)))