I have a sample dataframe
df_that_I_have
+---------+---------+-------+
| country | members | some |
+---------+---------+-------+
| India | 50 | 1 |
+---------+---------+-------+
| Japan | 20 | 3 |
+---------+---------+-------+
| India | 20 | 1 |
+---------+---------+-------+
| Japan | 10 | 3 |
+---------+---------+-------+
and I want a dataframe that looks like this
df_that_I_want
+---------+---------+-------+
| country | members | some |
+---------+---------+-------+
| India | 70 | 10 | // 5 * Sum of "some" for India, i.e. (1 + 1)
+---------+---------+-------+
| Japan | 30 | 30 | // 5 * Sum of "some" for Japan, i.e. (3 + 3)
+---------+---------+-------+
The second dataframe has the sum of members and the sum of some multiplied 5.
This is what I'm doing to achieve this
val df_that_I_want = df_that_I_have
.select(df_that_I_have("country"),
df_that_I_have.groupBy("country").sum("members"),
5 * df_that_I_have.groupBy("country").sum("some")) //Problem here
But the compiler does not allow me to do this because apparently I can't multiply 5 with a column.
How can I multiply an Integer value with the sum of some for each country?
You can try lit function.
scala> val df_that_I_have = Seq(("India",50,1),("India",20,1),("Japan",20,3),("Japan",10,3)).toDF("Country","Members","Some")
df_that_I_have: org.apache.spark.sql.DataFrame = [Country: string, Members: int, Some: int]
scala> val df1 = df_that_I_have.groupBy("country").agg(sum("members"), sum("some") * lit(5))
df1: org.apache.spark.sql.DataFrame = [country: string, sum(members): bigint, ((sum(some),mode=Complete,isDistinct=false) * 5): bigint]
scala> val df_that_I_want= df1.select($"Country",$"sum(Members)".alias("Members"), $"((sum(Some),mode=Complete,isDistinct=false) * 5)".alias("Some"))
df_that_I_want: org.apache.spark.sql.DataFrame = [Country: string, Members: bigint, Some: bigint]
scala> df_that_I_want.show
+-------+-------+----+
|Country|Members|Some|
+-------+-------+----+
| India| 70| 10|
| Japan| 30| 30|
+-------+-------+----+
Please try this
df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))
df_that_I_have.select("country").groupBy("country").agg(sum("members"), sum("some") * lit(5))
lit function is used for creating the column of literal value which is 5 here.
when you are not able to multiply 5 directly, it is creating a column containing 5 and multiplying with it.
Related
I have a spark dataframe consist of two columns [Employee and Salary] where salary is in ascending order.
Sample Dataframe
Expected Output:
| Employee |salary |
| -------- | ------|
| Emp1 | 10 |
| Emp2 | 20 |
| Emp3 | 30 |
| EMp4 | 35 |
| Emp5 | 36 |
| Emp6 | 50 |
| Emp7 | 70 |
I want to group the rows such that each group has less than 80 as the aggregated value and assign a category to each group something like this. I will keep adding the salary in rows until the sum becomes more than 80. As soon as it becomes more than 80, I will asssign a new category.
Expected Output:
| Employee |salary | Category|
| -------- | ------|----------
| Emp1 | 10 |A |
| Emp2 | 20 |A |
| Emp3 | 30 |A |
| EMp4 | 35 |B |
| Emp5 | 36 |B |
| Emp6 | 50 |C |
| Emp7 | 70 |D |
Is there a simple way we can do this in spark scala?
To solve your problem, you can use a custom aggregate function over a window
First, you need to create your custom aggregate function. An aggregate function is defined by an accumulator (a buffer), that will be initialized (zero value) and updated when treating a new row (reduce function) or encountering another accumulator (merge function). And at the end, the accumulator is returned (finish function)
In your case, accumulator should keep two pieces of information:
Current category of employees
Sum of salaries of previous employees belonging to the current category
To store those information, you can use a Tuple (Int, Int), with first element is current category and second element the sum of salaries of previous employees of current category:
You initialize this tuple with (0, 0).
When you encounter a new row, if the sum of previous salaries and salary of current row is over 80, you increment category and reinitialize previous salaries' sum with salary of current row, else you add salary of current row to previous salaries' sum.
As you will be using a window function, you will sequentially treat rows so you don't need to implement merge with another accumulator.
And at the end, as you only want the category, you return only the first element of the accumulator.
So we get the following aggregator implementation:
import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.expressions.Aggregator
object Labeler extends Aggregator[Int, (Int, Int), Int] {
override def zero: (Int, Int) = (0, 0)
override def reduce(catAndSum: (Int, Int), salary: Int): (Int, Int) = {
if (catAndSum._2 + salary > 80)
(catAndSum._1 + 1, salary)
else
(catAndSum._1, catAndSum._2 + salary)
}
override def merge(catAndSum1: (Int, Int), catAndSum2: (Int, Int)): (Int, Int) = {
throw new NotImplementedError("should be used only over a windows function")
}
override def finish(catAndSum: (Int, Int)): Int = catAndSum._1
override def bufferEncoder: Encoder[(Int, Int)] = Encoders.tuple(Encoders.scalaInt, Encoders.scalaInt)
override def outputEncoder: Encoder[Int] = Encoders.scalaInt
}
Once you have your aggregator, you transform it to a spark aggregate function using udaf function.
You then create your window over all dataframe and ordered by salary and apply your spark aggregate function over this window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, udaf}
val labeler = udaf(Labeler)
val window = Window.orderBy("salary")
val result = dataframe.withColumn("category", labeler(col("salary")).over(window))
Using your example as input dataframe, you get the following result dataframe:
+--------+------+--------+
|employee|salary|category|
+--------+------+--------+
|Emp1 |10 |0 |
|Emp2 |20 |0 |
|Emp3 |30 |0 |
|Emp4 |35 |1 |
|Emp5 |36 |1 |
|Emp6 |50 |2 |
|Emp7 |70 |3 |
+--------+------+--------+
Consider there are 2 dataframes df1 and df2.
df1 has below data
A | B
-------
1 | m
2 | n
3 | o
df2 has below data
A | B
-------
1 | m
2 | n
3 | p
df1.except(df2) returns
A | B
-------
3 | o
3 | p
How to display the result as below?
df1: 3 | o
df2: 3 | p
As per the API docs df1.except(df2), Returns a new DataFrame containing rows in this frame but not in another frame. i.e, it will return rows that are in DF1 and not in DF2. Thus a custom except function could be written as:
def except(df1: DataFrame, df2: DataFrame): DataFrame = {
val edf1 = df1.except(df2).withColumn("df", lit("df1"))
val edf2 = df2.except(df1).withColumn("df", lit("df2"))
edf1.union(edf2)
}
//Output
+---+---+---+
| A| B| df|
+---+---+---+
| 3| o|df1|
| 3| p|df2|
+---+---+---+
Creating a multiple columns from array column
Dataframe
Car name | details
Toyota | [[year,2000],[price,20000]]
Audi | [[mpg,22]]
Expected dataframe
Car name | year | price | mpg
Toyota | 2000 | 20000 | null
Audi | null | null | 22
You can try this
Let's define the data
scala> val carsDF = Seq(("toyota",Array(("year", 2000), ("price", 100000))), ("Audi", Array(("mpg", 22)))).toDF("car", "details")
carsDF: org.apache.spark.sql.DataFrame = [car: string, details: array<struct<_1:string,_2:int>>]
scala> carsDF.show(false)
+------+-----------------------------+
|car |details |
+------+-----------------------------+
|toyota|[[year,2000], [price,100000]]|
|Audi |[[mpg,22]] |
+------+-----------------------------+
Splitting the data & accessing the values in the data
scala> carsDF.withColumn("split", explode($"details")).withColumn("col", $"split"("_1")).withColumn("val", $"split"("_2")).select("car", "col", "val").show
+------+-----+------+
| car| col| val|
+------+-----+------+
|toyota| year| 2000|
|toyota|price|100000|
| Audi| mpg| 22|
+------+-----+------+
Define the list of columns that are required
scala> val colNames = Seq("mpg", "price", "year", "dummy")
colNames: Seq[String] = List(mpg, price, year, dummy)
Use pivoting on the above defined column names gives required output.
By giving new column names in the sequence makes it a single point input
scala> weDF.groupBy("car").pivot("col", colNames).agg(avg($"val")).show
+------+----+--------+------+-----+
| car| mpg| price| year|dummy|
+------+----+--------+------+-----+
|toyota|null|100000.0|2000.0| null|
| Audi|22.0| null| null| null|
+------+----+--------+------+-----+
This seems more elegant & easy way to achieve the output
you can do it like that
import org.apache.spark.functions.col
val df: DataFrame = Seq(
("toyota",Array(("year", 2000), ("price", 100000))),
("toyota",Array(("year", 2001)))
).toDF("car", "details")
+------+-------------------------------+
|car |details |
+------+-------------------------------+
|toyota|[[year, 2000], [price, 100000]]|
|toyota|[[year, 2001]] |
+------+-------------------------------+
val newdf = df
.withColumn("year", when(col("details")(0)("_1") === lit("year"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.withColumn("price", when(col("details")(0)("_1") === lit("price"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.drop("details")
newdf.show()
+------+----+------+
| car|year| price|
+------+----+------+
|toyota|2000|100000|
|toyota|2001| null|
+------+----+------+
In My requirment , i come across a situation where i have to pass 2 strings from my dataframe's 2 column and get back the result in string and want to store it back to a dataframe.
Now while passing the value as string, it is always returning the same value. So in all the rows the same value is being populated. (In My case PPPP is being populated in all rows)
Is there a way to pass element (for those 2 columns) from every row and get the result in separate rows.
I am ready to modify my function to accept Dataframe and return Dataframe OR accept arrayOfString and get back ArrayOfString but i dont know how to do that as i am new to programming. Can someone please help me.
Thanks.
def myFunction(key: String , value :String ) : String = {
//Do my functions and get back a string value2 and return this value2 string
value2
}
val DF2 = DF1.select (
DF1("col1")
,DF1("col2")
,DF1("col5") )
.withColumn("anyName", lit(myFunction ( DF1("col3").toString() , DF1("col4").toString() )))
/* DF1:
/*+-----+-----+----------------+------+
/*|col1 |col2 |col3 | col4 | col 5|
/*+-----+-----+----------------+------+
/*|Hello|5 |valueAAA | XXX | 123 |
/*|How |3 |valueCCC | YYY | 111 |
/*|World|5 |valueDDD | ZZZ | 222 |
/*+-----+-----+----------------+------+
/*DF2:
/*+-----+-----+--------------+
/*|col1 |col2 |col5| anyName |
/*+-----+-----+--------------+
/*|Hello|5 |123 | PPPPP |
/*|How |3 |111 | PPPPP |
/*|World|5 |222 | PPPPP |
/*+-----+-----+--------------+
*/
After you define the function, you need to register them as udf(). The udf() function is available in org.apache.spark.sql.functions. check this out
scala> val DF1 = Seq(("Hello",5,"valueAAA","XXX",123),
| ("How",3,"valueCCC","YYY",111),
| ("World",5,"valueDDD","ZZZ",222)
| ).toDF("col1","col2","col3","col4","col5")
DF1: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 3 more fields]
scala> val DF2 = DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5") )
DF2: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]
scala> DF2.show(false)
+-----+----+----+
|col1 |col2|col5|
+-----+----+----+
|Hello|5 |123 |
|How |3 |111 |
|World|5 |222 |
+-----+----+----+
scala> DF1.select("*").show(false)
+-----+----+--------+----+----+
|col1 |col2|col3 |col4|col5|
+-----+----+--------+----+----+
|Hello|5 |valueAAA|XXX |123 |
|How |3 |valueCCC|YYY |111 |
|World|5 |valueDDD|ZZZ |222 |
+-----+----+--------+----+----+
scala> def myConcat(a:String,b:String):String=
| return a + "--" + b
myConcat: (a: String, b: String)String
scala>
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val myConcatUDF = udf(myConcat(_:String,_:String):String)
myConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))
scala> DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5"), myConcatUDF( DF1("col3"), DF1("col4"))).show()
+-----+----+----+---------------+
| col1|col2|col5|UDF(col3, col4)|
+-----+----+----+---------------+
|Hello| 5| 123| valueAAA--XXX|
| How| 3| 111| valueCCC--YYY|
|World| 5| 222| valueDDD--ZZZ|
+-----+----+----+---------------+
scala>
I read data from a csv file ,but don't have index.
I want to add a column from 1 to row's number.
What should I do,Thanks (scala)
With Scala you can use:
import org.apache.spark.sql.functions._
df.withColumn("id",monotonicallyIncreasingId)
You can refer to this exemple and scala docs.
With Pyspark you can use:
from pyspark.sql.functions import monotonically_increasing_id
df_index = df.select("*").withColumn("id", monotonically_increasing_id())
monotonically_increasing_id - The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
"I want to add a column from 1 to row's number."
Let say we have the following DF
+--------+-------------+-------+
| userId | productCode | count |
+--------+-------------+-------+
| 25 | 6001 | 2 |
| 11 | 5001 | 8 |
| 23 | 123 | 5 |
+--------+-------------+-------+
To generate the IDs starting from 1
val w = Window.orderBy("count")
val result = df.withColumn("index", row_number().over(w))
This would add an index column ordered by increasing value of count.
+--------+-------------+-------+-------+
| userId | productCode | count | index |
+--------+-------------+-------+-------+
| 25 | 6001 | 2 | 1 |
| 23 | 123 | 5 | 2 |
| 11 | 5001 | 8 | 3 |
+--------+-------------+-------+-------+
How to get a sequential id column id[1, 2, 3, 4...n]:
from pyspark.sql.functions import desc, row_number, monotonically_increasing_id
from pyspark.sql.window import Window
df_with_seq_id = df.withColumn('index_column_name', row_number().over(Window.orderBy(monotonically_increasing_id())) - 1)
Note that row_number() starts at 1, therefore subtract by 1 if you want 0-indexed column
NOTE : Above approaches doesn't give a sequence number, but it does give increasing id.
Simple way to do that and ensure the order of indexes is like below.. zipWithIndex.
Sample data.
+-------------------+
| Name|
+-------------------+
| Ram Ghadiyaram|
| Ravichandra|
| ilker|
| nick|
| Naveed|
| Gobinathan SP|
|Sreenivas Venigalla|
| Jackela Kowski|
| Arindam Sengupta|
| Liangpi|
| Omar14|
| anshu kumar|
+-------------------+
package com.example
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{LongType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row}
/**
* DistributedDataIndex : Program to index an RDD with
*/
object DistributedDataIndex extends App with Logging {
val spark = builder
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
val df = spark.sparkContext.parallelize(
Seq("Ram Ghadiyaram", "Ravichandra", "ilker", "nick"
, "Naveed", "Gobinathan SP", "Sreenivas Venigalla", "Jackela Kowski", "Arindam Sengupta", "Liangpi", "Omar14", "anshu kumar"
)).toDF("Name")
df.show
logInfo("addColumnIndex here")
// Add index now...
val df1WithIndex = addColumnIndex(df)
.withColumn("monotonically_increasing_id", monotonically_increasing_id)
df1WithIndex.show(false)
/**
* Add Column Index to dataframe to each row
*/
def addColumnIndex(df: DataFrame) = {
spark.sqlContext.createDataFrame(
df.rdd.zipWithIndex.map {
case (row, index) => Row.fromSeq(row.toSeq :+ index)
},
// Create schema for index column
StructType(df.schema.fields :+ StructField("index", LongType, false)))
}
}
Result :
+-------------------+-----+---------------------------+
|Name |index|monotonically_increasing_id|
+-------------------+-----+---------------------------+
|Ram Ghadiyaram |0 |0 |
|Ravichandra |1 |8589934592 |
|ilker |2 |8589934593 |
|nick |3 |17179869184 |
|Naveed |4 |25769803776 |
|Gobinathan SP |5 |25769803777 |
|Sreenivas Venigalla|6 |34359738368 |
|Jackela Kowski |7 |42949672960 |
|Arindam Sengupta |8 |42949672961 |
|Liangpi |9 |51539607552 |
|Omar14 |10 |60129542144 |
|anshu kumar |11 |60129542145 |
+-------------------+-----+---------------------------+
As Ram said, zippedwithindex is better than monotonically increasing id, id you need consecutive row numbers. Try this (PySpark environment):
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, LongType
new_schema = StructType(**original_dataframe**.schema.fields[:] + [StructField("index", LongType(), False)])
zipped_rdd = **original_dataframe**.rdd.zipWithIndex()
indexed = (zipped_rdd.map(lambda ri: row_with_index(*list(ri[0]) + [ri[1]])).toDF(new_schema))
where original_dataframe is the dataframe you have to add index on and row_with_index is the new schema with the column index which you can write as
row_with_index = Row(
"calendar_date"
,"year_week_number"
,"year_period_number"
,"realization"
,"index"
)
Here, calendar_date, year_week_number, year_period_number and realization were the columns of my original dataframe. You can replace the names with the names of your columns. index is the new column name you had to add for the row numbers.
If you require a unique sequence number for each row, I have a slightly different approach, where a static column is added and is used to compute the row number using that column.
val srcData = spark.read.option("header","true").csv("/FileStore/sample.csv")
srcData.show(5)
+--------+--------------------+
| Job| Name|
+--------+--------------------+
|Morpheus| HR Specialist|
| Kayla| Lawyer|
| Trisha| Bus Driver|
| Robert|Elementary School...|
| Ober| Judge|
+--------+--------------------+
val srcDataModf = srcData.withColumn("sl_no",lit("1"))
val windowSpecRowNum = Window.partitionBy("sl_no").orderBy("sl_no")
srcDataModf.withColumn("row_num",row_number.over(windowSpecRowNum)).drop("sl_no").select("row_num","Name","Job")show(5)
+-------+--------------------+--------+
|row_num| Name| Job|
+-------+--------------------+--------+
| 1| HR Specialist|Morpheus|
| 2| Lawyer| Kayla|
| 3| Bus Driver| Trisha|
| 4|Elementary School...| Robert|
| 5| Judge| Ober|
+-------+--------------------+--------+
For SparkR:
(Assuming sdf is some sort of spark data frame)
sdf<- withColumn(sdf, "row_id", SparkR:::monotonically_increasing_id())