Groupby and Subtract Spark Scala - scala

I have a dataframe like below:
group value
B 2
B 3
A 5
A 6
now i need to subtract rows based on group. i.e 2-3 and 5-6. after transformation it should look like this.
group value
B -1
A -1
i tried below code but couldnt solve my case.
val df2 = df1.groupBy("Group").agg(first("Value")-second(col("Value")))

import org.apache.spark.sql.expressions.Window
val df2 = df1.select("group", "value", $"value" - lead("value").over(Window.partitionBy("group").orderBy("value")))
I guess you're trying to subtract two neighbored values with order.

This is working for me.
val df2 = df1.groupBy("Group").agg(first("Value").minus(last(col("Value"))))

Related

Calculate mean for several columns in Spark scala

I'm looking for a way to calculate some statistic e.g. mean over several selected columns in Spark using Scala. Given that data object is my Spark DataFrame, it's easy to calculate a mean for one column only e.g.
data.agg(avg("var1") as "mean var1").show
Also, we can easily calculate a mean cross-tabulated by values of some other columns e.g.:
data.groupBy("category").agg(avg("var1") as "mean_var1").show
But how can we calculate a mean for a List of columns in a DataFrame? I tried running something like this, but it didn't work:
scala> data.select("var1", "var2").mean().show
<console>:44: error: value mean is not a member of org.apache.spark.sql.DataFrame
data.select("var1", "var2").mean().show
^
This is what you need to do
import org.apache.spark.sql.functions._
import spark.implicits._
val df1 = Seq((1,2,3), (3,4,5), (1,2,4)).toDF("A", "B", "C")
data.select(data.columns.map(mean(_)): _*).show()
Output:
+------------------+------------------+------+
| avg(A)| avg(B)|avg(C)|
+------------------+------------------+------+
|1.6666666666666667|2.6666666666666665| 4.0|
+------------------+------------------+------+
This works for selected columns
data.select(Seq("A", "B").map(mean(_)): _*).show()
Output:
+------------------+------------------+
| avg(A)| avg(B)|
+------------------+------------------+
|1.6666666666666667|2.6666666666666665|
+------------------+------------------+
Hope this helps!
If you already have the dataset you can do this:
ds.describe(s"age")
Which will return this:
summary age
count 10.0
mean 53.3
stddev 11.6
min 18.0
max 92.0

Creating a new column by applying a function in an existing column in PySpark?

Say I have a dataframe
product_id customers
1 [1,2,4]
2 [1,2]
I want to create a new column, say nb_customer by applying the function len on the column customers.
I tried
df = df.select('*', (map(len, df.customers)).alias('nb_customer'))
but it does not work.
What is the correct way to do that?
import pyspark.sql.functions as f
df = sc.parallelize([
[1,[1,2,4]],
[2,[1,2]]
]).toDF(('product_id', 'customers'))
df.withColumn('nb_customer',f.size(df.customers)).show()

Spark: Computing correlations of a DataFrame with missing values

I currently have a DataFrame of doubles with approximately 20% of the data being null values. I want to calculate the Pearson correlation of one column with every other column and return the columnId's of the top 10 columns in the DataFrame.
I want to filter out nulls using pairwise deletion, similar to R's pairwise.complete.obs option in its Pearson correlation function. That is, if one of the two vectors in any correlation calculation has a null at an index, I want to remove that row from both vectors.
I currently do the following:
val df = ... //my DataFrame
val cols = df.columns
df.registerTempTable("dataset")
val target = "Row1"
val mapped = cols.map {colId =>
val results = sqlContext.sql(s"SELECT ${target}, ${colId} FROM dataset WHERE (${colId} IS NOT NULL AND ${target} IS NOT NULL)")
(results.stat.corr(colId, target) , colId)
}.sortWith(_._1 > _._1).take(11).map(_._2)
This runs very slowly, as every single map iteration is its own job. Is there a way to do this efficiently, perhaps using Statistics.corr in the Mllib, as per this SO Question (Spark 1.6 Pearson Correlation)
There are "na" functions on DataFrame: DataFrameNaFunctions API
They work in the same way DataFramStatFunctions do.
You can drop the rows containing a null in either of your two dataframe columns with the following syntax:
myDataFrame.na.drop("any", target, colId)
if you want to drop rows containing null any of the columns then it is:
myDataFrame.na.drop("any")
By limiting the dataframe to the two columns you care about first, you can use the second method and avoid verbose!
As such your code would become:
val df = ??? //my DataFrame
val cols = df.columns
val target = "Row1"
val mapped = cols.map {colId =>
val resultDF = df.select(target, colId).na.drop("any")
(resultDF.stat.corr(target, colId) , colId)
}.sortWith(_._1 > _._1).take(11).map(_._2)
Hope this helps you.

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28

Filter out rows with NaN values for certain column

I have a dataset and in some of the rows an attribute value is NaN. This data is loaded into a dataframe and I would like to only use the rows which consist of rows where all attribute have values. I tried doing it via sql:
val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN")
I tried several variants on this, but I can't seem to get it working.
Another option would be to transform it to a RDD and then filter it, since filtering this dataframe to check if a attribute isNaN , does not work.
I know you accepted the other answer, but you can do it without the explode (which should perform better than doubling your DataFrame size).
Prior to Spark 1.6, you could use a udf like this:
def isNaNudf = udf[Boolean,Double](d => d.isNaN)
df.filter(isNaNudf($"value"))
As of Spark 1.6, you can now use the built-in SQL function isnan() like this:
df.filter(isnan($"value"))
Here is some sample code that shows you my way of doing it -
import sqlContext.implicits._
val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value")
val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN))
df will have -
df.show
id value
1 0.5
2 NaN
while doing filter on df2 will give you what you want -
df2.filter($"isNaN" !== true).show
id value isNaN
1 0.5 false
This works:
where isNaN(tau_doc) = false
e.g.
val df_data = sqlContext.sql("SELECT * FROM raw_data where isNaN(attribute1) = false")