Is there a way of counting approximately after a group by on an sql dataset in Spark? Or more generally, what is the fastest way of group by counting in Spark?

I am not sure you are looking for these...
approx_count_distinct and countDistinct
are the things available wtih spark api
there is no approx_count_groupby
Examples :
package examples
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
object CountAgg extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
val spark = SparkSession.builder.appName(getClass.getName)
import spark.implicits._
import org.apache.spark.sql.functions._
val df =
).toDF("Page", "Visitor")
println("groupby abd count example ")
println("group by and countDistinct")"page","visitor")
.agg( countDistinct('visitor)).show
println("group by and approx_count_distinct")"page","visitor")
.agg( approx_count_distinct('visitor)).show
| page|count|
|PAGE2| 4|
|PAGE1| 8|
group by and countDistinct
| page|count(DISTINCT visitor)|
|PAGE2| 2|
|PAGE1| 3|
group by and approx_count_distinct
| page|approx_count_distinct(visitor)|
|PAGE2| 2|
|PAGE1| 3|


How to apply udf on a dataframe and on a column in scala?

I am beginner to scala. I tried scala REPL window in intellij.
I have a sample df and trying to test udf function not builtin for understanding.
scala> import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder.appName("elephant").config("spark.master", "local[*]").getOrCreate()
val df = spark.createDataFrame(Seq(("A",1),("B",2),("C",3))).toDF("Letter", "Number")
| A| 1|
| B| 2|
| C| 3|
udf for dataframe filter:
scala> def kill_4(n: String) : Boolean = {
| if (n =="A"){ true} else {false}} // please validate if its correct ???
I tried
df.withColumn("new_col", kill_4(col("Letter"))).show() // please tell correct way???
error: type mismatch
I tried direct filter:
output desired
| B| 2|
| C| 3|
You can register udf and use it in code as follows:
import org.apache.spark.sql.functions.col
def kill_4(n: String) : Boolean = {
if (n =="A"){ true } else {false}
val kill_udf = udf((x: String) => kill_4(x))"Letter"),col("Number")
kill_udf(col("Letter")).as("Kill_4") ).show(false)
Please look at the databricks documentation on scala user defined funcitons.
You do not need the spark session to create a dataframe. I removed that code.
Your function had a couple bugs. Since it is very small, I created a inline one. The udf() call allows the function to be used with dataframes. The call to register allows it to be used with Spark SQL.
A quick SQL statement shows the function works.
Last but not least, we need the udf() and col() functions for the last statement to work.
In short, these three snippets solve your problem.

Sum vector columns in spark

I have a dataframe where I have multiple columns that contain vectors (number of vector columns is dynamic). I need to create a new column taking the sum of all the vector columns. I'm having a hard time getting this done. here is a code to generate a sample dataset that I'm testing on.
val temp1 = spark.createDataFrame(Seq(
.toDF("id", "f1","f2","f3","f4","label")
val assembler1 = new VectorAssembler()
val temp2 = assembler1.setHandleInvalid("skip").transform(temp1)
val assembler2 = new VectorAssembler()
.setInputCols(Array("f2","f3", "f4"))
val df = assembler2.setHandleInvalid("skip").transform(temp2)
This gives me the following dataset
| id| f1| f2| f3| f4|label| vec1| vec2|
| 1|1.0|0.0|4.7| 6| 0.0|[1.0,0.0,4.7]|[0.0,4.7,6.0]|
| 2|1.0|0.0|6.8| 6| 0.0|[1.0,0.0,6.8]|[0.0,6.8,6.0]|
| 3|1.0|1.0|7.8| 5| 0.0|[1.0,1.0,7.8]|[1.0,7.8,5.0]|
| 4|0.0|1.0|4.1| 7| 0.0|[0.0,1.0,4.1]|[1.0,4.1,7.0]|
| 5|1.0|0.0|2.8| 6| 1.0|[1.0,0.0,2.8]|[0.0,2.8,6.0]|
| 6|1.0|1.0|6.1| 5| 0.0|[1.0,1.0,6.1]|[1.0,6.1,5.0]|
| 7|0.0|1.0|4.9| 7| 1.0|[0.0,1.0,4.9]|[1.0,4.9,7.0]|
| 8|1.0|0.0|7.3| 6| 0.0|[1.0,0.0,7.3]|[0.0,7.3,6.0]|
If I needed to taek sum of regular columns, I can do it using something like,
import org.apache.spark.sql.functions.col
df.withColumn("sum",, c2)=>c1+c2))
I know I can use breeze to sum DenseVectors just using "+" operator
import breeze.linalg._
val v1 = DenseVector(1,2,3)
val v2 = DenseVector(5,6,7)
So, the above code gives me the expected vector. But I'm not sure how to take the sum of the vector columns and sum vec1 and vec2 columns.
I did try the suggestions mentioned here, but had no luck
Here's my take but coded in PySpark. Someone can probably help in translating this to Scala:
from import Vectors, VectorUDT
import numpy as np
from pyspark.sql.functions import udf, array
def vector_sum (arr):
return Vectors.dense(np.sum(arr,axis=0))
vector_sum_udf = udf(vector_sum, VectorUDT())
df = df.withColumn('sum',vector_sum_udf(array(['vec1','vec2'])))

How to add header and column to dataframe spark?

I have got a dataframe, on which I want to add a header and a first column
manually. Here is the dataframe :
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df ="header",true).option("inferSchema",true).csv("C:\\gg.csv").cache()
the content of the dataframe
The expected output is
What you want to do is:
df.withColumn("columnName", column) //here "columnName" should be "define" for you
Now you just need to create the said column (this might help)
Here is a solution that depends on Spark 2.4:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.Row
//First off the dataframe needs to be loaded with the expected schema
val spark = SparkSession.builder().appName().getOrCreate()
val schema = new StructType()
val df ="csv").schema(schema).load("C:\\gg.csv").cache()
val rddWithId = df.rdd.zipWithIndex
// Prepend "define" column of type Long
val newSchema = StructType(Array(StructField("define", StringType, false)) ++ df.schema.fields)
val dfZippedWithId = spark.createDataFrame({
case (row, index) =>
Row.fromSeq(Array("c" + index) ++ row.toSeq)}, newSchema)
// Show results
| c0| 12| 13| 14|
| c1| 11| 10| 5|
| c2| 3| 2| 45|
This is a mix of the documentation here and this example.

Column manipulations in Spark Scala

I am learning to work with Apache Spark (Scala) and still figuring out how things work out here
I am trying to achieve a simple task of
Finding max of column
Subtract each value of the column from this max and create a new column
The code I am using is
import org.apache.spark.sql.functions._
val training = sqlContext.createDataFrame(Seq(
val training_max = training.withColumn("Val_Max",training.groupBy().agg(max("Values"))
val training_max_sub = training_max.withColumn("Subs",training_max.groupBy().agg(col("Val_Max")-col("Values) ))
However I am getting a lot of errors. I am more or less fluent in R and had I been doing the same task my code would have been:
new_data <- training %>%
mutate(Subs= max(Values) - Values)
Here is a solution using window functions. You'll need a HiveContext to use them
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val training = sc.parallelize(Seq(10,13,14,21)).toDF("values")
max($"values").over(Window.partitionBy()) - $"values").show
Which produces the expected output :
| 10| 11|
| 13| 8|
| 14| 7|
| 21| 0|

How to convert a dataframe column to sequence

I have a dataframe as below:
| 4| inhibitori_effect|
| 4| novel_therapeut|
| 4| antiinflammator...|
| 4| promis_approach|
| 4| cell_function|
| 4| cell_line|
| 4| cancer_cell|
I want to create a new dataframe by taking all terms as sequence so that I can use them with Word2vec. That is:
| 4| inhibitori_effect, novel_therapeut,..., cell_line |
As a result I want to apply this sample code as given here:
So far I have tried to convert df to RDD and map it. And then I could not manage to re-convert it to a df.
Thanks in advance.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
val sc = new SparkContext(conf)
val sqlContext: SQLContext = new HiveContext(sc)
val df = sqlContext.load("jdbc",Map(
"url" -> "jdbc:oracle:thin:...",
"dbtable" -> "table"))
You can use collect_list or collect_set functions:
import org.apache.spark.sql.functions.{collect_list, collect_set}
In Spark < 2.0 it requires HiveContext and in Spark 2.0+ you have to enable hive support in SessionBuilder. See Use collect_list and collect_set in Spark SQL