Dataframe: how to groupBy/count then order by count in Scala - scala

I have a dataframe that contains a thousands of rows, what I'm looking for is to group by and count a column and then order by the out put: what I did is somthing looks like :
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
val objHive = new HiveContext(sc)
val df = objHive.sql("select * from db.tb")
val df_count=df.groupBy("id").count().collect()
df_count.sort($"count".asc).show()

You can use sort or orderBy as below
val df_count = df.groupBy("id").count()
df_count.sort(desc("count")).show(false)
df_count.orderBy($"count".desc).show(false)
Don't use collect() since it brings the data to the driver as an Array.
Hope this helps!

//import the SparkSession which is the entry point for spark underlying API to access
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val pathOfFile="f:/alarms_files/"
//create session and hold it in spark variable
val spark=SparkSession.builder().appName("myApp").getOrCreate()
//read the file below API will return DataFrame of Row
var df=spark.read.format("csv").option("header","true").option("delimiter", "\t").load("file://"+pathOfFile+"db.tab")
//groupBY id column and take count of the column and order it by count of the column
df=df.groupBy(df("id")).agg(count("*").as("columnCount")).orderBy("columnCount")
//for projecting the dataFrame it will show only top 20 records
df.show
//for projecting more than 20 records eg:
df.show(50)

Related

loop a sequence of s3 parquet file path with same schema and save in a single dataframe in scala

What is needed to a seq of s3 location is given. The difference for any two location is the partition column value of the table.
Each parquet folder has same schema .
So we need to loop the sequence of s3 parquet file path with same schema and save in a single dataframe in scala along.
If you have an array with all of directories you want to import you can iterate over that array and make a collection of dataframes and then union them into a single one.
Try something like this.
//You have now a collection of dataframes
val dataframes = directories.map(dir =>
spark.read.parquet(dir))
//Let's union them into one
val df_union = dataframes.reduce(_ union _)
If you turn on the options, then simply you can load the files recursively.
spark.read.parquet("s3a://path/to/root/")
The options are as follows.
spark.hive.mapred.supports.subdirectories true
spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive true
This can be used in a way that
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SparkSession
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("test")
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.read.parquet("s3a://path/to/root/")

Scala csv file read and display the data in new column

I am new to Scala. I need to read data from csv file which has two header columns named Name and Marks, based on the Marks column I want to show the result in a 3rd column; pass or fail (<35 fail, >35pass).
The data looks like this:
Name,Marks
x,10
y,50
z,80
Result should be:
Name Marks Result
x 10 Fail
Y 50 Pass
z 80 Pass
You can read the csv file with header, then add a column by using when and otherwise to give different values depending on the marks.
import spark.implicits._
val df = spark.read.option("header", true).csv("/path/to/csv") // read csv
val df2 = df.withColumn("Result", when($"Marks" < 35, "Fail").otherwise("Pass"))
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local")
.appName("").config("spark.sql.warehouse.dir", "C:/temp").getOrCreate()
val df = spark.read.option("header",true).csv("file path")
val resul = df.withColumn("Result", when(col("Marks").cast("Int")>=35, "PASS").otherwise("FAIL"))

Spark - UnsupportedOperationException: collect_list is not supported in a window operation

I am using Spark 1.6. I have a dataframe generated from a parquet file with 6 columns. I am trying to group (partitionBy) and order(orderBy) the rows in the dataframe, to later collect those columns in an Array.
I wasn't sure if this actions were possible in Spark 1.6, but in the following answers they show how it can be done:
https://stackoverflow.com/a/35529093/1773841 #zero323
https://stackoverflow.com/a/45135012/1773841 #Ramesh Maharjan
Based on those answers I wrote the following code:
val sqlContext: SQLContext = new HiveContext(sc)
val conf = sc.hadoopConfiguration
val dataPath = "/user/today/*/*"
val dfSource : DataFrame = sqlContext.read.format("parquet").option("dateFormat", "DDMONYY").option("timeFormat", "HH24:MI:SS").load(dataPath)
val w = Window.partitionBy("code").orderBy("date".desc)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val dfCollec = dfData.withColumn("collected", collect_list(struct("col1","col2","col3","col4","col5","col6")).over(w))
So, I followed the pattern written by Ramesh, and I created the sqlContext based on Hive as Zero recommended. But I am still getting the following error:
java.lang.UnsupportedOperationException:
'collect_list(struct('col1,'col2,'col3,'col4,'col5,'col6)) is not
supported in a window operation.
at org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:191)
at org.apache.spark.sql.Column.over(Column.scala:1052)
What am I missing still?

how to create multiple list from dataframe in spark?

how to create multiple list from dataframe in spark.
In my case, I want to order mongodb documents with grouping specific key. and create multiple list which is grouped on the basis of one key of schema
please help me
sparkSession = SparkSession.builder().getOrCreate()
MongoSpark.load[SparkSQL.Character](sparkSession).printSchema()
val characters = MongoSpark.load[SparkSQL.Character](sparkSession)
characters.createOrReplaceTempView("characters")
val sqlstmt = sparkSession.sql("SELECT * FROM characters WHERE site = 'website'")
...
You can do something like this:
val columns = sqlstmt.columns.map(col)
task1
.groupBy(key)
.agg(collect_list(struct(columns: _*)).as("data"))
Don't forget to import
import org.apache.spark.sql.functions._

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28