how to create multiple list from dataframe in spark? - mongodb

how to create multiple list from dataframe in spark.
In my case, I want to order mongodb documents with grouping specific key. and create multiple list which is grouped on the basis of one key of schema
please help me
sparkSession = SparkSession.builder().getOrCreate()
MongoSpark.load[SparkSQL.Character](sparkSession).printSchema()
val characters = MongoSpark.load[SparkSQL.Character](sparkSession)
characters.createOrReplaceTempView("characters")
val sqlstmt = sparkSession.sql("SELECT * FROM characters WHERE site = 'website'")
...

You can do something like this:
val columns = sqlstmt.columns.map(col)
task1
.groupBy(key)
.agg(collect_list(struct(columns: _*)).as("data"))
Don't forget to import
import org.apache.spark.sql.functions._

Related

loop a sequence of s3 parquet file path with same schema and save in a single dataframe in scala

What is needed to a seq of s3 location is given. The difference for any two location is the partition column value of the table.
Each parquet folder has same schema .
So we need to loop the sequence of s3 parquet file path with same schema and save in a single dataframe in scala along.
If you have an array with all of directories you want to import you can iterate over that array and make a collection of dataframes and then union them into a single one.
Try something like this.
//You have now a collection of dataframes
val dataframes = directories.map(dir =>
spark.read.parquet(dir))
//Let's union them into one
val df_union = dataframes.reduce(_ union _)
If you turn on the options, then simply you can load the files recursively.
spark.read.parquet("s3a://path/to/root/")
The options are as follows.
spark.hive.mapred.supports.subdirectories true
spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive true
This can be used in a way that
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SparkSession
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("test")
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.read.parquet("s3a://path/to/root/")

Filter Spark Dataframe with list of values in Scala

I am trying to create a dataframe from hive table using SparkSession like below. Once created I am filtering the rows by a list of Ids.
val myDF = spark.sql("select * from myhivetable")
val someDF = mfiDF.where(mfiDF("id").isin(myList:_*))
Instead of this approach is there a way I can query the hive table as below:
val myDF = spark.sql("select * from myhivetable").where (("id").isin(myList:_*))
When I try like this I am getting a compilation error.
Could someone suggest a best approach for this. Thanks.
You could also do an inner join to remove unwanted ids, something like below may work.
val ids = sc.parallelize(myList).toDF("id")
someDF.join(ids, ids.id === someDF.id)

Dataframe: how to groupBy/count then order by count in Scala

I have a dataframe that contains a thousands of rows, what I'm looking for is to group by and count a column and then order by the out put: what I did is somthing looks like :
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
val objHive = new HiveContext(sc)
val df = objHive.sql("select * from db.tb")
val df_count=df.groupBy("id").count().collect()
df_count.sort($"count".asc).show()
You can use sort or orderBy as below
val df_count = df.groupBy("id").count()
df_count.sort(desc("count")).show(false)
df_count.orderBy($"count".desc).show(false)
Don't use collect() since it brings the data to the driver as an Array.
Hope this helps!
//import the SparkSession which is the entry point for spark underlying API to access
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val pathOfFile="f:/alarms_files/"
//create session and hold it in spark variable
val spark=SparkSession.builder().appName("myApp").getOrCreate()
//read the file below API will return DataFrame of Row
var df=spark.read.format("csv").option("header","true").option("delimiter", "\t").load("file://"+pathOfFile+"db.tab")
//groupBY id column and take count of the column and order it by count of the column
df=df.groupBy(df("id")).agg(count("*").as("columnCount")).orderBy("columnCount")
//for projecting the dataFrame it will show only top 20 records
df.show
//for projecting more than 20 records eg:
df.show(50)

remove a column from a dataframe spark

I have a Spark dataframe with a very large number of columns. I want to remove two columns from it to get a new dataframe.
Had there been fewer columns, I could have used the select method in the API like this:
pcomments = pcomments.select(pcomments.col("post_id"),pcomments.col("comment_id"),pcomments.col("comment_message"),pcomments.col("user_name"),pcomments.col("comment_createdtime"));
But since picking columns from a long list is a tedious task, is there a workaround?
Use drop method and withColumnRenamed methods.
Example:
val initialDf= ....
val dfAfterDrop=initialDf.drop("column1").drop("coumn2")
val dfAfterColRename= dfAfterDrop.withColumnRenamed("oldColumnName","new ColumnName")
Try this:
val initialDf = ...
val dfAfterDropCols = initialDf.drop("column1", "coumn2")

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28