Transform rows to columns in Spark Scala SQL - scala

I have a database table containing unique user ids and items clicked.
e.g.
user id,item id
1 , 345
1 , 78993
1 , 784
5, 345
5, 897
15, 454
and I want to transform this data into following format using spark SQL (if possible in Scala)
user id, item ids
1, 345, 78993, 784
5, 345,897
15, 454
Thanks,

A local example:
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
object Main extends App {
case class Record(user: Int, item: Int)
val items = List(
Record(1 , 345),
Record(1 , 78993),
Record(1 , 784),
Record(5, 345),
Record(5, 897),
Record(15, 454)
)
val sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local"))
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
import hiveContext.sql
val df = sc.parallelize(items).toDF()
df.registerTempTable("records")
sql("SELECT * FROM records").collect().foreach(println)
sql("SELECT user, collect_set(item) From records group by user").collect().foreach(println)
}
This produces:
[1,ArrayBuffer(78993, 784, 345)]
[5,ArrayBuffer(897, 345)]
[15,ArrayBuffer(454)]

This is a pretty simple groupByKey scenario. Although if you want to do something else with it after, then I would suggest using a more efficient PairRDDFunction as groupByKey is inefficient for follow up queries.

Related

How to get all different records in two different spark rdd

Very very new to spark and RDD's so I hope I explain what I'm after well enough for someone to understand and help :)
I have two very large sets of data, lets say 3 million rows with 50 columns which is stored in hadoop hdfs.
What I would like to do is read both of these into RDD's so that it uses the parallelism & I would like to return a 3rd RDD that contains all records (from either RDD) that do not match.
Below hopefully helps show what I'm looking to do...
Just trying to find all different records in the fastest most efficient way...
Data is not necessarily in the same order - row 1 of rdd1 may be row 4 of rdd2.
many thanks in advance!!
So... This seems to be doing what I want it to, but it seems far to easy to be correct...
%spark
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import sqlContext.implicits._
import org.apache.spark.sql._
//create the tab1 rdd.
val rdd1 = sqlContext.sql("select * FROM table1").withColumn("source",lit("tab1"))
//create the tab2 rdd.
val rdd2 = sqlContext.sql("select * FROM table2").withColumn("source",lit("tab2"))
//create the rdd of all misaligned records between table1 and the table2.
val rdd3 = rdd1.except(rdd2).unionAll(rdd2.except(rdd1))
//rdd3.printSchema()
//val rdd3 = rdd1.except(rdd2)
//drop the temporary table that was used to create a hive compatible table from the last run.
sqlContext.dropTempTable("table3")
//register the new temporary table.
rdd3.toDF().registerTempTable("table3")
//drop the old compare table.
sqlContext.sql("drop table if exists data_base.compare_table")
//create the new version of the s_asset compare table.
sqlContext.sql("create table data_base.compare_table as select * from table3")
This is the final bit of code i've ended up on so far which seems to be doing the job - not sure on performance on the full dataset, will keep my fingers crossed...
many thanks to all that took the time to help this poor pleb out :)
p.s. if anyone has a solution with a little more performance I'd love to hear it!
or if you can see some issue with this that may mean it will return the wrong results.
Load your both Dataframes as df1,df2
Add a source column with default value as rdd1 and rdd2 respectively
Union df1 and df2
Group by "rowid", "name", "status", "lastupdated" and collect its sources as set
Filter all rows which has single source
import org.apache.spark.sql.functions._
object OuterJoin {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val cols = Array("rowid", "name", "status", "lastupdated")
val df1 = List(
("1-za23f0", "product1", "active", "30-12-2019"),
("1-za23f1", "product2", "inactive", "31-12-2019"),
("1-za23f2", "product3", "inactive", "01-01-2020"),
("1-za23f3", "product4", "inactive", "02-01-2020"),
("1-za23f4", "product5", "inactive", "03-01-2020"))
.toDF(cols:_ *)
.withColumn("source",lit("rdd1"))
val df2 = List(
("1-za23f0", "product1", "active", "30-12-2019"),
("1-za23f1", "product2", "active", "31-12-2019"),
("1-za23f2", "product3", "active", "01-01-2020"),
("1-za23f3", "product1", "inactive", "02-01-2020"),
("1-za23f4", "product5", "inactive", "03-01-2020"))
.toDF(cols:_ *)
.withColumn("source",lit("rdd2"))
df1.union(df2)
.groupBy(cols.map(col):_ *)
.agg(collect_set("source").as("sources"))
.filter(size(col("sources")) === 1)
.withColumn("from_rdd", explode(col("sources") ))
.drop("sources")
.show()
}
}
you can rather read the data into dataframes and not into Rdds and then use union and group by to achieve the result
Both can be joined with "full_outer", and then filter applied, where field value compared in both:
val filterCondition = cols
.map(c => (col(s"l.$c") =!= col(s"r.$c") || col(s"l.$c").isNull || col(s"r.$c").isNull))
.reduce((acc, c) => acc || c)
df1.alias("l")
.join(df2.alias("r"), $"l.rowid" === $"r.rowid", "full_outer")
.where(filterCondition)
Output:
+--------+--------+--------+-----------+------+--------+--------+--------+-----------+------+
|rowid |name |status |lastupdated|source|rowid |name |status |lastupdated|source|
+--------+--------+--------+-----------+------+--------+--------+--------+-----------+------+
|1-za23f1|product2|inactive|31-12-2019 |rdd1 |1-za23f1|product2|active |31-12-2019 |rdd2 |
|1-za23f2|product3|inactive|01-01-2020 |rdd1 |1-za23f2|product3|active |01-01-2020 |rdd2 |
|1-za23f3|product4|inactive|02-01-2020 |rdd1 |1-za23f3|product1|inactive|02-01-2020 |rdd2 |
+--------+--------+--------+-----------+------+--------+--------+--------+-----------+------+

Create a single CSV file for each dataframe row

I need to create a single dataframe for each dataframe row.
The following code will create a single csv with Dataframe information
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.SparkConf
import java.sql.Timestamp
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, LongType, DoubleType};
import org.apache.spark.sql.functions._
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
var myDF = sqlContext.sql("select a, b, c from my_table")
val filename = "/tmp/myCSV.csv";
myDF.repartition(1).write.option("header", "true").option("compression", "none").option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss.SSS").csv(filename)
I'd like to create a single CSV for each row
scala> val myDF = sqlContext.sql("select a, b, c from my_table")
scala> val c = myDF.cache.count //Let say total 100 records
scala> val newDF = myDF.repartition(c.toInt)
scala> newDF.rdd.getNumPartitions
res34: Int = 100
scala> newDF.write.format("csv").option("header","true").save(<path to write>)

Dataframe: how to groupBy/count then order by count in Scala

I have a dataframe that contains a thousands of rows, what I'm looking for is to group by and count a column and then order by the out put: what I did is somthing looks like :
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
val objHive = new HiveContext(sc)
val df = objHive.sql("select * from db.tb")
val df_count=df.groupBy("id").count().collect()
df_count.sort($"count".asc).show()
You can use sort or orderBy as below
val df_count = df.groupBy("id").count()
df_count.sort(desc("count")).show(false)
df_count.orderBy($"count".desc).show(false)
Don't use collect() since it brings the data to the driver as an Array.
Hope this helps!
//import the SparkSession which is the entry point for spark underlying API to access
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val pathOfFile="f:/alarms_files/"
//create session and hold it in spark variable
val spark=SparkSession.builder().appName("myApp").getOrCreate()
//read the file below API will return DataFrame of Row
var df=spark.read.format("csv").option("header","true").option("delimiter", "\t").load("file://"+pathOfFile+"db.tab")
//groupBY id column and take count of the column and order it by count of the column
df=df.groupBy(df("id")).agg(count("*").as("columnCount")).orderBy("columnCount")
//for projecting the dataFrame it will show only top 20 records
df.show
//for projecting more than 20 records eg:
df.show(50)

how to get months,years difference between two dates in sparksql

I am getting the error:
org.apache.spark.sql.analysisexception: cannot resolve 'year'
My input data:
1,2012-07-21,2014-04-09
My code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class c (id:Int,start:String,end:String)
val c1 = sc.textFile("date.txt")
val c2 = c1.map(_.split(",")).map(r=>(c(r(0).toInt,r(1).toString,r(2).toString)))
val c3 = c2.toDF();
c3.registerTempTable("c4")
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")
What can I do resolve above error?
I have tried the following code but I got the output in days and I need it in years
val r = sqlContext.sql("select id,datediff(to_date(end), to_date(start)) AS date from c4")
Please advise me if i can use any function like to_date to get year difference.
Another simple way to cast the string to dateType in spark sql and apply sql dates and time functions on the columns like following :
import org.apache.spark.sql.types._
val c4 = c3.select(col("id"),col("start").cast(DateType),col("end").cast(DateType))
c4.withColumn("dateDifference", datediff(col("end"),col("start")))
.withColumn("monthDifference", months_between(col("end"),col("start")))
.withColumn("yearDifference", year(col("end"))-year(col("start")))
.show()
One of the above answers doesn't return the right Year when days between two dates less than 365. Below example provides the right year and rounds the month and year to 2 decimal.
Seq(("2019-07-01"),("2019-06-24"),("2019-08-24"),("2018-12-23"),("2018-07-20")).toDF("startDate").select(
col("startDate"),current_date().as("endDate"))
.withColumn("datesDiff", datediff(col("endDate"),col("startDate")))
.withColumn("montsDiff", months_between(col("endDate"),col("startDate")))
.withColumn("montsDiff_round", round(months_between(col("endDate"),col("startDate")),2))
.withColumn("yearsDiff", months_between(col("endDate"),col("startDate"),true).divide(12))
.withColumn("yearsDiff_round", round(months_between(col("endDate"),col("startDate"),true).divide(12),2))
.show()
Outputs:
+----------+----------+---------+-----------+---------------+--------------------+---------------+
| startDate| endDate|datesDiff| montsDiff|montsDiff_round| yearsDiff|yearsDiff_round|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
|2019-07-01|2019-07-24| 23| 0.74193548| 0.74| 0.06182795666666666| 0.06|
|2019-06-24|2019-07-24| 30| 1.0| 1.0| 0.08333333333333333| 0.08|
|2019-08-24|2019-07-24| -31| -1.0| -1.0|-0.08333333333333333| -0.08|
|2018-12-23|2019-07-24| 213| 7.03225806| 7.03| 0.586021505| 0.59|
|2018-07-20|2019-07-24| 369|12.12903226| 12.13| 1.0107526883333333| 1.01|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
You can find a complete working example at below URL
https://sparkbyexamples.com/spark-calculate-difference-between-two-dates-in-days-months-and-years/
Hope this helps.
Happy Learning !!
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")
In the above code, "year" is not a column in the data frame i.e it is not a valid column in table "c4" that is why analysis exception is thrown as query is invalid, query is not able to find the "year" column.
Use Spark User Defined Function (UDF), that will be a more robust approach.
Since dateDiff only returns the difference between days. I prefer to use my own UDF.
import java.sql.Timestamp
import java.time.Instant
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.functions.{udf, col}
import org.apache.spark.sql.DataFrame
def timeDiff(chronoUnit: ChronoUnit)(dateA: Timestamp, dateB: Timestamp): Long = {
chronoUnit.between(
Instant.ofEpochMilli(dateA.getTime),
Instant.ofEpochMilli(dateB.getTime)
)
}
def withTimeDiff(dateA: String, dateB: String, colName: String, chronoUnit: ChronoUnit)(df: DataFrame): DataFrame = {
val timeDiffUDF = udf[Long, Timestamp, Timestamp](timeDiff(chronoUnit))
df.withColumn(colName, timeDiffUDF(col(dateA), col(dateB)))
}
Then I call it as a dataframe transformation.
df.transform(withTimeDiff("sleepTime", "wakeupTime", "minutes", ChronoUnit.MINUTES)

How to sum the values of one column of a dataframe in spark/scala

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.
I want to sum the values of each column, for instance the total number of steps on "steps" column.
As far as I see I want to use these kind of functions:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
But I can understand how to use the function sum.
When I write the following:
val df = CSV.load(args(0))
val sumSteps = df.sum("steps")
the function sum cannot be resolved.
Do I use the function sum wrongly?
Do Ι need to use first the function map? and if yes how?
A simple example would be very helpful! I started writing Scala recently.
You must first import the functions:
import org.apache.spark.sql.functions._
Then you can use them like this:
val df = CSV.load(args(0))
val sumSteps = df.agg(sum("steps")).first.get(0)
You can also cast the result if needed:
val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)
Edit:
For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:
val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first
Edit2:
For dynamically applying the aggregations, the following options are available:
Applying to all numeric columns at once:
df.groupBy().sum()
Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)
//res1 Int = 19
Simply apply aggregation function, Sum on your column
df.groupby('steps').sum().show()
Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Not sure this was around when this question was asked but:
df.describe().show("columnName")
gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()
Using spark sql query..just incase if it helps anyone!
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import java.util.stream.Collectors
val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()
df.createOrReplaceTempView("steps")
val sum = spark.sql("select sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28