Spark-SQL using with - scala

I am trying to load up a Parquet file with columns storyId1 and publisher1. I want to find all pairs of publishers that publish articles about the same stories. For each publisher pair need to report the number of co-published stories. Where a co-published story in a story published by both publishers. Report the pairs in decreasing order of frequency. The solution must conform to the following rules:
1. There should not be any replicated entries like:
NASDAQ, NASDAQ, 1000
2. Should not have the same pair occurring twice in opposite order. Only one of the following should occur:
NASDAQ, Reuters, 1000
Reuters, NASDAQ, 1000
(i.e. it is incorrect to have both of the above two lines in your result)
Now it have tried following code:
> import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import spark.implicits._
val worddocDF = spark.read.parquet("file:///home/user204943816622/t4_story_publishers.parquet")
val worddocDF1 = spark.read.parquet("file:///home/user204943816622/t4_story_publishers.parquet")
worddocDF.cache()
val joinDF = worddocDF.join(worddocDF1, "storyId1").withColumnRenamed("worddocDF.publisher1", "publisher2")
joinDF.filter($"publisher1" !== $"publisher2")
Input format:
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Livemint]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, IFA Magazine]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Moneynews]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, NASDAQ]
[dPhGU51DcrolUIMxbRm0InaHGA2XM, IFA Magazine]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Los Angeles Times]
[dPhGU51DcrolUIMxbRm0InaHGA2XM, NASDAQ]
Required output:
[
NASDAQ,IFA Magazine,2]
[Moneynews,Livemint,1]
[Moneynews,IFA Magazine,1]
[NASDAQ,Livemint,1]
[NASDAQ,Los Angeles Times,1]
[Moneynews,Los Angeles Times,1]
[Los Angeles Times,IFA Magazine,1]
[Livemint,IFA Magazine,1]
[NASDAQ,Moneynews,1]
[Los Angeles Times,Livemint,1]

import spark.implicits._
wordDocDf.as("a")
.join(
wordDocDf.as("b"),
$"a.storyId1" === $"b.storyId1" && $"a.publisher1" =!= $"b.publisher1",
"inner"
)
.select(
$"a.storyId1".as("storyId"),
$"a.publisher1".as("publisher1"),
$"b.publisher1".as("publisher2")
)

Related

how to pivot /transpose rows of a column in to individual columns in spark-scala without using the pivot method

Please check below image for the reference to my use case
You can get the same result without using pivot by adding the columns manually, if you know all the names of the new columns:
import org.apache.spark.sql.functions.{col, when}
dataframe
.withColumn("cheque", when(col("ttype") === "cheque", col("tamt")))
.withColumn("draft", when(col("ttype") === "draft", col("tamt")))
.drop("tamt", "ttype")
As this solution does not trigger shuffle, your processing will be faster than using pivot.
It can be generalized if you don't know the name of the columns. However, in this case you should benchmark to check whether pivot is more performant:
import org.apache.spark.sql.functions.{col, when}
val newColumnNames = dataframe.select("ttype").distinct.collect().map(_.getString(0))
newColumnNames
.foldLeft(dataframe)((df, columnName) => {
df.withColumn(columnName, when(col("ttype") === columnName, col("tamt")))
})
.drop("tamt", "ttype")
Use groupBy,pivot & agg functions. Check below code.
Added inline comments.
scala> df.show(false)
+----------+------+----+
|tdate |ttype |tamt|
+----------+------+----+
|2020-10-15|draft |5000|
|2020-10-18|cheque|7000|
+----------+------+----+
scala> df
.groupBy($"tdate") // Grouping data based on tdate column.
.pivot("ttype",Seq("cheque","draft")) // pivot based on ttype and "draft","cheque" are new column name
.agg(first("tamt")) // aggregation by "tamt" column.
.show(false)
+----------+------+-----+
|tdate |cheque|draft|
+----------+------+-----+
|2020-10-18|7000 |null |
|2020-10-15|null |5000 |
+----------+------+-----+

How to get all different records in two different spark rdd

Very very new to spark and RDD's so I hope I explain what I'm after well enough for someone to understand and help :)
I have two very large sets of data, lets say 3 million rows with 50 columns which is stored in hadoop hdfs.
What I would like to do is read both of these into RDD's so that it uses the parallelism & I would like to return a 3rd RDD that contains all records (from either RDD) that do not match.
Below hopefully helps show what I'm looking to do...
Just trying to find all different records in the fastest most efficient way...
Data is not necessarily in the same order - row 1 of rdd1 may be row 4 of rdd2.
many thanks in advance!!
So... This seems to be doing what I want it to, but it seems far to easy to be correct...
%spark
import org.apache.spark.sql.DataFrame
import org.apache.spark.rdd.RDD
import sqlContext.implicits._
import org.apache.spark.sql._
//create the tab1 rdd.
val rdd1 = sqlContext.sql("select * FROM table1").withColumn("source",lit("tab1"))
//create the tab2 rdd.
val rdd2 = sqlContext.sql("select * FROM table2").withColumn("source",lit("tab2"))
//create the rdd of all misaligned records between table1 and the table2.
val rdd3 = rdd1.except(rdd2).unionAll(rdd2.except(rdd1))
//rdd3.printSchema()
//val rdd3 = rdd1.except(rdd2)
//drop the temporary table that was used to create a hive compatible table from the last run.
sqlContext.dropTempTable("table3")
//register the new temporary table.
rdd3.toDF().registerTempTable("table3")
//drop the old compare table.
sqlContext.sql("drop table if exists data_base.compare_table")
//create the new version of the s_asset compare table.
sqlContext.sql("create table data_base.compare_table as select * from table3")
This is the final bit of code i've ended up on so far which seems to be doing the job - not sure on performance on the full dataset, will keep my fingers crossed...
many thanks to all that took the time to help this poor pleb out :)
p.s. if anyone has a solution with a little more performance I'd love to hear it!
or if you can see some issue with this that may mean it will return the wrong results.
Load your both Dataframes as df1,df2
Add a source column with default value as rdd1 and rdd2 respectively
Union df1 and df2
Group by "rowid", "name", "status", "lastupdated" and collect its sources as set
Filter all rows which has single source
import org.apache.spark.sql.functions._
object OuterJoin {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val cols = Array("rowid", "name", "status", "lastupdated")
val df1 = List(
("1-za23f0", "product1", "active", "30-12-2019"),
("1-za23f1", "product2", "inactive", "31-12-2019"),
("1-za23f2", "product3", "inactive", "01-01-2020"),
("1-za23f3", "product4", "inactive", "02-01-2020"),
("1-za23f4", "product5", "inactive", "03-01-2020"))
.toDF(cols:_ *)
.withColumn("source",lit("rdd1"))
val df2 = List(
("1-za23f0", "product1", "active", "30-12-2019"),
("1-za23f1", "product2", "active", "31-12-2019"),
("1-za23f2", "product3", "active", "01-01-2020"),
("1-za23f3", "product1", "inactive", "02-01-2020"),
("1-za23f4", "product5", "inactive", "03-01-2020"))
.toDF(cols:_ *)
.withColumn("source",lit("rdd2"))
df1.union(df2)
.groupBy(cols.map(col):_ *)
.agg(collect_set("source").as("sources"))
.filter(size(col("sources")) === 1)
.withColumn("from_rdd", explode(col("sources") ))
.drop("sources")
.show()
}
}
you can rather read the data into dataframes and not into Rdds and then use union and group by to achieve the result
Both can be joined with "full_outer", and then filter applied, where field value compared in both:
val filterCondition = cols
.map(c => (col(s"l.$c") =!= col(s"r.$c") || col(s"l.$c").isNull || col(s"r.$c").isNull))
.reduce((acc, c) => acc || c)
df1.alias("l")
.join(df2.alias("r"), $"l.rowid" === $"r.rowid", "full_outer")
.where(filterCondition)
Output:
+--------+--------+--------+-----------+------+--------+--------+--------+-----------+------+
|rowid |name |status |lastupdated|source|rowid |name |status |lastupdated|source|
+--------+--------+--------+-----------+------+--------+--------+--------+-----------+------+
|1-za23f1|product2|inactive|31-12-2019 |rdd1 |1-za23f1|product2|active |31-12-2019 |rdd2 |
|1-za23f2|product3|inactive|01-01-2020 |rdd1 |1-za23f2|product3|active |01-01-2020 |rdd2 |
|1-za23f3|product4|inactive|02-01-2020 |rdd1 |1-za23f3|product1|inactive|02-01-2020 |rdd2 |
+--------+--------+--------+-----------+------+--------+--------+--------+-----------+------+

Scala query need

Hi I am getting an error with following piece of code.
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
import spark.implicits._
// Define case classe for input data
case class Article(articleId: Int, title: String, url: String, publisher: String,
category: String, storyId: String, hostname: String, timestamp: String)
// Read the input data
val articles = spark.read.
schema(Encoders.product[Article].schema).
option("delimiter", ",").
csv("hdfs:///user/ashhall1616/bdc_data/t4/news-small.csv").
as[Article]
articles.createOrReplaceTempView("articles")
val writeDf = spark.sql("""SELECT articles.storyId AS storyId1, articles.publisher AS publisher1
FROM articles
GROUP BY storyId
ORDER BY publisher1 ASC""")
Error:
val writeDf = spark.sql("""SELECT articles.storyId AS storyId1, articles.publisher AS publisher1
| FROM articles
| GROUP BY storyId
| ORDER BY publisher1 ASC""")
org.apache.spark.sql.AnalysisException: expression 'articles.`publisher`' is neither present in the group by, nor is it an aggregate function. Add to group by or w
rap in first() (or first_value) if you don't care which value you get.;;
Sort [publisher1#36 ASC NULLS FIRST], true
+- Aggregate [storyId#13], [storyId#13 AS storyId1#35, publisher#11 AS publisher1#36]
+- SubqueryAlias articles
+- Relation[articleId#8,title#9,url#10,publisher#11,category#12,storyId#13,hostname#14,timestamp#15] csv
Data set looks like:
articleId publisher Category storyId hostname
1 | Los Angeles Times | B | ddUyU0VZz0BRneMioxUPQVP6sIxvM | www.latimes.com
goal is to create a list of each story paired with each publisher that wrote at least one article for that story.
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Livemint]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, IFA Magazine]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Moneynews]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, NASDAQ]
[dPhGU51DcrolUIMxbRm0InaHGA2XM, IFA Magazine]
[ddUyU0VZz0BRneMioxUPQVP6sIxvM, Los Angeles Times]
[dPhGU51DcrolUIMxbRm0InaHGA2XM, NASDAQ]
can someone suggest code improvement to get the desired output?
Parser, compiler getting confused.
You have no AGGregate with the GROUP BY. Use DISTINCT on storyid, publisher.
Check if you also need storyId1 on GROUP BY as well.

Dataframe: how to groupBy/count then order by count in Scala

I have a dataframe that contains a thousands of rows, what I'm looking for is to group by and count a column and then order by the out put: what I did is somthing looks like :
import org.apache.spark.sql.hive.HiveContext
import sqlContext.implicits._
val objHive = new HiveContext(sc)
val df = objHive.sql("select * from db.tb")
val df_count=df.groupBy("id").count().collect()
df_count.sort($"count".asc).show()
You can use sort or orderBy as below
val df_count = df.groupBy("id").count()
df_count.sort(desc("count")).show(false)
df_count.orderBy($"count".desc).show(false)
Don't use collect() since it brings the data to the driver as an Array.
Hope this helps!
//import the SparkSession which is the entry point for spark underlying API to access
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val pathOfFile="f:/alarms_files/"
//create session and hold it in spark variable
val spark=SparkSession.builder().appName("myApp").getOrCreate()
//read the file below API will return DataFrame of Row
var df=spark.read.format("csv").option("header","true").option("delimiter", "\t").load("file://"+pathOfFile+"db.tab")
//groupBY id column and take count of the column and order it by count of the column
df=df.groupBy(df("id")).agg(count("*").as("columnCount")).orderBy("columnCount")
//for projecting the dataFrame it will show only top 20 records
df.show
//for projecting more than 20 records eg:
df.show(50)

How to group by on epoch timestame field in Scala spark

I want to group by the records by date. but the date is in epoch timestamp in millisec.
Here is the sample data.
date, Col1
1506838074000, a
1506868446000, b
1506868534000, c
1506869064000, a
1506869211000, c
1506871846000, f
1506874462000, g
1506879651000, a
Here is what I'm trying to achieve.
**date Count of records**
02-10-2017 4
04-10-2017 3
03-10-2017 5
Here is the code which I tried to group by,
import java.text.SimpleDateFormat
val dateformat:SimpleDateFormat = new SimpleDateFormat("yyyy-MM-dd")
val df = sqlContext.read.csv("<path>")
val result = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
But while executing code I am getting below exception.
<console>:30: error: value toLong is not a member of org.apache.spark.sql.ColumnName
val t = df.select("*").groupBy(dateformat.format($"date".toLong)).agg(count("*").alias("cnt")).select("date","cnt")
Please help me to resolve the issue.
you would need to change the date column, which seems to be in long, to date data type. This can be done by using from_unixtime built-in function. And then its just a groupBy and agg function calls and use count function.
import org.apache.spark.sql.functions._
def stringDate = udf((date: Long) => new java.text.SimpleDateFormat("dd-MM-yyyy").format(date))
df.withColumn("date", stringDate($"date"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Above answer is using udf function which should be avoided as much as possible, since udf is a black box and requires serialization and deserialisation of columns.
Updated
Thanks to #philantrovert for his suggestion to divide by 1000
import org.apache.spark.sql.functions._
df.withColumn("date", from_unixtime($"date"/1000, "yyyy-MM-dd"))
.groupBy("date")
.agg(count("Col1").as("Count of records"))
.show(false)
Both ways work.