Apache SPARK GroupByKey alternate

Apache SPARK GroupByKey alternate - scala

I have below columns in my table [col1,col2,key1,col3,txn_id,dw_last_updated]. Out of these txn_id , key1 are primary key columns. In my dataset I can have multiple records for the combination of (txn_id,key). Out of those records , I need to pick the latest one one based on dw_last_updated..
I'm using a logic this. I'm consistently hitting memory issue and I believe its partly because of groupByKey()... Is there a better alternate for this ?
case class Fact(col1: Int,
col2: Int,
key1: String,
col3: Int,
txn_id: Double,
dw_last_updated: Long)
sc.textFile(s3path).map { row =>
val parts = row.split("\t")
Fact(parts(0).toInt,
parts(1).toInt,
parts(2),
parts(3).toInt,
parts(4).toDouble,
parts(5).toLong)
}).map { t => ((t.txn_id, t.key1), t) }.groupByKey(512).map {
case ((txn_id, key1), sequence) =>
val newrecord = sequence.maxBy {
case Fact_Cp(col1, col2, key1, col3, txn_id, dw_last_updated) => dw_last_updated.toLong
}
(newrecord.col1 + "\t" + newrecord.col2 + "\t" + newrecord.key1 +
"\t" + newrecord.col3 + "\t" + newrecord.txn_id + "\t" + newrecord.dw_last_updated)
}
Appreciate your thoughts / suggestions...

rdd.groupByKey collects all values per key, requiring the necessary memory to hold the sequence of values for a key on a single node. Its use is discouraged. See [1].
Given that we are interested in only 1 value per key: max(dw_last_updated), a more memory efficient way would be to use rdd.reduceByKey where the reduce function here is to pick up the max of the two records for the same key using that timestamp as discriminant.
rdd.reduceByKey{case (record1,record2) => max(record1, record2)}
Applied to your case, it should look like this:
case class Fact(...)
object Fact {
def parse(s:String):Fact = ???
def maxByTs(f1:Fact, f2:Fact):Fact = if (f1.dw_last_updated.toLong > f2.dw_last_updated.toLong) f1 else f2
}
val factById = sc.textFile(s3path).map{row => val fact = Fact.parse(row); ((fact.txn_id, fact.key1),fact)}
val maxFactById = factById.reduceByKey(Fact.maxByTs)
Note that I've defined utility operations on the Fact companion object to keep the code tidy. I also advice to give named variables to each transformation step or logical group of steps. It makes it the program more readable.

Related

Spark DataFrames Scala - jump to next group during a loop

I have a dataframe as below and it records the quarter and date during which same incident occurs to which IDs.
I would like to mark an ID and date if the incident happen at two consecutive quarters. And this is how I did it.
val arrArray = dtf.collect.map(x => (x(0).toString, x(1).toString, x(2).toString.toInt))
if (arrArray.length > 0) {
var bufBAQDate = ArrayBuffer[Any]()
for (a <- 1 to arrArray.length - 1) {
val (strBAQ1, strDate1, douTime1) = arrArray(a - 1)
val (strBAQ2, strDate2, douTime2) = arrArray(a)
if (douTime2 - douTime1 == 15 && strBAQ1 == strBAQ2 && strDate1 == strDate2) {
bufBAQDate = (strBAQ2, strDate2) +: bufBAQDate
//println(strBAQ2+" "+strDate2+" "+douTime2)
}
}
val vecBAQDate = bufBAQDate.distinct.toVector.reverse
Is there a better way of doing it? As the same insident can happen many times to one ID during a single day, it is better to jump to the next ID and/or date once an ID and a date is marked. I dont want to create nested loops to filter dataframe.

Note that you current solution misses 20210103 as 1400 - 1345 = 55
I think this does the trick
val windowSpec = Window.partitionBy("ID")
.orderBy("datetime_seconds")
val consecutiveIncidents = dtf.withColumn("raw_datetime", concat($"Date", $"Time"))
.withColumn("datetime", to_timestamp($"raw_datetime", "yyyyMMddHHmm"))
.withColumn("datetime_seconds", $"datetime".cast(LongType))
.withColumn("time_delta", ($"datetime_seconds" - lag($"datetime_seconds", 1).over(windowSpec)) / 60)
.filter($"time_delta" === lit(15))
.select("ID", "Date")
.distinct()
.collect()
.map { case Row(id, date) => (id, date) }
.toList
Basically - convert the datetimes to timestamps, then look for records with the same ID and consecutive times, with their times separated by 15 minutes.
This is done by using lag over a window grouped by ID and ordered by the time.
In order to calculate the time difference The timestamp is converted to unix epoch seconds.
If you don't want to count day-crossing incidients, you can add the date to the groupyBy clause of the window

Saving a Spark-SQL file as csv

I am trying to save the output of SparkSQL to a path but not sure what function to use. I want to do this without using spark data frames. I was trying using write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out") but not successful. Can someone tell how to do it?
Note: spark SQL will give the output in multiple files. Need to ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1.
case class Docword(docId: Int, vocabId: Int, count: Int)
case class VocabWord(vocabId: Int, word: String)
// Read the input data
val docwords = spark.read.
schema(Encoders.product[Docword].schema).
option("delimiter", " ").
csv("hdfs:///user/bdc_data/t3/docword.txt").
as[Docword]
val vocab = spark.read.
schema(Encoders.product[VocabWord].schema).
option("delimiter", " ").
csv("hdfs:///user/bdc_data/t3/vocab.txt").
as[VocabWord]
docwords.createOrReplaceTempView("docwords")
vocab.createOrReplaceTempView("vocab")
spark.sql("""SELECT vocab.word AS word1, SUM(count) count1 FROM
docwords INNER JOIN vocab
ON docwords.vocabId = vocab.vocabId
GROUP BY word
ORDER BY count1 DESC""").show(10)
write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out")
// Required to exit the spark-shell
sys.exit(0)

.show() returns void you should dp something like below
val writeDf = spark.sql("""SELECT vocab.word AS word1, SUM(count) count1 FROM
docwords INNER JOIN vocab
ON docwords.vocabId = vocab.vocabId
GROUP BY word
ORDER BY count1 DESC""")
writeDf.write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out")
writeDf.show() // this should not be used in prod environment

Inputting an array from spark dataframe to postgreSQL through JDBC

so I need to maintain a table with results and input information into it every certain amount of time, as JDBC and spark have no built in option for UPSERT and as I can not allow myself for the table to be vacant while I input the results or for them to be double, I built an UPSERT function of my own. The problem is that I have a WrappedArray of ints in my dataFrame and I can not seem to be able to translate it to a java object that will let me insert it into the PreparedStatement.
The relevant part from my code looks like this:
import java.sql._
val st: PreparedStatement = dbc.prepareStatement("""
INSERT INTO """ + table + """ as tb """ + sliced_columns + """
VALUES"""+"(" + "?, " * (columns.size - 1) + "?)"+"""
ON CONFLICT (id)
DO UPDATE SET """ + column_name + """= CAST (? AS _int4), count_win=?, occurrences=?, "sumOccurrences"=?, win_rate=? Where tb.id=?;
""")
As you can see I tried to write the WrappedArray as a string and then cast it in the SQL code itself, but that feels like a very bad solution.
I made this as the input part, doing different actions depending on which column type it is:
for (single_type <- types){
single_type._2 match {
case "IntegerType" => st.setInt(counter + 1, x.getInt(counter))
case "StringType" => st.setString(counter + 1, x.getString(counter))
case "DoubleType" => st.setDouble(counter + 1, x.getDouble(counter))
case "LongType" => st.setLong(counter + 1, x.getLong(counter))
case _ => st.setArray(counter + 1, x.getList(counter).toArray().asInstanceOf[Array])
}
This returns an error that Ljava.lang.Object; cannot be cast to java.sql.Array. I'd really appreciate any help!

Array is a type constructor not type:
import org.apache.spark.sql.Row
Row(Seq(1, 2, 3)).getList(0).toArray.asInstanceOf[Array[_]]
but toArray (with type) should be sufficient
Row(Seq(1, 2, 3)).getList[Int](0).toArray

The problem eventually was solved by the command createArrayOf
st.setArray(counter + 1, conn.createArrayOf("int4", x.getList[Int](4).toArray()))

List in the Case-When Statement in Spark SQL

I'm trying to convert a dataframe from long to wide as suggested at How to pivot DataFrame?
However, the SQL seems to misinterpret the Countries list as a variable from the table. The below are the messages I saw from the console and the sample data and codes from the above link. Anyone knows how to resolve the issues?
Messages from the scala console:
scala> val myDF1 = sqlc2.sql(query)
org.apache.spark.sql.AnalysisException: cannot resolve 'US' given input columns >id, tag, value;
id tag value
1 US 50
1 UK 100
1 Can 125
2 US 75
2 UK 150
2 Can 175
and I want:
id US UK Can
1 50 100 125
2 75 150 175
I can create a list with the value I want to pivot and then create a string containing the sql query I need.
val countries = List("US", "UK", "Can")
val numCountries = countries.length - 1
var query = "select *, "
for (i <- 0 to numCountries-1) {
query += "case when tag = " + countries(i) + " then value else 0 end as " + countries(i) + ", "
}
query += "case when tag = " + countries.last + " then value else 0 end as " + countries.last + " from myTable"
myDataFrame.registerTempTable("myTable")
val myDF1 = sqlContext.sql(query)

Country codes are literals and should be enclosed in quotes otherwise SQL parser will treat these as the names of the columns:
val caseClause = countries.map(
x => s"""CASE WHEN tag = '$x' THEN value ELSE 0 END as $x"""
).mkString(", ")
val aggClause = countries.map(x => s"""SUM($x) AS $x""").mkString(", ")
val query = s"""
SELECT id, $aggClause
FROM (SELECT id, $caseClause FROM myTable) tmp
GROUP BY id"""
sqlContext.sql(query)
Question is why even bother with building SQL strings from scratch?
def genCase(x: String) = {
when($"tag" <=> lit(x), $"value").otherwise(0).alias(x)
}
def genAgg(f: Column => Column)(x: String) = f(col(x)).alias(x)
df
.select($"id" :: countries.map(genCase): _*)
.groupBy($"id")
.agg($"id".alias("dummy"), countries.map(genAgg(sum)): _*)
.drop("dummy")

scalaquery retrieve values

I have few tables, lets say 2 for simplicity. I can create them in this way,
...
val tableA = new Table[(Int,Int)]("tableA"){
def a = column[Int]("a")
def b = column[Int]("b")
}
val tableB = new Table[(Int,Int)]("tableB"){
def a = column[Int]("a")
def b = column[Int]("b")
}
Im going to have a query to retrieve value 'a' from tableA and value 'a' from tableB as a list inside the results from 'a'
my result should be:
List[(a,List(b))]
so far i came upto this point in query,
def createSecondItr(b1:NamedColumn[Int]) = for(
b2 <- tableB if b1 === b1.b
) yield b2.a
val q1 = for (
a1 <- tableA
listB = createSecondItr(a1.b)
) yield (a1.a , listB)
i didn't test the code so there might be errors in the code. My problem is I cannot retrieve data from the results.
to understand the question, take trains and classes of it. you search the trains after 12pm and you need to have a result set where the train name and the classes which the train have as a list inside the train's result.

I don't think you can do this directly in ScalaQuery. What I would do is to do a normal join and then manipulate the result accordingly:
import scala.collection.mutable.{HashMap, Set, MultiMap}
def list2multimap[A, B](list: List[(A, B)]) =
list.foldLeft(new HashMap[A, Set[B]] with MultiMap[A, B]){(acc, pair) => acc.addBinding(pair._1, pair._2)}
val q = for (
a <- tableA
b <- tableB
if (a.b === b.b)
) yield (a.a, b.a)
list2multimap(q.list)
The list2multimap is taken from https://stackoverflow.com/a/7210191/66686
The code is written without assistance of an IDE, compiler or similar. Consider the debugging free training :-)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Apache SPARK GroupByKey alternate - scala

Related

Spark DataFrames Scala - jump to next group during a loop

Saving a Spark-SQL file as csv

Inputting an array from spark dataframe to postgreSQL through JDBC

List in the Case-When Statement in Spark SQL

scalaquery retrieve values

Categories

Resources