Full outer join in RDD scala spark - scala

I have two file below:
file1
0000003 杉山______ 26 F
0000005 崎村______ 50 F
0000007 梶川______ 42 F
file2
0000005 82 79 16 21 80
0000001 46 39 8 5 21
0000004 58 71 20 10 6
0000009 60 89 33 18 6
0000003 30 50 71 36 30
0000007 50 2 33 15 62
Now, I would like join columns have the same value in field 1.
I want something like this:
0000005 崎村______ 50 F 82 79 16 21 80
0000003 杉山______ 26 F 30 50 71 36 30
0000007 梶川______ 42 F 50 2 33 15 62

You can use Data Frame Join concept instead of RDD joining. That will be easy. You can refer my sample code below. Hope that will help you.
I am considering your data is in same format as you mentioned above. If it is in CSV or any other format then you can skip Step-2 and update Step-1 as per data format. If you required output in RDD format then you can use Step-5 otherwise you can ignore it as per comment mentioned in code snippet.
I have modified data (like A_____, B_____, C____) just for readability.
//Step1: Loading file1 and file2 to corresponding DataFrame in text format
val df1 = spark.read.format("text").load("<path of file1>")
val df2 = spark.read.format("text").load("<path of file2>")
//Step2: Spliting single column "value" into multiple column for join Key
val file1 = ((((df1.withColumn("col1", split($"value", " ")(0)))
.withColumn("col2", split($"value", " ")(1)))
.withColumn("col3", split($"value", " ")(2)))
.withColumn("col4", split($"value", " ")(3)))
.select("col1","col2", "col3", "col4")
/*
+-------+-------+----+----+
|col1 |col2 |col3|col4|
+-------+-------+----+----+
|0000003|A______|26 |F |
|0000005|B______|50 |F |
|0000007|C______|42 |F |
+-------+-------+----+----+
*/
val file2 = ((((((df2.withColumn("col1", split($"value", " ")(0)))
.withColumn("col2", split($"value", " ")(1)))
.withColumn("col3", split($"value", " ")(2)))
.withColumn("col4", split($"value", " ")(3)))
.withColumn("col5", split($"value", " ")(4)))
.withColumn("col6", split($"value", " ")(5)))
.select("col1","col2", "col3", "col4","col5","col6")
/*
+-------+----+----+----+----+----+
|col1 |col2|col3|col4|col5|col6|
+-------+----+----+----+----+----+
|0000005|82 |79 |16 |21 |80 |
|0000001|46 |39 |8 |5 |21 |
|0000004|58 |71 |20 |10 |6 |
|0000009|60 |89 |33 |18 |6 |
|0000003|30 |50 |71 |36 |30 |
|0000007|50 |2 |33 |15 |62 |
+-------+----+----+----+----+----+
*/
//Step3: you can do alias to refer column name with aliases to increase readablity
val file01 = file1.as("f1")
val file02 = file2.as("f2")
//Step4: Joining files on Key
file01.join(file02,col("f1.col1") === col("f2.col1"))
/*
+-------+-------+----+----+-------+----+----+----+----+----+
|col1 |col2 |col3|col4|col1 |col2|col3|col4|col5|col6|
+-------+-------+----+----+-------+----+----+----+----+----+
|0000005|B______|50 |F |0000005|82 |79 |16 |21 |80 |
|0000003|A______|26 |F |0000003|30 |50 |71 |36 |30 |
|0000007|C______|42 |F |0000007|50 |2 |33 |15 |62 |
+-------+-------+----+----+-------+----+----+----+----+----+
*/
// Step5: if you want file data in RDD format the you can use below command
file01.join(file02,col("f1.col1") === col("f2.col1")).rdd.collect
/*
Array[org.apache.spark.sql.Row] = Array([0000005,B______,50,F,0000005,82,79,16,21,80], [0000003,A______,26,F,0000003,30,50,71,36,30], [0000007,C______,42,F,0000007,50,2,33,15,62])
*/

i found the solution , here is my code :
val rddPair1 = logData1.map { x =>
var data = x.split(" ")
var index = 0
var value=""
var key = data(index)
for( i <- 0 to data.length-1){
if(i!=index){
value+= data(i)+" "
}
}
new Tuple2(key, value.trim)
}
val rddPair2 = logData2.map { x =>
var data = x.split(" ")
var index = 0
var value=""
var key = data(index)
for( i <- 0 to data.length-1){
if(i!=index){
value+= data(i)+" "
}
}
new Tuple2(key, value.trim)
}
rddPair1.join(rddPair2).collect().foreach(f =>{
println(f._1+" "+f._2._1+" "+f._2._2
)})
}
result:
0000003 杉山______ 26 F 30 50 71 36 30
0000005 崎村______ 50 F 82 79 16 21 80
0000007 梶川______ 42 F 50 2 33 15 62

Related

How to make first line of text file as header and skip second line in spark scala

I am trying to figure out how to use the first line of text file as header and skip seconds line. So far I have tried this:
scala> val file = spark.sparkContext.textFile("/home/webwerks/Desktop/UseCase-03-March/Temp/temp.out")
file: org.apache.spark.rdd.RDD[String] = /home/webwerks/Desktop/UseCase-03-March/Temp/temp.out MapPartitionsRDD[40] at textFile at <console>:23
scala> val clean = file.flatMap(x=>x.split("\t")).filter(x=> !(x.contains("-")))
clean: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[42] at filter at <console>:25
scala> val df=clean.toDF()
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.show
+--------------------+
| value|
+--------------------+
|time task...|
|03:27:51.199 FCPH...|
|03:27:51.199 PORT...|
|03:27:51.200 PORT...|
|03:27:51.200 PORT...|
|03:27:59.377 PORT...|
|03:27:59.377 PORT...|
|03:27:59.377 FCPH...|
|03:27:59.377 FCPH...|
|03:28:00.468 PORT...|
|03:28:00.468 PORT...|
|03:28:00.469 FCPH...|
|03:28:00.469 FCPH...|
|03:28:01.197 FCPH...|
|03:28:01.197 FCPH...|
|03:28:01.197 PORT...|
|03:28:01.198 PORT...|
|03:28:09.380 PORT...|
|03:28:09.380 PORT...|
|03:28:09.380 FCPH...|
Here I want first line as header and data should be separate by tab
data is like:
time task event port cmd args
--------------------------------------------------------------------------------------
03:27:51.199 FCPH seq 13 28 00300000,00000000,00000591,00020182,00000000
03:27:51.199 PORT Rx 11 0 c0fffffd,00fffffd,0ed10335,00000001
03:27:51.200 PORT Tx 13 40 02fffffd,00fffffd,0ed3ffff,14000000
03:27:51.200 PORT Rx 13 0 c0fffffd,00fffffd,0ed329ae,00000001
03:27:59.377 PORT Rx 15 40 02fffffd,00fffffd,0336ffff,14000000
03:27:59.377 PORT Tx 15 0 c0fffffd,00fffffd,03360ed2,00000001
03:27:59.377 FCPH read 15 40 02fffffd,00fffffd,d0000000,00000000,03360ed2
03:27:59.377 FCPH seq 15 28 22380000,03360ed2,0000052b,0000001c,00000000
03:28:00.468 PORT Rx 13 40 02fffffd,00fffffd,29afffff,14000000
03:28:00.468 PORT Tx 13 0 c0fffffd,00fffffd,29af0ed5,00000001
scala> val ds = spark.read.textFile("data.txt") > spark-v2.0
(or)
val ds = spark.sparkContext.textFile("data.txt")
scala> val schemaArr = ds.filter(x=>x.contains("time")).collect.mkString.split("\t").toList
scala> val df = ds.filter(x=> !x.contains("time"))
.map(x=>{
val cols = x.split("\t")
(cols(0),cols(1),cols(2),cols(3),cols(4),cols(5))
}).toDF(schemaArr:_*)
scala> df.show(false)
+------------+----+-----+----+---+--------------------------------------------+
|time |task|event|port|cmd|args |
+------------+----+-----+----+---+--------------------------------------------+
|03:27:51.199|FCPH|seq |13 |28 |00300000,00000000,00000591,00020182,00000000|
|03:27:51.199|PORT|Rx |11 | 0 |c0fffffd,00fffffd,0ed10335,00000001 |
|03:27:51.200|PORT|Tx |13 |40 |02fffffd,00fffffd,0ed3ffff,14000000 |
|03:27:51.200|PORT|Rx |13 | 0 |c0fffffd,00fffffd,0ed329ae,00000001 |
|03:27:59.377|PORT|Rx |15 |40 |02fffffd,00fffffd,0336ffff,14000000 |
|03:27:59.377|PORT|Tx |15 | 0 |c0fffffd,00fffffd,03360ed2,00000001 |
|03:27:59.377|FCPH|read |15 |40 |02fffffd,00fffffd,d0000000,00000000,03360ed2|
|03:27:59.377|FCPH|seq |15 |28 |22380000,03360ed2,0000052b,0000001c,00000000|
|03:28:00.468|PORT|Rx |13 |40 |02fffffd,00fffffd,29afffff,14000000 |
|03:28:00.468|PORT|Tx |13 | 0 |c0fffffd,00fffffd,29af0ed5,00000001 |
+------------+----+-----+----+---+--------------------------------------------+
please try something like above and if you want schema then apply to it by using costume schema

How convert sequential numerical processing of Cassandra table data to parallel in Spark?

We are doing some mathematical modelling on data from Cassandra table using the spark cassandra connector and the execution is currently sequential to get the output. How do you parallelize this for faster execution?
I'm new to Spark and I tried a few things but I'm unable understand how to use tabular data in map , groupby, reduceby functions. If someone can help explain (with some code snippets) how to parrellize tabular data, it will be really helpful.
import org.apache.spark.sql.{Row, SparkSession}
import com.datastax.spark.connector._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
class SparkExample(sparkSession: SparkSession, pathToCsv: String) {
private val sparkContext = sparkSession.sparkContext
sparkSession.stop()
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host","127.0.0.1")
.setAppName("cassandra").setMaster("local[*]")
val sc = new SparkContext(conf)
def testExample(): Unit = {
val KNMI_rdd = sc.cassandraTable ("dbks1","knmi_w")
val Table_count = KNMI_rdd.count()
val KNMI_idx = KNMI_rdd.zipWithIndex
val idx_key = KNMI_idx.map{case (k,v) => (v,k)}
var i = 0
var n : Int = Table_count.toInt
println(Table_count)
for ( i <- 1 to n if i < n) {
println(i)
val Row = idx_key.lookup(i)
println(Row)
val firstRow = Row(0)
val yyyy_var = firstRow.get[Int]("yyyy")
val mm_var = firstRow.get[Double]("mm")
val dd_var = firstRow.get[Double]("dd")
val dr_var = firstRow.get[Double]("dr")
val tg_var = firstRow.get[Double]("tg")
val ug_var = firstRow.get[Double]("ug")
val loc_var = firstRow.get[String]("loc")
val pred_factor = (((0.15461 * tg_var) + (0.8954 * ug_var)) / ((0.0000451 * dr_var) + 0.0004487))
println(yyyy_var,mm_var,dd_var,loc_var)
println(pred_factor)
}
}
}
//test data
// loc | yyyy | mm | dd | dr | tg | ug
//-----+------+----+----+-----+-----+----
// AMS | 2019 | 1 | 1 | 35 | 5 | 84
// AMS | 2019 | 1 | 2 | 76 | 34 | 74
// AMS | 2019 | 1 | 3 | 46 | 33 | 85
// AMS | 2019 | 1 | 4 | 35 | 1 | 84
// AMS | 2019 | 1 | 5 | 29 | 0 | 93
// AMS | 2019 | 1 | 6 | 32 | 25 | 89
// AMS | 2019 | 1 | 7 | 42 | 23 | 89
// AMS | 2019 | 1 | 8 | 68 | 75 | 92
// AMS | 2019 | 1 | 9 | 98 | 42 | 86
// AMS | 2019 | 1 | 10 | 92 | 12 | 76
// AMS | 2019 | 1 | 11 | 66 | 0 | 71
// AMS | 2019 | 1 | 12 | 90 | 56 | 85
// AMS | 2019 | 1 | 13 | 83 | 139 | 90
Edit 1:
I tired using map function and I'm able to calculate the mathematical computations, how do I add keys in front of these values which is defined by WeatherId?
case class Weather( loc: String, yyyy: Int, mm: Int, dd: Int,dr: Double, tg: Double, ug: Double)
case class WeatherId(loc: String, yyyy: Int, mm: Int, dd: Int)
val rows = dataset1
.map(line => Weather(
line.getAs[String]("loc"),
line.getAs[Int]("yyyy"),
line.getAs[Int]("mm"),
line.getAs[Int]("dd"),
line.getAs[Double]("dr"),
line.getAs[Double]("tg"),
line.getAs[Double]("ug")
) )
val pred_factor = rows
.map(x => (( ((x.dr * betaz) + (x.tg * betay)) + (x.ug) * betaz)))
Thanks
TL;DR;
Use a Dataframe/Dataset instead of an RDD.
The argument for DFs over RDDs is long but the short of it is that DFs and their structured alternative DS' outperform the low level RDDs.
With the spark-cassandra connector you can configure input split size that dictate the size of partition size in spark, more partitions more parallelism.
val lastdf = spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne",
"spark.cassandra.input.split.size_in_mb" -> 48 // smaller size = more partitions
)
).load()

How to get count of group by two columns

The below is myDf
fi_Sk sec_SK END_DATE
89 42 20160122
89 42 20150330
51 43 20140116
51 43 20130616
82 43 20100608
82 43 20160608
The below is my code:
val count = myDf.withColumn("END_DATE", unix_timestamp(col("END_DATE"), dateFormat))
.groupBy(col("sec_SK"),col("fi_Sk"))
.agg(count("sec_SK").as("Visits"), max("END_DATE").as("Recent_Visit"))
.withColumn("Recent_Visit", from_unixtime(col("Recent_Visit"), dateFormat))
I am getting visits incorrectly,i need to group by(fi_Sk and sec_SK) for counting visits
the result should be like below :
fi_Sk sec_SK Visits END_DATE
89 42 2 20160122
51 43 2 20140116
82 43 2 20160608
currently i am getting :
fi_Sk sec_SK Visits END_DATE
89 42 2 20160122
51 43 2 20140116
groupBy and aggregation would aggregate all the rows in group into one row but the expected output seems that you want to populate the count for each row in the group. Window function is the appropriate solution for you
import org.apache.spark.sql.expressions.Window
def windowSpec = Window.partitionBy("fi_Sk", "sec_SK")
import org.apache.spark.sql.functions._
df.withColumn("Visits", count("fi_Sk").over(windowSpec))
// .sort("fi_Sk", "END_DATE")
// .show(false)
//
// +-----+------+--------+------+
// |fi_Sk|sec_SK|END_DATE|Visits|
// +-----+------+--------+------+
// |51 |42 |20130616|2 |
// |51 |42 |20140116|2 |
// |89 |44 |20100608|1 |
// |89 |42 |20150330|2 |
// |89 |42 |20160122|2 |
// +-----+------+--------+------+

Spark Dataframe Group by having New Indicator Column

I need to group by "KEY" Column and need to check whether "TYPE_CODE" column has both "PL" and "JL" values , if so then i need to add a Indicator Column as "Y" else "N"
Example :
//Input Values
val values = List(List("66","PL") ,
List("67","JL") , List("67","PL"),List("67","PO"),
List("68","JL"),List("68","PO")).map(x =>(x(0), x(1)))
import spark.implicits._
//created a dataframe
val cmc = values.toDF("KEY","TYPE_CODE")
cmc.show(false)
------------------------
KEY |TYPE_CODE |
------------------------
66 |PL |
67 |JL |
67 |PL |
67 |PO |
68 |JL |
68 |PO |
-------------------------
Expected Output :
For each "KEY", If it has "TYPE_CODE" has both PL & JL then Y
else N
-----------------------------------------------------
KEY |TYPE_CODE | Indicator
-----------------------------------------------------
66 |PL | N
67 |JL | Y
67 |PL | Y
67 |PO | Y
68 |JL | N
68 |PO | N
---------------------------------------------------
For example,
67 has both PL & JL - So "Y"
66 has only PL - So "N"
68 has only JL - So "N"
One option:
1) collect TYPE_CODE as list;
2) check if it contains the specific strings;
3) then flatten the list with explode:
(cmc.groupBy("KEY")
.agg(collect_list("TYPE_CODE").as("TYPE_CODE"))
.withColumn("Indicator",
when(array_contains($"TYPE_CODE", "PL") && array_contains($"TYPE_CODE", "JL"), "Y").otherwise("N"))
.withColumn("TYPE_CODE", explode($"TYPE_CODE"))).show
+---+---------+---------+
|KEY|TYPE_CODE|Indicator|
+---+---------+---------+
| 68| JL| N|
| 68| PO| N|
| 67| JL| Y|
| 67| PL| Y|
| 67| PO| Y|
| 66| PL| N|
+---+---------+---------+
Another option:
Group by KEY and use agg to create two separate indicator columns (one for JL and on for PL), then calculate the combined indicator
join with the original DataFrame
Altogether:
val indicators = cmc.groupBy("KEY").agg(
sum(when($"TYPE_CODE" === "PL", 1).otherwise(0)) as "pls",
sum(when($"TYPE_CODE" === "JL", 1).otherwise(0)) as "jls"
).withColumn("Indicator", when($"pls" > 0 && $"jls" > 0, "Y").otherwise("N"))
val result = cmc.join(indicators, "KEY")
.select("KEY", "TYPE_CODE", "Indicator")
This might be slower than #Psidom's answer, but might be safer - collect_list might be problematic if you have a huge number of matches for a specific key (that list would have to be stored in a single worker's memory).
EDIT:
In case the input is known to be unique (i.e. JL / PL would only appear once per key, at most), indicators could be created using simple count aggregation, which is (arguably) easier to read:
val indicators = cmc
.where($"TYPE_CODE".isin("PL", "JL"))
.groupBy("KEY").count()
.withColumn("Indicator", when($"count" === 2, "Y").otherwise("N"))

How to create DataFrame from fixed-length text file given field lengths?

I am reading fixed positional file. Final result of file is stored in string. I would like to convert string into a DataFrame to process further. Kindly help me on this. Below is my code:
Input data:
+---------+----------------------+
|PRGREFNBR|value |
+---------+----------------------+
|01 |11 apple TRUE 0.56|
|02 |12 pear FALSE1.34|
|03 |13 raspberry TRUE 2.43|
|04 |14 plum TRUE .31|
|05 |15 cherry TRUE 1.4 |
+---------+----------------------+
data position: "3,10,5,4"
expected result with default header in data frame:
+-----+-----+----------+-----+-----+
|SeqNo|col_0| col_1|col_2|col_3|
+-----+-----+----------+-----+-----+
| 01 | 11 |apple |TRUE | 0.56|
| 02 | 12 |pear |FALSE| 1.34|
| 03 | 13 |raspberry |TRUE | 2.43|
| 04 | 14 |plum |TRUE | 1.31|
| 05 | 15 |cherry |TRUE | 1.4 |
+-----+-----+----------+-----+-----+
Given the fixed-position file (say input.txt):
11 apple TRUE 0.56
12 pear FALSE1.34
13 raspberry TRUE 2.43
14 plum TRUE 1.31
15 cherry TRUE 1.4
and the length of every field in the input file as (say lengths):
3,10,5,4
you could create a DataFrame as follows:
// Read the text file as is
// and filter out empty lines
val lines = spark.read.textFile("input.txt").filter(!_.isEmpty)
// define a helper function to do the split per fixed lengths
// Home exercise: should be part of a case class that describes the schema
def parseLinePerFixedLengths(line: String, lengths: Seq[Int]): Seq[String] = {
lengths.indices.foldLeft((line, Array.empty[String])) { case ((rem, fields), idx) =>
val len = lengths(idx)
val fld = rem.take(len)
(rem.drop(len), fields :+ fld)
}._2
}
// Split the lines using parseLinePerFixedLengths method
val lengths = Seq(3,10,5,4)
val fields = lines.
map(parseLinePerFixedLengths(_, lengths)).
withColumnRenamed("value", "fields") // <-- it'd be unnecessary if a case class were used
scala> fields.show(truncate = false)
+------------------------------+
|fields |
+------------------------------+
|[11 , apple , TRUE , 0.56]|
|[12 , pear , FALSE, 1.34]|
|[13 , raspberry , TRUE , 2.43]|
|[14 , plum , TRUE , 1.31]|
|[15 , cherry , TRUE , 1.4 ]|
+------------------------------+
That's what you may have had already so let's unroll/destructure the nested sequence of fields into columns
val answer = lengths.indices.foldLeft(fields) { case (result, idx) =>
result.withColumn(s"col_$idx", $"fields".getItem(idx))
}
// drop the unnecessary/interim column
scala> answer.drop("fields").show
+-----+----------+-----+-----+
|col_0| col_1|col_2|col_3|
+-----+----------+-----+-----+
| 11 |apple |TRUE | 0.56|
| 12 |pear |FALSE| 1.34|
| 13 |raspberry |TRUE | 2.43|
| 14 |plum |TRUE | 1.31|
| 15 |cherry |TRUE | 1.4 |
+-----+----------+-----+-----+
Done!