How to map a RDD to another RDD with Scala, in Spark? - scala

I have a RDD:
RDD1 = (big,data), (apache,spark), (scala,language) ...
and I need to map that with the time stamp
RDD2 = ('2015-01-01 13.00.00')
so that I get
RDD3 = (big, data, 2015-01-01 13.00.00), (apache, spark, 2015-01-01 13.00.00), (scala, language, 2015-01-01 13.00.00)
I wrote a simple map function for this:
RDD3 = RDD1.map(rdd => (rdd, RDD2))
but it is not working, and I think it is not the way to go.
How to do it? I am new to Scala and Spark. Thank you.

You can use zip:
val rdd1 = sc.parallelize(("big","data") :: ("apache","spark") :: ("scala","language") :: Nil)
// RDD[(String, String)]
val rdd2 = sc.parallelize(List.fill(3)(new java.util.Date().toString))
// RDD[String]
rdd1.zip(rdd2).map{ case ((a,b),c) => (a,b,c) }.collect()
// Array((big,data,Fri Jul 24 22:25:01 CEST 2015), (apache,spark,Fri Jul 24 22:25:01 CEST 2015), (scala,language,Fri Jul 24 22:25:01 CEST 2015))
If you want the same time stamp with every element of rdd1 :
val now = new java.util.Date().toString
rdd1.map{ case (a,b) => (a,b,now) }.collect()

Related

Convert RDD[List[AnyRef]] to RDD[List[String, Date, String, String]]

I want to set return type of RDD. But it is RDD[List[AnyRef]].
So I am not able to specify anything directly.
Like,
val rdd2 = rdd1.filter(! _.isEmpty).filter(x => x(0) != null)
This returns RDD[List[String, Date, String, String]] type of RDD but it was RDD[List[AnyRef]].
EDIT
rdd1:
List(Sun Jul 31 10:21:53 PDT 2016, pm1, 11, ri1)
List(Mon Aug 01 12:57:09 PDT 2016, pm3, 5, ri1)
List(Mon Aug 01 01:11:16 PDT 2016, pm1, 1, ri2)
This rdd1 is RDD[List[AnyRef]] type.
Now I want rdd2 in this type:
RDD[List[Date, String, Long, String]]
The reason is that I am facing issues with date while converting RDD to Data Frame using schema. To deal with that firstly I have to fix the RDD type.
That problem's solution is :
Spark rdd correct date format in scala?
Here is a small example which leads to the same problem (I omitted Date, replaced it by String, that's not the point):
val myRdd = sc.makeRDD(List(
List[AnyRef]("date 1", "blah2", (11: java.lang.Integer), "baz1"),
List[AnyRef]("date 2", "blah3", (5: java.lang.Integer), "baz2"),
List[AnyRef]("date 3", "blah4", (1: java.lang.Integer), "baz3")
))
// myRdd: org.apache.spark.rdd.RDD[List[AnyRef]] = ParallelCollectionRDD[0]
Here is how you can recover the types:
val unmessTypes = myRdd.map{
case List(a: String, b: String, c: java.lang.Integer, d: String) => (a, b, (c: Int), d)
}
// unmessTypes: org.apache.spark.rdd.RDD[(String, String, Int, String)] = MapPartitionsRDD[1]
You simply apply a partial function that matches lists of length 4 with elements of specified types, and constructs the tuples of expected type out of it. If your RDD indeed contains only lists of length 4 with the expected types, the partial function will never fail.
By looking at your Spark rdd correct date format in scala?, it seems that you are having issue in converting your rdd to dataframe. Tzach has already answered it correctly to convert the java.util.Date to java.sql.Date and that should solve your issue.
First of all a List cannot have separate dataType for each element in the list as we do have for Tuples. List have only one dataType and if mixed dataTypes are used then the dataType of the list is represented as Any or AnyRef.
I guess you must have created data as below
val list = List(
List[AnyRef](new SimpleDateFormat("EEE MMM dd HH:mm:ss Z yyyy", Locale.ENGLISH).parse("Sun Jul 31 10:21:53 PDT 2016"), "pm1", 11L: java.lang.Long,"ri1"),
List[AnyRef](new SimpleDateFormat("EEE MMM dd HH:mm:ss Z yyyy", Locale.ENGLISH).parse("Mon Aug 01 12:57:09 PDT 2016"), "pm3", 5L: java.lang.Long, "ri1"),
List[AnyRef](new SimpleDateFormat("EEE MMM dd HH:mm:ss Z yyyy", Locale.ENGLISH).parse("Mon Aug 01 01:11:16 PDT 2016"), "pm1", 1L: java.lang.Long, "ri2")
)
val rdd1 = spark.sparkContext.parallelize(list)
which would give
rdd1: org.apache.spark.rdd.RDD[List[AnyRef]]
but in fact its real datatypes are [java.util.Date, String, java.lang.Long, String]
And looking at your other question you must be having problem converting the rdd to dataframe having following schema
val schema =
StructType(
StructField("lotStartDate", DateType, false) ::
StructField("pm", StringType, false) ::
StructField("wc", LongType, false) ::
StructField("ri", StringType, false) :: Nil)
What you can do is utilize java.sql.Date api as answered in your other question and then create dataframe as
val rdd1 = sc.parallelize(list).map(lis => Row.fromSeq(new java.sql.Date((lis.head.asInstanceOf[java.util.Date]).getTime)::lis.tail))
val df = sqlContext.createDataFrame(rdd1,schema)
which should give you
+------------+---+---+---+
|lotStartDate|pm |wc |ri |
+------------+---+---+---+
|2016-07-31 |pm1|11 |ri1|
|2016-08-02 |pm3|5 |ri1|
|2016-08-01 |pm1|1 |ri2|
+------------+---+---+---+
I hope the answer is helpful

Filtering RDD1 on the basis of RDD2

i have 2 RDDS in below format
RDD1 178,1
156,1
23,2
RDD2
34
178
156
now i want to filter rdd1 on the basis of value in rdd2 ie if 178 is present in rdd1 and also in rdd2 then it should return me those tuples from rdd1.
i have tried
val out = reversedl1.filter({ case(x,y) => x.contains(lines)})
where lines is my 2nd rdd and reversedl1 is first, but its not working
i also tried
val abce = reversedl1.subtractByKey(lines)
val defg = reversedl1.subtractByKey(abce)
This is also not working.
Any suggestions?
You can convert rdd2 to key value pairs and then join with rdd1 on the keys:
val rdd1 = sc.parallelize(Seq((178, 1), (156, 1), (23, 2)))
val rdd2 = sc.parallelize(Seq(34, 178, 156))
(rdd1.join(rdd2.distinct().map(k => (k, null)))
// here create a dummy value to form a pair wise RDD so you can join
.map{ case (k, (v1, v2)) => (k, v1) } // drop the dummy value
).collect
// res11: Array[(Int, Int)] = Array((156,1), (178,1))

Spark Get years as array To compare

I have a data which contains
+------------+----------+
|BaseFromYear|BaseToYear|
+------------+----------+
| 2013| 2013|
+------------+----------+
I need to check difference of the two years and compare in another dataframe wheather the required year exists in base years so created a query
val df = DF_WE.filter($"id"===3 && $"status"===1).select("BaseFromYear","BaseToYear").withColumn("diff_YY",$"BaseToYear"-$"BaseFromYear".cast(IntegerType)).withColumn("Baseyears",when($"diff_YY"===0,$BaseToYear))
+------------+----------+-------+---------+
|BaseFromYear|BaseToYear|diff_YY|Baseyears|
+------------+----------+-------+---------+
| 2013| 2013| 0| 2013|
+------------+----------+-------+---------+
So I get above output But if basefromyear 2014 and basetoyear is 2017 then differnce will be 3 I need to get [2014,2015,2016,2017] as Baseyears .. So that in the next step I have a required year say 2016 so need compare with base year. I see isin function will it work??
I have added comment in the code, let me know if you need further explanation.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
// This is a user defined function(udf) which will populate an array of Int from BaseFromYear to BaseToYear
val generateRange: (Int, Int) => Array[Int] = (baseFromYear: Int, baseToYear: Int) => (baseFromYear to baseToYear).toArray
val sqlfunc = udf(generateRange) // Registering the UDF with spark
val df = DF_WE.filter($"id" === 3 && $"status" === 1)
.select("BaseFromYear", "BaseToYear")
.withColumn("diff_YY", $"BaseToYear" - $"BaseFromYear".cast(IntegerType))
.withColumn("Baseyears", sqlfunc($"BaseFromYear", $"BaseToYear")) // using the UDF to populate new columns
df.show()
// Now lets say we are selecting records which has 2016 in the Baseyears
val filteredDf = df.where(array_contains(df("Baseyears"), 2016))
filteredDf.show()
// Seq[Row] is not type safe, please be careful about that
val isIn: (Int, Seq[Row] ) => Boolean = (num: Int, years: Seq[Row] ) => years.contains(num)
val sqlIsIn = udf(isIn)
val filteredDfBasedOnAnotherCol = df.filter(sqlIsIn(df("YY"), df("Baseyears")))

Dropping constant columns in a csv file

i would like to drop columns which are constant in a dataframe , here what i did , but i see that it tooks some much time to do it , special while writing the dataframe into the csv file , please any help to optimize the code to take less time
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("inferSchema", "true").option("header", "false").csv("D:\\ProcessDataSet\\anis_data\\Set _1Mud Pumps_Merged.csv")
val aggregations = df.drop("DateTime").columns.map(c => stddev("c").as(c))
val df2 = df.agg(aggregations.head, aggregations.tail: _*)
val columnsToKeep: Seq[String] = (df2.first match {
case r : Row => r.toSeq.toArray.map(_.asInstanceOf[Double])
}).zip(df.columns)
.filter(_._1 != 0) // your special condition is in the filter
.map(_._2) // keep just the name of the column
// select columns with stddev != 0
val finalResult = df.select(columnsToKeep.head, columnsToKeep.tail : _*)
finalResult.write.option("header",true).csv("D:\\ProcessDataSet\\dataWithoutConstant\\Set _1Mud Pumps_MergedCleaned.csv")
}
I think there is no much room left for optimization. You are doing the right thing.
Maybe what you can try is to cache() your dataframe df.
df is used in two separate Spark actions so it is loaded twice.
Try :
...
val df = spark.read.option("inferSchema", "true").option("header", "false").csv("D:\\ProcessDataSet\\anis_data\\Set _1Mud Pumps_Merged.csv")
df.cache()
val aggregations = df.drop("DateTime").columns.map(c => stddev("c").as(c))
...

How to replace epoch column in DataFrame with date in scala

I am writing a spark application which receives an avro record. I am converting that avro record into Spark DataFrame (df) object.
The df contains a time stamp attribute which is in seconds. (Epoch time)
I want to replace the seconds column with the date column.
How to do that?
My code snippet is :
val df = sqlContext.read.avro("/root/Work/PixelReporting/input_data/pixel.avro")
val pixelGeoOutput = df.groupBy("current_time", "pixel_id", "geo_id", "operation_type", "is_piggyback").count()
pixelGeoOutput.write.json("/tmp/pixelGeo")
"current_time" is in seconds right now. I want to convert it into date.
Since Spark 1.5, there's a built in sql.function called from_unixtime, you can do:
val df = Seq(Tuple1(1462267668L)).toDF("epoch")
df.withColumn("date", from_unixtime(col("epoch")))
Thanks guys,
I used withColumn method to solve my problem.
Code snippet is :
val newdf = df.withColumn("date", epochToDateUDF(df("current_time")))
def epochToDateUDF = udf((current_time : Long) =>{
DateTimeFormat.forPattern("YYYY-MM-dd").print(current_time *1000)
})
This should give you an idea:
import java.util.Date
val df = sc.parallelize(List(1462267668L, 1462267672L, 1462267678L)).toDF("current_time")
val dfWithDates = df.map(row => new Date(row.getLong(0) * 1000))
dfWithDates.collect()
Output:
Array[java.util.Date] = Array(Tue May 03 11:27:48 CEST 2016, Tue May 03 11:27:52 CEST 2016, Tue May 03 11:27:58 CEST 2016)
You might also try this in a UDF and using withColumn to just replace that single column.