I have 2 paired RDDs that I joined them together using the same key and I now I want to add a new calculated column using 2 columns from the values part. The new joined RDD type is:
RDD[((String, Int), Iterable[((String, DateTime, Int,Int), (String, DateTime, String, String))])]
I want to add another field to the new RDD which show the delta between the 2 DateTime fields.
How can I do this?
You should be able to do this using map to extend the 2-tuples into 3-tuples, roughly as follows:
joined.map{ case (key, values) =>
val delta = computeDelta(values)
(key, values, delta)
}
Or, more concisely:
joined.map{ case (k, vs) => (k, vs, computeDelta(vs)) }
Then your computeDelta function can just extract the first and second values of type (String, DateTime, Int,Int), get the second item (DateTime) from each and compute the delta using whatever DateTime functions are convenient.
If you want your output RDD to still be a paired RDD, then you will need to wrap the new delta field into a tuple, roughly as follows:
joined.mapValues{ values =>
val delta = computeDelta(values)
(values, delta)
}
which will preserve the original PairedRDD keys, and give you values of type (Iterable[(String, DateTime, Int,Int)], Long)
(assuming you are calculating deltas of type Long)
Related
Here is the example dataframe,
city, LONG, LAT
city1, 100.30, 50.11
city2, 100.20, 50.16
city3, 100.20, 51
..
We need to calculate distance between city1 and all cities, and city2 and all cities, and iterate for each city. Function 'distance' is created. Then we can use for loop each line or use data dict in Python.
For dataframe, how can apply for loop or data dict concept to dataframe?
for example in python. (Not all codes shown here.)
citydict = dict()
citydict2=copy.deepcopy(citydict)
for city1, pciinfo1 in citydict.items():
pcicity2.pop(pci1)
for city2, cityinfo2 in citydict2.items():
s=distancecalc(cityinfo1,cityinfo2)
The crossJoin method does the trick. It returns the cartesian product of two dataframes. The idea is to cross the Dataframe with itself.
import org.apache.spark.sql.functions._
df.as("thisDF")
.crossJoin(df.as("toCompareDF"))
.filter($"thisDF.city" =!= $"toCompareDF.city")
.withColumn("distance", calculateDistance($"thisDF.lon", $"thisDF.lat", $"toCompareDF.lon", $"toCompareDF.lat"))
.show
First of all, we add an alias to our Dataframe so that we can identify it when we perform the join. Next step is to perform the crossJoin over the same Dataframe. Note that we're also adding an alias to this new Dataframe. To delete those tuples that match the same city, we filter by the city column.
Finally, we apply a Spark User Defined Function, passing the necessary columns to calculate the distance. This is the declaration of the UDF:
def calculateDistance = udf((lon1: Double, lat1: Double, lon2: Double, lat2: Double) => {
// add calculation here
})
And that's all. Hope it helps.
I'm new to Spark and Scala.
We have an external data source feeding us JSON. This JSON has quotes around all values including number and boolean fields. So by the time I get it into my DataFrame all the columns are strings. The end goal is to convert these JSON records into a properly typed Parquet files.
There are approximately 100 fields, and I need to change several of the types from string to int, boolean, or bigint (long). Further, each DataFrame we process will only have a subset of these fields, not all of them. So I need to be able to handle subsets of columns for a given DataFrame, compare each column to a known list of column types, and cast certain columns from string to int, bigint, and boolean depending on which columns appear in the DataFrame.
Finally, I need the list of column types to be configurable because we'll have new columns in the future and may want to get rid of or change old ones.
So, here's what I have so far:
// first I convert to all lower case for column names
val df = dfIn.toDF(dfIn.columns map(_.toLowerCase): _*)
// Big mapping to change types
// TODO how would I make this configurable?
// I'd like to drive this list from an external config file.
val dfOut = df.select(
df.columns.map {
///// Boolean
case a # "a" => df(a).cast(BooleanType).as(a)
case b # "b" => df(b).cast(BooleanType).as(b)
///// Integer
case i # "i" => df(i).cast(IntegerType).as(i)
case j # "j" => df(j).cast(IntegerType).as(j)
// Bigint to Double
case x # "x" => df(x).cast(DoubleType).as(x)
case y # "y" => df(y).cast(DoubleType).as(y)
case other => df(other)
}: _*
)
Is this a good efficient way to transform this data to having the types I want in Scala?
I could use some advice on how to drive this off an external 'config' file where I could define the column types.
My question evolved into this question. Good answer given there:
Spark 2.2 Scala DataFrame select from string array, catching errors
In Apache Flink, if I join two data sets on one primary key I get a tuple 2 containing the corresponding data set entry out each of the data sets.
The problem is, when applying a the map() method to the outcoming tuple 2 data set it does not really look nice, especially if the entries of both data sets have a high number of features.
Using tuples in both input data sets gets me some code like this:
var in1: DataSet[(Int, Int, Int, Int, Int)] = /* */
var in2: DataSet[(Int, Int, Int, Int)] = /* */
val out = in1.join(in2).where(0, 1, 2).equalTo(0, 1, 2)
.map(join => (join._1._1, join._1._2, join._1._3,
join._1._4, join._1._5, join._2._4))
I would not mind using POJOs or case classes, but I don't see how this would make it better.
Question 1: Is there a nice way to flaten that tuple 2? E.g. using another operator.
Question 2: How to handle a join of 3 data sets on the same key? It would make the example source even more messy.
Thanks for helping.
you can directly apply a join function on each pair of joined elements such as for example
val leftData: DataSet[(String, Int, Int)] = ...
val rightData: DataSet[(String, Int)] = ...
val joined: DataSet[(String, Int, Int)] = leftData
.join(rightData).where(0).equalTo(0) { (l, r) => (l._1, l._2, l._3 + r._2) ) }
To answer the second question, Flink handles only binary joins. However, Flink's optimizer can avoid to do unnecessary shuffles, if you give a hint about the behavior of your function. Forward Field annotations tell the optimizer, that certain fields (such as the join key) have not been modified by your join function and enables reusing existing partitioning and sortings.
I have created a map like this -
val b = a.map(x => (x(0), x) )
Here b is of the type
org.apache.spark.rdd.RDD[(Any, org.apache.spark.sql.Row)]
How can I sort the PairRDD within each key using a field from the value row?
After that I want to run a function which processes all the values for each Key in isolation in the previously sorted order. Is that possible? If yes can you please give an example.
Is there any consideration needed for Partitioning the Pair RDD?
Answering only your first question:
val indexToSelect: Int = ??? //points to sortable type (has Ordering or is Ordered)
sorted = rdd.sortBy(pair => pair._2(indexToSelect))
What this does, it just selects the second value in the pair (pair._2) and from that row it selects the appropriate value ((indexToSelect) or more verbosely: .apply(indexToSelect)).
I am new to Scala and I am trying to do something for a project:
I generated a RDD: RDD
[UserID1, Date1, Value1]
[UserID1, Date2, Value2]
[UserID1, Date3, Value3]
[UserID2, Date1, Value1]
[UserID3, Date1, Value1]
I wish to run a function on this RDD that generates RDD
[UserID1, FunctionResult1, FunctionResult2]
[UserID1, FunctionResult3, FunctionResult4]
[UserID2, FunctionResult1, FunctionResult2]
The way this function should work is:
1, groupBy UserID, and sort the date on ascending order (I have already formatted the date into an INT: 20150225.
2, take the first Date and the Second Date and find the number of Day between them.
3, take the first Value and the Second Value, find the difference between them.
record these value to Function Results, proceed to process the second Date and third Date and the second and third Value.
If the input is 5 rows x 3 column, the result should be 4 rows x 3 column.
So far, I have tried to do reduceByKey on the RDD, but it only generates a single row in the output. So I was wondering if there is any other efficient way to do this, perhaps without looping? My current code looks like this
val basicsearchprofile = basicsearch.map(x=>(x._1,(x._2,x._3).reduceByKey((a,b)=> funcdiff(a,b))
// x._1 is the User ID, x._2 is the Date, x._3 is the Value;
def funcdiff(a:(Date,Value),b:(Date,Value)):(Day,value) =
{
val diffdays = (b._1%100 - a._1%100) + ((b._1/100)%100)- (a._1/100)%100))*30 + ((a._1/10000)%100 - (b._1/10000)%100))*365 //difference between days
val diffvalue = Math.abs(a._2 - b._2)
}
(diffdays diffvalue)
}
I assume that the returned value from the function funcdiff reduces every event in pairs, and eventually reduces it to a single row? Is it possible to make it apply function funcdiff to first row with second row, record answer; next apply it to second and third row...and so on such that the returned result is an RDD of [ID, Datediff, Valuediff]
Thanks in advance
Spark process rows parallel. As you have to do row2-row1, row3-row2, I think you can not work in parallel anymore. So you'll have to forget Spark a bit, use plain Scala and process a whole user data on a single node (each user can be processed in parallel though). For instance:
// First, group by user with Spark
case class Info(userId:String, date:Int, value:Int)
val infos=List(
Info("john",20150221,10),
Info("mary",20150221,11),
Info("john",20150222,12),
Info("mary",20150223,15),
Info("john",20150223,14),
Info("john",20150224,16),
Info("john",20150225,18),
Info("mary",20150225,17))
val infoRdd=sc.parallelize(infos)
val infoByIdRdd=infoRdd.map( info => (info.userId, info)).groupByKey()
// Then use plain Scala to process each user data
def infoDeltas(infos:List[Info]) = {
// Transform [Info1, Info2, Info3] into [(Info1,Info2),(Info2,Info3)]
val accZero:(Option[Info],List[(Info,Info)])=(None,List())
def accInfo(last:Option[Info], list:List[(Info,Info)], info:Info) = {
last match {
case None => (Some(info), list)
case Some(lastInfo) => (Some(info), list:+(lastInfo,info))
}
}
val infoIntervals=infos.foldLeft(accZero)(
(acc,info) => accInfo(acc._1, acc._2, info)
)._2
// Transform [(Info1,Info2),(Info2,Info3)] into [Info2-Info1,Info3-Info2]
infoIntervals.map(interval => interval match {case (before,after) => Info(after.userId,after.date-before.date,after.value-before.value)})
}
val infoDeltasByIdRdd=infoByIdRdd.mapValues(infos => infoDeltas(infos.toList))