How to split a rdd into 2 based on column name - scala

I have an RDD of two columns A and B
How can i create 2RDD's out if it?
I have a use case where i am taking an input RDD, performs some operations and produces two different output (intermediate (column A), final(column B)) which needs to be 2 ifferent locations. How can i split them?
Thanks

Related

Efficient histogram on multiple columns in one shot

Here's how I calculate histogram on one column:
val df = spark.read.format("csv").option("header", "true").load("/project/test.csv")
df.map(row => row.getString(2).toDouble).rdd.histogram(10)
I want to calculate histograms on all columns. I can simply repeat the second line (see code above) and call histogram separately on each column. But my concern is that Spark will load data from disk each time I call histogram(), which means that if there are 10 columns, data is loaded 10 times. Is there a more efficient way to do this? How can I calculate histograms on all 10 columns in one shot?
Edit
Here's one way to combine multiple histogram() calls into one expression:
val histograms = {
val a = df.map(row => row.getString(0).toDouble).rdd.histogram(10)
val b = df.map(row => row.getString(1).toDouble).rdd.histogram(15)
(a, b)
}
Does this guarantee that the histograms will be computed with only one pass over the data? Is combining multiple histogram calls into one expression the trick? Or is that even necessary? Doesn't Spark delay evaluation until the result is used in any case, even when separate statements are used?

Spark: Searching for matches in large DataSets

I need to count how many values of one of the columns of df1 are present in one of the columns of df2. (I just need the number of matched values)
I wouldn't be asking this question if efficiency wasn't such a big concern:
df1 contains 100,000,000+ records
df2 contains 1,000,000,000+ records
Just an off the top of my head idea for the case that intersection won't cut it:
For the datatype that is contained in the columns, find two hash functions h1, h2 such that
h1 produces hashes roughly uniformly between 0 and N
h2 produces hashes roughly uniformly between 0 and M
such that M * N is approximately 1B, e.g. M = 10k, N = 100k,
then:
map each entry x from the column from df1 to (h1(x), x)
map each entry x from the column from df2 to (h1(x), x)
group both by h1 into buckets with xs
join on h1 (that's gonna be the nasty shuffle)
then locally, for each pair of buckets (b1, b2) that came from df1 and df2 and had the same h1 hash code, do essentially the same:
compute h2 for all bs from b1 and from b2,
group by the hash code h2
Compare the small sub-sub-buckets that remain by converting everything toSet and computing the intersection directly.
Everything that remains after intersection is present in both df1 and df2, so compute size and sum the results across all partitions.
The idea is to select N small enough so that the buckets with M entries still comfortably fit on a single node, but at the same time prevent that the whole application dies on the first shuffle trying to find out where is what by sending every key to everyone else. For example, using SHA-256 as "hash code" for h1 wouldn't help much, because the keys would be essentially unique, so that you could take the original data directly and try to do a shuffle with that. However, if you restrict N to some reasonably small number, e.g. 10k, you obtain a rough approximation of where what is, so that you can then regroup the buckets and start the second stage with h2.
Essentially it's just a random guess, I didn't test it. It could well be that the built-in intersection is smarter than everything I could possibly come up with.

I have two rdd how to put in one (spark, scala) [duplicate]

This question already has answers here:
How to combine two RDD[String]s index-wise?
(2 answers)
Closed 5 years ago.
I have two rdd. The first has x coordinates ( one column) and the second has the y coordinates (one column) . I want the result to be one rdd with one column with the format (x,y) . Is there any solution?
For example:
first rdd has: 1,2,3
second rdd has: 4,5,6
The result: (1,4),(2,5),(3,6)
Thanks in advance
The way to combine two RDDs is by using zip, so you could do something like
val coordinates = x.zip(y)
However, the order of the elements is not guaranteed as in Spark you have your elements split into partitions. You should have a way of performing a join by having a key that identifies each record

How can I convert one column data to a vector using Spark Scala

I am using Spark, Scala to process data. I have one question couldn't figure out. I have a dataframe, which is one column:
data
1
2
3
4
5
I want it to a single vector
[1.0,2.0,3.0,4.0,5.0]
How can I implemented it ? I tried
df.collect().toVector or rdd.foreach, but everytime it returns to me an array of vectors [1,0], [2.0], [3.0], [4.0], [5.0], not one single vector.
This is happening because when you collect a dataframe you get an Array of rows. You need to extract the values from the row objects.
df.collect().map(x => x.getDouble(0)).toVector

Compare column values in consecutive rows in Scala

I am new to Spark Scala. I am having a situation where I have to compare values of a particular column in a data set, for example:
Source Data
Source Destination Distance
Austin Houston 200
Dallas Houston 400
Kansas Dallas 700
Resultant
Source1 Destination1 Distance1 Source2 Destination2 Distance2 DistDiff
Dallas Houston 400 Kansas Dallas 700 300
As per the situation, I have to compare the distance of subsequent rows and if the difference is more than or equals to 300 then save the records in the Resultant data set
700 - 400 = 300
The examples which I have encountered are having functions which are executing on per row basis on any particular data set, however my scenario is to play with consecutive rows.
You mentioned you can sort rows by datetime. So, assuming it's sorted using sortBy or sortByKey to create an ordered rdd, and also assuming you have an even number of rows (so each row has another one to calculate difference with) you can:
Give each row an index using zipWithIndex.
Split the RDD into two RDDs, one with even-numbered indices and one with odd-numbered indices, by filtering on the index created.
zip the split RDDs together, creating a new RDD of Tuple2 with even-indexed rows on the left and odd-indexed rows on the right.
map the result to calculate the difference between left/right of each row.
This may be done in the following way:
Add an index column to your sorted rdd
Make sure the rdd has an even number of rows N
Make an rdd rdd_even1 to contain the even rows with indices [0, N-2]
Make an rdd rdd_odd1 to contain the odd rows [1, N-1]
Make an rdd rdd_even2 to contain the even rows [2, N-2]
Make an rdd rdd_odd2 to contain the odd rows [1, N-3]
Now you need to repartition rdd_even1 and *rdd_odd1 before zipping, because zipping won't work if both rdd's do not have the same number of elements in all partitions (in pyspark at least). You can do it in memory using collect and parallelize, but most likely you have to write the rdd's to HDFS and re-read them, controlling for the paritioning
Do the same for rdd_even2 and rdd_odd2
zip the rdd's from step 7 to rdd_zip1
zip the rdd's from step 8 to rdd_zip2
Call rdd_zip1.union(rdd_zip2)
Now you can call map() on the union to get your "resultant" with the required differences
Good luck.