I have two rdd how to put in one (spark, scala) [duplicate] - scala

This question already has answers here:
How to combine two RDD[String]s index-wise?
(2 answers)
Closed 5 years ago.
I have two rdd. The first has x coordinates ( one column) and the second has the y coordinates (one column) . I want the result to be one rdd with one column with the format (x,y) . Is there any solution?
For example:
first rdd has: 1,2,3
second rdd has: 4,5,6
The result: (1,4),(2,5),(3,6)
Thanks in advance

The way to combine two RDDs is by using zip, so you could do something like
val coordinates = x.zip(y)
However, the order of the elements is not guaranteed as in Spark you have your elements split into partitions. You should have a way of performing a join by having a key that identifies each record

Related

Spark StringIndexer consistent output value for given input [duplicate]

This question already has an answer here:
Spark ML StringIndexer Different Labels Training/Testing
(1 answer)
Closed 5 years ago.
Is it possible to use Spark's StringIndexer to consistently return the same output for a given input (I.e a column labelled 'Apple' will always output say '56.0')
The use case is when indexing multiple DataFrames and not all the inputs are inclusive in both but, you want to ensure ones which are, are converted to the same indexed value.
I'm trying to avoid my own String => Number mapping and wondered if StringIndexer could do this.
After looking some more I came across this similar post:
Spark ML StringIndexer Different Labels Training/Testing
If you save the StringIndexerModel used firstly and reuse it for transformation of any further DataFrames you'll get the same outputs.
I've flagged this post as a duplicate.

How can I convert one column data to a vector using Spark Scala

I am using Spark, Scala to process data. I have one question couldn't figure out. I have a dataframe, which is one column:
data
1
2
3
4
5
I want it to a single vector
[1.0,2.0,3.0,4.0,5.0]
How can I implemented it ? I tried
df.collect().toVector or rdd.foreach, but everytime it returns to me an array of vectors [1,0], [2.0], [3.0], [4.0], [5.0], not one single vector.
This is happening because when you collect a dataframe you get an Array of rows. You need to extract the values from the row objects.
df.collect().map(x => x.getDouble(0)).toVector

How to split a rdd into 2 based on column name

I have an RDD of two columns A and B
How can i create 2RDD's out if it?
I have a use case where i am taking an input RDD, performs some operations and produces two different output (intermediate (column A), final(column B)) which needs to be 2 ifferent locations. How can i split them?
Thanks

Compare column values in consecutive rows in Scala

I am new to Spark Scala. I am having a situation where I have to compare values of a particular column in a data set, for example:
Source Data
Source Destination Distance
Austin Houston 200
Dallas Houston 400
Kansas Dallas 700
Resultant
Source1 Destination1 Distance1 Source2 Destination2 Distance2 DistDiff
Dallas Houston 400 Kansas Dallas 700 300
As per the situation, I have to compare the distance of subsequent rows and if the difference is more than or equals to 300 then save the records in the Resultant data set
700 - 400 = 300
The examples which I have encountered are having functions which are executing on per row basis on any particular data set, however my scenario is to play with consecutive rows.
You mentioned you can sort rows by datetime. So, assuming it's sorted using sortBy or sortByKey to create an ordered rdd, and also assuming you have an even number of rows (so each row has another one to calculate difference with) you can:
Give each row an index using zipWithIndex.
Split the RDD into two RDDs, one with even-numbered indices and one with odd-numbered indices, by filtering on the index created.
zip the split RDDs together, creating a new RDD of Tuple2 with even-indexed rows on the left and odd-indexed rows on the right.
map the result to calculate the difference between left/right of each row.
This may be done in the following way:
Add an index column to your sorted rdd
Make sure the rdd has an even number of rows N
Make an rdd rdd_even1 to contain the even rows with indices [0, N-2]
Make an rdd rdd_odd1 to contain the odd rows [1, N-1]
Make an rdd rdd_even2 to contain the even rows [2, N-2]
Make an rdd rdd_odd2 to contain the odd rows [1, N-3]
Now you need to repartition rdd_even1 and *rdd_odd1 before zipping, because zipping won't work if both rdd's do not have the same number of elements in all partitions (in pyspark at least). You can do it in memory using collect and parallelize, but most likely you have to write the rdd's to HDFS and re-read them, controlling for the paritioning
Do the same for rdd_even2 and rdd_odd2
zip the rdd's from step 7 to rdd_zip1
zip the rdd's from step 8 to rdd_zip2
Call rdd_zip1.union(rdd_zip2)
Now you can call map() on the union to get your "resultant" with the required differences
Good luck.

How to sort the rows of a matrix, so that a specified column is sorted afterwards [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How can I sort a 2-D array in MATLAB with respect to one column?
I want to sort one column of a 2D matrix and still retain the relative row-position of an element in that column to other elements of the same row. So if the 3rd element of this column is to be swapped with the 1st element, then swap row 3 with row1, etc. How can I do this in Matlab? Thank you.
This does the trick:
B = sortrows(A, column_number);