How to print RowMatrix in Scala/Spark? - scala

How can I view/print to screen a small RowMatrix in Scala?
val A = new RowMatrix(sparkContext.parallelize(Seq(
Vectors.dense(1, 2, 3),
Vectors.dense(4, 5, 6))))

I figured it's just
A.rows.collect
FYI: Beware of the matrix size.

Related

Reformatting Dataframe Containing Array to RowMatrix

I have this dataframe in the following format:
+----+-----+
| features |
+----+-----+
|[1,4,7,10]|
|[2,5,8,11]|
|[3,6,9,12]|
+----+----+
Script to create sample dataframe:
rows2 = sc.parallelize([ IndexedRow(0, [1, 4, 7, 10 ]),
IndexedRow(1, [2, 5, 8, 1]),
IndexedRow(1, [3, 6, 9, 12]),
])
rows_df = rows2.toDF()
row_vec= rows_df.drop("index")
row_vec.show()
The feature column contains 4 features, and there are 3 row ids. I want to convert this data to a rowmatrix, where the columns and rows will be in the following mat format:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
Basically, I want to transpose the dataframe into the new format so that I can run the columnSimilarities() function. I have a much larger dataframe that contains 50 features, and 39000 rows.
Is this what you are trying to do? Hate using collect() but don't think it can be avoided here since you want to reshape/convert structured object to matrix ... right?
X = np.array(row_vec.select("_2").collect()).reshape(-1,3)
X = sc.parallelize(X)
for i in X.collect(): print(i)
[1 4 7]
[10 2 5]
[8 1 3]
[ 6 9 12]
I figured it out, I used the following:
from pyspark.mllib.linalg.distributed import RowMatrix
features_rdd = row_vec.select("features").rdd.map(lambda row: row[0])
features_mat = RowMatrix(features_rdd )
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coordmatrix_features = CoordinateMatrix(
features_mat .rows.zipWithIndex().flatMap(
lambda x: [MatrixEntry(x[1], j, v) for j, v in enumerate(x[0])]
)
)
transposed_rowmatrix_features = coordmatrix_features.transpose().toRowMatrix()

Spark Scala def with yield

In SO 33655920 I come across the below, fine.
rdd = sc.parallelize([1, 2, 3, 4], 2)
def f(iterator): yield sum(iterator)
rdd.mapPartitions(f).collect()
In Scala, I cannot seem to get the the def in the same shorthand way. The equivalent is? I have searched and tried but to no avail.
Thanks in advance.
yield sum(iterator) in Python sums the elements of the iterator. The similar way of doing this in Scala would be:
val rdd = sc.parallelize(Array(1, 2, 3, 4), 2)
rdd.mapPartitions(it => Iterator(it.sum)).collect()
If you want to sum values in the partition you can write something like
val rdd = sc.parallelize(1 to 4, 2)
def f(i: Iterator[Int]) = Iterator(i.sum)
rdd.mapPartitions(f).collect()

Intersection of Two HashMap (HashMap<Integer,HashSet<Integer>>) RDDs in Scala for Spark

I am working in Scala for programming in Spark on a Standalone machine (PC having Windows 10). I am a newbie and don't have experience in programming in scala and spark. So I will be very thankful for the help.
Problem:
I have a HashMap, hMap1, whose values are HashSets of Integer entries (HashMap>). I then store its values (i.e., many HashSet values) in an RDD. The code is as below
val rdd1 = sc.parallelize(Seq(hMap1.values()))
Now I have another HashMap, hMap2, of same type i.e., HashMap>. Its values are also stored in an RDD as
val rdd2 = sc.parallelize(Seq(hMap2.values()))
I want to know how can I intersect the values of hMap1 and hMap2
For example:
Input:
the data in rdd1 = [2, 3], [1, 109], [88, 17]
and data in rdd2 = [2, 3], [1, 109], [5,45]
Output
so the output = [2, 3], [1, 109]
Problem statement
My understanding of your question is the following:
Given two RDDs of type RDD[Set[Integer]], how can I produce an RDD of their common records.
Sample data
Two RDDs generated by
val rdd1 = sc.parallelize(Seq(Set(2, 3), Set(1, 109), Set(88, 17)))
val rdd2 = sc.parallelize(Seq(Set(2, 3), Set(1, 109), Set(5, 45)))
Possible solution
If my understanding of the problem statement is correct, you could use rdd1.intersection(rdd2) if your RDDs are as I thought. This is what I tried on a spark-shell with Spark 2.2.0:
rdd1.intersection(rdd2).collect
which yielded the output:
Array(Set(2, 3), Set(1, 109))
This works because Spark can compare elements of type Set[Integer], but note that this is not generalisable to any object Set[MyObject] unless you defined the equality contract of MyObject.

How to generate undirect edges by using the element of a Set in Spark

I'm trying to generate some undirect edges using the elements of a Set.Set(1, 4, 5), for example, and the result must be like this:
(1,4)
(1,5)
(4,5)
Any solution will be much appreciated.
Here is a simple example using subset and filter
val set = Set(1,4,5)
val result = set.subsets().map(_.toList).toList
Output:
List(1, 4)
List(1, 5)
List(4, 5)
If you want as a list of tuples then you can convert as
result.map(list => (list(0), list(1)))
Output:
(1,4)
(1,5)
(4,5)
Hope this helps!

Spark Scala - Split columns into multiple rows

Following the question that I post here:
Spark Mllib - Scala
I've another one doubt... Is possible to transform a dataset like this:
2,1,3
1
3,6,8
Into this:
2,1
2,3
1,3
1
3,6
3,8
6,8
Basically I want to discover all the relationships between the movies. Is possible to do this?
My current code is:
val input = sc.textFile("PATH")
val raw = input.lines.map(_.split(",")).toArray
val twoElementArrays = raw.flatMap(_.combinations(2))
val result = twoElementArrays ++ raw.filter(_.length == 1)
Given that input is a multi-line string.
scala> val raw = input.lines.map(_.split(",")).toArray
raw: Array[Array[String]] = Array(Array(2, 1, 3), Array(1), Array(3, 6, 8))
Following approach discards one-element arrays, 1 in your example.
scala> val twoElementArrays = raw.flatMap(_.combinations(2))
twoElementArrays: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8))
It can be fixed by appending filtered raw collection.
scala> val result = twoElementArrays ++ raw.filter(_.length == 1)
result: Array[Array[String]] = Array(Array(2, 1), Array(2, 3), Array(1, 3), Array(3, 6), Array(3, 8), Array(6, 8), Array(1))
Order of combinations is not relevant I believe.
Update
SparkContext.textFile returns RDD of lines, so it could be plugged in as:
val raw = rdd.map(_.split(","))