Reformatting Dataframe Containing Array to RowMatrix - pyspark

I have this dataframe in the following format:
+----+-----+
| features |
+----+-----+
|[1,4,7,10]|
|[2,5,8,11]|
|[3,6,9,12]|
+----+----+
Script to create sample dataframe:
rows2 = sc.parallelize([ IndexedRow(0, [1, 4, 7, 10 ]),
IndexedRow(1, [2, 5, 8, 1]),
IndexedRow(1, [3, 6, 9, 12]),
])
rows_df = rows2.toDF()
row_vec= rows_df.drop("index")
row_vec.show()
The feature column contains 4 features, and there are 3 row ids. I want to convert this data to a rowmatrix, where the columns and rows will be in the following mat format:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
Basically, I want to transpose the dataframe into the new format so that I can run the columnSimilarities() function. I have a much larger dataframe that contains 50 features, and 39000 rows.

Is this what you are trying to do? Hate using collect() but don't think it can be avoided here since you want to reshape/convert structured object to matrix ... right?
X = np.array(row_vec.select("_2").collect()).reshape(-1,3)
X = sc.parallelize(X)
for i in X.collect(): print(i)
[1 4 7]
[10 2 5]
[8 1 3]
[ 6 9 12]

I figured it out, I used the following:
from pyspark.mllib.linalg.distributed import RowMatrix
features_rdd = row_vec.select("features").rdd.map(lambda row: row[0])
features_mat = RowMatrix(features_rdd )
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coordmatrix_features = CoordinateMatrix(
features_mat .rows.zipWithIndex().flatMap(
lambda x: [MatrixEntry(x[1], j, v) for j, v in enumerate(x[0])]
)
)
transposed_rowmatrix_features = coordmatrix_features.transpose().toRowMatrix()

Related

Concat column values in a dataframe

I have a csv file that looks like the following
name, state, a, b, c, d, ..., x
Jon, NY, 1, 4, 6, 2, 6
Eric, CA, 5, 3, 1, 5, 6
Chris,LA, 4, 4, 3, 1, 5
and I want the following result (one column with concat fields (inkl header name))
concate-fields
"name=Jon, state=NY, a=1, b=4, c=6, d= 2, ... x=6"
"name=Eric, state=CA, a=5, b=3, c=1, d= 5, ... x=6"
"name=Chris, state=LA, a=4, b=4, c=3, d= 1, ... x=5"
There can be many headers from a...>x so these should be appended in a generic way
I now have
import org.apache.spark.sql.functions.{concat, lit}
val lp = sample.select(concat(lit("name), $"name", lit(",state="), $"state")
display(lp)
But I have trouble adding the same for column a->x (as this needs to be done in a generic way)
You can dynamically create SQL expression to concat columns by calling map method on df.columns() as shown below.
val df = // Read CSV
df.withColumn("concate-fields", expr(s"concat(${df.columns.map(col=>s"'$col=', nvl($col,'null'),','").mkString("").dropRight(4)})"))
.withColumn("concate-fields", concat(lit("\""),col("concate-fields"),lit("\"")))

Adding lists by element in pyspark

I'd like to take a RDD of integer lists and reduce it down to one list. For example...
[1, 2, 3, 4]
[2, 3, 4, 5]
to
[3, 5, 7, 9]
I can do this in python using the zip function but not sure how to replicate it in spark besides doing collect on the object but I want to keep the data in the rdd.
If all elements in rdd are of the same length, you can use reduce with zip:
rdd = sc.parallelize([[1,2,3,4],[2,3,4,5]])
rdd.reduce(lambda x, y: [i+j for i, j in zip(x, y)])
# [3, 5, 7, 9]

How to print the arrays in table or dimension format?

Actually this works
object Matrixmul extends App {
val a = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9))
val b = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9))
val c = Array.ofDim[Int](3, 3)
val sum =Array.ofDim[Int](3,3)
println(a.mkString(" "))
val elements = for {
row <- a
ele <- row
}yield ele
for(array1 <-elements)
println(" the 1st matrix array elements are : " + array1)
This prints the arrray in the format,
the 1st matrix array elements are : 1
the 1st matrix array elements are : 2
the 1st matrix array elements are : 3
the 1st matrix array elements are : 4
the 1st matrix array elements are : 5
the 1st matrix array elements are : 6
the 1st matrix array elements are : 7
the 1st matrix array elements are : 8
the 1st matrix array elements are : 9
But I need in DIMENSION format,
1 2 3
4 5 6
7 8 9
How about the following,
val a = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9))
a.foreach(row => println(row.mkString(" ")))
Which will print your dimentional format,
1 2 3
4 5 6
7 8 9
Small variation on the subject
val a = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9))
println(a.map(_.mkString(" ")).mkString("\n"))

Create 2D array and store value into each element of that array in Scala

I am working on a Scala exercise which asks me to create a 2D array of 4 rows and 5 columns and store the row index+column index+5 in each element. Also I have to sum the array by rows and then by columns and print the rows total and the columns total.I am so confused and I only know how to create an empty array.
val matrix = Array.ofDim[Int](4, 5)
Can you teach me how to do the rest of this exercise?
I will not tell you "the rest of the exercise" but I will try to show one way to create a 2D collection, like an array in this case:
val matrix1D = for {
rowIndex <- (0 until 4).toArray
colIndex <- (0 until 5).toArray
} yield rowIndex + colIndex + 5
Where
scala> :t matrix1D
Array[Int]
Now the result of this for-comprehension is the 1D version of your 2D array.
EDIT
I could probably give you few more hints:
scala> (0 to 11).toArray.grouped(4).toArray
res10: Array[Array[Int]] = Array(Array(0, 1, 2, 3), Array(4, 5, 6, 7), Array(8, 9, 10, 11))
scala> .transpose
res11: Array[Array[Int]] = Array(Array(0, 4, 8), Array(1, 5, 9), Array(2, 6, 10), Array(3, 7, 11))
EDIT
After you create matrix2D from matrix1D:
val matrix2D = matrix1D.??????????????????
Where
scala> :t matrix2D
Array[Array[Int]]
To print it out, you could simply use mkString:
scala> matrix2D.map(_.mkString("\t")).mkString("\n")
res32: String =
5 6 7 8 9
6 7 8 9 10
7 8 9 10 11
8 9 10 11 12

How to zip after distnct in pySpark

The following program fails in the zip step.
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
z = x.distinct()
print x.zip(y).collect()
The error that is produced depends on whether multiple partitions have been specified or not.
I understand that
the two RDDs [must] have the same number of partitions and the same number of elements in each partition.
What is the best way to work around this restriction?
I have been performing the operation with the following code, but I am hoping to find something more efficient.
def safe_zip(left, right):
ix_left = left.zipWithIndex().map(lambda row: (row[1], row[0]))
ix_right = right.zipWithIndex().map(lambda row: (row[1], row[0]))
return ix_left.join(ix_right).sortByKey().values()
I think this would be accomplished by using cartesian() on your RDD
import pyspark
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
x.distinct().cartesian(y.distinct()).collect()