How to zip after distnct in pySpark - pyspark

The following program fails in the zip step.
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
z = x.distinct()
print x.zip(y).collect()
The error that is produced depends on whether multiple partitions have been specified or not.
I understand that
the two RDDs [must] have the same number of partitions and the same number of elements in each partition.
What is the best way to work around this restriction?
I have been performing the operation with the following code, but I am hoping to find something more efficient.
def safe_zip(left, right):
ix_left = left.zipWithIndex().map(lambda row: (row[1], row[0]))
ix_right = right.zipWithIndex().map(lambda row: (row[1], row[0]))
return ix_left.join(ix_right).sortByKey().values()

I think this would be accomplished by using cartesian() on your RDD
import pyspark
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
x.distinct().cartesian(y.distinct()).collect()

Related

pyspark RDD combine consecutive values into tuple

Seems like a simple problem but I can't figure it out.
Given an rdd
Input
[1, 5, 3, 2, 7]
Output
[(1,5), (5,3), (3,2), (2,7)]
I've tried this but with obvious error.
rdd.map(lambda x,y: (x,y))
I'm assuming I need a helper function of some sort.

Concat column values in a dataframe

I have a csv file that looks like the following
name, state, a, b, c, d, ..., x
Jon, NY, 1, 4, 6, 2, 6
Eric, CA, 5, 3, 1, 5, 6
Chris,LA, 4, 4, 3, 1, 5
and I want the following result (one column with concat fields (inkl header name))
concate-fields
"name=Jon, state=NY, a=1, b=4, c=6, d= 2, ... x=6"
"name=Eric, state=CA, a=5, b=3, c=1, d= 5, ... x=6"
"name=Chris, state=LA, a=4, b=4, c=3, d= 1, ... x=5"
There can be many headers from a...>x so these should be appended in a generic way
I now have
import org.apache.spark.sql.functions.{concat, lit}
val lp = sample.select(concat(lit("name), $"name", lit(",state="), $"state")
display(lp)
But I have trouble adding the same for column a->x (as this needs to be done in a generic way)
You can dynamically create SQL expression to concat columns by calling map method on df.columns() as shown below.
val df = // Read CSV
df.withColumn("concate-fields", expr(s"concat(${df.columns.map(col=>s"'$col=', nvl($col,'null'),','").mkString("").dropRight(4)})"))
.withColumn("concate-fields", concat(lit("\""),col("concate-fields"),lit("\"")))

Reformatting Dataframe Containing Array to RowMatrix

I have this dataframe in the following format:
+----+-----+
| features |
+----+-----+
|[1,4,7,10]|
|[2,5,8,11]|
|[3,6,9,12]|
+----+----+
Script to create sample dataframe:
rows2 = sc.parallelize([ IndexedRow(0, [1, 4, 7, 10 ]),
IndexedRow(1, [2, 5, 8, 1]),
IndexedRow(1, [3, 6, 9, 12]),
])
rows_df = rows2.toDF()
row_vec= rows_df.drop("index")
row_vec.show()
The feature column contains 4 features, and there are 3 row ids. I want to convert this data to a rowmatrix, where the columns and rows will be in the following mat format:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
Basically, I want to transpose the dataframe into the new format so that I can run the columnSimilarities() function. I have a much larger dataframe that contains 50 features, and 39000 rows.
Is this what you are trying to do? Hate using collect() but don't think it can be avoided here since you want to reshape/convert structured object to matrix ... right?
X = np.array(row_vec.select("_2").collect()).reshape(-1,3)
X = sc.parallelize(X)
for i in X.collect(): print(i)
[1 4 7]
[10 2 5]
[8 1 3]
[ 6 9 12]
I figured it out, I used the following:
from pyspark.mllib.linalg.distributed import RowMatrix
features_rdd = row_vec.select("features").rdd.map(lambda row: row[0])
features_mat = RowMatrix(features_rdd )
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coordmatrix_features = CoordinateMatrix(
features_mat .rows.zipWithIndex().flatMap(
lambda x: [MatrixEntry(x[1], j, v) for j, v in enumerate(x[0])]
)
)
transposed_rowmatrix_features = coordmatrix_features.transpose().toRowMatrix()

How to remove duplicates from particular column in Scala by reading textfile

I am new to scala, I am reading textfile from local, and I want to find duplicate columns in example.
Input File:
1,2,3
2,3,4
1,3,4
2,4,5
3,4,5
I need output like this:
Select first column
1->2
2->3
3->1
program is:
val file=scala.io.Source.fromFile("D:/Files/test.txt").getLines().mkString("\n")
val d=file.groupBy(identity).mapValues(_.size)
println(d)
But I am getting output Like this
Map(-> 5, 4 -> 1, 9 -> 1, 5 -> 3, , -> 12, 1 -> 3, 0 -> 1, 2 -> 5, 3 -> 4)
Its counting all the data but I want to count duplicates in particualr column only
The issue here is because once the call mkString is made, the multiple lines on the file is 'lost'. Another approach could be to use the toArray call instead.
val file = scala.io.Source.fromFile("D:/Files/test.txt")
val lines = file.getLines().toArray
On the above example, lines would be a array of strings:
Array(1,2,3, 2,3,4, 1,3,4, 2,4,5, 3,4,5)
then to extract the first column before grouping you could use something like the slice method on each string
lines.map(_.slice(0,1)).groupBy(identity).mapValues(_.size)
Also, remember to close the file :)
Full example:
val file = scala.io.Source.fromFile("D:/Files/test.txt")
val lines = file.getLines().toArray
val grouping = lines.map(_.slice(0,1)).groupBy(identity).mapValues(_.size)
file.close
If I understand your question correctly, shouldn't the duplicate counts of the 1st column be (1->2, 2->2, 3->1)?
Here's one approach to get the counts:
// Create a list of split-column arrays
val list = scala.io.Source.
fromFile("/Users/leo/projects/scala/files/testfile.txt").
getLines.
map(_.split(",")).
toList
list: List[Array[String]] = List(Array(1, 2, 3), Array(2, 3, 4), Array(1, 3, 4), Array(2, 4, 5), Array(3, 4, 5))
// Count duplicates of the 1st split-column
val d = list.
groupBy(_(0)).
mapValues(_.size)
d: scala.collection.immutable.Map[String,Int] = Map(2 -> 2, 1 -> 2, 3 -> 1)

Filter matching and non-matching elements into different halves of a tuple

Is there a simple and efficient way to perform the following in Scala?
val elements = List(1, 2, 3, 4, 5, 6)
val (odd, even) = elements.filter(_ % 2 == 0)
I am aware of groupBy, but I would like something that works with a constant number of groups that can be extracted into separate values.
List.partition does what you want:
val (even, odd) = elements.partition(_ % 2 == 0)
Note that it works only with two final groups.