Concat column values in a dataframe - scala

I have a csv file that looks like the following
name, state, a, b, c, d, ..., x
Jon, NY, 1, 4, 6, 2, 6
Eric, CA, 5, 3, 1, 5, 6
Chris,LA, 4, 4, 3, 1, 5
and I want the following result (one column with concat fields (inkl header name))
concate-fields
"name=Jon, state=NY, a=1, b=4, c=6, d= 2, ... x=6"
"name=Eric, state=CA, a=5, b=3, c=1, d= 5, ... x=6"
"name=Chris, state=LA, a=4, b=4, c=3, d= 1, ... x=5"
There can be many headers from a...>x so these should be appended in a generic way
I now have
import org.apache.spark.sql.functions.{concat, lit}
val lp = sample.select(concat(lit("name), $"name", lit(",state="), $"state")
display(lp)
But I have trouble adding the same for column a->x (as this needs to be done in a generic way)

You can dynamically create SQL expression to concat columns by calling map method on df.columns() as shown below.
val df = // Read CSV
df.withColumn("concate-fields", expr(s"concat(${df.columns.map(col=>s"'$col=', nvl($col,'null'),','").mkString("").dropRight(4)})"))
.withColumn("concate-fields", concat(lit("\""),col("concate-fields"),lit("\"")))

Related

Reformatting Dataframe Containing Array to RowMatrix

I have this dataframe in the following format:
+----+-----+
| features |
+----+-----+
|[1,4,7,10]|
|[2,5,8,11]|
|[3,6,9,12]|
+----+----+
Script to create sample dataframe:
rows2 = sc.parallelize([ IndexedRow(0, [1, 4, 7, 10 ]),
IndexedRow(1, [2, 5, 8, 1]),
IndexedRow(1, [3, 6, 9, 12]),
])
rows_df = rows2.toDF()
row_vec= rows_df.drop("index")
row_vec.show()
The feature column contains 4 features, and there are 3 row ids. I want to convert this data to a rowmatrix, where the columns and rows will be in the following mat format:
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([(1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12)])
# Convert to RowMatrix
mat = RowMatrix(rows)
# Calculate exact and approximate similarities
exact = mat.columnSimilarities()
approx = mat.columnSimilarities(0.05)
Basically, I want to transpose the dataframe into the new format so that I can run the columnSimilarities() function. I have a much larger dataframe that contains 50 features, and 39000 rows.
Is this what you are trying to do? Hate using collect() but don't think it can be avoided here since you want to reshape/convert structured object to matrix ... right?
X = np.array(row_vec.select("_2").collect()).reshape(-1,3)
X = sc.parallelize(X)
for i in X.collect(): print(i)
[1 4 7]
[10 2 5]
[8 1 3]
[ 6 9 12]
I figured it out, I used the following:
from pyspark.mllib.linalg.distributed import RowMatrix
features_rdd = row_vec.select("features").rdd.map(lambda row: row[0])
features_mat = RowMatrix(features_rdd )
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
coordmatrix_features = CoordinateMatrix(
features_mat .rows.zipWithIndex().flatMap(
lambda x: [MatrixEntry(x[1], j, v) for j, v in enumerate(x[0])]
)
)
transposed_rowmatrix_features = coordmatrix_features.transpose().toRowMatrix()

How to find duplicates in a list in Scala?

I have a list of unsorted integers and I want to find the elements which are duplicated.
val dup = List(1|1|1|2|3|4|5|5|6|100|101|101|102)
I have to find the list of unique elements and also how many times each element is repeated.
I know I can find it with below code :
val ans2 = dup.groupBy(identity).map(t => (t._1, t._2.size))
But I am not able to split the above list on "|" . I tried converting to a String then using split but I got the result below:
L
i
s
t
(
1
0
3
)
I am not sure why I am getting this result.
Reference: How to find duplicates in a list?
The symbol | is a function in scala. You can check the API here
|(x: Int): Int
Returns the bitwise OR of this value and x.
So you don't have a List, you have a single Integer (103) which is the result of operating | with all the integers in your pretended List.
Your code is fine, if you want to make a proper List you should separate its elements by commas
val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
If you want to convert your given String before having it on a List you can do:
"1|1|1|2|3|4|5|5|6|100|101|101|102".split("\\|").toList
Even easier, convert the list of duplicates into a set - a set is a data structure that by default does not have any duplicates.
scala> val dup = List(1,1,1,2,3,4,5,5,6,100,101,101,102)
dup: List[Int] = List(1, 1, 1, 2, 3, 4, 5, 5, 6, 100, 101, 101, 102)
scala> val noDup = dup.toSet
res0: scala.collection.immutable.Set[Int] = Set(101, 5, 1, 6, 102, 2, 3, 4, 100)
To count the elements, just call the method sizeon the resulting set:
scala> noDup.size
res3: Int = 9
Another way to solve the problem
"1|1|1|2|3|4|5|5|6|100|101|101|102".split("\|").groupBy(x => x).mapValues(_.size)
res0: scala.collection.immutable.Map[String,Int] = Map(100 -> 1, 4 -> 1, 5 -> 2, 6 -> 1, 1 -> 3, 102 -> 1, 2 -> 1, 101 -> 2, 3 -> 1)

How to remove duplicates from particular column in Scala by reading textfile

I am new to scala, I am reading textfile from local, and I want to find duplicate columns in example.
Input File:
1,2,3
2,3,4
1,3,4
2,4,5
3,4,5
I need output like this:
Select first column
1->2
2->3
3->1
program is:
val file=scala.io.Source.fromFile("D:/Files/test.txt").getLines().mkString("\n")
val d=file.groupBy(identity).mapValues(_.size)
println(d)
But I am getting output Like this
Map(-> 5, 4 -> 1, 9 -> 1, 5 -> 3, , -> 12, 1 -> 3, 0 -> 1, 2 -> 5, 3 -> 4)
Its counting all the data but I want to count duplicates in particualr column only
The issue here is because once the call mkString is made, the multiple lines on the file is 'lost'. Another approach could be to use the toArray call instead.
val file = scala.io.Source.fromFile("D:/Files/test.txt")
val lines = file.getLines().toArray
On the above example, lines would be a array of strings:
Array(1,2,3, 2,3,4, 1,3,4, 2,4,5, 3,4,5)
then to extract the first column before grouping you could use something like the slice method on each string
lines.map(_.slice(0,1)).groupBy(identity).mapValues(_.size)
Also, remember to close the file :)
Full example:
val file = scala.io.Source.fromFile("D:/Files/test.txt")
val lines = file.getLines().toArray
val grouping = lines.map(_.slice(0,1)).groupBy(identity).mapValues(_.size)
file.close
If I understand your question correctly, shouldn't the duplicate counts of the 1st column be (1->2, 2->2, 3->1)?
Here's one approach to get the counts:
// Create a list of split-column arrays
val list = scala.io.Source.
fromFile("/Users/leo/projects/scala/files/testfile.txt").
getLines.
map(_.split(",")).
toList
list: List[Array[String]] = List(Array(1, 2, 3), Array(2, 3, 4), Array(1, 3, 4), Array(2, 4, 5), Array(3, 4, 5))
// Count duplicates of the 1st split-column
val d = list.
groupBy(_(0)).
mapValues(_.size)
d: scala.collection.immutable.Map[String,Int] = Map(2 -> 2, 1 -> 2, 3 -> 1)

How to zip after distnct in pySpark

The following program fails in the zip step.
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
z = x.distinct()
print x.zip(y).collect()
The error that is produced depends on whether multiple partitions have been specified or not.
I understand that
the two RDDs [must] have the same number of partitions and the same number of elements in each partition.
What is the best way to work around this restriction?
I have been performing the operation with the following code, but I am hoping to find something more efficient.
def safe_zip(left, right):
ix_left = left.zipWithIndex().map(lambda row: (row[1], row[0]))
ix_right = right.zipWithIndex().map(lambda row: (row[1], row[0]))
return ix_left.join(ix_right).sortByKey().values()
I think this would be accomplished by using cartesian() on your RDD
import pyspark
x = sc.parallelize([1, 2, 3, 1, 2, 3])
y = sc.parallelize([1, 2, 3])
x.distinct().cartesian(y.distinct()).collect()

Assigning one value to many elements of an array in Scala

I have some experience with R language and now I wanted to try Scala language. In R language I can assign one value to many elements of a vector, e.g.
(xs <- 1:10)
#[1] 1 2 3 4 5 6 7 8 9 10
k <- 3
xs[1:k] <- xs[k+1]
xs
# 4 4 4 4 5 6 7 8 9 10
It assigns value of k+1 element to all elements of indices from 1 to k. Is it also possible to do it without a loop in Scala (I mean Array in Scale)? I know there is slice method, but it only returns values of an Array, one cannot modify elements of the Array using this method.
What is even more, should I use Array or ArrayBuffer if I only want to change values of elements and I do not want to add/remove elements from a collection?
Check out the java.util.Arrays.fill methods.
scala> val xs = (1 to 9).toArray
xs: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> val k = 6
k: Int = 6
scala> java.util.Arrays.fill(xs, 0, k, xs(k))
scala> xs
res10: Array[Int] = Array(7, 7, 7, 7, 7, 7, 7, 8, 9)
For your second question, if not resizing the collection but editing the elements, stick with array. ArrayBuffer is much like the Java ArrayList, it resizes it self when it needs to, so insertion is amortized constant, not just constant.
For your first question, I'm not aware of any method in the collections library that would allow you to do this. It's obviously syntactic sugar for looping, so if you really care (do you really find yourself needing to do this often?), you can define an implicit class and yourself define a method which loops, and then use that. Write a comment if you would like to see example of such code, otherwise try doing it yourself, it's gonna be good training.
Scala has the Range class. You can convert the Range to an Array if you wish.
scala> val n = 10
n: Int = 10
scala> Range(1,n)
res22: scala.collection.immutable.Range = Range(1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> res22.toArray
res23: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9)
│
ArrrayBuffer has constant time update and would be good for updating values.