Create separate columns from array column in Spark Dataframe in Scala when array is large [duplicate] - scala

This question already has answers here:
Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, ..., fn: Double)]
(5 answers)
Closed 4 years ago.
I have two columns: one of type Integer and one of type linalg.Vector. I can convert linalg.Vector to array. Each array has 32 elements. I want to convert each element in the array to a column. So the input is like :
column1 column2
(3, 5, 25, ...., 12) 3
(2, 7, 15, ...., 10) 4
(1, 10, 12, ..., 35) 2
Output should be:
column1_1 column1_2 column1_3 ......... column1_32 column 2
3 5 25 ......... 12 3
2 7 15 ......... 10 4
1 1 0 12 ......... 12 2
Except, in my case there are 32 elements in the array. It is too many to use the method in question Convert Array of String column to multiple columns in spark scala
I tried a few ways and none of it worked. What is the right way to do this?
Thanks a lot.

scala> import org.apache.spark.sql.Column
scala> val df = Seq((Array(3,5,25), 3),(Array(2,7,15),4),(Array(1,10,12),2)).toDF("column1", "column2")
df: org.apache.spark.sql.DataFrame = [column1: array<int>, column2: int]
scala> def getColAtIndex(id:Int): Column = col(s"column1")(id).as(s"column1_${id+1}")
getColAtIndex: (id: Int)org.apache.spark.sql.Column
scala> val columns: IndexedSeq[Column] = (0 to 2).map(getColAtIndex) :+ col("column2") //Here, instead of 2, you can give the value of n
columns: IndexedSeq[org.apache.spark.sql.Column] = Vector(column1[0] AS `column1_1`, column1[1] AS `column1_2`, column1[2] AS `column1_3`, column2)
scala> df.select(columns: _*).show
+---------+---------+---------+-------+
|column1_1|column1_2|column1_3|column2|
+---------+---------+---------+-------+
| 3| 5| 25| 3|
| 2| 7| 15| 4|
| 1| 10| 12| 2|
+---------+---------+---------+-------+

This can be done best by writing a UserDefinedFunction like:
val getElementFromVectorUDF = udf(getElementFromVector(_: Vector, _: Int))
def getElementFromVector(vec: Vector, idx: Int) = {
vec(idx)
}
You can use it like this then:
df.select(
getElementFromVectorUDF($"column1", 0) as "column1_0",
...
getElementFromVectorUDF($"column1", n) as "column1_n",
)
I hope this helps.

Related

Scala filter out rows where any column2 matches column1

Hi Stackoverflow,
I want to remove all rows in a dataframe where column A matches any of the distinct values in column B. I would expect this code block to do exactly that, but it seems to remove values where column B is null as well, which is weird since the filter should only consider column A anyway. How can I fix this code to perform the expected behavior, which is remove all rows in a dataframe where column A matches any of the distinct values in column B.
import spark.implicits._
val df = Seq(
(scala.math.BigDecimal(1) , null),
(scala.math.BigDecimal(2), scala.math.BigDecimal(1)),
(scala.math.BigDecimal(3), scala.math.BigDecimal(4)),
(scala.math.BigDecimal(4), null),
(scala.math.BigDecimal(5), null),
(scala.math.BigDecimal(6), null)
).toDF("A", "B")
// correct, has 1, 4
val to_remove = df
.filter(
df.col("B").isNotNull
).select(
df("B")
).distinct()
// incorrect, returns 2, 3 instead of 2, 3, 5, 6
val final = df.filter(!df.col("A").isin(to_remove.col("B")))
// 4 != 2
assert(4 === final.collect().length)
isin function accepts a list. However, in your code, you're passing Dataset[Row]. As per documentation https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.Column#isin%28scala.collection.Seq%29
it's declared as
def isin(list: Any*): Column
You first need to extract the values into Sequence and then use that in isin function. Please, note that this may have performance implications.
scala> val to_remove = df.filter(df.col("B").isNotNull).select(df("B")).distinct().collect.map(_.getDecimal(0))
to_remove: Array[java.math.BigDecimal] = Array(1.000000000000000000, 4.000000000000000000)
scala> val finaldf = df.filter(!df.col("A").isin(to_remove:_*))
finaldf: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [A: decimal(38,18), B: decimal(38,18)]
scala> finaldf.show
+--------------------+--------------------+
| A| B|
+--------------------+--------------------+
|2.000000000000000000|1.000000000000000000|
|3.000000000000000000|4.000000000000000000|
|5.000000000000000000| null|
|6.000000000000000000| null|
+--------------------+--------------------+
Change filter condition !df.col("A").isin(to_remove.col("B")) to !df.col("A").isin(to_remove.collect.map(_.getDecimal(0)):_*)
Check below code.
val finaldf = df
.filter(!df
.col("A")
.isin(to_remove.map(_.getDecimal(0)).collect:_*)
)
scala> finaldf.show
+--------------------+--------------------+
| A| B|
+--------------------+--------------------+
|2.000000000000000000|1.000000000000000000|
|3.000000000000000000|4.000000000000000000|
|5.000000000000000000| null|
|6.000000000000000000| null|
+--------------------+--------------------+

Convert spark dataframe to sequence of sequences and vice versa in Scala [duplicate]

This question already has an answer here:
How to get Array[Seq[String]] from DataFrame?
(1 answer)
Closed 3 years ago.
I have a DataFrame and I want to convert it into a sequence of sequences and vice versa.
Now the thing is, I want to do it dynamically, and write something which runs for DataFrame with any number/type of columns.
In summary, these are the questions:
How to convert Seq[Seq[String]] to a DataFrame?
How to convert DataFrame to Seq[Seq[String]?
How to perform 2 but also make the DataFrame infer the schema and decide column types by itself?
UPDATE 1
This is not a duplicate of this question because in answer to that question solution provided is not dynamic, it works for two columns or how many columns is to be hardcoded. I am trying to find a dynamic solution.
This is how you can dynamically create a dataframe from Seq[Seq[String]]:
scala> val seqOfSeq = Seq(Seq("a","b", "c"),Seq("3","4", "5"))
seqOfSeq: Seq[Seq[String]] = List(List(a, b, c), List(3, 4, 5))
scala> val lengthOfRow = seqOfSeq(0).size
lengthOfRow: Int = 3
scala> val tempDf = sc.parallelize(seqOfSeq).toDF
tempDf: org.apache.spark.sql.DataFrame = [value: array<string>]
scala> val requiredDf = tempDf.select((0 until lengthOfRow).map(i => col("value")(i).alias(s"col$i")): _*)
requiredDf: org.apache.spark.sql.DataFrame = [col0: string, col1: string ... 1 more field]
scala> requiredDf.show
+----+----+----+
|col0|col1|col2|
+----+----+----+
| a| b| c|
| 3| 4| 5|
+----+----+----+
How to convert DataFrame to Seq[Seq[String]:
val newSeqOfSeq = requiredDf.collect().map(row => row.toSeq.map(_.toString).toSeq).toSeq
To use custom column names:
scala> val myCols = Seq("myColA", "myColB", "myColC")
myCols: Seq[String] = List(myColA, myColB, myColC)
scala> val requiredDf = tempDf.select((0 until lengthOfRow).map(i => col("value")(i).alias( myCols(i) )): _*)
requiredDf: org.apache.spark.sql.DataFrame = [myColA: string, myColB: string ... 1 more field]

Spark withColumn - add column using non-Column type variable [duplicate]

This question already has answers here:
How to add a constant column in a Spark DataFrame?
(3 answers)
Closed 4 years ago.
How can I add a column to a data frame from a variable value?
I know that I can create a data frame using .toDF(colName) and that .withColumn is the method to add the column. But, when I try the following, I get a type mismatch error:
val myList = List(1,2,3)
val myArray = Array(1,2,3)
myList.toDF("myList")
.withColumn("myArray", myArray)
Type mismatch, expected: Column, actual: Array[Int]
This compile error is on myArray within the .withColumn call. How can I convert it from an Array[Int] to a Column type?
The error message has exactly what is up, you need to input a column (or a lit()) as the second argument as withColumn()
try this
import org.apache.spark.sql.functions.typedLit
val myList = List(1,2,3)
val myArray = Array(1,2,3)
myList.toDF("myList")
.withColumn("myArray", typedLit(myArray))
:)
Not sure withColumn is what you're actually seeking. You could apply lit() to make myArray conform to the method specs, but the result will be the same array value for every row in the DataFrame:
myList.toDF("myList").withColumn("myArray", lit(myArray)).
show
// +------+---------+
// |myList| myArray|
// +------+---------+
// | 1|[1, 2, 3]|
// | 2|[1, 2, 3]|
// | 3|[1, 2, 3]|
// +------+---------+
If you're trying to merge the two collections column-wise, it's a different transformation from what withColumn offers. In that case you'll need to convert each of them into a DataFrame and combine them via a join.
Now if the elements of the two collections are row-identifying and match each other pair-wise like in your example and you want to join them that way, you can simply join the converted DataFrames:
myList.toDF("myList").join(
myArray.toSeq.toDF("myArray"), $"myList" === $"myArray"
).show
// +------+-------+
// |myList|myArray|
// +------+-------+
// | 1| 1|
// | 2| 2|
// | 3| 3|
// +------+-------+
But in case the two collections have elements that aren't join-able and you simply want to merge them column-wise, you'll need to use compatible row-identifying columns from the two dataframes to join them. And if there isn't such row-identifying columns, one approach would be to create your own rowIds, as in the following example:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val df1 = List("a", "b", "c").toDF("myList")
val df2 = Array("x", "y", "z").toSeq.toDF("myArray")
val rdd1 = df1.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df1withId = spark.createDataFrame( rdd1,
StructType(df1.schema.fields :+ StructField("rowId", LongType, false))
)
val rdd2 = df2.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df2withId = spark.createDataFrame( rdd2,
StructType(df2.schema.fields :+ StructField("rowId", LongType, false))
)
df1withId.join(df2withId, Seq("rowId")).show
// +-----+------+-------+
// |rowId|myList|myArray|
// +-----+------+-------+
// | 0| a| x|
// | 1| b| y|
// | 2| c| z|
// +-----+------+-------+

Find columns with different values

My dataframe has 120 columns.Suppose my dataframe has the below structure
Id value1 value2 value3
a 10 1983 19
a 20 1983 20
a 10 1983 21
b 10 1984 1
b 10 1984 2
we can see here the id a, value1 have different values(10,20). I have to find columns having the different values for a particular id. Is there any statistical or any other approach in spark to solve this problem?
Expected output
id new_column
a value1,value3
b value3
The following code might be a start of an answer:
val result = log.select("Id","value1","value2","value3").groupBy('Id).agg('Id, countDistinct('value1),countDistinct('value2),countDistinct('value3))
Should do the following:
1)
log.select("Id","value1","value2","value3")
select relevant columns (if you want to take all columns it might be redundant)
2)
groupBy('Id)
group rows with the same ID
3)
agg('Id, countDistinct('value1),countDistinct('value2),countDistinct('value3))
output : ID, and number(count) of unique(distinct) values per ID/specific column
You can do it in several ways, one of them being the distinct method, that is similar to the SQL behaviour. Another one would be the groupBy method, where you have to pass in parameters the name of the columns you want to group (e.g. df.groupBy("Id", "value1")).
Below is an example using the distinct method.
scala> case class Person(name : String, age: Int)
defined class Person
scala> val persons = Seq(Person("test", 10), Person("test", 20), Person("test", 10)).toDF
persons: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> persons.show
+----+---+
|name|age|
+----+---+
|test| 10|
|test| 20|
|test| 10|
+----+---+
scala> persons.select("name", "age").distinct().show
+-----+---+
| name|age|
+-----+---+
| test| 10|
| test| 20|
+-----+---+

Adding a column of rowsums across a list of columns in Spark Dataframe

I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.
For example, my data looks like this:
ID var1 var2 var3 var4 var5
a 5 7 9 12 13
b 6 4 3 20 17
c 4 9 4 6 9
d 1 2 6 8 1
I want a column added summing the rows for specific columns:
ID var1 var2 var3 var4 var5 sums
a 5 7 9 12 13 46
b 6 4 3 20 17 50
c 4 9 4 6 9 32
d 1 2 6 8 10 27
I know it is possible to add columns together if you know the specific columns to add:
val newdf = df.withColumn("sumofcolumns", df("var1") + df("var2"))
But is it possible to pass a list of column names and add them together? Based off of this answer which is basically what I want but it is using the python API instead of scala (Add column sum as new column in PySpark dataframe) I think something like this would work:
//Select columns to sum
val columnstosum = ("var1", "var2","var3","var4","var5")
// Create new column called sumofcolumns which is sum of all columns listed in columnstosum
val newdf = df.withColumn("sumofcolumns", df.select(columstosum.head, columnstosum.tail: _*).sum)
This throws the error value sum is not a member of org.apache.spark.sql.DataFrame. Is there a way to sum across columns?
Thanks in advance for your help.
You should try the following:
import org.apache.spark.sql.functions._
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(
("a", 5, 7, 9, 12, 13),
("b", 6, 4, 3, 20, 17),
("c", 4, 9, 4, 6 , 9),
("d", 1, 2, 6, 8 , 1)
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")
val columnsToSum = List(col("var1"), col("var2"), col("var3"), col("var4"), col("var5"))
val output = input.withColumn("sums", columnsToSum.reduce(_ + _))
output.show()
Then the result is:
+---+----+----+----+----+----+----+
| ID|var1|var2|var3|var4|var5|sums|
+---+----+----+----+----+----+----+
| a| 5| 7| 9| 12| 13| 46|
| b| 6| 4| 3| 20| 17| 50|
| c| 4| 9| 4| 6| 9| 32|
| d| 1| 2| 6| 8| 1| 18|
+---+----+----+----+----+----+----+
Plain and simple:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, col}
def sum_(cols: Column*) = cols.foldLeft(lit(0))(_ + _)
val columnstosum = Seq("var1", "var2", "var3", "var4", "var5").map(col _)
df.select(sum_(columnstosum: _*))
with Python equivalent:
from functools import reduce
from operator import add
from pyspark.sql.functions import lit, col
def sum_(*cols):
return reduce(add, cols, lit(0))
columnstosum = [col(x) for x in ["var1", "var2", "var3", "var4", "var5"]]
select("*", sum_(*columnstosum))
Both will default to NA if there is a missing value in the row. You can use DataFrameNaFunctions.fill or coalesce function to avoid that.
I assume you have a dataframe df. Then you can sum up all cols except your ID col. This is helpful when you have many cols and you don't want to manually mention names of all columns like everyone mentioned above. This post has the same answer.
val sumAll = df.columns.collect{ case x if x != "ID" => col(x) }.reduce(_ + _)
df.withColumn("sum", sumAll)
Here's an elegant solution using python:
NewDF = OldDF.withColumn('sums', sum(OldDF[col] for col in OldDF.columns[1:]))
Hopefully this will influence something similar in Spark ... anyone?.