Scala concatenate Column of Array[String] into single Array[String] - scala

I have a Spark Dataframe (Scala) with an id - (Int) and tokens - (array<string>) column:
id,tokens
0,["a","b","c"]
1,["a","b"]
...
Assuming I am able to retrieve the data via a SparkSession and casting to a case class:
case class Token(id: Int, tokens: Array[String])
After getting a Dataset[Token] object, how do I concatenate all the array of string tokens into a single Array<String> and subsequently perform a count to find the most occuring strings?
Output:
a,2
b,2
c,1
...

You need to explode the token column & take the count after grouping by the individual tokens:
scala> val input = sc.parallelize(List(
(0, Array("a","b","c")),
(1, Array("a","b"))
)).toDF("id","token")
scala> input.withColumn("token_split",explode($"token"))
.groupBy($"token_split")
.agg(count($"id") as "count")
.orderBy($"count".desc)
.show
Output:
+-----------+-----+
|token_split|count|
+-----------+-----+
| b| 2|
| a| 2|
| c| 1|
+-----------+-----+

Related

substring from lastIndexOf in spark scala

I have a column in my dataframe which contains the filename
test_1_1_1_202012010101101
I want to get the string after the lastIndexOf(_)
I tried this and it is working
val timestamp_df =file_name_df.withColumn("timestamp",split(col("filename"),"_").getItem(4))
But I want to make it more generic, so that if in future if the filename can have any number of _ in it, it can split it on the basis of lastIndexOf _
val timestamp_df =file_name_df.withColumn("timestamp", expr("substring(filename, length(filename)-15,17)"))
This also is not generic as the character length can vary.
Can anyone help me in using the lastIndexOf function with withColumn.
You can use element_at function with split to get last element of array.
Example:
df.withColumn("timestamp",element_at(split(col("filename"),"_"),-1)).show(false)
+--------------------------+---------------+
|filename |timestamp |
+--------------------------+---------------+
|test_1_1_1_202012010101101|202012010101101|
+--------------------------+---------------+
You can use substring_index
scala> val df = Seq(("a-b-c", 1),("d-ef-foi",2)).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: string, c2: int]
+--------+---+
| c1| c2|
+--------+---+
| a-b-c| 1|
|d-ef-foi| 2|
+--------+---+
scala> df.withColumn("c3", substring_index(col("c1"), "-", -1)).show
+--------+---+---+
| c1| c2| c3|
+--------+---+---+
| a-b-c| 1| c|
|d-ef-foi| 2|foi|
+--------+---+---+
Per docs: When the last argument "is negative, everything to the right of the final delimiter (counting from the right) is returned"
val timestamp_df =file_name_df.withColumn("timestamp",reverse(split(reverse(col("filename")),"_").getItem(0)))
It's working with this.

Spark withColumn - add column using non-Column type variable [duplicate]

This question already has answers here:
How to add a constant column in a Spark DataFrame?
(3 answers)
Closed 4 years ago.
How can I add a column to a data frame from a variable value?
I know that I can create a data frame using .toDF(colName) and that .withColumn is the method to add the column. But, when I try the following, I get a type mismatch error:
val myList = List(1,2,3)
val myArray = Array(1,2,3)
myList.toDF("myList")
.withColumn("myArray", myArray)
Type mismatch, expected: Column, actual: Array[Int]
This compile error is on myArray within the .withColumn call. How can I convert it from an Array[Int] to a Column type?
The error message has exactly what is up, you need to input a column (or a lit()) as the second argument as withColumn()
try this
import org.apache.spark.sql.functions.typedLit
val myList = List(1,2,3)
val myArray = Array(1,2,3)
myList.toDF("myList")
.withColumn("myArray", typedLit(myArray))
:)
Not sure withColumn is what you're actually seeking. You could apply lit() to make myArray conform to the method specs, but the result will be the same array value for every row in the DataFrame:
myList.toDF("myList").withColumn("myArray", lit(myArray)).
show
// +------+---------+
// |myList| myArray|
// +------+---------+
// | 1|[1, 2, 3]|
// | 2|[1, 2, 3]|
// | 3|[1, 2, 3]|
// +------+---------+
If you're trying to merge the two collections column-wise, it's a different transformation from what withColumn offers. In that case you'll need to convert each of them into a DataFrame and combine them via a join.
Now if the elements of the two collections are row-identifying and match each other pair-wise like in your example and you want to join them that way, you can simply join the converted DataFrames:
myList.toDF("myList").join(
myArray.toSeq.toDF("myArray"), $"myList" === $"myArray"
).show
// +------+-------+
// |myList|myArray|
// +------+-------+
// | 1| 1|
// | 2| 2|
// | 3| 3|
// +------+-------+
But in case the two collections have elements that aren't join-able and you simply want to merge them column-wise, you'll need to use compatible row-identifying columns from the two dataframes to join them. And if there isn't such row-identifying columns, one approach would be to create your own rowIds, as in the following example:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val df1 = List("a", "b", "c").toDF("myList")
val df2 = Array("x", "y", "z").toSeq.toDF("myArray")
val rdd1 = df1.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df1withId = spark.createDataFrame( rdd1,
StructType(df1.schema.fields :+ StructField("rowId", LongType, false))
)
val rdd2 = df2.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df2withId = spark.createDataFrame( rdd2,
StructType(df2.schema.fields :+ StructField("rowId", LongType, false))
)
df1withId.join(df2withId, Seq("rowId")).show
// +-----+------+-------+
// |rowId|myList|myArray|
// +-----+------+-------+
// | 0| a| x|
// | 1| b| y|
// | 2| c| z|
// +-----+------+-------+

How to compose column name using another column's value for withColumn in Scala Spark

I'm trying to add a new column to a DataFrame. The value of this column is the value of another column whose name depends on other columns from the same DataFrame.
For instance, given this:
+---+---+----+----+
| A| B| A_1| B_2|
+---+---+----+----+
| A| 1| 0.1| 0.3|
| B| 2| 0.2| 0.4|
+---+---+----+----+
I'd like to obtain this:
+---+---+----+----+----+
| A| B| A_1| B_2| C|
+---+---+----+----+----+
| A| 1| 0.1| 0.3| 0.1|
| B| 2| 0.2| 0.4| 0.4|
+---+---+----+----+----+
That is, I added column C whose value came from either column A_1 or B_2. The name of the source column A_1 comes from concatenating the value of columns A and B.
I know that I can add a new column based on another and a constant like this:
df.withColumn("C", $"B" + 1)
I also know that the name of the column can come from a variable like this:
val name = "A_1"
df.withColumn("C", col(name) + 1)
However, what I'd like to do is something like this:
df.withColumn("C", col(s"${col("A")}_${col("B")}"))
Which doesn't work.
NOTE: I'm coding in Scala 2.11 and Spark 2.2.
You can achieve your requirement by writing a udf function. I am suggesting udf, as your requirement is to process dataframe row by row contradicting to inbuilt functions which functions column by column.
But before that you would need array of column names
val columns = df.columns
Then write a udf function as
import org.apache.spark.sql.functions._
def getValue = udf((A: String, B: String, array: mutable.WrappedArray[String]) => array(columns.indexOf(A+"_"+B)))
where
A is the first column value
B is the second column value
array is the Array of all the columns values
Now just call the udf function using withColumn api
df.withColumn("C", getValue($"A", $"B", array(columns.map(col): _*))).show(false)
You should get your desired output dataframe.
You can select from a map. Define map which translates name to column value:
import org.apache.spark.sql.functions.{col, concat_ws, lit, map}
val dataMap = map(
df.columns.diff(Seq("A", "B")).flatMap(c => lit(c) :: col(c) :: Nil): _*
)
df.select(dataMap).show(false)
+---------------------------+
|map(A_1, A_1, B_2, B_2) |
+---------------------------+
|Map(A_1 -> 0.1, B_2 -> 0.3)|
|Map(A_1 -> 0.2, B_2 -> 0.4)|
+---------------------------+
and select from it with apply:
df.withColumn("C", dataMap(concat_ws("_", $"A", $"B"))).show
+---+---+---+---+---+
| A| B|A_1|B_2| C|
+---+---+---+---+---+
| A| 1|0.1|0.3|0.1|
| B| 2|0.2|0.4|0.4|
+---+---+---+---+---+
You can also try mapping, but I suspect it won't perform well with very wide data:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val outputEncoder = RowEncoder(df.schema.add(StructField("C", DoubleType)))
df.map(row => {
val a = row.getAs[String]("A")
val b = row.getAs[String]("B")
val key = s"${a}_${b}"
Row.fromSeq(row.toSeq :+ row.getAs[Double](key))
})(outputEncoder).show
+---+---+---+---+---+
| A| B|A_1|B_2| C|
+---+---+---+---+---+
| A| 1|0.1|0.3|0.1|
| B| 2|0.2|0.4|0.4|
+---+---+---+---+---+
and in general I wouldn't recommend this approach.
If data comes from csv you might consider skipping default csv reader and use custom logic to push column selection directly into parsing process. With pseudocode:
spark.read.text(...).map { line => {
val a = ??? // parse A
val b = ??? // parse B
val c = ??? // find c, based on a and b
(a, b, c)
}}

How to get a value from Dataset and store it in a Scala value?

I have a dataframe which looks like this:
scala> avgsessiontime.show()
+-----------------+
| avg|
+-----------------+
|2.073455735838315|
+-----------------+
I need to store the value 2.073455735838315 in a variable. I tried using
avgsessiontime.collect
but that starts giving me Task not serializable exceptions. So to avoid that I started using foreachPrtition. But I dont know how to extract the value 2.073455735838315 in an array variable.
scala> avgsessiontime.foreachPartition(x => x.foreach(println))
[2.073455735838315]
But when I do this:
avgsessiontime.foreachPartition(x => for (name <- x) name.get(0))
I get a blank/empty result. Even the length returns empty.
avgsessiontime.foreachPartition(x => for (name <- x) name.length)
I know name is of type org.apache.spark.sql.Row then it should return both those results.
You might need:
avgsessiontime.first.getDouble(0)
Here use first to extract the Row object, and .getDouble(0) to extract value from the Row object.
val df = Seq(2.0743).toDF("avg")
df.show
+------+
| avg|
+------+
|2.0743|
+------+
df.first.getDouble(0)
// res6: Double = 2.0743
scala> val df = spark.range(10)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
scala> val variable = df.select("id").as[Long].collect
variable: Array[Long] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Same way you can extract values of any type i.e double,string. You just need to give data type while selecting values from df.
rdd and dataframes/datasets are distributed in nature, and foreach and foreachPartition are executed on executors, transforming dataframe or rdd on executors itself without returning anything. So if you want to return the variable to the driver node then you will have to use collect.
Supposing you have a dataframe as
+-----------------+
|avg |
+-----------------+
|2.073455735838315|
|2.073455735838316|
+-----------------+
doing the following will print all the values, which you can store in a variable too
avgsessiontime.rdd.collect().foreach(x => println(x(0)))
it will print
2.073455735838315
2.073455735838316
Now if you want only the first one then you can do
avgsessiontime.rdd.collect()(0)(0)
which will give you
2.073455735838315
I hope the answer is helpful

What is going wrong with `unionAll` of Spark `DataFrame`?

Using Spark 1.5.0 and given the following code, I expect unionAll to union DataFrames based on their column name. In the code, I'm using some FunSuite for passing in SparkContext sc:
object Entities {
case class A (a: Int, b: Int)
case class B (b: Int, a: Int)
val as = Seq(
A(1,3),
A(2,4)
)
val bs = Seq(
B(5,3),
B(6,4)
)
}
class UnsortedTestSuite extends SparkFunSuite {
configuredUnitTest("The truth test.") { sc =>
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val aDF = sc.parallelize(Entities.as, 4).toDF
val bDF = sc.parallelize(Entities.bs, 4).toDF
aDF.show()
bDF.show()
aDF.unionAll(bDF).show
}
}
Output:
+---+---+
| a| b|
+---+---+
| 1| 3|
| 2| 4|
+---+---+
+---+---+
| b| a|
+---+---+
| 5| 3|
| 6| 4|
+---+---+
+---+---+
| a| b|
+---+---+
| 1| 3|
| 2| 4|
| 5| 3|
| 6| 4|
+---+---+
Why does the result contain intermixed "b" and "a" columns, instead of aligning columns bases on column names? Sounds like a serious bug!?
It doesn't look like a bug at all. What you see is a standard SQL behavior and every major RDMBS, including PostgreSQL, MySQL, Oracle and MS SQL behaves exactly the same. You'll find SQL Fiddle examples linked with names.
To quote PostgreSQL manual:
In order to calculate the union, intersection, or difference of two queries, the two queries must be "union compatible", which means that they return the same number of columns and the corresponding columns have compatible data types
Column names, excluding the first table in the set operation, are simply ignored.
This behavior comes directly form the Relational Algebra where basic building block is a tuple. Since tuples are ordered an union of two sets of tuples is equivalent (ignoring duplicates handling) to the output you get here.
If you want to match using names you can do something like this
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).unionAll(b.select(columns: _*))
}
To check both names and types it is should be enough to replace columns with:
a.dtypes.toSet.intersect(b.dtypes.toSet).map{case (c, _) => col(c)}.toSeq
This issue is getting fixed in spark2.3. They are adding support of unionByName in the dataset.
https://issues.apache.org/jira/browse/SPARK-21043
no issues/bugs - if you observe your case class B very closely then you will be clear.
Case Class A --> you have mentioned the order (a,b), and
Case Class B --> you have mentioned the order (b,a) ---> this is expected as per order
case class A (a: Int, b: Int)
case class B (b: Int, a: Int)
thanks,
Subbu
Use unionByName:
Excerpt from the documentation:
def unionByName(other: Dataset[T]): Dataset[T]
The difference between this function and union is that this function resolves columns by name (not by position):
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.union(df2).show
// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// | 1| 2| 3|
// | 4| 5| 6|
// +----+----+----+
As discussed in SPARK-9813, it seems like as long as the data types and number of columns are the same across frames, the unionAll operation should work. Please see the comments for additional discussion.