How to get the name of a Spark Column as String? - scala

I want to write a method to round a numeric column without doing something like:
df
.select(round($"x",2).as("x"))
Therefore I need to have a reusable column-expression like:
def roundKeepName(c:Column,scale:Int) = round(c,scale).as(c.name)
Unfortunately c.name does not exist, therefore the above code does not compile. I've found a solution for ColumName:
def roundKeepName(c:ColumnName,scale:Int) = round(c,scale).as(c.string.name)
But how can I do that with Column (which is generated if I use col("x") instead of $"x")

Not sure if the question has really been answered. Your function could be implemented like this (toString returns the name of the column):
def roundKeepname(c:Column,scale:Int) = round(c,scale).as(c.toString)
In case you don't like relying on toString, here is a more robust version. You can rely on the underlying expression, cast it to a NamedExpression and take its name.
import org.apache.spark.sql.catalyst.expressions.NamedExpression
def roundKeepname(c:Column,scale:Int) =
c.expr.asInstanceOf[NamedExpression].name
And it works:
scala> spark.range(2).select(roundKeepname('id, 2)).show
+---+
| id|
+---+
| 0|
| 1|
+---+
EDIT
Finally, if that's OK for you to use the name of the column instead of the Column object, you can change the signature of the function and that yields a much simpler implementation:
def roundKeepName(columnName:String, scale:Int) =
round(col(columnName),scale).as(columnName)

Update:
With the solution way given by BlueSheepToken, here is how you can do it dynamically assuming you have all "double" columns.
scala> val df = Seq((1.22,4.34,8.93),(3.44,12.66,17.44),(5.66,9.35,6.54)).toDF("x","y","z")
df: org.apache.spark.sql.DataFrame = [x: double, y: double ... 1 more field]
scala> df.show
+----+-----+-----+
| x| y| z|
+----+-----+-----+
|1.22| 4.34| 8.93|
|3.44|12.66|17.44|
|5.66| 9.35| 6.54|
+----+-----+-----+
scala> df.columns.foldLeft(df)( (acc,p) => (acc.withColumn(p+"_t",round(col(p),1)).drop(p).withColumnRenamed(p+"_t",p))).show
+---+----+----+
| x| y| z|
+---+----+----+
|1.2| 4.3| 8.9|
|3.4|12.7|17.4|
|5.7| 9.4| 6.5|
+---+----+----+
scala>

Related

substring from lastIndexOf in spark scala

I have a column in my dataframe which contains the filename
test_1_1_1_202012010101101
I want to get the string after the lastIndexOf(_)
I tried this and it is working
val timestamp_df =file_name_df.withColumn("timestamp",split(col("filename"),"_").getItem(4))
But I want to make it more generic, so that if in future if the filename can have any number of _ in it, it can split it on the basis of lastIndexOf _
val timestamp_df =file_name_df.withColumn("timestamp", expr("substring(filename, length(filename)-15,17)"))
This also is not generic as the character length can vary.
Can anyone help me in using the lastIndexOf function with withColumn.
You can use element_at function with split to get last element of array.
Example:
df.withColumn("timestamp",element_at(split(col("filename"),"_"),-1)).show(false)
+--------------------------+---------------+
|filename |timestamp |
+--------------------------+---------------+
|test_1_1_1_202012010101101|202012010101101|
+--------------------------+---------------+
You can use substring_index
scala> val df = Seq(("a-b-c", 1),("d-ef-foi",2)).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: string, c2: int]
+--------+---+
| c1| c2|
+--------+---+
| a-b-c| 1|
|d-ef-foi| 2|
+--------+---+
scala> df.withColumn("c3", substring_index(col("c1"), "-", -1)).show
+--------+---+---+
| c1| c2| c3|
+--------+---+---+
| a-b-c| 1| c|
|d-ef-foi| 2|foi|
+--------+---+---+
Per docs: When the last argument "is negative, everything to the right of the final delimiter (counting from the right) is returned"
val timestamp_df =file_name_df.withColumn("timestamp",reverse(split(reverse(col("filename")),"_").getItem(0)))
It's working with this.

How to compose column name using another column's value for withColumn in Scala Spark

I'm trying to add a new column to a DataFrame. The value of this column is the value of another column whose name depends on other columns from the same DataFrame.
For instance, given this:
+---+---+----+----+
| A| B| A_1| B_2|
+---+---+----+----+
| A| 1| 0.1| 0.3|
| B| 2| 0.2| 0.4|
+---+---+----+----+
I'd like to obtain this:
+---+---+----+----+----+
| A| B| A_1| B_2| C|
+---+---+----+----+----+
| A| 1| 0.1| 0.3| 0.1|
| B| 2| 0.2| 0.4| 0.4|
+---+---+----+----+----+
That is, I added column C whose value came from either column A_1 or B_2. The name of the source column A_1 comes from concatenating the value of columns A and B.
I know that I can add a new column based on another and a constant like this:
df.withColumn("C", $"B" + 1)
I also know that the name of the column can come from a variable like this:
val name = "A_1"
df.withColumn("C", col(name) + 1)
However, what I'd like to do is something like this:
df.withColumn("C", col(s"${col("A")}_${col("B")}"))
Which doesn't work.
NOTE: I'm coding in Scala 2.11 and Spark 2.2.
You can achieve your requirement by writing a udf function. I am suggesting udf, as your requirement is to process dataframe row by row contradicting to inbuilt functions which functions column by column.
But before that you would need array of column names
val columns = df.columns
Then write a udf function as
import org.apache.spark.sql.functions._
def getValue = udf((A: String, B: String, array: mutable.WrappedArray[String]) => array(columns.indexOf(A+"_"+B)))
where
A is the first column value
B is the second column value
array is the Array of all the columns values
Now just call the udf function using withColumn api
df.withColumn("C", getValue($"A", $"B", array(columns.map(col): _*))).show(false)
You should get your desired output dataframe.
You can select from a map. Define map which translates name to column value:
import org.apache.spark.sql.functions.{col, concat_ws, lit, map}
val dataMap = map(
df.columns.diff(Seq("A", "B")).flatMap(c => lit(c) :: col(c) :: Nil): _*
)
df.select(dataMap).show(false)
+---------------------------+
|map(A_1, A_1, B_2, B_2) |
+---------------------------+
|Map(A_1 -> 0.1, B_2 -> 0.3)|
|Map(A_1 -> 0.2, B_2 -> 0.4)|
+---------------------------+
and select from it with apply:
df.withColumn("C", dataMap(concat_ws("_", $"A", $"B"))).show
+---+---+---+---+---+
| A| B|A_1|B_2| C|
+---+---+---+---+---+
| A| 1|0.1|0.3|0.1|
| B| 2|0.2|0.4|0.4|
+---+---+---+---+---+
You can also try mapping, but I suspect it won't perform well with very wide data:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val outputEncoder = RowEncoder(df.schema.add(StructField("C", DoubleType)))
df.map(row => {
val a = row.getAs[String]("A")
val b = row.getAs[String]("B")
val key = s"${a}_${b}"
Row.fromSeq(row.toSeq :+ row.getAs[Double](key))
})(outputEncoder).show
+---+---+---+---+---+
| A| B|A_1|B_2| C|
+---+---+---+---+---+
| A| 1|0.1|0.3|0.1|
| B| 2|0.2|0.4|0.4|
+---+---+---+---+---+
and in general I wouldn't recommend this approach.
If data comes from csv you might consider skipping default csv reader and use custom logic to push column selection directly into parsing process. With pseudocode:
spark.read.text(...).map { line => {
val a = ??? // parse A
val b = ??? // parse B
val c = ??? // find c, based on a and b
(a, b, c)
}}

How to get a value from Dataset and store it in a Scala value?

I have a dataframe which looks like this:
scala> avgsessiontime.show()
+-----------------+
| avg|
+-----------------+
|2.073455735838315|
+-----------------+
I need to store the value 2.073455735838315 in a variable. I tried using
avgsessiontime.collect
but that starts giving me Task not serializable exceptions. So to avoid that I started using foreachPrtition. But I dont know how to extract the value 2.073455735838315 in an array variable.
scala> avgsessiontime.foreachPartition(x => x.foreach(println))
[2.073455735838315]
But when I do this:
avgsessiontime.foreachPartition(x => for (name <- x) name.get(0))
I get a blank/empty result. Even the length returns empty.
avgsessiontime.foreachPartition(x => for (name <- x) name.length)
I know name is of type org.apache.spark.sql.Row then it should return both those results.
You might need:
avgsessiontime.first.getDouble(0)
Here use first to extract the Row object, and .getDouble(0) to extract value from the Row object.
val df = Seq(2.0743).toDF("avg")
df.show
+------+
| avg|
+------+
|2.0743|
+------+
df.first.getDouble(0)
// res6: Double = 2.0743
scala> val df = spark.range(10)
df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> df.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+---+
scala> val variable = df.select("id").as[Long].collect
variable: Array[Long] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Same way you can extract values of any type i.e double,string. You just need to give data type while selecting values from df.
rdd and dataframes/datasets are distributed in nature, and foreach and foreachPartition are executed on executors, transforming dataframe or rdd on executors itself without returning anything. So if you want to return the variable to the driver node then you will have to use collect.
Supposing you have a dataframe as
+-----------------+
|avg |
+-----------------+
|2.073455735838315|
|2.073455735838316|
+-----------------+
doing the following will print all the values, which you can store in a variable too
avgsessiontime.rdd.collect().foreach(x => println(x(0)))
it will print
2.073455735838315
2.073455735838316
Now if you want only the first one then you can do
avgsessiontime.rdd.collect()(0)(0)
which will give you
2.073455735838315
I hope the answer is helpful

groupByKey in Spark dataset

Please help me understand the parameter we pass to groupByKey when it is used on a dataset
scala> val data = spark.read.text("Sample.txt").as[String]
data: org.apache.spark.sql.Dataset[String] = [value: string]
scala> data.flatMap(_.split(" ")).groupByKey(l=>l).count.show
In the above code, please help me understand what (l=>l) means in groupByKey(l=>l).
l =>l says use the whole string(in your case that's every word as you're tokenizing on space) will be used as a key. This way you get all occurrences of each word in same partition and you can count them.
- As you probably seen in other articles, it is preferable to use reduceByKey in this case so you don't need to collect all values for each key in memory before counting.
Always a good place to start is the API Docs:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
def groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]
(Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.
You need a function that derives your key from the dataset's data.
In your example, your function takes the whole string as is and uses it as the key. A different example will be, for a Dataset[String], to use as a key the first 3 characters of your string and not the whole string:
scala> val ds = List("abcdef", "abcd", "cdef", "mnop").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]
scala> ds.show
+------+
| value|
+------+
|abcdef|
| abcd|
| cdef|
| mnop|
+------+
scala> ds.groupByKey(l => l.substring(0,3)).keys.show
+-----+
|value|
+-----+
| cde|
| mno|
| abc|
+-----+
group of key "abc" will have 2 values.
Here is the difference on how the key gets transformed vs the (l => l) so you can see better:
scala> ds.groupByKey(l => l.substring(0,3)).count.show
+-----+--------+
|value|count(1)|
+-----+--------+
| cde| 1|
| mno| 1|
| abc| 2|
+-----+--------+
scala> ds.groupByKey(l => l).count.show
+------+--------+
| value|count(1)|
+------+--------+
| abcd| 1|
| cdef| 1|
|abcdef| 1|
| mnop| 1|
+------+--------+

What is going wrong with `unionAll` of Spark `DataFrame`?

Using Spark 1.5.0 and given the following code, I expect unionAll to union DataFrames based on their column name. In the code, I'm using some FunSuite for passing in SparkContext sc:
object Entities {
case class A (a: Int, b: Int)
case class B (b: Int, a: Int)
val as = Seq(
A(1,3),
A(2,4)
)
val bs = Seq(
B(5,3),
B(6,4)
)
}
class UnsortedTestSuite extends SparkFunSuite {
configuredUnitTest("The truth test.") { sc =>
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val aDF = sc.parallelize(Entities.as, 4).toDF
val bDF = sc.parallelize(Entities.bs, 4).toDF
aDF.show()
bDF.show()
aDF.unionAll(bDF).show
}
}
Output:
+---+---+
| a| b|
+---+---+
| 1| 3|
| 2| 4|
+---+---+
+---+---+
| b| a|
+---+---+
| 5| 3|
| 6| 4|
+---+---+
+---+---+
| a| b|
+---+---+
| 1| 3|
| 2| 4|
| 5| 3|
| 6| 4|
+---+---+
Why does the result contain intermixed "b" and "a" columns, instead of aligning columns bases on column names? Sounds like a serious bug!?
It doesn't look like a bug at all. What you see is a standard SQL behavior and every major RDMBS, including PostgreSQL, MySQL, Oracle and MS SQL behaves exactly the same. You'll find SQL Fiddle examples linked with names.
To quote PostgreSQL manual:
In order to calculate the union, intersection, or difference of two queries, the two queries must be "union compatible", which means that they return the same number of columns and the corresponding columns have compatible data types
Column names, excluding the first table in the set operation, are simply ignored.
This behavior comes directly form the Relational Algebra where basic building block is a tuple. Since tuples are ordered an union of two sets of tuples is equivalent (ignoring duplicates handling) to the output you get here.
If you want to match using names you can do something like this
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.col
def unionByName(a: DataFrame, b: DataFrame): DataFrame = {
val columns = a.columns.toSet.intersect(b.columns.toSet).map(col).toSeq
a.select(columns: _*).unionAll(b.select(columns: _*))
}
To check both names and types it is should be enough to replace columns with:
a.dtypes.toSet.intersect(b.dtypes.toSet).map{case (c, _) => col(c)}.toSeq
This issue is getting fixed in spark2.3. They are adding support of unionByName in the dataset.
https://issues.apache.org/jira/browse/SPARK-21043
no issues/bugs - if you observe your case class B very closely then you will be clear.
Case Class A --> you have mentioned the order (a,b), and
Case Class B --> you have mentioned the order (b,a) ---> this is expected as per order
case class A (a: Int, b: Int)
case class B (b: Int, a: Int)
thanks,
Subbu
Use unionByName:
Excerpt from the documentation:
def unionByName(other: Dataset[T]): Dataset[T]
The difference between this function and union is that this function resolves columns by name (not by position):
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.union(df2).show
// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// | 1| 2| 3|
// | 4| 5| 6|
// +----+----+----+
As discussed in SPARK-9813, it seems like as long as the data types and number of columns are the same across frames, the unionAll operation should work. Please see the comments for additional discussion.