How to copy the "first" row of a spark data frame to another data frame? Why does my minimal example fails? - scala

Basic Problem :
I want to copy the "first row" of a Spark Dataframe sdf to another Spark dataframe sdfEmpty.
I do not understand what goes wrong in the following code.
Hence I am looking forward for a solution and an explanation what fails in my minimal example.
A minimal example :
// create a spark data frame
import org.apache.spark.sql._
val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
sdf.show()
+---+---+
| A| B|
+---+---+
| 1| a|
| 2| b|
| 3| b|
+---+---+
// create an empty spark data frame to store the row
// declare it as var, such that I can change it later
var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
+---+---+
// take the "first" row of sdf as a spark data frame
val row = sdf.limit(1)
// combine the two spark data frames
sdfEmpty = sdfEmpty.union(row)
As row is:
row.show()
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
the exspected result for sdfEmpty is:
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
But I get :
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
| 2| b|
+---+---+
Question:
What confused me is the following: Using val row = sdf.limit(1) I thought I created a permanent/ unchangeable/ well defined object. Such that when I print it once and add it to something, I get the same results.
Remark: (thanks a lot to Daniel's remarks)
I know that in the distributed world of scala there is no well defined notion of "first row". I put it there for simplicity and I hope that people struggling with something similar will "accidentially" use the term "first".
What I try to achieve is the following: (in a simplified example)
I have a data frame with 2 columns A and B. Column A is partially ordered and column B is totally ordered.
I want to filter the data w.r.t. the columns. So the idea is some kind of divide and conquer: split the data frame, such that into pieces both columns are totally ordered and than filter as usual. (and do the obvious iterations)
To achieve this I need to pick a well defined row and split the date w.r.t. row.A. But as the minimal example shows my comands do not produce a well defined object.
Thanks a lot

Spark is distributed, so the notion of 'first' is not something we can rely on. Dependently on partitioning we can get a different result when calling limit or first.
To have consistent results your data has to have an underlying order which we can use - what makes a lot of sense, since unless there is logical ordering to your data, we can't really say what does it mean to take the first row.
Assuming you want to take the first row with respect to column A, you can just run orderBy("A").first()(*) . Although if column A has more than one row with same smallest value there is no guarantee which row you will get.
(* I assume scala API has the same naming as Python so please correct me if they are differently named)

#Christian you can achieve this result by using take function.
take(num) Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.
here the code snippet.
scala> import org.apache.spark.sql.types._
scala> val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
scala> import org.apache.spark.sql._
scala> var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
scala> var first1 =sdf.rdd.take(1)
scala> val first_row = spark.createDataFrame(sc.parallelize(first1), sdf.schema)
scala> sdfEmpty.union(first_row).show
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
for more about take() and first() function just read spark Documentation.let me know if you have any query related to this.

I am posting this answer as it contains the solution suggested by Daniel. Once I am through literature provided mahesh gupta or some more testing I'll update this answer and give remarks on the runtimes of the different approaches in "real life".
Basic Problem :
I want to copy the "first row" of a Spark Dataframe sdf to another Spark dataframe sdfEmpty.
As in the distributed world of spark there is not a well defined notion of first, but something similar might be achieved due to orderBy.
A minimal working example :
// create a spark data frame
import org.apache.spark.sql._
val sdf = Seq(
(1, "a"),
(12, "b"),
(234, "b")
).toDF("A", "B")
sdf.show()
+---+---+
| A| B|
+---+---+
| 1| a|
| 2| b|
| 3| b|
+---+---+
// create an empty spark data frame to store the row
// declare it as var, such that I can change it later
var sdfEmpty = spark.createDataFrame(sc.emptyRDD[Row], sdf.schema)
sdfEmpty.show()
+---+---+
| A| B|
+---+---+
+---+---+
// take the "first" row of sdf as a spark data frame
val row = sdf.limit(1).collect()
// combine the two spark data frames
sdfEmpty = sdfEmpty.union(row)
The row is:
row.show()
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
** and the result for sdfEmpty is:**
+---+---+
| A| B|
+---+---+
| 1| a|
+---+---+
Remark: Explanation given by Daniel (see comments above) .limit(n) is a transformation - it does not get evaluated until an action runs like show or collect. Hence depending on the context it can return different value. To use the result of .limit consistently one can .collect it to driver and use it as a local variable.

Related

How to use countDistinct using a window function in Spark/Scala?

I need to use window function that is paritioned by 2 columns and do distinct count on the 3rd column and that as the 4th column. I can do count with out any issues, but using distinct count is throwing exception -
rg.apache.spark.sql.AnalysisException: Distinct window functions are not supported:
Is there any workaround for this ?
A previous answer suggested two possible techniques: approximate counting and size(collect_set(...)). Both have problems.
If you need an exact count, which is the main reason to use COUNT(DISTINCT ...) in big data, approximate counting will not do. Also, approximate counting actual error rates can vary quite significantly for small data.
size(collect_set(...)) may cause a substantial slowdown in processing of big data because it uses a mutable Scala HashSet, which is a pretty slow data structure. In addition, you may occasionally get strange results, e.g., if you run the query over an empty dataframe, because size(null) produces the counterintuitive -1. Spark's native distinct counting runs faster for a number of reasons, the main one being that it doesn't have to produce all the counted data in an array.
The typical approach to solving this problem is with a self-join. You group by whatever columns you need, compute the distinct count or any other aggregate function that cannot be used as a window function, and then join back to your original data.
Use approx_count_distinct (or) collect_set and size functions on window to mimic countDistinct functionality.
Example:
df.show()
//+---+---+---+
//| i| j| k|
//+---+---+---+
//| 1| a| c|
//| 2| b| d|
//| 1| a| c|
//| 2| b| e|
//+---+---+---+
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("i","j")
df.withColumn("cnt",size(collect_set("k").over(windowSpec))).show()
//or using approx_count_distinct
df.withColumn("cnt",approx_count_distinct("k").over(windowSpec)).show()
//+---+---+---+---+
//| i| j| k|cnt|
//+---+---+---+---+
//| 2| b| d| 2|
//| 2| b| e| 2|
//| 1| a| c| 1| //as c value repeated for 1,a partition
//| 1| a| c| 1|
//+---+---+---+---+
Trying to improve Sim's answer, if you want to do this:
//val newColumnName: String = ...
//val colToCount: Column = ...
//val aggregatingCols: Seq[Column] = ...
df.withColumn(newColName, countDistinct(colToCount).over(partitionBy(aggregatingCols:_*)))
You must instead do this:
//val aggregatingCols: Seq[String] = ...
df.groupBy(aggregatingCols.head, aggregatingCols.tail:_*)
.agg(countDistinct(colToCount).as(newColName))
.select(newColName, aggregatingCols:_*)
.join(df, usingColumns = aggregatingCols)
This will return the number of distinct elements in the partition, using dense_rank() function. When we sum ascending and descending rank, we always get the total number of distinct elements + 1 :
dense_rank().over(Window.partitionBy("i").orderBy(c.asc)) + dense_rank().over(Window.partitionBy("i").orderBy(c.desc)) - 1

Exploding column with index

I know that I can "explode" a column of type array like this:
import org.apache.spark.sql._
import org.apache.spark.sql.functions.explode
val explodedDf =
payloadLegsDf.withColumn("legs", explode(payloadLegsDf.col("legs")))
Now I have multiple rows; one for each item in the array.
Is there a way I can "explode with index"? So that there will be a new column that contains the index of the item in the original array?
(I can think of hacks to do this. First make the array field into an array of tuples of the original value and the index. Then do the explode. Then unpack the tuples. But is there a more elegant way?)
If you are using Spark 2.1+, the posexplode function can be used for that:
Creates a new row for each element with position in the given array or map column.
Example:
val df = Seq(
(1L, Array[String]("a", "b")),
(2L, Array[String]("c", "d"))
).toDF("id", "items")
val res = df.select($"id", posexplode($"items"))
This will create two new columns, pos for position/index and col for the extracted value:
+---+---+---+
| id|pos|col|
+---+---+---+
| 1| 0| a|
| 1| 1| b|
| 2| 0| c|
| 2| 1| d|
+---+---+---+

Remove all records which are duplicate in spark dataframe

I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).
I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.
I am using Spark 1.6 and Scala 2.10.
I would use window-functions for this. Lets say you want to remove duplicate id rows :
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", count("*").over(Window.partitionBy($"id")))
.where($"cnt"===1).drop($"cnt")
.show()
This can be done by grouping by the column (or columns) to look for duplicates in and then aggregate and filter the results.
Example dataframe df:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 4| 5|
+---+---+
Grouping by the id column to remove its duplicates (the last two rows):
val df2 = df.groupBy("id")
.agg(first($"num").as("num"), count($"id").as("count"))
.filter($"count" === 1)
.select("id", "num")
This will give you:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
+---+---+
Alternativly, it can be done by using a join. It will be slower, but if there is a lot of columns there is no need to use first($"num").as("num") for each one to keep them.
val df2 = df.groupBy("id").agg(count($"id").as("count")).filter($"count" === 1).select("id")
val df3 = df.join(df2, Seq("id"), "inner")
I added a killDuplicates() method to the open source spark-daria library that uses #Raphael Roth's solution. Here's how to use the code:
import com.github.mrpowers.spark.daria.sql.DataFrameExt._
df.killDuplicates(col("id"))
// you can also supply multiple Column arguments
df.killDuplicates(col("id"), col("another_column"))
Here's the code implementation:
object DataFrameExt {
implicit class DataFrameMethods(df: DataFrame) {
def killDuplicates(cols: Column*): DataFrame = {
df
.withColumn(
"my_super_secret_count",
count("*").over(Window.partitionBy(cols: _*))
)
.where(col("my_super_secret_count") === 1)
.drop(col("my_super_secret_count"))
}
}
}
You might want to leverage the spark-daria library to keep this logic out of your codebase.

PySpark difference between pyspark.sql.functions.col and pyspark.sql.functions.lit

I find it hard to understand the difference between these two methods from pyspark.sql.functions as the documentation on PySpark official website is not very informative. For example the following code:
import pyspark.sql.functions as F
print(F.col('col_name'))
print(F.lit('col_name'))
The results are:
Column<b'col_name'>
Column<b'col_name'>
so what are the difference between the two and when should I use one and not the other?
The doc says:
col:
Returns a Column based on the given column name.
lit:
Creates a Column of literal value
Say if we have a data frame as below:
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.types import *
>>> schema = StructType([StructField('A', StringType(), True)])
>>> df = spark.createDataFrame([("a",), ("b",), ("c",)], schema)
>>> df.show()
+---+
| A|
+---+
| a|
| b|
| c|
+---+
If using col to create a new column from A:
>>> df.withColumn("new", F.col("A")).show()
+---+---+
| A|new|
+---+---+
| a| a|
| b| b|
| c| c|
+---+---+
So col grabs an existing column with the given name, F.col("A") is equivalent to df.A or df["A"] here.
If using F.lit("A") to create the column:
>>> df.withColumn("new", F.lit("A")).show()
+---+---+
| A|new|
+---+---+
| a| A|
| b| A|
| c| A|
+---+---+
While lit will create a constant column with the given string as the values.
Both of them return a Column object but the content and meaning are different.
To explain in a very succinct manner, col is typically used to refer to an existing column in a DataFrame, as opposed to lit which is typically used to set the value of a column to a literal
To illustrate with an example:
Assume i have a DataFrame df containing two columns of IntegerType, col_a and col_b
If i wanted a column total which were the sum of the two columns:
df.withColumn('total', col('col_a') + col('col_b'))
Instead of i wanted a column fixed_val having the value "Hello" for all rows of the DataFrame df:
df.withColumn('fixed_val', lit('Hello'))

Overwrite Spark dataframe schema

LATER EDIT:
Based on this article it seems that Spark cannot edit and RDD or column. A new one has to be created with the new type and the old one deleted. The for loop and .withColumn method suggested below seem to be the easiest way to get the job done.
ORIGINAL QUESTION:
Is there a simple way (for both human and machine) to convert multiple columns to a different data type?
I tried to define the schema manually, then load the data from a parquet file using this schema and save it to another file but I get "Job aborted."..."Task failed while writing rows" every time and on every DF. Somewhat easy for me, laborious for Spark ... and it does not work.
Another option is using:
df = df.withColumn("new_col", df("old_col").cast(type)).drop("old_col").withColumnRenamed("new_col", "old_col")
A bit more work for me as there are close to 100 columns and, if Spark has to duplicate each column in memory, then that doesn't sound optimal either. Is there an easier way?
Depending on how complicated the casting rules are, you can accomplish what you are asking a with this loop:
scala> var df = Seq((1,2),(3,4)).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala> df.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> > df.columns.foreach{c => df = df.withColumn(c, df(c).cast(DoubleType))}
scala> df.show
+---+---+
| a| b|
+---+---+
|1.0|2.0|
|3.0|4.0|
+---+---+
This should be as efficient as any other column operation.