Passing column information to SparkSession within withColumn - scala

I have a simple df with 2 columns, as shown below,
+------------+---+
|file_name |id |
+------------+---+
|file1.csv |1 |
|file2.csv |2 |
+------------+---+
root
|-- file_name: string (nullable = true)
|-- id: string (nullable = true)
I wish to add a 3rd column with the count() from each file specified in the file_name column
These are large files so I wish to go for a Spark based approach for getting the count() from each file.
Assuming originalDF is the above df,
I have tried:
dfWithCounts = originalDF.withColumn("counts", lit(spark.read.csv(lit(col('file_name'))).count))
but this seems to be throwing error.
Column is not iterable
Is there way I can achieve this?
I'm using Spark 2.4.

You can't run a Spark job from within another Spark job. Assuming that the file list is not super huge you can collect originalDF to the driver and spawn individual jobs to count lines from there.
val dfWithCounts = originalDF.collect.map { r =>
(r.getString(0), r.getInt(1), spark.read.csv(r.getString(0)).count)
}.toSeq.toDF("file_name", "id", "count")
Optionally you can use Scala parallel collections to run these jobs in parallel.
val dfWithCounts = originalDF.collect.par.map { r =>
(r.getString(0), r.getInt(1), spark.read.csv(r.getString(0)).count)
}.toSeq.seq.toDF("file_name", "id", "count")

Related

String Functions for Nested Schema in Spark Scala

I am learning Spark in Scala programming language.
Input file ->
"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}
Schema ->
root
|-- Personal: struct (nullable = true)
| |-- ID: integer (nullable = true)
| |-- Name: array (nullable = true)
| | |-- element: string (containsNull = true)
Operation for output ->
I want to concat the Strings of "Name" element
Eg - abcs|dakjdb
I am reading the file using dataframe API.
Please help me from this.
It should be pretty straightforward if you are working with Spark >= 1.6.0 you can use get_json_object and concat_ws:
import org.apache.spark.sql.functions.{get_json_object, concat_ws}
val df = Seq(
("""{"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}"""),
("""{"Personal":{"ID":3425,"Name":["cfg","woooww"]}}""")
)
.toDF("data")
df.select(
concat_ws(
"-",
get_json_object($"data", "$.Personal.Name[0]"),
get_json_object($"data", "$.Personal.Name[1]")
).as("FullName")
).show(false)
// +-----------+
// |FullName |
// +-----------+
// |abcs-dakjdb|
// |cfg-woooww |
// +-----------+
With get_json_object we go through the json data an extract the two elements of the Name array which we concatenate later on.
There is an inbuilt function concat_ws which should be useful here.
to extend #Alexandros Biratsis answer. you can first convert Name into array[String] type before concatenating to avoid writing every name position. Querying by position would also fail when the value is null or when only one value exist instead of two.
import org.apache.spark.sql.functions.{get_json_object, concat_ws, from_json}
import org.apache.spark.sql.types.{ArrayType, StringType}
val arraySchema = ArrayType(StringType)
val df = Seq(
("""{"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}"""),
("""{"Personal":{"ID":3425,"Name":["cfg","woooww"]}}""")
)
.toDF("data")
.select(get_json_object($"data", "$.Personal.Name") as "name")
.select(from_json($"name", arraySchema) as "name")
.select(concat_ws("|", $"name"))
.show(false)

How to create a list of all values of a column in Structured Streaming?

I have a spark structured streaming job which gets records from Kafka (10,000 as maxOffsetsPerTrigger). I get all those records by spark's readStream method. This dataframe has a column named "key".
I need string(set(all values in that column 'key')) to use this string in a query to ElasticSearch.
I have already tried df.select("key").collect().distinct() but it throws exception:
collect() is not supported with structured streaming.
Thanks.
EDIT:
DATAFRAME:
+-------+-------------------+----------+
| key| ex|new column|
+-------+-------------------+----------+
| fruits| [mango, apple]| |
|animals| [cat, dog, horse]| |
| human|[ram, shyam, karun]| |
+-------+-------------------+----------+
SCHEMA:
root
|-- key: string (nullable = true)
|-- ex: array (nullable = true)
| |-- element: string (containsNull = true)
|-- new column: string (nullable = true)
STRING I NEED:
'["fruits", "animals", "human"]'
You can not apply collect on streaming dataframe. streamingDf here refers reading from Kafka.
val query = streamingDf
.select(col("Key").cast(StringType))
.writeStream
.format("console")
.start()
query.awaitTermination()
It will print your data in the console. To write data in an external source, you have to give an implementation of foreachWriter. For reference, refer
In given link, data is streamed using Kafka, read by spark and written to Cassandra eventually.
Hope, it will help.
For such use case, I'd recommend using foreachBatch operator:
foreachBatch(function: (Dataset[T], Long) ⇒ Unit): DataStreamWriter[T]
Sets the output of the streaming query to be processed using the provided function. This is supported only the in the micro-batch execution modes (that is, when the trigger is not continuous).
In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a Dataset and (ii) the batch identifier.
The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems. The output Dataset is guaranteed to exactly same for the same batchId (assuming all operations are deterministic in the query).
Quoting the official documentation (with a few modifications):
The foreachBatch operation allows you to apply arbitrary operations and writing logic on the output of a streaming query.
foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch.
And in the same official documentation you can find a sample code that shows that you could do your use case fairly easily.
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.select("key").collect().distinct()
}

Filtering a DataFrame on date columns comparison

I am trying to filter a DataFrame comparing two date columns using Scala and Spark. Based on the filtered DataFrame there are calculations running on top to calculate new columns.
Simplified my data frame has the following schema:
|-- received_day: date (nullable = true)
|-- finished: int (nullable = true)
On top of that I create two new column t_start and t_end that would be used for filtering the DataFrame. They have 10 and 20 days difference from the original column received_day:
val dfWithDates= df
.withColumn("t_end",date_sub(col("received_day"),10))
.withColumn("t_start",date_sub(col("received_day"),20))
I now want to have a new calculated column that indicates for each row of data how many rows of the dataframe are in the t_start to t_end period. I thought I can achieve this the following way:
val dfWithCount = dfWithDates
.withColumn("cnt", lit(
dfWithDates.filter(
$"received_day".lt(col("t_end"))
&& $"received_day".gt(col("t_start"))).count()))
However, this count only returns 0 and I believe that the problem is in the the argument that I am passing to lt and gt.
From following that issue here Filtering a spark dataframe based on date I realized that I need to pass a string value. If I try with hard coded values like lt(lit("2018-12-15")), then the filtering works. So I tried casting my columns to StringType:
val dfWithDates= df
.withColumn("t_end",date_sub(col("received_day"),10).cast(DataTypes.StringType))
.withColumn("t_start",date_sub(col("received_day"),20).cast(DataTypes.StringType))
But the filter still returns an empty dataFrame.
I would assume that I am not handling the data type right.
I am running on Scala 2.11.0 with Spark 2.0.2.
Yes you are right. For $"received_day".lt(col("t_end") each reveived_day value is compared with the current row's t_end value, not the whole dataframe. So each time you'll get zero as count.
You can solve this by writing a simple udf. Here is the way how you can solve the issue:
Creating sample input dataset:
import org.apache.spark.sql.{Row, SparkSession}
import java.sql.Date
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq((Date.valueOf("2018-10-12"),1),
(Date.valueOf("2018-10-13"),1),
(Date.valueOf("2018-09-25"),1),
(Date.valueOf("2018-10-14"),1)).toDF("received_day", "finished")
val dfWithDates= df
.withColumn("t_start",date_sub(col("received_day"),20))
.withColumn("t_end",date_sub(col("received_day"),10))
dfWithDates.show()
+------------+--------+----------+----------+
|received_day|finished| t_start| t_end|
+------------+--------+----------+----------+
| 2018-10-12| 1|2018-09-22|2018-10-02|
| 2018-10-13| 1|2018-09-23|2018-10-03|
| 2018-09-25| 1|2018-09-05|2018-09-15|
| 2018-10-14| 1|2018-09-24|2018-10-04|
+------------+--------+----------+----------+
Here for 2018-09-25 we desire count 3
Generate output:
val count_udf = udf((received_day:Date) => {
(dfWithDates.filter((col("t_end").gt(s"$received_day")) && col("t_start").lt(s"$received_day")).count())
})
val dfWithCount = dfWithDates.withColumn("count",count_udf(col("received_day")))
dfWithCount.show()
+------------+--------+----------+----------+-----+
|received_day|finished| t_start| t_end|count|
+------------+--------+----------+----------+-----+
| 2018-10-12| 1|2018-09-22|2018-10-02| 0|
| 2018-10-13| 1|2018-09-23|2018-10-03| 0|
| 2018-09-25| 1|2018-09-05|2018-09-15| 3|
| 2018-10-14| 1|2018-09-24|2018-10-04| 0|
+------------+--------+----------+----------+-----+
To make computation faster i would suggest to cache dfWithDates as there are repetition of same operation for each row.
You can cast date value to string with any pattern using DateTimeFormatter
import java.time.format.DateTimeFormatter
date.format(DateTimeFormatter.ofPattern("yyyy-MM-dd"))

How to join datasets with same columns and select one?

I have two Spark dataframes which I am joining and selecting afterwards. I want to select a specific column of one of the Dataframes. But the same column name exists in the other one. Therefore I am getting an Exception for ambiguous column.
I have tried this:
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id", "left").select($"d1.columnName")
and this:
d1.join(d2, d1("id") === d2("id"), "left").select($"d1.columnName")
but it does not work.
which spark version you're using ? can you put a sample of your dataframes ?
try this:
d2prim = d2.withColumnRenamed("columnName", d2_columnName)
d1.join(d2prim , Seq("id"), "left_outer").select("columnName")
I have two dataframes
val d1 = spark.range(3).withColumn("columnName", lit("d1"))
scala> d1.printSchema
root
|-- id: long (nullable = false)
|-- columnName: string (nullable = false)
val d2 = spark.range(3).withColumn("columnName", lit("d2"))
scala> d2.printSchema
root
|-- id: long (nullable = false)
|-- columnName: string (nullable = false)
which I am joining and selecting afterwards.
I want to select a specific column of one of the Dataframes. But the same column name exists in the other one.
val q1 = d1.as("d1")
.join(d2.as("d2"), Seq("id"), "left")
.select("d1.columnName")
scala> q1.show
+----------+
|columnName|
+----------+
| d1|
| d1|
| d1|
+----------+
As you can see it just works.
So, why did it not work for you? Let's analyze each.
// you started very well
d1.as("d1")
// but here you used $ to reference a column to join on
// with column references by their aliases
// that won't work
.join(d2.as("d2"), $"d1.id" === $"d2.id", "left")
// same here
// $ + aliased columns won't work
.select($"d1.columnName")
PROTIP: Use d1("columnName") to reference a specific column in a dataframe.
The other query was very close to be fine, but...
d1.join(d2, d1("id") === d2("id"), "left") // <-- so far so good!
.select($"d1.columnName") // <-- that's the issue, i.e. $ + aliased column
This happens because when spark combines the columns from the two DataFrames it doesn't do any automatic renaming for you. You just need to rename one of the columns before joining. Spark provides a method for this. After the join you can drop the renamed column.
val df2join = df2.withColumnRenamed("id", "join_id")
val joined = df1.join(df2, $"id" === $"join_id", "left").drop("join_id")

Spark Dataframe of WrappedArray to Dataframe[Vector]

I have a spark Dataframe df with the following schema:
root
|-- features: array (nullable = true)
| |-- element: double (containsNull = false)
I would like to create a new Dataframe where each row will be a Vector of Doubles and expecting to get the following schema:
root
|-- features: vector (nullable = true)
So far I have the following piece of code (influenced by this post: Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala) but I fear something is wrong with it because it takes a very long time to compute even a reasonable amount of rows.
Also, if there are too many rows the application will crash with a heap space exception.
val clustSet = df.rdd.map(r => {
val arr = r.getAs[mutable.WrappedArray[Double]]("features")
val features: Vector = Vectors.dense(arr.toArray)
features
}).map(Tuple1(_)).toDF()
I suspect that the instruction arr.toArray is not a good Spark practice in this case. Any clarification would be very helpful.
Thank you!
It's because .rdd have to unserialize objects from internal in-memory format and it is very time consuming.
It's ok to use .toArray - you are operating on row level, not collecting everything to the driver node.
You can do this very easy with UDFs:
import org.apache.spark.ml.linalg._
val convertUDF = udf((array : Seq[Double]) => {
Vectors.dense(array.toArray)
})
val withVector = dataset
.withColumn("features", convertUDF('features))
Code is from this answer: Convert ArrayType(FloatType,false) to VectorUTD
However there author of the question didn't ask about differences