Multiple Spark DataFrame mutations in a single pipe - scala

Consider a Spark DataFrame df with the following schema:
root
|-- date: timestamp (nullable = true)
|-- customerID: string (nullable = true)
|-- orderID: string (nullable = true)
|-- productID: string (nullable = true)
One column should be cast to a different type, other columns should just have their white-space trimmed.
df.select(
$"date",
df("customerID").cast(IntegerType),
$"orderID",
$"productId")
.withColumn("orderID", trim(col("orderID")))
.withColumn("productID", trim(col("productID")))
The operations seem to require different syntax; casting is done via select, while trim is done via withColumn.
I'm used to R and dplyr where all the above would be handled in a single mutate function, so mixing select and withColumn feels a bit cumbersome.
Is there a cleaner way to do this in a single pipe?

You can use either one. The difference is that withColumn will add (or replace if the same name is used) a new column to the dataframe while select will only keep the columns you specified. Depending on the situation, choose one to use.
The cast can be done using withColumn as follows:
df.withColumn("customerID", $"customerID".cast(IntegerType))
.withColumn("orderID", trim($"orderID"))
.withColumn("productID", trim($"productID"))
Note that you do not need to use withColumn on the date column above.
The trim functions can be done in a select as follows, here the column names are kept the same:
df.select(
$"date",
$"customerID".cast(IntegerType),
trim($"orderID").as("orderID"),
trim($"productId").as("productId"))

df.select(
$"date",
$"customerID".cast(IntegerType),
trim($"orderID").as("orderID"),
trim($"productID").as("productID"))

Related

PySpark DataFrame When to use/ not to use Select

Based on PySpark document:
A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SQLContext
Meaning I can use Select for showing the value of a column, however, I saw sometimes these two equivalent codes are used instead:
# df is a sample DataFrame with column a
df.a
# or
df['a']
And sometimes when I use select I might get an error instead of them and vice versa sometimes I have to use Select.
For example, this is a DataFrame for finding a dog in a given picture problem:
joined_df.printSchema()
root
|-- folder: string (nullable = true)
|-- filename: string (nullable = true)
|-- width: string (nullable = true)
|-- height: string (nullable = true)
|-- dog_list: array (nullable = true)
| |-- element: string (containsNull = true)
If I want to select the dog details and show 10 rows, this code shows an error:
print(joined_df.dog_list.show(truncate=False))
Traceback (most recent call last):
 File "<stdin>", line 2, in <module>
    print(joined_df.dog_list.show(truncate=False))
TypeError: 'Column' object is not callable
And this is not:
print(joined_df.select('dog_list').show(truncate=False))
Question1: When I have to use Select and when I have to use df.a or df["a"]
Question2: what is the meaning of the error above? 'Column' object is not callable
df.col_name return a Column object but df.select("col_name") return another dataframe
see this for documentation
The key here is Those two methods are returning two different objects, that is why your print(joined_df.dog_list.show(truncate=False)) give you the error. Meaning that the Column object does not have this .show method but the dataframe does.
So when you call a function, function takes Column as input, you should use df.col_name, if you want to operate at dataframe level, you want to use df.select("col_name")

Select spark dataframe column with special character in it using selectExpr

I am in a scenario where my columns name is Município with accent on the letter í.
My selectExpr command is failing because of it. Is there a way to fix it? Basically I have something like the following expression:
.selectExpr("...CAST (Município as string) as Município...")
What I really want is to be able to leave the column with the same name that it came, so in the future, I won't have this kind of problem on different tables/files.
How can I make spark dataframe accept accents or other special characters?
You can use wrap your column name in backticks. For example, if you had the following schema:
df.printSchema()
#root
# |-- Município: long (nullable = true)
Express the column name with the special character wrapped with the backtick:
df2 = df.selectExpr("CAST (`Município` as string) as `Município`")
df2.printSchema()
#root
# |-- Município: string (nullable = true)

UDF to Concatenate Arrays of Undefined Case Class Buried in a Row Object

I have a dataframe, called sessions, with columns that may change over time. (Edit to Clarify: I do not have a case class for the columns - only a reflected schema.) I will consistently have a uuid and clientId in the outer scope with some other inner and outer scope columns that might constitute a tracking event so ... something like:
root
|-- runtimestamp: long (nullable = true)
|-- clientId: long (nullable = true)
|-- uuid: string (nullable = true)
|-- oldTrackingEvents: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- actionid: integer (nullable = true)
| | |-- actiontype: string (nullable = true)
| | |-- <tbd ... maps, arrays and other stuff matches sibling> section
...
|-- newTrackingEvents: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- actionid: integer (nullable = true)
| | |-- actiontype: string (nullable = true)
| | |-- <tbd ... maps, arrays and other stuff matches sibling>
...
I'd like to now merge oldTrackingEvents and newTrackingEvents with a UDF containing these parameters and yet-to-be resolved code logic:
val mergeTEs = udf((oldTEs : Seq[Row], newTEs : Seq[Row]) =>
// do some stuff - figure best way
// - to merge both groups of tracking events
// - remove duplicates tracker events structures
// - limit total tracking events < 500
return result // same type as UDF input params
)
The UDF return result would be an array of of the structure that is the resulting List of the concatenated two fields.
QUESTION:
My question is how to construct such a UDF - (1) use of correct passed-in parameter types, (2) a way to manipulate these collections within a UDF and (3) a clear way to return a value that doesn't have a compiler error. I unsuccessfully tested Seq[Row] for the input / output (with val testUDF = udf((trackingEvents : Seq[Row]) => trackingEvents) and received the error java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported for a direct return of trackingEvents. However, I get no error for returning Some(1) instead of trackingEvents ... What is the best way to manipulate the collections so that I can concatenate 2 lists of identical structures as suggested by the schema above with the UDF using the activity in the comments section. The goal is to use this operation:
sessions.select(mergeTEs('oldTrackingEvents, 'newTrackingEvents).as("cleanTrackingEvents"))
And in each row, ... get back a single array of 'trackingEvents' structure in a memory / speed efficient manner.
SUPPLEMENTAL:
Looking at a question shown to me ... There's a possible hint, if relevancy exists ... Defining a UDF that accepts an Array of objects in a Spark DataFrame? ... To create struct function passed to udf has to return Product type (Tuple* or case class), not Row.
Perhaps ... this other post is relevant / useful.
I think that the question you've linked explains it all, so just to reiterate. When working with udf:
Input representation for the StructType is weakly typed Row object.
Output type for StructType has to be Scala Product. You cannot return Row object.
If this is to much burden, you should use strongly typed Dataset
val f: T => U
sessions.as[T].map(f): Dataset[U]
where T is an algebraic data type representing Session schema, and U is algebraic data type representing the result.
Alternatively ... If your goal is to merge sequences of some random row structure / schema with some manipulation, this is an alternative generally-stated approach that avoids the partitioning talk:
From the master dataframe, create dataframes for each trackingEvents section, new and old. With each, select the exploded 'trackingEvents' section's columns. Save these val dataframe declarations as newTE and oldTE.
Create another dataframe, where columns that are picked are unique to each tracking event in the arrays of oldTrackingEvents and newTrackingEvents such as each's uuid, clientId and event timestamp. Your pseudo-schema would be:
(uuid: String, clientId : Long, newTE : Seq[Long], oldTE : Seq[Long])
Use a UDF to join the two simple sequences of your structure, both Seq[Long] which is 'something like the untested' example:
val limitEventsUDF = udf { (newTE: Seq[Long], oldTE: Seq[Long], limit: Int, tooOld: Long) => {
(newTE ++ oldTE).filter(_ > tooOld).sortWith(_ > _).distinct.take(limit)
}}
The UDF will return a dataframe of cleaned tracking events & you now have a very slim dataframe with removed events to self-join back to your exploded newTE and oldTE frames after being unioned back to each other.
GroupBy as needed thereafter using collect_list.
Still ... this seems like a lot of work - Should this be voted for this as "the answer" - I'm not sure?

PySpark DataFrames: filter where some value is in array column

I have a DataFrame in PySpark that has a nested array value for one of its fields. I would like to filter the DataFrame where the array contains a certain string. I'm not seeing how I can do that.
The schema looks like this:
root
|-- name: string (nullable = true)
|-- lastName: array (nullable = true)
| |-- element: string (containsNull = false)
I want to return all the rows where the upper(name) == 'JOHN' and where the lastName column (the array) contains 'SMITH' and the equality there should be case insensitive (like I did for the name). I found the isin() function on a column value, but that seems to work backwards of what I want. It seem like I need a contains() function on a column value. Anyone have any ideas for a straightforward way to do this?
You could consider working on the underlying RDD directly.
def my_filter(row):
if row.name.upper() == 'JOHN':
for it in row.lastName:
if it.upper() == 'SMITH':
yield row
dataframe = dataframe.rdd.flatMap(my_filter).toDF()
An update in 2019
spark 2.4.0 introduced new functions like array_contains and transform
official document
now it can be done in sql language
For your problem, it should be
dataframe.filter('array_contains(transform(lastName, x -> upper(x)), "JOHN")')
It is better than the previous solution using RDD as a bridge, because DataFrame operations are much faster than RDD ones.

Matching two dataframes in scala

I have two RDDs in SCALA and converted those to dataframes.
Now I have two dataframes.One prodUniqueDF where I have two columns named prodid and uid, it is having master data of product
scala> prodUniqueDF.printSchema
root
|-- prodid: string (nullable = true)
|-- uid: long (nullable = false)
Second, ratingsDF where I have columns named prodid,custid,ratings
scala> ratingsDF.printSchema
root
|-- prodid: string (nullable = true)
|-- custid: string (nullable = true)
|-- ratings: integer (nullable = false)
I want to join the above two and replace the ratingsDF.prodid with prodUniqueDF.uid in the ratingsDF
To do this, I first registered them as 'tempTables'
prodUniqueDF.registerTempTable("prodUniqueDF")
ratingsDF.registerTempTable("ratingsDF")
And I run the code
val testSql = sql("SELECT prodUniqueDF.uid, ratingsDF.custid, ratingsDF.ratings FROM prodUniqueDF, ratingsDF WHERE prodUniqueDF.prodid = ratingsDF.prodid")
But the error comes as :
org.apache.spark.sql.AnalysisException: Table not found: prodUniqueDF; line 1 pos 66
Please help! How can I achieve the join? Is there another method to map RDDs instead?
The Joining of the DataFrames can easily be achieved,
Format is
DataFrameA.join(DataFrameB)
By default it takes an inner join, but you can also specify the type of join that you want to do and they have APi's for that
You can look here for more information.
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.DataFrame
For replacing the values in an existing column you can take help of withColumn method from the API
It would be something like this:
val newDF = dfA.withColumn("newColumnName", dfB("columnName"))).drop("columnName").withColumnRenamed("newColumnName", "columnName")
I think this might do the trick !