Select spark dataframe column with special character in it using selectExpr - pyspark

I am in a scenario where my columns name is Município with accent on the letter í.
My selectExpr command is failing because of it. Is there a way to fix it? Basically I have something like the following expression:
.selectExpr("...CAST (Município as string) as Município...")
What I really want is to be able to leave the column with the same name that it came, so in the future, I won't have this kind of problem on different tables/files.
How can I make spark dataframe accept accents or other special characters?

You can use wrap your column name in backticks. For example, if you had the following schema:
df.printSchema()
#root
# |-- Município: long (nullable = true)
Express the column name with the special character wrapped with the backtick:
df2 = df.selectExpr("CAST (`Município` as string) as `Município`")
df2.printSchema()
#root
# |-- Município: string (nullable = true)

Related

PySpark DataFrame When to use/ not to use Select

Based on PySpark document:
A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SQLContext
Meaning I can use Select for showing the value of a column, however, I saw sometimes these two equivalent codes are used instead:
# df is a sample DataFrame with column a
df.a
# or
df['a']
And sometimes when I use select I might get an error instead of them and vice versa sometimes I have to use Select.
For example, this is a DataFrame for finding a dog in a given picture problem:
joined_df.printSchema()
root
|-- folder: string (nullable = true)
|-- filename: string (nullable = true)
|-- width: string (nullable = true)
|-- height: string (nullable = true)
|-- dog_list: array (nullable = true)
| |-- element: string (containsNull = true)
If I want to select the dog details and show 10 rows, this code shows an error:
print(joined_df.dog_list.show(truncate=False))
Traceback (most recent call last):
 File "<stdin>", line 2, in <module>
    print(joined_df.dog_list.show(truncate=False))
TypeError: 'Column' object is not callable
And this is not:
print(joined_df.select('dog_list').show(truncate=False))
Question1: When I have to use Select and when I have to use df.a or df["a"]
Question2: what is the meaning of the error above? 'Column' object is not callable
df.col_name return a Column object but df.select("col_name") return another dataframe
see this for documentation
The key here is Those two methods are returning two different objects, that is why your print(joined_df.dog_list.show(truncate=False)) give you the error. Meaning that the Column object does not have this .show method but the dataframe does.
So when you call a function, function takes Column as input, you should use df.col_name, if you want to operate at dataframe level, you want to use df.select("col_name")

Multiple Spark DataFrame mutations in a single pipe

Consider a Spark DataFrame df with the following schema:
root
|-- date: timestamp (nullable = true)
|-- customerID: string (nullable = true)
|-- orderID: string (nullable = true)
|-- productID: string (nullable = true)
One column should be cast to a different type, other columns should just have their white-space trimmed.
df.select(
$"date",
df("customerID").cast(IntegerType),
$"orderID",
$"productId")
.withColumn("orderID", trim(col("orderID")))
.withColumn("productID", trim(col("productID")))
The operations seem to require different syntax; casting is done via select, while trim is done via withColumn.
I'm used to R and dplyr where all the above would be handled in a single mutate function, so mixing select and withColumn feels a bit cumbersome.
Is there a cleaner way to do this in a single pipe?
You can use either one. The difference is that withColumn will add (or replace if the same name is used) a new column to the dataframe while select will only keep the columns you specified. Depending on the situation, choose one to use.
The cast can be done using withColumn as follows:
df.withColumn("customerID", $"customerID".cast(IntegerType))
.withColumn("orderID", trim($"orderID"))
.withColumn("productID", trim($"productID"))
Note that you do not need to use withColumn on the date column above.
The trim functions can be done in a select as follows, here the column names are kept the same:
df.select(
$"date",
$"customerID".cast(IntegerType),
trim($"orderID").as("orderID"),
trim($"productId").as("productId"))
df.select(
$"date",
$"customerID".cast(IntegerType),
trim($"orderID").as("orderID"),
trim($"productID").as("productID"))

PySpark DataFrames: filter where some value is in array column

I have a DataFrame in PySpark that has a nested array value for one of its fields. I would like to filter the DataFrame where the array contains a certain string. I'm not seeing how I can do that.
The schema looks like this:
root
|-- name: string (nullable = true)
|-- lastName: array (nullable = true)
| |-- element: string (containsNull = false)
I want to return all the rows where the upper(name) == 'JOHN' and where the lastName column (the array) contains 'SMITH' and the equality there should be case insensitive (like I did for the name). I found the isin() function on a column value, but that seems to work backwards of what I want. It seem like I need a contains() function on a column value. Anyone have any ideas for a straightforward way to do this?
You could consider working on the underlying RDD directly.
def my_filter(row):
if row.name.upper() == 'JOHN':
for it in row.lastName:
if it.upper() == 'SMITH':
yield row
dataframe = dataframe.rdd.flatMap(my_filter).toDF()
An update in 2019
spark 2.4.0 introduced new functions like array_contains and transform
official document
now it can be done in sql language
For your problem, it should be
dataframe.filter('array_contains(transform(lastName, x -> upper(x)), "JOHN")')
It is better than the previous solution using RDD as a bridge, because DataFrame operations are much faster than RDD ones.

Matching two dataframes in scala

I have two RDDs in SCALA and converted those to dataframes.
Now I have two dataframes.One prodUniqueDF where I have two columns named prodid and uid, it is having master data of product
scala> prodUniqueDF.printSchema
root
|-- prodid: string (nullable = true)
|-- uid: long (nullable = false)
Second, ratingsDF where I have columns named prodid,custid,ratings
scala> ratingsDF.printSchema
root
|-- prodid: string (nullable = true)
|-- custid: string (nullable = true)
|-- ratings: integer (nullable = false)
I want to join the above two and replace the ratingsDF.prodid with prodUniqueDF.uid in the ratingsDF
To do this, I first registered them as 'tempTables'
prodUniqueDF.registerTempTable("prodUniqueDF")
ratingsDF.registerTempTable("ratingsDF")
And I run the code
val testSql = sql("SELECT prodUniqueDF.uid, ratingsDF.custid, ratingsDF.ratings FROM prodUniqueDF, ratingsDF WHERE prodUniqueDF.prodid = ratingsDF.prodid")
But the error comes as :
org.apache.spark.sql.AnalysisException: Table not found: prodUniqueDF; line 1 pos 66
Please help! How can I achieve the join? Is there another method to map RDDs instead?
The Joining of the DataFrames can easily be achieved,
Format is
DataFrameA.join(DataFrameB)
By default it takes an inner join, but you can also specify the type of join that you want to do and they have APi's for that
You can look here for more information.
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.DataFrame
For replacing the values in an existing column you can take help of withColumn method from the API
It would be something like this:
val newDF = dfA.withColumn("newColumnName", dfB("columnName"))).drop("columnName").withColumnRenamed("newColumnName", "columnName")
I think this might do the trick !

Spark 1.6 apply function to column with dot in name/ How to properly escape colName

To apply a function to a column in Spark, the common way (only way?) seems to be
df.withColumn(colName, myUdf(df.col(colName))
fine, but I have columns with dots in the name, and to access thecolumn, I need to escape the name with backtick " ` "
Problem is: if I use that escaped name, the .withColumn function creates a new column with the escaped name
df.printSchema
root
|-- raw.hourOfDay: long (nullable = false)
|-- raw.minOfDay: long (nullable = false)
|-- raw.dayOfWeek: long (nullable = false)
|-- raw.sensor2: long (nullable = false)
df = df.withColumn("raw.hourOfDay", df.col("raw.hourOfDay"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "raw.hourOfDay" among (raw.hourOfDay, raw.minOfDay, raw.dayOfWeek, raw.sensor2);
this works:
df = df.withColumn("`raw.hourOfDay`", df.col("`raw.hourOfDay`"))
df: org.apache.spark.sql.DataFrame = [raw.hourOfDay: bigint, raw.minOfDay: bigint, raw.dayOfWeek: bigint, raw.sensor2: bigint, `raw.hourOfDay`: bigint]
scala> df.printSchema
root
|-- raw.hourOfDay: long (nullable = false)
|-- raw.minOfDay: long (nullable = false)
|-- raw.dayOfWeek: long (nullable = false)
|-- raw.sensor2: long (nullable = false)
|-- `raw.hourOfDay`: long (nullable = false)
but as you see the schema has a new escaped column name.
If i do the above and attempt to drop the old column with the escaped name, it will drop the old column, but after that any attempt to access the new column results in something like:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "`raw.sensor2`" among (`raw.hourOfDay`, `raw.minOfDay`, `raw.dayOfWeek`, `raw.sensor2`);
as if it's now understanding the backtick as par of the name and not an escape character.
So how do I 'replace' my old column with withColumn without changing the name?
(PS: note that my column names are parametric, so I use a loop over the names. I used specific string names here for clarity: the escape sequence really look like "`"+colName+"`" )
EDIT:
right now the only trick I found was to do:
for (t <- df.columns) {
if (t.contains(".")) {
df = df.withColumn("`" + t + "`", myUdf(df.col("`" + t + "`")))
df = df.drop(df.col("`" + t + "`"))
df = df.withColumnRenamed("`" + t + "`", t)
}
else {
df = df.withColumn(t, myUdf(df.col(t)))
}
}
Not really efficient i guess...
EDIT:
The documentation state:
def withColumn(colName: String, col: Column): DataFrame
Returns a new DataFrame by adding a column
or replacing the existing column that has the same name.
So replacing a column should not be a problem.
However as pointed by #Glennie below, using a new name works fine, so this may be a bug in Spark 1.6
Thanks for the trick.
df = df.withColumn("`" + t + "`", myUdf(df.col("`" + t + "`")))
df = df.drop(df.col("`" + t + "`"))
df = df.withColumnRenamed("`" + t + "`", t)
It is working fine for me. Looking forward to see the better solution. Just to remind that we will be having similar issues with '#' character too.
I don't believe you can add a column with the same name as an existing column (and why would you?).
df = df.withColumn("raw.hourOfDay", df.col("`raw.hourOfDay`"))
will fail as you point out, but not because the name isn't properly escaped, but because the name is identical to an existing column.
df = df.withColumn("raw.hourOfDay_2", df.col("`raw.hourOfDay`"))
on the other hand will evaluate just fine :)