Spark: How to convert a String to multiple columns - scala

I have a dataframe that contains a field item which is a string having a array of items:
[{"item":"76CJMX4Y"},{"item":"7PWZVWCG"},{"item":"967NBPMS"},{"item":"72LC5SMF"},{"item":"8N6DW3VD"},{"item":"045QHTU4"},{"item":"0UL4MMSI"}]
root
|-- item: string (nullable = true)
I would like to get the item as array of string. Can someone let me know if there is a easy way to do this with default from_json ?
root
|-- item: array (nullable = true)
So that I will only have
["76CJMX4Y", "7PWZVWCG", "967NBPMS", "72LC5SMF", "8N6DW3VD", "045QHTU4", "0UL4MMSI"]
Thanks

Use Spark built-in functions from_json and then use higher order function transform to extract item from the array.
Example
//from_json we are creating a json array then extracting item from the array
import org.apache.spark.sql.functions._
df.selectExpr("""transform(from_json(item,'array<struct<item:string>>'),x->x.item) as item""").show(10,false)
//+----------------------------------------------------------------------+
//|item |
//+----------------------------------------------------------------------+
//|[76CJMX4Y, 7PWZVWCG, 967NBPMS, 72LC5SMF, 8N6DW3VD, 045QHTU4, 0UL4MMSI]|
//+----------------------------------------------------------------------+

You could use split() on :, then sort with sort_array() the values (so that the values you’re not interested in are either at the top or bottom, then filter using slice().
For your reference: https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html (even if it’s the Java version, it’s the synthetic list of functions).

Related

How to find whether df column name contains substring in Scala

My df has multiple columns. I want to check whether one column name contains a substring. Such as % in SQL.
I try to use this one but seems not to work. I don't want to give a full name to find whether that column exists.
If I can find this column, I also want to rename the column using .withColumnRename
Such like
if (df.columns.contains("%ABC%" or "%BCD%")) df.withColumnrename("%ABC%" or "%BCD%","ABC123") else println(0)
Maybe you can try this.
The filter can help you select you columns which need to update.
Write your update logic in the flodLeft()() method.
flodLeft is a useful method in scala. If you want to learn more about flodLeft , you can search scala foldLeft example in google.
So, good luck with you.
df.schema.fieldNames.map(_.toUpperCase).filter(x => !x.contains("")).foldLeft(df)((a,b) => {
a.withColumnRenamed(b, ("abc_" + b).toLowerCase() )
})
First, find a column that matches your criteria:
df.columns
.filter(c => c.contains("ABC") || c.contains("BCD"))
.take(1)
This will either return an empty Array[String] if no such column exists or an array with a single element if the column does exist. take(1) is there to make sure that you won't be renaming more than one column using the same new name.
Continuing the previous expression, renaming the column boils down to calling foldLeft, which iterates over the collection chaining its second argument to the "zero" (df in this case):
.foldLeft(df)((ds, c) => ds.withColumnRenamed(c, "ABC123"))
If the array was empty, nothing will get called and the result will be the original df.
Here it is in action:
df.printSchema
// root
// |-- AB: integer (nullable = false)
// |-- ABCD: string (nullable = true)
df.columns
.filter(c => c.contains("ABC") || c.contains("BCD"))
.take(1)
.foldLeft(df)(_.withColumnRenamed(_, "ABC123"))
.printSchema
// root
// |-- AB: integer (nullable = false)
// |-- ABC123: string (nullable = true)

PySpark DataFrame When to use/ not to use Select

Based on PySpark document:
A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SQLContext
Meaning I can use Select for showing the value of a column, however, I saw sometimes these two equivalent codes are used instead:
# df is a sample DataFrame with column a
df.a
# or
df['a']
And sometimes when I use select I might get an error instead of them and vice versa sometimes I have to use Select.
For example, this is a DataFrame for finding a dog in a given picture problem:
joined_df.printSchema()
root
|-- folder: string (nullable = true)
|-- filename: string (nullable = true)
|-- width: string (nullable = true)
|-- height: string (nullable = true)
|-- dog_list: array (nullable = true)
| |-- element: string (containsNull = true)
If I want to select the dog details and show 10 rows, this code shows an error:
print(joined_df.dog_list.show(truncate=False))
Traceback (most recent call last):
 File "<stdin>", line 2, in <module>
    print(joined_df.dog_list.show(truncate=False))
TypeError: 'Column' object is not callable
And this is not:
print(joined_df.select('dog_list').show(truncate=False))
Question1: When I have to use Select and when I have to use df.a or df["a"]
Question2: what is the meaning of the error above? 'Column' object is not callable
df.col_name return a Column object but df.select("col_name") return another dataframe
see this for documentation
The key here is Those two methods are returning two different objects, that is why your print(joined_df.dog_list.show(truncate=False)) give you the error. Meaning that the Column object does not have this .show method but the dataframe does.
So when you call a function, function takes Column as input, you should use df.col_name, if you want to operate at dataframe level, you want to use df.select("col_name")

Multiple Spark DataFrame mutations in a single pipe

Consider a Spark DataFrame df with the following schema:
root
|-- date: timestamp (nullable = true)
|-- customerID: string (nullable = true)
|-- orderID: string (nullable = true)
|-- productID: string (nullable = true)
One column should be cast to a different type, other columns should just have their white-space trimmed.
df.select(
$"date",
df("customerID").cast(IntegerType),
$"orderID",
$"productId")
.withColumn("orderID", trim(col("orderID")))
.withColumn("productID", trim(col("productID")))
The operations seem to require different syntax; casting is done via select, while trim is done via withColumn.
I'm used to R and dplyr where all the above would be handled in a single mutate function, so mixing select and withColumn feels a bit cumbersome.
Is there a cleaner way to do this in a single pipe?
You can use either one. The difference is that withColumn will add (or replace if the same name is used) a new column to the dataframe while select will only keep the columns you specified. Depending on the situation, choose one to use.
The cast can be done using withColumn as follows:
df.withColumn("customerID", $"customerID".cast(IntegerType))
.withColumn("orderID", trim($"orderID"))
.withColumn("productID", trim($"productID"))
Note that you do not need to use withColumn on the date column above.
The trim functions can be done in a select as follows, here the column names are kept the same:
df.select(
$"date",
$"customerID".cast(IntegerType),
trim($"orderID").as("orderID"),
trim($"productId").as("productId"))
df.select(
$"date",
$"customerID".cast(IntegerType),
trim($"orderID").as("orderID"),
trim($"productID").as("productID"))

How to extract an array column from spark dataframe [duplicate]

This question already has answers here:
Access Array column in Spark
(2 answers)
Closed 5 years ago.
I have a spark dataframe with the following schema and class data:
>ab
ab: org.apache.spark.sql.DataFrame = [block_number: bigint, collect_list(to): array<string> ... 1 more field]
>ab.printSchema
root |-- block_number: long (nullable = true)
|-- collect_list(to): array (nullable = true)
| |-- element: string (containsNull = true)
|-- collect_list(from): array (nullable = true)
| |-- element: string (containsNull = true)
I want to simply merge the arrays from these two columns. I have tried to find a simple solution for this online but have not had any luck. Basically my issue comes down to two problems.
First, I know that probably the solution involves the map function. I have not been able to find any syntax that can actually compile, so for now please accept my best attempt:
ab.rdd.map(
row => {
val block = row.getLong(0)
val array1 = row(1).getAs[Array<string>]
val array1 = row(1).getAs[Array<string>]
}
)
Basically issue number 1 is very simple, and an issue that has been recurring since the day I first started using map in Scala: I can't figure out how to extract an arbitrary field for an arbitrary type from a column. I know that for the primitive types you have things like row.getLong(0) etc, but I don't understand how this should be done for things like array types.
I have seen somewhere that something like row.getAs[Array<string>](1) should work, but when I try it I get the error
error: identifier expected but ']' found.
val array1 = row.getAs[Array<string>](1)`
As far as I can tell, this is exactly the syntax I have seen in other situations but I can't tell why it's not working. I think I have seen before some other syntax that looks like row(1).getAs[Type], but I am not sure.
The second issue is: once I can extact these two arrays, what is the best way of merging them? Using the intersect function? Or is there a better approach to this whole process? For example using the brickhouse package?
Any help would be appreciated.
Best,
Paul
You don't need to switch to the RDD API, you can do it with Dataframe UDFs like this:
val mergeArrays = udf((arr1:Seq[String],arr2:Seq[String]) => arr1++arr2)
df
.withColumn("merged",mergeArrays($"collect_list(from)",$"collect_list(to)"))
.show()
The above UDF just concats the array (using the ++ operator), you can also use union or intersect etc, depending what you want to achieve.
Using the RDD API, the solution would look like this:
df.rdd.map(
row => {
val block = row.getLong(0)
val array1 = row.getAs[Seq[String]](1)
val array2 = row.getAs[Seq[String]](2)
(block,array1++array2)
}
).toDF("block","merged") // back to Dataframes

PySpark DataFrames: filter where some value is in array column

I have a DataFrame in PySpark that has a nested array value for one of its fields. I would like to filter the DataFrame where the array contains a certain string. I'm not seeing how I can do that.
The schema looks like this:
root
|-- name: string (nullable = true)
|-- lastName: array (nullable = true)
| |-- element: string (containsNull = false)
I want to return all the rows where the upper(name) == 'JOHN' and where the lastName column (the array) contains 'SMITH' and the equality there should be case insensitive (like I did for the name). I found the isin() function on a column value, but that seems to work backwards of what I want. It seem like I need a contains() function on a column value. Anyone have any ideas for a straightforward way to do this?
You could consider working on the underlying RDD directly.
def my_filter(row):
if row.name.upper() == 'JOHN':
for it in row.lastName:
if it.upper() == 'SMITH':
yield row
dataframe = dataframe.rdd.flatMap(my_filter).toDF()
An update in 2019
spark 2.4.0 introduced new functions like array_contains and transform
official document
now it can be done in sql language
For your problem, it should be
dataframe.filter('array_contains(transform(lastName, x -> upper(x)), "JOHN")')
It is better than the previous solution using RDD as a bridge, because DataFrame operations are much faster than RDD ones.