DataFrame columns names conflict with .(dot) - scala

I have a DataFrame df which has this schema:
root
|-- person.name: string (nullable = true)
|-- person: struct (nullable = true)
| |-- age: long (nullable = true)
| |-- name: string (nullable = true)
When I do df.select("person.name") I obviously fetch the sub-field name from person. How could I select the column person.name?

For the column name that contains .(dot) you can use the ` character to enclose the column name
df.select("`person.name`")
This selects the outer String person.name: string (nullable = true)
And
df.select("person.name")
This gets the person name which is struct
|-- person: struct (nullable = true)
| |-- age: long (nullable = true)
If you have a column name you can just prepend and append ` character for the column name as
"`" + columnName + "`"
I hope this was helpful!

To access the column name with a period using pyspark, do this:
spark.sql("select person.name from person_table")
Note: person_table is a registerTempTable on df.

My answer provides a working code snippet that illustrates the problem of having dots in column names and explains how you can easily remove dots from column names.
Let's create a DataFrame with some sample data:
schema = StructType([
StructField("person.name", StringType(), True),
StructField("person", StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)]))
])
data = [
("charles", Row("chuck", 42)),
("larry", Row("chipper", 48))
]
df = spark.createDataFrame(data, schema)
df.show()
+-----------+-------------+
|person.name| person|
+-----------+-------------+
| charles| [chuck, 42]|
| larry|[chipper, 48]|
+-----------+-------------+
Let's illustrate that selecting person.name will return different results depending on if backticks are used or not.
cols = ["person.name", "person", "person.name", "`person.name`"]
df.select(cols).show()
+-----+-----------+-----+-----------+
| name| person| name|person.name|
+-----+-----------+-----+-----------+
|chuck|[chuck, 42]|chuck| charles|
|larry|[larry, 73]|larry| lawrence|
+-----+-----------+-----+-----------+
You definitely don't want to write or maintain code that changes results based on the presence of backticks. It's always better to replace all the dots with underscores when starting the analysis.
clean_df = df.toDF(*(c.replace('.', '_') for c in df.columns))
clean_df.select("person_name", "person.name", "person.age").show()
+-----------+-----+---+
|person_name| name|age|
+-----------+-----+---+
| charles|chuck| 42|
| lawrence|larry| 73|
+-----------+-----+---+
This post explains how and why to avoid dots in PySpark columns names in more detail.

Related

How to apply Sha2 for a particular column which is inside in the form of array struct in Hive or spark sql? Dynamically

I am having data in Hive
id name kyc
1001 smith [pnno:999,ssn:12345,email:ss#mail.com]
when we select these columns the output will be
1001.smith, [999,12345,ss#mail.com]
I have to apply SHA2 inside this array column and also the output should display
1001,smith,[999,*****(sha2 masked value), ss#gmail.com]
The output should be same array struct format
I am currently creating a separate view and joining the query, Is there any way to handle this in a Hive query or inside spark/scala using dataframe Dynamically?
Also, using any config for spark/scala?
Thank you
You can use transform to encrypt the ssn field in the array of structs:
// sample dataframe
df.show(false)
+----+-----+---------------------------+
|id |name |kyc |
+----+-----+---------------------------+
|1001|smith|[[999, 12345, ss#mail.com]]|
+----+-----+---------------------------+
// sample schema
df.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = false)
// |-- kyc: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- pnno: integer (nullable = false)
// | | |-- ssn: integer (nullable = false)
// | | |-- email: string (nullable = false)
val df2 = df.withColumn(
"kyc",
expr("""
transform(kyc,
x -> struct(x.pnno pnno, sha2(string(x.ssn), 512) ssn, x.email email)
)
""")
)
df2.show(false)
+----+-----+------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |name |kyc |
+----+-----+------------------------------------------------------------------------------------------------------------------------------------------------------+
|1001|smith|[[999, 3627909a29c31381a071ec27f7c9ca97726182aed29a7ddd2e54353322cfb30abb9e3a6df2ac2c20fe23436311d678564d0c8d305930575f60e2d3d048184d79, ss#mail.com]]|
+----+-----+------------------------------------------------------------------------------------------------------------------------------------------------------+

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions

I got this error while using this code to drop a nested column with pyspark. Why is this not working? I was trying to use a tilde instead of a not != as the error suggests but it doesnt work either. So what do you do in that case?
def drop_col(df, struct_nm, delete_struct_child_col_nm):
fields_to_keep = filter(lambda x: x != delete_struct_child_col_nm, df.select("
{}.*".format(struct_nm)).columns)
fields_to_keep = list(map(lambda x: "{}.{}".format(struct_nm, x), fields_to_keep))
return df.withColumn(struct_nm, struct(fields_to_keep))
I built a simple example with a struct column and a few dummy columns:
from pyspark import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id, lit, col, struct
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
schema = StructType(
[
StructField('addresses',
StructType(
[StructField("state", StringType(), True),
StructField("street", StringType(), True),
StructField("country", StringType(), True),
StructField("code", IntegerType(), True)]
)
)
]
)
rdd = [({'state': 'pa', 'street': 'market', 'country': 'USA', 'code': 100},),
({'state': 'ca', 'street': 'baker', 'country': 'USA', 'code': 101},)]
df = sql_context.createDataFrame(rdd, schema)
df = df.withColumn('id', monotonically_increasing_id())
df = df.withColumn('name', lit('test'))
print(df.show())
print(df.printSchema())
Output:
+--------------------+-----------+----+
| addresses| id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+
root
|-- addresses: struct (nullable = true)
| |-- state: string (nullable = true)
| |-- street: string (nullable = true)
| |-- country: string (nullable = true)
| |-- code: integer (nullable = true)
|-- id: long (nullable = false)
|-- name: string (nullable = false)
To drop the whole struct column, you can simply use the drop function:
df2 = df.drop('addresses')
print(df2.show())
Output:
+-----------+----+
| id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
To drop specific fields, in a struct column, it's a bit more complicated - there are some other similar questions here:
Dropping a nested column from Spark DataFrame
Dropping nested column of Dataframe with PySpark
In any case, I found them to be a bit complicated - my approach would just be to reassign the original column with the subset of struct fields you want to keep:
columns_to_keep = ['country', 'code']
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
Output:
+----------+-----------+----+
| addresses| id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
Alternatively, if you just wanted to specify the columns you want to remove rather than the columns you want to keep:
columns_to_remove = ['country', 'code']
all_columns = df.select("addresses.*").columns
columns_to_keep = list(set(all_columns) - set(columns_to_remove))
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
Output:
+------------+-----------+----+
| addresses| id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+
Hope this helps!

Join two spark Dataframe using the nested column and update one of the columns

I am working on some requirement in which I am getting one small table in from of CSV file as follow:
root
|-- ACCT_NO: string (nullable = true)
|-- SUBID: integer (nullable = true)
|-- MCODE: string (nullable = true)
|-- NewClosedDate: timestamp (nullable = true
We also have a very big external hive table in form of Avro which is stored in HDFS as follow:
root
-- accountlinks: array (nullable = true)
| | |-- account: struct (nullable = true)
| | | |-- acctno: string (nullable = true)
| | | |-- subid: string (nullable = true)
| | | |-- mcode: string (nullable = true)
| | | |-- openeddate: string (nullable = true)
| | | |-- closeddate: string (nullable = true)
Now, the requirement is to look up the the external hive table based on the three columns from the csv file : ACCT_NO - SUBID - MCODE. If it matches, updates the accountlinks.account.closeddate with NewClosedDate from CSV file.
I have already written the following code to explode the required columns and join it with the small table but I am not really sure how to update the closeddate field ( this is currently null for all account holders) with NewClosedDate because closeddate is a nested column and I cannot easily use withColumn to populate it. In addition to that the schema and column names cannot be changed as these files are linked to some external hive table.
val df = spark.sql("select * from db.table where archive='201711'")
val ExtractedColumn = df
.coalesce(150)
.withColumn("ACCT_NO", explode($"accountlinks.account.acctno"))
.withColumn("SUBID", explode($"accountlinks.account.acctsubid"))
.withColumn("MCODE", explode($"C.mcode"))
val ReferenceData = spark.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load("file.csv")
val FinalData = ExtractedColumn.join(ReferenceData, Seq("ACCT_NO","SUBID","MCODE") , "left")
All you need is to explode the accountlinks array and then join the 2 dataframes like this:
val explodedDF = df.withColumn("account", explode($"accountlinks"))
val joinCondition = $"ACCT_NO" === $"account.acctno" && $"SUBID" === $"account.subid" && $"MCODE" === $"account.mcode"
val joinDF = explodedDF.join(ReferenceData, joinCondition, "left")
Now you can update the account struct column like below, and collect list to get back the array structure:
val FinalData = joinDF.withColumn("account",
struct($"account.acctno", $"account.subid", $"account.mcode",
$"account.openeddate", $"NewClosedDate".alias("closeddate")
)
)
.groupBy().agg(collect_list($"account").alias("accountlinks"))
The idea is to create a new struct with all the fields from account except closedate that you get from NewCloseDate column.
If the struct contains many fields you can use a for-comprehension to get all the fields except the close date to prevent typing them all.

PySpark DataFrame change column of string to array before using explode

I have a column called event_data in json format in my spark DataFrame, after reading it using from_json, I get this schema:
root
|-- user_id: string (nullable = true)
|-- event_data: struct (nullable = true)
| |-- af_content_id: string (nullable = true)
| |-- af_currency: string (nullable = true)
| |-- af_order_id: long (nullable = true)
I only need af_content_id from this column. This attribute can be of different formats:
a String
an Integer
a List of Int and Str. eg ['ghhjj23','123546',12356]
None (sometimes event_data doesn't contain af_content_id)
I want to use explode function in order to return a new row for each element in af_content_id when it is of format List. But as when I apply it, I get an error:
from pyspark.sql.functions import explode
def get_content_id(column):
return column.af_content_id
df_transf_1 = df_transf_1.withColumn(
"products_basket",
get_content_id(df_transf_1.event_data)
)
df_transf_1 = df_transf_1.withColumn(
"product_id",
explode(df_transf_1.products_basket)
)
cannot resolve 'explode(products_basket)' due to data type mismatch: input to function explode should be array or map type, not StringType;
I know the reason, it's because of the different types that the field af_content_id may contain, but I don't know how to resolve it. Using pyspark.sql.functions.array() directly on the column doesn't work because it become array of array and explode will not produce the expected result.
A sample code to reproduce the step that I'm stuck on:
import pandas as pd
arr = [
['b5ad805c-f295-4852-82fc-961a88',12732936],
['0FD6955D-484C-4FC8-8C3F-DA7D28',['Gklb38','123655']],
['0E3D17EA-BEEF-4931-8104','12909841'],
['CC2877D0-A15C-4C0A-AD65-762A35C1',[12645715, 12909837, 12909837]]
]
df = pd.DataFrame(arr, columns = ['user_id','products_basket'])
df = df[['user_id','products_basket']].astype(str)
df_transf_1 = spark.createDataFrame(df)
I'm looking for a way to convert products_basket to one only possible format: an Array so that when I apply explode, it will contain one id per row.
If you are starting with a DataFrame like:
df_transf_1.show(truncate=False)
#+--------------------------------+------------------------------+
#|user_id |products_basket |
#+--------------------------------+------------------------------+
#|b5ad805c-f295-4852-82fc-961a88 |12732936 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |['Gklb38', '123655'] |
#|0E3D17EA-BEEF-4931-8104 |12909841 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|
#+--------------------------------+------------------------------+
where the products_basket column is a StringType:
df.printSchema()
#root
# |-- user_id: string (nullable = true)
# |-- products_basket: string (nullable = true)
You can't call explode on products_basket because it's not an array or map.
One workaround is to remove any leading/trailing square brackets and then split the string on ", " (comma followed by a space). This will convert the string into an array of strings.
from pyspark.sql.functions import col, regexp_replace, split
df_transf_new= df_transf_1.withColumn(
"products_basket",
split(regexp_replace(col("products_basket"), r"(^\[)|(\]$)|(')", ""), ", ")
)
df_transf_new.show(truncate=False)
#+--------------------------------+------------------------------+
#|user_id |products_basket |
#+--------------------------------+------------------------------+
#|b5ad805c-f295-4852-82fc-961a88 |[12732936] |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |
#|0E3D17EA-BEEF-4931-8104 |[12909841] |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|
#+--------------------------------+------------------------------+
The regular expression pattern matches any of the following:
(^\[): An opening square bracket at the start of the string
(\]$): A closing square bracket at the end of the string
('): Any single quote (because your strings are quoted)
and replaces these with an empty string.
This assumes that your data does not contain any needed single quotes or square brackets inside the product_basket.
After the split, the schema of the new DataFrame is:
df_transf_new.printSchema()
#root
# |-- user_id: string (nullable = true)
# |-- products_basket: array (nullable = true)
# | |-- element: string (containsNull = true)
Now you can call explode:
from pyspark.sql.functions import explode
df_transf_new.withColumn("product_id", explode("products_basket")).show(truncate=False)
#+--------------------------------+------------------------------+----------+
#|user_id |products_basket |product_id|
#+--------------------------------+------------------------------+----------+
#|b5ad805c-f295-4852-82fc-961a88 |[12732936] |12732936 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |Gklb38 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |123655 |
#|0E3D17EA-BEEF-4931-8104 |[12909841] |12909841 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12645715 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12909837 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12909837 |
#+--------------------------------+------------------------------+----------+

Is it possible to create a StructField of tuple type using PySpark?

I need to create a schema for a dataframe in Spark. I have no problem creating regular StructFields, such as StringType, IntegerType. However, I want to create a StructField for a tuple.
I have tried the following:
StructType([
StructField("dst_ip", StringType()),
StructField("port", StringType())
])
However, it throws an error
"list object has no attribute 'name'"
Is it possible to create a StructField for a tuple type?
You can define a StructType inside of a StructField:
schema = StructType(
[
StructField(
"myTuple",
StructType(
[
StructField("dst_ip", StringType()),
StructField("port", StringType())
]
)
)
]
)
df = sqlCtx.createDataFrame([], schema)
df.printSchema()
#root
# |-- myTuple: struct (nullable = true)
# | |-- dst_ip: string (nullable = true)
# | |-- port: string (nullable = true)
The class StructType--used to to define the structure of a DataFrame--is the data type representing a Row and it consists of a list of StructField's.
In order to define a tuple datatype for a column (say columnA) you need to encapsulate (list) the StructType's of the the tuple's elements into a StructField. Note that StructFields need to have names since they represent columns.
Define tuple StructField as a new StructType:
columnA = StructField('columnA', StructType([
StructField("dst_ip", StringType()),
StructField("port", StringType())
])
)
Define schema containing columnA and columnB (of type FloatType):
mySchema = StructType([ columnA, StructField("columnB", FloatType())])
Apply schema to dataframe:
data =[{'columnA': ('x', 'y'), 'columnB': 1.0}]
# data = [Row(columnA=('x', 'y'), columnB=1.0)] (needs from pyspark.sql import Row)
df = spark.createDataFrame(data, mySchema)
df.printSchema()
# root
# |-- columnA: struct (nullable = true)
# | |-- dst_ip: string (nullable = true)
# | |-- port: string (nullable = true)
# |-- columnB: float (nullable = true)
Show dataframe:
df.show()
# +-------+-------+
# |columnA|columnB|
# +-------+-------+
# | [x, y]| 1.0|
# +-------+-------+
(this is just the longer version of the other answer)