String Functions for Nested Schema in Spark Scala - scala

I am learning Spark in Scala programming language.
Input file ->
"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}
Schema ->
root
|-- Personal: struct (nullable = true)
| |-- ID: integer (nullable = true)
| |-- Name: array (nullable = true)
| | |-- element: string (containsNull = true)
Operation for output ->
I want to concat the Strings of "Name" element
Eg - abcs|dakjdb
I am reading the file using dataframe API.
Please help me from this.

It should be pretty straightforward if you are working with Spark >= 1.6.0 you can use get_json_object and concat_ws:
import org.apache.spark.sql.functions.{get_json_object, concat_ws}
val df = Seq(
("""{"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}"""),
("""{"Personal":{"ID":3425,"Name":["cfg","woooww"]}}""")
)
.toDF("data")
df.select(
concat_ws(
"-",
get_json_object($"data", "$.Personal.Name[0]"),
get_json_object($"data", "$.Personal.Name[1]")
).as("FullName")
).show(false)
// +-----------+
// |FullName |
// +-----------+
// |abcs-dakjdb|
// |cfg-woooww |
// +-----------+
With get_json_object we go through the json data an extract the two elements of the Name array which we concatenate later on.

There is an inbuilt function concat_ws which should be useful here.

to extend #Alexandros Biratsis answer. you can first convert Name into array[String] type before concatenating to avoid writing every name position. Querying by position would also fail when the value is null or when only one value exist instead of two.
import org.apache.spark.sql.functions.{get_json_object, concat_ws, from_json}
import org.apache.spark.sql.types.{ArrayType, StringType}
val arraySchema = ArrayType(StringType)
val df = Seq(
("""{"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}"""),
("""{"Personal":{"ID":3425,"Name":["cfg","woooww"]}}""")
)
.toDF("data")
.select(get_json_object($"data", "$.Personal.Name") as "name")
.select(from_json($"name", arraySchema) as "name")
.select(concat_ws("|", $"name"))
.show(false)

Related

Spark scala Dataframe : How can i apply custom type to an existing dataframe?

I have a dataframe (dataDF) which contains data like :
firstColumn;secondColumn;thirdColumn
myText;123;2010-08-12 00:00:00
In my case, all of these columns are StringType.
In the other hand, I have another DataFrame (customTypeDF) which can be modified and contains for some columns a custom type like :
columnName;customType
secondColumn;IntegerType
thirdColumn; TimestampType
How can I apply dynamically the new types on my dataDF dataframe ?
You can map the column names using the customTypeDF collected as a Seq:
val colTypes = customTypeDF.rdd.map(x => x.toSeq.asInstanceOf[Seq[String]]).collect
val result = dataDF.select(
dataDF.columns.map(c =>
if (colTypes.map(_(0)).contains(c))
col(c).cast(colTypes.filter(_(0) == c)(0)(1).toLowerCase.replace("type","")).as(c)
else col(c)
):_*
)
result.show
+-----------+------------+-------------------+
|firstColumn|secondColumn| thirdColumn|
+-----------+------------+-------------------+
| myText| 123|2010-08-12 00:00:00|
+-----------+------------+-------------------+
result.printSchema
root
|-- firstColumn: string (nullable = true)
|-- secondColumn: integer (nullable = true)
|-- thirdColumn: timestamp (nullable = true)

Convert array<map<string,string>> type to <string,string> in scala

I am facing in problem in converting a column in my dataframe to string format. The example of the dataframe is as follows:
-- example_code_b: string (nullable = true)
-- example_code: array (nullable = true)
[info] | |-- element: map (containsNull = true)
[info] | | |-- key: string
[info] | | |-- value: string (valueContainsNull = true)
I want to convert example code to (string,string) format from the current array(map(string,string)).
The input is in the form of [Map(entity -> PER), Map(entity -> PER)] and
I want the output to be in the form of PER,PER
you can either do an UDF in DataFrame API or use Dataset-API to do it:
import spark.implicits._
df
.as[Seq[Map[String,String]]]
.map(s => s.reduce(_ ++ _))
.toDF("example_code")
.show()
Note that this does not consider the case of multiple keys, they are not "merged" but just overwritten
You can simply use explode function on any array column, which will create separate rows for each value of array.
val newDF = df.withColumn("mymap" explode(col("example_code")))

PySpark DataFrame change column of string to array before using explode

I have a column called event_data in json format in my spark DataFrame, after reading it using from_json, I get this schema:
root
|-- user_id: string (nullable = true)
|-- event_data: struct (nullable = true)
| |-- af_content_id: string (nullable = true)
| |-- af_currency: string (nullable = true)
| |-- af_order_id: long (nullable = true)
I only need af_content_id from this column. This attribute can be of different formats:
a String
an Integer
a List of Int and Str. eg ['ghhjj23','123546',12356]
None (sometimes event_data doesn't contain af_content_id)
I want to use explode function in order to return a new row for each element in af_content_id when it is of format List. But as when I apply it, I get an error:
from pyspark.sql.functions import explode
def get_content_id(column):
return column.af_content_id
df_transf_1 = df_transf_1.withColumn(
"products_basket",
get_content_id(df_transf_1.event_data)
)
df_transf_1 = df_transf_1.withColumn(
"product_id",
explode(df_transf_1.products_basket)
)
cannot resolve 'explode(products_basket)' due to data type mismatch: input to function explode should be array or map type, not StringType;
I know the reason, it's because of the different types that the field af_content_id may contain, but I don't know how to resolve it. Using pyspark.sql.functions.array() directly on the column doesn't work because it become array of array and explode will not produce the expected result.
A sample code to reproduce the step that I'm stuck on:
import pandas as pd
arr = [
['b5ad805c-f295-4852-82fc-961a88',12732936],
['0FD6955D-484C-4FC8-8C3F-DA7D28',['Gklb38','123655']],
['0E3D17EA-BEEF-4931-8104','12909841'],
['CC2877D0-A15C-4C0A-AD65-762A35C1',[12645715, 12909837, 12909837]]
]
df = pd.DataFrame(arr, columns = ['user_id','products_basket'])
df = df[['user_id','products_basket']].astype(str)
df_transf_1 = spark.createDataFrame(df)
I'm looking for a way to convert products_basket to one only possible format: an Array so that when I apply explode, it will contain one id per row.
If you are starting with a DataFrame like:
df_transf_1.show(truncate=False)
#+--------------------------------+------------------------------+
#|user_id |products_basket |
#+--------------------------------+------------------------------+
#|b5ad805c-f295-4852-82fc-961a88 |12732936 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |['Gklb38', '123655'] |
#|0E3D17EA-BEEF-4931-8104 |12909841 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|
#+--------------------------------+------------------------------+
where the products_basket column is a StringType:
df.printSchema()
#root
# |-- user_id: string (nullable = true)
# |-- products_basket: string (nullable = true)
You can't call explode on products_basket because it's not an array or map.
One workaround is to remove any leading/trailing square brackets and then split the string on ", " (comma followed by a space). This will convert the string into an array of strings.
from pyspark.sql.functions import col, regexp_replace, split
df_transf_new= df_transf_1.withColumn(
"products_basket",
split(regexp_replace(col("products_basket"), r"(^\[)|(\]$)|(')", ""), ", ")
)
df_transf_new.show(truncate=False)
#+--------------------------------+------------------------------+
#|user_id |products_basket |
#+--------------------------------+------------------------------+
#|b5ad805c-f295-4852-82fc-961a88 |[12732936] |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |
#|0E3D17EA-BEEF-4931-8104 |[12909841] |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|
#+--------------------------------+------------------------------+
The regular expression pattern matches any of the following:
(^\[): An opening square bracket at the start of the string
(\]$): A closing square bracket at the end of the string
('): Any single quote (because your strings are quoted)
and replaces these with an empty string.
This assumes that your data does not contain any needed single quotes or square brackets inside the product_basket.
After the split, the schema of the new DataFrame is:
df_transf_new.printSchema()
#root
# |-- user_id: string (nullable = true)
# |-- products_basket: array (nullable = true)
# | |-- element: string (containsNull = true)
Now you can call explode:
from pyspark.sql.functions import explode
df_transf_new.withColumn("product_id", explode("products_basket")).show(truncate=False)
#+--------------------------------+------------------------------+----------+
#|user_id |products_basket |product_id|
#+--------------------------------+------------------------------+----------+
#|b5ad805c-f295-4852-82fc-961a88 |[12732936] |12732936 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |Gklb38 |
#|0FD6955D-484C-4FC8-8C3F-DA7D28 |[Gklb38, 123655] |123655 |
#|0E3D17EA-BEEF-4931-8104 |[12909841] |12909841 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12645715 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12909837 |
#|CC2877D0-A15C-4C0A-AD65-762A35C1|[12645715, 12909837, 12909837]|12909837 |
#+--------------------------------+------------------------------+----------+

Filter a dataFrame based on elemnt from a array column

I'm working with dataframe
root
|-- c: long (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I'm trying to filter this dataframe based on an element ["value1", "key1"] in the array data i.e if this element exist in data of df so keep it else delete it, I tried
df.filter(col("data").contain("["value1", "key1"])
but it didn't work. Also I tried to
put val f=Array("value1", "key1") then df.filter(col("data").contain(f)) it didn't work also.
Any help please?
Straight forward approach would be to use a udf function as udf function helps to perform logics row by row and in primitive datatypes (thats what your requirement suggests to check every key and value of struct element in array data column)
import org.apache.spark.sql.functions._
//udf to check for key1 in key and value1 in value of every struct in the array field
def containsUdf = udf((data: Seq[Row])=> data.exists(row => row.getAs[String]("key") == "key1" && row.getAs[String]("value") == "value1"))
//calling the udf function in the filter
val filteredDF = df.filter(containsUdf(col("data")))
so the filteredDF should be your desired output

Change schema of existing dataframe

I want to change schema of existing dataframe,while changing the schema I'm experiencing error.Is it possible I can change the existing schema of a dataframe.
val customSchema=StructType(
Array(
StructField("data_typ", StringType, nullable=false),
StructField("data_typ", IntegerType, nullable=false),
StructField("proc_date", IntegerType, nullable=false),
StructField("cyc_dt", DateType, nullable=false),
));
val readDF=
+------------+--------------------+-----------+--------------------+
|DatatypeCode| Description|monthColNam| timeStampColNam|
+------------+--------------------+-----------+--------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:...|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:...|
+------------+--------------------+-----------+--------------------+
val rows= readDF.rdd
val readDF1 = sparkSession.createDataFrame(rows,customSchema)
expected result
val newdf=
+------------+--------------------+-----------+--------------------+
|data_typ_cd | data_typ_desc|proc_dt | cyc_dt |
+------------+--------------------+-----------+--------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:...|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:...|
+------------+--------------------+-----------+--------------------+
Any help will be appricated
You can do something like this to change the datatype from one to other.
I have created a dataframe similar to yours like below:
import sparkSession.sqlContext.implicits._
import org.apache.spark.sql.types._
var df = Seq(("03099","Volumetric/Expand...", "201867", "2018-05-31 18:25:00"),("03307","Elapsed Day Factor", "201867", "2018-05-31 18:25:00"))
.toDF("DatatypeCode","data_typ", "proc_date", "cyc_dt")
df.printSchema()
df.show()
This gives me the following output:
root
|-- DatatypeCode: string (nullable = true)
|-- data_typ: string (nullable = true)
|-- proc_date: string (nullable = true)
|-- cyc_dt: string (nullable = true)
+------------+--------------------+---------+-------------------+
|DatatypeCode| data_typ|proc_date| cyc_dt|
+------------+--------------------+---------+-------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:00|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:00|
+------------+--------------------+---------+-------------------+
If you see the schema above all the columns are of type String. Now I want to change the column proc_date to Integer type and cyc_dt to Date type, I will do the following:
df = df.withColumnRenamed("DatatypeCode", "data_type_code")
df = df.withColumn("proc_date_new", df("proc_date").cast(IntegerType)).drop("proc_date")
df = df.withColumn("cyc_dt_new", df("cyc_dt").cast(DateType)).drop("cyc_dt")
and when you check the schema of this dataframe
df.printSchema()
then it gives the output as following with the new column names:
root
|-- data_type_code: string (nullable = true)
|-- data_typ: string (nullable = true)
|-- proc_date_new: integer (nullable = true)
|-- cyc_dt_new: date (nullable = true)
You cannot change schema like this. Schema object passed to createDataFrame has to match the data, not the other way around:
To parse timestamp data use corresponding functions, for example like Better way to convert a string field into timestamp in Spark
To change other types use cast method, for example how to change a Dataframe column from String type to Double type in pyspark