Pyspark Edit Schema (json column) [duplicate] - pyspark

This question already has answers here:
Pyspark: Parse a column of json strings
(7 answers)
Closed 6 months ago.
I have the following dataframe.
and the schema looks like this.
root
|-- nro_ot: decimal(12,0) (nullable = true)
|-- json_bcg: string (nullable = true)
The column "json_bcg" is just a string and I need to edit the schema to explore the contents.
function explode() dont work.

Pyspark: Parse a column of json strings
Helped me.
from pyspark.sql.functions import from_json, col
json_schema = spark.read.json(df.rdd.map(lambda row: row.json)).schema
df.withColumn('json', from_json(col('json'), json_schema))
In my case I edited a littlebit
import pyspark.sql.functions as f
df = spark.sql('Select nro_ot, json_bcg from sandbox_did_sio_phernandez.Batch_10')
json_schema = spark.read.json(df.rdd.map(lambda row: row.json_bcg)).schema
df = df.withColumn('json_bcg', f.from_json(f.col('json_bcg'), json_schema))
display(df)
df.printSchema()

Related

Convert yyyyMM to end of month date using PySpark

I have a column in a dataframe in Pyspark with date in integer format e.g 202203 (yyyyMM format). I want to convert that to end of month date as 2022-03-31. How do I achieve this?
First cast column to String, then use to_date to get the date and then last_day.
Example:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
data = [{"x": 202203}]
df = spark.createDataFrame(data=data)
df = df.withColumn("date", F.last_day(F.to_date(F.col("x").cast("string"), "yyyyMM")))
df.show(10)
df.printSchema()
Output:
+------+----------+
| x| date|
+------+----------+
|202203|2022-03-31|
+------+----------+
root
|-- x: long (nullable = true)
|-- date: date (nullable = true)

unable to save decimal value in decimal type in pyspark dataframe

I'm trying to write a json into a dataframe using pyspark. json has a decimal value and in the schema also I have defined that field as DecimalType but when creating the data frame, spark throws exception that TypeError: field pr: DecimalType(3,1) can not accept object 20.0 in type
r = {'name':'wellreading','pr':20.0}
distData = sc.parallelize([r])
schema = StructType([StructField('name',StringType(),True),StructField('pr',DecimalType(3,1),True)])
df = spark.createDataFrame(distData,schema)
df.collect()
here I have given a sample code but I'm unable to understand how come spark determines that 20.0 is float and can not be stored in decimal type?
One of the quick solutions (not sure if the best one) is that you can read your json file directly to a data frame and then perform a conversion you like, eg.
from pyspark.sql.types import DecimalType
from pyspark.sql.functions import col
df1 = spark.read.json("/tmp/test.json")
df2 = df1.select(col('name'),col('pr').cast(DecimalType(3,1)).alias('pr'))
df2.printSchema()
root
|-- name: string (nullable = true)
|-- pr: decimal(3,1) (nullable = true)
OR
df2 = df1.withColumn("pr",df1.pr.cast(DecimalType(3,1)))

String Functions for Nested Schema in Spark Scala

I am learning Spark in Scala programming language.
Input file ->
"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}
Schema ->
root
|-- Personal: struct (nullable = true)
| |-- ID: integer (nullable = true)
| |-- Name: array (nullable = true)
| | |-- element: string (containsNull = true)
Operation for output ->
I want to concat the Strings of "Name" element
Eg - abcs|dakjdb
I am reading the file using dataframe API.
Please help me from this.
It should be pretty straightforward if you are working with Spark >= 1.6.0 you can use get_json_object and concat_ws:
import org.apache.spark.sql.functions.{get_json_object, concat_ws}
val df = Seq(
("""{"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}"""),
("""{"Personal":{"ID":3425,"Name":["cfg","woooww"]}}""")
)
.toDF("data")
df.select(
concat_ws(
"-",
get_json_object($"data", "$.Personal.Name[0]"),
get_json_object($"data", "$.Personal.Name[1]")
).as("FullName")
).show(false)
// +-----------+
// |FullName |
// +-----------+
// |abcs-dakjdb|
// |cfg-woooww |
// +-----------+
With get_json_object we go through the json data an extract the two elements of the Name array which we concatenate later on.
There is an inbuilt function concat_ws which should be useful here.
to extend #Alexandros Biratsis answer. you can first convert Name into array[String] type before concatenating to avoid writing every name position. Querying by position would also fail when the value is null or when only one value exist instead of two.
import org.apache.spark.sql.functions.{get_json_object, concat_ws, from_json}
import org.apache.spark.sql.types.{ArrayType, StringType}
val arraySchema = ArrayType(StringType)
val df = Seq(
("""{"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}"""),
("""{"Personal":{"ID":3425,"Name":["cfg","woooww"]}}""")
)
.toDF("data")
.select(get_json_object($"data", "$.Personal.Name") as "name")
.select(from_json($"name", arraySchema) as "name")
.select(concat_ws("|", $"name"))
.show(false)

How to read decimal data of 38 precision and 18 scale in Scala

I have data of type Decimal(38,16) in RDBMS. I am importing that data into HDFS(Hadoop) in parquet file format. After that, I am reading that parquet file into Spark code.
val df = spark.read.parquet(<path>)
Once data loaded into Spark dataframe the datatype of that column converted to double. It is round up the value of cnt column to 14 digits after the decimal point while I have 16 digits after the decimal point.
Schema:
scala> df.printSchema
root
|-- id: integer (nullable = true)
|-- cnt: double (nullable = true)
In order solve this I have to take a simple example.
For example,
val dt = Array(1,88.2115557137985,223.7658213615901501)
Output:
scala> dt.foreach(println)
1.0
88.2115557137985
223.76582136159016
But here I am expecting as it is data without round up the value.
Thanks in advance.
You can predefine your schema to make the high-precision column DecimalType when reading the Parquet file:
import org.apache.spark.sql.types._
val customSchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("cnt", DecimalType(38, 16), true)
))
val df = spark.read.schema(customSchema).parquet("/path/to/parquetfile")

Casting type of columns in a dataframe

My Spark program needs to read a file which contains a matrix of integers. Columns are separated with ",". Number of columns is not the same each time I run the program.
I read the file as a dataframe:
var df = spark.read.csv(originalPath);
but when I print schema it gives me all the columns as Strings.
I convert all columns to Integers as below but after that when I print the schema of df again, columns are still Strings.
df.columns.foreach(x => df.withColumn(x + "_new", df.col(x).cast(IntegerType))
.drop(x).withColumnRenamed(x + "_new", x));
I appreciate any help to solve the issue of casting.
Thanks.
DataFrames are immutable. Your code creates new DataFrame for each value and discards it.
It is best to use map and select:
val newDF = df.select(df.columns.map(c => df.col(c).cast("integer")): _*)
but you could foldLeft:
df.columns.foldLeft(df)((df, x) => df.withColumn(x , df.col(x).cast("integer")))
or even (please don't) mutable reference:
var df = Seq(("1", "2", "3")).toDF
df.columns.foreach(x => df = df.withColumn(x , df.col(x).cast("integer")))
Or as you mentioned your column numbers are not same each time, you could take the highest number of possible column and make a schema out of it, having IntegerType as column type. During loading the file infer this schema to automatically convert your dataframe columns from string to integer. No explicit conversion required in this case.
import org.apache.spark.sql.types._
val csvSchema = StructType(Array(
StructField("_c0", IntegerType, true),
StructField("_c1", IntegerType, true),
StructField("_c2", IntegerType, true),
StructField("_c3", IntegerType, true)))
val df = spark.read.schema(csvSchema).csv(originalPath)
scala> df.printSchema
root
|-- _c0: integer (nullable = true)
|-- _c1: integer (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: integer (nullable = true)