Show call failing with a grouped and aggregated dataframe - scala

I am trying to use sum after groupBy, like this,
val b = a.groupBy($"key").agg(sum($"value"))
Here the schema of a is of the following type,
|-- key: string (nullable = true)
|-- value: integer (nullable = false)
While the schema of b is of the following type,
|-- key: string (nullable = true)
|-- sum(value): long (nullable = true)
But when I do b.show, I get this error.
cannot assign instance of scala.collection.immutable.List$SerializationProxy
to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_
of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
What could be the reason for this error? I am using Spark 2.3.2 and running the code with an Apache Zeppelin note.

Related

Error while inserting into partitioned hive table for spark scala

I am having hive table with following structure
CREATE TABLE gcganamrswp_work.historical_trend_result(
column_name string,
metric_name string,
current_percentage string,
lower_threshold double,
upper_threshold double,
calc_status string,
final_status string,
support_override string,
dataset_name string,
insert_timestamp string,
appid string,
currentdate string,
indicator map<string,string>)
PARTITIONED BY (
appname string,
year_month int)
STORED AS PARQUET
TBLPROPERTIES ("parquet.compression"="SNAPPY");
I am having spark dataframe with schema
root
|-- metric_name: string (nullable = true)
|-- column_name: string (nullable = true)
|-- Lower_Threshold: double (nullable = true)
|-- Upper_Threshold: double (nullable = true)
|-- Current_Percentage: double (nullable = true)
|-- Calc_Status: string (nullable = false)
|-- Final_Status: string (nullable = false)
|-- support_override: string (nullable = false)
|-- Dataset_Name: string (nullable = false)
|-- insert_timestamp: string (nullable = false)
|-- appId: string (nullable = false)
|-- currentDate: string (nullable = false)
|-- indicator: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = false)
|-- appname: string (nullable = false)
|-- year_month: string (nullable = false)
when i try to insert into hive table using below code it is failing
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
data_df.repartition(1)
.write.mode("append")
.format("hive")
.insertInto(Outputhive_table)
Spark Version : Spark 2.4.0
Error:
ERROR Hive:1987 - Exception when loading partition with parameters
partPath=hdfs://gcgprod/data/work/hive/historical_trend_result/.hive-staging_hive_2021-09-01_04-34-04_254_8783620706620422928-1/-ext-10000/_temporary/0,
table=historical_trend_result, partSpec={appname=, year_month=},
replace=false, listBucketingEnabled=false, isAcid=false,
hasFollowingStatsTask=false
org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:Partition spec is incorrect. {appname=,
year_month=}) at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1662)
at
org.apache.hadoop.hive.ql.metadata.Hive.lambda$loadDynamicPartitions$4(Hive.java:1970)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
MetaException(message:Partition spec is incorrect. {appname=,
year_month=}) at
org.apache.hadoop.hive.metastore.Warehouse.makePartName(Warehouse.java:329)
at
org.apache.hadoop.hive.metastore.Warehouse.makePartPath(Warehouse.java:312)
at
org.apache.hadoop.hive.ql.metadata.Hive.genPartPathFromTable(Hive.java:1751)
at
org.apache.hadoop.hive.ql.metadata.Hive.loadPartitionInternal(Hive.java:1607)
I have specified the partition columns in the last columns of dataframe, so i expect it consider last tow columns as partition columns. I wanted to used the same routine for inserting different tables so i don't want to mention the partition columns explicitly
Just to recap that you are using spark to write data to a hive table with dynamic partitions. So my answer below is based on same, if my understanding is incorrect, please feel free to correct me in comment.
While you have a table that is dynamically partitioned (by app_name and year_month), the spark job doesn't know the partitioning fields in the destination so you will still have to tell your spark job about the partitioning field of the destination table.
Something like this should work
data_df.repartition(1)
.write
.partitionBy("appname", "year_month")
.mode(SaveMode.Append)
.saveAsTable(Outputhive_table)
Make sure that you enable support for dynamic partitions by executing something like
hiveContext.setConf("hive.exec.dynamic.partition", "true")
hiveContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
Check out this post by Itai Yaffe, this may be handy https://medium.com/nmc-techblog/spark-dynamic-partition-inserts-part-1-5b66a145974f
I think the problem is that some records have appname and year_month as strings. At least this is suggested by
Partition spec is incorrect. {appname=, year_month=}
Make sure partition colums are never empty or null! Also note that the type of year_month is not consistent between the DataFrame and your schema (string/int)

Change Spark Dataframe Array[String] to Array[Double]

I am very new to scala and I have the following issue.
I have a spark dataframe with the following schema:
df.printSchema()
root
|-- word: string (nullable = true)
|-- vector: array (nullable = true)
| |-- element: string (containsNull = true)
I need to convert this to the following schema:
root
|-- word: string (nullable = true)
|-- vector: array (nullable = true)
| |-- element: double (containsNull = true)
I do not want to specify the schema before hand, but instead change the existing one.
I have tried the following
df.withColumn("vector", col("vector").cast("array<element: double>"))
I have also tried converting it into an RDD to use map to change the elements and then turn it back into a dataframe, but I get the following data type Array[WrappedArray] and I am not sure how to handle it.
Using pyspark and numpy, I could do this by df.select("vector").rdd.map(lambda x: numpy.asarray(x)).
Any help would be greatly appreciated.
You're close. Try this code:
val df2 = df.withColumn("vector", col("vector").cast("array<double>"))

Loading data from glue to snowflake

I am trying to run an ETL job on glue where I am extracting data into a spark dataframe from a mongodb into glue and load it into snowflake.
This is the sample schema of the Spark dataframe
|-- login: struct (nullable = true)
| |-- login_attempts: integer (nullable = true)
| |-- last_attempt: timestamp (nullable = true)
|-- name: string (nullable = true)
|-- notifications: struct (nullable = true)
| |-- bot_review_queue: boolean (nullable = true)
| |-- bot_review_queue_web_push: boolean (nullable = true)
| |-- bot_review_queue_web_push_admin: boolean (nullable = true)
| |-- weekly_account_summary: struct (nullable = true)
| | |-- enabled: boolean (nullable = true)
| |-- weekly_summary: struct (nullable = true)
| | |-- enabled: boolean (nullable = true)
| | |-- day: integer (nullable = true)
| | |-- hour: integer (nullable = true)
| | |-- minute: integer (nullable = true)
|-- query: struct (nullable = true)
| |-- email_address: string (nullable = true)
I am trying to load the data into snowflake as it is and struct columns as json payload in snowflake but it throws the following error
An error occurred while calling o81.collectToPython.com.mongodb.spark.exceptions.MongoTypeConversionException:Cannot cast ARRAY into a StructType
I also tried to cast the struct columns into string and load it but it throws more or less the same error
An error occurred while calling o106.save. com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a StructType
Really appreciate if I can get some help on it.
code below for casting and loading.
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="mongodb",
connection_options=read_mongo_options)
user_df_cast = user_df.select(user_df.login.cast(StringType()),'name',user_df.notifications.cast(StringType()))
datasinkusers = user_df_cast.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "users").mode("append").save()
If your users table in Snowflake has the following schema then casting is not required, as the StructType fields of a SparkSQL DataFrame will map to the VARIANT type in Snowflake automatically:
CREATE TABLE users (
login VARIANT
,name STRING
,notifications VARIANT
,query VARIANT
)
Just do the following, no transformations required because the Snowflake Spark Connector understands the data-type and will convert to appropriate JSON representations on its own:
user_df = glueContext.create_dynamic_frame.from_options(
connection_type="mongodb",
connection_options=read_mongo_options
)
user_df
.toDF()
.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(**sfOptions)
.option("dbtable", "users")
.mode("append")
.save()
If you absolutely need to store the StructType fields as plain JSON strings, you'll need to explicitly transform them using the to_json SparkSQL function:
from pyspark.sql.functions import to_json
user_df_cast = user_df.select(
to_json(user_df.login),
user_df.name,
to_json(user_df.notifications)
)
This will store JSON strings as simple VARCHAR types which will not let you leverage Snowflake's semi-structured data storage and querying capabilities directly without a PARSE_JSON step (inefficient).
Consider using the VARIANT approach shown above, which will allow you to perform queries on the fields directly:
SELECT
login:login_attempts
,login:last_attempt
,name
,notifications:weekly_summary.enabled
FROM users

How to pass a dataframe column to scala function

I've written a scala function which will convert time(HH:mm:ss.SSS) to seconds. First it will ignore milliseconds and will take only (HH:mm:ss) and convert into seconds(int). It works fine when testing in spark-shell.
def hoursToSeconds(a: Any): Int = {
val sec = a.toString.split('.')
val fields = sec(0).split(':')
val creationSeconds = fields(0).toInt*3600 + fields(1).toInt*60 + fields(2).toInt
return creationSeconds
}
print(hoursToSeconds("03:51:21.2550000"))
13881
I would need to pass this function to one of the dataframe column(running), which i was trying with the withColumn method, but getting error Type mismatch, expected: column, actual String. Any help would be appreciated, is there a way i can pass the scala function to udf and then use udf in df.withColumn.
df.printSchema
root
|-- vin: string (nullable = true)
|-- BeginOfDay: string (nullable = true)
|-- Timezone: string (nullable = true)
|-- Version: timestamp (nullable = true)
|-- Running: string (nullable = true)
|-- Idling: string (nullable = true)
|-- Stopped: string (nullable = true)
|-- dlLoadDate: string (nullable = false)
sample running column values.
df.withColumn("running", hoursToSeconds(df("Running")
You can create a udf for the hoursToSeconds function by using the following sytax :
val hoursToSecUdf = udf(hoursToSeconds _)
Further to use it on a particular column the following sytax can be used :
df.withColumn("TimeInSeconds",hoursToSecUdf(col("running")))

Spark apply custom schema to a DataFrame

I have a data in Parquet file and want to apply custom schema to it.
My initial data within Parquet is as below,
root
|-- CUST_ID: decimal(9,0) (nullable = true)
|-- INACTV_DT: string (nullable = true)
|-- UPDT_DT: string (nullable = true)
|-- ACTV_DT: string (nullable = true)
|-- PMT_AMT: decimal(9,4) (nullable = true)
|-- CMT_ID: decimal(38,14) (nullable = true)
My custom schema is below,
root
|-- CUST_ID: decimal(38,0) (nullable = false)
|-- INACTV_DT: timestamp (nullable = false)
|-- UPDT_DT: timestamp (nullable = false)
|-- ACTV_DT: timestamp (nullable = true)
|-- PMT_AMT: decimal(19,4) (nullable = true)
|-- CMT_ID: decimal(38,14) (nullable = false)
Below is my code to apply new data-frame to it
val customSchema = getOracleDBSchema(sparkSession, QUERY).schema
val DF_frmOldParkquet = sqlContext_par.read.parquet("src/main/resources/data_0_0_0.parquet")
val rows: RDD[Row] = DF_frmOldParkquet.rdd
val newDataFrame = sparkSession.sqlContext.createDataFrame(rows, tblSchema)
newDataFrame.printSchema()
newDataFrame.show()
I am getting below error, when I perform this operation.
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of timestamp
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), fromDecimal, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, CUST_ID), DecimalType(38,0)), true) AS CUST_ID#27
There are two main applications of schema in Spark SQL
schema argument passed to schema method of the DataFrameReader which is used to transform data in some formats (primarily plain text files). In this case schema can be used to automatically cast input records.
schema argument passed to createDataFrame (variants which take RDD or List of Rows) of the SparkSession. In this case schema has to conform to the data, and is not used for casting.
None of the above is applicable in your case:
Input is strongly typed, therefore schema, if present, is ignored by the reader.
Schema doesn't match the data, therefore it cannot be used to createDataFrame.
In this scenario you should cast each column to the desired type. Assuming the types are compatible, something like this should work
val newDataFrame = df.schema.fields.foldLeft(df){
(df, s) => df.withColumn(s.name, df(s.name).cast(s.dataType))
}
Depending on the format of the data this might be sufficient or not. For example if fields that should be transformed to timestamps don't use standard formatting, casting won't work, and you'll have to use Spark datetime processing utilities.