I am trying to run an ETL job on glue where I am extracting data into a spark dataframe from a mongodb into glue and load it into snowflake.
This is the sample schema of the Spark dataframe
|-- login: struct (nullable = true)
| |-- login_attempts: integer (nullable = true)
| |-- last_attempt: timestamp (nullable = true)
|-- name: string (nullable = true)
|-- notifications: struct (nullable = true)
| |-- bot_review_queue: boolean (nullable = true)
| |-- bot_review_queue_web_push: boolean (nullable = true)
| |-- bot_review_queue_web_push_admin: boolean (nullable = true)
| |-- weekly_account_summary: struct (nullable = true)
| | |-- enabled: boolean (nullable = true)
| |-- weekly_summary: struct (nullable = true)
| | |-- enabled: boolean (nullable = true)
| | |-- day: integer (nullable = true)
| | |-- hour: integer (nullable = true)
| | |-- minute: integer (nullable = true)
|-- query: struct (nullable = true)
| |-- email_address: string (nullable = true)
I am trying to load the data into snowflake as it is and struct columns as json payload in snowflake but it throws the following error
An error occurred while calling o81.collectToPython.com.mongodb.spark.exceptions.MongoTypeConversionException:Cannot cast ARRAY into a StructType
I also tried to cast the struct columns into string and load it but it throws more or less the same error
An error occurred while calling o106.save. com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a StructType
Really appreciate if I can get some help on it.
code below for casting and loading.
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="mongodb",
connection_options=read_mongo_options)
user_df_cast = user_df.select(user_df.login.cast(StringType()),'name',user_df.notifications.cast(StringType()))
datasinkusers = user_df_cast.write.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("dbtable", "users").mode("append").save()
If your users table in Snowflake has the following schema then casting is not required, as the StructType fields of a SparkSQL DataFrame will map to the VARIANT type in Snowflake automatically:
CREATE TABLE users (
login VARIANT
,name STRING
,notifications VARIANT
,query VARIANT
)
Just do the following, no transformations required because the Snowflake Spark Connector understands the data-type and will convert to appropriate JSON representations on its own:
user_df = glueContext.create_dynamic_frame.from_options(
connection_type="mongodb",
connection_options=read_mongo_options
)
user_df
.toDF()
.write
.format(SNOWFLAKE_SOURCE_NAME)
.options(**sfOptions)
.option("dbtable", "users")
.mode("append")
.save()
If you absolutely need to store the StructType fields as plain JSON strings, you'll need to explicitly transform them using the to_json SparkSQL function:
from pyspark.sql.functions import to_json
user_df_cast = user_df.select(
to_json(user_df.login),
user_df.name,
to_json(user_df.notifications)
)
This will store JSON strings as simple VARCHAR types which will not let you leverage Snowflake's semi-structured data storage and querying capabilities directly without a PARSE_JSON step (inefficient).
Consider using the VARIANT approach shown above, which will allow you to perform queries on the fields directly:
SELECT
login:login_attempts
,login:last_attempt
,name
,notifications:weekly_summary.enabled
FROM users
Related
I am loading a Dataframe from an external source with the following schema:
|-- A: string (nullable = true)
|-- B: timestamp (nullable = true)
|-- C: long (nullable = true)
|-- METADATA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- M_1: integer (nullable = true)
| | |-- M_2: string (nullable = true)
| | |-- M_3: string (nullable = true)
| | |-- M_4: string (nullable = true)
| | |-- M_5: double (nullable = true)
| | |-- M_6: string (nullable = true)
| | |-- M_7: double (nullable = true)
| | |-- M_8: boolean (nullable = true)
| | |-- M_9: boolean (nullable = true)
|-- E: string (nullable = true)
Now, I need to add new column, METADATA_PARSED, with column type Array and the following case class:
case class META_DATA_COL(M_1: String, M_2: String, M_3, M_10:String)
My approach here, based on examples is to create a UDF and pass in the METADATA column. But since it is of a complex type I am having a lot of trouble parsing it.
On top of that in the UDF, for the "new" variable M_10, I need to do some string manipulation on the method as well. So I need to access each of the elements in the metadata column.
What would be the best way to approach this issue? I attempted to convert the source dataframe (+METADATA) to a case class; but that did not work as it was translated back to spark WrappedArray types upon entering the UDF.
you can Use something like this.
import org.apache.spark.sql.functions._
val tempdf = df.select(
explode( col("METADATA")).as("flat")
)
val processedDf = tempdf.select( col("flat.M_1"),col("flat.M_2"),col("flat.M_3"))
now write a udf
def processudf = udf((col1:Int,col2:String,col3:String) => /* do the processing*/)
this should help, i can provide some more help if you can provide more details on the processing.
I have just started learning Spark. I am aware of the fact that if we set inferSchema option to true, the schema is automatically inferred. I am reading a simple csv file. How do i dynamically infer a schema without specifying any custom schema in my code. The code should be able to build schema for any incoming dataset.
Is it possible to do so?
I tried using readStream and specified my format as csv skipping the inferschema option altogether but it seems i need to provide that option in any case.
val ds1: DataFrame = spark
.readStream
.format("csv")
.load("/home/vaibha/Downloads/C2ImportCalEventSample.csv")
println(ds1.show(2))
You can dynamically infer schema but might get bit tedious in some cases of csv format. More read here. Referring to CSV file in your code sample and assuming it is same as the one here, something like below will give what you need:
scala> val df = spark.read.
| option("header", "true").
| option("inferSchema", "true").
| option("timestampFormat","MM/dd/yyyy").
| csv("D:\\texts\\C2ImportCalEventSample.csv")
df: org.apache.spark.sql.DataFrame = [Start Date : timestamp, Start Time: string ... 15 more fields]
scala> df.printSchema
root
|-- Start Date : timestamp (nullable = true)
|-- Start Time: string (nullable = true)
|-- End Date: timestamp (nullable = true)
|-- End Time: string (nullable = true)
|-- Event Title : string (nullable = true)
|-- All Day Event: string (nullable = true)
|-- No End Time: string (nullable = true)
|-- Event Description: string (nullable = true)
|-- Contact : string (nullable = true)
|-- Contact Email: string (nullable = true)
|-- Contact Phone: string (nullable = true)
|-- Location: string (nullable = true)
|-- Category: integer (nullable = true)
|-- Mandatory: string (nullable = true)
|-- Registration: string (nullable = true)
|-- Maximum: integer (nullable = true)
|-- Last Date To Register: timestamp (nullable = true)
Schema:
|-- c0: string (nullable = true)
|-- c1: struct (nullable = true)
| |-- c2: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- orangeID: string (nullable = true)
| | | |-- orangeId: string (nullable = true)
I am trying to flatten the schema above in spark.
Code:
var df = data.select($"c0",$"c1.*").select($"c0",explode($"c2")).select($"c0",$"col.orangeID", $"col.orangeId")
The flattening code is working fine. The problem is in the last part where the 2 columns differ only by 1 letter (orangeID and orangeId). Hence I am getting this error:
Error:
org.apache.spark.sql.AnalysisException: Ambiguous reference to fields StructField(orangeID,StringType,true), StructField(orangeId,StringType,true);
Any suggestions to avoid this ambiguity will be great.
turn on the spark sql case sensitivity configuration and try
spark.sql("set spark.sql.caseSensitive=true")
I'm currently using Scala Spark for some ETL and have a base dataframe that contains has the following schema
|-- round: string (nullable = true)
|-- Id : string (nullable = true)
|-- questions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- bonusQuestions: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- difficulty : string (nullable = true)
| | |-- answerOptions: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- followUpAnswers: array (nullable = true)
| | | |-- element: string (containsNull = true)
|-- school: string (nullable = true)
I only need to perform ETL on rows where the round type is primary (there are 2 types primary and secondary). However, I need both type of rows in my final table.
I'm stuck doing the ETL which should be according to -
If tag is non-bonus, the bonusQuestions should be set to null and difficulty should be null.
I'm currently able to access most fields of the DF like
val round = tr.getAs[String]("round")
Next, I'm able to get the questions array using
val questionsArray = tr.getAs[Seq[StructType]]("questions")
and can iterate using for (question <- questionsArray) {...}; However I cannot access struct fields like question.bonusQuestions or question.tagwhich returns an error
error: value tag is not a member of org.apache.spark.sql.types.StructType
Spark treats StructType as GenericRowWithSchema, more specific as Row. So instead of Seq[StructType] you have to use Seq[Row] as
val questionsArray = tr.getAs[Seq[Row]]("questions")
and in the loop for (question <- questionsArray) {...} you can get the data of Row as
for (question <- questionsArray) {
val tag = question.getAs[String]("tag")
val bonusQuestions = question.getAs[Seq[String]]("bonusQuestions")
val difficulty = question.getAs[String]("difficulty")
val answerOptions = question.getAs[Seq[String]]("answerOptions")
val followUpAnswers = question.getAs[Seq[String]]("followUpAnswers")
}
I hope the answer is helpful
I am using Spark to process some datas stored in an XML file.
I successfuly loaded my datas and printed the schema :
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag","elementTag")
.load(myPath+"/myfile.xml")
df.printSchema
Which give me a result that look like this :
root
|-- _id: string (nullable = true)
|-- _type: string (nullable = true)
|-- creationDate: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _value: string (nullable = true)
|-- lastUpdateDate: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _value: string (nullable = true)
From this datas, I want to extract only certain fields , which should be easy with a 'select'. So I am doing the folowing request :
df.select("_id","creationDate._value","lastUpdateDate._value")
But I get the error :
org.apache.spark.sql.AnalysisException: Ambiguous reference to fields StructField(_VALUE,StringType,true), StructField(_value,StringType,true);
My problem is that spark sql is not case sensitive and my file contains field _value and _VALUE and I can't change my input file.
Is there a way to solve this probleme with Spark?
Spark-xml creates _VALUE there is no child in a xml tag which cause conflict with other.
You can change default value _VALUE by adding option while reading xml as
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag","elementTag")
.option("valueTag", "anyName")
.load(myPath+"/myfile.xml")
Hope this helps!