Snowflake Variant to pyspark array - pyspark

Hello I have a Array field in snowflake stored as variant and when I read it I get it back as String in pyspark. How can I convert the string into Array back so that I can apply explode over it?
Below is the VARIANT from snowflake:
In pyspark I tried splitting the field and casting it to array however when I explode the array the values are not the expected strings. It contains double quotes and even the square bracket. I wanted output without quotes and square brackets like a Pyspark array field would result in after explode operation.
df = df.withColumn("genres", split(col("genres"), ",").cast("array<string>"))

If you check the Data Type Mappings (from Snowflake to Spark), you see that the VARIANT datatype is mapped to StringType:
https://docs.snowflake.com/en/user-guide/spark-connector-use.html#from-snowflake-to-spark-sql
This is why you get those quotes and square brackets. I think the solution is to covert the variant to string explicitly using ARRAY_TO_STRING when querying the table, and then convert the string to array in Spark:
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select ARRAY_TO_STRING(genres,',') genres from test_v") \
.load()
df = df.withColumn("genres", split(col("genres"), ",").cast("array<string>"))
df.show()
In my tests, it returns the following output:
+---------------+
| genres|
+---------------+
|[News, Weather]|
+---------------+

Related

Is there any way to change one spark DF's datatype to match with other DF's datatype

I have one spark DF1 with datatype,
string (nullable = true)
integer (nullable = true)
timestamp (nullable = true)
And one more spark DF2 (which I created from Pandas DF)
object
int64
object
Now I need to change the DF2 datatype to match with the Df1 datatype. Is there any common way to do that. Because every time I may get different columns and different datatypes.
Is there any way like assign the DF1 data type to some structType and use that for DF2?
suppose you have 2 dataframes - data1_sdf and data2_sdf. you can use a dataframe's schema to extract the column's data type by data_sdf.select('column_name').schema[0].dataType.
here's an example where data2_sdf columns are casted using data1_sdf within a select
data2_sdf. \
select(*[func.col(c).cast(data1_sdf.select(c).schema[0].dataType) if c in data1_sdf.columns else c for c in data_sdf.columns])
If you make sure that your first object is a string-like column and your third object is timestamp-like column, you can try to use this method:
df2 = spark.createDataFrame(
df2.rdd, schema=df1.schema
)
However, this method is not guaranteed, some cases are not valid (eg transform from string to integer). Also, this method might not be good in performance perspective. Therefore, you better use cast to change the data type of each column.

Pyspark: Identify the arrayType column from the the Struct and call udf to convert array to string

I am creating an accelerator where it migrates the data from source to destination. For Example, I will pick the data from an API and will migrate the data to csv. I have faced issues with handling arraytype while data is converted to csv. I have used withColumn and concat_ws method(i.e., df1=df.withColumn('films',F.concat_ws(':',F.col("films"))) films is the arraytype column ) for this conversion and it worked. Now I wanted this to happen dynamically. I mean, without specifying the column name, is there a way that I can pick the column name from struct which have the arraytype and then call the udf?
Thank you for your time!
You can get the type of the columns using df.schema. Depending on the type of the column you can apply concat_ws or not:
data = [["test1", "test2", [1,2,3], ["a","b","c"]]]
schema= ["col1", "col2", "arr1", "arr2"]
df = spark.createDataFrame(data, schema)
array_cols = [F.concat_ws(":", c.name).alias(c.name) \
for c in df.schema if isinstance(c.dataType, T.ArrayType) ]
other_cols = [F.col(c.name) \
for c in df.schema if not isinstance(c.dataType, T.ArrayType) ]
df = df.select(other_cols + array_cols)
Result:
+-----+-----+-----+-----+
| col1| col2| arr1| arr2|
+-----+-----+-----+-----+
|test1|test2|1:2:3|a:b:c|
+-----+-----+-----+-----+

i have json string in my dataframe i already tried to exract json sting columns using pyspark

df = spark.read.json("dbfs:/mnt/evbhaent2blobs", multiLine=True)
df2 = df.select(F.col('body').cast("Struct").getItem('CustomerType').alias('CustomerType'))
display(df)
my df is
my oupputdf
I am taking a guess that your dataframe has a column "body" which is a json string and you want to parse the json and extract an element from it.
First you need to define or extract the json schema. And then parse json string and extract its elements as column. From the extracted columns, you can select the desired columns.
json_schema = spark.read.json(df.rdd.map(lambda row: row.body)).schema
df2 = df.withColumn('body_json', F.from_json(F.col('body'), json_schema))\
.select("body_json.*").select('CustomerType')
display(df2)

How to convert a row from Dataframe to String

I have a dataframe which contains only one row with the column name: source_column in the below format:
forecast_id:bigInt|period:numeric|name:char(50)|location:char(50)
I want to retrieve this value into a String and then split it on the regex |
First I tried converting the row from the DataFrame into the String by following way so that I can check if the row is converted to String:
val sourceColDataTypes = sourceCols.select("source_columns").rdd.map(x => x.toString()).collect()
When I try to print: println(sourceColDataTypes) to check the content, I see [Ljava.lang.String;#19bbb216
I couldn't understand the mistake here. Could anyone let me know how can I properly fetch a row from a dataframe and convert it to String.
You can also try this:
df.show()
//Input data
//+-----------+----------+--------+--------+
//|forecast_id|period |name |location|
//+-----------+----------+--------+--------+
//|1000 |period1000|name1000|loc1000 |
//+-----------+----------+--------+--------+
df.map(_.mkString(",")).show(false)
//Output:
//+--------------------------------+
//|value |
//+--------------------------------+
//|1000,period1000,name1000,loc1000|
//+--------------------------------+
df.rdd.map(_.mkString(",")).collect.foreach(println)
//1000,period1000,name1000,loc1000

Converting Column of Dataframe to Seq[Columns] Scala

I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)