PySpark UDF Output Schema

PySpark UDF Output Schema - pyspark

I'm trying to format the output of a PySpark UDF. The python code in the UDF returns something like this:
[ [[1],[.5,.6,.7],"A","B"], [[2],[.1,.3,.9],"A","C"],... ]
I have the following return schema code:
schema_return = st.StructType([
st.StructField('result', st.StructType([
st.StructField('rank', st.ArrayType(st.FloatType()),True), \
st.StructField('embedding', st.ArrayType(st.FloatType()),True), \
st.StructField('name', st.StringType(),True), \
st.StructField('value', st.StringType(),True)
])),
])
However this is giving me the following error which I found on here corresponds to wrong output type:
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
Can someone help me write this schema correctly?

Related

Snowflake Variant to pyspark array

Hello I have a Array field in snowflake stored as variant and when I read it I get it back as String in pyspark. How can I convert the string into Array back so that I can apply explode over it?
Below is the VARIANT from snowflake:
In pyspark I tried splitting the field and casting it to array however when I explode the array the values are not the expected strings. It contains double quotes and even the square bracket. I wanted output without quotes and square brackets like a Pyspark array field would result in after explode operation.
df = df.withColumn("genres", split(col("genres"), ",").cast("array<string>"))

If you check the Data Type Mappings (from Snowflake to Spark), you see that the VARIANT datatype is mapped to StringType:
https://docs.snowflake.com/en/user-guide/spark-connector-use.html#from-snowflake-to-spark-sql
This is why you get those quotes and square brackets. I think the solution is to covert the variant to string explicitly using ARRAY_TO_STRING when querying the table, and then convert the string to array in Spark:
df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
.options(**sfOptions) \
.option("query", "select ARRAY_TO_STRING(genres,',') genres from test_v") \
.load()
df = df.withColumn("genres", split(col("genres"), ",").cast("array<string>"))
df.show()
In my tests, it returns the following output:
+---------------+
| genres|
+---------------+
|[News, Weather]|
+---------------+

How to parse nested column for CSV data in Pyspark?

I am working on a database where the data is stored in csv format. The DB looks like the following:
id
containertype
size
1
CASE
{height=2.01, length=1.07, width=1.22}
2
PALLET
{height=1.80, length=1.07, width=1.23}
I want to parse the data inside size column and create a pyspark df like:
id
containertype
height
length
width
1
CASE
2.01
1.07
1.22
2
PALLET
1.80
1.07
1.23
I tried parsing the string to StructType and MapType but none of the approaches are working. Is there any way to do it except the messy string manipulation?
Reproducible data-frame code:
df = spark.createDataFrame(
[
("1", "CASE", "{height=2.01, length=1.07, width=1.22}"),
("2", "PALLET", "{height=2.01, length=1.07, width=1.22}"),
],
["id", "containertype", "size"]
)
df.printSchema()

If one of the columns is a JSON, you can parse it with the function to_json, which requires the column you want to parse, in your case size, and the schema that will result in the parsing, in this case:
schema = StructType([ \
StructField("height",FloatType(),True), \
StructField("length",FloatType(),True), \
StructField("width",FloatType(),True)
])
df.withColumn("json", F.from_json(F.col("size"), schema))\
.select(F.col("id"), F.col("containertype"), F.col("json.*"))

Use a regex to extract the value
def getParameter(tag):
return F.regexp_extract("size", tag+"=(\d+\.\d+)", 1).cast(FloatType()).alias(tag)
df.select(F.col("id"), F.col("containertype"), getParameter("height"), getParameter("length"), getParameter("width"))

Pass RDD in scala function. Output Dataframe

say I have the below csv and many more like it.
val csv = sc.parallelize(Array(
"col1, col2, col3",
"1, cat, dog",
"2, bird, bee"))
I would like to apply the below functions to the RDD to convert it to a data frame with the desired logic below. I keep running into the error error: not found: value DataFrame
How can I correct this?
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
/
In most cases I would read CSV files directly as a dateframe using Spark's core functionality, but I am unable to in this case.
Any/all help is appreciated.

in order not to get error: not found: value DataFrame you must add the following import:
import org.apache.spark.sql.DataFrame
and your method declaration should be like this:
def udf(fName : RDD[String]): DataFrame = { ...

How to update column values in a spark dataframe

I'm trying to load data from Elasticsearch to Mongo DB using Spark.
I'm collecting the data from ES into a dataframe and then pushing the DF into Mongo DB.
Now, I have a column named '_id' in my dataframe which has String value
i.e., _id: "abcd12345"
I would now like to modify this columnvalue to mongo ObjectId type and then push to Mongo DB,
i.e., _id: ObjectId("abcd12345")
I have tried achieving it using the following spark code, but I find no luck in getting what I want.
import spark.implicits._
val df = spark.read
.format("org.elasticsearch.spark.sql")
.option("query", esQuery)
.option("pushdown", true)
.option("scroll.size", Config.ES_SCROLL_SIZE)
.load(Config.ES_RESOURCE)
.withColumn("_id", $"_metadata".getItem("_id"))
.drop("_metadata")
df("_id") = new ObjectId(df("_id").toString()) // error
Expect the resultant DataFrame to have '_id' column value as,
_id: ObjectId("abc12345") instead of _id: "abc12345"
Any help is appreciated. I'm really blocked with this issue.

how to rename column name and cast type when do aggregation in pyspark dataframe

I have a pyspark dataframe, and I wish to get the mean and std for all columns, and rename the columns name and type, what is the easiest way to implement this, currently below is my code:
test_mean=test.groupby('id').agg({'col1': 'mean',
'col2': 'mean',
'col3':'mean'
})
test_std=test.groupby('id').agg({'col1': 'std',
'col2': 'std',
'col3':'std'
})
##rename one columns by one columns
## type cast decimal to float
May I know how to improve it?
Thanks.

You can try with Col experssioons:
from pyspark.sql import functions as F
expr1 = F.std(F.col('col1').cast('integer').alias('col1'))
expr2 = F.std(F.col('col2').cast('integer').alias('col2'))
test \
.groupBy(id) \
.agg(
expr1,
expr2
)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PySpark UDF Output Schema - pyspark

Related

Snowflake Variant to pyspark array

How to parse nested column for CSV data in Pyspark?

Pass RDD in scala function. Output Dataframe

How to update column values in a spark dataframe

how to rename column name and cast type when do aggregation in pyspark dataframe

Categories

Resources