Cast Issue with AWS Glue 3.0 - Pyspark - pyspark

I'm using Glue 3.0
data = [("Java", "6241499.16943521594684385382059800664452")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF()
df.show()
df.select(f.col("_2").cast("decimal(15,2)")).show()
I get the following result
+----+--------------------+
| _1| _2|
+----+--------------------+
|Java|6241499.169435215...|
+----+--------------------+
+----+
| _2|
+----+
|null|
+----+
locally with pyspark= "==3.2.1" there is no issue to cast the string to decimal() but the Glue job is not able to do so

The problem is with AWS Glue ! in order to encounter this, I used to convert my string before doing the cast
def prepareStringDecimal(str_):
"""
Pyspark UDF
:param str_: "1234.123456789"
:return: 1234.12345
"""
arr = str(str_).split(".")
if len(arr) > 1:
return arr[0] + "." + arr[1][:5]
else:
return str_
# convert function to UDF
convertUDF = udf(lambda z: prepareStringDecimal(z), StringType())
data = [("Java", "6241499.16943521594684385382059800664452")]
df = spark.sparkContext.parallelize(data).toDF()
df.show()
df.select(convertUDF(f.col("_2")).cast("decimal(15,2)")).show()
Output
+----+--------------------+
| _1| _2|
+----+--------------------+
|Java|6241499.169435215...|
+----+--------------------+
+-----------------------------------+
|CAST(<lambda>(_2) AS DECIMAL(15,2))|
+-----------------------------------+
| 6241499.17|
+-----------------------------------+
Note: Obviously ! we can use Spark SQL Functions instead

Related

How to make first row as header in PySpark reading text file as Spark context

The data frame what I get after reading text file in spark context
+----+---+------+
| _1| _2| _3|
+----+---+------+
|name|age|salary|
| sai| 25| 1000|
| bum| 30| 1500|
| che| 40| null|
+----+---+------+
the dataframe I required is
+----+---+------+
|name|age|salary|
+----+---+------+
| sai| 25| 1000|
| bum| 30| 1500|
| che| 40| null|
+----+---+------+
Here is the the code:
## from spark context
df_txt=spark.sparkContext.textFile("/FileStore/tables/simple-2.txt")
df_txt1=df_txt.map(lambda x: x.split(" "))
ddf=df_txt1.toDF().show()
You can use spark csv reader to read your comma seperate file.
For reading text file, you have to take first row as header and create a Seq of String and pass to toDF function. Also, remove first header to the rdd.
Note: Below code has written in spark scala. you can convert into lambda function to make it work in pyspark
import org.apache.spark.sql.functions._
val df = spark.sparkContext.textFile("/FileStore/tables/simple-2.txt")
val header = df.first()
val headerCol: Seq[String] = header.split(",").toList
val filteredRDD = df.filter(x=> x!= header)
val finaldf = filteredRDD.map( _.split(",")).map(w => (w(0),w(1),w(2))).toDF(headerCol: _*)
finaldf.show()
w(0),w(1),w(2) - you have to define fixed number of column from your file.

base64 decoding of a dataframe

I have an encoded dataframe and I managed to get it decoded using following code in PySpark. Is there any simple way where I can have an additional column in the dataframe itself through Scala/PySpark?
import base64
import numpy as np
df = spark.read.parquet("file_path")
encodedColumn = base64.decodestring(df.take(1)[0].column2)
t1 = np.frombuffer(encodedColumn ,dtype='<f4')
I looked up multiple similar questions, but couldnt get them to work.
Edit:
Got it working with help from a colleague.
def binaryToFloatArray(stringValue: String): Array[Float] = {
val t:Array[Byte] = Base64.getDecoder().decode(stringValue)
val b = ByteBuffer.wrap(t).order(ByteOrder.LITTLE_ENDIAN).asFloatBuffer()
val copy = new Array[Float](2048)
b.get(copy)
return copy
}
val binaryToFloatArrayUDF = udf(binaryToFloatArray _)
val finalResultDf = dftest.withColumn("myFloatArray", binaryToFloatArrayUDF(col("_2"))).drop("_2")
You have base64 and unbase64 functions for this.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=streaming#pyspark.sql.functions.base64
You could
from pyspark.sql.functions import unbase64,base64
got = spark.createDataFrame([(1, "Jon"), (2, "Danny"), (3, "Tyrion")], ("id", "name"))
+---+------+
| id| name|
+---+------+
| 1| Jon|
| 2| Danny|
| 3|Tyrion|
+---+------+
encoded_got = got.withColumn('encoded_base64_name', base64(got.name))
+---+------+-------------------+
| id| name|encoded_base64_name|
+---+------+-------------------+
| 1| Jon| Sm9u|
| 2| Danny| RGFubnk=|
| 3|Tyrion| VHlyaW9u|
+---+------+-------------------+
decoded_got = encoded_got.withColumn('decoded_base64', unbase64(encoded_got.encoded_base64).cast("string"))
# Need to use cast("string") to convert from binary to string
+---+------+--------------+--------------+
| id| name|encoded_base64|decoded_base64|
+---+------+--------------+--------------+
| 1| Jon| Sm9u| Jon|
| 2| Danny| RGFubnk=| Danny|
| 3|Tyrion| VHlyaW9u| Tyrion|
+---+------+--------------+--------------+

Scala: Find the maximum value across each row of a dataframe

For each row of a DataFrame, I would like to extract the maximum value and put it in a new column.
The example code below gives me a DataFrame ('dfmax') of each maximum value:
val donuts = Seq((2.0, 1.50, 3.5), (4.2, 22.3, 10.8), (33.6, 2.50, 7.3))
val df = sparkSession
.createDataFrame(donuts)
.toDF("col1", "col2", "col3")
df.show()
import sparkSession.implicits._
val dfmax = df.map(r => r.getValuesMap[Double](df.schema.fieldNames).map(r => r._2).max)
dfmax.show
This gives me df:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 2.0| 1.5| 3.5|
| 4.2|22.3|10.8|
|33.6| 2.5| 7.3|
+----+----+----+
and dfmax:
+-----+
|value|
+-----+
| 3.5|
| 22.3|
| 33.6|
+-----+
I would like to have these two frames combined in one table preferably using .withColumn or similar in a style like this (which I cannot get to work):
def maxValue(data: DataFrame): DataFrame = {
val dfmax = df.map(r => r.getValuesMap[Double](df.schema.fieldNames).map(r => r._2).max)
dfmax
}
val udfMaxValue = udf(maxValue _)
df.withColumn("max", udfMaxValue(df))

How to transform a string column of a dataframe into a column of Array[String] with Apache Spark and Scala

I have a DataFrame with a column 'title_from' as below.
.
This colume contains a sentence and I want to transform this column into a Array[String]. I have tried something like this but it does not works.
val newDF = df.select("title_from").map(x => x.split("\\\s+")
How can I achieve this? How can I transform a datafram of strings into a dataframe of Array[string]? I want evry line of newDF to be an array of words from df.
Thanks for any help!
You can use the withColumn function.
import org.apache.spark.sql.functions._
val newDF = df.withColumn("split_title_from", split(col("title_from"), "\\s+"))
.select("split_title_from")
Can you try following to get the list of all authors
scala> val df = Seq((1,"a1,a2,a3"), (2,"a1,a4,a10")).toDF("id","author")
df: org.apache.spark.sql.DataFrame = [id: int, author: string]
scala> df.show()
+---+---------+
| id| author|
+---+---------+
| 1| a1,a2,a3|
| 2|a1,a4,a10|
+---+---------+
scala> df.select("author").show
+---------+
| author|
+---------+
| a1,a2,a3|
|a1,a4,a10|
+---------+
scala> df.select("author").flatMap( row => { row.get(0).toString().split(",")}).show()
+-----+
|value|
+-----+
| a1|
| a2|
| a3|
| a1|
| a4|
| a10|
+-----+

how to use Regexp_replace in spark

I am pretty new to spark and would like to perform an operation on a column of a dataframe so as to replace all the , in the column with .
Assume there is a dataframe x and column x4
x4
1,3435
1,6566
-0,34435
I want the output to be as
x4
1.3435
1.6566
-0.34435
The code I am using is
import org.apache.spark.sql.Column
def replace = regexp_replace((x.x4,1,6566:String,1.6566:String)x.x4)
But I get the following error
import org.apache.spark.sql.Column
<console>:1: error: ')' expected but '.' found.
def replace = regexp_replace((train_df.x37,0,160430299:String,0.160430299:String)train_df.x37)
Any help on the syntax, logic or any other suitable way would be much appreciated
Here's a reproducible example, assuming x4 is a string column.
import org.apache.spark.sql.functions.regexp_replace
val df = spark.createDataFrame(Seq(
(1, "1,3435"),
(2, "1,6566"),
(3, "-0,34435"))).toDF("Id", "x4")
The syntax is regexp_replace(str, pattern, replacement), which translates to:
df.withColumn("x4New", regexp_replace(df("x4"), "\\,", ".")).show
+---+--------+--------+
| Id| x4| x4New|
+---+--------+--------+
| 1| 1,3435| 1.3435|
| 2| 1,6566| 1.6566|
| 3|-0,34435|-0.34435|
+---+--------+--------+
We could use the map method to do this transformation:
scala> df.map(each => {
(each.getInt(0),each.getString(1).replaceAll(",", "."))
})
.toDF("Id","x4")
.show
Output:
+---+--------+
| Id| x4|
+---+--------+
| 1| 1.3435|
| 2| 1.6566|
| 3|-0.34435|
+---+--------+