I have a code that converts DataFrame to DynamicFrame and I get this weird error when trying to execute return statement, any clues what's going on?
Error:
{AttributeError}'str' object has no attribute '_jvm'
# record is DynamicFrame
def extractCustomFields(record, ctx):
rec = record.toDF()
rec = rec.withColumn("lastname", rec["customfields"][0].value)
rec.show()
return DynamicFrame.fromDF(rec, ctx, "recordTransform")
fromDF() expects the GlueContext as second argument. You need to pass that:
return DynamicFrame.fromDF(rec, ctx, "recordTransform")
Related
I'd like to create a decorator that handles errors inside of a pandas_udf. I've tried a few attempts with no luck so wanted to see if anyone has been successful in doing this?
Below is some initial code I've tried but it fails. In this example, I'm trying to decorate the function pandas_divide with both pandas_udf and a new decorator to detect errors, return_code.
I'm not sure if my idea is possible given the fact that pandas UDFs require us to define a single return data type (whereas this idea of wrapping it in a safe call would allow for either the output of the function to be returned in the column or an exception). I tried researching if I could define a new pyspark data type that is the union of one data type, an exception and None but did not have any luck - is this possible?
I was also thinking of using a closure to try and get this functionality but closures are new to me so I'm still looking into this.
from pyspark.sql import types as T
from pyspark.sql import functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType
# create dataframe for testing
df = spark.range(0, 10).withColumn('id', (F.col('id') / 10).cast('integer')).withColumn('v', F.rand())
columns = ['id', 'v']
vals = [(1, 2), (2, 0), (3, 0)]
new_rows_df = spark.createDataFrame(vals, columns)
df = df.union(new_rows_df)
df.cache()
df.count()
display(df)
class ReturnCode:
def __init__(self):
self.pass1 = 'PASS'
self.fail1 = 'FAIL'
def __call__(self, fn, *args, **kwargs):
def inner_func(self, *args, **kwargs):
try:
output = func(**kwargs)
return_code = self.pass1
except Exception as ex:
output = f"{ex}"
return_code = self.fail1
return (return_code, output)
return inner_func
return_code = ReturnCode()
#pandas_udf(T.StructType(T.StructField('return_code', T.StringType()), T.StructField('value', T.IntegerType())))
#return_code
def pandas_divide(v):
if v == 0:
raise
return 1/v
#pandas_divide(0)[0]
df = df.withColumn('pandas_divide', pandas_divide(F.col('v')))
df.show()
I load data from database to Spark Dataframe,named DF,then I must to extract some records from the Dataframe which their ID has special condition. So, I define this function:
def hash_id(id:String): Int = {
val two_char = id.takeRight(2).toInt
val hash_result = two_char % 4
return hash_result
}
Then, I use the function in this query:
DF.filter(hash_id("ID")===3)
But I receive this error:
value === is not a member of Int
DF has ID column.
Would you please guide me how to use a custom function in where/filter clause?
Any help would be really appreciated.
=== can only be used between Column objects. That's why you have an error value === is not a member of Int, as return type of your function hash_id is an Int, not a Column
To be able to use your function, you should convert it to an user-defined function and apply this function to a column object as follow:
import org.apache.spark.sql.functions.{col, udf}
def hash_id(id:String): Int = {
val two_char = id.takeRight(2).toInt
val hash_result = two_char % 4
return hash_result
}
val hash_id_udf = udf((id: String) => hasd_id(id))
DF.filter(hash_id_udf(col("ID")) === 3)
i'm trying to implement taking a data frame and using that as an input into a stored proc in teradata. here is the code
def dfToStoredProc(store_id : String) = {
var connection:Connection = null
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
val statement = connection.prepareCall("CALL DB.STORED_PROC(?);")
statement.setString(1, store_id)
statement.execute()
}
val dataFrame = df.toDF()
dataFrame.map(m => dfToStoredProc(m.getLong(0).toString))
However getting an error, can anyone help?
I've realised my mistake, I forgot to add a collect() statement
dataFrame.map(m => dfToStoredProc(m.getLong(0).toString)).collect()
An empty dataframe was getting passed to the stored proc which was raising an error
scala/spark newbie here. I have inherited an old code which I have refactored and been trying to use in order to retrieve data from Scylla. The code looks like:
val TEST_QUERY = s"SELECT user_id FROM test_table WHERE name = ? AND id_type = 'test_type';"
var selectData = List[Row]()
dataRdd.foreachPartition {
iter => {
// Build up a cluster that we can connect to
// Start a session with the cluster by connecting to it.
val cluster = ScyllaConnector.getCluster(clusterIpString, scyllaPreferredDc, scyllaUsername, scyllaPassword)
var batchCounter = 0
val session = cluster.connect(tableConfig.keySpace)
val preparedStatement: PreparedStatement = session.prepare(TEST_QUERY)
iter.foreach {
case (test_name: String) => {
// Get results
val testResults = session.execute(preparedStatement.bind(test_name))
if (testResults != null){
val testResult = testResults.one()
if(testResult != null){
val user_id = testResult.getString("user_id")
selectData ::= Row(user_id, test_name)
}
}
}
}
session.close()
cluster.close()
}
}
println("Head is =======> ")
println(selectData.head)
The above does not return any data and fails with null pointer exception because the selectedData list is empty although there is data in there for sure that matches the select statement. I feel like how I'm doing it is not correct but can't figure out what needs to change in order to get this fixed so any help is much appreciated.
PS: The whole idea of me using a list to keep the results is so that I can use that list to create a dataframe. I'd be grateful if you could point me to the right direction here.
If you look into the definition of the foreachPartition function, you will see that it's by definition can't return anything because its return type is void.
Anyway, it's a very bad way of querying data from Cassandra/Scylla from Spark. For that exists Spark Cassandra Connector that should be able to work with Scylla as well because of the protocol compatibility.
To read a dataframe from Cassandra just do:
spark.read
.format("cassandra")
.option("keyspace", "ksname")
.option("table", "tab")
.load()
Documentation is quite detailed, so just read it.
So, I want to do certain operations on my spark DataFrame, write them to DB and create another DataFrame at the end. It looks like this :
import sqlContext.implicits._
val newDF = myDF.mapPartitions(
iterator => {
val conn = new DbConnection
iterator.map(
row => {
addRowToBatch(row)
convertRowToObject(row)
})
conn.writeTheBatchToDB()
conn.close()
})
.toDF()
This gives me an error as mapPartitions expects return type of Iterator[NotInferedR], but here it is Unit. I know this is possible with forEachPartition, but I'd like to do the mapping also. Doing it separate would be an overhead (extra spark job). What to do?
Thanks!
On most cases, eager consuming the iterator will result to execution failure if not slow down of jobs. Thus what I've done was to check if iterator is already empty then do the cleanup routines.
rdd.mapPartitions(itr => {
val conn = new DbConnection
itr.map(data => {
val yourActualResult = // do something with your data and conn here
if(itr.isEmpty) conn.close // close the connection
yourActualResult
})
})
Thought this as a spark problem at first but was a scala one actually. http://www.scala-lang.org/api/2.12.0/scala/collection/Iterator.html#isEmpty:Boolean
The last expression in the anonymous function implementation must be the return value:
import sqlContext.implicits._
val newDF = myDF.mapPartitions(
iterator => {
val conn = new DbConnection
// using toList to force eager computation - make it happen now when connection is open
val result = iterator.map(/* the same... */).toList
conn.writeTheBatchToDB()
conn.close()
result.iterator
}
).toDF()