i'm trying to implement taking a data frame and using that as an input into a stored proc in teradata. here is the code
def dfToStoredProc(store_id : String) = {
var connection:Connection = null
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
val statement = connection.prepareCall("CALL DB.STORED_PROC(?);")
statement.setString(1, store_id)
statement.execute()
}
val dataFrame = df.toDF()
dataFrame.map(m => dfToStoredProc(m.getLong(0).toString))
However getting an error, can anyone help?
I've realised my mistake, I forgot to add a collect() statement
dataFrame.map(m => dfToStoredProc(m.getLong(0).toString)).collect()
An empty dataframe was getting passed to the stored proc which was raising an error
Related
I'm trying to insert rows of dataframe in postgres databases and insert the generated primary keys in this dataframe.
I'm doing this :
def createConnexionRds():
host = "..."
database = "..."
conn = pg8000.connect(
user="...",
host=host,
database=database,
password="...",
ssl_context=True)
return conn
def insertProcess(r):
conn = createConnexionRds()
insertResults = conn.run(r["tmp_query"])
insertResult = "NOT_INSERTED"
if len(insertResults) > 0:
insertResult = insertResults[0][0]
conn.commit()
conn.close()
return insertResult
def insertPerQuery(myDataframe):
query = sf.lit("insert into tabAAA (colBBB) values ('valueCCC') returning idAAA")
myDataframe = myDataframe.withColumn("tmp_query", query)
myDataframe = myDataframe.drop("idAAA")
rdd=myDataframe.rdd.map(
lambda x:(*x, insertProcess(x))
)
myDataframe = myDataframe.withColumn("idAAA", sf.lit(""))
myDataframe = sqlContext.createDataFrame(rdd,myDataframe.schema)
myDataframe = myDataframe.drop("tmp_query")
return myDataframe
df = insertPerQuery(df)
# df.show(100, False)
The issue is when I comment df.show(...) (the last line), the insert is not process. And if I launch a second df.show(), the insert is duplicate.
This is for a AWS glue job.
Thanks.
This is due to the lazy-evaluation-nature of Spark. The code gets only executed on the executors as soon you call an action, in this case .show()
I have a code that converts DataFrame to DynamicFrame and I get this weird error when trying to execute return statement, any clues what's going on?
Error:
{AttributeError}'str' object has no attribute '_jvm'
# record is DynamicFrame
def extractCustomFields(record, ctx):
rec = record.toDF()
rec = rec.withColumn("lastname", rec["customfields"][0].value)
rec.show()
return DynamicFrame.fromDF(rec, ctx, "recordTransform")
fromDF() expects the GlueContext as second argument. You need to pass that:
return DynamicFrame.fromDF(rec, ctx, "recordTransform")
scala/spark newbie here. I have inherited an old code which I have refactored and been trying to use in order to retrieve data from Scylla. The code looks like:
val TEST_QUERY = s"SELECT user_id FROM test_table WHERE name = ? AND id_type = 'test_type';"
var selectData = List[Row]()
dataRdd.foreachPartition {
iter => {
// Build up a cluster that we can connect to
// Start a session with the cluster by connecting to it.
val cluster = ScyllaConnector.getCluster(clusterIpString, scyllaPreferredDc, scyllaUsername, scyllaPassword)
var batchCounter = 0
val session = cluster.connect(tableConfig.keySpace)
val preparedStatement: PreparedStatement = session.prepare(TEST_QUERY)
iter.foreach {
case (test_name: String) => {
// Get results
val testResults = session.execute(preparedStatement.bind(test_name))
if (testResults != null){
val testResult = testResults.one()
if(testResult != null){
val user_id = testResult.getString("user_id")
selectData ::= Row(user_id, test_name)
}
}
}
}
session.close()
cluster.close()
}
}
println("Head is =======> ")
println(selectData.head)
The above does not return any data and fails with null pointer exception because the selectedData list is empty although there is data in there for sure that matches the select statement. I feel like how I'm doing it is not correct but can't figure out what needs to change in order to get this fixed so any help is much appreciated.
PS: The whole idea of me using a list to keep the results is so that I can use that list to create a dataframe. I'd be grateful if you could point me to the right direction here.
If you look into the definition of the foreachPartition function, you will see that it's by definition can't return anything because its return type is void.
Anyway, it's a very bad way of querying data from Cassandra/Scylla from Spark. For that exists Spark Cassandra Connector that should be able to work with Scylla as well because of the protocol compatibility.
To read a dataframe from Cassandra just do:
spark.read
.format("cassandra")
.option("keyspace", "ksname")
.option("table", "tab")
.load()
Documentation is quite detailed, so just read it.
The following code is causing java.lang.NullPointerException.
val sqlContext = new SQLContext(sc)
val dataFramePerson = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(CustomSchema1).load("c:\\temp\\test.csv")
val dataFrameAddress = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(CustomSchema2).load("c:\\temp\\test2.csv")
val personData = dataFramePerson.map(data => {
val addressData = dataFrameAddress.filter(i => i.getAs("ID") == data.getAs("ID"));
var address:Address = null;
if (addressData != null) {
val addressRow = addressData.first;
address = addressRow.asInstanceOf[Address];
}
Person(data.getAs("Name"),data.getAs("Phone"),address)
})
I narrowed it down to the following line of that is causing the exception.
val addressData = dataFrameAddress.filter(i => i.getAs("ID") == data.getAs("ID"));
Can someone point out what the issue is?
Your code has a big structural flaw, that is, you can only refer to dataframes from the code that executes in the driver, but not in the code that is run by the executors. Your code contains a reference to another dataframe from within a map, that is executed in executors. See this link Can I use Spark DataFrame inside regular Spark map operation?
val personData = dataFramePerson.map(data => { // WITHIN A MAP
val addressData = dataFrameAddress.filter(i => // <--- REFERRING TO OTHER DATAFRAME WITHIN A MAP
i.getAs("ID") == data.getAs("ID"));
var address:Address = null;
if (addressData != null) {
What you want to do instead is a left outer join, then do further processing.
dataFramePerson.join(dataFrameAddress, Seq("ID"), "left_outer")
Note also than when using getAs you want to specify the type, like getAs[String]("ID")
The only thing that can be said is that either dataFrameAddress, or i, or data is null. Use your favorite debugging technique to know which one actually is e.g., debugger, print statements or logs.
Note that if you see the filter call in the stacktrace of your NullPointerException, it would mean that only i, or data could be null. On the other hand, if you don't see the filter call, it would rather mean that it is dataFrameAddress that is null.
I am running a query through my HiveContext
Query:
val hiveQuery = s"SELECT post_domain, post_country, post_geo_city, post_geo_region
FROM $database.$table
WHERE year=$year and month=$month and day=$day and hour=$hour and event_event_id='$uniqueIdentifier'"
val hiveQueryObj:DataFrame = hiveContext.sql(hiveQuery)
Originally, I was extracting each value from the column with:
hiveQueryObj.select(column).collectAsList().get(0).get(0).toString
However, I was told to avoid this because it makes too many connections to Hive. I am pretty new to this area so I'm not sure how to extract the column values efficiently. How can I perform the same logic in a more efficient way?
I plan to implement this in my code
val arr = Array("post_domain", "post_country", "post_geo_city", "post_geo_region")
arr.foreach(column => {
// expected Map
val ex = expected.get(column).get
val actual = hiveQueryObj.select(column).collectAsList().get(0).get(0).toString
assert(actual.equals(ex))
}