Spark Error:- "value foreach is not a member of Object" - scala

The dataframe consists of the two columns (s3ObjectName, batchName) with tens of thousands of rows like:-
s3ObjectName
batchName
a1.json
45
b2.json
45
c3.json
45
d4.json
46
e5.json
46
The objective is to retrieve objects from an S3 bucket and write to datalake in parallel using details from each row in the dataframe using foreachPartition() and foreach() functions
// s3 connector details defined as an object so it can be serialized and available on all executors in the cluster
object container {
def getDataSource() = {
val AccessKey = dbutils.secrets.get(scope = "ADBTEL_Scope", key = "Telematics-TrueMotion-AccessKey-ID")
val SecretKey = dbutils.secrets.get(scope = "ADBTEL_Scope", key = "Telematics-TrueMotion-AccessKey-Secret")
val creds = new BasicAWSCredentials(AccessKey, SecretKey)
val clientRegion: Regions = Regions.US_EAST_1
AmazonS3ClientBuilder.standard()
.withRegion(clientRegion)
.withCredentials(new AWSStaticCredentialsProvider(creds))
.build()
}
}
dataframe.foreachPartition(partition => {
//Initialize s3 connection for each partition
val client: AmazonS3 = container.getDataSource()
partition.foreach(row => {
val s3ObjectName = row.getString(0)
val batchname = row.getString(1)
val inputS3Stream = client.getObject("s3bucketname", s3ObjectName).getObjectContent
val inputS3String = IOUtils.toString(inputS3Stream, "UTF-8")
val filePath = s"/dbfs/mnt/test/${batchname}/${s3ObjectName}"
val file = new File(filePath)
val fileWriter = new FileWriter(file)
val bw = new BufferedWriter(fileWriter)
bw.write(inputS3String)
bw.close()
fileWriter.close()
})
})
The above process gives me
Error: value foreach is not a member of Object

Convert Dataframe to RDD before calling foreachPartition.
dataframe.rdd.foreachPartition(partition => {
})

Related

dbutils.secrets.get- NoSuchElementException: None.get

The below code executes a 'get' api method to retrieve objects from s3 and write to the data lake.
The problem arises when I use dbutils.secrets.get to get the keys required to establish the connection to s3
my_dataframe.rdd.foreachPartition(partition => {
val AccessKey = dbutils.secrets.get(scope = "ADB_Scope", key = "AccessKey-ID")
val SecretKey = dbutils.secrets.get(scope = "ADB_Scope", key = "AccessKey-Secret")
val creds = new BasicAWSCredentials(AccessKey, SecretKey)
val clientRegion: Regions = Regions.US_EAST_1
val s3client = AmazonS3ClientBuilder.standard()
.withRegion(clientRegion)
.withCredentials(new AWSStaticCredentialsProvider(creds))
.build()
partition.foreach(x => {
val objectKey = x.getString(0)
val i = s3client.getObject(s3bucketName, objectKey).getObjectContent
val inputS3String = IOUtils.toString(i, "UTF-8")
val filePath = s"${data_lake_get_path}"
val file = new File(filePath)
val fileWriter = new FileWriter(file)
val bw = new BufferedWriter(fileWriter)
bw.write(inputS3String)
bw.close()
fileWriter.close()
})
})
The above results in the error:-
Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:529)
at scala.None$.get(Option.scala:527)
at com.databricks.dbutils_v1.impl.SecretUtilsImpl.sc$lzycompute(SecretUtilsImpl.scala:24)
at com.databricks.dbutils_v1.impl.SecretUtilsImpl.sc(SecretUtilsImpl.scala:24)
at com.databricks.dbutils_v1.impl.SecretUtilsImpl.getSecretManagerClient(SecretUtilsImpl.scala:36)
at com.databricks.dbutils_v1.impl.SecretUtilsImpl.getBytesInternal(SecretUtilsImpl.scala:46)
at com.databricks.dbutils_v1.impl.SecretUtilsImpl.get(SecretUtilsImpl.scala:61)
When the actual secret scope values for AccessKey and SecretKey are passed the above code works fine.
How can this work using dbutils.secrets.get so that keys are not exposed in the code?
You just need to move following two lines:
val AccessKey = dbutils.secrets.get(scope = "ADB_Scope", key = "AccessKey-ID")
val SecretKey = dbutils.secrets.get(scope = "ADB_Scope", key = "AccessKey-Secret")
outside of the foreachPartition block, so these functions will be executed in context of driver, and sent to the worker nodes.

Performance issue with UDF.. is there a better to solve the transformation. Database write is getting stuck

Table
userid
data
123234
{"type":1,"actionData":{"actionType":"Category","id":"1233232","title":"BOSS","config":{"type":"X"}}}
And I need a output table like this..
userid
action
123234
{"type":"Category","template":{"data":"{"title":"BOSS"}" },"additionalInfo":{"type":1,"config":{"type":"X"} } }
Scala spark..
It is getting stuck while database write with UDF.
running
bin/spark-shell --master local[*] --packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.0 --driver-memory 200g
Need a better way to solve it..
object testDataMigration extends Serializable {
def main(cassandra: String): Unit = {
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("UserLookupMigration")
.config("spark.master", "local[*]")
.config("spark.cassandra.connection.host",cassandra)
.config("spark.cassandra.output.batch.size.rows", "10000")
.config("spark.cassandra.read.timeoutMS","60000")
.getOrCreate()
val res = time(migrateData());
Console.println("Time taken to execute script", res._1);
spark.stop();
}
def migrateData()(implicit spark: SparkSession) {
)
val file = new File("validation_count.txt" )
val print_Writer = new PrintWriter(file)
//Reading data from user_feed table
val userFeedData = spark.read.format("org.apache.spark.sql.cassandra")
.option("keyspace", "sunbird").option("table", "TABLE1").load();
print_Writer.write("User Feed Table records:"+ userFeedData.count() );
//Persisting user feed data into memory
userFeedData.persist()
val userFeedWithNewUUID = userFeedData
.withColumn("newId",expr("uuid()"))
.withColumn("action", myColUpdate(userFeedData("data"),
userFeedData("createdby"), userFeedData("category")))
userFeedWithNewUUID.persist()
val userFeedV2Format = userFeedWithNewUUID.select(
col("newId"),col("category"),col("createdby"),
col("createdon"),col("action"),col("expireon"),
col("priority"),col("status"),col("updatedby"),
col("updatedon"),col("userid"))
.withColumnRenamed("newId","id")
.withColumn("version",lit("v2").cast(StringType))
//Persist v2 format data to memory
userFeedV2Format.persist()
print_Writer.write("User Feed V2 Format records:"+ userFeedV2Format.count() );
userFeedV2Format.write.format("org.apache.spark.sql.cassandra")
.option("keyspace", "sunbird_notifications")
.option("table", "TABLE2")
.mode(SaveMode.Append).save();
//Remove from memory
userFeedV2Format.unpersist()
userFeedData.unpersist()
print_Writer.close()
}
def myColUpdate= udf((data: String, createdby: String, category: String)=> {
val jsonMap = parse(data).values.asInstanceOf[Map[String, Object]]
val actionDataMap = new HashMap[String, Object]
val additionalInfo = new HashMap[String,Object]
val dataTemplate = new HashMap[String,String]
val templateMap = new HashMap[String,Object]
val createdByMap = new HashMap[String,Object]
createdByMap("type")="System"
createdByMap("id")=createdby
var actionType: String = null
for((k,v)<-jsonMap){
if(k == "actionData"){
val actionMap = v.asInstanceOf[Map[String,Object]]
if(actionMap.contains("actionType")){
actionType = actionMap("actionType").asInstanceOf[String]
}
for((k1,v1)<-actionMap){
if(k1 == "title" || k1 == "description"){
dataTemplate(k1)=v1.asInstanceOf[String]
}else{
additionalInfo(k1)=v1
}
}
}else{
additionalInfo(k)=v
}
}
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
templateMap("data")=mapper.writeValueAsString(dataTemplate)
templateMap("ver")="4.4.0"
templateMap("type")="JSON"
actionDataMap("type")=actionType
actionDataMap("category")=category.asInstanceOf[String]
actionDataMap("createdBy")=createdByMap;
actionDataMap("template") =templateMap;
actionDataMap("additionalInfo")=additionalInfo
mapper.writeValueAsString(actionDataMap)
})
}
Getting stuck Table 1 has 40 million data.

How to speed up log parsing Spark job?

My architecture right now is AWS ELB writes log to S3 and it sends a message to SQS for further processing by Spark Streaming. It's working right now but my problem right now is it's taking a bit of time. I'm new to Spark and Scala so just want to make sure that I'm not doing something stupid.
val conf = new SparkConf()
.setAppName("SparrowOrc")
.set("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
.set("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version","2")
.set("spark.speculation","false")
val sc = new SparkContext(conf)
val streamContext = new StreamingContext(sc, Seconds(1))
val sqs = streamContext.receiverStream(new SQSReceiver("queue")
.at(Regions.US_EAST_1)
.withTimeout(5))
// Got 10 messages at a time
val s3Keys = sqs.map(messages => {
val sqsMsg: JsValue = Json.parse(messages)
val s3Key = "s3://" +
Json.stringify(sqsMsg("Records")(0)("s3")("bucket")("name")).replace("\"", "") + "/" +
Json.stringify(sqsMsg("Records")(0)("s3")("object")("key")).replace("\"", "")
s3Key
})
val rawLogs: DStream[String] = s3Keys.transform(keys => {
val fileKeys = keys.collect()
val files = fileKeys.map(f => {
sc.textFile(f)
})
sc.union(files)
})
val jsonRows = rawLogs.mapPartitions(partition => {
// Parsing raw log to json
val txfm = new LogLine2Json
val log = Logger.getLogger("parseLog")
partition.map(line => {
try{
txfm.parseLine(line)
}
catch {
case e: Throwable => {log.info(line); "";}
}
}).filter(line => line != "{}")
})
val sqlSession = SparkSession
.builder()
.getOrCreate()
// Write to S3
jsonRows.foreachRDD(r => {
val parsedFormat = new SimpleDateFormat("yyyy-MM-dd/")
val parsedDate = parsedFormat.format(new java.util.Date())
val outputPath = "bucket" + parsedDate
val jsonDf = sqlSession.read.schema(sparrowSchema.schema).json(r)
jsonDf.write.mode("append").format("orc").option("compression","zlib").save(outputPath)
})
streamContext.start()
streamContext.awaitTermination()
}
Here's the DAG and it seems like everything is merged in union transformation.

Spark DF: Schema for type Unit is not supported

I am new to Scala and Spark and trying to build on some samples I found. Essentially I am trying to call a function from within a data frame to get State from zip code using Google API..
I have the code working separately but not together ;(
Here is the piece of code not working...
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type Unit is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:716)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:654)
at org.apache.spark.sql.functions$.udf(functions.scala:2837)
at MovieRatings$.getstate(MovieRatings.scala:51)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:48)
at MovieRatings$$anonfun$4.apply(MovieRatings.scala:47)...
Line 51 starts with def getstate = udf {(zipcode:String)...
...
code:
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, zipcode as state FROM Users")
// zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("transformed") else c)
val newDF = zipcodesDF.select(mappedCols:_*).show()
}
def getstate = udf {(zipcode:String) => {
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val shortnames = for {
JObject(address_components) <- address
JField("short_name", short_name) <- address_components
} yield short_name
val state = shortnames(3)
//return state.toString()
val stater = state.toString()
}
}
Thanks for the responses.. I think I figured it out. Here is the code that works. One thing to note is Google API has restriction so some valid zip codes don't have state info.. not an issue for me though.
private def loaduserdata(spark: SparkSession): Unit = {
import spark.implicits._
// Create an RDD of User objects from a text file, convert it to a Dataframe
val userDF = spark.sparkContext
.textFile("examples/src/main/resources/users.csv")
.map(_.split("::"))
.map(attributes => users(attributes(0).trim.toInt, attributes(1), attributes(2).trim.toInt, attributes(3), attributes(4)))
.toDF()
// Register the DataFrame as a temporary view
userDF.createOrReplaceTempView("Users")
// SQL statements can be run by using the sql methods provided by Spark
val zipcodesDF = spark.sql("SELECT distinct zipcode, substr(zipcode,1,5) as state FROM Users ORDER BY zipcode desc") // zipcodesDF.map(zipcodes => "zipcode: " + zipcodes.getAs[String]("zipcode") + getstate(zipcodes.getAs[String]("zipcode"))).show()
val colNames = zipcodesDF.columns
val cols = colNames.map(cName => zipcodesDF.col(cName))
val theColumn = zipcodesDF("state")
val mappedCols = cols.map(c =>
if (c.toString() == theColumn.toString()) getstate(c).as("state") else c)
val geoDF = zipcodesDF.select(mappedCols:_*)//.show()
geoDF.createOrReplaceTempView("Geo")
}
val getstate = udf {(zipcode: String) =>
val url = "http://maps.googleapis.com/maps/api/geocode/json?address="+zipcode
val result = scala.io.Source.fromURL(url).mkString
val address = parse(result)
val statenm = for {
JObject(statename) <- address
JField("types", JArray(types)) <- statename
JField("short_name", JString(short_name)) <- statename
if types.toString().equals("List(JString(administrative_area_level_1), JString(political))")
// if types.head.equals("JString(administrative_area_level_1)")
} yield short_name
val str = if (statenm.isEmpty.toString().equals("true")) "N/A" else statenm.head
}

Spark Streaming - Broadcast variable - Case Class

My requirement is to enrich data stream data with profile information from a HBase table. I was looking to use a broadcast variable. Enclosed the whole code here.
The output of HBase data is as follows
In the Driver node HBaseReaderBuilder
(org.apache.spark.SparkContext#3c58b102,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List()))
In the Worker node
HBaseReaderBuilder(null,hbase_customer_profile,Some(data),WrappedArray(gender, age),None,None,List()))
As you can see it has lost the spark context. When i issue the statement val
myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3)) i get a NullPointerException
java.lang.NullPointerException
at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toSimpleHBaseRDD(HBaseReaderBuilder.scala:83)
at it.nerdammer.spark.hbase.package$.toSimpleHBaseRDD(package.scala:5)
at it.nerdammer.spark.hbase.HBaseReaderBuilderConversions$class.toHBaseRDD(HBaseReaderBuilder.scala:67)
at it.nerdammer.spark.hbase.package$.toHBaseRDD(package.scala:5)
at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:34)
at testPartition$$anonfun$main$1$$anonfun$apply$1$$anonfun$apply$2.apply(testPartition.scala:33)
object testPartition {
def main(args: Array[String]) : Unit = {
val sparkMaster = "spark://x.x.x.x:7077"
val ipaddress = "x.x.x.x:2181" // Zookeeper
val hadoopHome = "/home/hadoop/software/hadoop-2.6.0"
val topicname = "new_events_test_topic"
val mainConf = new SparkConf().setMaster(sparkMaster).setAppName("testingPartition")
val mainSparkContext = new SparkContext(mainConf)
val ssc = new StreamingContext(mainSparkContext, Seconds(30))
val eventsStream = KafkaUtils.createStream(ssc,"x.x.x.x:2181","receive_rest_events",Map(topicname.toString -> 2))
val docRdd = mainSparkContext.hbaseTable[(String, Option[String], Option[String])]("hbase_customer_profile").select("gender","age").inColumnFamily("data")
println ("docRDD from Driver ",docRdd)
val broadcastedprof = mainSparkContext.broadcast(docRdd)
eventsStream.foreachRDD(dstream => {
dstream.foreachPartition(records => {
println("Broadcasted docRDD - in Worker ", broadcastedprof.value)
val bcdocRdd = broadcastedprof.value
records.foreach(record => {
//val myRdd = bcdocRdd.map(r => Profile(r._1, r._2, r._3))
//myRdd.foreach(println)
val Rows = record._2.split("\r\n")
})
})
})
ssc.start()
ssc.awaitTermination()
}
}