NULL Pointer Exception, while creating DF inside foreach() - scala

I have to read certain files from S3, so I created a CSV containing path of those files on S3. I am reading created CSV file using below code:
val listofFilesRDD = sparkSession.read.textFile("s3://"+ file)
This is working fine.
Then I am trying to read each of those paths and create dataframe like:
listofFilesRDD.foreach(iter => {
val pathDF = sparkSession.read
.schema(testSchema)
.option("headers", true)
.csv("s3://"+iter)
pathDF.printSchema()
})
but, the above code gives NullPointerException.
So, How can I fix the above code?

You can solve the above problem as below you simple create Array of s3 file paths and iterate over that array and create DF inside that as below
val listofFilesRDD = sparkSession.read.textFile("s3://"+ file)
val listOfPaths = listofFilesRDD.collect()
listOfPaths.foreach(iter => {
val pathDF = sparkSession.read
.schema(testSchema)
.option("headers", true)
.csv("s3://"+iter)
pathDF.printSchema()
})

You cannot access a RDD inside a RDD ! Thats the sole rule ! You have to do something else to make your logic work !
You can find more about it here : NullPointerException in Scala Spark, appears to be caused be collection type?

If anyone encounter DataFrame problem , can solve this problem.
def parameterjsonParser(queryDF:DataFrame,spark:SparkSession): Unit ={
queryDF.show()
val otherDF=queryDF.collect()
otherDF.foreach { row =>
row.toSeq.foreach { col =>
println(col)
mainJsonParser(col.toString,spark)
}
}
Thank you #Sandeep Purohit

Related

How to apply filters on spark scala dataframe view?

I am pasting a snippet here where I am facing issues with the BigQuery Read. The "wherePart" has more number of records and hence BQ call is invoked again and again. Keeping the filter outside of BQ Read would help. The idea is, first read the "mainTable" from BQ, store it in a spark view, then apply the "wherePart" filter to this view in spark.
["subDate" is a function to subtract one date from another and return the number of days in between]
val Df = getFb(config, mainTable, ds)
def getFb(config: DataFrame, mainTable: String, ds: String) : DataFrame = {
val fb = config.map(row => Target.Pfb(
row.getAs[String]("m1"),
row.getAs[String]("m2"),
row.getAs[Seq[Int]]("days")))
.collect
val wherePart = fb.map(x => (x.m1, x.m2, subDate(ds, x.days.max - 1))).
map(x => s"(idata_${x._1} = '${x._2}' AND ds BETWEEN '${x._3}' AND '${ds}')").
mkString(" OR ")
val q = new Q()
val tempView = "tempView"
spark.readBigQueryTable(mainTable, wherePart).createOrReplaceTempView(tempView)
val Df = q.mainTableLogs(tempView)
Df
}
Could someone please help me here.
Are you using the spark-bigquery-connector? If so the right syntax is
spark.read.format("bigquery")
.load(mainTable)
.where(wherePart)
.createOrReplaceTempView(tempView)

Spark UDF with Maxmind Geo Data

I'm trying to use the Maxmind snowplow library to pull out geo data on each IP that I have in a dataframe.
We are using Spark SQL (spark version 2.1.0) and I created an UDF in the following class:
class UdfDefinitions #Inject() extends Serializable with StrictLogging {
sparkSession.sparkContext.addFile("s3n://s3-maxmind-db/latest/GeoIPCity.dat")
val s3Config = configuration.databases.dataWarehouse.s3
val lruCacheConst = 20000
val ipLookups = IpLookups(geoFile = Some(SparkFiles.get(s3Config.geoIPFileName) ),
ispFile = None, orgFile = None, domainFile = None, memCache = false, lruCache = lruCacheConst)
def lookupIP(ip: String): LookupIPResult = {
val loc: Option[IpLocation] = ipLookups.getFile.performLookups(ip)._1
loc match {
case None => LookupIPResult("", "", "")
case Some(x) => LookupIPResult(Option(x.countryName).getOrElse(""),
x.city.getOrElse(""), x.regionName.getOrElse(""))
}
}
val lookupIPUDF: UserDefinedFunction = udf(lookupIP _)
}
The intention is to create the pointer to the file (ipLookups) outside the UDF and use it inside, so not to open files on each row. This get an error of task no serialized and when we use the addFiles in the UDF, we get a too many files open error (when using a large dataset, on a small dataset it does work).
This thread show how to use to solve the problem using RDD, but we would like to use Spark SQL. using maxmind geoip in spark serialized
Any thoughts?
Thanks
The problem here is that IpLookups is not Serializable. Yet it makes the lookups from a static file (frmo what I gathered) so you should be able to fix that. I would advise that you clone the repo and make IpLookups Serializable. Then, to make it work with spark SQL, wrap everything in a class like you did. The in the main spark job, you can write something as follows:
val IPResolver = new MySerializableIpResolver()
val resolveIP = udf((ip : String) => IPResolver.resolve(ip))
data.withColumn("Result", resolveIP($"IP"))
If you do not have that many distinct IP addresses, there is another solution: you could do everything in the driver.
val ipMap = data.select("IP").distinct.collect
.map(/* calls to the non serializable IpLookups but that's ok, we are in the driver*/)
.toMap
val resolveIP = udf((ip : String) => ipMap(ip))
data.withColumn("Result", resolveIP($"IP"))

scala java.lang.NullPointerException

The following code is causing java.lang.NullPointerException.
val sqlContext = new SQLContext(sc)
val dataFramePerson = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(CustomSchema1).load("c:\\temp\\test.csv")
val dataFrameAddress = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(CustomSchema2).load("c:\\temp\\test2.csv")
val personData = dataFramePerson.map(data => {
val addressData = dataFrameAddress.filter(i => i.getAs("ID") == data.getAs("ID"));
var address:Address = null;
if (addressData != null) {
val addressRow = addressData.first;
address = addressRow.asInstanceOf[Address];
}
Person(data.getAs("Name"),data.getAs("Phone"),address)
})
I narrowed it down to the following line of that is causing the exception.
val addressData = dataFrameAddress.filter(i => i.getAs("ID") == data.getAs("ID"));
Can someone point out what the issue is?
Your code has a big structural flaw, that is, you can only refer to dataframes from the code that executes in the driver, but not in the code that is run by the executors. Your code contains a reference to another dataframe from within a map, that is executed in executors. See this link Can I use Spark DataFrame inside regular Spark map operation?
val personData = dataFramePerson.map(data => { // WITHIN A MAP
val addressData = dataFrameAddress.filter(i => // <--- REFERRING TO OTHER DATAFRAME WITHIN A MAP
i.getAs("ID") == data.getAs("ID"));
var address:Address = null;
if (addressData != null) {
What you want to do instead is a left outer join, then do further processing.
dataFramePerson.join(dataFrameAddress, Seq("ID"), "left_outer")
Note also than when using getAs you want to specify the type, like getAs[String]("ID")
The only thing that can be said is that either dataFrameAddress, or i, or data is null. Use your favorite debugging technique to know which one actually is e.g., debugger, print statements or logs.
Note that if you see the filter call in the stacktrace of your NullPointerException, it would mean that only i, or data could be null. On the other hand, if you don't see the filter call, it would rather mean that it is dataFrameAddress that is null.

Spark: How to get String value while generating output file

I have two files
--------Student.csv---------
StudentId,City
101,NDLS
102,Mumbai
-------StudentDetails.csv---
StudentId,StudentName,Course
101,ABC,C001
102,XYZ,C002
Requirement
StudentId in first should be replaced with StudentName and Course in the second file.
Once replaced I need to generate a new CSV with complete details like
ABC,C001,NDLS
XYZ,C002,Mumbai
Code used
val studentRDD = sc.textFile(file path);
val studentdetailsRDD = sc.textFile(file path);
val studentB = sc.broadcast(studentdetailsRDD.collect)
//Generating CSV
studentRDD.map{student =>
val name = getName(student.StudentId)
val course = getCourse(student.StudentId)
Array(name, course, student.City)
}.mapPartitions{data =>
val stringWriter = new StringWriter();
val csvWriter =new CSVWriter(stringWriter);
csvWriter.writeAll(data.toList)
Iterator(stringWriter.toString())
}.saveAsTextFile(outputPath)
//Functions defined to get details
def getName(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.StudentName}
}
def getCourse(studentId : String) {
studentB.value.map{stud =>if(studentId == stud.StudentId) stud.Course}
}
Problem
File gets generated but the values are object representations instead of String value.
How can I get the string values instead of objects ?
As suggested in another answer, Spark's DataFrame API is especially suitable for this, as it easily supports joining two DataFrames, and writing CSV files.
However, if you insist on staying with RDD API, looks like the main issue with your code is the lookup functions: getName and getCourse basically do nothing, because their return type is Unit; Using an if without an else means that for some inputs there's no return value, which makes the entire function return Unit.
To fix this, it's easier to get rid of them and simplify the lookup by broadcasting a Map:
// better to broadcast a Map instead of an Array, would make lookups more efficient
val studentB = sc.broadcast(studentdetailsRDD.keyBy(_.StudentId).collectAsMap())
// convert to RDD[String] with the wanted formatting
val resultStrings = studentRDD.map { student =>
val details = studentB.value(student.StudentId)
Array(details.StudentName, details.Course, student.City)
}
.map(_.mkString(",")) // naive CSV writing with no escaping etc., you can also use CSVWriter like you did
// save as text file
resultStrings.saveAsTextFile(outputPath)
Spark has great support for join and write to file. Join only takes 1 line of code and write also only takes 1.
Hand write those code can be error proven, hard to read and most likely super slow.
val df1 = Seq((101,"NDLS"),
(102,"Mumbai")
).toDF("id", "city")
val df2 = Seq((101,"ABC","C001"),
(102,"XYZ","C002")
).toDF("id", "name", "course")
val dfResult = df1.join(df2, "id").select("id", "city", "name")
dfResult.repartition(1).write.csv("hello.csv")
There will be a directory created. There is only 1 file in the directory which is the finally result.

How to read file names from column in DataFrame to process using SparkContext.textFile?

I'm so new using Spark and I'm so stuck with this issue:
From a DataFrame that I have created; called reportesBN, I want to get the value of a field, in order to use it to get a TextFile of a specific route. And after that, give to that file a specific process.
I have developed this code, but its not working:
reportesBN.foreach {
x =>
val file = x(0)
val insumo = sc.textFile(s"$file")
val firstRow = insumo.first.split("\\|", -1)
// Get values of next rows
val nextRows = insumo.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
val dfNextRows = nextRows.map(a => a.split("\\|")).map(x=> BalanzaNextRows(x(0), x(1),
x(2), x(3), x(4))).toDF()
val validacionBalanza = new RevisionCampos(sc)
validacionBalanza.validacionBalanza(firstRow, dfNextRows)
}
The error log indicates that it is because of serialization.
7/06/28 18:55:45 INFO SparkContext: Created broadcast 0 from textFile at ValidacionInsumos.scala:56
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Is this problem caused by the Spark Context (sc) that is inside the foreach?
Is there another way to implement this?
Regards.
A very similar question you asked before and that's that same issue - you cannot use SparkContext inside a RDD transformation or action. In this case, you use sc.textFile(s"$file") inside reportesBN.foreach which as you said is a DataFrame:
From a DataFrame that I have created; called reportesBN
You should rewrite your transformation to take a file from the DataFrame and read it afterwards.
// This is val file = x(0)
// I assume that the column name is `files`
val files = reportesBN.select("files").as[String].collectAsList
Once you have the collection of files to process, you execute the code in your block.
files.foreach {
x => ...
}