The following code is causing java.lang.NullPointerException.
val sqlContext = new SQLContext(sc)
val dataFramePerson = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(CustomSchema1).load("c:\\temp\\test.csv")
val dataFrameAddress = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").schema(CustomSchema2).load("c:\\temp\\test2.csv")
val personData = dataFramePerson.map(data => {
val addressData = dataFrameAddress.filter(i => i.getAs("ID") == data.getAs("ID"));
var address:Address = null;
if (addressData != null) {
val addressRow = addressData.first;
address = addressRow.asInstanceOf[Address];
}
Person(data.getAs("Name"),data.getAs("Phone"),address)
})
I narrowed it down to the following line of that is causing the exception.
val addressData = dataFrameAddress.filter(i => i.getAs("ID") == data.getAs("ID"));
Can someone point out what the issue is?
Your code has a big structural flaw, that is, you can only refer to dataframes from the code that executes in the driver, but not in the code that is run by the executors. Your code contains a reference to another dataframe from within a map, that is executed in executors. See this link Can I use Spark DataFrame inside regular Spark map operation?
val personData = dataFramePerson.map(data => { // WITHIN A MAP
val addressData = dataFrameAddress.filter(i => // <--- REFERRING TO OTHER DATAFRAME WITHIN A MAP
i.getAs("ID") == data.getAs("ID"));
var address:Address = null;
if (addressData != null) {
What you want to do instead is a left outer join, then do further processing.
dataFramePerson.join(dataFrameAddress, Seq("ID"), "left_outer")
Note also than when using getAs you want to specify the type, like getAs[String]("ID")
The only thing that can be said is that either dataFrameAddress, or i, or data is null. Use your favorite debugging technique to know which one actually is e.g., debugger, print statements or logs.
Note that if you see the filter call in the stacktrace of your NullPointerException, it would mean that only i, or data could be null. On the other hand, if you don't see the filter call, it would rather mean that it is dataFrameAddress that is null.
Related
scala/spark newbie here. I have inherited an old code which I have refactored and been trying to use in order to retrieve data from Scylla. The code looks like:
val TEST_QUERY = s"SELECT user_id FROM test_table WHERE name = ? AND id_type = 'test_type';"
var selectData = List[Row]()
dataRdd.foreachPartition {
iter => {
// Build up a cluster that we can connect to
// Start a session with the cluster by connecting to it.
val cluster = ScyllaConnector.getCluster(clusterIpString, scyllaPreferredDc, scyllaUsername, scyllaPassword)
var batchCounter = 0
val session = cluster.connect(tableConfig.keySpace)
val preparedStatement: PreparedStatement = session.prepare(TEST_QUERY)
iter.foreach {
case (test_name: String) => {
// Get results
val testResults = session.execute(preparedStatement.bind(test_name))
if (testResults != null){
val testResult = testResults.one()
if(testResult != null){
val user_id = testResult.getString("user_id")
selectData ::= Row(user_id, test_name)
}
}
}
}
session.close()
cluster.close()
}
}
println("Head is =======> ")
println(selectData.head)
The above does not return any data and fails with null pointer exception because the selectedData list is empty although there is data in there for sure that matches the select statement. I feel like how I'm doing it is not correct but can't figure out what needs to change in order to get this fixed so any help is much appreciated.
PS: The whole idea of me using a list to keep the results is so that I can use that list to create a dataframe. I'd be grateful if you could point me to the right direction here.
If you look into the definition of the foreachPartition function, you will see that it's by definition can't return anything because its return type is void.
Anyway, it's a very bad way of querying data from Cassandra/Scylla from Spark. For that exists Spark Cassandra Connector that should be able to work with Scylla as well because of the protocol compatibility.
To read a dataframe from Cassandra just do:
spark.read
.format("cassandra")
.option("keyspace", "ksname")
.option("table", "tab")
.load()
Documentation is quite detailed, so just read it.
I build a dataframe, which is going to have a String column that is actuall a structure, turned to a JSON string.
val df = df_helper.select(
lit("some data").as("id"),
to_json(
struct(
col("id"),
col("type"),
col("path")
)
)).as("content")
I also built a function, that takes the identifier id:String as a parameter and spits out a string list.
def buildHierarchy(id:String) : List[String] = {
val check_df = hierarchy_df.select(
$"parent_id",
$"id"
).where($"id" === id)
val pathArray = List(id)
val parentString = check_df.select($"parent_id").first.getString(0)
if (parentString == null) {
return pathArray
}
else {
val pathList = buildHierarchy(parentString)
val finalList: List[String] = pathList ++ pathArray
return finalList
}
}
I want to call this function and replace the path column with the result of the function. Is this possible, or is there a workaround?
Thank you in advance!
With the help of the following blog post I built up the hierarchy:
https://www.qubole.com/blog/processing-hierarchical-data-using-spark-graphx-pregel-api/
I included all the necessary information in my final dataframe, that I need to build up the details for each elements for the structure, along with the hierarchy path for each elements. Then as a new column I created a structure for each elements with the details.
val content_df = hierarchy_df.select($"path")
.withColumn("content",
struct( col("id"),
col("type"),
col("path")
))
I exploded the path to reach each identifiers in it, but preserve the position order, to join on the content for each level in the path.
val exploded_df = content_df.select(
$"*",posexplode($"path"),
$"pos".as("exploded_level"),
$"col".as("exploded_id"))
Then finally, join the content with the path, and aggregate the content into a path that contains all the content for each level.
val level_content_df = exploded_df.as("e")
.join(content_df.as("ouc"), col("e.col") === col("ouc.id"), "left")
.select($"e.*", $"ouc.content".as("content"))
val pathFull_df = level_content_df
.groupBy(col("id").as("id"))
.agg(sort_array(collect_list("content")).as("pathFull"))
And then finally I just put the whole thing together again :)
val content_with_path_df = content_df.as("ou")
.join(pathFull_df.as("opf"), col("ou.id") === col("opf.id"), "left")
.select($"ou.*", $"opf.pathFull".as("pathFull"))
Feel free to reach out if it doesn't make sense, it took a while for me to make it happen! :D
In the following Cassandra code, I am querying a database and expect multiple values. The function takes and id and should return Option[List[M]] where M is my model. I have a function rowToModel(row: Row): MyModel which could take a row from ResultSet and convert it into instance of my model.
My issue is that the List I am returning is always empty even though ResultSet has data. I checked it by adding debug prints in rowToModel
def getRowsByPartitionKeyId(id:I):Option[List[M]] = {
val whereClause = whereConditions(tablename, id);
val resultSet = session.execute(whereClause) //resultSet is an iterator
val it = resultSet.iterator();
val resultList:List[M] = List();
if(it.hasNext){
while(it.hasNext) {
val item:M = rowToModel(it.next())
resultList.:+(item)
}
Some(resultList) //THIS IS ALWAYS List()
}
else
None
}
I suspect that as resultList is a val, its value is not getting changed in the while loop. I probably should use yield or something else but I don't know what and how.
Solved it by converting Java Iterator to Scala and then using toList
import collection.JavaConverters._
val it = resultSet.iterator();
if(it.hasNext){
val resultSetAsList:List[Row] = asScalaIterator(it).toList
val resultSetAsModelList = resultSetAsList.map((row:Row) => rowToModel(row))
Some(resultSetAsModelList)
I'm so new using Spark and I'm so stuck with this issue:
From a DataFrame that I have created; called reportesBN, I want to get the value of a field, in order to use it to get a TextFile of a specific route. And after that, give to that file a specific process.
I have developed this code, but its not working:
reportesBN.foreach {
x =>
val file = x(0)
val insumo = sc.textFile(s"$file")
val firstRow = insumo.first.split("\\|", -1)
// Get values of next rows
val nextRows = insumo.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
val dfNextRows = nextRows.map(a => a.split("\\|")).map(x=> BalanzaNextRows(x(0), x(1),
x(2), x(3), x(4))).toDF()
val validacionBalanza = new RevisionCampos(sc)
validacionBalanza.validacionBalanza(firstRow, dfNextRows)
}
The error log indicates that it is because of serialization.
7/06/28 18:55:45 INFO SparkContext: Created broadcast 0 from textFile at ValidacionInsumos.scala:56
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Is this problem caused by the Spark Context (sc) that is inside the foreach?
Is there another way to implement this?
Regards.
A very similar question you asked before and that's that same issue - you cannot use SparkContext inside a RDD transformation or action. In this case, you use sc.textFile(s"$file") inside reportesBN.foreach which as you said is a DataFrame:
From a DataFrame that I have created; called reportesBN
You should rewrite your transformation to take a file from the DataFrame and read it afterwards.
// This is val file = x(0)
// I assume that the column name is `files`
val files = reportesBN.select("files").as[String].collectAsList
Once you have the collection of files to process, you execute the code in your block.
files.foreach {
x => ...
}
I have a function that I want to apply to a every row of a .csv file:
def convert(inString: Array[String]) : String = {
val country = inString(0)
val sellerId = inString(1)
val itemID = inString(2)
try{
val minidf = sqlContext.read.json( sc.makeRDD(inString(3):: Nil) )
.withColumn("country", lit(country))
.withColumn("seller_id", lit(sellerId))
.withColumn("item_id", lit(itemID))
val finalString = minidf.toJSON.collect().mkString(",")
finalString
} catch{
case e: Exception =>println("AN EXCEPTION "+inString.mkString(","))
("this is an exception "+e+" "+inString.mkString(","))
}
}
This function transforms an entry of the sort:
CA 112578240 132080411845 [{"id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
Where I have 4 columns, the 4th being a json blob, into
[{"country":"CA", "seller":112578240", "product":112578240, "id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
which is the json object where the first 3 columns have been inserted into the fourth.
Now, this works:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).collect().map(x => convert(x))
or this:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).take(10).map(x => convert(x))
but this does not
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).map(x => convert(x))
The last one throw a java.lang.NullPointerException.
I included a try catch clause so see where exactly is this failing and it's failing for every single row.
What am I doing wrong here?
You cannot put sqlContext or sparkContext in a Spark map, since that object can only exist on the driver node. Essentially they are in charge of distributing your tasks.
You could rewite the JSON parsing bit using one of these libraries in pure scala: https://manuel.bernhardt.io/2015/11/06/a-quick-tour-of-json-libraries-in-scala/