I am new to Scala, while running one spark program I am getting null Pointer exception. Can anyone point me how to solve this.
val data = spark.read.csv("C:\\File\\Path.csv").rdd
val result = data.map{ line => {
val population = line.getString(10).replaceAll(",","")
var popNum = 0L
if (population.length()> 0)
popNum = Long.parseLong(population)
(popNum, line.getString(0))
}}
.sortByKey(false)
.first()
//spark.sparkContext.parallelize(Seq(result)).saveAsTextFile(args(1))
println("The result is: "+ result)
spark.stop
Error message :
Caused by: java.lang.NullPointerException
at com.nfs.WBI.KPI01.HighestUrbanPopulation$$anonfun$1.apply(HighestUrbanPopulation.scala:23)
at com.nfs.WBI.KPI01.HighestUrbanPopulation$$anonfun$1.apply(HighestUrbanPopulation.scala:22)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
I guess that in your input data there is at least one row that does not contain a value in column 10, so that line.getString(10) returns null. When calling replaceAll(",","") on that result, the NullPointerException occurs.
A quick fix would be to wrap the the call to getString in an Option:
val population = Option(line.getString(10)).getOrElse("")
This returns the value of column 10 or an empty string if the column is null.
Some care must be taken when parsing the long. Unless you are absolutely sure that the column always contains a number, a NumberFormatException could be thrown.
In general, you should check the inferSchema option of the CSV reader of Spark and try to avoid parsing the data yourself.
In addition to the parsing issues mentioned elsewhere in this post, it seems that you have numbers separated by commas in your data. This is going to complicate csv parsing and cause potentially undesirable behavior. You may have to sanitize the data even before reading in spark.
Also if you're using Spark 2.0, it's best to use Dataframes/Datasets along with GroupBy constructs. See this post - How to deal with null values in spark reduceByKey function?. I suspect you have null values in your sort key as well.
Related
Hey I have the following problem, I'd like to use the polars apply function on columns with the datatype List.
In most cases this works, but in some cases all lists in the column are empty and the column datatype is List[null], in that special case the code is crashing.
Here some example Code:
df = pl.from_pandas(pd.DataFrame(data=[
[[]],
[[]]
], columns=['A']))
df.with_columns(pl.col('A').apply(lambda x:x))
results in
pyo3_runtime.PanicException: Unwrapped panic from Python code
I think the problem can be easily solved by cast the datatype to another List datatype, but i have no Idea how to do that.
In polars>=0.13.11 you can:
df = pl.from_pandas(pd.DataFrame(data=[
[[]],
[[]]
], columns=['A']))
assert df["A"].cast(pl.List(pl.Int64)).dtype.inner == pl.Int64
assert df["A"].cast(pl.List(int)).dtype.inner == pl.Int64
I am a beginner on Flink streaming.
When reading a file with RowCsvInputFormat, the code that Kryo serializer creates Row does not work properly.
The code is below.
val readLocalCsvFile = new RowCsvInputFormat(
new Path("flink-test/000000_1"),
Array(Types.STRING, Types.STRING, Types.STRING),
"\n",
","
)
val read = env.readFile(
readLocalCsvFile,
"flink-test/000000_1",
FileProcessingMode.PROCESS_CONTINUOUSLY,
1000000)
read.print()
env.execute("test")
The contents of the file 000000_1 are as follows.
aa,bb,cc
aaa,bbb,ccc
As a result of debugging, I get the divided values of aa, bb, and cc well. But when I put those values into Row's fields one by one, a nullpointexception is raised because fields are null.
The image below shows that the fields of the Row are null.
enter image description here
The code that creates a Row when the above code is executed is as follows. KryoSerializer generates the row.
val kryo = new EmptyFlinkScalaKryoInstantiator().newKryo
val Row = kryo.newInstance(classOf[Row])
The output error is as follows.
java.lang.NullPointerException
at org.apache.flink.types.Row.setField(Row.java:140)
at org.apache.flink.api.java.io.RowCsvInputFormat.fillRecord(RowCsvInputFormat.java:162)
at org.apache.flink.api.java.io.RowCsvInputFormat.fillRecord(RowCsvInputFormat.java:33)
at org.apache.flink.api.java.io.CsvInputFormat.readRecord(CsvInputFormat.java:113)
at org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:551)
at org.apache.flink.api.java.io.CsvInputFormat.nextRecord(CsvInputFormat.java:80)
at org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.readAndCollectRecord(ContinuousFileReaderOperator.java:387)
at
Maybe you can post the complete code.
Judging from the task error report, it may be because the number of fields does not match
I have a file in HDFS containing paths of various other files. Here is the file called file1:
path/of/HDFS/fileA
path/of/HDFS/fileB
path/of/HDFS/fileC
.
.
.
I am using a for loop in Scala Spark as follows to read each line of the above file and process it in another function:
val lines=Source.fromFile("path/to/file1.txt").getLines.toList
for(i<-lines){
i.toString()
val firstLines=sc.hadoopFile(i,classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
}
when I run the above loop, it runs through without returning any errors and I get the Scala prompt in a new line: scala>
However, when I try to see a few lines of output which should be stored in firstLines, it does not work:
scala> firstLines
<console>:38: error: not found: value firstLines
firstLine
^
What is the problem in the above loop that is not producing the output, however running through without any errors?
Additional info
The function hadoopFile accepts a String path name as its first parameter. That is why I am trying to pass each line of file1 (each line is a path name) as a String in the first parameter i. The flatMap functionality is taking the first line of the file that has been passed to hadoopFile and stores that alone and dumps all the other lines. So the desired output (firstLines) should be the first line of all the files that are being passed to hadoopFile through their path names (i).
I tried running the function for just a single file, without a looop, and that produces the output:
val firstLines=sc.hadoopFile("path/of/HDFS/fileA",classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
scala> firstLines.take(3)
res27: Array[String] = Array(<?xml version="1.0" encoding="utf-8"?>)
fileA is an XML file, so you can see the resulting first line of that file. So I know the function works fine, it is just a problem with the loop that I am not able to figure out. Please help.
The variable firstLines is defined in the body of the for loop and its scope is therefore limited to this loop. This means you cannot access the variable outside of the loop, and this is why the Scala compiler tells you error: not found: value firstLines.
From your description, I understand you want to collect the first line of every file which are listed in lines.
The every here can translate into different construct in Scala. We can use something like the for loop you wrote or even better adopt a functional approach and use a map function applied on the list of files. In the code below I put inside the map the code you used in your description, which creates an HadoopRDD and applies flatMap with your function to retrieve the first line of a file.
We then obtain a list of RDD[String] of lines. At this stage, note that we have not started to do any actual work. To trigger the evaluation of the RDDs and collect the result, we need an addition call to the collect method for each of the RDD we have in our list.
// Renamed "lines" to "files" as it is more explicit.
val fileNames = Source.fromFile("path/to/file1.txt").getLines.toList
val firstLinesRDDs = fileNames.map(sc.hadoopFile(_,classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
})
// firstLinesRDDs is a list of RDD[String]. Based on this code, each RDD
// should consist in a single String value. We collect them using RDD#collect:
val firstLines = firstLinesRDDs.map(_.collect)
However, this approach suffers from a flaw which prevent us to benefit from any advantage Spark can provide.
When we apply the operation in map to filenames, we are not working with an RDD, hence the file names are processed sequentially on the driver (the process which hosts your Spark session) and not part of a parallelizable Spark job. This is equivalent to doing what you wrote in your second block of code, one file name at a time.
To address the problem, what can we do? A good thing to keep in mind when working with Spark is to try to push the declaration of the RDDs as early as possible in our code. Why? Because this allows Spark to parallelize and optimize the work we want to do. Your example could be a textbook illustration of this concept, though an additional complexity here is added by the requirement to manipulate files.
In our present case, we can benefit from the fact that hadoopFile accepts comma-separated files in input. Therefore, instead of sequentially creating RDDs for every file, we create one RDD for all of them:
val firstLinesRDD = sc.hadoopFile(fileNames.mkString(","), classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
And we retrieve our first lines with a single collect:
val firstLines = firstLinesRDD.collect
I have a loop which generates rows in each iteration. My goal is to create a dataframe, with a given schema, that contents just those rows. I have in mind a set of steps to follow, but I am not able to add a new Row to a List[Row] in each loop iteration
I am trying the following approach:
var listOfRows = List[Row]()
val dfToExtractValues: DataFrame = ???
dfToExtractValues.foreach { x =>
//Not really important how to generate here the variables
//So to simplify all the rows will have the same values
var col1 = "firstCol"
var col2 = "secondCol"
var col3 = "thirdCol"
val newRow = RowFactory.create(col1,col2,col3)
//This step I am not able to do
//listOfRows += newRow -> Just for strings
//listOfRows.add(newRow) -> This add doesnt exist, it is a addString
//listOfRows.aggregate(1)(newRow) -> This is not how aggreage works...
}
val rdd = sc.makeRDD[RDD](listOfRows)
val dfWithNewRows = sqlContext.createDataFrame(rdd, myOriginalDF.schema)
Can someone tell me what am I doing wrong, or what could I change in my approach to generate a dataframe from the rows I'm generating?
Maybe there is a better way to collect the Rows instead of List[Row]. But then I need to convert that other type of collection into a dataframe.
Can someone tell me what am I doing wrong
Closures:
First of all it looks like you skipped over Understanding Closures in the Programming Guide. Any attempt to modify variables passed with closure is futile. All you can do is modify a copy and changes won't be reflected globally.
Variable doesn't make object mutable:
Following
var listOfRows = List[Row]()
creates a variable. Assigned List is as immutable as it was. If it wasn't in the Spark context you could create a new List and reassign:
listOfRows = newRow :: listOfRows
Note that we perpend not append - you don't want to append to the list in a loop.
Variables with immutable objects are useful, when you want to share data (it is common pattern in Akka for example), but don't have many applications in Spark.
Keep things distributed:
Finally never fetch data to the driver just to distribute it again. You should also avoid unnecessary conversions between RDDs and DataFrames. It is best to use DataFrame operators all the way:
dfToExtractValues.select(...)
but if you need something more complex map:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
dfToExtractValues.map(x => ...)(RowEncoder(schema))
My code is crashing with java.util.NoSuchElementException: next on empty iterator exception.
def myfunction(arr : Array[(Int,(String,Int))]) = {
val values = (arr.sortBy(x => (-x._2._2, x._2._1.head)).toList)
...........................
The code is crashing in the first line where I am trying to sort an array.
var arr = Array((1,("kk",1)),(1,("hh",1)),(1,("jj",3)),(1,("pp",3)))
I am trying to sort the array on the basis of 2nd element of the inner tuple. If there is equality the sort should take place on first element of inner tuple.
output - ((1,("pp",3)),(1,("jj",3)),(1,("hh",1)),(1,("kk",1)))
This is crashing under some scenarios (normally it works fine) which I guess is due to empty array.
How can I get rid of this crash or any other elegant way of achieving the same result.
It happens because one of your array items (Int,(String,Int)) contains empty string.
"".head
leads to
java.util.NoSuchElementException: next on empty iterator
use x._2._1.headOption
val values = (arr.sortBy(x => (-x._2._2, x._2._1)).toList)
Removing head from the statement works.This crashes because of the empty string in arr
var arr = Array((1,("kk",1)),(1,("hh",1)),(1,("jj",3)),(1,("pp",3)),(1,("",1)))
I use MLlib in spark and get this error, It turned out that I predict for a non-existing userID or itemID, ALS will generate a matrix for prediction(userIDs * itemIDs), you must make sure that your request is included in this matrix.