My question is will quantile_foo and quantile_bar be the right value passed to each loop?
Will the value of quantile_foo and quantile_bar be set to last time, that is, i=5, in the loop because of spark lazy execution, so that I always get wrong foo_quantile_{i} except for foo_5?
df = spark.sql("select * from some_table")
for i in range(5):
quantile_foo = df.approxQuantile("foo_{}".format(str(i)),[0.25,0.5,0.75],0.05)
quantile_bar = df.approxQuantile("bar_{}".format(str(i)),[0.25,0.5,0.75],0.05)
df = df.withColumn("foo_quantile_{}".format(str(i)),
F.when(F.col("foo_{}".format(str(i))>quantile_foo[0],75))\
.when(F.col("foo_{}".format(str(i))>quantile_foo[1],50))
... ...
)
df = df.withColumn("bar_quantile_{}".format(str(i)),\
F.when(F.col("bar_{}".format(str(i))>quantile_foo[0],75))\
.when(F.col("bar_{}".format(str(i))>quantile_foo[1],50))
... ...
)
I would say the answer to your question is Yes, it would take the last value of i to perform the operation but that has nothing to do with spark lazy evaluation.
You are assigning a different value to a variable in python and as it is in for loop, whenever you come out of the for loop you will get the last value.
Related
I am having a small Scala code using Future. After execution of code, it returns a String.
My result is coming like below:
scala.concurrent.Future[List[String]] = Future(Success(List(10)))
Here 10 is my return value. How do I get only value 10 and store it in another variable?
I'd like to get values from map like that
user_TopicResponses.put(
"3"+"_"+topicid,
Access.useradVectorMap.getOrElse(
"3"+"_"+topicid,
Access.useradVectorMap.getOrElse("3"+"_"+"0"),
Array(0.0)
)
)
What means if key in map value will be get, of else key is set to "3+"0" and value will also be get.
but it will be reported that:
too many arguments for method getOrElse: (key: String, default: => B1)B
You mixed up parentheses a bit :
Access.useradVectorMap.getOrElse("3"+"_"+"0"),Array(0.0) shoud in fact be
Access.useradVectorMap.getOrElse("3"+"_"+"0",Array(0.0))
It should be ok after that !
First off, I would suggest, at least for debugging purposes, to break your one statement into multiple statements. Your problem stems from a missing/misplaced parentheses. This would be much easier to see if you split your code up.
Secondly, it's good practice to use a variable or function for any repeated code, it makes it far easier to change and maintain (I like to use them for any hard coded value that might change later as well).
In order to only calculate the secondaryValue only if the primaryValue.getOrElse(...) goes to the "else" value, you can use a lazy val, which only evaluates if needed:
val primaryKey = "3_" + topicid
val secondaryKey = "3_0"
val secondaryDefaultValue = Array(0.0)
lazy val secondaryValue = Access.useradVectorMap.getOrElse(secondaryKey, secondaryDefaultValue )
val primaryValue = Access.useradVectorMap.getOrElse(primaryKey, secondaryValue)
user_TopicResponses.put(primaryKey, primaryValue)
This code is far easier to read and, more importantly, far easier to debug
I am new to Scala, while running one spark program I am getting null Pointer exception. Can anyone point me how to solve this.
val data = spark.read.csv("C:\\File\\Path.csv").rdd
val result = data.map{ line => {
val population = line.getString(10).replaceAll(",","")
var popNum = 0L
if (population.length()> 0)
popNum = Long.parseLong(population)
(popNum, line.getString(0))
}}
.sortByKey(false)
.first()
//spark.sparkContext.parallelize(Seq(result)).saveAsTextFile(args(1))
println("The result is: "+ result)
spark.stop
Error message :
Caused by: java.lang.NullPointerException
at com.nfs.WBI.KPI01.HighestUrbanPopulation$$anonfun$1.apply(HighestUrbanPopulation.scala:23)
at com.nfs.WBI.KPI01.HighestUrbanPopulation$$anonfun$1.apply(HighestUrbanPopulation.scala:22)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
I guess that in your input data there is at least one row that does not contain a value in column 10, so that line.getString(10) returns null. When calling replaceAll(",","") on that result, the NullPointerException occurs.
A quick fix would be to wrap the the call to getString in an Option:
val population = Option(line.getString(10)).getOrElse("")
This returns the value of column 10 or an empty string if the column is null.
Some care must be taken when parsing the long. Unless you are absolutely sure that the column always contains a number, a NumberFormatException could be thrown.
In general, you should check the inferSchema option of the CSV reader of Spark and try to avoid parsing the data yourself.
In addition to the parsing issues mentioned elsewhere in this post, it seems that you have numbers separated by commas in your data. This is going to complicate csv parsing and cause potentially undesirable behavior. You may have to sanitize the data even before reading in spark.
Also if you're using Spark 2.0, it's best to use Dataframes/Datasets along with GroupBy constructs. See this post - How to deal with null values in spark reduceByKey function?. I suspect you have null values in your sort key as well.
I have a loop which generates rows in each iteration. My goal is to create a dataframe, with a given schema, that contents just those rows. I have in mind a set of steps to follow, but I am not able to add a new Row to a List[Row] in each loop iteration
I am trying the following approach:
var listOfRows = List[Row]()
val dfToExtractValues: DataFrame = ???
dfToExtractValues.foreach { x =>
//Not really important how to generate here the variables
//So to simplify all the rows will have the same values
var col1 = "firstCol"
var col2 = "secondCol"
var col3 = "thirdCol"
val newRow = RowFactory.create(col1,col2,col3)
//This step I am not able to do
//listOfRows += newRow -> Just for strings
//listOfRows.add(newRow) -> This add doesnt exist, it is a addString
//listOfRows.aggregate(1)(newRow) -> This is not how aggreage works...
}
val rdd = sc.makeRDD[RDD](listOfRows)
val dfWithNewRows = sqlContext.createDataFrame(rdd, myOriginalDF.schema)
Can someone tell me what am I doing wrong, or what could I change in my approach to generate a dataframe from the rows I'm generating?
Maybe there is a better way to collect the Rows instead of List[Row]. But then I need to convert that other type of collection into a dataframe.
Can someone tell me what am I doing wrong
Closures:
First of all it looks like you skipped over Understanding Closures in the Programming Guide. Any attempt to modify variables passed with closure is futile. All you can do is modify a copy and changes won't be reflected globally.
Variable doesn't make object mutable:
Following
var listOfRows = List[Row]()
creates a variable. Assigned List is as immutable as it was. If it wasn't in the Spark context you could create a new List and reassign:
listOfRows = newRow :: listOfRows
Note that we perpend not append - you don't want to append to the list in a loop.
Variables with immutable objects are useful, when you want to share data (it is common pattern in Akka for example), but don't have many applications in Spark.
Keep things distributed:
Finally never fetch data to the driver just to distribute it again. You should also avoid unnecessary conversions between RDDs and DataFrames. It is best to use DataFrame operators all the way:
dfToExtractValues.select(...)
but if you need something more complex map:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
dfToExtractValues.map(x => ...)(RowEncoder(schema))
My code is crashing with java.util.NoSuchElementException: next on empty iterator exception.
def myfunction(arr : Array[(Int,(String,Int))]) = {
val values = (arr.sortBy(x => (-x._2._2, x._2._1.head)).toList)
...........................
The code is crashing in the first line where I am trying to sort an array.
var arr = Array((1,("kk",1)),(1,("hh",1)),(1,("jj",3)),(1,("pp",3)))
I am trying to sort the array on the basis of 2nd element of the inner tuple. If there is equality the sort should take place on first element of inner tuple.
output - ((1,("pp",3)),(1,("jj",3)),(1,("hh",1)),(1,("kk",1)))
This is crashing under some scenarios (normally it works fine) which I guess is due to empty array.
How can I get rid of this crash or any other elegant way of achieving the same result.
It happens because one of your array items (Int,(String,Int)) contains empty string.
"".head
leads to
java.util.NoSuchElementException: next on empty iterator
use x._2._1.headOption
val values = (arr.sortBy(x => (-x._2._2, x._2._1)).toList)
Removing head from the statement works.This crashes because of the empty string in arr
var arr = Array((1,("kk",1)),(1,("hh",1)),(1,("jj",3)),(1,("pp",3)),(1,("",1)))
I use MLlib in spark and get this error, It turned out that I predict for a non-existing userID or itemID, ALS will generate a matrix for prediction(userIDs * itemIDs), you must make sure that your request is included in this matrix.