Pyspark - update list in udf - pyspark

Is there any possibility to update a list/variable inside an udf?
Let's consider this scenario:
studentsWithNewId = []
udfChangeStudentId = udf(changeStudentId, IntegerType())
def changeStudentId(studentId):
if condition:
newStudentId = computeNewStudentId() // this function is based on studentsWithNewId list contents
studentsWithNewId.append(newStudentId)
return newStudentId
return studentId
studentsDF.select(udfChangeStudentId(studentId))
Is this possible and safe on a cluster environment?
The code above is just an example, therefore maybe it can be re-written in some other, better way.

Related

SparkSQL performance issue with collect method

We are currently facing a performance issue in sparksql written in scala language. Application flow is mentioned below.
Spark application reads a text file from input hdfs directory
Creates a data frame on top of the file using programmatically specifying schema. This dataframe will be an exact replication of the input file kept in memory. Will have around 18 columns in the dataframe
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema)
Creates a filtered dataframe from the first data frame constructed in step 2. This dataframe will contain unique account numbers with the help of distinct keyword.
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
Using the two dataframes constructed in step 2 & 3, we will get all the records which belong to one account number and do some Json parsing logic on top of the filtered data.
var filtrEqpDF =
eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
Finally the json parsed data will be put into Hbase table
Here we are facing performance issues while calling the collect method on top of the data frames. Because collect will fetch all the data into a single node and then do the processing, thus losing the parallel processing benefit.
Also in real scenario there will be 10 billion records of data which we can expect. Hence collecting all those records in to driver node will might crash the program itself due to memory or disk space limitations.
I don't think the take method can be used in our case which will fetch limited number of records at a time. We have to get all the unique account numbers from the whole data and hence I am not sure whether take method, which takes
limited records at a time, will suit our requirements
Appreciate any help to avoid calling collect methods and have some other best practises to follow. Code snippets/suggestions/git links will be very helpful if anyone have had faced similar issues
Code snippet
val eqpSchemaString = "acoountnumber ....."
val eqpSchema = StructType(eqpSchemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, true)));
val eqpRdd = sc.textFile(inputPath)
val eqpRowRdd = eqpRdd.map(_.split(",")).map(eqpRow => Row(eqpRow(0).trim, eqpRow(1).trim, ....)
var eqpDF = sqlContext.createDataFrame(eqpRowRdd, eqpSchema);
var distAccNrsDF = eqpDF.select("accountnumber").distinct().collect()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'").collect()
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}
The approach usually take in this kind of situation is:
Instead of collect, invoke foreachPartition: foreachPartition applies a function to each partition (represented by an Iterator[Row]) of the underlying DataFrame separately (the partition being the atomic unit of parallelism of Spark)
the function will open a connection to HBase (thus making it one per partition) and send all the contained values through this connection
This means the every executor opens a connection (which is not serializable but lives within the boundaries of the function, thus not needing to be sent across the network) and independently sends its contents to HBase, without any need to collect all data on the driver (or any one node, for that matter).
It looks like you are reading a CSV file, so probably something like the following will do the trick:
spark.read.csv(inputPath). // Using DataFrameReader but your way works too
foreachPartition { rows =>
val conn = ??? // Create HBase connection
for (row <- rows) { // Loop over the iterator
val data = parseJson(row) // Your parsing logic
??? // Use 'conn' to save 'data'
}
}
You can ignore collect in your code if you have large set of data.
Collect Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
Also this can cause the driver to run out of memory, though, because collect() fetches the entire RDD/DF to a single machine.
I have just edited your code, which should work for you.
var distAccNrsDF = eqpDF.select("accountnumber").distinct()
distAccNrsDF.foreach { data =>
var filtrEqpDF = eqpDF.where("accountnumber='" + data.getString(0) + "'")
var result = new JSONObject()
result.put("jsonSchemaVersion", "1.0")
val firstRowAcc = filtrEqpDF(0)
//Json parsing logic
{
.....
.....
}
}

Spark - how to handle with lazy evaluation in case of iterative (or recursive) function calls

I have a recursive function that needs to compare the results of the current call to the previous call to figure out whether it has reached a convergence. My function does not contain any action - it only contains map, flatMap, and reduceByKey. Since Spark does not evaluate transformations (until an action is called), my next iteration does not get the proper values to compare for convergence.
Here is a skeleton of the function -
def func1(sc: SparkContext, nodes:RDD[List[Long]], didConverge: Boolean, changeCount: Int) RDD[(Long] = {
if (didConverge)
nodes
else {
val currChangeCount = sc.accumulator(0, "xyz")
val newNodes = performSomeOps(nodes, currChangeCount) // does a few map/flatMap/reduceByKey operations
if (currChangeCount.value == changeCount) {
func1(sc, newNodes, true, currChangeCount.value)
} else {
func1(sc, newNode, false, currChangeCount.value)
}
}
}
performSomeOps only contains map, flatMap, and reduceByKey transformations. Since it does not have any action, the code in performSomeOps does not execute. So my currChangeCount does not get the actual count. What that implies, the condition to check for the convergence (currChangeCount.value == changeCount) is going to be invalid. One way to overcome is to force an action within each iteration by calling a count but that is an unnecessary overhead.
I am wondering what I can do to force an action w/o much overhead or is there another way to address this problem?
I believe there is a very important thing you're missing here:
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
Because of that accumulators cannot be reliably used for managing control flow and are better suited for job monitoring.
Moreover executing an action is not an unnecessary overhead. If you want to know what is the result of the computation you have to perform it. Unless of course the result is trivial. The cheapest action possible is:
rdd.foreach { case _ => }
but it won't address the problem you have here.
In general iterative computations in Spark can be structured as follows:
def func1(chcekpoinInterval: Int)(sc: SparkContext, nodes:RDD[List[Long]],
didConverge: Boolean, changeCount: Int, iteration: Int) RDD[(Long] = {
if (didConverge) nodes
else {
// Compute and cache new nodes
val newNodes = performSomeOps(nodes, currChangeCount).cache
// Periodically checkpoint to avoid stack overflow
if (iteration % checkpointInterval == 0) newNodes.checkpoint
/* Call a function which computes values
that determines control flow. This execute an action on newNodes.
*/
val changeCount = computeChangeCount(newNodes)
// Unpersist old nodes
nodes.unpersist
func1(checkpointInterval)(
sc, newNodes, currChangeCount.value == changeCount,
currChangeCount.value, iteration + 1
)
}
}
I see that these map/flatMap/reduceByKey transformations are updating an accumulator. Therefore the only way to perform all updates is to execute all these functions and count is the easiest way to achieve that and gives the lowest overhead compared to other ways (cache + count, first or collect).
Previous answers put me on the right track to solve a similar convergence detection problem.
foreach is presented in the docs as:
foreach(func) : Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
It seems like instead of using rdd.foreach() as a cheap action to trigger accumulator increments placed in various transformations, it should be used to do the incrementing itself.
I'm unable to produce a scala example, but here's a basic java version, if it can still help:
// Convergence is reached when two iterations
// return the same number of results
long previousCount = -1;
long currentCount = 0;
while (previousCount != currentCount){
rdd = doSomethingThatUpdatesRdd(rdd);
// Count entries in new rdd with foreach + accumulator
rdd.foreach(tuple -> accumulator.add(1));
// Update helper values
previousCount = currentCount;
currentCount = accumulator.sum();
accumulator.reset();
}
// Convergence is reached

how to update a value in dataframe and drop a row on this basis of a given value in scala

I need to update the value and if the value is zero then drop that row. Here is the snapshot.
val net = sc.accumulator(0.0)
df1.foreach(x=> {net += calculate(df2, x)})
def calculate(df2:DataFrame, x : Row):Double = {
var pro:Double = 0.0
df2.foreach(y => {if(xxx){ do some stuff and update the y.getLong(2) value }
else if(yyy){ do some stuff and update the y.getLong(2) value}
if(y.getLong(2) == 0) {drop this row from df2} })
return pro;
}
Any suggestions? Thanks.
You cannot change the DataFrame or RDD. They are read only for a reason. But you can create a new one and use transformations by all the means available. So when you want to change for example contents of a column in dataframe just add new column with updated contents by using functions like this:
df.withComlumn(...)
DataFrames are immutable, you can not update a value but rather create new DF every time.
Can you reframe your use case, its not very clear what you are trying to achieve with the above snippet (Not able to understand the use of accumulator) ?
You can rather try df2.withColumn(...) and use your udf here.

Get the first elements (take function) of a DStream

I look for a way to retrieve the first elements of a DStream created as:
val dstream = ssc.textFileStream(args(1)).map(x => x.split(",").map(_.toDouble))
Unfortunately, there is no take function (as on RDD) on a dstream //dstream.take(2) !!!
Could someone has any idea on how to do it ?! thanks
You can use transform method in the DStream object then take n elements of the input RDD and save it to a list, then filter the original RDD to be contained in this list. This will return a new DStream contains n elements.
val n = 10
val partOfResult = dstream.transform(rdd => {
val list = rdd.take(n)
rdd.filter(list.contains)
})
partOfResult.print
The previous suggested solution did not compile for me as the take() method returns an Array, which is not serializable thus Spark streaming will fail with a java.io.NotSerializableException.
A simple variation on the previous code that worked for me:
val n = 10
val partOfResult = dstream.transform(rdd => {
rdd.filter(rdd.take(n).toList.contains)
})
partOfResult.print
Sharing a java based solution that is working for me. The idea is to use a custom function, which can send the top row from a sorted RDD.
someData.transform(
rdd ->
{
JavaRDD<CryptoDto> result =
rdd.keyBy(Recommendations.volumeAsKey)
.sortByKey(new CryptoComparator()).values().zipWithIndex()
.map(row ->{
CryptoDto purchaseCrypto = new CryptoDto();
purchaseCrypto.setBuyIndicator(row._2 + 1L);
purchaseCrypto.setName(row._1.getName());
purchaseCrypto.setVolume(row._1.getVolume());
purchaseCrypto.setProfit(row._1.getProfit());
purchaseCrypto.setClose(row._1.getClose());
return purchaseCrypto;
}
).filter(Recommendations.selectTopinSortedRdd);
return result;
}).print();
The custom function selectTopinSortedRdd looks like below:
public static Function<CryptoDto, Boolean> selectTopInSortedRdd = new Function<CryptoDto, Boolean>() {
private static final long serialVersionUID = 1L;
#Override
public Boolean call(CryptoDto value) throws Exception {
if (value.getBuyIndicator() == 1L) {
System.out.println("Value of buyIndicator :" + value.getBuyIndicator());
return true;
}
else {
return false;
}
}
};
It basically compares all incoming elements, and returns true only for the first record from the sorted RDD.
This seems to be always an issue with DStreams as well as regular RDDs.
If you don't want (or can't) to use .take() (especially in DStreams) you can think outside the box here and just use reduce instead. That is a valid function for both DStreams as well as RDD's.
Think about it. If you use reduce like this (Python example):
.reduce( lambda x, y : x)
Then what happens is: For every 2 elements you pass in, always return only the first. So if you have a million elements in your RDD or DStream it will shrink it to one element in the end which is the very first one in your RDD or DStream.
Simple and clean.
However keep in mind that .reduce() does not take order into consideration. However you can easily overcome this with a custom function instead.
Example: Let's assume your data looks like this x = (1, [1,2,3]) and y = (2, [1,2]). A tuple x where the 2nd element is a list. If you are sorting by the longest list for example then your code could look like below maybe (adapt as needed):
def your_reduce(x,y):
if len(x[1]) > len(y[1]):
return x
else:
return y
yourNewRDD = yourOldRDD.reduce(your_reduce)
Accordingly you will get '(1, [1,2,3])' as that has the longer list. There you go!
This has caused me some headaches in the past until I finally tried this. Hopefully this helps.

Way to Extract list of elements from Scala list

I have standard list of objects which is used for the some analysis. The analysis generates a list of Strings and i need to look through the standard list of objects and retrieve objects with same name.
case class TestObj(name:String,positions:List[Int],present:Boolean)
val stdLis:List[TestObj]
//analysis generates a list of strings
var generatedLis:List[String]
//list to save objects found in standard list
val lisBuf = new ListBuffer[TestObj]()
//my current way
generatedLis.foreach{i=>
val temp = stdLis.filter(p=>p.name.equalsIgnoreCase(i))
if(temp.size==1){
lisBuf.append(temp(0))
}
}
Is there any other way to achieve this. Like having an custom indexof method that over rides and looks for the name instead of the whole object or something. I have not tried that approach as i am not sure about it.
stdLis.filter(testObj => generatedLis.exists(_.equalsIgnoreCase(testObj.name)))
use filter to filter elements from 'stdLis' per predicate
use exists to check if 'generatedLis' has a value of ....
Don't use mutable containers to filter sequences.
Naive solution:
val lisBuf =
for {
str <- generatedLis
temp = stdLis.filter(_.name.equalsIgnoreCase(str))
if temp.size == 1
} yield temp(0)
if we discard condition temp.size == 1 (i'm not sure it is legal or not):
val lisBuf = stdLis.filter(s => generatedLis.exists(_.equalsIgnoreCase(s.name)))