PySpark UDF to test conversion from String to Int - pyspark

I want to apply a udf function which return original value only if it can be converted to int. Til now I tried 2 functions :
def nb_int(s):
try:
val=int(s)
return s
except:
return "ERROR"
def nb_digit(s):
if (s.isdigit() == True):
return s
else:
return "ERROR"
nb_udf = F.udf(nb_digit, StringType())
df_corrected=df.withColumn("IntValue",nb_udf("nb_value"))
I applied this function on "nb_value". But it's not working :
df_corrected.filter(df_corrected["IntValue"] == "ERROR").select("nb_value").dropDuplicates(subset=["nb_value"]).collect()
Results of the last line should only give me values which are not convertible, but I still have 1, 2, 4, etc ...
[Row(nb_value=u'MS'),
Row(nb_value=u'286'),
Row(nb_value=u'TB'),
Row(nb_value=u'GF'),
Row(nb_value=u'287'),
Row(nb_value=u'MU'),
Row(nb_value=u'170'),
Row(nb_value=u'A9'),
Row(nb_value=u'288'),
Row(nb_value=u'171'),
Row(nb_value=u'333'),....
Any help to fix it is welcome ! Thanks

Both your UDFs work for me. You'd better use the second one, because raising exceptions can slow spark down. The == True is redundant in nb_digit, but apart from that it's fine:
First let's create the example dataframe:
df = hc.createDataFrame(sc.parallelize([['MS'],['286'],['TB'],['GF'],['287'],['MU'],['170'],['A9'],['288'],['171'],['333']]), ["nb_value"])
Using any of your UDFs:
df_corrected = df.withColumn("IntValue", nb_udf("nb_value"))
df_corrected.filter(df_corrected["IntValue"] == "ERROR").select("nb_value").dropDuplicates(subset=["nb_value"]).collect()
[Row(nb_value=u'MS'),
Row(nb_value=u'TB'),
Row(nb_value=u'GF'),
Row(nb_value=u'A9'),
Row(nb_value=u'MU')]

Related

Return keyword inside the inline function in Scala

I've heard about to not use Return keyword in Scala, because it might change the flow of the program like;
// this will return only 2 because of return keyword
List(1, 2, 3).map(value => return value * 2)
Here is my case; I've recursive case class, and using DFS to do some calculation on it. So, maximum depth could be 5. Here is the model;
case class Line(
lines: Option[Seq[Line]],
balls: Option[Seq[Ball]],
op: Option[String]
)
I'm using DFS approach to search this recursive model. But at some point, if a special value exist in the data, I want to stop iterating over the data left and return the result directly instead. Here is an example;
Line(
lines = Some(Seq(
Line(None, Some(Seq(Ball(1), Ball(3))), Some("s")),
Line(None, Some(Seq(Ball(5), Ball(2))), Some("d")),
Line(None, Some(Seq(Ball(9))), None)
)),
balls = None,
None
)
In this data, I want to return as like "NOT_OKAY" if I run into the Ball(5), which means I do not need to any operation on Ball(2) and Ball(9) anymore. Otherwise, I will apply a calculation to the each Ball(x) with the given operator.
I'm using this sort of DFS method;
def calculate(line: Line) = {
// string here is the actual result that I want, Boolean just keeps if there is a data that I don't want
def dfs(line: Line): (String, Boolean) = {
line.balls.map{_.map { ball =>
val result = someCalculationOnBall(ball)
// return keyword here because i don't want to iterate values left in the balls
if (result == "NOTREQUIRED") return ("NOT_OKAY", true)
("OKAY", false)
}}.getOrElse(
line.lines.map{_.map{ subLine =>
val groupResult = dfs(subLine)
// here is I'm using return because I want to return the result directly instead of iterating the values left in the lines
if (groupResult._2) return ("NOT_OKAY", true)
("OKAY", false)
}}
)
}
.... rest of the thing
}
In this case, I'm using return keyword in the inline functions, and change the behaviour of the inner map functions completely. I've just read somethings about not using return keyword in Scala, but couldn't make sure this will create a problem or not. Because in my case, I don't want to do any calculation if I run into a value that I don't want to see. Also I couldn't find the functional way to get rid of return keyword.
Is there any side effect like stack exception etc. to use return keyword here? I'm always open to the alternative ways. Thank you so much!

Optimize a piece of code that uses map action

The following piece of code takes a lot of time on 4Gb of raw data in a cluster:
df.select("type", "user_pk", "item_pk","timestamp")
.withColumn("date",to_date(from_unixtime($"timestamp")))
.filter($"date" > "2018-04-14")
.select("type", "user_pk", "item_pk")
.map {
row => {
val typef = row.get(0).toString
val user = row.get(1).toString
val item = row.get(2).toString
(typef, user, item)
}
}
The output should be of type Dataset[(String,String,String)].
I guess that map part takes a lot of time. Is there any way to optimize this piece of code?
I seriously doubt the map is the problem, nonetheless I wouldn't use it at all and go with standard Dataset converter
import df.sparkSession.implicits._
df.select("type", "user_pk", "item_pk","timestamp")
.withColumn("date",to_date(from_unixtime($"timestamp")))
.filter($"date" > "2018-04-14")
.select($"type" cast "string", $"user_pk" cast "string", $"item_pk" cast "string")
.as[(String,String,String)]
You're creating date column with Date type and then compare it with string??
I'd assume some implicit transformation is happening underneath (for each row while filtering).
Instead I'd convert that string to date to timestamp and do integer comparison (as you're using from_unixtime I assume timestamp is stored as System.currenttimemillis or similar):
timestamp = some_to_timestamp_func("2018-04-14")
df.select("type", "user_pk", "item_pk","timestamp")
.filter($"timestamp" > timestamp)
... etc

how to update a value in dataframe and drop a row on this basis of a given value in scala

I need to update the value and if the value is zero then drop that row. Here is the snapshot.
val net = sc.accumulator(0.0)
df1.foreach(x=> {net += calculate(df2, x)})
def calculate(df2:DataFrame, x : Row):Double = {
var pro:Double = 0.0
df2.foreach(y => {if(xxx){ do some stuff and update the y.getLong(2) value }
else if(yyy){ do some stuff and update the y.getLong(2) value}
if(y.getLong(2) == 0) {drop this row from df2} })
return pro;
}
Any suggestions? Thanks.
You cannot change the DataFrame or RDD. They are read only for a reason. But you can create a new one and use transformations by all the means available. So when you want to change for example contents of a column in dataframe just add new column with updated contents by using functions like this:
df.withComlumn(...)
DataFrames are immutable, you can not update a value but rather create new DF every time.
Can you reframe your use case, its not very clear what you are trying to achieve with the above snippet (Not able to understand the use of accumulator) ?
You can rather try df2.withColumn(...) and use your udf here.

Get the first elements (take function) of a DStream

I look for a way to retrieve the first elements of a DStream created as:
val dstream = ssc.textFileStream(args(1)).map(x => x.split(",").map(_.toDouble))
Unfortunately, there is no take function (as on RDD) on a dstream //dstream.take(2) !!!
Could someone has any idea on how to do it ?! thanks
You can use transform method in the DStream object then take n elements of the input RDD and save it to a list, then filter the original RDD to be contained in this list. This will return a new DStream contains n elements.
val n = 10
val partOfResult = dstream.transform(rdd => {
val list = rdd.take(n)
rdd.filter(list.contains)
})
partOfResult.print
The previous suggested solution did not compile for me as the take() method returns an Array, which is not serializable thus Spark streaming will fail with a java.io.NotSerializableException.
A simple variation on the previous code that worked for me:
val n = 10
val partOfResult = dstream.transform(rdd => {
rdd.filter(rdd.take(n).toList.contains)
})
partOfResult.print
Sharing a java based solution that is working for me. The idea is to use a custom function, which can send the top row from a sorted RDD.
someData.transform(
rdd ->
{
JavaRDD<CryptoDto> result =
rdd.keyBy(Recommendations.volumeAsKey)
.sortByKey(new CryptoComparator()).values().zipWithIndex()
.map(row ->{
CryptoDto purchaseCrypto = new CryptoDto();
purchaseCrypto.setBuyIndicator(row._2 + 1L);
purchaseCrypto.setName(row._1.getName());
purchaseCrypto.setVolume(row._1.getVolume());
purchaseCrypto.setProfit(row._1.getProfit());
purchaseCrypto.setClose(row._1.getClose());
return purchaseCrypto;
}
).filter(Recommendations.selectTopinSortedRdd);
return result;
}).print();
The custom function selectTopinSortedRdd looks like below:
public static Function<CryptoDto, Boolean> selectTopInSortedRdd = new Function<CryptoDto, Boolean>() {
private static final long serialVersionUID = 1L;
#Override
public Boolean call(CryptoDto value) throws Exception {
if (value.getBuyIndicator() == 1L) {
System.out.println("Value of buyIndicator :" + value.getBuyIndicator());
return true;
}
else {
return false;
}
}
};
It basically compares all incoming elements, and returns true only for the first record from the sorted RDD.
This seems to be always an issue with DStreams as well as regular RDDs.
If you don't want (or can't) to use .take() (especially in DStreams) you can think outside the box here and just use reduce instead. That is a valid function for both DStreams as well as RDD's.
Think about it. If you use reduce like this (Python example):
.reduce( lambda x, y : x)
Then what happens is: For every 2 elements you pass in, always return only the first. So if you have a million elements in your RDD or DStream it will shrink it to one element in the end which is the very first one in your RDD or DStream.
Simple and clean.
However keep in mind that .reduce() does not take order into consideration. However you can easily overcome this with a custom function instead.
Example: Let's assume your data looks like this x = (1, [1,2,3]) and y = (2, [1,2]). A tuple x where the 2nd element is a list. If you are sorting by the longest list for example then your code could look like below maybe (adapt as needed):
def your_reduce(x,y):
if len(x[1]) > len(y[1]):
return x
else:
return y
yourNewRDD = yourOldRDD.reduce(your_reduce)
Accordingly you will get '(1, [1,2,3])' as that has the longer list. There you go!
This has caused me some headaches in the past until I finally tried this. Hopefully this helps.

How does the following Java "continue" code translate to Scala?

for (String stock : allStocks) {
Quote quote = getQuote(...);
if (null == quoteLast) {
continue;
}
Price price = quote.getPrice();
if (null == price) {
continue;
}
}
I don't necessarily need a line by line translation, but I'm looking for the "Scala way" to handle this type of problem.
You don't need continue or breakable or anything like that in cases like this: Options and for comprehensions do the trick very nicely,
val stocksWithPrices =
for {
stock <- allStocks
quote <- Option(getQuote(...))
price <- Option(quote.getPrice())
} yield (stock, quote, price);
Generally you try to avoid those situations to begin with by filtering before you even start:
val goodStocks = allStocks.view.
map(stock => (stock, stock.getQuote)).filter(_._2 != null).
map { case (stock, quote) => (stock,quote, quote.getPrice) }.filter(_._3 != null)
(this example showing how you'd carry along partial results if you need them). I've used a view so that results will be computed as-needed, instead of creating a bunch of new collections at each step.
Actually, you'd probably have the quotes and such return options--look around on StackOverflow for examples of how to use those instead of null return values.
But, anyway, if that sort of thing doesn't work so well (e.g. because you are generating too many intermediate results that you need to keep, or you are relying on updating mutable variables and you want to keep the evaluation pattern simple so you know what's happening when) and you can't conceive of the problem in a different, possibly more robust way, then you can
import scala.util.control.Breaks._
for (stock <- allStocks) {
breakable {
val quote = getQuote(...)
if (quoteLast eq null) break;
...
}
}
The breakable construct specifies where breaks should take you to. If you put breakable outside a for loop, it works like a standard Java-style break. If you put it inside, it acts like continue.
Of course, if you have a very small number of conditions, you don't need the continue at all; just use the else of the if-statement.
Your control structure here can be mapped very idiomatically into the following for loop, and your code demonstrates the kind of filtering that Scala's for loop was designed for.
for {stock <- allStocks.view
quote = getQuote(...)
if quoteLast != null
price = quote.getPrice
if null != price
}{
// whatever comes after all of the null tests
}
By the way, Scala will automatically desugar this into the code from Rex Kerr's solution
val goodStocks = allStocks.view.
map(stock => (stock, stock.getQuote)).filter(_._2 != null).
map { case (stock, quote) => (stock,quote, quote.getPrice) }.filter(_._3 != null)
This solution probably doesn't work in general for all different kinds of more complex flows that might use continue, but it does address a lot of common ones.
If the focus is really on the continue and not on the null handling, just define an inner method (the null handling part is a different idiom in scala):
def handleStock(stock: String): Unit {
val quote = getQuote(...)
if (null == quoteLast) {
return
}
val price = quote.getPrice();
if (null == price) {
return
}
}
for (stock <- allStocks) {
handleStock(stock)
}
The simplest way is to embed the skipped-over code in an if with reversed-sense to what you have.
See http://www.scala-lang.org/node/257