how to update a value in dataframe and drop a row on this basis of a given value in scala - scala

I need to update the value and if the value is zero then drop that row. Here is the snapshot.
val net = sc.accumulator(0.0)
df1.foreach(x=> {net += calculate(df2, x)})
def calculate(df2:DataFrame, x : Row):Double = {
var pro:Double = 0.0
df2.foreach(y => {if(xxx){ do some stuff and update the y.getLong(2) value }
else if(yyy){ do some stuff and update the y.getLong(2) value}
if(y.getLong(2) == 0) {drop this row from df2} })
return pro;
}
Any suggestions? Thanks.

You cannot change the DataFrame or RDD. They are read only for a reason. But you can create a new one and use transformations by all the means available. So when you want to change for example contents of a column in dataframe just add new column with updated contents by using functions like this:
df.withComlumn(...)

DataFrames are immutable, you can not update a value but rather create new DF every time.
Can you reframe your use case, its not very clear what you are trying to achieve with the above snippet (Not able to understand the use of accumulator) ?
You can rather try df2.withColumn(...) and use your udf here.

Related

Exclude null columns in an update statement - JOOQ

I have a POJO that has the fields that can be updated. But sometimes only a few fields will need to be updated and the rest are null. How do I write an update statement that ignores the fields that are null? Would it be better to loop through the non missing ones and dynamically add to a set statement, or using coalesce?
I have the following query:
jooqService.using(txn)
.update(USER_DETAILS)
.set(USER_DETAILS.NAME, input.name)
.set(USER_DETAILS.LAST_NAME, input.lastName)
.set(USER_DETAILS.COURSES, input.courses)
.set(USER_DETAILS.SCHOOL, input.school)
.where(USER_DETAILS.ID.eq(input.id))
.execute()
If there is a better practice?
I don't know Jooq but it looks like you could simply do this:
val jooq = jooqService.using(txn).update(USER_DETAILS)
input.name.let {jooq.set(USER_DETAILS.NAME, it)}
input.lastName.let {jooq.set(USER_DETAILS.LAST_NAME, it)}
etc...
EDIT: Mapping these fields explicitly as above is clearest in my opinion, but you could do something like this:
val fields = new Object[] {USER_DETAILS.NAME, USER_DETAILS.LAST_NAME}
val values = new Object[] {input.name, input.lastName}
val jooq = jooqService.using(txn).update(USER_DETAILS)
values.forEachIndexed { i, value ->
value.let {jooq.set(fields[i], value)}
}
You'd still need to enumerate all the fields and values explicitly and consistently in the arrays for this to work. It seems less readable and more error prone to me.
In Java, it would be somthing like this
var jooqQuery = jooqService.using(txn)
.update(USER_DETAILS);
if (input.name != null) {
jooqQuery.set(USER_DETAILS.NAME, input.name);
}
if (input.lastName != null) {
jooqQuery.set(USER_DETAILS.LAST_NAME, input.lastName);
}
// ...
jooqQuery.where(USER_DETAILS.ID.eq(input.id))
.execute();
Another option rather than writing this UPDATE statement is to use UpdatableRecord:
// Load a POJO into a record using a RecordUnmapper
UserDetailsRecord r =
jooqService.using(txn)
.newRecord(USER_DETAILS, input)
(0 .. r.size() - 1).forEach { if (r[it] == null) r.changed(it, false) }
r.update();
You can probably write an extension function to make this available for all jOOQ records, globally, e.g. as r.updateNonNulls().

What input does my user defined function in spark dataframe take in?

I try to combine the two columns "Format Group" and "Format SubGroup" to a single column called Format.
The O/P in the final Format column should be in the form of Format Group:Format Subgroup
I need to create my own UDF using some given data, but I am not sure why my UDF doesn't like the input I have given it.
This is the first rows of the data I use:
checkoutDF:
BibNumber, ItemBarcode, ItemType, Collection, CallNumber, CheckoutDateTime
1842225, 0010035249209, acbk, namys, MYSTERY ELKINS1999, 05/23/2005 03:20:00 PM
dataDictionaryDF:
Code, Description, Code Type, Format Group, Format Subgroup
acdvd, DVD: Adult/YA, ItemType, Media, Video Disc
Here's how it looks in the IntelliJ IDEA
Updated the code: changed seq[seq[string]] to String
def numberCheckoutRecordsPerFormat(checkoutDF: DataFrame, dataDictionaryDF: DataFrame): DataFrame = {
val createFeatureVector = udf{(Format_Group:String, Format_Subgroup:String) => {
dataDictionaryDF.map(x => if(Format_Group.flatten.contains(x)) 1.0 else 0.0)++Array(Format_Subgroup)
}
}
checkoutDF
.na.drop()
.join(dataDictionaryDF
.select($"Format_Group", $"Format_Subgroup", $"Code".as("ItemType"))
, "ItemType")
.withColumn("Format", createFeatureVector(dataDictionaryDF("Format_Group"), dataDictionaryDF("Format_Subgroup")))
.groupBy("ItemBarCode")
.agg(count("ItemBarCode"))
.withColumnRenamed("count(ItemBarCode)", "CheckoutCount")
.select($"Format", $"CheckoutCount")
}
Furthermore, the numberCheckoutRecordsPerFormat should return a DataFrame of Format and number of Checkouts for a given item - but I got this part covered myself.
The data set used is the Seattle Library Checkout Records from Kaggle
Thanks, people!
Doomdaam, you can try to use the concat_ws built-in function (always use built-in functions when possible). Your code will look like :
checkoutDF
.na.drop()
.join(dataDictionaryDF
.select($"Format_Group", $"Format_Subgroup", $"Code".as("ItemType"))
, "ItemType")
.withColumn("Format", concat_ws(":",$"Format_Group", $"Format_Subgroup"))
.groupBy("ItemBarCode")
.agg(count("ItemBarCode"))
.withColumnRenamed("count(ItemBarCode)", "CheckoutCount")
.select($"Format", $"CheckoutCount")
Otherwise your UDF would have been :
val createFeatureVector = udf{(formatGroup:String, formatSubgroup:String) => Seq(formatGroup,formatSubgroup).mkString(":")}

Optimize a piece of code that uses map action

The following piece of code takes a lot of time on 4Gb of raw data in a cluster:
df.select("type", "user_pk", "item_pk","timestamp")
.withColumn("date",to_date(from_unixtime($"timestamp")))
.filter($"date" > "2018-04-14")
.select("type", "user_pk", "item_pk")
.map {
row => {
val typef = row.get(0).toString
val user = row.get(1).toString
val item = row.get(2).toString
(typef, user, item)
}
}
The output should be of type Dataset[(String,String,String)].
I guess that map part takes a lot of time. Is there any way to optimize this piece of code?
I seriously doubt the map is the problem, nonetheless I wouldn't use it at all and go with standard Dataset converter
import df.sparkSession.implicits._
df.select("type", "user_pk", "item_pk","timestamp")
.withColumn("date",to_date(from_unixtime($"timestamp")))
.filter($"date" > "2018-04-14")
.select($"type" cast "string", $"user_pk" cast "string", $"item_pk" cast "string")
.as[(String,String,String)]
You're creating date column with Date type and then compare it with string??
I'd assume some implicit transformation is happening underneath (for each row while filtering).
Instead I'd convert that string to date to timestamp and do integer comparison (as you're using from_unixtime I assume timestamp is stored as System.currenttimemillis or similar):
timestamp = some_to_timestamp_func("2018-04-14")
df.select("type", "user_pk", "item_pk","timestamp")
.filter($"timestamp" > timestamp)
... etc

Get the first elements (take function) of a DStream

I look for a way to retrieve the first elements of a DStream created as:
val dstream = ssc.textFileStream(args(1)).map(x => x.split(",").map(_.toDouble))
Unfortunately, there is no take function (as on RDD) on a dstream //dstream.take(2) !!!
Could someone has any idea on how to do it ?! thanks
You can use transform method in the DStream object then take n elements of the input RDD and save it to a list, then filter the original RDD to be contained in this list. This will return a new DStream contains n elements.
val n = 10
val partOfResult = dstream.transform(rdd => {
val list = rdd.take(n)
rdd.filter(list.contains)
})
partOfResult.print
The previous suggested solution did not compile for me as the take() method returns an Array, which is not serializable thus Spark streaming will fail with a java.io.NotSerializableException.
A simple variation on the previous code that worked for me:
val n = 10
val partOfResult = dstream.transform(rdd => {
rdd.filter(rdd.take(n).toList.contains)
})
partOfResult.print
Sharing a java based solution that is working for me. The idea is to use a custom function, which can send the top row from a sorted RDD.
someData.transform(
rdd ->
{
JavaRDD<CryptoDto> result =
rdd.keyBy(Recommendations.volumeAsKey)
.sortByKey(new CryptoComparator()).values().zipWithIndex()
.map(row ->{
CryptoDto purchaseCrypto = new CryptoDto();
purchaseCrypto.setBuyIndicator(row._2 + 1L);
purchaseCrypto.setName(row._1.getName());
purchaseCrypto.setVolume(row._1.getVolume());
purchaseCrypto.setProfit(row._1.getProfit());
purchaseCrypto.setClose(row._1.getClose());
return purchaseCrypto;
}
).filter(Recommendations.selectTopinSortedRdd);
return result;
}).print();
The custom function selectTopinSortedRdd looks like below:
public static Function<CryptoDto, Boolean> selectTopInSortedRdd = new Function<CryptoDto, Boolean>() {
private static final long serialVersionUID = 1L;
#Override
public Boolean call(CryptoDto value) throws Exception {
if (value.getBuyIndicator() == 1L) {
System.out.println("Value of buyIndicator :" + value.getBuyIndicator());
return true;
}
else {
return false;
}
}
};
It basically compares all incoming elements, and returns true only for the first record from the sorted RDD.
This seems to be always an issue with DStreams as well as regular RDDs.
If you don't want (or can't) to use .take() (especially in DStreams) you can think outside the box here and just use reduce instead. That is a valid function for both DStreams as well as RDD's.
Think about it. If you use reduce like this (Python example):
.reduce( lambda x, y : x)
Then what happens is: For every 2 elements you pass in, always return only the first. So if you have a million elements in your RDD or DStream it will shrink it to one element in the end which is the very first one in your RDD or DStream.
Simple and clean.
However keep in mind that .reduce() does not take order into consideration. However you can easily overcome this with a custom function instead.
Example: Let's assume your data looks like this x = (1, [1,2,3]) and y = (2, [1,2]). A tuple x where the 2nd element is a list. If you are sorting by the longest list for example then your code could look like below maybe (adapt as needed):
def your_reduce(x,y):
if len(x[1]) > len(y[1]):
return x
else:
return y
yourNewRDD = yourOldRDD.reduce(your_reduce)
Accordingly you will get '(1, [1,2,3])' as that has the longer list. There you go!
This has caused me some headaches in the past until I finally tried this. Hopefully this helps.

Way to Extract list of elements from Scala list

I have standard list of objects which is used for the some analysis. The analysis generates a list of Strings and i need to look through the standard list of objects and retrieve objects with same name.
case class TestObj(name:String,positions:List[Int],present:Boolean)
val stdLis:List[TestObj]
//analysis generates a list of strings
var generatedLis:List[String]
//list to save objects found in standard list
val lisBuf = new ListBuffer[TestObj]()
//my current way
generatedLis.foreach{i=>
val temp = stdLis.filter(p=>p.name.equalsIgnoreCase(i))
if(temp.size==1){
lisBuf.append(temp(0))
}
}
Is there any other way to achieve this. Like having an custom indexof method that over rides and looks for the name instead of the whole object or something. I have not tried that approach as i am not sure about it.
stdLis.filter(testObj => generatedLis.exists(_.equalsIgnoreCase(testObj.name)))
use filter to filter elements from 'stdLis' per predicate
use exists to check if 'generatedLis' has a value of ....
Don't use mutable containers to filter sequences.
Naive solution:
val lisBuf =
for {
str <- generatedLis
temp = stdLis.filter(_.name.equalsIgnoreCase(str))
if temp.size == 1
} yield temp(0)
if we discard condition temp.size == 1 (i'm not sure it is legal or not):
val lisBuf = stdLis.filter(s => generatedLis.exists(_.equalsIgnoreCase(s.name)))