collect on a dataframe spark - scala

I wrote this:
df.select(col("colname")).distinct().collect.map(_.toString()).toList
the result is
List("[2019-06-24]", "[2019-06-22]", "[2019-06-23]")
Whereas I want to get :
List("2019-06-24", "2019-06-22", "2019-06-23")
How to change this please

You need to change .map(_.toString()) to .map(_.getAs[String]("colname")).With .map(_.toString()), you are calling org.apache.spark.sql.Row.toString, that's why the output is like List("[2019-06-24]", "[2019-06-22]", "[2019-06-23]").Correct way is:
val list = df.select("colname").distinct().collect().map(_.getAs[String]("colname")).toList
Output will be:
List("2019-06-24", "2019-06-22", "2019-06-23")

Sample data:
val df=sc.parallelize(Seq(("2019-06-24"),( "2019-06-22"),("2019-06-23"))).toDF("cn")
Now select column then apply map to get first index value then add quotes and convert to string.
df.select("cn").collect().map(x => x(0)).map(x => s""""$x"""".toString)
//res36: Array[String] = Array("2019-06-24", "2019-06-22", "2019-06-23")
(or)
df.select("cn").collect().map(x => x(0)).map(x => s""""$x"""".toString).toList
//res37: List[String] = List("2019-06-24", "2019-06-22", "2019-06-23")

Related

ConstraintSuggestionRunner not taking up columns enclosed with backticks

I am currently importing the dataset from an excel sheet which has a column name with a dot character like this "abc.xyz".
I went through a couple of stackOverflow questions and it says that we can replace it with the column names with backtick like this: "'abc.xyz'". So, I renamed all the column names which have a dot in it with the same name but enclosed in backticks like this:
df.columns.foreach(item => {
if(item.contains("."))
{
df.withColumnRenamed(item, s"`$item`")
}
})
Now when I pass this dataframe inside the ConstraintSuggestionRunner class like this:
val suggestionResult = ConstraintSuggestionRunner()
.onData(df)
.addConstraintRules(Rules.DEFAULT)
.setKLLParameters(KLLParameters(sketchSize = 2048, shrinkingFactor = 0.64, numberOfBuckets = 10))
.run()
I am getting errors like :
ERROR Main: org.apache.spark.sql.AnalysisException: cannot resolve
'`abc.xyz`' given input columns:
How can I resolve this error?
The escaping must be handled in Deequ but the issue is always open. What you did here is adding the backticks as part of the column names, not escaping them.
You can try to replace the dots by another caracheter like underscore _ then pass the dataframe with the renamed columns to the ConstraintSuggestionRunner:
val df1 = df.toDF(df.columns.map(_.replaceAll("[.]+", "_")):_*)
val suggestionResult = ConstraintSuggestionRunner()
.onData(df1)
.addConstraintRules(Rules.DEFAULT)
.setKLLParameters(KLLParameters(sketchSize = 2048, shrinkingFactor = 0.64, numberOfBuckets = 10))
.run()

What input does my user defined function in spark dataframe take in?

I try to combine the two columns "Format Group" and "Format SubGroup" to a single column called Format.
The O/P in the final Format column should be in the form of Format Group:Format Subgroup
I need to create my own UDF using some given data, but I am not sure why my UDF doesn't like the input I have given it.
This is the first rows of the data I use:
checkoutDF:
BibNumber, ItemBarcode, ItemType, Collection, CallNumber, CheckoutDateTime
1842225, 0010035249209, acbk, namys, MYSTERY ELKINS1999, 05/23/2005 03:20:00 PM
dataDictionaryDF:
Code, Description, Code Type, Format Group, Format Subgroup
acdvd, DVD: Adult/YA, ItemType, Media, Video Disc
Here's how it looks in the IntelliJ IDEA
Updated the code: changed seq[seq[string]] to String
def numberCheckoutRecordsPerFormat(checkoutDF: DataFrame, dataDictionaryDF: DataFrame): DataFrame = {
val createFeatureVector = udf{(Format_Group:String, Format_Subgroup:String) => {
dataDictionaryDF.map(x => if(Format_Group.flatten.contains(x)) 1.0 else 0.0)++Array(Format_Subgroup)
}
}
checkoutDF
.na.drop()
.join(dataDictionaryDF
.select($"Format_Group", $"Format_Subgroup", $"Code".as("ItemType"))
, "ItemType")
.withColumn("Format", createFeatureVector(dataDictionaryDF("Format_Group"), dataDictionaryDF("Format_Subgroup")))
.groupBy("ItemBarCode")
.agg(count("ItemBarCode"))
.withColumnRenamed("count(ItemBarCode)", "CheckoutCount")
.select($"Format", $"CheckoutCount")
}
Furthermore, the numberCheckoutRecordsPerFormat should return a DataFrame of Format and number of Checkouts for a given item - but I got this part covered myself.
The data set used is the Seattle Library Checkout Records from Kaggle
Thanks, people!
Doomdaam, you can try to use the concat_ws built-in function (always use built-in functions when possible). Your code will look like :
checkoutDF
.na.drop()
.join(dataDictionaryDF
.select($"Format_Group", $"Format_Subgroup", $"Code".as("ItemType"))
, "ItemType")
.withColumn("Format", concat_ws(":",$"Format_Group", $"Format_Subgroup"))
.groupBy("ItemBarCode")
.agg(count("ItemBarCode"))
.withColumnRenamed("count(ItemBarCode)", "CheckoutCount")
.select($"Format", $"CheckoutCount")
Otherwise your UDF would have been :
val createFeatureVector = udf{(formatGroup:String, formatSubgroup:String) => Seq(formatGroup,formatSubgroup).mkString(":")}

How to pass multiParams in scalatra

If I want to read a single parameter in a get request in scalatra I can do it as follows:
get("mypath/:id") {
val id = params("id")
...
}
According to the scalatra documentation I can also use multiParams to get a sequence of parameters:
val ids = multiParams("ids")
But it does not say how the URL should be formed should I wish to pass more than one parameter. So if I wanted to pass multiple ids what is the format for the URL?
I have tried it with ampersands, commas and semi-colons but to no avail: e.g.
../mypath/id1&id2
Check the docs: http://scalatra.org/guides/2.4/http/routes.html
As an example, let’s hit a URL with a GET like this:
/articles/52?foo=uno&bar=dos&baz=three&foo=anotherfoo
Look closely: there are two “foo” keys in there.
Assuming there’s a matching route at /articles/:id, we get the
following results inside the action:
get("/articles/:id") {
params("id") // => "52"
params("foo") // => "uno" (discarding the second "foo" parameter value)
params("unknown") // => generates a NoSuchElementException
params.get("unknown") // => None - this is what Scala does with unknown keys in a Map
multiParams("id") // => Seq("52")
multiParams("foo") // => Seq("uno", "anotherfoo")
multiParams("unknown") // => an empty Seq
}
So you would need to name each param. e.g. /mypath/?ids=id1&ids=id2&ids=id3
You can embed multiple same name parameters in the path and get them through multiParams:
// http://localhost:8080/articles/id1/id2
get("/articles/:id/:id"){
println(multiParams("id")) // => Seq("id1", "id2")
}

how to update a value in dataframe and drop a row on this basis of a given value in scala

I need to update the value and if the value is zero then drop that row. Here is the snapshot.
val net = sc.accumulator(0.0)
df1.foreach(x=> {net += calculate(df2, x)})
def calculate(df2:DataFrame, x : Row):Double = {
var pro:Double = 0.0
df2.foreach(y => {if(xxx){ do some stuff and update the y.getLong(2) value }
else if(yyy){ do some stuff and update the y.getLong(2) value}
if(y.getLong(2) == 0) {drop this row from df2} })
return pro;
}
Any suggestions? Thanks.
You cannot change the DataFrame or RDD. They are read only for a reason. But you can create a new one and use transformations by all the means available. So when you want to change for example contents of a column in dataframe just add new column with updated contents by using functions like this:
df.withComlumn(...)
DataFrames are immutable, you can not update a value but rather create new DF every time.
Can you reframe your use case, its not very clear what you are trying to achieve with the above snippet (Not able to understand the use of accumulator) ?
You can rather try df2.withColumn(...) and use your udf here.

Way to Extract list of elements from Scala list

I have standard list of objects which is used for the some analysis. The analysis generates a list of Strings and i need to look through the standard list of objects and retrieve objects with same name.
case class TestObj(name:String,positions:List[Int],present:Boolean)
val stdLis:List[TestObj]
//analysis generates a list of strings
var generatedLis:List[String]
//list to save objects found in standard list
val lisBuf = new ListBuffer[TestObj]()
//my current way
generatedLis.foreach{i=>
val temp = stdLis.filter(p=>p.name.equalsIgnoreCase(i))
if(temp.size==1){
lisBuf.append(temp(0))
}
}
Is there any other way to achieve this. Like having an custom indexof method that over rides and looks for the name instead of the whole object or something. I have not tried that approach as i am not sure about it.
stdLis.filter(testObj => generatedLis.exists(_.equalsIgnoreCase(testObj.name)))
use filter to filter elements from 'stdLis' per predicate
use exists to check if 'generatedLis' has a value of ....
Don't use mutable containers to filter sequences.
Naive solution:
val lisBuf =
for {
str <- generatedLis
temp = stdLis.filter(_.name.equalsIgnoreCase(str))
if temp.size == 1
} yield temp(0)
if we discard condition temp.size == 1 (i'm not sure it is legal or not):
val lisBuf = stdLis.filter(s => generatedLis.exists(_.equalsIgnoreCase(s.name)))