dataframe: how to groupBy/count then filter on count in Scala - scala

Spark 1.4.1
I encounter a situation where grouping by a dataframe, then counting and filtering on the 'count' column raises the exception below
import sqlContext.implicits._
import org.apache.spark.sql._
case class Paf(x:Int)
val myData = Seq(Paf(2), Paf(1), Paf(2))
val df = sc.parallelize(myData, 2).toDF()
Then grouping and filtering:
df.groupBy("x").count()
.filter("count >= 2")
.show()
Throws an exception:
java.lang.RuntimeException: [1.7] failure: ``('' expected but `>=' found count >= 2
Solution:
Renaming the column makes the problem vanish (as I suspect there is no conflict with the interpolated 'count' function'
df.groupBy("x").count()
.withColumnRenamed("count", "n")
.filter("n >= 2")
.show()
So, is that a behavior to expect, a bug or is there a canonical way to go around?
thanks, alex

When you pass a string to the filter function, the string is interpreted as SQL. Count is a SQL keyword and using count as a variable confuses the parser. This is a small bug (you can file a JIRA ticket if you want to).
You can easily avoid this by using a column expression instead of a String:
df.groupBy("x").count()
.filter($"count" >= 2)
.show()

So, is that a behavior to expect, a bug
Truth be told I am not sure. It looks like parser is interpreting count not as a column name but a function and expects following parentheses. Looks like a bug or at least a serious limitation of the parser.
is there a canonical way to go around?
Some options have been already mentioned by Herman and mattinbits so here more SQLish approach from me:
import org.apache.spark.sql.functions.count
df.groupBy("x").agg(count("*").alias("cnt")).where($"cnt" > 2)

I think a solution is to put count in back ticks
.filter("`count` >= 2")
http://mail-archives.us.apache.org/mod_mbox/spark-user/201507.mbox/%3C8E43A71610EAA94A9171F8AFCC44E351B48EDF#fmsmsx124.amr.corp.intel.com%3E

Related

Adding Column In sparkdataframe

Hi I am trying to add one column in my spark dataframe and calculating value based on existing dataframe column. I am writing below code.
val df1=spark.sql("select id,dt1,salary frm dbdt1.tabledt1")
val df2=df1.withColumn("new_date",WHEN (month(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))
IN (01,02,03)) THEN
CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy')))-1,'-'),
substr(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy'))),3,4))
.otherwise(CONCAT(CONCAT(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM- yyyy'))),'-')
,SUBSTR(year(to_date(from_unixtime(unix_timestamp(dt1), 'dd-MM-yyyy')))+1,3,4))))
But it always showing issue error: unclosed character literal. Can someone plase guide me how should i add this new column or modify the existing code.
Incorrect syntax in many places. First I suggest you look at a few spark sql examples online and also the org.apache.spark.sql.functions API documentation because your use of WHEN, CONCAT, IN are all incorrect.
Scala strings are enclosed by double quotes, you appear to be using SQL string syntax.
'dd-MM-yyyy' should be "dd-MM-yyyy"
To reference a column dt1 on DataFrame df1 you can use one of the following:
df1("dt1")
col("dt1") // if you import org.apache.spark.sql.functions.col
$"dt1" // if you import spark.implicits._ locally
For example:
from_unixtime(unix_timestamp(col("dt1")), 'dd-MM- yyyy')

flatten a spark data frame's column values and put it into a variable

Spark version 1.60, Scala version 2.10.5.
I have a spark-sql dataframe df like this,
+-------------------------------------------------+
|addess | attributes |
+-------------------------------------------------+
|1314 44 Avenue | Tours, Mechanics, Shopping |
|115 25th Ave | Restaurant, Mechanics, Brewery|
+-------------------------------------------------+
From this dataframe, I would like values as below,
Tours, Mechanics, Shopping, Brewery
If I do this,
df.select(df("attributes")).collect().foreach(println)
I get,
[Tours, Mechanics, Shopping]
[Restaurant, Mechanics, Brewery]
I thought I could use flatMapinstead found this, so, tried to put this into a variable using,
val allValues = df.withColumn(df("attributes"), explode("attributes"))
but I am getting an error:
error: type mismatch;
found:org.apache.spark.sql.column
required:string
I was thinking if I can get an output using explode I can use distinct to get the unique values after flattening them.
How can I get the desired output?
I strongly recommend you to use spark 2.x version. In Cloudera, when you issue "spark-shell", it launches 1.6.x version.. however, if you issue "spark2-shell", you get the 2.x shell. Check with your admin
But if you need with Spark 1.6 and rdd solution, try this.
import spark.implicits._
import scala.collection.mutable._
val df = Seq(("1314 44 Avenue",Array("Tours", "Mechanics", "Shopping")),
("115 25th Ave",Array("Restaurant", "Mechanics", "Brewery"))).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[mutable.WrappedArray[String]]("attributes") ).distinct().collect.foreach(println)
Results:
Brewery
Shopping
Mechanics
Restaurant
Tours
If the "attribute" column is not an array, but comma separated string, then use the below one which gives you same results
val df = Seq(("1314 44 Avenue","Tours,Mechanics,Shopping"),
("115 25th Ave","Restaurant,Mechanics,Brewery")).toDF("address","attributes")
df.rdd.flatMap( x => x.getAs[String]("attributes").split(",") ).distinct().collect.foreach(println)
The problem is that withColumn expects a String in its first argument (which is the name of the added column), but you're passing it a Column here df.withColumn(df("attributes").
You only need to pass "attributes" as a String.
Additionally, you need to pass a Column to the explode function, but you're passing a String - to make it a column you can use df("columName") or the Scala shorthand $ syntax, $"columnName".
Hope this example can help you.
import org.apache.spark.sql.functions._
val allValues = df.select(explode($"attributes").as("attributes")).distinct
Note that this will only preserve the attributes Column, since you want the distinct elements on that one.

How to remove header by using filter function in spark?

I want to remove header from a file. But, since the file will be split into partitions, I can't just drop the first item. So I was using a filter function to figure it out and here below is the code I am using :
val noHeaderRDD = baseRDD.filter(line=>!line.contains("REPORTDATETIME"));
and the error I am getting says "error not found value line "what could be the issue here with this code?
I don't think anybody answered the obvious, whereby line.contains also possible:
val noHeaderRDD = baseRDD.filter(line => !(line contains("REPORTDATETIME")))
You were nearly there, just a syntax issue, but that is significant of course!
Using textFile as below:
val rdd = sc.textFile(<<path>>)
rdd.filter(x => !x.startsWith(<<"Header Text">>))
Or
In Spark 2.0:
spark.read.option("header","true").csv("filePath")

Using a csv feeder to check a count i.e. convert to Int

I am using Gatling 2.0.0-SNAPSHOT (from a few months ago) and I have a CSV file:
search,min_results
test,1000
testing,50
...
I would, ideally, like to check that the number of results are equal or greater than the min_results column, something like:
.get(...)
.check(
jsonPath("$..results")
.count
.is(>= "${min_results}".toInt)
The last line doesn't work as it probably isn't valid Scala but also"${min_results}".toInt tries to convert ${min_results} rather than its value to an Int.
I would settle for just fixing the conversion toInt problem but this might result in a slightly less robust script. However, I could then add a max_results column and use the .in(xx to yy).
Thanks
In this case, toInt definitely won't work, as it would be called before the EL has been resolved.
However, this code snippet would achieve what you want :
.get(...)
.check(
jsonPath("$...results")
.count
.greaterThanOrEqual(session => session("min_results").as[String].toInt))
Here, using as[Int] works as expected, since the min_results attribute has already been resolved from the session at this point.

Error "java.util.Date does not take parameters" in Squeryl's where "clause"

I'm fairly new to Scala and currently using an ORM called Squeryl against our MySQL database.
What I'm trying to do is looking up plural records that fall within a time range.
For example, in plain SQL, I think it would be something like:
SELECT * FROM records WHERE updated_at >= ? AND updated_at < ?
However, my Scala code to achieve similar behavior as below, gives me an error saying "java.util.Date does not take parameters" at the opening bracket in "from(records)"
def getRecordsBetween(from:java.util.Date, til:java.util.Date):List[Record]
transaction {
from(records)(record =>
where(
record.updatedAt gte from and
record.updatedAt lt til
)
select(record)
).toList
}
}
(where val records = tableRecord
What am I doing wrong here? Thanks a lot in advance.
Method parameters and methods are in the same namespace in Scala, so your from method parameter is "shadowing" the from method on the PrimitiveTypeMode (or CustomTypeMode) object that you brought into scope with a line like the following:
import org.squeryl.PrimitiveTypeMode._
Just change the parameter name to fromDate or start or whatever and you should be fine.