Slick 2 aggregation - how to get a scalar result? - scala

I have a table with an Int column TIME in it:
def time = column[Int]("TIME")
The table is mapped to a custom type. I want to find a maximum time value, i.e. to perform a simple aggregation. The example in the documentation seems easy enough:
val q = coffees.map(_.price)
val q1 = q.min
val q2 = q.max
However, when I do this, the type of q1 and q2 is Column[Option[Int]]. I can perform a get or getOrElse on this to get a result of type Column[Int] (even this seems somewhat surprising to me - is get a member of Column, or is the value converted from Option[Int] to Int and then wrapped to Column again? Why?), but I am unable to use the scalar value, when I attempt to assign it to Int, I get an error message saying:
type mismatch;
found : scala.slick.lifted.Column[Int]
required: Int
How can I get the scala value from the aggregated query?

My guess is that you are not calling the invoker that's the reason why you get a Column object. Try this:
val q1 = q.min.run
Should return an Option[Int] and then you can get or getOrElse.

Related

overloaded method value select with alternatives

I'm trying to select more columns and cast all of them but I receive this error
"overloaded method value select with alternatives: (col:
String,cols: String*)org.apache.spark.sql.DataFrame (cols:
org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame cannot be
applied to (org.apache.spark.sql.Column, org.apache.spark.sql.Column,
String)"
the code is this:
val result = df.select(
col(s"${Constant.CS}_exp.${Constant.DATI_CONTRATTO}.${Constant.NUMERO_CONTRATTO}").cast(IntegerType),
col(s"${Constant.CS}_exp.${Constant.DATI_CONTRATTO}.${Constant.CODICE_PORTAFOGLIO}").cast(IntegerType),
col(s"${Constant.CS}_exp.${Constant.RATEALE}.${Constant.STORIA_DEL_CONTRATTO}"))
The last part of the error message means that the compiler cannot find a method "select" with an api that fit your code: select(Column, Column, String)
However, the compiler found 2 possible methods, but they don't fit:
select(col: String, cols: String*)
select(cols: Column*)
(the * means "any number of")
This, I am sure of.
However, I don't understand why you get that error with the code you've given that actually is select(Column, Column, Column) which fits the select(cols: Column*) api. For some reason, it consider the last argument to be a String. Maybe some parenthesis are wrongly placed
What I do in such cases, is to split the code to validate types:
val col1: Column = col(s"${Constant.CS}_exp.${Constant.DATI_CONTRATTO}.${Constant.NUMERO_CONTRATTO}").cast(IntegerType)
val col2: Column = col(s"${Constant.CS}_exp.${Constant.DATI_CONTRATTO}.${Constant.CODICE_PORTAFOGLIO}").cast(IntegerType)
val col3: Column = col(s"${Constant.CS}_exp.${Constant.RATEALE}.${Constant.STORIA_DEL_CONTRATTO}")
val result = df.select(col1, col2, col3)
and check it compiles alright

Comparing Column Object Values in Spark with Scala

I'm writing methods in Scala that take in Column arguments and return a column. Within them, I'm looking to compare the value of the columns (ranging from integers to dates) using logic similar to the below, but have been encountering an error message.
The lit() is for example purposes only. In truth I'm passing columns from a DataFrame.select() into a method to do computation. I need to compare using those columns.
val test1 = lit(3)
val test2 = lit(4)
if (test1 > test2) {
print("tuff")
}
Error message.
Error : <console>:96: error: type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
if (test1 > test2) {
What is the correct way to compare Column objects in Spark? The column documentation lists the > operator as being valid for comparisons.
Edit: Here's a very contrived example of usage, assuming the columns passed into the function are dates that need to be compared for business reasons with a returned integer value that also has some business significance.
someDataFrame.select(
$"SomeColumn",
computedColumn($"SomeColumn1", $"SomeColumn2").as("MyComputedColumn")
)
Where computedColumn would be
def computedColumn(col1 : Column, col2: Column) : Column = {
val returnCol : Column = lit(0)
if (col1 > col2) {
returnCol = lit(4)
}
}
Except in actually usage there is a lot more if/else logic that needs to happen in computedColumn, with the final result being a returned Column that will be added to the select's output.
You can use when to do a conditional comparison:
someDataFrame.select(
$"SomeColumn",
when($"SomeColumn1" > $"SomeColumn2", 4).otherwise(0).as("MyComputedColumn")
)
If you prefer to write a function:
def computedColumn(col1 : Column, col2: Column) : Column = {
when(col1 > col2, 4).otherwise(0)
}
someDataFrame.select(
$"SomeColumn",
computedColumn($"SomeColumn1", $"SomeColumn2").as("MyComputedColumn")
)

Spark Cassandra Table Filter in Spark Rdd

I have to filter Cassandra table in spark, after getting data from a table via spark, apply filter function on the returned rdd ,we dont want to use where clause in cassandra api that can filter but that needs custom sasi index on the filter column, which has disk overhead issue due to multiple ss table scan in cassandra .
for example:
val ct = sc.cassandraTable("keyspace1", "table1")
val fltr = ct.filter(x=x.contains "zz")
table1 fields are :
dirid uuid
filename text
event int
eventtimestamp bigint
fileid int
filetype int
Basically we need to filter data based on filename with arbitrary string. since returned rdd is of type
com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD
and filter operations are restricted only to the methods of CassandraRow type which are enter image description here
val ct = sc.cassandraTable("keyspace1", "table1")
scala> ct
res140: com.datastax.spark.connector.rdd.CassandraTableScanRDD[com.datastax.spark.connector.CassandraRow] = CassandraTableScanRDD[171] at RDD at CassandraRDD.scala:19
when i hit tab after "x." in the below filter function, which shows the below methods of CassandraRow class`enter code here
scala> ct.filter(x=>x.
columnValues getBooleanOption getDateTime getFloatOption getLongOption getString getUUIDOption length
contains getByte getDateTimeOption getInet getMap getStringOption getVarInt metaData
copy getByteOption getDecimal getInetOption getRaw getTupleValue getVarIntOption nameOf
dataAsString getBytes getDecimalOption getInt getRawCql getTupleValueOption hashCode size
equals getBytesOption getDouble getIntOption getSet getUDTValue indexOf toMap
get getDate getDoubleOption getList getShort getUDTValueOption isNullAt toString
getBoolean getDateOption getFloat getLong getShortOption getUUID iterator
You need to get string field from the CassandraRow object, and then perform filtering on it. So this code will look as following:
val fltr = ct.filter(x => x.getString("filename").contains("zz"))

How to create a Column expression from collection of column names?

I have a list of strings, which represents the names of various columns I want to add together to make another column:
val myCols = List("col1", "col2", "col3")
I want to convert the list to columns, then add the columns together to make a final column. I've looked for a number of ways to do this, and the closest I can come to the answer is:
df.withColumn("myNewCol", myCols.foldLeft(lit(0))(col(_) + col(_)))
I get a compile error where it says it is looking for a string, when all I really want is a column. What's wrong? How to fix it?
When I tried it out in spark-shell it gave me the error that says exactly what the error is and where.
scala> myCols.foldLeft(lit(0))(col(_) + col(_))
<console>:26: error: type mismatch;
found : org.apache.spark.sql.Column
required: String
myCols.foldLeft(lit(0))(col(_) + col(_))
^
Just think of the first pair that is given to the function of foldLeft. It's going to be lit(0) of type Column and col1 of type String. There's no col function that accepts a Column.
Try reduce instead:
myCols.map(col).reduce(_ + _)
From the official documentation of reduce:
Applies a binary operator to all elements of this collection, going right to left.
the result of inserting op between consecutive elements of this collection, going right to left:
op(x_1, op(x_2, ..., op(x_{n-1}, x_n)...))
where x1, ..., xn are the elements of this collection.
Here is how you can add columns dynamically based on the column names on a List. When all columns are numeric the result is a number. The 1st variable on foldLeft is of same type as return. foldLeft would work as much as reduce.
val employees = //a dataframe with 2 numeric columns "salary","exp"
val initCol = lit(0)
val cols = Seq("salary","exp")
val col1 = cols.foldLeft(initCol)((x,y) => x + col(y))
employees.select(col1).show()

How to use Spark's PrefixSpan on real-world data (text file or sql)?

I'm trying to use Spark's PrefixSpan algorithm but it is comically difficult to get the data in the right shape to feed to the algo. It feels like a Monty Python skit where the API is actively working to confuse the programmer.
My data is a list of rows, each of which contains a list of text items.
a b c c c d
b c d e
a b
...
I have made this data available two ways, an sql table in Hive (where each row has an array of items) and text files where each line contains the items above.
The official example creates a Seq of Array(Array).
If I use sql, I get the following type back:
org.apache.spark.sql.DataFrame = [seq: array<string>]
If I read in text, I get this type:
org.apache.spark.sql.Dataset[Array[String]] = [value: array<string>]
Here is an example of an error I get (if I feed it data from sql):
error: overloaded method value run with alternatives:
[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: org.apache.spark.api.java.JavaRDD[Sequence])org.apache.spark.mllib.fpm.PrefixSpanModel[Item] <and>
[Item](data: org.apache.spark.rdd.RDD[Array[Array[Item]]])(implicit evidence$1: scala.reflect.ClassTag[Item])org.apache.spark.mllib.fpm.PrefixSpanModel[Item]
cannot be applied to (org.apache.spark.sql.DataFrame)
new PrefixSpan().setMinSupport(0.5).setMaxPatternLength(5).run( sql("select seq from sequences limit 1000") )
^
Here is an example if I feed it text files:
error: overloaded method value run with alternatives:
[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: org.apache.spark.api.java.JavaRDD[Sequence])org.apache.spark.mllib.fpm.PrefixSpanModel[Item] <and>
[Item](data: org.apache.spark.rdd.RDD[Array[Array[Item]]])(implicit evidence$1: scala.reflect.ClassTag[Item])org.apache.spark.mllib.fpm.PrefixSpanModel[Item]
cannot be applied to (org.apache.spark.sql.Dataset[Array[String]])
new PrefixSpan().setMinSupport(0.5).setMaxPatternLength(5).run(textfiles.map( x => x.split("\u0002")).limit(3))
^
I've tried to mold the data by using casting and other unnecessarily complicated logic.
This can't be so hard. Given a list of items (of the very reasonable format described above), how the heck do I fed it to PrefixSpan?
edit:
I'm on spark 2.2.1
Resolved:
A column in the table I was querying had collections in each cell. This was causing the returned result to be inside a WrappedArray. I changed my query so the result column only contained a string (by concat_ws). This made it MUCH easier to deal with the type error.