I'm trying to select more columns and cast all of them but I receive this error
"overloaded method value select with alternatives: (col:
String,cols: String*)org.apache.spark.sql.DataFrame (cols:
org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame cannot be
applied to (org.apache.spark.sql.Column, org.apache.spark.sql.Column,
String)"
the code is this:
val result = df.select(
col(s"${Constant.CS}_exp.${Constant.DATI_CONTRATTO}.${Constant.NUMERO_CONTRATTO}").cast(IntegerType),
col(s"${Constant.CS}_exp.${Constant.DATI_CONTRATTO}.${Constant.CODICE_PORTAFOGLIO}").cast(IntegerType),
col(s"${Constant.CS}_exp.${Constant.RATEALE}.${Constant.STORIA_DEL_CONTRATTO}"))
The last part of the error message means that the compiler cannot find a method "select" with an api that fit your code: select(Column, Column, String)
However, the compiler found 2 possible methods, but they don't fit:
select(col: String, cols: String*)
select(cols: Column*)
(the * means "any number of")
This, I am sure of.
However, I don't understand why you get that error with the code you've given that actually is select(Column, Column, Column) which fits the select(cols: Column*) api. For some reason, it consider the last argument to be a String. Maybe some parenthesis are wrongly placed
What I do in such cases, is to split the code to validate types:
val col1: Column = col(s"${Constant.CS}_exp.${Constant.DATI_CONTRATTO}.${Constant.NUMERO_CONTRATTO}").cast(IntegerType)
val col2: Column = col(s"${Constant.CS}_exp.${Constant.DATI_CONTRATTO}.${Constant.CODICE_PORTAFOGLIO}").cast(IntegerType)
val col3: Column = col(s"${Constant.CS}_exp.${Constant.RATEALE}.${Constant.STORIA_DEL_CONTRATTO}")
val result = df.select(col1, col2, col3)
and check it compiles alright
Related
If you want to select the first column of a dataframe this can be done:
df.select(df.columns(0))
df.columns(0) returns a string, so by giving the name of the column, the select is able to get the column correctly.
Now, suppose I want to select the first 3 columns of the dataset, this is what I would intuitively do:
df.select(df.columns.split(0,3):_*)
The _* operator would pass the array of strings as a varag to my understanding, and it would be the same as passing (df.column(1), df.column(2), df.column(3)) to the select statement. However this doesn't work and it is necessary to do this:
import org.apache.spark.sql.functions.col
df.select(sf.columns.split(0,3).map(i => col(i)):_*))
What is going on?
I think in the question you meant slice instead of split.
And as for your question,
df.columns.slice(0,3):_* is meant to be passed to functions with *-parameters (varargs), i.e. if you call select(columns:_*) then there must be a function defined with varargs, e.g. def select(cols: String*).
But there can only be one such function defined - no overloading here is possible.
Example on why it's not possible to define two different functions with same vararg-parameter declaration:
def select(cols: String*): String = "string"
select() // returns "string"
def select(cols: Column*): Int = 3
select() // now returns 3
And in Spark, that one function is defined not for Strings but for Columns:
def select(cols: Column*)
For Strings, the method is declared like this:
def select(col: String, cols: String*)
I suggest you to stick to Columns, like you do now, but with some syntax sugar:
df.select(df.columns.slice(0,3).map(col):_*))
Or if there's a need to pass column names as Strings, then you can use selectExpr:
df.selectExpr(df.columns.slice(0,3):_*)
I have a dataframe from which I want to select column(s) as seq to be used in zeppelin Select form.
This is how the select form works:
Select form requires
required: Iterable[(Object, String)]
what I have I got is
val test_seq = data.select("file", "id").collect().map(x => (x.get(0), x.get(1).toString)).toSeq
Which is in form
found: Seq[(Any, String)]
And is not usable in the form. I have not yet figured out how do I get the the column(s) out of the dataframe in correct format.
You can try getting a tuple of object and string from the RDD, and use toIterable to convert to Iterable[(Object, String)]:
val testIter = data.select("file", "id").collect().map(
x => (x.getAs[Object](0), x.getAs[String](1))
).toIterable
I have a list of strings, which represents the names of various columns I want to add together to make another column:
val myCols = List("col1", "col2", "col3")
I want to convert the list to columns, then add the columns together to make a final column. I've looked for a number of ways to do this, and the closest I can come to the answer is:
df.withColumn("myNewCol", myCols.foldLeft(lit(0))(col(_) + col(_)))
I get a compile error where it says it is looking for a string, when all I really want is a column. What's wrong? How to fix it?
When I tried it out in spark-shell it gave me the error that says exactly what the error is and where.
scala> myCols.foldLeft(lit(0))(col(_) + col(_))
<console>:26: error: type mismatch;
found : org.apache.spark.sql.Column
required: String
myCols.foldLeft(lit(0))(col(_) + col(_))
^
Just think of the first pair that is given to the function of foldLeft. It's going to be lit(0) of type Column and col1 of type String. There's no col function that accepts a Column.
Try reduce instead:
myCols.map(col).reduce(_ + _)
From the official documentation of reduce:
Applies a binary operator to all elements of this collection, going right to left.
the result of inserting op between consecutive elements of this collection, going right to left:
op(x_1, op(x_2, ..., op(x_{n-1}, x_n)...))
where x1, ..., xn are the elements of this collection.
Here is how you can add columns dynamically based on the column names on a List. When all columns are numeric the result is a number. The 1st variable on foldLeft is of same type as return. foldLeft would work as much as reduce.
val employees = //a dataframe with 2 numeric columns "salary","exp"
val initCol = lit(0)
val cols = Seq("salary","exp")
val col1 = cols.foldLeft(initCol)((x,y) => x + col(y))
employees.select(col1).show()
I'm trying to use Spark's PrefixSpan algorithm but it is comically difficult to get the data in the right shape to feed to the algo. It feels like a Monty Python skit where the API is actively working to confuse the programmer.
My data is a list of rows, each of which contains a list of text items.
a b c c c d
b c d e
a b
...
I have made this data available two ways, an sql table in Hive (where each row has an array of items) and text files where each line contains the items above.
The official example creates a Seq of Array(Array).
If I use sql, I get the following type back:
org.apache.spark.sql.DataFrame = [seq: array<string>]
If I read in text, I get this type:
org.apache.spark.sql.Dataset[Array[String]] = [value: array<string>]
Here is an example of an error I get (if I feed it data from sql):
error: overloaded method value run with alternatives:
[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: org.apache.spark.api.java.JavaRDD[Sequence])org.apache.spark.mllib.fpm.PrefixSpanModel[Item] <and>
[Item](data: org.apache.spark.rdd.RDD[Array[Array[Item]]])(implicit evidence$1: scala.reflect.ClassTag[Item])org.apache.spark.mllib.fpm.PrefixSpanModel[Item]
cannot be applied to (org.apache.spark.sql.DataFrame)
new PrefixSpan().setMinSupport(0.5).setMaxPatternLength(5).run( sql("select seq from sequences limit 1000") )
^
Here is an example if I feed it text files:
error: overloaded method value run with alternatives:
[Item, Itemset <: Iterable[Item], Sequence <: Iterable[Itemset]](data: org.apache.spark.api.java.JavaRDD[Sequence])org.apache.spark.mllib.fpm.PrefixSpanModel[Item] <and>
[Item](data: org.apache.spark.rdd.RDD[Array[Array[Item]]])(implicit evidence$1: scala.reflect.ClassTag[Item])org.apache.spark.mllib.fpm.PrefixSpanModel[Item]
cannot be applied to (org.apache.spark.sql.Dataset[Array[String]])
new PrefixSpan().setMinSupport(0.5).setMaxPatternLength(5).run(textfiles.map( x => x.split("\u0002")).limit(3))
^
I've tried to mold the data by using casting and other unnecessarily complicated logic.
This can't be so hard. Given a list of items (of the very reasonable format described above), how the heck do I fed it to PrefixSpan?
edit:
I'm on spark 2.2.1
Resolved:
A column in the table I was querying had collections in each cell. This was causing the returned result to be inside a WrappedArray. I changed my query so the result column only contained a string (by concat_ws). This made it MUCH easier to deal with the type error.
I have a table with an Int column TIME in it:
def time = column[Int]("TIME")
The table is mapped to a custom type. I want to find a maximum time value, i.e. to perform a simple aggregation. The example in the documentation seems easy enough:
val q = coffees.map(_.price)
val q1 = q.min
val q2 = q.max
However, when I do this, the type of q1 and q2 is Column[Option[Int]]. I can perform a get or getOrElse on this to get a result of type Column[Int] (even this seems somewhat surprising to me - is get a member of Column, or is the value converted from Option[Int] to Int and then wrapped to Column again? Why?), but I am unable to use the scalar value, when I attempt to assign it to Int, I get an error message saying:
type mismatch;
found : scala.slick.lifted.Column[Int]
required: Int
How can I get the scala value from the aggregated query?
My guess is that you are not calling the invoker that's the reason why you get a Column object. Try this:
val q1 = q.min.run
Should return an Option[Int] and then you can get or getOrElse.