I am reading a CSV file in dataframe1 and then filter some columns in dataframe2, during selecting columns for dataframe2 from dataframe1 I want to apply my function on the column value. Like
import utilities._
val Logs = sqlContext.read
.format("csv")
.option("header", "true")
.load("dbfs:/mnt/records/Logs/2016.07.17/2016.07.17.{*}.csv")
val Log = Logs.select(
"key1",
utility.stringToGuid("username"),
"key2",
"key3",
"startdatetime",
"enddatetime")
display(Log)
so here I am calling utility.stringToGuid("username"). And it is giving me error:
notebook:5: error: overloaded method value select with alternatives:
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
So actually I found the answer to my question. Actually I was passing the string "username" to the utility function instead of passing the column value of "username".
So in argument it should be like utility.stringToGuid($"username"). In scala $"" is used to send the column enter code here value and in python col() is used.
Related
I have a dataframe and a list of columns like this:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq(("Java", "20000"), ("Python", "100000"))).toDF("language","users_count")
val data_columns = List("language","users_count").map(x=>col(s"$x"))
Why does this work:
df.select(data_columns:_ *).show()
But not this?
df.select($"language", data_columns:_*).show()
Gives the error:
error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
And how do I get it to work so I can use _* to select all columns in a list, but I also want to specify some other columns in the select?
Thanks!
Update:
based on #chinayangyangyong answer below, this is how I solved it:
df.select( $"language" +: data_columns :_*)
It is because there is no method on Dataframe with the signature select(col: Column, cols: Column*): DataFrame, but there is one with the signature select(col: Column*): DataFrame, which is why your first example works.
Interestingly, your second example would work if you were using String to select the columns since there is a method select(col: String, cols: String*): DataFrame.
df.select(data_columns.head, data_columns.tail:_*),show()
I'm not able to specify a list of columns in the groupBy function along with a window operation. My current code:
val groupCols = List("SINR_Distribution","NE_VERSION","NE_ID","NE_NAME","cNum","EarfcnDl","datetime","circle")
val aggDFrame = dframe.groupBy(groupCols, window($"EVENT_TIME", "60 minutes")).agg(Rule_Agg)
Error:
Multiple markers at this line: overloaded method value groupBy with alternatives: (col1: String,cols: String*)org.apache.spark.sql.RelationalGroupedDataset (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.RelationalGroupedDataset cannot be applied to (List[String], org.apache.spark.sql.Column) overloaded method value groupBy with alternatives: (col1: String,cols: String*)org.apache.spark.sql.RelationalGroupedDataset (cols: org.apache.spark.sql.Column*)org.apache.spark.sql.RelationalGroupedDataset cannot be applied to (List[String], org.apache.spark.sql.Column)
What am I doing wrong?
You are mixing strings with a column in the groupBy. The window window($"EVENT_TIME", "60 minutes") is correctly interpreted as a column but the list of column names needs to be columns to match, it's not possible to mix types.
What you can do is:
val cols = groupCols.map(col) ++ Seq(window($"EVENT_TIME", "60 minutes"))
val aggDFrame = dframe.groupBy(cols: _*).agg(...)
I am a newbie to azure spark/ databricks and trying to access specific row e.g. 10th row in the dataframe.
This is what I did in notebook so far
1. Read a CSV file in a table
spark.read
.format("csv")
.option("header", "true")
.load("/mnt/training/enb/commonfiles/ramp.csv")
.write
.mode("overwrite")
.saveAsTable("ramp_csv")
2. Create a DataFrame for the "table" ramp_csv
val rampDF = spark.read.table("ramp_csv")
3. Read specific row
I am using the following logic in Scala
val myRow1st = rampDF.rdd.take(10).last
display(myRow1st)
and it should display 10th row but I am getting the following error
command-2264596624884586:9: error: overloaded method value display with alternatives:
[A](data: Seq[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])Unit <and>
(dataset: org.apache.spark.sql.Dataset[_],streamName: String,trigger: org.apache.spark.sql.streaming.Trigger,checkpointLocation: String)Unit <and>
(model: org.apache.spark.ml.classification.DecisionTreeClassificationModel)Unit <and>
(model: org.apache.spark.ml.regression.DecisionTreeRegressionModel)Unit <and>
(model: org.apache.spark.ml.clustering.KMeansModel)Unit <and>
(model: org.apache.spark.mllib.clustering.KMeansModel)Unit <and>
(documentable: com.databricks.dbutils_v1.WithHelpMethods)Unit
cannot be applied to (org.apache.spark.sql.Row)
display(myRow1st)
^
Command took 0.12 seconds --
Could you please share what I am missing here? I tried few other things but it didn't work.
Thanks in advance for help!
Here is the breakdown of what is happening in your code:
rampDF.rdd.take(10) returns Array[Row]
.last returns Row
display() takes a Dataset and you are passing it a Row. You can use .show(10) to display the first 10 rows in tabular form.
Another option is to do display(rampDF.limit(10))
I'd go with João's answer as well. But if you insist on getting the Nth row as a DataFrame and avoid collecting to the driver node (say when N is very big) you can do:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = 1 to 100 toDF //sample data
val cols = df.columns
df
.limit(10)
.withColumn("id", monotonically_increasing_id())
.agg(max(struct(("id" +: cols).map(col(_)):_*)).alias("tenth"))
.select(cols.map(c => col("tenth."+c).alias(c)):_*)
This will return:
+-----+
|value|
+-----+
| 10|
+-----+
I also go with João Guitana's answer. An alternative to get specifically the 10'th record:
val df = 1 to 1000 toDF
val tenth = df.limit(10).collect.toList.last
tenth: org.apache.spark.sql.Row = [10]
That will return the 10th Row on that df
Let's say I have my DataFrame, with a given column named "X". I want to understand why the first code doesn't work whereas the second one does. For me, it doesn't change anything.
On the one hand, this doesn't work:
val dataDF = sqlContext
.read
.parquet(input_data)
.select(
"XXX", "YYY", "III"
)
.toDF(
"X", "Y", "I"
)
.groupBy(
"X", "Y"
)
.agg(
sum("I").as("sum_I")
)
.orderBy(desc("sum_I"))
.withColumn("f_sum_I", udf((x: Long) => f(x)).apply(dataDF("sum_I")))
.drop("sum_I")
dataDF.show(50, false)
IntelliJ doesn't compile my code and I have the following error:
Error:(88, 67) recursive value dataDF needs type
.withColumn("f_sum_I", udf((x: Long) => f(x)).apply(dataDF("sum_I")))
On the other hand, this work if I change the given line with this:
.withColumn("f_sum_I", udf((x: Long) => f(x)).apply(col("sum_I")))
All I did was replacing the call to my DataFrame column to use a more generic function "col". I don't understand the difference, and more especially why does it not prefer the first method (with the name of the DataFrame).
You're trying to use dataDF before you're done defining it - dataDF is the result of the entire expression starting with sqlContext.read and ending with .drop("sumI"), so you can't use it within that expression.
You can solve this by simply referencing the column without using the DataFrame, e.g. using the col function from org.apache.spark.sql.functions:
.withColumn("f_sum_I", udf((x: Long) => f(x)).apply(col("sum_I")))
For example,
val columns=Array("column1", "column2", "column3")
val df=sc.parallelize(Seq(
(1,"example1", Seq(0,2,5)),
(2,"example2", Seq(1,20,5)))).toDF(columns)
How can I set column name using string Array?
Is it possible to mention data types inside toDF()?
toDF() takes a repeated parameter of type String, so you can use the _* type annotation to pass a sequence:
val df=sc.parallelize(Seq(
(1,"example1", Seq(0,2,5)),
(2,"example2", Seq(1,20,5)))).toDF(columns: _*)
For more on repeated parameters - see section 4.6.2 in the Scala Language Specification.
val df=sc.parallelize(Seq(
(1,"example1", Seq(0,2,5)),
(2,"example2", Seq(1,20,5)))).toDF("column1", "column2", "column3")
toDF() takes comma-seperated strings
toDF() is defined in Spark documentation as:
def toDF(colNames: String*): DataFrame
And so you need to turn your array to a varargs as also described here. That means you need to do the following:
val columns=Array("column1", "column2", "column3")
val df=sc.parallelize(Seq(
(1,"example1", Seq(0,2,5)),
(2,"example2", Seq(1,20,5)))).toDF(columns: _*)
(Add : _* tocolumns in toDF)