In spark SQL (perhaps only HiveQL) one can do:
select sex, avg(age) as avg_age
from humans
group by sex
which would result in a DataFrame with columns named "sex" and "avg_age".
How can avg(age) be aliased to "avg_age" without using textual SQL?
Edit:
After zero323 's answer, I need to add the constraint that:
The column-to-be-renamed's name may not be known/guaranteed or even addressable. In textual SQL, using "select EXPR as NAME" removes the requirement to have an intermediate name for EXPR. This is also the case in the example above, where "avg(age)" could get a variety of auto-generated names (which also vary among spark releases and sql-context backends).
Let's suppose human_df is the DataFrame for humans. Since Spark 1.3:
human_df.groupBy("sex").agg(avg("age").alias("avg_age"))
If you prefer to rename a single column it is possible to use withColumnRenamed method:
case class Person(name: String, age: Int)
val df = sqlContext.createDataFrame(
Person("Alice", 2) :: Person("Bob", 5) :: Nil)
df.withColumnRenamed("name", "first_name")
Alternatively you can use alias method:
import org.apache.spark.sql.functions.avg
df.select(avg($"age").alias("average_age"))
You can take it further with small helper:
import org.apache.spark.sql.Column
def normalizeName(c: Column) = {
val pattern = "\\W+".r
c.alias(pattern.replaceAllIn(c.toString, "_"))
}
df.select(normalizeName(avg($"age")))
Turns out def toDF(colNames: String*): DataFrame does exactly that. Pasting from 2.11.7 documentation:
def toDF(colNames: String*): DataFrame
Returns a new DataFrame with columns renamed. This can be quite
convenient in conversion from a RDD of tuples into a DataFrame
with meaningful names. For example:
val rdd: RDD[(Int, String)] = ...
rdd.toDF() // this implicit conversion creates a DataFrame
// with column name _1 and _2
rdd.toDF("id", "name") // this creates a DataFrame with
// column name "id" and "name"
Anonymous columns, such as the one that would be generated by avg(age) without AS avg_age, get automatically assigned names. As you point out in your question, the names are implementation-specific, generated by a naming strategy. If needed, you could write code that sniffs the environment and instantiates an appropriate discovery & renaming strategy based on the specific naming strategy. There are not many of them.
In Spark 1.4.1 with HiveContext, the format is "_cN" where N is the position of the anonymous column in the table. In your case, the name would be _c1.
Related
In Spark v3.0.1 I have a DataFrame of arbitrary schema.
I want to turn that DataFrame of arbitrary schema into a new DataFrame with the same schema and a new column that is the result of a calculation over the data discretely present in each row.
I can safely assume that certain columns of certain types are available for the logical calculation despite the DataFrame being of arbitrary schema.
I have solved this previously by creating a new Dataset[outcome] of two columns:
the KEY from the input DataFrame
the OUTCOME of the calculation
... and then joining that DF back on the initial input to add the new column:
val inputDf = Seq(
("1", "input1", "input2"),
("2", "anotherInput1", "anotherInput2"),
).asDF("key", "logicalInput1", "logicalInput2")
case class outcome(key: String, outcome: String)
val outcomes = inputDf.map(row => {
val input1 = row.getAs[String]("logicalInput1")
val input2 = row.getAs[String]("logicalInput2")
val key = row.getAs[String]("key")
val result = if (input1 != "") input1 + input2 else input2
outcome(key, result)
})
val finalDf = inputDf.join(outcomes, Seq("key"))
Is there a more efficient way to map a DataFrame to a new DataFrame with an extra column given arbitrary columns on the input DF upon which we can assume some columns exist to make the calculation?
I'd like to take the inputDF and map over each row, generating a copy of the row and adding a new column to it with the outcome result without having to join afterwards...
NOTE that in the example above, a simple solution exists using Spark API... My calculation is not as simple as concatenating strings together, so the .map or a udf is required for the solution. I'd like to avoid UDF if possible, though that could work too.
Before answering exact question about using .map I think it is worth a brief discussion about using UDFs for this purpose. UDFs were mentioned in the "note" of the question but not in detail.
When we use .map (or .filter, .flatMap, and any other higher order function) on any Dataset [1] we are forcing Spark to fully deserialize the entire row into an object, transforming the object with a function, and then serializing the entire object again. This is very expensive.
A UDF is effectively a wrapper around a Scala function that routes values from certain columns to the arguments of the UDF. Therefore, Spark is aware of which columns are required by the UDF and which are not and thus we save a lot of serialization (and possibly IO) costs by ignoring columns that are not used by the UDF.
In addition, the query optimizer can't really help with .map but a UDF can be part of a larger plan that the optimizer will (in theory) minimize the cost of execution.
I believe that a UDF will usually be better in the kind of scenario put forth int the question. Another smell that indicate UDFs are a good solution is how little code is required compared to other solutions.
val outcome = udf { (input1: String, input2: String) =>
if (input1 != "") input1 + input2 else input2
}
inputDf.withColumn("outcome", outcome(col("logicalInput1"), col("logicalInput2")))
Now to answer the question about using .map! To avoid the join, we need to have the result of the .map be a Row that has all the contents of the input row with the output added. Row is effectively a sequence of values with type Any. Spark manipulates these values in a type-safe way by using the schema information from the dataset. If we create a new Row with a new schema, and provide .map with an Encoder for the new schema, Spark will know how to create a new DataFrame for us.
val newSchema = inputDf.schema.add("outcome", StringType)
val newEncoder = RowEncoder(newSchema)
inputDf
.map { row =>
val rowWithSchema = row.asInstanceOf[GenericRowWithSchema] // This cast might not always be possible!
val input1 = row.getAs[String]("logicalInput1")
val input2 = row.getAs[String]("logicalInput2")
val key = row.getAs[String]("key")
val result = if (input1 != "") input1 + input2 else input2
new GenericRowWithSchema(rowWithSchema.toSeq.toArray :+ result, row.schema).asInstanceOf[Row] // Encoder is invariant so we have to cast again.
}(newEncoder)
.show()
Not as elegant as the UDFs, but it works in this case. However, I'm not sure that this solution is universal.
[1] DataFrame is just an alias for Dataset[Row]
You should use withColumn with an UDF. I don't see why map should be preferred, and I think it's very difficult to append a column in DataFrame API
Or you switch to Dataset API
I am trying to make the next operation:
var test = df.groupBy(keys.map(col(_)): _*).agg(sequence.head, sequence.tail: _*)
I know that the required parameter inside the agg should be a Seq[Columns].
I have then a dataframe "expr" containing the next:
sequences
count(col("colname1"),"*")
count(col("colname2"),"*")
count(col("colname3"),"*")
count(col("colname4"),"*")
The column sequence is of string type and I want to use the values of each row as input of the agg, but I am not capable to reach those.
Any idea of how to give it a try?
If you can change the strings in the sequences column to be SQL commands, then it would be possible to solve. Spark provides a function expr that takes a SQL string and converts it into a column. Example dataframe with working commands:
val df2 = Seq("sum(case when A like 2 then A end) as A", "count(B) as B").toDF("sequences")
To convert the dataframe to Seq[Column]s do:
val seqs = df2.as[String].collect().map(expr(_))
Then the groupBy and agg:
df.groupBy(...).agg(seqs.head, seqs.tail:_*)
I'm new to spark/scala. I have a file say config where I specify all the column names.
Config:
Id,
Emp_Name,
Dept,
Address,
Account
I have a dataframe where i select the column names like:
df.select("id","Emp_Name","Dept","Address","Account").show()
Instead of specifying the column names in select, I want to get the column names from config file like
df.select(config-file_column_names).show()
You don't necessarily need the commas in your file if each column is in a different line.
This is the definition of select :
def select(col: String, cols: String*): DataFrame
def select(cols: org.apache.spark.sql.Column*): DataFrame
We are going to use the second definition here.
import org.apache.spark.sql.functions.col
val colNames = sc.textFile("file").map(_.replaceAll(",", "") ).map(col(_)).collect
// Unpacking the array in `select`
df.select(colNames: _*).show
I am trying to pass a list of tuples to a udf in scala. I am not sure how to exactly define the datatype for this. I tried to pass it as a whole row but it can't really resolve it. I need to sort the list based on the first element of the tuple and then send n number of elements back. I have tried the following definitions for the udf
def udfFilterPath = udf((id: Long, idList: Array[structType[Long, String]] )
def udfFilterPath = udf((id: Long, idList: Array[Tuple2[Long, String]] )
def udfFilterPath = udf((id: Long, idList: Row)
This is what the idList looks like:
[[1234,"Tony"], [2345, "Angela"]]
[[1234,"Tony"], [234545, "Ruby"], [353445, "Ria"]]
This is a dataframe with a 100 rows like the above. I call the udf as follows:
testSet.select("id", "idList").withColumn("result", udfFilterPath($"id", $"idList")).show
When I print the schema for the dataframe it reads it as a array of structs. The idList itself is generated by doing a collect list over a column of tuples grouped by a key and stored in the dataframe. Any ideas on what I am doing wrong? Thanks!
When defining a UDF, you should use plain Scala types (e.g. Tuples, Primitives...) and not the Spark SQL types (e.g. StructType) as the output types.
As for the input types - this is where it gets tricky (and not too well documented) - an array of tuples would actually be a mutable.WrappedArray[Row]. So - you'll have to "convert" each row into a tuple first, then you can do the sorting and return the result.
Lastly, by your description it seems that id column isn't used at all, so I removed it from the UDF definition, but it can easily be added back.
val udfFilterPath = udf { idList: mutable.WrappedArray[Row] =>
// converts the array items into tuples, sorts by first item and returns first two tuples:
idList.map(r => (r.getAs[Long](0), r.getAs[String](1))).sortBy(_._1).take(2)
}
df.withColumn("result", udfFilterPath($"idList")).show(false)
+------+-------------------------------------------+----------------------------+
|id |idList |result |
+------+-------------------------------------------+----------------------------+
|1234 |[[1234,Tony], [2345,Angela]] |[[1234,Tony], [2345,Angela]]|
|234545|[[1234,Tony], [2345454,Ruby], [353445,Ria]]|[[1234,Tony], [353445,Ria]] |
+------+-------------------------------------------+----------------------------+
I am trying to find a good way of doing a spark select with a List[Column, I am exploding a column than passing back all the columns I am interested in with my exploded column.
var columns = getColumns(x) // Returns a List[Column]
tempDf.select(columns) //trying to get
Trying to find a good way of doing this I know, if it were a string I could do something like
val result = dataframe.select(columnNames.head, columnNames.tail: _*)
For spark 2.0 seems that you have two options. Both depends on how you manage your columns (Strings or Columns).
Spark code (spark-sql_2.11/org/apache/spark/sql/Dataset.scala):
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
You can see how internally spark is converting your head & tail to a list of Columns to call again Select.
So, in that case if you want a clear code I will recommend:
If columns: List[String]:
import org.apache.spark.sql.functions.col
df.select(columns.map(col): _*)
Otherwise, if columns: List[Columns]:
df.select(columns: _*)