Is it possible to go from an Array[Row] to a DataFrame - scala

If I call collect on a DataFrame, I will get an Array[Row]. But I'm wondering if it possible to go back to a DataFrame from that result or an Array[Row] in general.
For example:
rows = df.select("*").collect()
Is there some way to do something like this:
import df.sparkSession.implicits._
newDF = rows.toDF()

It is possible to provide a List[Row], as long as you provide as schema. Then you can use SparkSession.createDataFrame
def createDataFrame(rows: List[Row], schema: StructType): DataFrame
There is no variant of toDF that can be used here.
In general you should avoid collecting and converting result back to DataFrame.

Related

How do you change schema in a Spark `DataFrame.map()` operation without joins?

In Spark v3.0.1 I have a DataFrame of arbitrary schema.
I want to turn that DataFrame of arbitrary schema into a new DataFrame with the same schema and a new column that is the result of a calculation over the data discretely present in each row.
I can safely assume that certain columns of certain types are available for the logical calculation despite the DataFrame being of arbitrary schema.
I have solved this previously by creating a new Dataset[outcome] of two columns:
the KEY from the input DataFrame
the OUTCOME of the calculation
... and then joining that DF back on the initial input to add the new column:
val inputDf = Seq(
("1", "input1", "input2"),
("2", "anotherInput1", "anotherInput2"),
).asDF("key", "logicalInput1", "logicalInput2")
case class outcome(key: String, outcome: String)
val outcomes = inputDf.map(row => {
val input1 = row.getAs[String]("logicalInput1")
val input2 = row.getAs[String]("logicalInput2")
val key = row.getAs[String]("key")
val result = if (input1 != "") input1 + input2 else input2
outcome(key, result)
})
val finalDf = inputDf.join(outcomes, Seq("key"))
Is there a more efficient way to map a DataFrame to a new DataFrame with an extra column given arbitrary columns on the input DF upon which we can assume some columns exist to make the calculation?
I'd like to take the inputDF and map over each row, generating a copy of the row and adding a new column to it with the outcome result without having to join afterwards...
NOTE that in the example above, a simple solution exists using Spark API... My calculation is not as simple as concatenating strings together, so the .map or a udf is required for the solution. I'd like to avoid UDF if possible, though that could work too.
Before answering exact question about using .map I think it is worth a brief discussion about using UDFs for this purpose. UDFs were mentioned in the "note" of the question but not in detail.
When we use .map (or .filter, .flatMap, and any other higher order function) on any Dataset [1] we are forcing Spark to fully deserialize the entire row into an object, transforming the object with a function, and then serializing the entire object again. This is very expensive.
A UDF is effectively a wrapper around a Scala function that routes values from certain columns to the arguments of the UDF. Therefore, Spark is aware of which columns are required by the UDF and which are not and thus we save a lot of serialization (and possibly IO) costs by ignoring columns that are not used by the UDF.
In addition, the query optimizer can't really help with .map but a UDF can be part of a larger plan that the optimizer will (in theory) minimize the cost of execution.
I believe that a UDF will usually be better in the kind of scenario put forth int the question. Another smell that indicate UDFs are a good solution is how little code is required compared to other solutions.
val outcome = udf { (input1: String, input2: String) =>
if (input1 != "") input1 + input2 else input2
}
inputDf.withColumn("outcome", outcome(col("logicalInput1"), col("logicalInput2")))
Now to answer the question about using .map! To avoid the join, we need to have the result of the .map be a Row that has all the contents of the input row with the output added. Row is effectively a sequence of values with type Any. Spark manipulates these values in a type-safe way by using the schema information from the dataset. If we create a new Row with a new schema, and provide .map with an Encoder for the new schema, Spark will know how to create a new DataFrame for us.
val newSchema = inputDf.schema.add("outcome", StringType)
val newEncoder = RowEncoder(newSchema)
inputDf
.map { row =>
val rowWithSchema = row.asInstanceOf[GenericRowWithSchema] // This cast might not always be possible!
val input1 = row.getAs[String]("logicalInput1")
val input2 = row.getAs[String]("logicalInput2")
val key = row.getAs[String]("key")
val result = if (input1 != "") input1 + input2 else input2
new GenericRowWithSchema(rowWithSchema.toSeq.toArray :+ result, row.schema).asInstanceOf[Row] // Encoder is invariant so we have to cast again.
}(newEncoder)
.show()
Not as elegant as the UDFs, but it works in this case. However, I'm not sure that this solution is universal.
[1] DataFrame is just an alias for Dataset[Row]
You should use withColumn with an UDF. I don't see why map should be preferred, and I think it's very difficult to append a column in DataFrame API
Or you switch to Dataset API

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

How do I convert Array[Row] to RDD[Row]

I have a scenario where I want to convert the result of a dataframe which is in the format Array[Row] to RDD[Row]. I have tried using parallelize, but I don't want to use it as it needs to contain entire data in a single system which is not feasible in production box.
val Bid = spark.sql("select Distinct DeviceId, ButtonName from stb").collect()
val bidrdd = sparkContext.parallelize(Bid)
How do I achieve this? I tried the approach given in this link (How to convert DataFrame to RDD in Scala?), but it didn't work for me.
val bidrdd1 = Bid.map(x => (x(0).toString, x(1).toString)).rdd
It gives an error value rdd is not a member of Array[(String, String)]
The variable Bid which you've created here is not a DataFrame, it is an Array[Row], that's why you can't use .rdd on it. If you want to get an RDD[Row], simply call .rdd on the DataFrame (without calling collect):
val rdd = spark.sql("select Distinct DeviceId, ButtonName from stb").rdd
Your post contains some misconceptions worth noting:
... a dataframe which is in the format Array[Row] ...
Not quite - the Array[Row] is the result of collecting the data from the DataFrame into Driver memory - it's not a DataFrame.
... I don't want to use it as it needs to contain entire data in a single system ...
Note that as soon as you use collect on the DataFrame, you've already collected entire data into a single JVM's memory. So using parallelize is not the issue.

remove a column from a dataframe spark

I have a Spark dataframe with a very large number of columns. I want to remove two columns from it to get a new dataframe.
Had there been fewer columns, I could have used the select method in the API like this:
pcomments = pcomments.select(pcomments.col("post_id"),pcomments.col("comment_id"),pcomments.col("comment_message"),pcomments.col("user_name"),pcomments.col("comment_createdtime"));
But since picking columns from a long list is a tedious task, is there a workaround?
Use drop method and withColumnRenamed methods.
Example:
val initialDf= ....
val dfAfterDrop=initialDf.drop("column1").drop("coumn2")
val dfAfterColRename= dfAfterDrop.withColumnRenamed("oldColumnName","new ColumnName")
Try this:
val initialDf = ...
val dfAfterDropCols = initialDf.drop("column1", "coumn2")

How to display the results brough from column functions using spark/scala like what show() does to dataframe

I just started how to use dataframe and column in Spark/Scala. I know if I want to show something on the screen, I can just do like df.show() for that. But how can I do this to a column. For example,
scala> val dfcol = df.apply("sgan")
dfcol: org.apache.spark.sql.Column = sgan
this can find a column called "sgan" from the dataframe df then give it to dfcol, so dfcol is a column. Then, if I do
scala> abs(dfcol)
res29: org.apache.spark.sql.Column = abs(sgan)
I just got the result shown on the screen like above. How can I show the result of this function on the screen like df.show() does? Or, in other words, how can I know the results of the functions like abs, min and so forth?
You should always use a dataframe, Column objects are not meant to be investigated this way. You can use select to create a dataframe with the column you're interested in, and then use show():
df.select(functions.abs(df("sgan"))).show()