Add mapType column to existing DataFrame - scala

I have got probably easy and quick question regarding the DataFrames in the Scala in Spark.
I have an existing Spark DataFrame (operate with Scala 2.10.5 and Spark 1.6.3) and would like to add a new column with the ArrayType or MapType, but don't know how to achieve that. But don't know how to deal with that. I would not like to create multiple columns with the 'single' values, but store them in one column. It would shorten my code and make it more changes prone.
import org.apache.spark.sql.types.MapType
...
// DataFrame initial creation
val df = ...
// adding new columns
val df_new = df
.withColumn("new_col1", lit("something_to_add") // add a literal
.withColumn("new_col2"), MapType("key1" -> "val1", "key2" -> "val2")) // ???

You could try something like
val df_new = df
.withColumn("new_col1", lit("something_to_add") // add a literal
.withColumn("new_col2"), typedLit[Map[String, String]](Map("key1" -> "val1"), ("key2" -> "val2")))

Related

spark.createDataFrame () not working with Seq RDD

CreateDataFrame takes 2 arguments , an rdd and schema.
my schema is like this
val schemas= StructType(
Seq(
StructField("number",IntegerType,false),
StructField("notation", StringType,false)
)
)
in one case i am able to create dataframe from RDD like below:
`val data1=Seq(Row(1,"one"),Row(2,"two"))
val rdd=spark.sparkContext.parallelize(data1)
val final_df= spark.createDataFrame(rdd,schemas)`
In other case like below .. i am not able to
`val data2=Seq((1,"one"),(2,"two"))
val rdd=spark.sparkContext.parallelize(data2)
val final_df= spark.createDataFrame(rdd,schemas)`
Whats wrong with data2 for not able to become a valid rdd for Dataframe?
but we can able to create dataframe using toDF() with data2 but not CreateDataFrame.
val data2_DF=Seq((1,"one"),(2,"two")).toDF("number", "notation")
Please help me understand this behaviour.
Is Row mandatory while creating dataframe?
In the second case, just do :
val final_df = spark.createDataFrame(rdd)
Because your RDD is an RDD of Tuple2 (which is a Product), the schema is known at compile time, so you don't need to specify a schema

access scala map from dataframe without using UDFs

I have a Spark (version 1.6) Dataframe, and I would like to add a column with a value contained in a Scala Map, this is my simplified code:
val map = Map("VAL1" -> 1, "VAL2" -> 2)
val df2 = df.withColumn("newVal", map(col("key")))
This code doesn't work and obviously I receive the following error, because the map expecting a String value, while receiving a column:
found : org.apache.spark.sql.Column
required: String
The only way I could do that is using an UDF:
val map = Map("VAL1" -> 1, "VAL2" -> 2)
val myUdf = udf{ value:String => map(value)}
val df2 = df.withColumn("newVal", myUdf($"key"))
I want avoid the use of UDFs if possible.
Are there any other solutions available using just the DataFrame API (I would like also to avoid transforming it to RDD)?
TL;DR Just use udf.
With the version you use (Spark 1.6 according to your comment) there is no solution which doesn't require udf or map over RDD / Dataset.
In later versions you can:
use map functions (2.0 or later) to create literal MapType column
import org.apache.spark.sql.functions
val map = functions.map(
Map("VAL1" -> 1, "VAL2" -> 2)
.flatMap { case (k, v) => Seq(k, v) } .map(lit) .toSeq: _*
)
map($"key")
typedLit (2.2 or later) to create literal MapType column.
val map = functions.typedLit(Map("VAL1" -> 1, "VAL2" -> 2))
map($"key")
and use these directly.
Reference How to add a constant column in a Spark DataFrame?
You could convert the Map to a Dataframe and use a JOIN between this and your existing dataframe. Since the Map dataframe would be very small, it should be a Broadcast Join and avoid the need for a shuffle phase.
Letting Spark know to use a broadcast join is described in this answer: DataFrame join optimization - Broadcast Hash Join

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

remove a column from a dataframe spark

I have a Spark dataframe with a very large number of columns. I want to remove two columns from it to get a new dataframe.
Had there been fewer columns, I could have used the select method in the API like this:
pcomments = pcomments.select(pcomments.col("post_id"),pcomments.col("comment_id"),pcomments.col("comment_message"),pcomments.col("user_name"),pcomments.col("comment_createdtime"));
But since picking columns from a long list is a tedious task, is there a workaround?
Use drop method and withColumnRenamed methods.
Example:
val initialDf= ....
val dfAfterDrop=initialDf.drop("column1").drop("coumn2")
val dfAfterColRename= dfAfterDrop.withColumnRenamed("oldColumnName","new ColumnName")
Try this:
val initialDf = ...
val dfAfterDropCols = initialDf.drop("column1", "coumn2")

Spark Select with a List of Columns Scala

I am trying to find a good way of doing a spark select with a List[Column, I am exploding a column than passing back all the columns I am interested in with my exploded column.
var columns = getColumns(x) // Returns a List[Column]
tempDf.select(columns) //trying to get
Trying to find a good way of doing this I know, if it were a string I could do something like
val result = dataframe.select(columnNames.head, columnNames.tail: _*)
For spark 2.0 seems that you have two options. Both depends on how you manage your columns (Strings or Columns).
Spark code (spark-sql_2.11/org/apache/spark/sql/Dataset.scala):
def select(cols: Column*): DataFrame = withPlan {
Project(cols.map(_.named), logicalPlan)
}
def select(col: String, cols: String*): DataFrame = select((col +: cols).map(Column(_)) : _*)
You can see how internally spark is converting your head & tail to a list of Columns to call again Select.
So, in that case if you want a clear code I will recommend:
If columns: List[String]:
import org.apache.spark.sql.functions.col
df.select(columns.map(col): _*)
Otherwise, if columns: List[Columns]:
df.select(columns: _*)