Add new columns - scala

ErrorHi I am trying to a new column to a Spark. I am trying in a data set where I want to add the percentage made by in all games.
The data set looks like this:
Name, Platform, Year, Genre, Publisher, NA_Sales, EU_Sales, JP_Sales, Other_Sales
val vgdataLines = sc.textFile("hdfs:///user/ashhall1616/bdc_data/t1/vgsales-small.csv")
val vgdata = vgdataLines.map(_.split(";"))
def toPercentage(x: Double): Double = {x * 100} val countPubl = vgdata.map(r => (r(4),1)).reduceByKey(_+_)
val addpercen = countPubl.withColumn("count", toPercentage($"count"/countPubl.count(_._2)))
I used withColumn() to add new column 'count' and expected output to be like:
(Ubisoft,3,15.0)
Can anyone tell whats wrong here?

You cannot use "withColumn" with an RDD.
You could do as follow
val addpercen = countPubl.map({case(key, value) => (key, value, toPercentage(value))})
use map to add a calculated value as new column and convert to a DataFrame if you want
import spark.implicits._
val myDf = addpercen.toDF("key","value","myNewColumn")
myDf.show()
Hope it helps.

You can not use withColumn with an RDD hence convert it to DataFrame as below and then use it
val countPubl : DataFrame = vgdata.map(r => (r(4),1)).reduceByKey(_+_).toDF()
If you still looking to use RDD then just converto it back to RDD once you add the with column as
val javaRdd : JavaRDD[Row] = countPubl.withColumn("...",col("...")).toJavaRDD

Related

Scala RDD get earliest date by group

I have a case class RDD in Scala and need to find the earliest date by each group (patientID).
Here is the input:
patientID date
000000047-01 2008-03-21T21:00:00Z
000000047-01 2007-10-24T19:45:00Z
000000485-01 2011-06-17T21:00:00Z
000000485-01 2006-02-22T18:45:00Z
The expected should be:
patientID date
000000047-01 2007-10-24T19:45:00Z
000000485-01 2006-02-22T18:45:00Z
I tried something like following but didn't work
val out = medication.groupBy(x => x.patientID).sortBy(x => x.date).take(1)
Okay!
So I understand your question correctly you want the top from every record, if that's the case then here I have created the solution.
val dataDF = Seq(
("000000047-01", "2008-03-21T21:00:00Z"),
("000000047-01" , "2007-10-24T19:45:00Z"),
("000000485-01", "2011-06-17T21:00:00Z"),
("000000485-01", "2006-02-22T18:45:00Z"))
import spark.implicits._
val dfWithSchema = dataDF.toDF("patientId", "date")
val winSpec = Window.partitionBy("patientId").orderBy("date")
val rank_df = dfWithSchema.withColumn("rank", rank().over(winSpec)).orderBy(col("patientId"))
val result = rank_df.select(col("patientId"),col("date")).where(col("rank") === 1)
result.show()
Please ignore the steps for creating the DF with the schema if you have already schema defined with your data.

Update date format in spark dataframe for multiple spark columns

I have a Spark dataframe where few columns having a different type of date format.
To handle this I have written below code to keep a consistent type of format for all the date columns.
As the date column date format may get change every time hence I have defined a set of date formats in dt_formats.
def to_timestamp_multiple(s: Column, formats: Seq[String]): Column = {
coalesce(formats.map(fmt => to_timestamp(s, fmt)):_*)
}
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df.withColumn("ETD1", date_format(to_timestamp_multiple($"ETD",Seq("dd-MMM-yyyy", dt_formats)).cast("date"), "yyyy-MM-dd")).drop("ETD").withColumnRenamed("ETD1","ETD")
But here I have to create a new column then I have to drop older column then rename the new column.
that make the code unnecessary very clumsy hence I want to get override from this code.
I am trying to implement similar functionality by writing a Scala below function but it is throwing the exception org.apache.spark.sql.catalyst.parser.ParseException:, but I am unable to identify the what change I should made to make it work..
val CleansedData= rawDF.selectExpr(rawDF.columns.map(
x => { x match {
case "ETA" => s"""date_format(to_timestamp_multiple($x, dt_formats).cast("date"), "yyyy-MM-dd") as ETA"""
case _ => x
} } ) : _*)
Hence seeking help.
Thanks in advance.
Create a UDF in order to use with select. The select method takes columns and produces another DataFrame.
Also, instead of using coalesce, it might be more straightforward simply to build a parser that handles all of the formats. You can use DateTimeFormatterBuilder for this.
import java.time.format.DateTimeFormatter
import java.time.format.DateTimeFormatterBuilder
import org.apache.spark.sql.functions.udf
import java.time.LocalDate
import scala.util.Try
import java.sql.Date
val dtFormatStrings:Seq[String] = Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
// use foldLeft with appendOptional method, which for each format,
// returns a new builder with that additional possible format
val initBuilder = new DateTimeFormatterBuilder()
val builder: DateTimeFormatterBuilder = dtFormatStrings.foldLeft(initBuilder)(
(b: DateTimeFormatterBuilder, s:String) => b.appendOptional(DateTimeFormatter.ofPattern(s)))
val formatter = builder.toFormatter()
// Create the UDF, which just takes
// any function returning a sql-compatible type (java.sql.Date, here)
def toTimeStamp2(dateString:String): Date = {
val dateTry: Try[Date] = Try(java.sql.Date.valueOf(LocalDate.parse(dateString, formatter)))
dateTry.toOption.getOrElse(null)
}
val timeConversionUdf = udf(toTimeStamp2 _)
// example DF and new DF
val df = Seq(("05/08/20"), ("2020-04-03"), ("unparseable")).toDF("ETD")
df.select(timeConversionUdf(col("ETD"))).toDF("ETD2").show
Output:
+----------+
| ETD2|
+----------+
|2020-05-08|
|2020-04-03|
| null|
+----------+
Note that unparseable values end up null, as shown.
try withColumn(...) with same name and coalesce as below-
val dt_formats= Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd","MM/dd/yy","dd-MM-yy","dd-MM-yyyy","yyyy/MM/dd","dd/MM/yyyy")
val newDF = df.withColumn("ETD", coalesce(dt_formats.map(fmt => to_date($"ETD", fmt)):_*))

calling a scala method passing each row of a dataframe as input

I have a dataframe which has two columns in it, has been created importing a .txt file.
sample file content::
Sankar Biswas, Played{"94"}
Puja "Kumari" Jha, Didnot
Man Women, null
null,Gay Gentleman
null,null
Created a dataframe importing the above file ::
val a = sc.textFile("file:////Users/sankar.biswas/Desktop/hello.txt")
case class Table(contentName: String, VersionDetails: String)
val b = a.map(_.split(",")).map(p => Table(p(0).trim,p(1).trim)).toDF
Now I have a function defined lets say like this ::
def getFormattedName(contentName : String, VersionDetails:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
Now what I need to do is I have to take each row of the dataframe and call the method getFormattedName passing the 2 arguments of the dataframe's each row.
I tried like this and many others but did not work out ::
val a = b.map((m,n) => getFormattedContentName(m,n))
Looking forward to any suggestion you have for me.
Thanks in advance.
I think you have a structured schema and it can be represented by a dataframe.
Dataframe has support for reading the csv input.
import org.apache.spark.sql.types._
val customSchema = StructType(Array(StructField("contentName", StringType, true),StructField("titleVersionDesc", StringType, true)))
val df = spark.read.schema(customSchema).csv("input.csv")
To call a custom method on dataset, you can create a UDF(User Defined Function).
def getFormattedName(contentName : String, titleVersionDesc:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
val get_formatted_name = udf(getFormattedName _)
df.select(get_formatted_name($"contentName", $"titleVersionDesc"))
Try
val a = b.map(row => getFormattedContentName(row(0),row(1)))
Remember that the rows of a dataframe are their own type, not a tuple or something, and you need to use the correct methodology for referring to their elements.

How to convert RDD[Row] to RDD[String]

I have a DataFrame called source, a table from mysql
val source = sqlContext.read.jdbc(jdbcUrl, "source", connectionProperties)
I have converted it to rdd by
val sourceRdd = source.rdd
but its RDD[Row] I need RDD[String]
to do transformations like
source.map(rec => (rec.split(",")(0).toInt, rec)), .subtractByKey(), etc..
Thank you
You can use Row. mkString(sep: String): String method in a map call like this :
val sourceRdd = source.rdd.map(_.mkString(","))
You can change the "," parameter by whatever you want.
Hope this help you, Best Regards.
What is your schema?
If it's just a String, you can use:
import spark.implicits._
val sourceDS = source.as[String]
val sourceRdd = sourceDS.rdd // will give RDD[String]
Note: use sqlContext instead of spark in Spark 1.6 - spark is a SparkSession, which is a new class in Spark 2.0 and is a new entry point to SQL functionality. It should be used instead of SQLContext in Spark 2.x
You can also create own case classes.
Also you can map rows - here source is of type DataFrame, we use partial function in map function:
val sourceRdd = source.rdd.map { case x : Row => x(0).asInstanceOf[String] }.map(s => s.split(","))

udf spark column names

I need to specify a sequence of columns. If I pass two strings, it works fine
val cols = array("predicted1", "predicted2")
but if I pass a sequence or an array, I get an error:
val cols = array(Seq("predicted1", "predicted2"))
Could you please help me? Many thanks!
You have at least two options here:
Using a Seq[String]:
val columns: Seq[String] = Seq("predicted1", "predicted2")
array(columns.head, columns.tail: _*)
Using a Seq[ColumnName]:
val columns: Seq[ColumnName] = Seq($"predicted1", $"predicted2")
array(columns: _*)
Function signature is def array(colName: String, colNames: String*): Column which means that it takes one string and then one or more strings. If you want to use a sequence, do it like this:
array("predicted1", Seq("predicted2"):_*)
From what I can see in the code, there are a couple of overloaded versions of this function, but neither one takes a Seq directly. So converting it into varargs as described should be the way to go.
You can use Spark's array form def array(cols: Column*): Column where the cols val is defined without using the $ column name notation -- i.e. when you want to have a Seq[ColumnName] type specifically, but created using strings. Here is how to solve that...
import org.apache.spark.sql.ColumnName
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val some_states: Seq[String] = Seq("state_AK","state_AL","state_AR","state_AZ")
val some_state_cols: Seq[ColumnName] = some_states.map(s => symbolToColumn(scala.Symbol(s)))
val some_array = array(some_state_cols: _*)
...using Spark's symbolToColumn method.
or with the ColumnName(s) constructor directly.
val some_array: Seq[ColumnName] = some_states.map(s => new ColumnName(s))