Find columns to select, for spark.read(), from another Dataset - Spark Scala - scala

I have a Dataset[Year] that has the following schema:
case class Year(day: Int, month: Int, Year: Int)
Is there any way to make a collection of the current schema?
I have tried:
println("Print -> "+ds.collect().toList)
But the result were:
Print -> List([01,01,2022], [31,01,2022])
I expected something like:
Print -> List(Year(01,01,2022), Year(31,01,2022)
I know that with a map I can adjust it, but I am trying to create a generic method that accepts any schema, and for this I cannot add a map doing the conversion.
That is my method:
class SchemeList[A]{
def set[A](ds: Dataset[A]): List[A] = {
ds.collect().toList
}
}
Apparently the method return is getting the correct signature, but when running the engine, it gets an error:
val setYears = new SchemeList[Year]
val YearList: List[Year] = setYears.set(df)
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to schemas.Schemas$Year

Based on your additional information in your comment:
I need this list to use as variables when creating another dataframe via jdbc (I need to make a specific select within postgresql). Is there a more performative way to pass values from a dataframe as parameters in a select?
Given your initial dataset:
val yearsDS: Dataset[Year] = ???
and that you want to do something like:
val desiredColumns: Array[String] = ???
spark.read.jdbc(..).select(desiredColumns.head, desiredColumns.tail: _*)
You could find the column names of yearsDS by doing:
val desiredColumns: Array[String] = yearsDS.columns
Spark achieves this by using def schema, which is defined on Dataset.
You can see the definition of def columns here.

May be you got a DataFrame,not a DataSet.
try to use "as" to transform dataframe to dataset.
like this
val year = Year(1,1,1)
val years = Array(year,year).toList
import spark.implicits._
val df = spark.
sparkContext
.parallelize(years)
.toDF("day","month","Year")
.as[Year]
println(df.collect().toList)

Related

Spark Scala: Failed to execute user defined function Caused by: SparkException: A master URL must be set in your configuration

I have a dataframe with one column that contains string, and I need to use Word2Vec model to add synonyms to that word in a new column.
So I wrote this function:
def expand(word: String): String = {
val model = Word2VecModel.load( "w2vec.model")
val expanded = model.findSynonyms(word,3).rdd.map(_.getString(0)).collect().toList.mkString(" ")
expanded
}
When I call the function, I get
expand("usb")
res51: String = hdmi cable kabel
Cool so far, now I want to put this function in a udf and apply it to an entire dataframe:
import org.apache.spark.sql.functions.udf
val func = udf(expand _)
val exploded_df_w2vec = exploded_df.withColumn("expanded", func($"col"))
The problem appears here:
display(exploded_df_w2vec)
Failed to execute user defined function($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$6261/1609460357: (string) => string)
Caused by: SparkException: A master URL must be set in your configuration
This code is executed in databricks, Scala 2.12
Any idea how to work around this issue?

calling a scala method passing each row of a dataframe as input

I have a dataframe which has two columns in it, has been created importing a .txt file.
sample file content::
Sankar Biswas, Played{"94"}
Puja "Kumari" Jha, Didnot
Man Women, null
null,Gay Gentleman
null,null
Created a dataframe importing the above file ::
val a = sc.textFile("file:////Users/sankar.biswas/Desktop/hello.txt")
case class Table(contentName: String, VersionDetails: String)
val b = a.map(_.split(",")).map(p => Table(p(0).trim,p(1).trim)).toDF
Now I have a function defined lets say like this ::
def getFormattedName(contentName : String, VersionDetails:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
Now what I need to do is I have to take each row of the dataframe and call the method getFormattedName passing the 2 arguments of the dataframe's each row.
I tried like this and many others but did not work out ::
val a = b.map((m,n) => getFormattedContentName(m,n))
Looking forward to any suggestion you have for me.
Thanks in advance.
I think you have a structured schema and it can be represented by a dataframe.
Dataframe has support for reading the csv input.
import org.apache.spark.sql.types._
val customSchema = StructType(Array(StructField("contentName", StringType, true),StructField("titleVersionDesc", StringType, true)))
val df = spark.read.schema(customSchema).csv("input.csv")
To call a custom method on dataset, you can create a UDF(User Defined Function).
def getFormattedName(contentName : String, titleVersionDesc:String): Option[String] = {
Option(contentName+titleVersionDesc)
}
val get_formatted_name = udf(getFormattedName _)
df.select(get_formatted_name($"contentName", $"titleVersionDesc"))
Try
val a = b.map(row => getFormattedContentName(row(0),row(1)))
Remember that the rows of a dataframe are their own type, not a tuple or something, and you need to use the correct methodology for referring to their elements.

Adding new column using existing one using Spark Scala

Hi I want to add new column using existing column in each row of DataFrame , I am trying this in Spark Scala like this ...
df is dataframe containing variable number of column , which can be decided at run time only.
// Added new column "docid"
val df_new = appContext.sparkSession.sqlContext.createDataFrame(df.rdd, df.schema.add("docid", DataTypes.StringType))
df_new.map(x => {
import appContext.sparkSession.implicits._
val allVals = (0 to x.size).map(x.get(_)).toSeq
val values = allVals ++ allVals.mkString("_")
Row.fromSeq(values)
})
But this is giving error is eclipse itself
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
not enough arguments for method map: (implicit evidence$7: org.apache.spark.sql.Encoder[org.apache.spark.sql.Row])org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]. Unspecified value parameter evidence$7.
Please help.
concat_ws from the functions object can help.
This code adds the docid field
df = df.withColumn("docid", concat_ws("_", df.columns.map(df.col(_)):_*))
assuming all columns of df are strings.

Removing newlines in a DataFrame field with udf function gives TypeTag Error

val trim: String => String = _.trim.replace("[\\r\\n]", "")
def main(args: Array[String]) {
val spark = ... ...
import spark.implicits._
val trimUDF = udf[String,String](trim)
val df = spark.read.json(df_path) ...
val fixed_dblogs_df = df.withColumn("qp_new", trimUDF('qp)) ...
}
When I run this code I get a compile time error:
No TypeTag available for String
This error is where I define the udf function. I have no idea why this is happening. I have used udf functions before but this one is making this error. I used Spark 2.1.1 and that's it.
The purpose of the code is to remove all the new lines in one of my fields of columns that is StringType and I just want it to not have any newlines in it
Is there some reason you're using a UDF instead of the replace_regexp builtin?
val fixed_dblogs_df = df.withColumn("qp_new", replace_regexp('qp, "[\\r\\n]", "") ...)
UDF's break Spark's plan optimization.

udf spark column names

I need to specify a sequence of columns. If I pass two strings, it works fine
val cols = array("predicted1", "predicted2")
but if I pass a sequence or an array, I get an error:
val cols = array(Seq("predicted1", "predicted2"))
Could you please help me? Many thanks!
You have at least two options here:
Using a Seq[String]:
val columns: Seq[String] = Seq("predicted1", "predicted2")
array(columns.head, columns.tail: _*)
Using a Seq[ColumnName]:
val columns: Seq[ColumnName] = Seq($"predicted1", $"predicted2")
array(columns: _*)
Function signature is def array(colName: String, colNames: String*): Column which means that it takes one string and then one or more strings. If you want to use a sequence, do it like this:
array("predicted1", Seq("predicted2"):_*)
From what I can see in the code, there are a couple of overloaded versions of this function, but neither one takes a Seq directly. So converting it into varargs as described should be the way to go.
You can use Spark's array form def array(cols: Column*): Column where the cols val is defined without using the $ column name notation -- i.e. when you want to have a Seq[ColumnName] type specifically, but created using strings. Here is how to solve that...
import org.apache.spark.sql.ColumnName
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val some_states: Seq[String] = Seq("state_AK","state_AL","state_AR","state_AZ")
val some_state_cols: Seq[ColumnName] = some_states.map(s => symbolToColumn(scala.Symbol(s)))
val some_array = array(some_state_cols: _*)
...using Spark's symbolToColumn method.
or with the ColumnName(s) constructor directly.
val some_array: Seq[ColumnName] = some_states.map(s => new ColumnName(s))