Conversion of Dataset into DataFrame

Conversion of Dataset into DataFrame - scala

I need help in converting below Dataset code into DataFrame. Any help will be appreciated.

A Dataframe is a Dataset[Row]. any schema existing in the Dataset will be kept in the Dataframe.
Regarding the question on conversion, there is a simple way by using the .toDF() method :
val myDs : Dataset[Member] = someDs
val myDf : Dataframe = myDs.toDF()
Now there is a conversion in your method signature (Member => ExtMember) that seems to require some custom coding

Related

Best approach to transform Dataset[Row] to RDD[Array[String]] in Spark-Scala?

I am creating a spark Dataset by reading a csv file. Further, I need to transform this Dataset[Row] to RDD[Array[String]] for passing it to the FpGrowth(Spark MLLIB).
val df: DataFrame = spark.read.format("csv").option("header", "true").load("/path/to/csv")
val ds: Dataset[Row] = df.groupBy("user").agg(collect_set("values"))
Now, I need to select the column "values" and transform the resultant dataset to RDD[Array[String]].
val rddS: RDD[String] = ds.select(concat_ws(",", col("values")).as("items")).distinct().rdd.map(_.mkString(","))
val rddArray: RDD[Array[String]] = rddS.map(s => s.trim.split(','))
I tried out this approach but not sure if its the best way. Please suggest me a optimal way of achieving this.

One-liner:
val rddArray: RDD[Array[String]] = ds.select("values").as[Array[String]].rdd
By the way I'd suggest using dataframe-based Spark ML instead of RDD-based Spark MLLib which is now deprecated. You can use org.apache.spark.ml.fpm.FPGrowth.

I ended up using toSeq approach
val rddArray: RDD[Array[String]] = ds.select("values").rdd.map(r => r.getSeq[String](0).toArray)
This was more efficient (faster) for my usecase.

Why not simply use as below, You will reduce the concat_ws and split operation.
val rddS:RDD[Array[String]] = ds.select("values")
.distinct()
.rdd.map(r => r.getAs[mutable.WrappedArray[String]](0).toArray)

IllegalArgumentException: u'Data type StringType of column is not supported [duplicate]

I have a dataframe that contains string columns and I am planning to use it as input for k-means using spark and scala. I am converting my string typed columns of the dataframe using the method below:
val toDouble = udf[Double, String]( _.toDouble)
val analysisData = dataframe_mysql.withColumn("Event", toDouble(dataframe_mysql("event"))).withColumn("Execution", toDouble(dataframe_mysql("execution"))).withColumn("Info", toDouble(dataframe_mysql("info")))
val assembler = new VectorAssembler()
.setInputCols(Array("execution", "event", "info"))
.setOutputCol("features")
val output = assembler.transform(analysisData)
println(output.select("features", "execution").first())
when I print the analysisData schema the convertion is correct. but I am getting an exception: VectorAssembler does not support the StringType type
which means that my values are still strings! how can I convert the values and not only the schema type?
thanks

Indeed, the VectorAssembler Transformer does not take strings. So you need to make sure that your columns match numerical, boolean, vector types. Make sure that your udf is doing the right thing and be sure that none of the columns has StringType.
To convert a column in a Spark DataFrame to another type, make it simple and use the cast() DSL function like so:
val analysisData = dataframe_mysql.withColumn("Event", dataframe_mysql("Event").cast(DoubleType))
It should work!

Is it possible to go from an Array[Row] to a DataFrame

If I call collect on a DataFrame, I will get an Array[Row]. But I'm wondering if it possible to go back to a DataFrame from that result or an Array[Row] in general.
For example:
rows = df.select("*").collect()
Is there some way to do something like this:
import df.sparkSession.implicits._
newDF = rows.toDF()

It is possible to provide a List[Row], as long as you provide as schema. Then you can use SparkSession.createDataFrame
def createDataFrame(rows: List[Row], schema: StructType): DataFrame
There is no variant of toDF that can be used here.
In general you should avoid collecting and converting result back to DataFrame.

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?

spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

VectorAssembler does not support the StringType type scala spark convert

I have a dataframe that contains string columns and I am planning to use it as input for k-means using spark and scala. I am converting my string typed columns of the dataframe using the method below:
val toDouble = udf[Double, String]( _.toDouble)
val analysisData = dataframe_mysql.withColumn("Event", toDouble(dataframe_mysql("event"))).withColumn("Execution", toDouble(dataframe_mysql("execution"))).withColumn("Info", toDouble(dataframe_mysql("info")))
val assembler = new VectorAssembler()
.setInputCols(Array("execution", "event", "info"))
.setOutputCol("features")
val output = assembler.transform(analysisData)
println(output.select("features", "execution").first())
when I print the analysisData schema the convertion is correct. but I am getting an exception: VectorAssembler does not support the StringType type
which means that my values are still strings! how can I convert the values and not only the schema type?
thanks

Indeed, the VectorAssembler Transformer does not take strings. So you need to make sure that your columns match numerical, boolean, vector types. Make sure that your udf is doing the right thing and be sure that none of the columns has StringType.
To convert a column in a Spark DataFrame to another type, make it simple and use the cast() DSL function like so:
val analysisData = dataframe_mysql.withColumn("Event", dataframe_mysql("Event").cast(DoubleType))
It should work!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Conversion of Dataset into DataFrame - scala

I need help in converting below Dataset code into DataFrame. Any help will be appreciated.

Related

Best approach to transform Dataset[Row] to RDD[Array[String]] in Spark-Scala?

IllegalArgumentException: u'Data type StringType of column is not supported [duplicate]

Is it possible to go from an Array[Row] to a DataFrame

spark scala reducekey dataframe operation

VectorAssembler does not support the StringType type scala spark convert

Categories

Resources