I need help in converting below Dataset code into DataFrame. Any help will be appreciated.
A Dataframe is a Dataset[Row]. any schema existing in the Dataset will be kept in the Dataframe.
Regarding the question on conversion, there is a simple way by using the .toDF() method :
val myDs : Dataset[Member] = someDs
val myDf : Dataframe = myDs.toDF()
Now there is a conversion in your method signature (Member => ExtMember) that seems to require some custom coding
Related
I am creating a spark Dataset by reading a csv file. Further, I need to transform this Dataset[Row] to RDD[Array[String]] for passing it to the FpGrowth(Spark MLLIB).
val df: DataFrame = spark.read.format("csv").option("header", "true").load("/path/to/csv")
val ds: Dataset[Row] = df.groupBy("user").agg(collect_set("values"))
Now, I need to select the column "values" and transform the resultant dataset to RDD[Array[String]].
val rddS: RDD[String] = ds.select(concat_ws(",", col("values")).as("items")).distinct().rdd.map(_.mkString(","))
val rddArray: RDD[Array[String]] = rddS.map(s => s.trim.split(','))
I tried out this approach but not sure if its the best way. Please suggest me a optimal way of achieving this.
One-liner:
val rddArray: RDD[Array[String]] = ds.select("values").as[Array[String]].rdd
By the way I'd suggest using dataframe-based Spark ML instead of RDD-based Spark MLLib which is now deprecated. You can use org.apache.spark.ml.fpm.FPGrowth.
I ended up using toSeq approach
val rddArray: RDD[Array[String]] = ds.select("values").rdd.map(r => r.getSeq[String](0).toArray)
This was more efficient (faster) for my usecase.
Why not simply use as below, You will reduce the concat_ws and split operation.
val rddS:RDD[Array[String]] = ds.select("values")
.distinct()
.rdd.map(r => r.getAs[mutable.WrappedArray[String]](0).toArray)
I have a dataframe that contains string columns and I am planning to use it as input for k-means using spark and scala. I am converting my string typed columns of the dataframe using the method below:
val toDouble = udf[Double, String]( _.toDouble)
val analysisData = dataframe_mysql.withColumn("Event", toDouble(dataframe_mysql("event"))).withColumn("Execution", toDouble(dataframe_mysql("execution"))).withColumn("Info", toDouble(dataframe_mysql("info")))
val assembler = new VectorAssembler()
.setInputCols(Array("execution", "event", "info"))
.setOutputCol("features")
val output = assembler.transform(analysisData)
println(output.select("features", "execution").first())
when I print the analysisData schema the convertion is correct. but I am getting an exception: VectorAssembler does not support the StringType type
which means that my values are still strings! how can I convert the values and not only the schema type?
thanks
Indeed, the VectorAssembler Transformer does not take strings. So you need to make sure that your columns match numerical, boolean, vector types. Make sure that your udf is doing the right thing and be sure that none of the columns has StringType.
To convert a column in a Spark DataFrame to another type, make it simple and use the cast() DSL function like so:
val analysisData = dataframe_mysql.withColumn("Event", dataframe_mysql("Event").cast(DoubleType))
It should work!
If I call collect on a DataFrame, I will get an Array[Row]. But I'm wondering if it possible to go back to a DataFrame from that result or an Array[Row] in general.
For example:
rows = df.select("*").collect()
Is there some way to do something like this:
import df.sparkSession.implicits._
newDF = rows.toDF()
It is possible to provide a List[Row], as long as you provide as schema. Then you can use SparkSession.createDataFrame
def createDataFrame(rows: List[Row], schema: StructType): DataFrame
There is no variant of toDF that can be used here.
In general you should avoid collecting and converting result back to DataFrame.
I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()
I have a dataframe that contains string columns and I am planning to use it as input for k-means using spark and scala. I am converting my string typed columns of the dataframe using the method below:
val toDouble = udf[Double, String]( _.toDouble)
val analysisData = dataframe_mysql.withColumn("Event", toDouble(dataframe_mysql("event"))).withColumn("Execution", toDouble(dataframe_mysql("execution"))).withColumn("Info", toDouble(dataframe_mysql("info")))
val assembler = new VectorAssembler()
.setInputCols(Array("execution", "event", "info"))
.setOutputCol("features")
val output = assembler.transform(analysisData)
println(output.select("features", "execution").first())
when I print the analysisData schema the convertion is correct. but I am getting an exception: VectorAssembler does not support the StringType type
which means that my values are still strings! how can I convert the values and not only the schema type?
thanks
Indeed, the VectorAssembler Transformer does not take strings. So you need to make sure that your columns match numerical, boolean, vector types. Make sure that your udf is doing the right thing and be sure that none of the columns has StringType.
To convert a column in a Spark DataFrame to another type, make it simple and use the cast() DSL function like so:
val analysisData = dataframe_mysql.withColumn("Event", dataframe_mysql("Event").cast(DoubleType))
It should work!