How to convert DataFrame to RDD in Scala? - scala

Can someone please share how one can convert a dataframe to an RDD?

Simply:
val rows: RDD[Row] = df.rdd

Use df.map(row => ...) to convert the dataframe to a RDD if you want to map a row to a different RDD element. For example
df.map(row => (row(1), row(2)))
gives you a paired RDD where the first column of the df is the key and the second column of the df is the value.

I was just looking for my answer and found this post.
Jean's answer to absolutely correct,adding on that "df.rdd" will return a RDD[Rows]. I need to apply split() once i get RDD. For that we need to convert RDD[Row} to RDD[String]
val opt=spark.sql("select tags from cvs").map(x=>x.toString()).rdd

Related

Best approach to transform Dataset[Row] to RDD[Array[String]] in Spark-Scala?

I am creating a spark Dataset by reading a csv file. Further, I need to transform this Dataset[Row] to RDD[Array[String]] for passing it to the FpGrowth(Spark MLLIB).
val df: DataFrame = spark.read.format("csv").option("header", "true").load("/path/to/csv")
val ds: Dataset[Row] = df.groupBy("user").agg(collect_set("values"))
Now, I need to select the column "values" and transform the resultant dataset to RDD[Array[String]].
val rddS: RDD[String] = ds.select(concat_ws(",", col("values")).as("items")).distinct().rdd.map(_.mkString(","))
val rddArray: RDD[Array[String]] = rddS.map(s => s.trim.split(','))
I tried out this approach but not sure if its the best way. Please suggest me a optimal way of achieving this.
One-liner:
val rddArray: RDD[Array[String]] = ds.select("values").as[Array[String]].rdd
By the way I'd suggest using dataframe-based Spark ML instead of RDD-based Spark MLLib which is now deprecated. You can use org.apache.spark.ml.fpm.FPGrowth.
I ended up using toSeq approach
val rddArray: RDD[Array[String]] = ds.select("values").rdd.map(r => r.getSeq[String](0).toArray)
This was more efficient (faster) for my usecase.
Why not simply use as below, You will reduce the concat_ws and split operation.
val rddS:RDD[Array[String]] = ds.select("values")
.distinct()
.rdd.map(r => r.getAs[mutable.WrappedArray[String]](0).toArray)

convert scala dataframe to rdd[(Long,Vector)]

I have a dataframe with two columns id and a tfidfvector(org.apache.spark.mllib.linlag.Vector).
I want to convert this to a rdd[(id,Vector)] and then convert it to a coordinate matrix.
PS: Can't share the data due to constraints.
I tried df.As[(Long,Vector)] didn't work
You can convert a dataframe to an RDD[Row] using
rdd = df.rdd
after which you can restructure the RDD with map, e.g.
rdd = df.rdd.map(row => (row(1), row(2)))

Spark Dataframe - Add new Column from List[String]

I have a List[String] and add the value of these Strings as Column Names to an existing Dataframe.
Is there a way to do it instead of Iterating over the List. If Iterating over the List is the only way, how best can I achieve it?
Must be dump.. should have tried this before..
got answer after a little try:
val test: DataFrame = useCaseTagField_l.foldLeft(ds_segments)((df, tag) => df.withColumn(tag._2, lit(null)))

Obtaining one column of a RDD[Array[String]] and converting it to dataset/dataframe

I have a .csv file that I read in to a RDD:
val dataH = sc.textFile(filepath).map(line => line.split(",").map(elem => elem.trim))
I would like to iterate over this RDD in order and compare adjacent elements, this comparison is only dependent of one column of the datastructure. It is not possible to iterate over RDDs so instead, the idea is to first convert the column of RDD to either a Dataset or Dataframe.
You can convert a RDD to a dataset like this (which doesn't work if my structure is RDD[Array[String]]:
val sc = new SparkContext(conf)
val sqc = new SQLContext(sc)
import sqc.implicits._
val lines = sqc.createDataset(dataH)
How do I obtain just the one column that I am interested in from dataH and thereafter create a dataset just from it?
I am using Spark 1.6.0.
You can just map your Array to the desired index, e.g. :
dataH.map(arr => arr(0)).toDF("col1")
Or safer (avoids Exception in case the index is out of bound):
dataH.map(arr => arr.lift(0).orElse(None)).toDF("col1")

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()