convert scala dataframe to rdd[(Long,Vector)] - scala

I have a dataframe with two columns id and a tfidfvector(org.apache.spark.mllib.linlag.Vector).
I want to convert this to a rdd[(id,Vector)] and then convert it to a coordinate matrix.
PS: Can't share the data due to constraints.
I tried df.As[(Long,Vector)] didn't work

You can convert a dataframe to an RDD[Row] using
rdd = df.rdd
after which you can restructure the RDD with map, e.g.
rdd = df.rdd.map(row => (row(1), row(2)))

Related

Obtaining one column of a RDD[Array[String]] and converting it to dataset/dataframe

I have a .csv file that I read in to a RDD:
val dataH = sc.textFile(filepath).map(line => line.split(",").map(elem => elem.trim))
I would like to iterate over this RDD in order and compare adjacent elements, this comparison is only dependent of one column of the datastructure. It is not possible to iterate over RDDs so instead, the idea is to first convert the column of RDD to either a Dataset or Dataframe.
You can convert a RDD to a dataset like this (which doesn't work if my structure is RDD[Array[String]]:
val sc = new SparkContext(conf)
val sqc = new SQLContext(sc)
import sqc.implicits._
val lines = sqc.createDataset(dataH)
How do I obtain just the one column that I am interested in from dataH and thereafter create a dataset just from it?
I am using Spark 1.6.0.
You can just map your Array to the desired index, e.g. :
dataH.map(arr => arr(0)).toDF("col1")
Or safer (avoids Exception in case the index is out of bound):
dataH.map(arr => arr.lift(0).orElse(None)).toDF("col1")

spark scala reducekey dataframe operation

I'm trying to do a count in scala with dataframe. My data has 3 columns and I've already loaded the data and split by tab. So I want to do something like this:
val file = file.map(line=>line.split("\t"))
val x = file1.map(line=>(line(0), line(2).toInt)).reduceByKey(_+_,1)
I want to put the data in dataframe, and having some trouble on the syntax
val file = file.map(line=>line.split("\t")).toDF
val file.groupby(line(0))
.count()
Can someone help check if this is correct?
spark needs to know the schema of the df
there are many ways to specify the schema, here is one option:
val df = file
.map(line=>line.split("\t"))
.map(l => (l(0), l(1).toInt)) //at this point spark knows the number of columns and their types
.toDF("a", "b") //give the columns names for ease of use
df
.groupby('a)
.count()

How do I convert Array[Row] to RDD[Row]

I have a scenario where I want to convert the result of a dataframe which is in the format Array[Row] to RDD[Row]. I have tried using parallelize, but I don't want to use it as it needs to contain entire data in a single system which is not feasible in production box.
val Bid = spark.sql("select Distinct DeviceId, ButtonName from stb").collect()
val bidrdd = sparkContext.parallelize(Bid)
How do I achieve this? I tried the approach given in this link (How to convert DataFrame to RDD in Scala?), but it didn't work for me.
val bidrdd1 = Bid.map(x => (x(0).toString, x(1).toString)).rdd
It gives an error value rdd is not a member of Array[(String, String)]
The variable Bid which you've created here is not a DataFrame, it is an Array[Row], that's why you can't use .rdd on it. If you want to get an RDD[Row], simply call .rdd on the DataFrame (without calling collect):
val rdd = spark.sql("select Distinct DeviceId, ButtonName from stb").rdd
Your post contains some misconceptions worth noting:
... a dataframe which is in the format Array[Row] ...
Not quite - the Array[Row] is the result of collecting the data from the DataFrame into Driver memory - it's not a DataFrame.
... I don't want to use it as it needs to contain entire data in a single system ...
Note that as soon as you use collect on the DataFrame, you've already collected entire data into a single JVM's memory. So using parallelize is not the issue.

VectorAssembler does not support the StringType type scala spark convert

I have a dataframe that contains string columns and I am planning to use it as input for k-means using spark and scala. I am converting my string typed columns of the dataframe using the method below:
val toDouble = udf[Double, String]( _.toDouble)
val analysisData = dataframe_mysql.withColumn("Event", toDouble(dataframe_mysql("event"))).withColumn("Execution", toDouble(dataframe_mysql("execution"))).withColumn("Info", toDouble(dataframe_mysql("info")))
val assembler = new VectorAssembler()
.setInputCols(Array("execution", "event", "info"))
.setOutputCol("features")
val output = assembler.transform(analysisData)
println(output.select("features", "execution").first())
when I print the analysisData schema the convertion is correct. but I am getting an exception: VectorAssembler does not support the StringType type
which means that my values are still strings! how can I convert the values and not only the schema type?
thanks
Indeed, the VectorAssembler Transformer does not take strings. So you need to make sure that your columns match numerical, boolean, vector types. Make sure that your udf is doing the right thing and be sure that none of the columns has StringType.
To convert a column in a Spark DataFrame to another type, make it simple and use the cast() DSL function like so:
val analysisData = dataframe_mysql.withColumn("Event", dataframe_mysql("Event").cast(DoubleType))
It should work!

How to convert DataFrame to RDD in Scala?

Can someone please share how one can convert a dataframe to an RDD?
Simply:
val rows: RDD[Row] = df.rdd
Use df.map(row => ...) to convert the dataframe to a RDD if you want to map a row to a different RDD element. For example
df.map(row => (row(1), row(2)))
gives you a paired RDD where the first column of the df is the key and the second column of the df is the value.
I was just looking for my answer and found this post.
Jean's answer to absolutely correct,adding on that "df.rdd" will return a RDD[Rows]. I need to apply split() once i get RDD. For that we need to convert RDD[Row} to RDD[String]
val opt=spark.sql("select tags from cvs").map(x=>x.toString()).rdd