How to convert dataframe with multiple columns
I can get RDD[org.apache.spark.sql.Row], but I'd need something that I could use for org.apache.spark.mllib.fpm.FPGrowth, ei RDD[Array[String]]
How to convert?
df.head
org.apache.spark.sql.Row = [blabla,128323,23843,11.23,blabla,null,null,..]
df.printSchema
|-- source: string (nullable = true)
|-- b1: string (nullable = true)
|-- b2: string (nullable = true)
|-- b3: long (nullable = true)
|-- amount: decimal(30,2) (nullable = true)
and so on
Thanks
Question is vague, but in general, you can change the RDD from Row to Array passing through Sequence. The following code will take all columns from an RDD, convert them to string, and returning them as an array.
df.first
res1: org.apache.spark.sql.Row = [blah1,blah2]
df.map { _.toSeq.map {_.toString}.toArray }.first
res2: Array[String] = Array(blah1, blah2)
This however may not be enough to get it to work with MLib the way you want since you didn't give enough detail, but it's a start.
Related
I am very new to scala and I have the following issue.
I have a spark dataframe with the following schema:
df.printSchema()
root
|-- word: string (nullable = true)
|-- vector: array (nullable = true)
| |-- element: string (containsNull = true)
I need to convert this to the following schema:
root
|-- word: string (nullable = true)
|-- vector: array (nullable = true)
| |-- element: double (containsNull = true)
I do not want to specify the schema before hand, but instead change the existing one.
I have tried the following
df.withColumn("vector", col("vector").cast("array<element: double>"))
I have also tried converting it into an RDD to use map to change the elements and then turn it back into a dataframe, but I get the following data type Array[WrappedArray] and I am not sure how to handle it.
Using pyspark and numpy, I could do this by df.select("vector").rdd.map(lambda x: numpy.asarray(x)).
Any help would be greatly appreciated.
You're close. Try this code:
val df2 = df.withColumn("vector", col("vector").cast("array<double>"))
I've written a scala function which will convert time(HH:mm:ss.SSS) to seconds. First it will ignore milliseconds and will take only (HH:mm:ss) and convert into seconds(int). It works fine when testing in spark-shell.
def hoursToSeconds(a: Any): Int = {
val sec = a.toString.split('.')
val fields = sec(0).split(':')
val creationSeconds = fields(0).toInt*3600 + fields(1).toInt*60 + fields(2).toInt
return creationSeconds
}
print(hoursToSeconds("03:51:21.2550000"))
13881
I would need to pass this function to one of the dataframe column(running), which i was trying with the withColumn method, but getting error Type mismatch, expected: column, actual String. Any help would be appreciated, is there a way i can pass the scala function to udf and then use udf in df.withColumn.
df.printSchema
root
|-- vin: string (nullable = true)
|-- BeginOfDay: string (nullable = true)
|-- Timezone: string (nullable = true)
|-- Version: timestamp (nullable = true)
|-- Running: string (nullable = true)
|-- Idling: string (nullable = true)
|-- Stopped: string (nullable = true)
|-- dlLoadDate: string (nullable = false)
sample running column values.
df.withColumn("running", hoursToSeconds(df("Running")
You can create a udf for the hoursToSeconds function by using the following sytax :
val hoursToSecUdf = udf(hoursToSeconds _)
Further to use it on a particular column the following sytax can be used :
df.withColumn("TimeInSeconds",hoursToSecUdf(col("running")))
There are probably at least 10 question very similar to this, but I still have not found a clear answer.
How can I add a nullable string column to a DataFrame using scala? I was able to add a column with null values, but the DataType shows null
val testDF = myDF.withColumn("newcolumn", when(col("UID") =!= "not", null).otherwise(null))
However, the schema shows
root
|-- UID: string (nullable = true)
|-- IsPartnerInd: string (nullable = true)
|-- newcolumn: null (nullable = true)
I want the new column to be string |-- newcolumn: string (nullable = true)
Please don't mark as duplicate, unless it's really the same question and in scala.
Just explicitly cast null literal to StringType.
scala> val testDF = myDF.withColumn("newcolumn", when(col("UID") =!= "not", lit(null).cast(StringType)).otherwise(lit(null).cast(StringType)))
scala> testDF.printSchema
root
|-- UID: string (nullable = true)
|-- newcolumn: string (nullable = true)
Why do you want a column which is always null? There are several ways, I would prefer the solution with typedLit:
myDF.withColumn("newcolumn", typedLit[String](null))
or for older Spark versions:
myDF.withColumn("newcolumn",lit(null).cast(StringType))
When I retrieve a dataset in Spark 2, using a select statement the underlying columns inherit the data types of the queried columns.
val ds1 = spark.sql("select 1 as a, 2 as b, 'abd' as c")
ds1.printSchema()
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: string (nullable = false)
Now if I convert this into a case class, it will correctly convert the values, but the underlying schema is still wrong.
case class abc(a: Double, b: Double, c: String)
val ds2 = ds1.as[abc]
ds2.printSchema()
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: string (nullable = false)
ds2.collect
res18: Array[abc] = Array(abc(1.0,2.0,abd))
I "SHOULD" be able to specify the encoder to use when I create the second dataset, but scala seems to ignore this parameter (Is this a BUG?):
val abc_enc = org.apache.spark.sql.Encoders.product[abc]
val ds2 = ds1.as[abc](abc_enc)
ds2.printSchema
root
|-- a: integer (nullable = false)
|-- b: integer (nullable = false)
|-- c: string (nullable = false)
So the only way I can see to do this simply, without very complex mapping is to use createDataset, but this requires a collect on the underlying object, so it's not ideal.
val ds2 = spark.createDataset(ds1.as[abc].collect)
This is an open issue in Spark API (check this ticket SPARK-17694)
So what you need to do is doing an extra explicit cast. Something like this should work:
ds1.as[abc].map(x => x : abc)
You can simply use cast method on columns as
import sqlContext.implicits._
val ds2 = ds1.select($"a".cast(DoubleType), $"a".cast(DoubleType), $"c")
ds2.printSchema()
you should have
root
|-- a: double (nullable = false)
|-- a: double (nullable = false)
|-- c: string (nullable = false)
You could also cast the column while selecting with sql query as below
import spark.implicits._
val ds = Seq((1,2,"abc"),(1,2,"abc")).toDF("a", "b","c").createOrReplaceTempView("temp")
val ds1 = spark.sql("select cast(a as Double) , cast (b as Double), c from temp")
ds1.printSchema()
This have the schema as
root
|-- a: double (nullable = false)
|-- b: double (nullable = false)
|-- c: string (nullable = true)
Now you can convert to Dataset with case class
case class abc(a: Double, b: Double, c: String)
val ds2 = ds1.as[abc]
ds2.printSchema()
Which now has the required schema
root
|-- a: double (nullable = false)
|-- b: double (nullable = false)
|-- c: string (nullable = true)
Hope this helps!
OK, I think I've resolved this in a better way.
Instead of using a collect when we create a new dataset, we can just reference the rdd of the dataset.
So instead of
val ds2 = spark.createDataset(ds1.as[abc].collect)
We use:
val ds2 = spark.createDataset(ds1.as[abc].rdd)
ds2.printSchema
root
|-- a: double (nullable = false)
|-- b: double (nullable = false)
|-- c: string (nullable = true)
This keeps the lazy evaluation intact, but allows the new dataset to use the Encoder for the abc case class, and the subsequent schema will reflect this when we use it to create a new table.
I am trying to drop the duplicate column while retaining only the unique columns and only one column among the duplicates after joining.
For Example:
Duplicate DataFrame
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- loc: string (nullable = true)
|-- sal: string (nullable = true)
|-- name: string (nullable = true)
|-- loc: string (nullable = true)
|-- sal: string (nullable = true)
After removing duplicates, the output should be
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- loc: string (nullable = true)
|-- sal: string (nullable = true)
Any help will be appreciated?
As Shaido already commented above stating that you should drop all the columns that are not used for join as it would be difficult after you join them. (for example if loc and sal are not used in join then)
df2.drop("loc", "sal")
or
df1.drop("loc", "sal")
If you are using column names (for example id and name) in the join then do as
df1.join(df2, Seq("id", "name"))
I believe, if you go for generic approach, then below code may be help you. Here you no need to mention the duplicate column names.
First create a implicit class (Better design approach)
implicit class DataFrameOperations(df: DataFrame) {
def dropDuplicateCols(rmvDF: DataFrame): DataFrame = {
val cols = df.columns.groupBy(identity).mapValues(_.size).filter(_._2 > 1).keySet.toSeq
#tailrec def deleteCol(df: DataFrame, cols: Seq[String]): DataFrame = {
if (cols.size == 0) df else deleteCol(df.drop(rmvDF(cols.head)), cols.tail)
}
deleteCol(df, cols)
}
}
To call the method you can use below one
val dupDF = rdd1.join(rdd2,"id").dropDuplicateCols(rdd1)
//For exemple
val dataFrame = sparkSession.sql("SELECT .....")
dataFrame .distinct() //since 2.0.0
//or
dataFrame.dropDuplicates()
//or
dataFrame.dropDuplicates(colNames)