Pass RDD in scala function. Output Dataframe - scala

say I have the below csv and many more like it.
val csv = sc.parallelize(Array(
"col1, col2, col3",
"1, cat, dog",
"2, bird, bee"))
I would like to apply the below functions to the RDD to convert it to a data frame with the desired logic below. I keep running into the error error: not found: value DataFrame
How can I correct this?
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
/
In most cases I would read CSV files directly as a dateframe using Spark's core functionality, but I am unable to in this case.
Any/all help is appreciated.

in order not to get error: not found: value DataFrame you must add the following import:
import org.apache.spark.sql.DataFrame
and your method declaration should be like this:
def udf(fName : RDD[String]): DataFrame = { ...

Related

Scala _* to select a list of dataframe columns

I have a dataframe and a list of columns like this:
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq(("Java", "20000"), ("Python", "100000"))).toDF("language","users_count")
val data_columns = List("language","users_count").map(x=>col(s"$x"))
Why does this work:
df.select(data_columns:_ *).show()
But not this?
df.select($"language", data_columns:_*).show()
Gives the error:
error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
And how do I get it to work so I can use _* to select all columns in a list, but I also want to specify some other columns in the select?
Thanks!
Update:
based on #chinayangyangyong answer below, this is how I solved it:
df.select( $"language" +: data_columns :_*)
It is because there is no method on Dataframe with the signature select(col: Column, cols: Column*): DataFrame, but there is one with the signature select(col: Column*): DataFrame, which is why your first example works.
Interestingly, your second example would work if you were using String to select the columns since there is a method select(col: String, cols: String*): DataFrame.
df.select(data_columns.head, data_columns.tail:_*),show()

cast string column to decimal in scala dataframe

I have a dataframe (scala)
I am using both pyspark and scala in a notebook
#pyspark
spark.read.csv(output_path + '/dealer', header = True).createOrReplaceTempView('dealer_dl')
%scala
import org.apache.spark.sql.functions._
val df = spark.sql("select * from dealer_dl")
How to convert a string column (amount) into decimal in scala dataframe.
I tried as below.
%scala
df = df.withColumn("amount", $"amount".cast(DecimalType(9,2)))
But I am getting an error as below:
error: reassignment to val
I am used to pyspark and quite new to scala. I need to do by scala to proceed further. Please let me know. Thanks.
in scala you can't reasign references defined as val but val is immutable reference. if you want to use reasigning some ref you can use var but better solution is not reasign something to the same reference name and use another val.
For example:
val dfWithDecimalAmount = df.withColumn("amount", $"amount".cast(DecimalType(9,2)))

Combining two columns, casting two timestamp and selecting from df causes no error, but casting one column to timestamp and selecting causes error

Description
When I try to select a column that is cast to unix_timestamp and then timestamp from a dataframe there is a sparkanalysisexception error. See link below.
However, when I combine two columns, and then cast the combo to a unix_timestamp and then timestamp type and then select from a df there is no error.
Disparate Cases
Error:
How to extract year from a date string?
No Error
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val spark: SparkSession = SparkSession.builder().
appName("myapp").master("local").getOrCreate()
case class Person(id: Int, date: String, time:String)
import spark.implicits._
val mydf: DataFrame = Seq(Person(1,"9/16/13", "11:11:11")).toDF()
//solution.show()
//column modificaton
val datecol: Column = mydf("date")
val timecol: Column = mydf("time")
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)
mydf.select(newcol).show()
Results
Expected:
Error-sparkanalysis, can't find unix_timestamp(concat(....)) in mydf
Actual:
+------------------------------------------------------------------+
|CAST(unix_timestamp(concat(date, , time), MM/dd/yy) AS TIMESTAMP)|
+------------------------------------------------------------------+
| 2013-09-16 00:00:...|
These do not seem disparate cases. In the erroneous case, you had a new dataframe with changed column names. See below :-
val select_df: DataFrame = mydf.select(unix_timestamp(mydf("date"),"MM/dd/yy").cast(TimestampType))
select_df.select(year($"date")).show()
Here, select_df dataframe has changed column names from date to something like cast(unix_timestamp(mydf("date"),"MM/dd/yy")) as Timestamp
While in the case mentioned above, you are just defining a new column when you say :-
val newcol: Column = unix_timestamp(concat(datecol,lit(" "),timecol),"MM/dd/yy").cast(TimestampType)
And then you use this to select from your dataframe and thus it gives out expected results.
Hope this makes things clearer.

how to use createDataFrame to create a pyspark dataframe?

I know this is probably to be a stupid question. I have the following code:
from pyspark.sql import SparkSession
rows = [1,2,3]
df = SparkSession.createDataFrame(rows)
df.printSchema()
df.show()
But I got an error:
createDataFrame() missing 1 required positional argument: 'data'
I don't understand why this happens because I already supplied 'data', which is the variable rows.
Thanks
You have to create SparkSession instance using the build pattern and use it for creating dataframe, check
https://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
spark= SparkSession.builder.getOrCreate()
Below are the steps to create pyspark dataframe using createDataFrame
Create sparksession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
Create data and columns
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
Creating DataFrame from RDD
rdd = spark.sparkContext.parallelize(data)
df= spark.createDataFrame(rdd).toDF(*columns)
the second approach, Directly creating dataframe
df2 = spark.createDataFrame(data).toDF(*columns)
Try
row = [(1,), (2,), (3,)]
?
If I am not wrong createDataFrame() takes 2 lists as input: first list is the data and second list is the column names. The data must be a lists of list of tuples, where each tuple is a row of the dataframe.

Is it possible to go from an Array[Row] to a DataFrame

If I call collect on a DataFrame, I will get an Array[Row]. But I'm wondering if it possible to go back to a DataFrame from that result or an Array[Row] in general.
For example:
rows = df.select("*").collect()
Is there some way to do something like this:
import df.sparkSession.implicits._
newDF = rows.toDF()
It is possible to provide a List[Row], as long as you provide as schema. Then you can use SparkSession.createDataFrame
def createDataFrame(rows: List[Row], schema: StructType): DataFrame
There is no variant of toDF that can be used here.
In general you should avoid collecting and converting result back to DataFrame.