Cannot access Spark dataframe methods - scala

In Zeppelin I am using a dataframe created in another paragraph. I display the type of my df variable and get:
res35: String = DataFrame
suggesting it is a dataframe. But when I try and use select on the df variable I get an error:
<console>:62: error: value select is not a member of Object
Do I have to convert Object to Dataframe or something? Can someone tell me what I am missing? TIA!
My code is:
val df = z.get("wds")
df.getClass.getSimpleName
df.select(explode($"filtered").as("value")).groupBy("value").count.show
This gives the folowwing (edited) output:
df: Object = [racist: boolean, contributors:
string, coordinates: string, ...n: Int = 20
res35: String = DataFrame
<console>:62: error: value select is not a member of Object
df.select(explode($"filtered").as("value")).groupBy("value").count.show

Seems I was missing
.asInstanceOf[DataFrame]
i.e.
import org.apache.spark.sql.DataFrame
val df = z.get("wds").asInstanceOf[DataFrame]

Related

Find columns to select, for spark.read(), from another Dataset - Spark Scala

I have a Dataset[Year] that has the following schema:
case class Year(day: Int, month: Int, Year: Int)
Is there any way to make a collection of the current schema?
I have tried:
println("Print -> "+ds.collect().toList)
But the result were:
Print -> List([01,01,2022], [31,01,2022])
I expected something like:
Print -> List(Year(01,01,2022), Year(31,01,2022)
I know that with a map I can adjust it, but I am trying to create a generic method that accepts any schema, and for this I cannot add a map doing the conversion.
That is my method:
class SchemeList[A]{
def set[A](ds: Dataset[A]): List[A] = {
ds.collect().toList
}
}
Apparently the method return is getting the correct signature, but when running the engine, it gets an error:
val setYears = new SchemeList[Year]
val YearList: List[Year] = setYears.set(df)
Exception in thread "main" java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to schemas.Schemas$Year
Based on your additional information in your comment:
I need this list to use as variables when creating another dataframe via jdbc (I need to make a specific select within postgresql). Is there a more performative way to pass values from a dataframe as parameters in a select?
Given your initial dataset:
val yearsDS: Dataset[Year] = ???
and that you want to do something like:
val desiredColumns: Array[String] = ???
spark.read.jdbc(..).select(desiredColumns.head, desiredColumns.tail: _*)
You could find the column names of yearsDS by doing:
val desiredColumns: Array[String] = yearsDS.columns
Spark achieves this by using def schema, which is defined on Dataset.
You can see the definition of def columns here.
May be you got a DataFrame,not a DataSet.
try to use "as" to transform dataframe to dataset.
like this
val year = Year(1,1,1)
val years = Array(year,year).toList
import spark.implicits._
val df = spark.
sparkContext
.parallelize(years)
.toDF("day","month","Year")
.as[Year]
println(df.collect().toList)

Getting error while trying to add a java date as literal in spark dataFrame

I have defined a variable like this in my scala notebook .
import java.time.{LocalDate, LocalDateTime, ZoneId, ZoneOffset, Duration}
val fiscalYearStartDate = LocalDate.of(fiscalStartYear,7,1);
I would like to add this as column to my dataFrame.
SomeDF.lit(fiscalYearStartDate ).cast("date").as("fiscalYearStartDate")
This is throwing an error .
java.lang.RuntimeException: Unsupported literal type class java.time.LocalDate 2020-10-01
Spark SQL DateType eqivalent in Scala is java.sql.Date and as result solution could be on of:
val finalDF = SomeDF.withColumn("fiscalYearStartDate", lit(fiscalYearStartDate.toString).cast("Date"))
or
val finalDF = SomeDF.withColumn("fiscalYearStartDate", lit(fiscalYearStartDate.format(DateTimeFormatter.ofPattern("yyyy-MM-dd")).cast("Date"))
or
import java.sql.Date
val finalDF = SomeDF.withColumn("fiscalYearStartDate", lit(Date.valueOf(fiscalYearStartDate)))

unable to store row elements of a dataset, via mapPartitions(), in variables

I am trying to create a Spark Dataset, and then using mapPartitions, trying to access each of its elements and store those in variables. Using below piece of code for the same:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val df = spark.sql("select col1,col2,col3 from table limit 10")
val schema = StructType(Seq(
StructField("col1", StringType),
StructField("col2", StringType),
StructField("col3", StringType)))
val encoder = RowEncoder(schema)
df.mapPartitions{iterator => { val myList = iterator.toList
myList.map(x=> { val value1 = x.getString(0)
val value2 = x.getString(1)
val value3 = x.getString(2)}).iterator}} (encoder)
The error I am getting against this code is:
<console>:39: error: type mismatch;
found : org.apache.spark.sql.catalyst.encoders.ExpressionEncoder[org.apache.spark.sql.Row]
required: org.apache.spark.sql.Encoder[Unit]
val value3 = x.getString(2)}).iterator}} (encoder)
Eventually, I am targeting to store the row elements in variables, and perform some operation with these. Not sure what am I missing here. Any help towards this would be highly appreciated!
Actually, there are several problems with your code:
Your map-statement has no return value, therefore Unit
If you return a tuple of String from mapPartitions, you don't need a RowEncoder (because you don't return a Row, but a Tuple3 which does not need a encoder because its a Product)
You can write your code like this:
df
.mapPartitions{itr => itr.map(x=> (x.getString(0),x.getString(1),x.getString(2)))}
.toDF("col1","col2","col3") // Convert Dataset to Dataframe, get desired field names
But you could just use a simple select statement in DataFrame API, no need for mapPartitions here
df
.select($"col1",$"col2",$"col3")

Convert HadoopRDD to DataFrame

In EMR Spark, I have a HadoopRDD
org.apache.spark.rdd.RDD[(org.apache.hadoop.io.Text, org.apache.hadoop.dynamodb.DynamoDBItemWritable)] = HadoopRDD[0] at hadoopRDD
I want to convert this to DataFrame org.apache.spark.sql.DataFrame.
Does anyone know how to do this?
First convert it to simple types. Let's say your DynamoDBItemWritable has just one string column:
val simple: RDD[(String, String)] = rdd.map {
case (text, dbwritable) => (text.toString, dbwritable.getString(0))
}
Then you can use toDF to get a DataFrame:
import sqlContext.implicits._
val df: DataFrame = simple.toDF()

Return Temporary Spark SQL Table in Scala

First I convert a CSV file to a Spark DataFrame using
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/usr/people.csv")
after that type df and return I can see
res30: org.apache.spark.sql.DataFrame = [name: string, age: string, gender: string, deptID: string, salary: string]
Then I use df.registerTempTable("people") to convert df to a Spark SQL table.
But after that when I do people Instead got type table, I got
<console>:33: error: not found: value people
Is it because people is a temporary table?
Thanks
When you register an temp table using the registerTempTable command you used, it will be available inside your SQLContext.
This means that the following is incorrect and will give you the error you are getting :
scala> people.show
<console>:33: error: not found: value people
To use the temp table, you'll need to call it with your sqlContext. Example :
scala> sqlContext.sql("select * from people")
Note : df.registerTempTable("df") will register a temporary table with name df correspond to the DataFrame df you apply the method on.
So persisting on df wont persist the table but the DataFrame, even thought the SQLContext will be using that DataFrame.
The above answer is right for Zeppelin too. If you want to run println to see data, you have to send it back to the driver to see output.
val querystrings = sqlContext.sql("select visitorDMA,
visitorIpAddress, visitorState, allRequestKV
from {redacted}
limit 1000")
querystrings.collect.foreach(entry => {
print(entry.getString(3).toString() + "\n")
})