Share HDInsight SPARK SQL Table saveAsTable does not work - scala

I want to show the data from HDInsight SPARK using tableau. I was following this video where they have described how to connect the two systems and expose the data.
currently my script itself is very simple as shown below:
/* csvFile is an RDD of lists, each list representing a line in the CSV file */
val csvLines = sc.textFile("wasb://mycontainer#mysparkstorage.blob.core.windows.net/*/*/*/mydata__000000.csv")
// Define a schema
case class MyData(Timestamp: String, TimezoneOffset: String, SystemGuid: String, TagName: String, NumericValue: Double, StringValue: String)
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
).toDF()
// Register as a temporary table called "processdata"
myData.registerTempTable("test_table")
myData.saveAsTable("test_table")
unfortunately I run in to the following error
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Table `test_table` already exists.;
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:209)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:198)
i have also tried to use the following code to overwrite the table if it exists
import org.apache.spark.sql.SaveMode
myData.saveAsTable("test_table", SaveMode.Overwrite)
but still it gives me same error.
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
java.lang.RuntimeException: Tables created with SQLContext must be TEMPORARY. Use a HiveContext instead.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.SparkStrategies$DDLStrategy$.apply(SparkStrategies.scala:416)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
Can someone please help me fix this issue?

I know it was my mistake, but i'll leave it as an answer as it was not readily available in any of the blogs or forum answers. hopefully it will help someone like me starting with Spark
I figured out that .toDF() actually creates the sqlContext and not the hiveContext based DataFrame. so I have now updated my code as below
// Map the values in the .csv file to the schema
val myData = csvLines.map(s => s.split(",")).filter(s => s(0) != "Timestamp").map(
s => MyData(s(0),
s(1),
s(2),
s(3),
s(4).toDouble,
s(5)
)
)
// Register as a temporary table called "myData"
val myDataFrame = hiveContext.createDataFrame(myData)
myDataFrame.registerTempTable("mydata_stored")
myDataFrame.write.mode(SaveMode.Overwrite).saveAsTable("mydata_stored")
also make sure that the s(4) has proper double value else add try/catch to handle it. i did something like this:
def parseDouble(s: String): Double = try { s.toDouble } catch { case _ => 0.00 }
parseDouble(s(4))
Regards
Kiran

Related

How to create a encoder for type Iterator[org.apache.spark.sql.Row]

I am using spark 2.4.4 in databricks notebook.
I have a data in dataframe which I want to use to update records in Postgre table .
I am following the approach given in this post Spark Dataframes UPSERT to Postgres Table
Here is my code
import spark.implicits._
val update_query = s"""UPDATE scored_fact.f_learner_assessment_item_response_classifications_test SET is_deleted = ? where f.learner_assigned_item_classification_attempt_sk = ?::uuid AND f.root_org_partition= ?::int"""
changedSectionLearnerDF.coalesce(8).mapPartitions((d) => Iterator(d)).foreach { batch =>
val dbc: Connection = DriverManager.getConnection(connectionUrl)
val stmt: PreparedStatement = dbc.prepareStatement(update_query)
batch.grouped(100).foreach { session =>
session.foreach { row =>
stmt.setBoolean( 0, row.getAs[Boolean]("is_deleted") )
stmt.setString( 1, row.getAs[String]("learner_assigned_item_classification_attempt_sk"))
stmt.setString( 2, row.getAs[String]("root_org_partition"))
stmt.addBatch()
}
stmt.executeBatch()
}
dbc.close()
}
I am getting below error
Unable to find encoder for type Iterator[org.apache.spark.sql.Row]. An implicit Encoder[Iterator[org.apache.spark.sql.Row]] is needed to store Iterator[org.apache.spark.sql.Row] instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.changedSectionLearnerDF.coalesce(8).mapPartitions((d) => Iterator(d)).foreach { batch =>
I am sure I am missing something. How can I resolve this error by creating an encoder
The signature of mapPartitions is
def mapPartitions[U](func: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]
So d in d => Iterator(d) is an Iterator[Row] and the function returns a Dataset[Iterator[Row]] which can't reasonably exist.
I think the mapPartitions call is simply wrong and .mapPartitions((d) => Iterator(d)).foreach should be replaced by foreachPartition (as #shridharama's comment to the linked answer says).

Spark UDF with Maxmind Geo Data

I'm trying to use the Maxmind snowplow library to pull out geo data on each IP that I have in a dataframe.
We are using Spark SQL (spark version 2.1.0) and I created an UDF in the following class:
class UdfDefinitions #Inject() extends Serializable with StrictLogging {
sparkSession.sparkContext.addFile("s3n://s3-maxmind-db/latest/GeoIPCity.dat")
val s3Config = configuration.databases.dataWarehouse.s3
val lruCacheConst = 20000
val ipLookups = IpLookups(geoFile = Some(SparkFiles.get(s3Config.geoIPFileName) ),
ispFile = None, orgFile = None, domainFile = None, memCache = false, lruCache = lruCacheConst)
def lookupIP(ip: String): LookupIPResult = {
val loc: Option[IpLocation] = ipLookups.getFile.performLookups(ip)._1
loc match {
case None => LookupIPResult("", "", "")
case Some(x) => LookupIPResult(Option(x.countryName).getOrElse(""),
x.city.getOrElse(""), x.regionName.getOrElse(""))
}
}
val lookupIPUDF: UserDefinedFunction = udf(lookupIP _)
}
The intention is to create the pointer to the file (ipLookups) outside the UDF and use it inside, so not to open files on each row. This get an error of task no serialized and when we use the addFiles in the UDF, we get a too many files open error (when using a large dataset, on a small dataset it does work).
This thread show how to use to solve the problem using RDD, but we would like to use Spark SQL. using maxmind geoip in spark serialized
Any thoughts?
Thanks
The problem here is that IpLookups is not Serializable. Yet it makes the lookups from a static file (frmo what I gathered) so you should be able to fix that. I would advise that you clone the repo and make IpLookups Serializable. Then, to make it work with spark SQL, wrap everything in a class like you did. The in the main spark job, you can write something as follows:
val IPResolver = new MySerializableIpResolver()
val resolveIP = udf((ip : String) => IPResolver.resolve(ip))
data.withColumn("Result", resolveIP($"IP"))
If you do not have that many distinct IP addresses, there is another solution: you could do everything in the driver.
val ipMap = data.select("IP").distinct.collect
.map(/* calls to the non serializable IpLookups but that's ok, we are in the driver*/)
.toMap
val resolveIP = udf((ip : String) => ipMap(ip))
data.withColumn("Result", resolveIP($"IP"))

How to correctly handle Option in Spark/Scala?

I have a method, createDataFrame, which returns an Option[DataFrame]. I then want to 'get' the DataFrame and use it in later code. I'm getting a type mismatch that I can't fix:
val df2: DataFrame = createDataFrame("filename.txt") match {
case Some(df) => { //proceed with pipeline
df.filter($"activityLabel" > 0)
case None => println("could not create dataframe")
}
val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
I need df2 to be of type: DataFrame otherwise later code won't recognise df2 as a DataFrame e.g. val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
However, the case None statement is not of type DataFrame, it returns Unit, so won't compile. But if I don't declare the type of df2 the later code won't compile as it is not recognised as a DataFrame. If someone can suggest a fix that would be helpful - been going round in circles with this for some time. Thanks
What you need is a map. If you map over an Option[T] you are doing something like: "if it's None I'm doing nothing, otherwise I transform the content of the Option in something else. In your case this content is the dataframe itself. So inside this myDFOpt.map() function you can put all your dataframe transformation and just in the end do the pattern matching you did, where you may print something if you have a None.
edit:
val df2: DataFrame = createDataFrame("filename.txt").map(df=>{
val filteredDF=df.filter($"activityLabel" > 0)
val Array(trainData, testData) = filteredDF.randomSplit(Array(0.5,0.5),seed = 12345)})

How to read file names from column in DataFrame to process using SparkContext.textFile?

I'm so new using Spark and I'm so stuck with this issue:
From a DataFrame that I have created; called reportesBN, I want to get the value of a field, in order to use it to get a TextFile of a specific route. And after that, give to that file a specific process.
I have developed this code, but its not working:
reportesBN.foreach {
x =>
val file = x(0)
val insumo = sc.textFile(s"$file")
val firstRow = insumo.first.split("\\|", -1)
// Get values of next rows
val nextRows = insumo.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
val dfNextRows = nextRows.map(a => a.split("\\|")).map(x=> BalanzaNextRows(x(0), x(1),
x(2), x(3), x(4))).toDF()
val validacionBalanza = new RevisionCampos(sc)
validacionBalanza.validacionBalanza(firstRow, dfNextRows)
}
The error log indicates that it is because of serialization.
7/06/28 18:55:45 INFO SparkContext: Created broadcast 0 from textFile at ValidacionInsumos.scala:56
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
Is this problem caused by the Spark Context (sc) that is inside the foreach?
Is there another way to implement this?
Regards.
A very similar question you asked before and that's that same issue - you cannot use SparkContext inside a RDD transformation or action. In this case, you use sc.textFile(s"$file") inside reportesBN.foreach which as you said is a DataFrame:
From a DataFrame that I have created; called reportesBN
You should rewrite your transformation to take a file from the DataFrame and read it afterwards.
// This is val file = x(0)
// I assume that the column name is `files`
val files = reportesBN.select("files").as[String].collectAsList
Once you have the collection of files to process, you execute the code in your block.
files.foreach {
x => ...
}

Spark: Populating Cassandra UDTValue as a DataFrame column

I am trying to create a Cassandra UDT from some columns from my data frame. I want to add this UDT column to the data frame and save this to the Cassandra table.
My code looks like:
val asUDT = udf((keys: Seq[String], values: Seq[String]) =>
UDTValue.fromMap(keys.zip(values).filter {
case (k, null) => false
case _ => true
}.toMap))
val keys = array(mapKeys.map(lit): _*)
val values = array(mapValues.map(col): _*)
return df.withColumn("targetColumn", (asUDT(keys, values))
However, I am receiving the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Schema for type AnyRef is not supported.
Please let me know how I can save a UDT value as a column in my data frame.
Any pointers on how I can get this to work will be really helpful.
Thanks