code show as below:
val sparkSession = DataExtract.getSparkSession("testName")
sparkSession.sql("use hive")
val df1 = sparkSession.sql("SELECT * from test1 ")
df1.createOrReplaceTempView("test1")
df1.createOrReplaceTempView("test1")
val df = sparkSession.sql("SELECT * from test1 ")
df.createOrReplaceTempView("test1")
When I execute this code, the first time I execute the createOrReplaceTempView method is successful, and then use the same dataframe to execute the createOrReplaceTempView method again is also successful, but when I use another dataframe to createOrReplaceTempView with the same name Report an error, the error is as follows:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Recursive view `test1` detected (cycle: `test1` -> `test1`)
at org.apache.spark.sql.errors.QueryCompilationErrors$.recursiveViewDetectedError(QueryCompilationErrors.scala:2045)
at org.apache.spark.sql.execution.command.ViewHelper$.checkCyclicViewReference(views.scala:515)
at org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2(views.scala:522)
at org.apache.spark.sql.execution.command.ViewHelper$.$anonfun$checkCyclicViewReference$2$adapted(views.scala:522)
at scala.collection.Iterator.foreach(Iterator.scala:944)
at scala.collection.Iterator.foreach$(Iterator.scala:944)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1432)
This problem has troubled me for a long time. There was no such problem in the previous version of spark3.1.1. It suddenly appeared after the upgraded version. I hope someone can help me. I'm very grateful. Thank you.
Related
In my implemented code I get the following error:
error: not found: value transform
.withColumn("min_date", array_min(transform('min_date,
^
I have been unable to resolve this. I already have the following import statements:
import sqlContext.implicits._
import org.apache.spark.sql.functions.split
import org.apache.spark.sql.functions._
I'm using Apache Zeppelin to execute this.
Here is the full code for reference and the sample of the dataset I'm using:
1004,bb5469c5|2021-09-19 01:25:30,4f0d-bb6f-43cf552b9bc6|2021-09-25 05:12:32,1954f0f|2021-09-19 01:27:45,4395766ae|2021-09-19 01:29:13,
1018,36ba7a7|2021-09-19 01:33:00,
1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59,77a90a1c97b|2021-09-19 01:34:53,
1022,3623fe40|2021-09-19 01:33:00,
1028,6c77d26c-6fb86|2021-09-19 01:50:50,f0ac93b3df|2021-09-19 01:51:11,
1032,ac55-4be82f28d|2021-09-19 01:54:20,82229689e9da|2021-09-23 01:19:47,
val users = sc.textFile("path to file").map(x=>x.replaceAll("\\(","")).map(x=>x.replaceAll("\\)","")).map(x=>x.replaceFirst(",","*")).toDF("column")
val tempDF = users.withColumn("_tmp", split($"column", "\\*")).select(
$"_tmp".getItem(0).as("col1"),
$"_tmp".getItem(1).as("col2")
)
val output = tempDF.withColumn("min_date", split('col2 , ","))
.withColumn("min_date", array_min(transform('min_date,
c => to_timestamp(regexp_extract(c, "\\|(.*)$", 1)))))
.show(10,false)
There is no method in functions (version 3.1.2) with the signature transform(c: Column, fn: Column => Column) so you're writing importing the wrong object or trying to do something else.
You are probably using a version of Spark < Spark 3.x, and this Scala dataframe API transform does not work. With Spark 3.x your code works fine.
I could not get with 2.4 that to work I noted. Not enough time, but have a look here: Higher Order functions in Spark SQL
I want to test a method we have that is formatted something like this:
def extractTable( spark: SparkSession, /* unrelated other parameters */ ): DataFrame = {
// Code before that I want to test
val df = spark.read
.format("jdbc")
.option("url", "URL")
.option("driver", "<Driver>")
.option("fetchsize", "1000")
.option("dbtable", "select * from whatever")
.load()
// Code after that I want to test
}
And I am trying to make stubs of the spark object, and the DataFrameReader objects that the read and option methods return:
val sparkStub = stub[ SparkSession ]
val dataFrameReaderStub = stub[ DataFrameReader ]
( dataFrameReaderStub.format _).when(*).returning( dataFrameReaderStub ) // Works
( dataFrameReaderStub.option _).when(*, *).returning( dataFrameReaderStub ) // Error
( dataFrameReaderStub.load _).when(*).returning( ??? ) // Return a dataframe // Error
( sparkStub.read _).when().returning( dataFrameReaderStub )
But I am getting an error on dataFrameReaderStub.option and dataFrameReaderStub.load that says "Cannot resolve symbol option" and "Cannot resolve symbol load". But these methods definitely exist on the object that spark.read returns.
How can I resolve this error, or is there a better way to mock/test the code I have?
I would suggest you look at this library for testing Spark code: https://github.com/holdenk/spark-testing-base
Mix in this with your test suite: https://github.com/holdenk/spark-testing-base/wiki/SharedSparkContext
...or alternatively, spin up your own SparkSession with a local[2] master. and load the test data from csv/parquet/json.
Mocking Spark classes will be quite painful and probably not a success. I am speaking from experience here, both working for a long time with Spark, and maintaining ScalaMock as a library.
You are better off using Spark in your tests, but not against the real datasources.
Instead, load the test data from csv/parquet/json, or programatically generate it (if it contains timestamps and such).
I am trying to encrypt a column in my CSV file. I am trying to do that using UDF. But I am getting compilation error. Here is my code :
import org.apache.spark.sql.functions.{col, udf}
val upperUDF1 = udf { str: String => Encryptor.aes(str) }
val rawDF = spark
.read
.format("csv")
.option("header", "true")
.load(inputPath)
rawDF.withColumn("id", upperUDF1("id")).show() //Compilation error.
I am getting the compilation error in the last line, am I using the incorrect syntax. Thanks in advance.
You should send a Column not a String, you can reference to a column by different syntaxes:
$"<columnName>"
col("<columnName>")
So you should try this:
rawDF.withColumn("id", upperUDF1($"id")).show()
or this:
rawDF.withColumn("id", upperUDF1(col("id"))).show()
Personally i like the dollar syntax the most, seems more elegant to me
In addition to the answer from SCouto, you could also register your udf as a Spark SQL function by
spark.udf.register("upperUDF2", upperUDF1)
Your subsequent select expression could then look like this
rawDF.selectExpr("id", "upperUDF2(id)").show()
I am trying to read CSV file in scala using dataset. And after that I am performing some operation. But my code is throwing error.
Below is my code:
final case class AadharData(date:String,
registrar:String,
agency:String,
state:String,
district:String,
subDistrict:String,
pinCode:Int,
gender:String,
age:Int,
aadharGenerated:Int,
rejected:Int,
mobileNo:Double,
email:String)
val spark = SparkSession.builder().appName("GDP").master("local").getOrCreate()
import spark.implicits._
val a = spark.read.option("header", false).csv("D:\\BGH\\Spark\\aadhaar_data.csv").as[AadharData]
val b = a.map(rec=>{
(rec.registrar,1)
}).groupByKey(f=>f._1).collect()
And I am getting below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`date`' given input columns: [_c0, _c2, _c1, _c3, _c5, _c8, _c9, _c7, _c6, _c11, _c12, _c10, _c4];
Any help is appreciated:
Thanks in advance.
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'date' given input columns: [_c0, _c2, _c1, _c3, _c5, _c8, _c9, _c7, _c6, _c11, _c12, _c10, _c4];
The above error is because you have used header option as false (.option("header", false)) so spark generates column names as _c0, _c1 and so on.. But while typecasting the generated dataframe using a case class you used column names different than the ones already generated. Thus the above error happened.
Solution
You should tell spark sql to generate the names used in the case class and also tell it to inferschema too as
val columnNames = classOf[AadharData].getDeclaredFields.map(x => x.getName)
val a = sqlContext.read.option("header", false).option("inferSchema", true)
.csv("D:\\BGH\\Spark\\aadhaar_data.csv").toDF(columnNames:_*).as[AadharData]
The above error should go away
I have Spark 1.5.0 running on cluster. I want to use Hive UDF from ESRI's API. I can use these API in Spark Application but due to some issues in my cluster, I am not able to use HiveContext. I want to use Existing Hive UDF in Spark-SQL application.
// val sqlContext = new SQLContext(sc)
// import sqlContext.implicits._
// val hc = new HiveContext(sc)
// hc.sql("create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point'")
// hc.sql("create temporary function ST_Within as 'com.esri.hadoop.hive.ST_Within'")
// hc.sql("create temporary function ST_Polygon as 'com.esri.hadoop.hive.ST_Polygon'")
// val resultDF = hc.sql("select ST_Within(ST_Point(2, 3), ST_Polygon(1,1, 1,4, 4,4, 4,1))")
The above code is for HiveContext but I want to use similar thing in SparkContext so wrote something as per this-
sqlContext.sql("""create function ST_Point as 'com.esri.hadoopcom.esri.hadoop.hive.ST_Point'""")
But seems like same error I am getting. (See below)
Exception in thread "main" java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier create found
create function ST_Point as 'com.esri.hadoopcom.esri.hadoop.hive.ST_Point'
^
at scala.sys.package$.error(package.scala:27)
I tried to make functions with existing UDFs but seems like need to make scala wrapper to call java classes. I tried as below-
def ST_Point_Spark = new ST_Point()
sqlContext.udf.register("ST_Point_Spark", ST_Point_Spark _)
def ST_Within_Spark = new ST_Within()
sqlContext.udf.register("ST_Within_Spark", ST_Within_Spark _)
def ST_Polygon_Spark = new ST_Polygon()
sqlContext.udf.register("ST_Polygon_Spark", ST_Polygon_Spark _)
sqlContext.sql("select ST_Within_Spark(ST_Point_Spark(2, 3), ST_Polygon_Spark(1,1, 1,4, 4,4, 4,1))")
but in this case getting error-
Exception in thread "main" scala.reflect.internal.Symbols$CyclicReference: illegal cyclic reference involving object InterfaceAudience
at scala.reflect.internal.Symbols$Symbol$$anonfun$info$3.apply(Symbols.scala:1220)
at scala.reflect.internal.Symbols$Symbol$$anonfun$info$3.apply(Symbols.scala:1218)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
I am just wondering, Is there any way to call Hive/Java UDF without using HiveContext, directly using SqlContext.
Note: This was a helpful post but not as per my requirement.