Read data from bigquery with spark scala - scala

I'm trying to read data from bigquery and print those. Here what I tried,
// Initialize Spark session
val spark = SparkSession
.builder
.master("local")
.appName("Word Count")
.config("fs.gs.project.id", "bigquery-public-data")
.config("google.cloud.auth.service.account.enable", "true")
.config("fs.gs.auth.service.account.json.keyfile", "<key_file>")
.getOrCreate()
val macbeth = spark.sql("SELECT * FROM shakespeare WHERE corpus = 'macbeth'").persist()
macbeth.show(100)
But this gives me an error as follows,
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: shakespeare; line 1 pos 14
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'shakespeare' not found in database 'default';
I couldn't find a way to fix this. Please help me to read data from this dataset.

Table or view not found: shakespeare; line 1 pos 14
When BigQuery looks for a table it looks for it under the projectId and the dataset. In your code I see two possible issues:
projectId - You are using BigQuery public project as your projectId bigquery-public-data you need to change the value of this variable to a correct value
datasetId - In your query you didn't indicate the dataset which store shakespeare table

Related

How to append an index column to a spark data frame using spark streaming in scala?

I am using something like this:
df.withColumn("idx", monotonically_increasing_id())
But I get an exception as it is NOT SUPPORTED:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Expression(s): monotonically_increasing_id() is not supported with streaming DataFrames/Datasets;;
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:143)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:250)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:316)
Any ideas how to add an index or row number column to spark streaming dataframe in scala?
Full stacktrace: https://justpaste.it/5bdqr
There are a few operations that cannot exists anywhere in a streaming plan of Spark Streaming, unfortunately including monotonically_increasing_id(). Double check for this fact transformed1 is failing with the error as in your question, here is a reference on this check in Spark source code:
import org.apache.spark.sql.functions._
val df = Seq(("one", 1), ("two", 2)).toDF("foo", "bar")
val schema = df.schema
df.write.parquet("/tmp/out")
val input = spark.readStream.format("parquet").schema(schema).load("/tmp/out")
val transformed1 = input.withColumn("id", monotonically_increasing_id())
transformed1.writeStream.format("parquet").option("format", "append") .option("path", "/tmp/out2") .option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()
import org.apache.spark.sql.expressions.Window
val windowSpecRowNum = Window.partitionBy("foo").orderBy("foo")
val transformed2 = input.withColumn("row_num", row_number.over(windowSpecRowNum))
transformed2.writeStream.format("parquet").option("format", "append").option("path", "/tmp/out2").option("checkpointLocation", "/tmp/checkpoint_path").outputMode("append").start()
Also I tried to add indexing with Window over a column in DF - transformed2 in the snapshot above - it also failed, but with a different error):
"Non-time-based windows are not supported on streaming
DataFrames/Datasets"
All unsupported operator checks for Spark Streaming you can find here - seems like the traditional ways of adding an index column in Spark Batch don't work in Spark Streaming.

INSERT SPARK DATAFRAME INTO HIVE Managed Acid Table not working, HDP 3.0

I have a issue with inserting the Spark dataframe into hive table. Can anyone please help me out. HDP version 3.1, Spark version 2.3 Thanks in advance.
//ORIGNAL CODE PART
import org.apache.spark.SparkContext;
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSessionImpl;
import org.apache.spark.sql.DataFrame
import com.hortonworks.hwc.HiveWarehouseSession;
import org.apache.spark.sql.SparkSession$;
val spark = SparkSession.builder.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
**val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()**
/*
Some Transformation operations happend and the output of transformation is stored in VAL RESULT
/*
val result = {
num_records
.union(df.transform(profile(heatmap_cols2type)))
}
result.createOrReplaceTempView("out_temp"); //Create tempview
scala> result.show()
+-----+--------------------+-----------+------------------+------------+-------------------+
| type| column| field| value| order| date|
+-----+--------------------+-----------+------------------+------------+-------------------+
|TOTAL| all|num_records| 737| 0|2019-12-05 18:10:12|
| NUM|available_points_...| present| 737| 0|2019-12-05 18:10:12|
hive.setDatabase("EXAMPLE_DB")
hive.createTable("EXAMPLE_TABLE").ifNotExists().column("`type`", "String").column("`column`", "String").column("`field`", "String").column("`value`","String").column("`order`", "bigint").column("`date`", "TIMESTAMP").create()
hive.executeUpdate("INSERT INTO TABLE EXAMPLE_DB.EXAMPLE_TABLE SELECT * FROM out_temp");
-----ERROR of Orginal code----------------
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: SemanticException [Error 10001]: Line 1:86 Table not found 'out_temp'**strong text**
What I tried as a alternative is: (As Hive and Spark use independent catalogues, by checking the documentation from HWC write operations)
spark.sql("SELECT type, column, field, value, order, date FROM out_temp").write.format("HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR").option("table", "wellington_profile").save()
-------ERROR of Alternative Step----------------
java.lang.ClassNotFoundException: Failed to find data source: HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:639)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:241)
... 58 elided
Caused by: java.lang.ClassNotFoundException: HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR.DefaultSource
My Question is:
Instead of saving the out_temp as a tempview in Spark is there any way to directly create the table in hive ?
Is there any way to insert into Hive table from spark dataframe ?
Thank you everyone for your time!
result.write.save("example_table.parquet")
result.write.mode(SaveMode.Overwrite).saveAsTable("EXAMPLE_TABLE")
You can read in more detail from here

catch exceptionorg.apache.spark.sql.AnalysisException: Table or view not found

I am trying to query hive table from spark scala code and getting below error:
catch exceptionorg.apache.spark.sql.AnalysisException: Table or view not found: `databaseName`.`register`; line 1 pos 35;
'Distinct
+- 'Project ['computer_name]
+- 'UnresolvedRelation `databaseName`.`register`
job failed
Here is the code, to read the data from Hive.
import org.apache.spark.sql.{SQLContext, SparkSession}
val hc = spark.sqlContext
val dbName = "databaseName"
val tblName = "register"
val HostDF = hc.sql(s"""select distinct computer_name from ${dbName}.${tblName} """)
If I ran through spark-shell, I don't see any issue and I am getting the data.
If I am using same code through spark scala code(I mean running jar in cluster mode), I am getting above mentioned error.
Could any one tell me what I am doing wrong in code Vs spark-shell?
Thanks,
Bab
Try troubleshooting, by looking into env or db/tables and compare
print conf values
sqlContext.getAllConfs.foreach(println _)
or print dbnames and tables names
sqlContext.tableNames().foreach(println _)

How to use identify columns of two data frames with same names without giving alias names to them

i am getting 2 data frames from the below code.Each data frame has same number of columns and column names
data for f2.csv is
c1,c2,c3,c4
k1,i,aa,k
k5,j,ee,l
data for f1.csv is
c1,c2,c3,c4
k1,a,aa,e
k2,b,bb,f
k3,c,cc,g
k4,d,dd,h
i am reading the above two data with following data frames
val avro_inp = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(f1.csv)
val del_inp = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(f2.csv)
i am trying to access individual column with the dataframename.columnname
it throws the sql exception
below is the code i am using
avro_inp.join(del_inp, Seq("c1", "c3"), "outer")
.withColumn("c2",when(del_inp.col(colName="c2").isNotNull,del_inp.col(colName ="c2")).otherwise(avro_inp.col(colName = "c2")))
.withColumn("c4",when(avro_inp.col(colName="c4").isNull,del_inp.col(colName ="c4")).otherwise(avro_inp.col(colName = "c4")))
.drop(del_inp.col(colName="c2")).drop(del_inp.col(colName="c4")).show()
is there any way i can do without adding alias name to the columns.I am getting following error with above code
Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'c4' is ambiguous, could be: c4#3, c4#7.;
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:171)
You can do something like below where original dataframe names aren't changed. The following code is tested in spark 2.0.
avro_inp.join(del_inp, Seq("c1", "c3"), "outer")
.withColumn("c22",when(del_inp.col(colName="c2").isNotNull, del_inp.col(colName ="c2")).otherwise(avro_inp.col(colName = "c2")))
.withColumn("c44",when(avro_inp.col(colName="c4").isNull,del_inp.col(colName ="c4")).otherwise(avro_inp.col(colName = "c4")))
.drop("c2", "c4")
.select($"c1", $"c22".as("c2"), $"c3", $"c44".as("c4"))
You can do something like this.
For spark 1.6.
avro_inp.join(del_inp, Seq("c1", "c3"), "outer")
.withColumn("c2_new",when(del_inp.col(colName="c2").isNotNull, del_inp.col(colName ="c2")).otherwise(avro_inp.col(colName = "c2")))
.withColumn("c4_new",when(avro_inp.col(colName="c4").isNull,del_inp.col(colName ="c4")).otherwise(avro_inp.col(colName = "c4")))
.drop(del_inp.col("c4")).drop(avro_inp.col("c4"))
.drop(del_inp.col("c2")).drop(avro_inp.col("c2"))
.select($"c1", $"c2_new".as("c2"), $"c3", $"c4_new".as("c4"))
.show()
But if you are using Spark 2.0, then please refer to #RameshMaharjan's answer.
I hope it helps!

How to log malformed rows from Scala Spark DataFrameReader csv

The documentation for the Scala_Spark_DataFrameReader_csv suggests that spark can log the malformed rows detected while reading a .csv file.
- How can one log the malformed rows?
- Can one obtain a val or var containing the malformed rows?
The option from the linked documentation is:
maxMalformedLogPerPartition (default 10): sets the maximum number of malformed rows Spark will log for each partition. Malformed records beyond this number will be ignored
Based on this databricks example you need to explicitly add the "_corrupt_record" column to a schema definition when you read in the file. Something like this worked for me in pyspark 2.4.4:
from pyspark.sql.types import *
my_schema = StructType([
StructField("field1", StringType(), True),
...
StructField("_corrupt_record", StringType(), True)
])
my_data = spark.read.format("csv")\
.option("path", "/path/to/file.csv")\
.schema(my_schema)
.load()
my_data.count() # force reading the csv
corrupt_lines = my_data.filter("_corrupt_record is not NULL")
corrupt_lines.take(5)
If you are using the spark 2.3 check the _corrupt_error special column ... according to several spark discussions "it should work " , so after the read filter those which non-empty cols - there should be your errors ... you could check also the input_file_name() sql func
if you are not using lower than version 2.3 you should implement a custom read , record solution, because according to my tests the _corrupt_error does not work for csv data source ...
I've expanded on klucar's answer here by loading the csv, making a schema from the non-corrupted records, adding the corrupted record column, using the new schema to load the csv and then looking for corrupted records.
from pyspark.sql.types import StructField, StringType
from pyspark.sql.functions import col
file_path = "/path/to/file"
mode = "PERMISSIVE"
schema = spark.read.options(mode=mode).csv(file_path).schema
schema = schema.add(StructField("_corrupt_record", StringType(), True))
df = spark.read.options(mode=mode).schema(schema).csv(file_path)
df.cache()
df.count()
df.filter(col("_corrupt_record").isNotNull()).show()