I am trying to add or set hive runtime parameters while submitting a job. But it is not working it at all. It is working fine when we try to run and insert into hive console. But it not working in spark-shell.
It is running blank. Nothing is getting inserted into table
val sqlAgg =
s"""
|set tez.task.resource.memory.mb=5000;
|SET hive.tez.container.size=6656;
|SET hive.tez.java.opts=-Xmx5120m;
|set hive.optimize.ppd=true;
|set hive.execution.engine=tez;
|INSERT INTO table Partition(JobID=$jobId)
|SELECT
|UUID() AS Key,
|a,
|b,
|SUM(dc_1) AS dc,
|FROM tablenametask where jobid=$jobId
|GROUP BY
|a,
|b,
|c
""".stripMargin
hive.executeUpdate(sqlAgg)
Try using spark.sql !
It should work.
Related
I have the following expression,
val pageViews = spark.sql(
s"""
|SELECT
| proposal,
| MIN(timestamp) AS timestamp,
| MAX(page_view_after) AS page_view_after
|FROM page_views
|GROUP BY proposalId
|""".stripMargin
).createOrReplaceTempView("page_views")
I want convert it into one that uses the Dataset API
val pageViews = pageViews.selectExpr("proposal", "MIN(timestamp) AS timestamp", "MAX(page_view_after) AS page_view_after").groupBy("proposal")
The problems is I can't call createOrReplaceTempView on this one - build fails.
My question is how do I convert the first one into the second one and create a TempView out of that?
You can get rid of SQL expression al together by using Spark Sql's functions
import org.apache.spark.sql.functions._
as below
pageViews
.groupBy("proposal")
.agg(max("timestamp").as("timestamp"),max("page_view_after").as("page_view_after"))
`
Considering you have a dataframe available with name pageViews -
Use -
pageViews
.groupBy("proposal")
.agg(expr("min(timestamp) AS timestamp"), expr("max(page_view_after) AS page_view_after"))
I am receiving file from API which have a encoded(non-ascii) character value in 3 columns.
when i am reading file using DataFrame in Spark1.6
val CleanData= sqlContext.sql("""SELECT
COL1
COL2,
COL3
FROM CLEANFRAME
""" )
Encoded value looks like below.
But encoded value appear like
53004, �����������������������������
May someone please help me how to fix this error if possiblw with spark 1.6 and scala.
Spark 1.6,
scala
#this ca be achieved by using the regex_replace
val df = spark.sparkContext.parallelize(List(("503004","d$üíõ$F|'.h*Ë!øì=(.î; ,.¡|®!®","3-2-704"))).toDF("col1","col2","col3")
df.withColumn("col2_new", regexp_replace($"col2", "[^a-zA-Z]", "")).show()
Output:
+------+--------------------+-------+--------+
| col1| col2| col3|col2_new|
+------+--------------------+-------+--------+
|503004|d$üíõ$F|'.h*Ë!øì=...|3-2-704| dFh|
+------+--------------------+-------+--------+
I am trying to save a dataframe as an external table which will be queried both with spark and possibly with hive, but somehow, I cannot query or see any data with hive. It works on in spark.
Here is how to reproduce the problem:
scala> println(spark.conf.get("spark.sql.catalogImplementation"))
hive
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("spark.sql.sources.bucketing.enabled", true)
scala> spark.conf.set("hive.exec.dynamic.partition", "true")
scala> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
scala> spark.conf.set("hive.enforce.bucketing","true")
scala> spark.conf.set("optimize.sort.dynamic.partitionining","true")
scala> spark.conf.set("hive.vectorized.execution.enabled","true")
scala> spark.conf.set("hive.enforce.sorting","true")
scala> spark.conf.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
scala> spark.conf.set("hive.metastore.uris", "thrift://localhost:9083")
scala> var df = spark.range(20).withColumn("random", round(rand()*90))
df: org.apache.spark.sql.DataFrame = [id: bigint, random: double]
scala> df.head
res19: org.apache.spark.sql.Row = [0,46.0]
scala> df.repartition(10, col("random")).write.mode("overwrite").option("compression", "snappy").option("path", "s3a://company-bucket/dev/hive_confs/").format("orc").bucketBy(10, "random").sortBy("random").saveAsTable("hive_random")
19/08/01 19:26:55 WARN HiveExternalCatalog: Persisting bucketed data source table `default`.`hive_random` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
Here is how I query in hive:
Beeline version 2.3.4-amzn-2 by Apache Hive
0: jdbc:hive2://localhost:10000/default> select * from hive_random;
+------------------+
| hive_random.col |
+------------------+
+------------------+
No rows selected (0.213 seconds)
But it works fine in spark:
scala> spark.sql("SELECT * FROM hive_random").show
+---+------+
| id|random|
+---+------+
| 3| 13.0|
| 15| 13.0|
...
| 8| 46.0|
| 9| 65.0|
+---+------+
There is warning after your saveAsTable call. That's where the hint lies -
'Persisting bucketed data source table default.hive_random into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.'
The reason being 'saveAsTable' creates RDD partitions but not Hive partitions, the workaround is to create the table via hql before calling DataFrame.saveAsTable.
I will suggest t try couple of thing. First, try to set hive execution engine to use Spark.
set hive.execution.engine=spark;
Second, try to create external table in metastore and then save data to that table.
The Semantics of bucketed table in Spark and Hive is different.
The doc has details of the differences in semantics.
It states that
Data is written to bucketed tables but the output does not adhere with expected
bucketing spec. This leads to incorrect results when one tries to consume the
Spark written bucketed table from Hive.
Workaround: If reading from both engines is the requirement, writes need to happen from Hive
I am trying to load a CSV data into Hive table using SparkSession.
I want to skip the header data while loading into hive table and setting tblproperties("skip.header.line.count"="1") is also not working.
I am using the following code.
import java.io.File
import org.apache.spark.sql.{SparkSession,Row,SaveMode}
case class Record(key: Int, value: String)
val warehouseLocation=new File("spark-warehouse").getAbsolutePath
val spark=SparkSession.builder().appName("Apache Spark Book Crossing Analysis").config("spark.sql.warehouse.dir",warehouseLocation).enableHiveSupport().getOrCreate()
import spark.implicits._
import spark.sql
//sql("set hive.vectorized.execution.enabled=false")
sql("drop table if exists BookTemp")
sql ("create table BookTemp(ISBN int,BookTitle String,BookAuthor String ,YearOfPublication int,Publisher String,ImageURLS String,ImageURLM String,ImageURLL String)row format delimited fields terminated by ';' ")
sql("alter table BookTemp set TBLPROPERTIES("skip.header.line.count"="1")")
sql("load data local inpath 'BX-Books.csv' into table BookTemp")
sql("select * from BookTemp limit 5").show
Error in consol:
res55: org.apache.spark.sql.DataFrame = []
<console>:1: error: ')' expected but '.' found.
sql("alter table BookTemp set TBLPROPERTIES("skip.header.line.count"="1")")
2019-02-20 22:48:09 WARN LazyStruct:151 - Extra bytes detected at the end of the row! Ignoring similar problems.
+----+--------------------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|ISBN| BookTitle| BookAuthor|YearOfPublication| Publisher| ImageURLS| ImageURLM| ImageURLL|
+----+--------------------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
|null| "Book-Title"| "Book-Author"| null| "Publisher"| "Image-URL-S"| "Image-URL-M"| "Image-URL-L"|
|null|"Classical Mythol...|"Mark P. O. Morford"| null|"Oxford Universit...|"http://images.am...|"http://images.am...|"http://images.am...|
|null| "Clara Callan"|"Richard Bruce Wr...| null|"HarperFlamingo C...|"http://images.am...|"http://images.am...|"http://images.am...|
|null|"Decision in Norm...| "Carlo D'Este"| null| "HarperPerennial"|"http://images.am...|"http://images.am...|"http://images.am...|
|null|"Flu: The Story o...| "Gina Bari Kolata"| null|"Farrar Straus Gi...|"http://images.am...|"http://images.am...|"http://images.am...|
+----+--------------------+--------------------+-----------------+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows
AS shown in the result , I want to skip the first row of data
If you are using sql then the workaround is to add filter to the sql:
sql("select * from BookTemp limit 5 where BookTitle!='Book-Title'").show
This Jira is related: https://issues.apache.org/jira/browse/SPARK-11374
Also read this: https://github.com/apache/spark/pull/14638 - you can use CSV Reader option:
spark.read.option("header","true").csv("/data").show
Or remove header using shell before loading:
file="myfile.csv"
tail -n +2 "$file" > "$file.tmp" && mv "$file.tmp" "$file"
Another alternative I was trying using spark sql to convert the CSV with Header to parquet
val df = spark.sql("select * from schema.table")
df.coalesce(1).write.options(Map("header"->"true","compression"->"snappy")).mode(SaveMode.Overwrite).parquet()
I'm running a spark application, using EMR through pyspark interactive shell.
I'm trying to connect to a hive table named: content_publisher_events_log which I know that is isn't empty (through my hue console using exactly the same query), though when I try to read it throuhg pyspark I get count=0 as following:
from pyspark.sql import HiveContext
Query=""" select dt
from default.content_publisher_events_log
where dt between '20170415' and '20170419'
"""
hive_context = HiveContext(sc)
user_data = hive_context.sql(Query)
user_data.count()
0 #that's the result
Also, from the console i can see that this table exists:
>>> sqlContext.sql("show tables").show()
+--------+--------------------+-----------+
|database| tableName|isTemporary|
+--------+--------------------+-----------+
| default|content_publisher...| false|
| default| feed_installer_log| false|
| default|keyword_based_ads...| false|
| default|search_providers_log| false|
+--------+--------------------+-----------+
>>> user_data.printSchema()
root
|-- dt: string (nullable = true)
Also checked on the spark history server - seems like the job that ran the count worked without any errors, any idea on what could go wrong?
Thank's in advance!
The dt column isnt in datetime format . Either properly change the column itself to have datetime format or change the query itself to cast string as timestamp
Query=""" select dt
from default.content_publisher_events_log
where dt between
unix_timestamp('20170415','yyyyMMdd') and
unix_timestamp('20170419','yyyyMMdd')
"""
It seems like our data team moved per each partition the parquet file into a subfolder, they fixed it and starting from April 25th it works perfectly.
As far as i know if anyone is facing this isseue, try something like this one:
sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
or this one:
sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive","true")