Results of rdd.count, count via spark sql are the same, but they are different from count with hive sql - scala

I use count to calculate the number of RDD,got 13673153,but after I transfer the rdd to df and insert into hive,and count again,got 13673182,why?
rdd.count
spark.sql("select count(*) from ...").show()
hive sql: select count(*) from ...

This could be caused by a mismatch between data in the underlying files and the metadata registered in hive for that table. Try running:
MSCK REPAIR TABLE tablename;
in hive, and see if the issue is fixed. The command updates the partition information of the table. You can find more info in the documentation here.

During a Spark Action and part of SparkContext, Spark will record which files were in scope for processing. So, if the DAG needs to recover and reprocess that Action, then the same results are gotten. By design.
Hive QL has no such considerations.
UPDATE
As you noted, the other answer did not help in this use case.
So, when Spark processes Hive tables it looks at the list of files that it will use for the Action.
In the case of a failure (node failure, etc.) it will recompute data from the generated DAG. If it needs to go back and re-compute as far as the start of reading from Hive itself, then it will know which files to use - i.e the same files, so that same results are gotten instead of non-deterministic outcomes. E.g. think of partitioning aspects, handy that same results can be recomputed!
It's that simple. It's by design. Hope this helps.

Related

Is there a way for PySpark to give user warning when executing a query on Apache Hive table without specifying partition keys?

We are using Spark SQL with Apache Hive tables (via AWS Glue Data catalog). One problem is that when we execute a Spark SQL query without specifying the partitions to read via the WHERE clause, it gives us/the user no warning about the fact that it will proceed to load all partitions and thus likely time out or fail.
Is there a way to ideally error out, or at least give some warning, when a user executes a Spark SQL query on Apache Hive table without specifying partition keys? It's very easy to forget to do this.
I searched for existing solutions to this and found none, both on Stack Overflow and on the wider internet. I was expecting some configuration option/code that would help me achieve the goal.

Good way to ensure data is ready to be queried while Glue partition is created?

We have queries that run on a schedule every several minutes that join a few different glue tables (via Athena) before coming up with some results. For the table in question, we have Glue Crawlers set up and partitions based on snapshot_date and a couple other columns.
In the query, we grab the latest snapshot_date and use only data from that snapshot_date for the query. The data in S3 gets updated and put into the right folder a few times a day, but it looks like sometimes, if we try to query the data right as the data in S3 is getting updated, we end up with empty results due to the query trying to access the new snapshot_date partition while Glue is still getting the data set up(?)
Is there a built-in way to ensure that our glue partitions are ready before we start querying them? So far, we considered building in artificial time "buffers" in our query around when we expect the snapshot_date partition data to be written and the glue update to be complete, but I'm aware that that's really brittle and depends on the exact timing.

Is really Hive on Tez with ORC performance better than Spark SQL for ETL?

I have little experience in Hive and currently learning Spark with Scala. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below
ORC will do the same as parquet in Spark
Tez engine will give better performance like Spark engine
Joins are better/faster in Hive than Spark
I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.
sample links :
link1
link2
link3
Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.
Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.
Input/lookup data are csv/text formats and updating into tables
Two input tables which has 5 million rows and 30 columns
30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns.
Multiple joins involved like inner and left outer since many look up tables used.
Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.
Method 1:
Hive on Tez with ORC tables
Python UDF thru TRANSFORM option
Joins with performance tuning like map join
Method 2:
SparkSQL with Parquet format which is converting from text/csv
Scala for UDF
Hope we can perform multiple inner and left outer join in Spark
The best way to implement the solution to your problem as below.
To load the data into the table the spark looks good option to me. You can read the tables from the hive metastore and perform the incremental updates using some kind of windowing functions and register them in hive. While ingesting as data is populated from various lookup table, you are able to write the code in programatical way in scala.
But at the end of the day, there need to be a query engine that is very easy to use. As your spark program register the table with hive, you can use hive.
Hive support three execution engines
Spark
Tez
Mapreduce
Tez is matured, spark is evolving with various commits from Facebook and community.
Business can understand hive very easily as a query engine as it is much more matured in the industry.
In short use spark to process the data for daily processing and register them with hive.
Create business users in hive.

Spark Streaming dropDuplicates

Spark 2.1.1 (scala api) streaming json files from an s3 location.
I want to deduplicate any incoming records based on an ID column (“event_id”) found in the json for every record. I do not care which record is kept, even if duplication of the record is only partial. I am using append mode as the data is merely being enriched/filtered, with no group by/window aggregations, via the spark.sql() method. I then use the append mode to write parquet files to s3.
According to the documentation, I should be able to use dropDuplicates without watermarking in order to deduplicate (obviously this is not effective in long-running production). However, this fails with the error:
User class threw exception: org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets
That error seems odd as I am doing no aggregation (unless dropDuplicates or sparkSQL counts as an aggregation?).
I know that duplicates won’t occur outside 3 days of each other, so I then tried it again by adding a watermark (by using .withWatermark() immediately before the drop duplicates). However, it seems to want to wait until 3 days are up before writing the data. (ie since today is July 24, only data up to the same time on July 21 is written to the output).
As there is no aggregation, I want to write every row immediately after the batch is processed, and simply throw away any rows with an event id that has occurred in the previous 3 days. Is there a simple way to accomplish this?
Thanks
In my case, I used to achieve that in two ways through DStream :
One way:
load tmp_data(contain 3 days unique data, see below)
receive batch_data and do leftOuterJoin with tmp_data
do filter on step2 and output new unique data
update tmp_data with new unique data through step2's result and drop old data(more than 3 days)
save tmp_data on HDFS or whatever
repeat above again and again
Another way:
create a table on mysql and set UNIQUE INDEX on event_id
receive batch_data and just save event_id + event_time + whatever to mysql
mysql will ignore duplicate automatically
Solution we used was a custom implementation of org.apache.spark.sql.execution.streaming.Sink that inserts into a hive table after dropping duplicates within batch and performing a left anti join against the previous few days worth of data in the target hive table.

Spark-scala how to work with HDFS directory partition

To reduce process time I partitioned my data by dates so that I use only required date data (not complete table).So now in HDFS my tables are stored like below
src_tbl //main dir trg_tbl
2016-01-01 //sub dir 2015-12-30
2016-01-02 2015-12-31
2016-01-03 2016-01-01
2016-01-03
Now I want to select min(date) from src_tbl which will be 2016-01-01
and from trg_tbl I want to use data in >= 2016-01-01(src_tbl min(date)) directories which will be2016-01-01 and 2016-01-03 data`
How can select required partitions or date folder from hdfs using Spark-scala ? After completing process I need to overwrite same date directories too.
Details about process:
I want to choose correct window of data (as all other date data in not required) from source and target table..then I want to do join -> lead / lag -> union -> write.
Spark SQL (including the DataFrame/set api's) is kind of funny in the way it handles partitioned tables wrt retaining the existing partitioning info from one transformation/stage to the next.
For the initial loading Spark SQL tends to do a good job on understanding how to retain the underlying partitioning information - if that information were available in the form of the hive metastore metadata for the table.
So .. are these hive tables?
If so - so far so good - you should see the data loaded partition by partition according to the hive partitions.
Will the DataFrame/Dataset remember this nice partitioning already
set up?
Now things get a bit more tricky. The answer depends on whether a shuffle were required or not.
In your case - a simple filter operation - there should not be any need. So once again - you should see the original partitioning preserved and thus good performance. Please verify that the partitioning were indeed retained.
I will mention that if any aggregate functions were invoked then you can be assured your partitioning would be lost. Spark SQL will in that case use a HashPartitioner -inducing a full shuffle.
Update The OP provided more details here: there is lead/lag and join involved. Then he is well advised - from the strictly performance perspective - to avoid the Spark SQL and do the operations manually.
To the OP: the only thing I can suggest at this point is to check
preservesPartitioning=true
were set in your RDD operations. But I am not even sure that capability were exposed by Spark for the lag/lead: please check.