I use count to calculate the number of RDD,got 13673153,but after I transfer the rdd to df and insert into hive,and count again,got 13673182,why?
rdd.count
spark.sql("select count(*) from ...").show()
hive sql: select count(*) from ...
This could be caused by a mismatch between data in the underlying files and the metadata registered in hive for that table. Try running:
MSCK REPAIR TABLE tablename;
in hive, and see if the issue is fixed. The command updates the partition information of the table. You can find more info in the documentation here.
During a Spark Action and part of SparkContext, Spark will record which files were in scope for processing. So, if the DAG needs to recover and reprocess that Action, then the same results are gotten. By design.
Hive QL has no such considerations.
UPDATE
As you noted, the other answer did not help in this use case.
So, when Spark processes Hive tables it looks at the list of files that it will use for the Action.
In the case of a failure (node failure, etc.) it will recompute data from the generated DAG. If it needs to go back and re-compute as far as the start of reading from Hive itself, then it will know which files to use - i.e the same files, so that same results are gotten instead of non-deterministic outcomes. E.g. think of partitioning aspects, handy that same results can be recomputed!
It's that simple. It's by design. Hope this helps.
We've been exploring using Glue to transform some JSON data to parquet. One scenario we tried was adding a column to the parquet table. So partition 1 has columns [A] and partition 2 has columns [A,B]. Then we wanted to write further Glue ETL jobs to aggregate the parquet table but the new column was not available. Using glue_context.create_dynamic_frame.from_catalog to load the dynamic frame our new column was never in the schema.
We tried several configurations for our table crawler. Using a single schema for all partitions, single schema for s3 path, schema per partition. We could always see the new column in the Glue table data but it was always null if we queried it from a Glue job using pyspark. The column was in the parquet when we downloaded some samples and available for querying via Athena.
Why are the new columns not available to pyspark?
This turned out to be a spark configuration issue. From the spark docs:
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by
setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.
We could enable schema merging in two ways.
set the option on the spark session spark.conf.set("spark.sql.parquet.mergeSchema", "true")
set mergeSchema to true in the additional_options when loading the dynamic frame.
source = glueContext.create_dynamic_frame.from_catalog(
database="db",
table_name="table",
additional_options={"mergeSchema": "true"}
)
After that the new column was available in the frame's schema.
Objective:
We're hoping to use the AWS Glue Data Catalog to create a single table for JSON data residing in an S3 bucket, which we would then query and parse via Redshift Spectrum.
Background:
The JSON data is from DynamoDB Streams and is deeply nested. The first level of JSON has a consistent set of elements: Keys, NewImage, OldImage, SequenceNumber, ApproximateCreationDateTime, SizeBytes, and EventName. The only variation is that some records do not have a NewImage and some don't have an OldImage. Below this first level, though, the schema varies widely.
Ideally, we would like to use Glue to only parse this first level of JSON, and basically treat the lower levels as large STRING objects (which we would then parse as needed with Redshift Spectrum). Currently, we're loading the entire record into a single VARCHAR column in Redshift, but the records are nearing the maximum size for a data type in Redshift (maximum VARCHAR length is 65535). As a result, we'd like to perform this first level of parsing before the records hit Redshift.
What we've tried/referenced so far:
Pointing the AWS Glue Crawler to the S3 bucket results in hundreds of tables with a consistent top level schema (the attributes listed above), but varying schemas at deeper levels in the STRUCT elements. We have not found a way to create a Glue ETL Job that would read from all of these tables and load it into a single table.
Creating a table manually has not been fruitful. We tried setting each column to a STRING data type, but the job did not succeed in loading data (presumably since this would involve some conversion from STRUCTs to STRINGs). When setting columns to STRUCT, it requires a defined schema - but this is precisely what varies from one record to another, so we are not able to provide a generic STRUCT schema that works for all the records in question.
The AWS Glue Relationalize transform is intriguing, but not what we're looking for in this scenario (since we want to keep some of the JSON intact, rather than flattening it entirely). Redshift Spectrum supports scalar JSON data as of a couple weeks ago, but this does not work with the nested JSON we're dealing with. Neither of these appear to help with handling the hundreds of tables created by the Glue Crawler.
Question:
How would we use Glue (or some other method) to allow us to parse just the first level of these records - while ignoring the varying schemas below the elements at the top level - so that we can access it from Spectrum or load it physically into Redshift?
I'm new to Glue. I've spent quite a bit of time in the Glue documentation and looking through (the somewhat sparse) info on forums. I could be missing something obvious - or perhaps this is a limitation of Glue in its current form. Any recommendations are welcome.
Thanks!
I'm not sure you can do this with a table definition, but you can accomplish this with an ETL job by using a mapping function to cast the top level values as JSON strings. Documentation: [link]
import json
# Your mapping function
def flatten(rec):
for key in rec:
rec[key] = json.dumps(rec[key])
return rec
old_df = glueContext.create_dynamic_frame.from_options(
's3',
{"paths": ['s3://...']},
"json")
# Apply mapping function f to all DynamicRecords in DynamicFrame
new_df = Map.apply(frame=old_df, f=flatten)
From here you have the option of exporting to S3 (perhaps in Parquet or some other columnar format to optimize for querying) or directly into Redshift from my understanding, although I haven't tried it.
This is a limitation of Glue as of now. Have you taken a look at Glue Classifiers? It's the only piece I haven't used yet, but might suit your needs. You can define a JSON path for a field or something like that.
Other than that - Glue Jobs are the way to go. It's Spark in the background, so you can do pretty much everything. Set up a development endpoint and play around with it. I've run against various roadblocks for the last three weeks and decided to completely forgo any and all Glue functionality and only Spark, that way it's both portable and actually works.
One thing you might need to keep in mind when setting up the dev endpoint is that the IAM role must have a path of "/", so you will most probably need to create a separate role manually that has this path. The one automatically created has a path of "/service-role/".
you should add a glue classifier preferably $[*]
When you crawl the json file in s3, it will read the first line of the file.
You can create a glue job in order to load the data catalog table of this json file into the redshift.
My only problem with here is that Redshift Spectrum has problems reading json tables in the data catalog..
let me know if you have found a solution
The procedure I found useful to shallow nested json:
ApplyMapping for the first level as datasource0;
Explode struct or array objects to get rid of element level
df1 = datasource0.toDF().select(id,col1,col2,...,explode(coln).alias(coln), where explode requires from pyspark.sql.functions import explode;
Select the JSON objects that you would like to keep intact by intact_json = df1.select(id, itct1, itct2,..., itctm);
Transform df1 back to dynamicFrame and Relationalize the
dynamicFrame as well as drop the intact columns by dataframe.drop_fields(itct1, itct2,..., itctm);
Join relationalized table with the intact table based on 'id'
column.
As of 12/20/2018, I was able to manually define a table with first level json fields as columns with type STRING. Then in the glue script the dynamicframe has the column as a string. From there, you can do an Unbox operation of type json on the fields. This will json parse the fields and derive the real schema. Combining Unbox with Filter allows you to loop through and process heterogeneous json schemas from the same input if you can loop through a list of schemas.
However, one word of caution, this is incredibly slow. I think that glue is downloading the source files from s3 during each iteration of the loop. I've been trying to find a way to persist the initial source data but it looks like .toDF derives the schema of the string json fields even if you specify them as glue StringType. I'll add a comment here if I can figure out a solution with better performance.
I have little experience in Hive and currently learning Spark with Scala. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below
ORC will do the same as parquet in Spark
Tez engine will give better performance like Spark engine
Joins are better/faster in Hive than Spark
I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.
sample links :
link1
link2
link3
Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.
Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.
Input/lookup data are csv/text formats and updating into tables
Two input tables which has 5 million rows and 30 columns
30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns.
Multiple joins involved like inner and left outer since many look up tables used.
Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.
Method 1:
Hive on Tez with ORC tables
Python UDF thru TRANSFORM option
Joins with performance tuning like map join
Method 2:
SparkSQL with Parquet format which is converting from text/csv
Scala for UDF
Hope we can perform multiple inner and left outer join in Spark
The best way to implement the solution to your problem as below.
To load the data into the table the spark looks good option to me. You can read the tables from the hive metastore and perform the incremental updates using some kind of windowing functions and register them in hive. While ingesting as data is populated from various lookup table, you are able to write the code in programatical way in scala.
But at the end of the day, there need to be a query engine that is very easy to use. As your spark program register the table with hive, you can use hive.
Hive support three execution engines
Spark
Tez
Mapreduce
Tez is matured, spark is evolving with various commits from Facebook and community.
Business can understand hive very easily as a query engine as it is much more matured in the industry.
In short use spark to process the data for daily processing and register them with hive.
Create business users in hive.
I'm looking for suggestions to approach this problem:
parallel queries using JDBC driver
big (in rows) Postgres table
there is no numeric column to be used as partitionColumn
I would like to read this big table using multiple parallel queries, but there is no evident numeric column to partition the table. I though about the physical location of the data using CTID, but I'm not sure if I should follow this path.
The spark-postgres library provides several functions to read/load postgres data. It uses the COPY statement under the hood. As a result it can handle large postgres tables.