I have PostgreSQL database with ~1000 different tables. I'd like to export all of these tables and data inside them into Parquet files.
In order to do it, I'm going to read each table into DataFrame and then store this df into Parquet file. Many of the PostgreSQL tables contains user-defined Types.
The biggest issue is - that I can't manually specify the schema for DataFrame. Will Apache Spark, in this case, be able to automatically infer the PostgreSQL table schemas and store them appropriately into Parquet format or this is impossible with Apache Spark and some other technology must be used for this purpose?
UPDATED
I have created the following PostgreSQL user-defined type, table and records:
create type dimensions as (
width integer,
height integer,
depth integer
);
create table moving_boxes (
id serial primary key,
dims dimensions not null
);
insert into moving_boxes (dims) values (row(3,4,5)::dimensions);
insert into moving_boxes (dims) values (row(1,4,2)::dimensions);
insert into moving_boxes (dims) values (row(10,12,777)::dimensions);
Implemented the following Spark application:
// that gives an one-partition Dataset
val opts = Map(
"url" -> "jdbc:postgresql:sparktest",
"dbtable" -> "moving_boxes",
"user" -> "user",
"password" -> "password")
val df = spark.
read.
format("jdbc").
options(opts).
load
println(df.printSchema())
df.write.mode(SaveMode.Overwrite).format("parquet").save("moving_boxes.parquet")
This is df.printSchema output:
root
|-- id: integer (nullable = true)
|-- dims: string (nullable = true)
As you may see, Spark DataFrame infer dims schema as string and not as a complex nested type.
This is the log information from ParquetWriteSupport:
18/11/06 10:08:52 INFO ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "dims",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
}
and corresponding Parquet message type:
message spark_schema {
optional int32 id;
optional binary dims (UTF8);
}
Could you please explain, will the original complex dims type(defined in PostgreSQL) be lost in the saved Parquet file or no?
Related
I am trying to write a pyspark DF to s3 hudie parquet format. Evcerything is working fine, however, the timestamps are writing as binary format. I would like to write this as hive tiestamp format so that i can query data in Athena.
Pyspark config as follows.
LOCAL_SPARK_CONF = (
SparkConf()
.set(
"spark.jars.packages",
"org.apache.hadoop:hadoop-aws:3.2.2,org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.1,org.apache.spark:spark-avro_2.12:3.0.2",
)
.set("spark.hadoop.fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.sql.hive.convertMetastoreParquet", "false")
)
Hudi options as follows:
hudi_options = {
"hoodie.table.name": hudi_table,
"hoodie.datasource.write.recordkey.field": "hash",
"hoodie.datasource.write.partitionpath.field": "version, date",
"hoodie.datasource.write.table.name": hudi_table,
"hoodie.datasource.hive_sync.support_timestamp": "true",
"hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS",
"hoodie.index.type": "GLOBAL_BLOOM", # This is required if we want to ensure we upsert a record, even if the partition changes
"hoodie.bloom.index.update.partition.path": "true",
"hoodie.datasource.write.operation": "upsert",
"hoodie.datasource.write.precombine.field": "data_timestamp",
"hoodie.upsert.shuffle.parallelism": 2,
"hoodie.insert.shuffle.parallelism": 2,
}
From reading the documentation "hoodie.datasource.hive_sync.support_timestamp": "true" should maintain hive timestamps and "hoodie.parquet.outputtimestamptype": "TIMESTAMP_MILLIS" should maintain the output format. However, when i subsequently check the data it's a binary timestamp. How can i avoid this?
I write the data as follows:
df.write.format("org.apache.hudi").options(**hudi_options).mode("overwrite").save(basePath)
I have tried converting the data to a string and writing as string type but it still converts to binary.
df = df.withColumn("data_timestamp", col("data_timestamp").cast(StringType()))
I am getting data from mongodb using the query,
db.objects.find({ _key: { $in: ["user:130"] } }, { _id: 0, uid: 1, username: 1 }).pretty();
now i need to get the same data in spark.
val readConf = ReadConfig(Map("uri" -> host, "database" -> "nodebb", "collection" -> "objects"))
val data = spark.read.mongo(readConf)
This is giving complete data from mongodb.
How can i apply that query too...?
Thanks
If you want for example just to filter some records you can use .filter on your df.
If you want to use sql queries on your data loaded from Mongo you can create temp view from your df and then query with spark.sql
df.createOrReplaceTempView("temp")
some_fruit = spark.sql("SELECT type, qty FROM temp WHERE type LIKE '%e%'")
some_fruit.show()
More details in documentation: MongoDB spark connector docu
I'm moving data from one collection to another in other cluster using Spark. the data's schema is not consistent(I mean that has few schema's in a single collection with different data types with little variations). When I try to read data from spark, the sampling is unable to get all the schema's of the data and throwing the below error.(I have a complex schema which I can't explicitly mention instead of spark gets by sampling.)
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast ARRAY into a NullType (value: BsonArray{values=[{ "type" : "GUEST_FEE", "appliesPer" : "GUEST_PER_NIGHT", "description" : null, "minAmount" : 33, "maxAmount" : 33 }]})
I tried reading the collection as an RDD and write as an RDD still the issue persists.
Any help on this.!
Thanks.
All these com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast SOME_TYPE into a NullType come from incorrect schema inference. For schema-less data sources such as JSON file or mongodb, Spark does a scan of small fraction of the data to determine the types. If some particular field has lots of NULL's you can get unlucky and type will be set as NullType.
One thing you can do is increase the number of entries scanned for schema inference.
Another - get the inferred schema first, fix it, and reload dataframe with fixed schema:
def fix_spark_schema(schema):
if schema.__class__ == pyspark.sql.types.StructType:
return pyspark.sql.types.StructType([fix_spark_schema(f) for f in schema.fields])
if schema.__class__ == pyspark.sql.types.StructField:
return pyspark.sql.types.StructField(schema.name, fix_spark_schema(schema.dataType), schema.nullable)
if schema.__class__ == pyspark.sql.types.NullType:
return pyspark.sql.types.StringType()
return schema
collection_schema = sqlContext.read \
.format("com.mongodb.spark.sql") \
.options(...) \
.load() \
.schema
collection = sqlContext.read \
.format("com.mongodb.spark.sql") \
.options(...) \
.load(schema=fix_spark_schema(collection_schema))
In my case all problematic fields could be represented with StringType, you might make the logic more complex if needed.
As far as I understood your problem:
* either Spark incorrectly detected your schema and considered some fields as required (nullable = false) - in such case, you can still define it explicitly and set nullable to true. It would work if your schema was evolving and in some time in the past you added or removed a field but still keeping column type (e.g. String will be always a String and not a Struct or other completely different type)
* or your schemas are completely inconsistent, i.e. your String field transformed at some time to a Struct or other completely different type. In such case I don't see other solution than use RDD abstraction and work with very permissive types as Any in Scala (Object in Java) and using isInstanceOf tests to normalize all fields into a 1 common format
Actually I see also another possible solution, but only if you know what data has which schema. For instance, if you know that for data between 2018-01-01 and 2018-02-01 you use schema#1 and for the others schema#2, you can write a pipeline that will transform schema#1 to schema#2. Later you could simply union both datasets and apply your transformations on consistently structured values.
Edit:
I've just tried similar code you give and it worked correctly on my local MongoDB instance:
val sc = getSparkContext(Array("mongodb://localhost:27017/test.init_data"))
// Load sample data
import com.mongodb.spark._
val docFees =
"""
| {"fees": null}
| {"fees": { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ]} }
""".stripMargin.trim.stripMargin.split("[\\r\\n]+").toSeq
MongoSpark.save(sc.parallelize(docFees.map(Document.parse)))
val rdd = MongoSpark.load(sc)
rdd.saveToMongoDB(WriteConfig(Map("uri"->"mongodb://localhost:27017/test.new_coll_data", "replaceDocument"->"true")))
And when I checked the result in MongoDB shell I got:
> coll = db.init_data;
test.init_data
> coll.find();
{ "_id" : ObjectId("5b33d415ea78632ff8452c60"), "fees" : { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ] } }
{ "_id" : ObjectId("5b33d415ea78632ff8452c61"), "fees" : null }
> coll = db.new_coll_data;
test.new_coll_data
> coll.find();
{ "_id" : ObjectId("5b33d415ea78632ff8452c60"), "fees" : { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ] } }
{ "_id" : ObjectId("5b33d415ea78632ff8452c61"), "fees" : null }
Following data in table dashboard_data. The column name is mr_data.
{"priority_id": "123", "urgent_problem_id": "111", "important_problem_id": "222"}
{"priority_id": "456", "urgent_problem_id": "", "important_problem_id": "333"}
{"priority_id": "789", "urgent_problem_id": "444", "important_problem_id": ""}
Query-
UPDATE
dashboard_data
SET
mr_data = replace(dashboard_data.mr_data,'urgent_problem_id','urgent_problem_ids')
WHERE
mr_data->>'urgent_problem_id' IS NOT NULL;
Expected result:
{"priority_id": "123", "urgent_problem_ids": {"111"}, "important_problem_ids": {"222"}}
{"priority_id": "456", "urgent_problem_ids": {""}, "important_problem_ids": {"333"}}
{"priority_id": "789", "urgent_problem_ids": {"444"}, "important_problem_ids": {""}}
Is there any way that during replace we get {} representation of data as shown in expected result.
Assuming Postgres 9.5 or newer.
You can use jsonb_set to add a proper JSON array with a new key, then remove the old key from the JSON (which is the only way to rename a key)
update dashboard_data
set mr_data = jsonb_set(mr_data, '{urgent_problem_ids}',
jsonb_build_array(mr_data -> 'urgent_problem_id'), true)
- 'urgent_problem_id'
where mr_data ? 'urgent_problem_id';
jsonb_build_array(mr_data -> 'urgent_problem_id') creates a proper JSON array with the (single) value from the urgent_problem_id that value is then stored under the new key urgent_problem_ids and finally the old key urgent_problem_id is removed using the - operator.
Online example: http://rextester.com/POG52716
If your column is not a JSONB (which it should be) then you need to cast the column inside jsonb_set() and cast the result back to a json
I am using pig to read avro files and normalize/transform the data before writing back out. The avro files have records of the form:
{
"type" : "record",
"name" : "KeyValuePair",
"namespace" : "org.apache.avro.mapreduce",
"doc" : "A key/value pair",
"fields" : [ {
"name" : "key",
"type" : "string",
"doc" : "The key"
}, {
"name" : "value",
"type" : {
"type" : "map",
"values" : "bytes"
},
"doc" : "The value"
} ]
}
I have used the AvroTools command-line utility in conjunction with jq to dump the first record to JSON:
$ java -jar avro-tools-1.8.1.jar tojson part-m-00000.avro | ./jq --compact-output 'select(.value.pf_v != null)' | head -n 1 | ./jq .
{
"key": "some-record-uuid",
"value": {
"pf_v": "v1\u0003Basic\u0001slcvdr1rw\u001a\u0004v2\u0003DayWatch\u0001slcva2omi\u001a\u0004v3\u0003Performance\u0001slc1vs1v1w1p1g1i\u0004v4\u0003Fundamentals\u0001snlj1erwi\u001a\u0004v5\u0003My Portfolio\u0001svr1dews1b2b3k1k2\u001a\u0004v0\u00035"
}
}
I run the following pig commands:
REGISTER avro-1.8.1.jar
REGISTER json-simple-1.1.1.jar
REGISTER piggybank-0.15.0.jar
REGISTER jackson-core-2.8.6.jar
REGISTER jackson-databind-2.8.6.jar
DEFINE AvroLoader org.apache.pig.piggybank.storage.avro.AvroStorage();
AllRecords = LOAD 'part-m-00000.avro'
USING AvroLoader()
AS (key: chararray, value: map[]);
Records = FILTER AllRecords BY value#'pf_v' is not null;
SmallRecords = LIMIT Records 10;
DUMP SmallRecords;
The corresponding record for the last command above is as follows:
...
(some-record-uuid,[pf_v#v03v1Basicslcviv2DayWatchslcva2omiv3Performanceslc1vs1v1w1p1g1i])
...
As you can see the unicode chars have been removed from the pf_v value. The unicode characters are actually being used as delimiters in these values so I will need them in order to fully parse the records into their desired normalized state. The unicode characters are clearly present in the encoded .avro file (as demonstrated by dumping the file to JSON). Is anybody aware of a way to get AvroStorage to not remove the unicode chars when loading records?
Thank you!
Update:
I have also performed the same operation using Avro's python DataFileReader:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
reader = DataFileReader(open("part-m-00000.avro", "rb"), DatumReader())
for rec in reader:
if 'some-record-uuid' in rec['key']:
print rec
print '--------------------------------------------'
break
reader.close()
This prints a dict with what looks like hex chars substituted for the unicode chars (which is preferable to removing them entirely):
{u'value': {u'pf_v': 'v0\x033\x04v1\x03Basic\x01slcvi\x1a\x04v2\x03DayWatch\x01slcva2omi\x1a\x04v3\x03Performance\x01slc1vs1v1w1p1g1i\x1a'}, u'key': u'some-record-uuid'}