I have multiple files in ADLS I want to convert them into single csv but without using Pandas. Is it possible to convert them using Pyspark?
These files are coming from API which has 225 000 records. I am using this script to convert it to csv
You can place all JSON files in a folder and import all using: spark.read.option("multiLine", True).json("/path/to/folder").
However, importing JSON files into dataframe is little trickier as you may not get desired format and you may have to preprocess JSON file before import or Spark dataframe after import.
For example, assume JSON files per continent:
NA.json
{
"US" : {
"capital": "Washington, D.C.",
"population in million": 330
},
"Canada" : {
"capital": "Ottawa",
"population in million": 38
}
}
EU.json
{
"England" : {
"capital": "London",
"population in million": 56
},
"France" : {
"capital": "Paris",
"population in million": 67
}
}
AUS.json
{
"Australia" : {
"capital": "Canberra",
"population in million": 25
},
"New Zealand" : {
"capital": "Wellington",
"population in million": 5
}
}
These files get imported with root JSON objects mapped to each column and nested JSON data mapped as nested map:
df = spark.read.option("multiLine", True).json("/content/sample_data/json")
+--------------+------------+------------+-----------+---------------+--------------------+
| Australia| Canada| England| France| New Zealand| US|
+--------------+------------+------------+-----------+---------------+--------------------+
|{Canberra, 25}| null| null| null|{Wellington, 5}| null|
| null|{Ottawa, 38}| null| null| null|{Washington, D.C....|
| null| null|{London, 56}|{Paris, 67}| null| null|
+--------------+------------+------------+-----------+---------------+--------------------+
Depending on structure of your JSON files, you will have to deal with 2 things:
Mapping JSON object to a generalised schema to prevent fragmented column names as above.
Preventing nested data to be mapped as nested map type as shown with {Canberra, 25} above. You may want to transform this to bring all data to tabular form.
Related
I have a table in PostgreSQL with the following schema:
Table "public.kc_ds"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+-----------------------+-----------+----------+-----------------------------------+----------+--------------+-------------
id | integer | | not null | nextval('kc_ds_id_seq'::regclass) | plain | |
num | integer | | not null | | plain | |
text | character varying(50) | | not null | | extended | |
Indexes:
"kc_ds_pkey" PRIMARY KEY, btree (id)
Publications:
"dbz_publication"
When I run a Debezium source connector for this table that uses io.confluent.connect.avro.AvroConverter and Schema Registry, it creates a Schema Registry schema that looks this (some fields are omitted here):
"fields":[
{
"name":"before",
"type":[
"null",
{
"type":"record",
"name":"Value",
"fields":[
{
"name":"id",
"type":"int"
},
{
"name":"num",
"type":"int"
},
{
"name":"text",
"type":"string"
}
],
"connect.name":"xxx.public.kc_ds.Value"
}
],
"default":null
},
{
"name":"after",
"type":[
"null",
"Value"
],
"default":null
},
]
The messages in my Kafka topic that are produced by Debezium look like this (some fields are omitted):
{
"before": null,
"after": {
"xxx.public.kc_ds.Value": {
"id": 2,
"num": 2,
"text": "text version 1"
}
}
When I INSERT or UPDATE, "before" is always null, and "after" contains my data; when I DELETE, the inverse holds true: "after" is null and "before" contains the data (although all fields are set to default values).
Question #1: Why does Kafka Connect create a schema with "before" and "after" fields? Why do those fields behave in such a weird way?
Question #2: Is there a built-in way to make Kafka Connect send flat messages to my topics while still using Schema Registry? Please note that the Flatten transform is not what I need: if enabled, I will still have the "before" and "after" fields.
Question #3 (not actually hoping for anything, but maybe someone knows): The necessity to flatten my messages comes from the fact that I need to read the data from my topics using HudiDeltaStreamer, and it seems like this tool expects flat input data. The "before" and "after" fields end up being separate object-like columns in the resulting .parquet files. Does anyone have any idea how HudiDeltaStreamer is supposed to integrate with messages produced by Kafka Connect?
I'm moving data from one collection to another in other cluster using Spark. the data's schema is not consistent(I mean that has few schema's in a single collection with different data types with little variations). When I try to read data from spark, the sampling is unable to get all the schema's of the data and throwing the below error.(I have a complex schema which I can't explicitly mention instead of spark gets by sampling.)
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast ARRAY into a NullType (value: BsonArray{values=[{ "type" : "GUEST_FEE", "appliesPer" : "GUEST_PER_NIGHT", "description" : null, "minAmount" : 33, "maxAmount" : 33 }]})
I tried reading the collection as an RDD and write as an RDD still the issue persists.
Any help on this.!
Thanks.
All these com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast SOME_TYPE into a NullType come from incorrect schema inference. For schema-less data sources such as JSON file or mongodb, Spark does a scan of small fraction of the data to determine the types. If some particular field has lots of NULL's you can get unlucky and type will be set as NullType.
One thing you can do is increase the number of entries scanned for schema inference.
Another - get the inferred schema first, fix it, and reload dataframe with fixed schema:
def fix_spark_schema(schema):
if schema.__class__ == pyspark.sql.types.StructType:
return pyspark.sql.types.StructType([fix_spark_schema(f) for f in schema.fields])
if schema.__class__ == pyspark.sql.types.StructField:
return pyspark.sql.types.StructField(schema.name, fix_spark_schema(schema.dataType), schema.nullable)
if schema.__class__ == pyspark.sql.types.NullType:
return pyspark.sql.types.StringType()
return schema
collection_schema = sqlContext.read \
.format("com.mongodb.spark.sql") \
.options(...) \
.load() \
.schema
collection = sqlContext.read \
.format("com.mongodb.spark.sql") \
.options(...) \
.load(schema=fix_spark_schema(collection_schema))
In my case all problematic fields could be represented with StringType, you might make the logic more complex if needed.
As far as I understood your problem:
* either Spark incorrectly detected your schema and considered some fields as required (nullable = false) - in such case, you can still define it explicitly and set nullable to true. It would work if your schema was evolving and in some time in the past you added or removed a field but still keeping column type (e.g. String will be always a String and not a Struct or other completely different type)
* or your schemas are completely inconsistent, i.e. your String field transformed at some time to a Struct or other completely different type. In such case I don't see other solution than use RDD abstraction and work with very permissive types as Any in Scala (Object in Java) and using isInstanceOf tests to normalize all fields into a 1 common format
Actually I see also another possible solution, but only if you know what data has which schema. For instance, if you know that for data between 2018-01-01 and 2018-02-01 you use schema#1 and for the others schema#2, you can write a pipeline that will transform schema#1 to schema#2. Later you could simply union both datasets and apply your transformations on consistently structured values.
Edit:
I've just tried similar code you give and it worked correctly on my local MongoDB instance:
val sc = getSparkContext(Array("mongodb://localhost:27017/test.init_data"))
// Load sample data
import com.mongodb.spark._
val docFees =
"""
| {"fees": null}
| {"fees": { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ]} }
""".stripMargin.trim.stripMargin.split("[\\r\\n]+").toSeq
MongoSpark.save(sc.parallelize(docFees.map(Document.parse)))
val rdd = MongoSpark.load(sc)
rdd.saveToMongoDB(WriteConfig(Map("uri"->"mongodb://localhost:27017/test.new_coll_data", "replaceDocument"->"true")))
And when I checked the result in MongoDB shell I got:
> coll = db.init_data;
test.init_data
> coll.find();
{ "_id" : ObjectId("5b33d415ea78632ff8452c60"), "fees" : { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ] } }
{ "_id" : ObjectId("5b33d415ea78632ff8452c61"), "fees" : null }
> coll = db.new_coll_data;
test.new_coll_data
> coll.find();
{ "_id" : ObjectId("5b33d415ea78632ff8452c60"), "fees" : { "main" : [ { "type" : "misc", "appliesPer" : "trip", "description" : null, "minAmount" : 175, "maxAmount" : 175 } ] } }
{ "_id" : ObjectId("5b33d415ea78632ff8452c61"), "fees" : null }
I have collection col that contains
{
'_id': ObjectId(...)
'type': "a"
'f1': data1
}
on same collection i have
{
'_id': ObjectId(...)
'f2': 222.234
'type': "b"
}
Spark MongoDB connector Is not working fine. It's reorder the data in wrong fields
for example:
{
'_id': ObjectId(...)
'type': "a"
'f1': data1
}
{
'_id': ObjectId(...)
'f1': data2
'type': "a"
}
Rdd will be:
------------------------
| id | f1 | type |
------------------------
| .... | a | data1 |
| .... | data2 | a |
------------------------
Is there any suggestions working with polymorphic schema
Is there any suggestions working with polymorphic schema
(Opinion alert) The best suggestion is not to have one in the first place. It is impossible to maintain in the long term, extremely error prone and requires complex compensation on the client side.
What to do if you have one:
You can try using Aggregation Framework with $project to sanitize data before it is fetched to Spark. See Aggregation section of the docs for example.
Don't try to couple it with structured format. Use RDDs, fetch data as plain Python dict and deal with the problem manually.
I am using pig to read avro files and normalize/transform the data before writing back out. The avro files have records of the form:
{
"type" : "record",
"name" : "KeyValuePair",
"namespace" : "org.apache.avro.mapreduce",
"doc" : "A key/value pair",
"fields" : [ {
"name" : "key",
"type" : "string",
"doc" : "The key"
}, {
"name" : "value",
"type" : {
"type" : "map",
"values" : "bytes"
},
"doc" : "The value"
} ]
}
I have used the AvroTools command-line utility in conjunction with jq to dump the first record to JSON:
$ java -jar avro-tools-1.8.1.jar tojson part-m-00000.avro | ./jq --compact-output 'select(.value.pf_v != null)' | head -n 1 | ./jq .
{
"key": "some-record-uuid",
"value": {
"pf_v": "v1\u0003Basic\u0001slcvdr1rw\u001a\u0004v2\u0003DayWatch\u0001slcva2omi\u001a\u0004v3\u0003Performance\u0001slc1vs1v1w1p1g1i\u0004v4\u0003Fundamentals\u0001snlj1erwi\u001a\u0004v5\u0003My Portfolio\u0001svr1dews1b2b3k1k2\u001a\u0004v0\u00035"
}
}
I run the following pig commands:
REGISTER avro-1.8.1.jar
REGISTER json-simple-1.1.1.jar
REGISTER piggybank-0.15.0.jar
REGISTER jackson-core-2.8.6.jar
REGISTER jackson-databind-2.8.6.jar
DEFINE AvroLoader org.apache.pig.piggybank.storage.avro.AvroStorage();
AllRecords = LOAD 'part-m-00000.avro'
USING AvroLoader()
AS (key: chararray, value: map[]);
Records = FILTER AllRecords BY value#'pf_v' is not null;
SmallRecords = LIMIT Records 10;
DUMP SmallRecords;
The corresponding record for the last command above is as follows:
...
(some-record-uuid,[pf_v#v03v1Basicslcviv2DayWatchslcva2omiv3Performanceslc1vs1v1w1p1g1i])
...
As you can see the unicode chars have been removed from the pf_v value. The unicode characters are actually being used as delimiters in these values so I will need them in order to fully parse the records into their desired normalized state. The unicode characters are clearly present in the encoded .avro file (as demonstrated by dumping the file to JSON). Is anybody aware of a way to get AvroStorage to not remove the unicode chars when loading records?
Thank you!
Update:
I have also performed the same operation using Avro's python DataFileReader:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
reader = DataFileReader(open("part-m-00000.avro", "rb"), DatumReader())
for rec in reader:
if 'some-record-uuid' in rec['key']:
print rec
print '--------------------------------------------'
break
reader.close()
This prints a dict with what looks like hex chars substituted for the unicode chars (which is preferable to removing them entirely):
{u'value': {u'pf_v': 'v0\x033\x04v1\x03Basic\x01slcvi\x1a\x04v2\x03DayWatch\x01slcva2omi\x1a\x04v3\x03Performance\x01slc1vs1v1w1p1g1i\x1a'}, u'key': u'some-record-uuid'}
I am trying to import data from MongoDB to a relational DB (SQL Server).
I don't have access to the MongoDB components so I am querying my collection with the mongo java driver, in a tJava component.
I get a:
List< DBObject >
which I send to a tExtractJSONFields
An object of my collection looks like this:
[
{
"_id":{
"$oid":"1564t8re13e4ter86"
},
"object":{
"shop":"shop1",
"domain":"Divers",
"sell":[
{
"location":{
"zipCode":"58000",
"city":"NEVERS"
},
"properties":{
"description":"ddddd!!!!",
"id":"f1re67897116fre87"
},
"employee":[
{
"name":"employee1",
"id":"245975"
},
{
"name":"employee2",
"id":"458624"
}
],
"customer":{
"name":"Customer1",
"custid":"test_réf"
}
}
]
}
}
]
For a sell, I can have several employees. I have an array of employee and I want to store the affected employee in another table. So I would have 2 tables:
Sell
oid | shop | domain | zipCode | ...
1564t8re13e4ter86 | shop1 | Divers | 58000 | ...
Affected employee
employee_id | employee_name | oid
245975 | employee1 | 1564t8re13e4ter86
458624 | employee2 | 1564t8re13e4ter86
So I want to loop on the employee array, with a Jsonpath query:
"$[*].object.sell[0].employee"
The problem is that doing like this, I can't have the object_id. It seems that I can't get an attribute on a parent node if I define my Jsonpath query like this.
I also saw that I can do like in the following link:
http://techpoet.blogspot.ro/2014/06/dealing-with-nested-documents-in.html?utm_content=buffer02d59&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
But I don't understand when does he get the object_id at the lower levels.
How can I do?
my tests with JSONPath failed as well but I think this must be a bug in the component because when I query:
$..['$oid'] I'm getting back -> [].
This seems to be the case whenever you try to get a node that's on higher levels than the loop query.