Spark - Recursive reference in Avro schema - scala

I am trying read an avro file. But it says the below error using the below code.
val expected_avro=spark.read.format("avro").load("avro_file_path")
Error:
Found recursive reference in Avro schema, which can not be processed by Spark:
Below is the sample data:
{
"type" : "record",
"name" : "ABC",
"namespace" : "com.abc.xyzlibrary",
"doc" : "Description Sample",
"fields" : [ {
"name" : "id",
"type" : "int",
"default" : 0
},
{
"name" : "location",
"type" : "string",
"default" : ""
}]
}
Is it because we have name twice (one at the root level and one inside the fields section). Any leads on how to get this fixed?
Thanks in advance.

Related

Find depth of tree in MongoDB documents using 'spring data mongo'

We have the documents in the following structure. I am trying to write a service using "spring-data-mongo" to traverse the documents and print a report as shown below (in output).
db.getCollection("tree").find({})
// Result
{
"_id" : ObjectId("5467cswrtr")
"name" : "energy1",
"rootId" : "10",
"parent_id" : null,
"id" : "10"
},
{
"_id" : ObjectId("5467cswrt54r")
"name" : "energy2",
"rootId" : "10",
"parent_id" : 10,
"id" : "20"
},
{
"_id" : ObjectId("54cswrtr")
"name" : "energy3",
"rootId" : "10",
"parent_id" : "20",
"id" : "30"
},
{
"_id" : ObjectId("54cswrtr")
"name" : "energy4",
"rootId" : "10",
"parent_id" : "20",
"id" : "40"
}
We can visualize the above document in tree form as below. We may have many such trees.
Tree Structure:
We are using Mongo version 3.2.12. I need to print the following report by traversing the documents using spring data mongo:
Name rootId Depth
============================
energy1 10 3
Agriculture 99 36
-- and so on
PS: I searched on Google and found out about $graphlookup, but it does not seem to be supported in mongo version 3.2.12.
Please help.

NiFi: Converting a Nested JSON File into CSV

I'm new to Nifi and in my case have to consume a JSON topic from Kafka
and would like to convert it into a CSV file where I need to select only few scalar and some nested fields.
I need to do the following things:
Consume topic - Done
Json to CSV
Include header in the CSV file
Merge into single file (if its split)
Give a proper filename with date
Following the below link:
https://community.hortonworks.com/articles/64069/converting-a-large-json-file-into-csv.html
But not sure if this is the right approach and also don't know how to make a single file.
Im using NiFi (1.8) and schema is stored in Confluent Schema Registry
Json Sample:
{
"type" : "record",
"name" : "Customer",
"namespace" : "namespace1"
"fields" : [ {
"name" : "header",
"type" : {
"type" : "record",
"name" : "CustomerDetails",
"namespace" : "namespace1"
"fields" : [ {
"name" : "Id",
"type" : "string"
}, {
"name" : "name",
"type" : "string"
}, {
"name" : "age",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "comm",
"type" : [ "null", "int" ],
"default" : null
} ]
},
"doc" : ""
}, {
"name" : "data",
"type" : {
"type" : "record",
"name" : "CustomerData"
"fields" : [ {
"name" : "tags",
"type" : {
"type" : "map",
"values" : "string"
}
}, {
"name" : "data",
"type" : [ "null", "bytes" ]
"default" : null
} ]
},
"doc" : ""
} ]
}
Please guide me.
Try ConvertRecord with a JsonTreeReader and CSVRecordSetWriter. The writer can be configured to include the header line, and you won't need to split/merge. UpdateAttribute can be used to set the filename attribute which is the associated filename for the flow file (used by PutFile, e.g.).
If ConvertRecord doesn't give you the output you want, can you please elaborate on the difference between what it gives you and what you expect?

Upload Data to druid Incrementally

I need to upload data to an existing model. This has to be done on daily basis. I guess some changes needs to be done in the index file and i am not able to figure out. I tried pushing the data with the same model name but the parent data was removed.
Any help would be appreciated.
Here is the ingestion json file :
{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "mksales",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions" : ["Address",
"City",
"Contract Name",
"Contract Sub Type",
"Contract Type",
"Customer Name",
"Domain",
"Nation",
"Contract Start End Date",
"Zip",
"Sales Rep Name"
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "time"
}
}
},
"metricsSpec" : [
{ "type" : "count", "name" : "count", "type" : "count" },
{"name" : "Price","type" : "doubleSum","fieldName" : "Price"},
{"name" : "Sales","type" : "doubleSum","fieldName" : "Sales"},
{"name" : "Units","type" : "longSum","fieldName" : "Units"}],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2000-12-01T00:00:00Z/2030-06-30T00:00:00Z"],
"rollup" : true
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "mksales/",
"filter" : "mksales.json"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 10000000,
"maxRowsInMemory" : 40000,
"forceExtendableShardSpecs" : true
}
}
}
There are 2 ways using which you can append/update the data to an existing segment.
Reindexing and Delta Ingestion
You need to reindex your data every time new data comes in a particular segment.(In your case its day) For the reindexing you need to give all the files having data for that day.
For Delta Ingestion you need to use inputSpec type="multi"
You can refer the documentation link for more details - http://druid.io/docs/latest/ingestion/update-existing-data.html

Create Map of Map using JSON Schema in AWS API Model

I want to create a Map variable in the SDK - AWS API - for which I need to write a JSON Schema Input / Output Model.
Help me write JSON Schema syntax such that I can achieve my objective
To just make one Map variable -
{
"type" : "object",
"required" : [ "answers" ],
"properties" : {
"answers" : {
"type" : "object",
"additionalProperties": { "type": "string" }
}
},
"title" : "test"
}
to Create a Map of Map use:
"answers": {
"type" :"object",
"additionalProperties" :
{ "type" : "object",
"additionalProperties" : {"type" : "string"}
}
}

Using aggregation in MongoDB to select two or more fields of a document

I'm starting to work with MongoDB and I've a question about aggregation. I've a document that use a lot of different fields in different orders. For example:
db.my_collection.insert({ "answers" : [ { "id" : "0", "type" : "text", "value" : "A"}, { "id" : "1", "type" : "text", "value" : "B"}]})
db.my_collection.insert({ "answers" : [ { "id" : "0", "type" : "text", "value" : "C"}, { "id" : "1", "type" : "text", "value" : "A"}]})
I would to execute a query using "answers.id" with "answers.value" to obtain a result.
I tried but didn't get results, in my case, I executed the command:
db.my_collection.aggregate({$match: {"answers.id":"0", "answers.value": "A"}})
And the result was the two responses when I expected only:
{ "answers" : [ { "id" : "0", "type" : "text", "value" : "A"}, { "id" : "1", "type" : "text", "value" : "B"}]
Thank you!!!
You need to use the $elemMatch operator to match a single element of the answers array with both the specified 'id' and 'value'.
Something like this should work:
db.my_collection.aggregate( {
"$match" : {
"answers" {
"$elemMatch" : {
"id" : "0",
"value" : "A"
}
}
}
} )