I'm new to Nifi and in my case have to consume a JSON topic from Kafka
and would like to convert it into a CSV file where I need to select only few scalar and some nested fields.
I need to do the following things:
Consume topic - Done
Json to CSV
Include header in the CSV file
Merge into single file (if its split)
Give a proper filename with date
Following the below link:
https://community.hortonworks.com/articles/64069/converting-a-large-json-file-into-csv.html
But not sure if this is the right approach and also don't know how to make a single file.
Im using NiFi (1.8) and schema is stored in Confluent Schema Registry
Json Sample:
{
"type" : "record",
"name" : "Customer",
"namespace" : "namespace1"
"fields" : [ {
"name" : "header",
"type" : {
"type" : "record",
"name" : "CustomerDetails",
"namespace" : "namespace1"
"fields" : [ {
"name" : "Id",
"type" : "string"
}, {
"name" : "name",
"type" : "string"
}, {
"name" : "age",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "comm",
"type" : [ "null", "int" ],
"default" : null
} ]
},
"doc" : ""
}, {
"name" : "data",
"type" : {
"type" : "record",
"name" : "CustomerData"
"fields" : [ {
"name" : "tags",
"type" : {
"type" : "map",
"values" : "string"
}
}, {
"name" : "data",
"type" : [ "null", "bytes" ]
"default" : null
} ]
},
"doc" : ""
} ]
}
Please guide me.
Try ConvertRecord with a JsonTreeReader and CSVRecordSetWriter. The writer can be configured to include the header line, and you won't need to split/merge. UpdateAttribute can be used to set the filename attribute which is the associated filename for the flow file (used by PutFile, e.g.).
If ConvertRecord doesn't give you the output you want, can you please elaborate on the difference between what it gives you and what you expect?
Related
I am trying read an avro file. But it says the below error using the below code.
val expected_avro=spark.read.format("avro").load("avro_file_path")
Error:
Found recursive reference in Avro schema, which can not be processed by Spark:
Below is the sample data:
{
"type" : "record",
"name" : "ABC",
"namespace" : "com.abc.xyzlibrary",
"doc" : "Description Sample",
"fields" : [ {
"name" : "id",
"type" : "int",
"default" : 0
},
{
"name" : "location",
"type" : "string",
"default" : ""
}]
}
Is it because we have name twice (one at the root level and one inside the fields section). Any leads on how to get this fixed?
Thanks in advance.
I need to upload data to an existing model. This has to be done on daily basis. I guess some changes needs to be done in the index file and i am not able to figure out. I tried pushing the data with the same model name but the parent data was removed.
Any help would be appreciated.
Here is the ingestion json file :
{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "mksales",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions" : ["Address",
"City",
"Contract Name",
"Contract Sub Type",
"Contract Type",
"Customer Name",
"Domain",
"Nation",
"Contract Start End Date",
"Zip",
"Sales Rep Name"
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "time"
}
}
},
"metricsSpec" : [
{ "type" : "count", "name" : "count", "type" : "count" },
{"name" : "Price","type" : "doubleSum","fieldName" : "Price"},
{"name" : "Sales","type" : "doubleSum","fieldName" : "Sales"},
{"name" : "Units","type" : "longSum","fieldName" : "Units"}],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2000-12-01T00:00:00Z/2030-06-30T00:00:00Z"],
"rollup" : true
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "mksales/",
"filter" : "mksales.json"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 10000000,
"maxRowsInMemory" : 40000,
"forceExtendableShardSpecs" : true
}
}
}
There are 2 ways using which you can append/update the data to an existing segment.
Reindexing and Delta Ingestion
You need to reindex your data every time new data comes in a particular segment.(In your case its day) For the reindexing you need to give all the files having data for that day.
For Delta Ingestion you need to use inputSpec type="multi"
You can refer the documentation link for more details - http://druid.io/docs/latest/ingestion/update-existing-data.html
I want to create a Map variable in the SDK - AWS API - for which I need to write a JSON Schema Input / Output Model.
Help me write JSON Schema syntax such that I can achieve my objective
To just make one Map variable -
{
"type" : "object",
"required" : [ "answers" ],
"properties" : {
"answers" : {
"type" : "object",
"additionalProperties": { "type": "string" }
}
},
"title" : "test"
}
to Create a Map of Map use:
"answers": {
"type" :"object",
"additionalProperties" :
{ "type" : "object",
"additionalProperties" : {"type" : "string"}
}
}
How can I specify one to many and many to one relations in json-ld.
For example :
{
"#context" : {
"#vocab" : "http://www.schema.org/",
"#id" : "http://www.example.com/users/Joe",
"name" : "name",
"dob" : "birthDate",
"age" : {
"#id" : "http://www.example.com/users/Joe#age",
"#type" : "Number"
}
"knows" : ["http://www.example.com/users/Jill", "http://www.example.com/users/James"]
},
"name" : "Joe",
"age" : "24",
"dob" : "12-Jun-2013"
}
this doesn't parse in json-ld playground.
What is the valid and best way to specify relations like this either in json-ld or using Hydra?
You need to be carful what you put into the context and what you put into the body of the document. Simply speaking the context defines the mappings to URLs while the body contains the actual data. Your example should thus look something like this:
{
"#context" : {
"#vocab" : "http://www.schema.org/",
"dob" : "birthDate",
"age" : {
"#id" : "http://www.example.com/users/Joe#age",
"#type" : "Number"
},
"knows": { "#type": "#id" }
},
"#id" : "http://www.example.com/users/Joe",
"name" : "Joe",
"age" : "24",
"dob" : "12-Jun-2013",
"knows" : [
"http://www.example.com/users/Jill",
"http://www.example.com/users/James"
]
}
I'm starting to work with MongoDB and I've a question about aggregation. I've a document that use a lot of different fields in different orders. For example:
db.my_collection.insert({ "answers" : [ { "id" : "0", "type" : "text", "value" : "A"}, { "id" : "1", "type" : "text", "value" : "B"}]})
db.my_collection.insert({ "answers" : [ { "id" : "0", "type" : "text", "value" : "C"}, { "id" : "1", "type" : "text", "value" : "A"}]})
I would to execute a query using "answers.id" with "answers.value" to obtain a result.
I tried but didn't get results, in my case, I executed the command:
db.my_collection.aggregate({$match: {"answers.id":"0", "answers.value": "A"}})
And the result was the two responses when I expected only:
{ "answers" : [ { "id" : "0", "type" : "text", "value" : "A"}, { "id" : "1", "type" : "text", "value" : "B"}]
Thank you!!!
You need to use the $elemMatch operator to match a single element of the answers array with both the specified 'id' and 'value'.
Something like this should work:
db.my_collection.aggregate( {
"$match" : {
"answers" {
"$elemMatch" : {
"id" : "0",
"value" : "A"
}
}
}
} )