Not a data file error while reading Avro file - scala

I have a file with the data in Avro format. I would like to read this data into GenericRecord type data structure or any other type data structure so I would be able to send it from Kafka to Spark.
I tried to use DataFileReader, however the result was this error:
Exception in thread "main" Not a data file.
at org.apache.avro.file.DataFileStream.initialize(
Here is the code with produced it:
val schema = Source.fromFile(schemaPath).mkString
val parser = new Schema.Parser
val avroSchema = parser.parse(schema)
val avroDataFile = new File(dataPath)
val avroReader = new GenericDatumReader[GenericRecord](avroSchema)
val dataFileReader = new DataFileReader[GenericRecord](avroDataFile, avroReader)
How can I fix this error?
This is how my Avro data schema looks like:
"type" : "record",
"namespace" : "input_data",
"name" : "testUser",
"fields" : [
{"name" : "name", "type" : "string", "default": "NONE"},
{"name" : "age", "type" : "int", "default": -1},
{"name" : "phone", "type" : "string", "default" : "NONE"},
{"name" : "city", "type" : "string", "default" : "NONE"},
{"name" : "country", "type" : "string", "default" : "NONE"}
And this is the data I tried to read (it was generated by this tool):
"name" : "O= ~usP3\u0001\bY\u0011k\u0001",
"age" : 585392215,
"phone" : "\u0012\u001F#\u001FH]e\u0015UW\u0000\fo",
"city" : "aWi\u001B'\u000Bh\u00163\u001A_I\u0001\u0001L",
"country" : "]H\u001Dl(n!Sr}oVCH"
"name" : "\u0011Y~\fV\u001Dv%4\u0006;\u0012",
"age" : -2045540864,
"phone" : "UyOdgny-hA",
"city" : "\u0015f?\u0000\u0015oN{\u0019\u0010\u001D%",
"country" : "eY>c\u0010j\u0002[\u001CdDQ"

Well, that data is not Avro, it is JSON.
If it were binary Avro data, you would not be able to read the file without first using avro-tools.jar tojson action.
If you look at the usage doc, JSON is the default
-j, --json: Encode outputted data in JSON format (default)
To actually get Avro, use arg -s schema.avsc -b -o out.avro
There are also other ways to generate test data in Kafka


Assign SQL schema to Spark DataFrame

I'm converting my team's legacy Redshift SQL code to Spark SQL code. All the Spark examples I've seen define the schema in a non-SQL way using StructType and StructField and I'd prefer to define the schema in SQL, since most of my users know SQL but not Spark.
This is the ugly workaround I'm doing now. Is there a more elegant way that doesn't require defining an empty table just so that I can pull the SQL schema?
create_table_sql = '''
CREATE TABLE public.example (
id LONG,
example VARCHAR(80)
schema = spark.sql("DESCRIBE public.example").collect()
s3_data =\
option("delimiter", "|")\
Yes there is a way to create schema from string although I am not sure if it really looks like SQL! So you can use:
from pyspark.sql.types import _parse_datatype_string
_parse_datatype_string("id: long, example: string")
This will create the next schema:
Or you may have a complex schema as well:
schema = _parse_datatype_string("customers array<struct<id: long, name: string, address: string>>")
You can check for more examples here
adding up to what has already been said, making a schema (e.g. StructType-based or JSON) is more straightforward in scala spark than in pySpark:
> import org.apache.spark.sql.types.StructType
> val s = StructType.fromDDL("customers array<struct<id: long, name: string, address: string>>")
> s
res3: org.apache.spark.sql.types.StructType = StructType(StructField(customers,ArrayType(StructType(StructField(id,LongType,true),StructField(name,StringType,true),StructField(address,StringType,true)),true),true))
> s.prettyJson
res9: String =
"type" : "struct",
"fields" : [ {
"name" : "customers",
"type" : {
"type" : "array",
"elementType" : {
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "long",
"nullable" : true,
"metadata" : { }
}, {
"name" : "name",
"type" : "string",
"nullable" : true,
"metadata" : { }
}, {
"name" : "address",
"type" : "string",
"nullable" : true,
"metadata" : { }
} ]
"containsNull" : true
"nullable" : true,
"metadata" : { }
} ]

how to transform a nested mongodb table into spark dataframe

i have a nested mongodb talbe and its document structure like this:
"_id" : "35228334dbd1090f6117c5a0011b56b0",
"brasidas" : [
"key" : "buy",
"value" : 859193
"crawl_time" : NumberLong(1526296211997),
"date" : "2018-05-11",
"id" : "44874f4c8c677087bcd5f829b2843e66",
"initNumber" : 0,
"repurchase" : 0,
"source_url" : "",
"stockCode" : "600020",
"stockName" : "ZYGS",
"type" : "SSE"
i want to transform it into spark dataframe,and extract the title "key"and "value " of "brasidas" as single column respectively.just like follows:
initNumber repurchase key value stockName type date
50000 50000 buy 286698 shgf SSE 2015/3/30
but there is a problem with the form of title "brasidas",it have three forms:
[{ "key" : "buy", "value" : 286698 }]
[{ "value" : 15311500, "key" : "buy_free" }, { "value" : 0, "key" : "buy_limited" }]
[{ "key" : ""buy_free" " }, { "key" : "buy_limited" }]
so when i use scala to define a StructType, it's not suitable for every document,i can only take "brasidas" as a single column and failed to divide it by the "key" .this is what i get:
initNumber repurchase brasidas stockName type date
50000 50000 [{ "key" : "buy", "value" : 286698 }] shgf SSE 2015/3/30
This is the code for getting mongodb document:
val readpledge =ReadConfig(Map("uri"-> (mongouri_beehive+".pledge")))
val pledge = getMongoDB.readCollection(sc, readpledge,"initNumber","repurchase","brasidas","stockName","type","date")
.selectExpr("cast(initNumber as int) initNumber", "cast(repurchase as int) repurchase","brasidas","stockName","type","date")
If you try to df.printSchema() you'll probably be able to observe that brasidas got ArrayType. Most likely (array of map).
So, I'd suggest to implement some sort of UDF function that get Array as parameter and transform it in a way you need.
def arrayProcess(arr: Seq[AnyRef]): Seq[AnyRef] = ???

Reading message from Kafka with java.util.List in avro schema

I am trying to read the message from Kafka using consumer with the following properties
And the schema is
"type" : "array",
"items" : {
"type" : "record",
"name" : "MyDto",
"namespace" : "test.dto",
"fields" : [ {
"default" : null,
"name" : "version",
"type" : ["null","string"]
}, {
"default" : null,
"name" : "testName",
"type" : ["null","string"]
}, {
"name" : "keys",
"type" : {"type": "array", "items": "string"},
"java-class" : "java.util.List"
"java-class" : "java.util.List"
The object was written to Kafka successfully using this schema. But on deserialization I am getting exception java.lang.NoSuchMethodException: java.util.List.<init>()
Is it possible to use java.util.List class? I am using confluent 3.1.2
A List is an interface, it has no constructor or concrete implementation. Your Producer is probably using an ArrayList, and so should the reader schema and consumer
Make sure the schema is defined with "java-class" : "java.util.ArrayList"

Kafka stream join with a specific key as input

I have 3 different topics with 3 Avro files in schema registry, I want to stream these topics and join them together and write them into one topic. the problem is the key I want to join is different with the key I write the data into each topic.
Let's say we have these 3 Avro files:
"type" : "record",
"name" : "Alarm",
"namespace" : "com.kafkastream.schema.avro",
"fields" : [ {
"name" : "alarm_id",
"type" : "string",
"doc" : "Unique identifier of the alarm."
}, {
"name" : "ne_id",
"type" : "string",
"doc" : "Unique identifier of the network element ID that produces the alarm."
}, {
"name" : "start_time",
"type" : "long",
"doc" : "is the timestamp when the alarm was generated."
}, {
"name" : "severity",
"type" : [ "null", "string" ],
"doc" : "The severity field is the default severity associated to the alarm ",
"default" : null
"type" : "record",
"name" : "Incident",
"namespace" : "com.kafkastream.schema.avro",
"fields" : [ {
"name" : "incident_id",
"type" : "string",
"doc" : "Unique identifier of the incident."
}, {
"name" : "incident_type",
"type" : [ "null", "string" ],
"doc" : "Categorization of the incident e.g. Network fault, network at risk, customer impact, etc",
"default" : null
}, {
"name" : "alarm_source_id",
"type" : "string",
"doc" : "Respective Alarm"
}, {
"name" : "start_time",
"type" : "long",
"doc" : "is the timestamp when the incident was generated on the node."
}, {
"name" : "ne_id",
"type" : "string",
"doc" : "ID of specific network element."
"type" : "record",
"name" : "Maintenance",
"namespace" : "com.kafkastream.schema.avro",
"fields" : [ {
"name" : "maintenance_id",
"type" : "string",
"doc" : "The message number is the unique ID for every maintenance"
}, {
"name" : "ne_id",
"type" : "string",
"doc" : "The NE ID is the network element ID on which the maintenance is done."
}, {
"name" : "start_time",
"type" : "long",
"doc" : "The timestamp when the maintenance start."
}, {
"name" : "end_time",
"type" : "long",
"doc" : "The timestamp when the maintenance start."
I have 3 topics in my Kafka for each of these Avro (ley's say alarm_raw, incident_raw, maintenance_raw) and whenever I wanted to write into these topics I am using ne_id as a key (so the topic partitioned by ne_id). now I want to join these 3 topics and get a new record and write it into a new topic. The problem is I want to join Alarm and Incident based on alarm_id and alarm_source_id and join alarm and maintenance based on ne_id. I want to avoid creating a new topic and re-assign a new key. Is there anyway that I specify the key while I am joining?
It depends what kind of join you want to use (c.f.
For KStream-KStream join, there is currently (v0.10.2 and earlier) no other way than setting a new key (e.g., by using selectKey()) and do a repartitioning.
For KStream-KTable join, Kafka 0.10.2 (will be released in the next weeks) contains a new feature called GlobalKTables (c.f. This allows you to do a non-key join on the KTable (i.e., a KStream-GlobalKTable join and thus you do not need to repartition the data in you GlobalKTable).
Note: a KStream-GlobalKTable join has different semantics than a KStream-KTable join. It is not time synchronized in contrast to the later, and thus, the join is non-deterministic by design with regard to GlobalKTable updates; i.e., there is no guarantee what KStream record will be the first to "see" a GlobalKTable updates and thus join with the updated GlobalKTable record.
There are plans to add a KTable-GlobalKTable join, too. This might become available in 0.10.3. There are no plans to add "global" KStream-KStream joins though.
You can maintain the same key by modifying it.
You can use KeyValueMapper through which you can modify your key as well as value.
You should use it as follows:
val modifiedStream =[String,String](
new KeyValueMapper[String, String,KeyValue[String,String]]{
override def apply(key: String, value: String): KeyValue[String, String] = new KeyValue("modifiedKey", value)
You can apply above logic on multiple Kstream objects to maintain a single key for joining KStreams.

using 2 different result sets in mongodb

I'm using groovy with mongodb. I have a result set but need a value from a different grouping of documents. How do I pull that value into the result set I need?
MAIN:Network data
"resource_metadata" : {
"name" : "tapd2e75adf-71",
"parameters" : { },
"fref" : null,
"instance_id" : "9f170531-79d0-48ee-b0f7-9bd2788b1cc5"}
I need the display_name for the network data result set which is contained in the compute data.
CPU data
"resource_id" : "9f170531-79d0-48ee-b0f7-9bd2788b1cc5",
"resource_metadata" : {
"ramdisk_id" : "",
"display_name" : "testinstance0001"}
You can see the resource_id and the Instance_id are the same values. I know there is no relationship I can do but trying to reach to see if anyone has come across this. I'm using the table model to retrieve data for reporting. Hashtable has been suggested to me but I'm not seeing that working. Somehow in the hasNext I need to include the display_name value. in the networking data so GUID number doesn't only valid name shows from compute data.
def docs = meter.find(query).sort(sort).limit(50)\
while (docs.hasNext()) { def doc =\
model.addRow([ doc.get("counter_name"),doc.get("counter_volume"),doc.get("timestamp"),\
as Object[]);}
Full document:
1st set where I need the network data measure with no name only id {resource_metadata.instance_id}
"_id" : ObjectId("528812f8be09a32281e137d0"),
"counter_name" : "network.outgoing.packets",
"user_id" : "4d4e43ec79c5497491b23b13644c2a3b",
"timestamp" : ISODate("2013-11-17T00:51:00Z"),
"resource_metadata" : {
"name" : "tap6baab24e-8f",
"parameters" : { },
"fref" : null,
"instance_id" : "a8727a1d-4661-4565-9c0a-511279024a97",
"instance_type" : "50",
"mac" : "fa:16:3e:a3:bf:fc"
"source" : "openstack",
"counter_unit" : "packet",
"counter_volume" : 4611911,
"project_id" : "97dc4ca962b040608e7e707dd03f2574",
"message_id" : "54039238-4f22-11e3-8e68-e4115b99a59d",
"counter_type" : "cumulative"
2nd set where I want to grab the name as I get the values {resource_id}:
"_id" : ObjectId("5287bc3ebe09a32281dd2594"),
"counter_name" : "cpu",
"user_id" : "4d4e43ec79c5497491b23b13644c2a3b",
"message_signature" :
"timestamp" : ISODate("2013-11-16T18:40:58Z"),
"resource_id" : "a8727a1d-4661-4565-9c0a-511279024a97",
"resource_metadata" : {
"ramdisk_id" : "",
"display_name" : "vmsapng01",
"name" : "instance-000014d4",
"disk_gb" : "",
"availability_zone" : "",
"kernel_id" : "",
"ephemeral_gb" : "",
"host" : "3746d148a76f4e1a8203d7e2378ef48ccad8a714a47e7481ab37bcb6",
"memory_mb" : "",
"instance_type" : "50",
"vcpus" : "",
"root_gb" : "",
"image_ref" : "869be2c0-9480-4239-97ad-df383c6d09bf",
"architecture" : "",
"os_type" : "",
"reservation_id" : ""
"source" : "openstack",
"counter_unit" : "ns",
"counter_volume" : NumberLong("724574640000000"),
"project_id" : "97dc4ca962b040608e7e707dd03f2574",
"message_id" : "a240fa5a-4eee-11e3-8e68-e4115b99a59d",
"counter_type" : "cumulative"
This is another collection that contains the same value but just thought it would be easier to grab from same collection:
"_id" : "a8727a1d-4661-4565-9c0a-511279024a97",
"metadata" : {
"ramdisk_id" : "",
"display_name" : "vmsapng01",
"name" : "instance-000014d4",
"disk_gb" : "",
"availability_zone" : "",
"kernel_id" : "",
"ephemeral_gb" : "",
"host" : "3746d148a76f4e1a8203d7e2378ef48ccad8a714a47e7481ab37bcb6",
"memory_mb" : "",
"instance_type" : "50",
"vcpus" : "",
"root_gb" : "",
"image_ref" : "869be2c0-9480-4239-97ad-df383c6d09bf",
"architecture" : "",
"os_type" : "",
"reservation_id" : "",
It looks like these data are in 2 different collections, is this correct?
Would you be able to query CPU data for each "instance_id" ("resource_id")?
Or if this would cause too many queries to the database (looks like you limit to 50...) you could use $in with the list of all "Instance_id"s
Either way, you will need to query each collection separately.