Reading an Avro file from scala - scala

I'm trying to read an avro file using scala.
I've extracted the file's schema using avro-tools and saved it to a file, I then try to read it using the following code:
val zibi= scala.io.Source.fromFile("/home/wasabi/schema").mkString
val schema_obj = new Schema.Parser
val schema2 = schema_obj.parse(zibi)
val READER2 = new GenericDatumReader[GenericRecord](schema2)
val myFile = Files.readAllBytes(Paths.get("/tmp/check/CMRF_80_1442744555901-1_1_2_1_1_1_4_10_1.avro"))
val datum = READER2.read(null, DecoderFactory.defaultFactory.createBinaryDecoder(myFile,null))
But I keep hitting IOExceptions as such:
java.io.IOException: Invalid int encoding
at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:145)
at org.apache.avro.io.ValidatingDecoder.readInt(ValidatingDecoder.java:83)
at org.apache.avro.generic.GenericDatumReader.readInt(GenericDatumReader.java:444)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:159)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.readArray(GenericDatumReader.java:219)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
When I'm reading the file through avro-tools it reads just fine.
What am I doing wrong?

Try using a DataFileReader instead of using a BinaryDecoder.
While Encoder/Decoders are used for writing and reading raw avros, I suspect that they are choking on the header info found in avro datafiles.
import org.apache.avro.generic.{ GenericDatumReader, GenericRecord }
import org.apache.avro.file.DataFileReader
val zibi= scala.io.Source.fromFile("/home/wasabi/schema").mkString
val schema_obj = new Schema.Parser
val schema2 = schema_obj.parse(zibi)
val READER2 = new GenericDatumReader[GenericRecord](schema2)
val myFile = new File("/tmp/check/CMRF_80_1442744555901-1_1_2_1_1_1_4_10_1.avro")
val dataFileReader = new DataFileReader[GenericRecord](myFile, READER2)
val datum = dataFileReader.next()

Related

Deserializing bytes to GenericRecord / Row

In the aim of using DataSourceV2 for reading some stored Parquet binary files that I have already their Spark schema, I am struggling to find a way to deserialize Parquet stream into GenericRecord / Row(s).
For Avro I found that we can do that using something like:
import org.apache.avro.generic.{GenericDatumReader, GenericRecord}
import org.apache.spark.sql.avro.AvroDeserializer
...
val datumReader = new GenericDatumReader[GenericRecord]()
val reader = DataFileReader.openReader(avroStream, datumReader)
val iterator = reader.iterator()
val record = iterator.next()
val row = AvroDeserializer(reader.getSchema, schema).deserialize(record)
where avroStream is a stream of bytes.
Is there some utility classes that can help?
Thanks for your help!

Read data from HDFS

I'm using the FSDataInputStream library to access the data from HDFS
The following is the snippet which I'm using
val fs = FileSystem.get(new java.net.URI(#HDFS_URI),new Configuration())
val stream = fs.open(new Path(#PATH))
val reader = new BufferedReader(new InputStreamReader(stream))
val offset:String = reader.readLine() #Reads the string "5432" stored in the file
Expected output is "5432".
But the actual output is "^#5^#4^#3^#2"
Not able to trim "^#" since they are not considered as characters.Please help with appropriate solution.

Getting error while converting DynamicFrame to a Spark DataFrame using toDF

I stated using AWS Glue to read data using data catalog and GlueContext and transform as per requirement.
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession
// Data Catalog: database and table name
val dbName = "abcdb"
val tblName = "xyzdt_2017_12_05"
// S3 location for output
val outputDir = "s3://output/directory/abc"
// Read data into a DynamicFrame using the Data Catalog metadata
val stGBDyf = glueContext.getCatalogSource(database = dbName, tableName = tblName).getDynamicFrame()
val revisedDF = stGBDyf.toDf() // This line getting error
While executing above code I got following error,
Error : Syntax Error: error: value toDf is not a member of
com.amazonaws.services.glue.DynamicFrame val revisedDF =
stGBDyf.toDf() one error found.
I followed this example to convert DynamicFrame to Spark dataFrame.
Please suggest what will be the best way to resolve this problem
There's a typo. It should work fine with capital F in toDF:
val revisedDF = stGBDyf.toDF()

Using iterated writing in HDFS file by using Spark/Scala

I am learning how to read and write from files in HDFS by using Spark/Scala.
I am unable to write in HDFS file, the file is created, but it's empty.
I don't know how to create a loop for writing in a file.
The code is:
import scala.collection.immutable.Map
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
// Read the adult CSV file
val logFile = "hdfs://zobbi01:9000/input/adult.csv"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
//val logFile = sc.textFile("hdfs://zobbi01:9000/input/adult.csv")
val headerAndRows = logData.map(line => line.split(",").map(_.trim))
val header = headerAndRows.first
val data = headerAndRows.filter(_(0) != header(0))
val maps = data.map(splits => header.zip(splits).toMap)
val result = maps.filter(map => map("AGE") != "23")
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
If I replace:
result.foreach{println}
Then it works!
but when using the method of (saveAsTextFile), then an error message is thrown as
<console>:76: error: type mismatch;
found : Unit
required: scala.collection.immutable.Map[String,String] => Unit
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
Any help please.
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
This is all what you need to do. You don't need to loop through all the rows.
Hope this helps!
What this does!!!
result.foreach{
result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt")
}
RDD action cannot be triggered from RDD transformations unless special conf set.
Just use result.saveAsTextFile("hdfs://zobbi01:9000/input/test2.txt") to save to HDFS.
I f you need other formats in the file to be written, change in rdd itself before writing.

Writing to a file in Apache Spark

I am writing a Scala code that requires me to write to a file in HDFS.
When I use Filewriter.write on local, it works. The same thing does not work on HDFS.
Upon checking, I found that there are the following options to write in Apache Spark-
RDD.saveAsTextFile and DataFrame.write.format.
My question is: what if I just want to write an int or string to a file in Apache Spark?
Follow up:
I need to write to an output file a header, DataFrame contents and then append some string.
Does sc.parallelize(Seq(<String>)) help?
create RDD with your data (int/string) using Seq: see parallelized-collections for details:
sc.parallelize(Seq(5)) //for writing int (5)
sc.parallelize(Seq("Test String")) // for writing string
val conf = new SparkConf().setAppName("Writing Int to File").setMaster("local")
val sc = new SparkContext(conf)
val intRdd= sc.parallelize(Seq(5))
intRdd.saveAsTextFile("out\\int\\test")
val conf = new SparkConf().setAppName("Writing string to File").setMaster("local")
val sc = new SparkContext(conf)
val stringRdd = sc.parallelize(Seq("Test String"))
stringRdd.saveAsTextFile("out\\string\\test")
Follow up Example: (Tested as below)
val conf = new SparkConf().setAppName("Total Countries having Icon").setMaster("local")
val sc = new SparkContext(conf)
val headerRDD= sc.parallelize(Seq("HEADER"))
//Replace BODY part with your DF
val bodyRDD= sc.parallelize(Seq("BODY"))
val footerRDD = sc.parallelize(Seq("FOOTER"))
//combine all rdds to final
val finalRDD = headerRDD ++ bodyRDD ++ footerRDD
//finalRDD.foreach(line => println(line))
//output to one file
finalRDD.coalesce(1, true).saveAsTextFile("test")
output:
HEADER
BODY
FOOTER
more examples here. . .