How to map one column with other columns in an avro file? - scala

I'm using Spark 2.1.1 and Scala 2.11.8
This question is an extension of one my earlier questions:
How to identify null fields in a csv file?
The change is that rather than reading the data from a CSV file, I'm now reading the data from an avro file. This is the format of the avro file I'm reading the data from :
var ttime: Long = 0;
var eTime: Long = 0;
var tids: String = "";
var tlevel: Integer = 0;
var tboot: Long = 0;
var rNo: Integer = 0;
var varType: String = "";
var uids: List[TRUEntry] = Nil;
I'm parsing the avro file in a separate class.
I have to map the tids column with every single one of the uids in the same way as mentioned in the accepted answer of the link posted above, except this time from an avro file rather than a well formatted csv file. How can I do this?
This is the code I'm trying to do it with :
val avroRow = spark.read.avro(inputString).rdd
val avroParsed = avroRow
.map(x => new TRParser(x))
.map((obj: TRParser) => ((obj.tids, obj.uId ),1))
.reduceByKey(_+_)
.saveAsTextFile(outputString)
After obj.tids, all the uids columns have to be mapped individually to give a final output same as mentioned in the accepted answer of the above link.
This is how I'm parsing all the uids in the avro file parsing class:
this.uids = Nil
row.getAs[Seq[Row]]("uids")
.foreach((objRow: Row) =>
this.uids ::= (new TRUEntry(objRow))
)
this.uids
.foreach((obj:TRUEntry) => {
uInfo += obj.uId + " , " + obj.initM.toString() + " , "
})
P.S : I apologise if the question seems dumb but this is my first encounter with avro file

It can be done by passing the same for loop processing
this.uids
in the main code as :
val avroParsed = avroRow
.map(x => new TRParser(x))
.map((obj: TRParser) => {
val tId = obj.source.trim
var retVal: String = ""
obj.uids
.foreach((obj: TRUEntry) => {
retVal += tId + "," + obj.uId.trim + ":"
})
retVal.dropRight(1)
})
val flattened = avroParsed
.flatMap(x => x.split(":"))
.map(y => ((y),1))

Related

trimming all string type columns dynamically of dataframe scala spark

Hi I want to trim only string type columns of DF as trimming all columns will change data type of non string column to string type.
I have 2 ways to do it currently but looking for some good and efficient method.
First Method
var Countrydf = Seq(("Virat ", 18, "RCB ali shah"), (" Rohit ", 45, "MI "), (" DK", 67, "KKR ")).toDF("captains", "jersey_number", "teams")
Countrydf.show
for (name <- Countrydf.schema) {
if (name.dataType.toString == "StringType")
Countrydf = Countrydf.withColumn(name.name, trim(col(name.name)))
}
Second Method
val trimmedDF = Countrydf.columns.foldLeft(Countrydf) { (memoDF, colName) =>
memoDF.withColumn(colName, trim(col(colName)))
}
val exprs = Countrydf.schema.fields.map { f =>
if (trimmedDF.schema.fields.contains(f)) col(f.name)
else lit(null).cast(f.dataType).alias(f.name)
}
trimmedDF.select(exprs: _*).printSchema
both works fine and output is same.
Performance wise best solution I found is
var Countrydf = Seq(("Virat ",18,"RCB ali shah"),(" Rohit ",45,"MI "),(" DK",67,"KKR ")).toDF("captains","jersey_number","teams")
Countrydf.show
for( name <- Countrydf.schema) {
if(name.dataType.toString=="StringType")
Countrydf= Countrydf.withColumn(name.name, trim(col(name.name)))
}

decompress (unzip/extract) util using spark scala

I have customer_input_data.tar.gz in HDFS, which have 10 different tables data in csv file format. so i need to unzip this file to /my/output/path using spark scala
please suggest how to unzip customer_input_data.tar.gz file using spark scala
gzip is not a splittable format in Hadoop. Consequently, the file is not really going to be distributed across the cluster and you don't get any benefit of distributed compute/processing in hadoop or Spark.
Better approach may be to,
uncompress the file on the OS and then individually send the files back to hadoop.
If you still want to uncompress in scala, you can simply resort to java class GZIPInputStream via
new GZIPInputStream(new FileInputStream("your file path"))
I developed the below code for decompress the files using scala. You need to pass input path and output path and Hadoopfile system
/*below method used for processing zip files*/
#throws[IOException]
private def processTargz(fullpath: String, houtPath: String, fs: FileSystem): Unit = {
val path = new Path(fullpath)
val gzipIn = new GzipCompressorInputStream(fs.open(path))
try {
val tarIn = new TarArchiveInputStream(gzipIn)
try {
var entry:TarArchiveEntry = null
out.println("Tar entry")
out.println("Tar Name entry :" + FilenameUtils.getName(fullpath))
val fileName1 = FilenameUtils.getName(fullpath)
val tarNamesFolder = fileName1.substring(0, fileName1.indexOf('.'))
out.println("Folder Name : " + tarNamesFolder)
while ( {
(entry = tarIn.getNextEntry.asInstanceOf[TarArchiveEntry]) != null
}) { // entity Name as tsv file name which are part of inside compressed tar file
out.println("ENTITY NAME : " + entry.getName)
/** If the entry is a directory, create the directory. **/
out.println("While")
if (entry.isDirectory) {
val f = new File(entry.getName)
val created = f.mkdir
out.println("mkdir")
if (!created) {
out.printf("Unable to create directory '%s', during extraction of archive contents.%n", f.getAbsolutePath)
out.println("Absolute path")
}
}
else {
var count = 0
val slash = "/"
val targetPath = houtPath + slash + tarNamesFolder + slash + entry.getName
val hdfswritepath = new Path(targetPath)
val fos = fs.create(hdfswritepath, true)
try {
val dest = new BufferedOutputStream(fos, BUFFER_SIZE)
try {
val data = new Array[Byte](BUFFER_SIZE)
while ( {
(count = tarIn.read(data, 0, BUFFER_SIZE)) != -1
}) dest.write(data, 0, count)
} finally if (dest != null) dest.close()
}
}
}
out.println("Untar completed successfully!")
} catch {
case e: IOException =>
out.println("catch Block")
} finally {
out.println("FINAL Block")
if (tarIn != null) tarIn.close()
}
}
}

ZipInputStream.read in ZipEntry

I am reading zip file using ZipInputStream. Zip file has 4 csv files. Some files are written completely, some are written partially. Please help me find the issue with below code. Is there any limit on reading buffer from ZipInputStream.read method?
val zis = new ZipInputStream(inputStream)
Stream.continually(zis.getNextEntry).takeWhile(_ != null).foreach { file =>
if (!file.isDirectory && file.getName.endsWith(".csv")) {
val buffer = new Array[Byte](file.getSize.toInt)
zis.read(buffer)
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
fo.write(buffer)
}
You have not closed/flushed the files you attempted to write. It should be something like this (assuming Scala syntax, or is this Kotlin/Ceylon?):
val fo = new FileOutputStream("c:\\temp\\input\\" + file.getName)
try {
fo.write(buffer)
} finally {
fo.close
}
Also you should check the read count and read more if necessary, something like this:
var readBytes = 0
while (readBytes < buffer.length) {
val r = zis.read(buffer, readBytes, buffer.length - readBytes)
r match {
case -1 => throw new IllegalStateException("Read terminated before reading everything")
case _ => readBytes += r
}
}
PS: In your example it seems to be less than required closing }s.

write immutable code for storing data in listBuffer in scala

I have the below code where I am using a mutable list buffer to store files recieved from kafka consumer , and then when the list size reached 15 I insert them into cassandra .
But Is their any way to do the same thing using immutable list.
val filesList = ListBuffer[SystemTextFile]()
storeservSparkService.configFilesTopicInBatch.subscribe.atLeastOnce(Flow[SystemTextFile].mapAsync(4) { file: SystemTextFile =>
filesList += file
if (filesList.size == 15) {
storeServSystemRepository.config.insertFileInBatch(filesList.toList)
filesList.clear()
}
Future(Done)
})
Something along these lines?
Flow[SystemTextFile].grouped(15).mapAsync(4){ files =>
storeServSystemRepository.config.insertFileInBatch(files)
}
Have you tried using Vector?
val filesList = Vector[SystemTextFile]()
storeservSparkService.configFilesTopicInBatch.subscribe.
atLeastOnce(Flow[SystemTextFile].mapAsync(4) { file: SystemTextFile =>
filesList = filesList :+ file
if (filesList.length == 15) {
storeServSystemRepository.config.insertFileInBatch(filesList.toList)
}
Future(Done)
})

Scala : How to use variable in for loop outside loop block

How can I create Dataframe with all my json files, when after reading each file I need to add fileName as field in dataframe? it seems Variable in for loop is not recognized outside loop. How to overcome this issue?
for (jsonfilenames <- fileArray) {
var df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
// trying to create temp table from dataframe created in loop
tblLanding.registerTempTable("LandingTable") // ERROR here, can't resolved tblLanding
Thank in advance
Hossain
I think you are new to programming itself.
Anyways here you go.
Basically you specify the type and initialise it before loop.
var df:DataFrame = null
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
df.registerTempTable("LandingTable") // Getting ERROR here
Update
Ok you are completely new to programming, even loops.
Suppose fileArray is having values as [1.json, 2.json, 3.json, 4.json]
So, the loop actually created 4 dataframe, by reading 4 json files.
Which one you want to register as temp table.
If all of them,
var df:DataFrame = null
var count = 0
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
df.registerTempTable(s"LandingTable_$count")
count++;
}
And reason for df being empty before this update is, your fileArray is empty or Spark failed to read that file. Print it and check.
To query any of those registered LandingTable
val df2 = hiveContext.sql("SELECT * FROM LandingTable_0")
Update
Question has changed to making a single dataFrame from all the json files.
var dataFrame:DataFrame = null
for (jsonfilename <- fileArray) {
val eachDataFrame = hivecontext.read.json(jsonfilename)
if(dataFrame == null)
dataFrame = eachDataFrame
else
dataFrame = eachDataFrame.unionAll(dataFrame)
}
dataFrame.registerTempTable("LandingTable")
Insure, that fileArray is not empty and all json files in fileArray are having same schema.
// Create list of dataframes with source-file-names
val dfList = fileArray.map{ filename =>
hivecontext.read.json(filename)
.withColumn("source_file_name", lit(filename))
}
// union the dataframes (assuming all are same schema)
val df = dfList.reduce(_ unionAll _) // or use union if spark 2.x
// register as table
df.registerTempTable("LandingTable")