Scala : How to use variable in for loop outside loop block - scala

How can I create Dataframe with all my json files, when after reading each file I need to add fileName as field in dataframe? it seems Variable in for loop is not recognized outside loop. How to overcome this issue?
for (jsonfilenames <- fileArray) {
var df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
// trying to create temp table from dataframe created in loop
tblLanding.registerTempTable("LandingTable") // ERROR here, can't resolved tblLanding
Thank in advance
Hossain

I think you are new to programming itself.
Anyways here you go.
Basically you specify the type and initialise it before loop.
var df:DataFrame = null
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
df.registerTempTable("LandingTable") // Getting ERROR here
Update
Ok you are completely new to programming, even loops.
Suppose fileArray is having values as [1.json, 2.json, 3.json, 4.json]
So, the loop actually created 4 dataframe, by reading 4 json files.
Which one you want to register as temp table.
If all of them,
var df:DataFrame = null
var count = 0
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
df.registerTempTable(s"LandingTable_$count")
count++;
}
And reason for df being empty before this update is, your fileArray is empty or Spark failed to read that file. Print it and check.
To query any of those registered LandingTable
val df2 = hiveContext.sql("SELECT * FROM LandingTable_0")
Update
Question has changed to making a single dataFrame from all the json files.
var dataFrame:DataFrame = null
for (jsonfilename <- fileArray) {
val eachDataFrame = hivecontext.read.json(jsonfilename)
if(dataFrame == null)
dataFrame = eachDataFrame
else
dataFrame = eachDataFrame.unionAll(dataFrame)
}
dataFrame.registerTempTable("LandingTable")
Insure, that fileArray is not empty and all json files in fileArray are having same schema.

// Create list of dataframes with source-file-names
val dfList = fileArray.map{ filename =>
hivecontext.read.json(filename)
.withColumn("source_file_name", lit(filename))
}
// union the dataframes (assuming all are same schema)
val df = dfList.reduce(_ unionAll _) // or use union if spark 2.x
// register as table
df.registerTempTable("LandingTable")

Related

Looping the scala list in Spark

I have a scala list as below.
partList: ListBuffer(2021-10-01, 2021-10-02, 2021-10-03, 2021-10-04, 2021-10-05, 2021-10-06, 2021-10-07, 2021-10-08)
Currently Im getting all the data from source into the dataframe based on the above dates.
fctExistingDF = ss.read.table(existingTable).filter(s"event_date in ('${partList.mkString("','")}')")
Later I'm doing few transformations and loading the data into a delta table. The sample code is as below.
fctDF = ss.read.table(existingTable).filter(s"event_date in ('${partList.mkString("','")}')")
if (fctExistingDF.count() > 0) {
fctDF.createOrReplaceTempView("vw_exist_fct")
val existingRecordsQuery = getExistingRecordsMergeQuery(azUpdateTS,key)
ss.sql(existingRecordsQuery)
.drop("az_insert_ts").drop("az_update_ts")
.withColumn("az_insert_ts", col("new_az_insert_ts"))
.withColumn("az_update_ts", col("new_az_update_ts"))
.drop("new_az_insert_ts").drop("new_az_update_ts")
.select(mrg_tbl_cols(0), mrg_tbl_cols.slice(1,mrg_tbl_cols.length): _*)
.coalesce(72*2)
.write.mode("Append").format("delta")
.insertInto(mergeTable)
mergedDataDF = ss.read.table(mergeTable).coalesce(72*2)
mergedDataDF.coalesce(72)
.write.mode("Overwrite").format("delta")
.insertInto(s"${tgtSchema}.${tgtTbl}")
The below command in the code is creating a dataframe based on the filter condition on the event_date present in the partList.
fctExistingDF = ss.read.table(existingTable).filter(s"event_date in ('${partList.mkString("','")}')")
Since it is creating the dataframe with huge data, I want to loop each date in the partlist and read the data into the dataframe, instead of filtering all the dates in the partlist at a time.
I tried below.
var counter = 0
while (counter < partList.length) {
fctExistingDF = ss.read.table(existingTable).filter(s"event_date in (I should pass 1st date from the list)
counter = counter + 1
I am new to scala , may be we should use foreach here?
Could someone please help. Thank you.
You can use foreach or map, depends whether you want to return the values (map) or not (foreach):
import org.apache.spark.sql.functions.col
partList = List("2021-10-01", "2021-10-02", "2021-10-03", "2021-10-04", "2021-10-05", "2021-10-06", "2021-10-07", "2021-10-08")
partList.foreach { case date =>
fctExistingDF = ss.read.table(existingTable).filter(col("event_date") === date)
}
If you want to return list of dataframes, use:
val dfs = partList.map { case date =>
fctExistingDF = ss.read.table(existingTable).filter(col("event_date") === date)

How to map one column with other columns in an avro file?

I'm using Spark 2.1.1 and Scala 2.11.8
This question is an extension of one my earlier questions:
How to identify null fields in a csv file?
The change is that rather than reading the data from a CSV file, I'm now reading the data from an avro file. This is the format of the avro file I'm reading the data from :
var ttime: Long = 0;
var eTime: Long = 0;
var tids: String = "";
var tlevel: Integer = 0;
var tboot: Long = 0;
var rNo: Integer = 0;
var varType: String = "";
var uids: List[TRUEntry] = Nil;
I'm parsing the avro file in a separate class.
I have to map the tids column with every single one of the uids in the same way as mentioned in the accepted answer of the link posted above, except this time from an avro file rather than a well formatted csv file. How can I do this?
This is the code I'm trying to do it with :
val avroRow = spark.read.avro(inputString).rdd
val avroParsed = avroRow
.map(x => new TRParser(x))
.map((obj: TRParser) => ((obj.tids, obj.uId ),1))
.reduceByKey(_+_)
.saveAsTextFile(outputString)
After obj.tids, all the uids columns have to be mapped individually to give a final output same as mentioned in the accepted answer of the above link.
This is how I'm parsing all the uids in the avro file parsing class:
this.uids = Nil
row.getAs[Seq[Row]]("uids")
.foreach((objRow: Row) =>
this.uids ::= (new TRUEntry(objRow))
)
this.uids
.foreach((obj:TRUEntry) => {
uInfo += obj.uId + " , " + obj.initM.toString() + " , "
})
P.S : I apologise if the question seems dumb but this is my first encounter with avro file
It can be done by passing the same for loop processing
this.uids
in the main code as :
val avroParsed = avroRow
.map(x => new TRParser(x))
.map((obj: TRParser) => {
val tId = obj.source.trim
var retVal: String = ""
obj.uids
.foreach((obj: TRUEntry) => {
retVal += tId + "," + obj.uId.trim + ":"
})
retVal.dropRight(1)
})
val flattened = avroParsed
.flatMap(x => x.split(":"))
.map(y => ((y),1))

Transforming specific field of the RDD

I am new to spark.I have a doubt in transforming the specific field of a RDD.
I have a file like below:
2016-11-10T07:01:37|AAA|S16.12|MN-MN/AAA-329044|288364|2|3
2016-11-10T07:01:37|BBB|S16.12|MN-MN/AAA-329044/BBB-1|304660|0|0
2016-11-10T07:01:37|TSB|S16.12|MN-MN/AAA-329044/BBB-1/TSB-1|332164|NA|NA
2016-11-10T07:01:37|RX|S16.12|MN-MN/AAA-329044/BBB-1/TSB-1/RX-1|357181|0|1
And I want ouput like below:In the third field I want to remove all characters and integers separated by |.
2016-11-10T07:01:37|AAA|16.12|329044|288364|2|3
2016-11-10T07:01:37|BBB|16.12|329044|1|304660|0|0
2016-11-10T07:01:37|TSB|16.12|329044|1|1|332164|NA|NA
2016-11-10T07:01:37|RX|16.12|329044|1|1|1|357181|0|1
how can I do that.
I tried the below code.
val inputRdd =sc.textFile("file:///home/arun/Desktop/inputcsv.txt");
val result =inputRdd.flatMap(line=>line.split("\\|")).collect;
def ghi(arr:Array[String]):Array[String]=
{
var outlist=scala.collection.mutable.Buffer[String]();
for( i <-0 to arr.length-1){
if(arr(i).matches("(.*)-(.*)")){
var io=arr(i); var arru=scala.collection.mutable.Buffer[String]();
if(io.contains("/"))
{
var ki=io.split("/");
for(st <-0 to ki.length-1 )
{
var ion =ki(st).split("-");
arru+=ion(1);
}
var strui="";
for(in <-0 to arru.length-1)
{
strui=strui+arru(in)+"|";
}
outlist+=strui;
}
else
{
var ion =arr(i).split("-");
outlist+=ion(1)+"|";
}
}
else
{
outlist+=arr(i);
}
}
return outlist.toArray;
}
var output=ghi(result);
val finalrdd=sc.parallelize(out, 1);
finalrdd.collect().foreach(println);
Please help me.
What we need to do is to extract the numbers from that field and add them as new entries to the Array being processed.
Something like this should do:
// use data provided as sample
val dataSample ="""2016-11-10T07:01:37|AAA|S16.12|MN-MN/AAA-329044|288364|2|3
2016-11-10T07:01:37|BBB|S16.12|MN-MN/AAA-329044/BBB-1|304660|0|0
2016-11-10T07:01:37|TSB|S16.12|MN-MN/AAA-329044/BBB-1/TSB-1|332164|NA|NA
2016-11-10T07:01:37|RX|S16.12|MN-MN/AAA-329044/BBB-1/TSB-1/RX-1|357181|0|1""".split('\n')
val data = sparkContext.parallelize(dataSample)
val records= data.map(line=> line.split("\\|"))
// this regex can find and extract the contiguous digits in a mixed string.
val numberExtractor = "\\d+".r.unanchored
// we replace field#3 with the results of the regex
val field3Exploded = records.map{arr => arr.take(3) ++ numberExtractor.findAllIn(arr.drop(3).head) ++ arr.drop(4)}
// Let's visualize the result
field3Exploded.collect.foreach(arr=> println(arr.mkString(",")))
2016-11-10T07:01:37,AAA,S16.12,329044,288364,2,3
2016-11-10T07:01:37,BBB,S16.12,329044,1,304660,0,0
2016-11-10T07:01:37,TSB,S16.12,329044,1,1,332164,NA,NA
2016-11-10T07:01:37,RX,S16.12,329044,1,1,1,357181,0,1

java heap space error when converting csv to json but no error with d3.csv()

Platform being used: Apache Zeppelin
Language: scala, javascript
I use d3js to read a csv file of size ~40MB and it works perfectly fine with the below code:
<script type="text/javascript">
d3.csv("test.csv", function(data) {
// data is JSON array. Do something with data;
console.log(data);
});
</script>
Now, the idea is to avoid d3js, instead, construct the JSONarray in scala and access this variable in javascript code through z.angularBind(). Both of the below code works for smaller files, but gives java heap space error for the CSV file of size 40MB. What I am unable to understand is when d3.csv() can perfectly do the job without any heap space error, why cannot these 2 below code?
Edited Code 1: Using scala's
import java.io.BufferedReader;
import java.io.FileReader;
import org.json._
import scala.io.Source
var br = new BufferedReader(new FileReader("/root/test.csv"))
var contentLine = br.readLine();
var keys = contentLine.split(",")
contentLine = br.readLine();
var ja = new JSONArray();
while (contentLine != null) {
var splits = contentLine.split(",")
var i = 0
var jo = new JSONObject()
for(i <- 0 to splits.length-1){
jo.put(keys(i), splits(i));
}
ja.put(jo);
contentLine = br.readLine();
}
//z.angularBind("ja",ja.toString()) //ja can be accessed now in javascript (EDITED-10/11/15)
Edited Code 2:
I thought the heap space issue might go away if I use Apache spark to construct the JSON array like in below code, but this one too gives heap space error:
def myf(keys: Array[String], value: String):String = {
var splits = value.split(",")
var jo = new JSONObject()
for(i <- 0 to splits.length-1){
jo.put(keys(i), splits(i));
}
return(jo.toString())
}
val csv = sc.textFile("/root/test.csv")
val firstrow = csv.first
val header = firstrow.split(",")
val data = csv.filter(x => x != firstrow)
var g = data.map(value => myf(header,value)).collect()
// EDITED BELOW 2 LINES-10/11/15
//var ja= g.mkString("[", ",", "]")
//z.angularBind("ja",ja) //ja can be accessed now in javascript
You are creating JSON-objects. They are not native to java/scala and will therefore take up more space in that environment. What does z.angularBind() really do?
Also what is the heap size of your javascript environment (see https://www.quora.com/What-is-the-maximum-size-of-a-JavaScript-object-in-browser-memory for chrome) and your java environment (see How is the default java heap size determined?).
Update: Removed the original part of the answer where I misunderstood the question

Apache-Spark: method in foreach doesn't work

I read file from HDFS, which contains x1,x2,y1,y2 representing a envelope in JTS.
I would like to use those data to build STRtree in foreach.
val inputData = sc.textFile(inputDataPath).cache()
val strtree = new STRtree
inputData.foreach(line => {val array = line.split(",").map(_.toDouble);val e = new Envelope(array(0),array(1),array(2),array(3)) ;
println("envelope is " + e);
strtree.insert(e,
new Rectangle(array(0),array(1),array(2),array(3)))})
As you can see, I also print the e object.
To my surprise, when I log the size of strtree, it is zero! It seems that insert method make no senses here.
By the way, if I write hard code some test data line by line, the strtree can be built well.
One more thing, those project is packed into jar and submitted in the spark-shell.
So, why does the method in foreach not work ?
You will have to collect() to do this:
inputData.collect().foreach(line => {
... // your code
})
You can do this (for avoiding collecting all data):
val pairs = inputData.map(line => {
val array = line.split(",").map(_.toDouble);
val e = new Envelope(array(0),array(1),array(2),array(3)) ;
println("envelope is " + e);
(e, new Rectangle(array(0),array(1),array(2),array(3)))
}
pairs.collect().foreach(pair => {
strtree.insert(pair._1, pair._2)
}
Use .map() instead of .foreach() and reassign the outcome.
Foreach does not return the outcome of applyied function. It can be used for sending data somewhere, storing to db, printing, and so on.