Transforming specific field of the RDD - scala

I am new to spark.I have a doubt in transforming the specific field of a RDD.
I have a file like below:
2016-11-10T07:01:37|AAA|S16.12|MN-MN/AAA-329044|288364|2|3
2016-11-10T07:01:37|BBB|S16.12|MN-MN/AAA-329044/BBB-1|304660|0|0
2016-11-10T07:01:37|TSB|S16.12|MN-MN/AAA-329044/BBB-1/TSB-1|332164|NA|NA
2016-11-10T07:01:37|RX|S16.12|MN-MN/AAA-329044/BBB-1/TSB-1/RX-1|357181|0|1
And I want ouput like below:In the third field I want to remove all characters and integers separated by |.
2016-11-10T07:01:37|AAA|16.12|329044|288364|2|3
2016-11-10T07:01:37|BBB|16.12|329044|1|304660|0|0
2016-11-10T07:01:37|TSB|16.12|329044|1|1|332164|NA|NA
2016-11-10T07:01:37|RX|16.12|329044|1|1|1|357181|0|1
how can I do that.
I tried the below code.
val inputRdd =sc.textFile("file:///home/arun/Desktop/inputcsv.txt");
val result =inputRdd.flatMap(line=>line.split("\\|")).collect;
def ghi(arr:Array[String]):Array[String]=
{
var outlist=scala.collection.mutable.Buffer[String]();
for( i <-0 to arr.length-1){
if(arr(i).matches("(.*)-(.*)")){
var io=arr(i); var arru=scala.collection.mutable.Buffer[String]();
if(io.contains("/"))
{
var ki=io.split("/");
for(st <-0 to ki.length-1 )
{
var ion =ki(st).split("-");
arru+=ion(1);
}
var strui="";
for(in <-0 to arru.length-1)
{
strui=strui+arru(in)+"|";
}
outlist+=strui;
}
else
{
var ion =arr(i).split("-");
outlist+=ion(1)+"|";
}
}
else
{
outlist+=arr(i);
}
}
return outlist.toArray;
}
var output=ghi(result);
val finalrdd=sc.parallelize(out, 1);
finalrdd.collect().foreach(println);
Please help me.

What we need to do is to extract the numbers from that field and add them as new entries to the Array being processed.
Something like this should do:
// use data provided as sample
val dataSample ="""2016-11-10T07:01:37|AAA|S16.12|MN-MN/AAA-329044|288364|2|3
2016-11-10T07:01:37|BBB|S16.12|MN-MN/AAA-329044/BBB-1|304660|0|0
2016-11-10T07:01:37|TSB|S16.12|MN-MN/AAA-329044/BBB-1/TSB-1|332164|NA|NA
2016-11-10T07:01:37|RX|S16.12|MN-MN/AAA-329044/BBB-1/TSB-1/RX-1|357181|0|1""".split('\n')
val data = sparkContext.parallelize(dataSample)
val records= data.map(line=> line.split("\\|"))
// this regex can find and extract the contiguous digits in a mixed string.
val numberExtractor = "\\d+".r.unanchored
// we replace field#3 with the results of the regex
val field3Exploded = records.map{arr => arr.take(3) ++ numberExtractor.findAllIn(arr.drop(3).head) ++ arr.drop(4)}
// Let's visualize the result
field3Exploded.collect.foreach(arr=> println(arr.mkString(",")))
2016-11-10T07:01:37,AAA,S16.12,329044,288364,2,3
2016-11-10T07:01:37,BBB,S16.12,329044,1,304660,0,0
2016-11-10T07:01:37,TSB,S16.12,329044,1,1,332164,NA,NA
2016-11-10T07:01:37,RX,S16.12,329044,1,1,1,357181,0,1

Related

trimming all string type columns dynamically of dataframe scala spark

Hi I want to trim only string type columns of DF as trimming all columns will change data type of non string column to string type.
I have 2 ways to do it currently but looking for some good and efficient method.
First Method
var Countrydf = Seq(("Virat ", 18, "RCB ali shah"), (" Rohit ", 45, "MI "), (" DK", 67, "KKR ")).toDF("captains", "jersey_number", "teams")
Countrydf.show
for (name <- Countrydf.schema) {
if (name.dataType.toString == "StringType")
Countrydf = Countrydf.withColumn(name.name, trim(col(name.name)))
}
Second Method
val trimmedDF = Countrydf.columns.foldLeft(Countrydf) { (memoDF, colName) =>
memoDF.withColumn(colName, trim(col(colName)))
}
val exprs = Countrydf.schema.fields.map { f =>
if (trimmedDF.schema.fields.contains(f)) col(f.name)
else lit(null).cast(f.dataType).alias(f.name)
}
trimmedDF.select(exprs: _*).printSchema
both works fine and output is same.
Performance wise best solution I found is
var Countrydf = Seq(("Virat ",18,"RCB ali shah"),(" Rohit ",45,"MI "),(" DK",67,"KKR ")).toDF("captains","jersey_number","teams")
Countrydf.show
for( name <- Countrydf.schema) {
if(name.dataType.toString=="StringType")
Countrydf= Countrydf.withColumn(name.name, trim(col(name.name)))
}

Gatling & Scala : How to split values in loop?

I want to split some values in loop. I used split method in check and it works for me. But, there are more than 25 values of two different types.
So, I am implementing loop in scala and struggling.
Consider the following scenario:
import scala.concurrent.duration._
import io.gatling.core.Predef._
import io.gatling.http.Predef._
class testSimulation extends Simulation {
val httpProtocol = http
.baseURL("https://website.com")
.doNotTrackHeader("1")
.disableCaching
val uri1 = "https://website.com"
val scn = scenario("EditAttribute")
.exec(http("LogIn")
.post(uri1 + "/web/guest/")
.headers(headers_0)
.exec(http("getPopupData")
.post("/website/getPopupData")
.check(jsonPath("$.data[0].pid").transform(_.split('#').toSeq).saveAs("pID"))) // Saving splited value
.exec(http("Listing")
.post("/website/listing")
.check(jsonPath("$.data[*].AdId").findAll.saveAs("aID")) // All values are collected in vector
// .check(jsonPath("$.data[*].AdId").transform(_.split('#').toSeq).saveAs("aID")) // Split method Not working for batch
// .check(jsonPath("$.data[*].AdId").findAll.saveAs("aID")) // To verify the length of array (vector)
.check(jsonPath("$.data[0].RcId").findAll.saveAs("rID")))
.exec(http("UpdatedDataListing")
.post("/website/search")
.formParam("entityTypeId", "${pId(0)}") // passing splited value, perfectly done
.formParam("action_id", "${aId(0)},${aId(1)},${aId(2)},..and so on) // need to pass splitted values which is not happening
.formParam("userId", "${rID}")
// To verify values on console (What value I m getting after splitting)...
.exec( session => {
val abc = session("pID").as[Seq[String]]
val xyz = session("aID").as[Seq[String]]
println("Separated pId ===> " +abc(0)) // output - first splitted value
println("Separated pId ===> " +abc(1)) // split separater
println("Separated pId ===> " +abc(2)) // second splitted value
println("Length ===> " +abc.length) // output - 3
println("Length ===> " +xyz.length) // output - 25
session
}
)
.exec(http("logOut")
.get("https://" + uri1 + "/logout")
.headers(headers_0))
setUp(scn.inject(atOnceUsers(1))).protocols(httpProtocol)
}
I want to implement a loop which performs splitting of all (25) values in session. I do not want to do hard coding.
I am newbie to scala and Gatling as well.
Since it is a session function the below snippet will give a direction to continue ,use split just like you do in Java :-
exec { session =>
var requestIdValue = new scala.util.Random().nextInt(Integer.MAX_VALUE).toString();
var length = jobsQue.length
try {
var reportElement = jobsQue.pop()
jobData = reportElement.getData;
xml = Configuration.XML.replaceAll("requestIdValue", requestIdValue);
println(s"For Request Id : $requestIdValue .Data Value from feeder is : $jobData Current size of jobsQue : $length");
} catch {
case e: NoSuchElementException => print("Erorr")
}
session.setAll(
"xmlRequest" -> xml)
}

Scala : How to use variable in for loop outside loop block

How can I create Dataframe with all my json files, when after reading each file I need to add fileName as field in dataframe? it seems Variable in for loop is not recognized outside loop. How to overcome this issue?
for (jsonfilenames <- fileArray) {
var df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
// trying to create temp table from dataframe created in loop
tblLanding.registerTempTable("LandingTable") // ERROR here, can't resolved tblLanding
Thank in advance
Hossain
I think you are new to programming itself.
Anyways here you go.
Basically you specify the type and initialise it before loop.
var df:DataFrame = null
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
df.registerTempTable("LandingTable") // Getting ERROR here
Update
Ok you are completely new to programming, even loops.
Suppose fileArray is having values as [1.json, 2.json, 3.json, 4.json]
So, the loop actually created 4 dataframe, by reading 4 json files.
Which one you want to register as temp table.
If all of them,
var df:DataFrame = null
var count = 0
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
df.registerTempTable(s"LandingTable_$count")
count++;
}
And reason for df being empty before this update is, your fileArray is empty or Spark failed to read that file. Print it and check.
To query any of those registered LandingTable
val df2 = hiveContext.sql("SELECT * FROM LandingTable_0")
Update
Question has changed to making a single dataFrame from all the json files.
var dataFrame:DataFrame = null
for (jsonfilename <- fileArray) {
val eachDataFrame = hivecontext.read.json(jsonfilename)
if(dataFrame == null)
dataFrame = eachDataFrame
else
dataFrame = eachDataFrame.unionAll(dataFrame)
}
dataFrame.registerTempTable("LandingTable")
Insure, that fileArray is not empty and all json files in fileArray are having same schema.
// Create list of dataframes with source-file-names
val dfList = fileArray.map{ filename =>
hivecontext.read.json(filename)
.withColumn("source_file_name", lit(filename))
}
// union the dataframes (assuming all are same schema)
val df = dfList.reduce(_ unionAll _) // or use union if spark 2.x
// register as table
df.registerTempTable("LandingTable")

How can I reduce the redundancies of the fields handling of feed handler

I am subscript to a message feed for a number of fields, I need to set the values from the feed to the domain object and have code like below:
if (map.contains(quoteBidPriceAcronym)) {
quote.bid.price = Some(map.get(quoteBidPriceAcronym).get.asInstanceOf[Number].doubleValue());
quote.changed = true;
}
if (map.contains(quoteBidSizeAcronym)) {
quote.bid.size = Some(sizeMultipler() * map.get(quoteBidSizeAcronym).get.asInstanceOf[Number].intValue());
quote.changed = true;
}
if (map.contains(quoteBidNumAcronym)) {
quote.bid.num = Some(map.get(quoteBidNumAcronym).get.asInstanceOf[Number].shortValue());
quote.changed = true;
}
if (map.contains(quoteAskPriceAcronym)) {
quote.ask.price = Some(map.get(quoteAskPriceAcronym).get.asInstanceOf[Number].doubleValue());
quote.changed = true;
}
if (map.contains(quoteAskSizeAcronym)) {
quote.ask.size = Some(sizeMultipler() * map.get(quoteAskSizeAcronym).get.asInstanceOf[Number].intValue());
quote.changed = true;
}
if (map.contains(quoteAskNumAcronym)) {
quote.ask.num = Some(map.get(quoteAskNumAcronym).get.asInstanceOf[Number].shortValue());
quote.changed = true;
}
if (map.contains(quoteExchTimeAcronym)) {
quote.exchtime = getExchTime(String.valueOf(map.get(quoteExchTimeAcronym).get));
}
It look pretty redundant, any suggestion to improve it?
You can do something like:
map.get(quoteBidPriceAcronym).map { item =>
quote.bid.price = item.map(_.asInstanceOf[Number].doubleValue())
quote.changed = true
}
Other issues might be better to fix outside. E.g. why map[quoteBidPriceAcronym] is storing an Option, if your code assumes it's not going to be None?
Something like this perhaps?
val handlers = Map[String, Number => Unit] (
quoteBidPriceAcronym -> { n => quote.bid.price = Some(n.doubleValue) },
quoteBidSizeAcronym -> { n => quote.bid.size = Some(sizeMultipler() * n.intValue },
etc. ...
)
for {
(k,handler) <- handlers
values <- map.get(k).toSeq
quote.chanded = true
_ = handler(n.asInstanceof[Number])
}
Personally, I don't like code changing an object state (quote) but this is a question on Scala, not functional programming.
That said I would reverse the way you are using you map map keys. Instead of checking whether a value exists to perform some action, I'd have a map from your keys to actions and I'd iterate over your map elements.
e.g (assuming map is of the type Map[String, Any]):
val actions: Map[String, PartialFunction[Any, Unit]] = Map(
(quoteBidPriceAcronym, {case n: Number => quote.bid.price = Some(n.doubleValue())}),
(quoteBidSizeAcronym, {case n: Number => quote.bid.size = Some(sizeMultipler() * n.doubleValue())}),
...
...
)
for((k,v) <- map; action <- actions.get(k); _ <- action.lift(v))
quote.changed = true;
The for construct here iterates over map key-values, then (next level of iteration, over the possible action available for the key. If an action is found, which is a partial function, it gets lifted to make it a function from Any to Option[Unit]. That way, you can iterate in an additional inner level so quote.changed = true is only run when the action is defined for v.

Apache-Spark: method in foreach doesn't work

I read file from HDFS, which contains x1,x2,y1,y2 representing a envelope in JTS.
I would like to use those data to build STRtree in foreach.
val inputData = sc.textFile(inputDataPath).cache()
val strtree = new STRtree
inputData.foreach(line => {val array = line.split(",").map(_.toDouble);val e = new Envelope(array(0),array(1),array(2),array(3)) ;
println("envelope is " + e);
strtree.insert(e,
new Rectangle(array(0),array(1),array(2),array(3)))})
As you can see, I also print the e object.
To my surprise, when I log the size of strtree, it is zero! It seems that insert method make no senses here.
By the way, if I write hard code some test data line by line, the strtree can be built well.
One more thing, those project is packed into jar and submitted in the spark-shell.
So, why does the method in foreach not work ?
You will have to collect() to do this:
inputData.collect().foreach(line => {
... // your code
})
You can do this (for avoiding collecting all data):
val pairs = inputData.map(line => {
val array = line.split(",").map(_.toDouble);
val e = new Envelope(array(0),array(1),array(2),array(3)) ;
println("envelope is " + e);
(e, new Rectangle(array(0),array(1),array(2),array(3)))
}
pairs.collect().foreach(pair => {
strtree.insert(pair._1, pair._2)
}
Use .map() instead of .foreach() and reassign the outcome.
Foreach does not return the outcome of applyied function. It can be used for sending data somewhere, storing to db, printing, and so on.