String Hive function for split key-value pair into two columns - scala

How to split this data T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0 into two columns using hive function
For example
T 32
P 1
A 420
H 60
R 0.30841494477846165
S 0

You can do this with a regex implementation:
def main(args: Array[String]) {
val s = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
val pattern = "[A-Z]\\_\\d+\\.?\\d*"
var buff = new String()
val r = Pattern.compile(pattern)
val m = r.matcher(s)
while (m.find()) {
buff = buff + (m.group(0))
buff = buff + "\n"
}
buff = buff.toString.replaceAll("\\_", " ")
println("output:\n" + buff)
}
Output:
output:
T 32
P 1
A 420
H 60
R 0.30841494477846165
S 0

If you need to collect the data for further processing, and you're guaranteed it's always paired correctly, you could do something like this.
scala> val str = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
str: String = T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0
scala> val data = str.split("_").sliding(2,2)
data: Iterator[Array[String]] = non-empty iterator
scala> data.toList // just to see it
res29: List[Array[String]] = List(Array(T, 32), Array(P, 1), Array(A, 420), Array(H, 60), Array(R, 0.30841494477846165), Array(S, 0))

You can split your string, get an array, zipWithIndex and filter based on index to get two arrays col1 and col2 and then use it for printing:
val str = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
val tmp = str.split('_').zipWithIndex
val col1 = tmp.filter( p => p._2 % 2 == 0 ).map( p => p._1)
val col2 = tmp.filter( p => p._2 % 2 != 0 ).map( p => p._1)
//col1: Array[String] = Array(T, P, A, H, R, S)
//col2: Array[String] = Array(32, 1, 420, 60, ...

Related

Reading ByteArray returned from reading Binary Format

I have this piece of code that reads from a stream which has binary data
def readSARData(ds: DataInputStream): List[SarInfoObj] = {
var c = 0
val list = mutable.MutableList[SarInfoObj]()
while (ds.available > 0) {
c = ds.readShort
println("c = " + c)
val Secondary = (c >> 11) & 0x01
val PID = (c >> 4) & 0X7f
val PCAT = c & 0xf
printf("%04x:%d(1)\t%d(65)\t%d(12)\n", c, Secondary, PID, PCAT)
c = ds.readShort()
val Sequence = c >> 14
val Count = c & 0x3f
c = ds.readShort()
val DataLen: Int = c + 1
printf("%04x: %x(3)\tCount=%02d\tLen=%d(61..65533)\n", c, Sequence, Count, DataLen)
if ((DataLen + 6) % 4 != 0) {
System.out.println("DataLen is not a multiple of 4")
System.exit(1)
}
val bdata = new Array[Byte](DataLen)
ds.readFully(bdata)
var str = new String(bdata, StandardCharsets.US_ASCII)
val data = str.toCharArray
val tmp32 = new Array[Byte](4)
val l = ByteBuffer.wrap(bdata).getInt(0) & (-1L >>> 32)
println("lBE = " + ByteBuffer.wrap(bdata).order(ByteOrder.BIG_ENDIAN).getInt(0))
println("timestamp = " + l)
//System.out.println("data= " + str);
val converted = new String(ByteBuffer.wrap(bdata).order(ByteOrder.LITTLE_ENDIAN).array, "US-ASCII")
//
val TimeStamp = ByteBuffer.wrap(bdata).getInt
val BAQ = ByteBuffer.wrap(bdata, 31, 1).get
var Typ = ByteBuffer.wrap(bdata, 57, 1).get
val BlockType = Typ >> 4
val NQ = ByteBuffer.wrap(bdata, 59, 2).getShort
val SWATH = ByteBuffer.wrap(bdata, 58, 1).get
println("Swath" + SWATH)
list += SarInfoObj(BAQ, BlockType,DataLen,NQ,str.replace('\0','|').substring(62))
}
list.toList
}
Now I want to write similar code to process this data by reading data from the binary format. How can I do that and advance the index so that I can read the data from the DataFrame
def processSarFileAsBinaryStream(fileName: String, session: SparkSession)={
val fileDf = session.read.format("binaryFile").load(fileName)
fileDf.map(row => {
val data = row.getAs[Array[Byte]](0)
}
}).collect().foreach(println)
}

Outlier Elimination in Spark With InterQuartileRange Results in Error

I have the following recursive function that determines the Outlier using the InterQuartileRange method:
def interQuartileRangeFiltering(df: DataFrame): DataFrame = {
#scala.annotation.tailrec
def inner(cols: List[String], acc: DataFrame): DataFrame = cols match {
case Nil => acc
case column :: xs =>
val quantiles = acc.stat.approxQuantile(column, Array(0.25, 0.75), 0.0) // TODO: values should come from config
println(s"$column ${quantiles.size}")
val q1 = quantiles(0)
val q3 = quantiles(1)
val iqr = q1 - q3
val lowerRange = q1 - 1.5 * iqr
val upperRange = q3 + 1.5 * iqr
val filtered = acc.filter(s"$column < $lowerRange or $column > $upperRange")
inner(xs, filtered)
}
inner(df.columns.toList, df)
}
val outlierDF = interQuartileRangeFiltering(incomingDF)
So basically what I'm doing is that I'm recursively iterating over the columns and eliminating the outliers. Strangely it results in an ArrayIndexOutOfBounds Exception and prints the following:
housing_median_age 2
inland 2
island 2
population 2
total_bedrooms 2
near_bay 2
near_ocean 2
median_house_value 0
java.lang.ArrayIndexOutOfBoundsException: 0
at inner$1(<console>:75)
at interQuartileRangeFiltering(<console>:83)
... 54 elided
What is wrong with my approach?
Here is what I came up with and works like a charm:
def outlierEliminator(df: DataFrame, colsToIgnore: List[String])(fn: (String, DataFrame) => (Double, Double)): DataFrame = {
val ID_COL_NAME = "id"
val dfWithId = DataFrameUtils.addColumnIndex(spark, df, ID_COL_NAME)
val dfWithIgnoredCols = dfWithId.drop(colsToIgnore: _*)
#tailrec
def inner(
cols: List[String],
filterIdSeq: List[Long],
dfWithId: DataFrame
): List[Long] = cols match {
case Nil => filterIdSeq
case column :: xs =>
if (column == ID_COL_NAME) {
inner(xs, filterIdSeq, dfWithId)
} else {
val (lowerBound, upperBound) = fn(column, dfWithId)
val filteredIds =
dfWithId
.filter(s"$column < $lowerBound or $column > $upperBound")
.select(col(ID_COL_NAME))
.map(r => r.getLong(0))
.collect
.toList
inner(xs, filteredIds ++ filterIdSeq, dfWithId)
}
}
val filteredIds = inner(dfWithIgnoredCols.columns.toList, List.empty[Long], dfWithIgnoredCols)
dfWithId.except(dfWithId.filter($"$ID_COL_NAME".isin(filteredIds: _*)))
}

RDD transformation inside a loop

So i have an rdd:Array[String] named Adat and I want to transform it within a loop and get a new RDD which I can use outside the loop scope.I tried this but the result is not what I want.
val sharedA = {
for {
i <- 0 to shareA.toInt - 1
j <- 0 to shareA.toInt - 1
} yield {
Adat.map(x => (x(1).toInt, i % shareA.toInt, j % shareA.toInt, x(2)))
}
}
The above code transforms the SharedA rdd to IndexedSeq[RDD[(Int, Int, Int, String)]] and when i try to print it the result is:
MapPartitionsRDD[12] at map at planet.scala:99
MapPartitionsRDD[13] at map at planet.scala:99 and so on.
How to transform sharedA to RDD[(Int, Int, Int, String)]?
If i do it like this the sharedA has the correct datatype but i cannot use it outside the scope.
for { i <- 0 to shareA.toInt -1
j<-0 to shareA.toInt-1 }
yield {
val sharedA=Adat.map(x => (x(1).toInt,i % shareA.toInt ,j %
shareA.toInt,x(2)))
}
I don't exactly understand your description but flatMap should do the trick:
val rdd = sc.parallelize(Seq(Array("", "0", "foo"), Array("", "1", "bar")))
val n = 2
val result = rdd.flatMap(xs => for {
i <- 0 to n
j <- 0 to n
} yield (xs(1).toInt, i, j, xs(2)))
result.take(5)
// Array[(Int, Int, Int, String)] =
// Array((0,0,0,foo), (0,0,1,foo), (0,0,2,foo), (0,1,0,foo), (0,1,1,foo))
Less common approach would be to call SparkContext.union on the results:
val resultViaUnion = sc.union(for {
i <- 0 to n
j <- 0 to n
} yield rdd.map(xs => (xs(1).toInt, i, j, xs(2))))
resultViaUnion.take(5)
// Array[(Int, Int, Int, String)] =
// Array((0,0,0,foo), (1,0,0,bar), (0,0,1,foo), (1,0,1,bar), (0,0,2,foo))

Scala regex and for comprehension

I am trying to reason about how for comprehension works, because it is doing something different from what I expect it to do. I read several answers, the most relevant of which is this one Scala "<-" for comprehension However, I am still perplexed.
The following code works as expected. It prints lines where the values matched by two different Regexes are not equal (one for the value in a session cookie and another for the value in the GET args, just to give context):
file.getLines().foreach { line =>
val whidSession: String = rWhidSession.findAllMatchIn(line) flatMap {m => m.group(1)} mkString ""
val whidArg: String = rWhidArg.findAllMatchIn(line) flatMap {m => m.group(1)} mkString ""
if(whidSession != whidArg) println(line)
}
The following is the problematic code, which iterates on the letters within the matching strings, thus printing the line as many times as there are different letters in the two values:
/**
* This would compare letters, regardless of the use of mkString.. even without the flatMap step.
*/
val whidTuples = for {
line <- file.getLines().toList
whidSession <- rWhidSession.findAllMatchIn(line) flatMap {m => m.group(1) mkString ""}
whidArg <- rWhidEOL.findAllMatchIn(line) flatMap {m => m.group(1) mkString ""} if whidArg != whidSession
} yield line
To check that corresponding matches are equal:
scala> val ss = "foo/foo" :: "bar/bar" :: "foo/bar" :: Nil
ss: List[String] = List(foo/foo, bar/bar, foo/bar)
scala> val ra = "(.*)/.*".r ; val rb = ".*/(.*)".r
ra: scala.util.matching.Regex = (.*)/.*
rb: scala.util.matching.Regex = .*/(.*)
scala> for (s <- ss; ra(x) = s; rb(y) = s if x != y) yield s
res0: List[String] = List(foo/bar)
but allow multiple matches on a line:
scala> val ss = "foo/foo" :: "bar/bar" :: "baz/baz foo/bar" :: Nil
ss: List[String] = List(foo/foo, bar/bar, baz/baz foo/bar)
this would still compare the first matches:
scala> val ra = """(\w*)/\w*""".r.unanchored ; val rb = """\w*/(\w*)""".r.unanchored
ra: scala.util.matching.UnanchoredRegex = (\w*)/\w*
rb: scala.util.matching.UnanchoredRegex = \w*/(\w*)
scala> for (s <- ss; ra(x) = s; rb(y) = s if x != y) yield s
res2: List[String] = List()
so compare all matches:
scala> val ra = """(\w*)/\w*""".r ; val rb = """\w*/(\w*)""".r
ra: scala.util.matching.Regex = (\w*)/\w*
rb: scala.util.matching.Regex = \w*/(\w*)
scala> for (s <- ss; ma <- ra findAllMatchIn s; mb <- rb findAllMatchIn s; ra(x) = ma; rb(y) = mb if x != y) yield s
res3: List[String] = List(baz/baz foo/bar, baz/baz foo/bar, baz/baz foo/bar)
or
scala> for (s <- ss; (ma, mb) <- (ra findAllMatchIn s) zip (rb findAllMatchIn s); ra(x) = ma; rb(y) = mb if x != y) yield s
res4: List[String] = List(baz/baz foo/bar)
scala> for (s <- ss; (ra(x), rb(y)) <- (ra findAllMatchIn s) zip (rb findAllMatchIn s) if x != y) yield s
res5: List[String] = List(baz/baz foo/bar)
where the match ra(x) = ma should not be re-evaluating the regex but just doing ma group 1.

In Scala, how do I keep track of running totals without using var?

For example, suppose I wish to read in fat, carbs and protein and wish to print the running total of each variable. An imperative style would look like the following:
var totalFat = 0.0
var totalCarbs = 0.0
var totalProtein = 0.0
var lineNumber = 0
for (lineData <- allData) {
totalFat += lineData...
totalCarbs += lineData...
totalProtein += lineData...
lineNumber += 1
printCSV(lineNumber, totalFat, totalCarbs, totalProtein)
}
How would I write the above using only vals?
Use scanLeft.
val zs = allData.scanLeft((0, 0.0, 0.0, 0.0)) { case(r, c) =>
val lineNr = r._1 + 1
val fat = r._2 + c...
val carbs = r._3 + c...
val protein = r._4 + c...
(lineNr, fat, carbs, protein)
}
zs foreach Function.tupled(printCSV)
Recursion. Pass the sums from previous row to a function that will add them to values from current row, print them to CSV and pass them to itself...
You can transform your data with map and get the total result with sum:
val total = allData map { ... } sum
With scanLeft you get the particular sums of each step:
val steps = allData.scanLeft(0) { case (sum,lineData) => sum+lineData}
val result = steps.last
If you want to create several new values in one iteration step I would prefer a class which hold the values:
case class X(i: Int, str: String)
object X {
def empty = X(0, "")
}
(1 to 10).scanLeft(X.empty) { case (sum, data) => X(sum.i+data, sum.str+data) }
It's just a jump to the left,
and then a fold to the right /:
class Data (val a: Int, val b: Int, val c: Int)
val list = List (new Data (3, 4, 5), new Data (4, 2, 3),
new Data (0, 6, 2), new Data (2, 4, 8))
val res = (new Data (0, 0, 0) /: list)
((acc, x) => new Data (acc.a + x.a, acc.b + x.b, acc.c + x.c))