Count filtered records in scala - scala

As I am new to scala ,This problem might look very basic to all..
I have a file called data.txt which contains like below:
xxx.lss.yyy23.com-->mailuogwprd23.lss.com,Hub,12689,14.98904563,1549
xxx.lss.yyy33.com-->mailusrhubprd33.lss.com,Outbound,72996,1.673717588,1949
xxx.lss.yyy33.com-->mailuogwprd33.lss.com,Hub,12133,14.9381027,664
xxx.lss.yyy53.com-->mailusrhubprd53.lss.com,Outbound,72996,1.673717588,3071
I want to split the line and find the records depending upon the numbers in xxx.lss.yyy23.com
val data = io.Source.fromFile("data.txt").getLines().map { x => (x.split("-->"))}.map { r => r(0) }.mkString("\n")
which gives me
xxx.lss.yyy23.com
xxx.lss.yyy33.com
xxx.lss.yyy33.com
xxx.lss.yyy53.com
This is what I am trying to count the exact value...
data.count { x => x.contains("33")}
How do I get the count of records who does not contain 33...

The following will give you the number of lines that contain "33":
data.split("\n").count(a => a.contains("33"))
The reason what you have above isn't working is that you need to split data into an array of strings again. Your previous statement actually concatenates the result into a single string using newline as a separator using mkstring, so you can't really run collection operations like count on it.
The following will work for getting the lines that do not contain "33":
data.split("\n").count(a => !a.contains("33"))
You simply need to negate the contains operation in this case.

Related

How to output field padding in file Scala spark?

I have a text file. Now, I want output field padding in file as Exp1 & Exp2.
What should I do?
This is my input:
a
a a
a a a
a a a a
a a a a a
Exp1. Fill the remaining fields with the - character when each record in the file does not fit into the n=4 field.
a _ _ _
a a _ _
a a a _
a a a a
a a a a a
Exp2. Same as above. Delete the fields after the n=4 field when the number of fields in the record exceeds n.
a _ _ _
a a _ _
a a a _
a a a a
a a a a
My code:
val df = spark.read.text("data.txt")
val result = df.columns.foldLeft(df){(newdf, colname) =>
newdf.withColumnRenamed(colname, colname.replace("a", "_"))
}
result .show
This resembles a homework-style problem, so I will help guide you based on your provided code and try to lead you on the right path here.
Your current code is only changing the name of the columns. In this case, the column name "value" is being changed to "v_lue".
You want to change the actual records themselves.
First, you want to read this data into an RDD. It can be done with a dataframe, but being able to map on the row strings
instead of Row objects might make this easier to understand conceptually. I'll get you started.
val data = sc.textFile("data.txt")
Data will be an RDD of strings, where each element is a line in the data file.
We're going to want to map this data to some new data, and transform each row.
data.map(row => {
// transform each row here
})
Inside this map we make some change to row, which is a string. The code inside applies to every string in the RDD.
You will probably want to split the row to get an array of strings, so that you can count how many occurrences
of 'a' there are. Depending on the size of the array, you will want to create a new string and output that from this map.
If there are fewer 'a's than n, you will probably want to create a string with enough '_'s. If there are too many,
you will probably want to return a string with the correct number.
Hope this helps.

PySpark list() in withColumn() only works once, then AssertionError: col should be Column

I have a DataFrame with 6 string columns named like 'Spclty1'...'Spclty6' and another 6 named like 'StartDt1'...'StartDt6'. I want to zip them and collapse into a columns that looks like this:
[[Spclty1, StartDt1]...[Spclty6, StartDt6]]
I first tried collapsing just the 'Spclty' columns into a list like this:
DF = DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6')))
This worked the first time I executed it, giving me a new column called 'Spclty' containing rows such as ['014', '124', '547', '000', '000', '000'], as expected.
Then, I added a line to my script to do the same thing on a different set of 6 string columns, named 'StartDt1'...'StartDt6':
DF = DF.withColumn('StartDt', list(DF.select('StartDt1', 'StartDt2', 'StartDt3', 'StartDt4', 'StartDt5', 'StartDt6'))))
This caused AssertionError: col should be Column.
After I ran out of things to try, I tried the original operation again (as a sanity check):
DF.withColumn('Spclty', list(DF.select('Spclty1', 'Spclty2', 'Spclty3', 'Spclty4', 'Spclty5', 'Spclty6'))).collect()
and got the assertion error as above.
So, it would be good to understand why it only worked the first time (only), but the main question is: what is the correct way to zip columns into a collection of dict-like elements in Spark?
.withColumn() expects a column object as second parameter and you are supplying a list.
Thanks. After reading a number of SO posts I figured out the syntax for passing a set of columns to the col parameter, using struct to create an output column that holds a list of values:
DF_tmp = DF_tmp.withColumn('specialties', array([
struct(
*(col("Spclty{}".format(i)).alias("spclty_code"),
col("StartDt{}".format(i)).alias("start_date"))
)
for i in range(1, 7)
]
))
So, the col() and *col() constructs are what I was looking for, while the array([struct(...)]) approach lets me combine the 'Spclty' and 'StartDt' entries into a list of dict-like elements.

Remove white spaces in scala-spark

I have sample file record like this
2018-01-1509.05.540000000000001000000751111EMAIL#AAA.BB.CL
and the above record is from a fixed length file and I wanted to split based on the lengths
and when I split I am getting a list as shown below.
ListBuffer(2018-01-15, 09.05.54, 00000000000010000007, 5, 1111, EMAIL#AAA.BB.CL)
Everything looks fine until now . But I am not sure why is there extra-space adding in each field in the list(not for the first field).
Example : My data is "09.05.54",But I am getting as" 09.05.54" in the list.
My Logic for splitting is shown below
// Logic to Split the Line based on the lengths
def splitLineBasedOnLengths(line: String, lengths: List[String]): ListBuffer[Any] = {
var splittedLine = line
var split = new ListBuffer[Any]()
for (i <- lengths) yield {
var c = i.toInt
var fi = splittedLine.take(c)
split += fi
splittedLine = splittedLine.drop(c)
}
split
}
The above code take's the line and list[String] which are nothing but lengths as input and gives the listbuffer[Any] which has the lines split according to the length.
Can any one help me why am I getting extra space before each field after splitting ?
There are no extra spaces in the data. It's just adding some separation between the elements when printing them (using toString) to make them easier to read.
To prove this try the following code:
split.foreach(s => println(s"\"$s\""))
You will see the following printed:
"2018-01-15"
"09.05.54"
"00000000000010000007"
"5"
"1111"
"EMAIL#AAA.BB.CL"

find line number in an unstructured file in scala

Hi guys I am parsing an unstructured file for some key words but i can't seem to easily find the line number of what the results I am getiing
val filePath:String = "myfile"
val myfile = sc.textFile(filePath);
var ora_temp = myfile.filter(line => line.contains("MyPattern")).collect
ora_temp.length
However, I not only want to find the lines that contains MyPatterns but I want more like a tupple (Mypattern line, line number)
Thanks in advance,
You can use ZipWithIndex as eliasah pointed out in a comment (with probably the most succinct way to do this using the direct tuple accessor syntax), or like so using pattern matching in the filter:
val matchingLineAndLineNumberTuples = sc.textFile("myfile").zipWithIndex().filter({
case (line, lineNumber) => line.contains("MyPattern")
}).collect

Need the best way to iterate a file returning batches of lines as XML

I'm looking for the best way to process a file in which, based on the contents, i combine certain lines into XML and return the XML.
e.g. Given
line 1
line 2
line 3
line 4
line 5
I may want the first call to return
<msg>line 1, line 2</msg>
and a subsequent call to return
<msg>line 5, line 4</msg>
skipping line 3 for uninteresting content and exhausting the input stream. (Note: the <msg> tags will always contain contiguous lines but the number and organization of those lines in the XML will vary.) If you'd like some criteria for choosing lines to include in a message, assume odd line #s combine with the following four lines, even line #s combine with the following two lines, mod(10) line #s combine with the following five lines, skip lines that start with '#'.
I was thinking I should implement this as an iterator so i can just do
<root>{ for (m <- messages(inputstream)) yield m }</root>
Is that reasonable? If so, how best to implement it? If not, how best to implement it? :)
Thanks
This answer provided my solution: How do you return an Iterator in Scala?
I tried the following but there appears to be some sort of buffer issue and lines are skipped between calls to Log.next.
class Log(filename:String) {
val src = io.Source.fromFile(filename)
var node:Node = null
def iterator = new Iterator[Node] {
def hasNext:Boolean = {
for (line <- src.getLines()) {
// ... do stuff ...
if (null != node) return true
}
src.close()
false
}
def next = node
}
There might be a more Scala-way to do it and i'd like to see it but this is my solution to move forward for now.