Map Triples to IDs/numbers using ZipWithIndex/ZipWithUniqueID - scala

I asked the question before but it was unclear so I added more explanation to be more clear and to get help.
replace strings with ZipWithIndex/ZipWithUniqueID
I am trying to map string to number using ZipWithIndex OR ZipWithUniqueID
lets say I have this format
("u1",("name", "John Sam"))
("u2",("age", "twinty Four"))
("u3",("name", "sam Blake"))
I want this result
(0,(3,4))
(1,(5,6))
(2,(3,8))
I tried to use zipWithIndex directly to the triples but I got each letter mapped to a number I want to map the whole string without dividing it.
and tried to extract the first element in the key, value pair
so I did
val first = file.map(line=> line._1).distinct()
then apply ZipWithIndex
val z1= first.ZipWithIndex()
I got result like this
("u1",0)
("u2",1)
("u3",2)
now I need to take the ids/numbers and change it in my original file. and I need to keep all the distinct ids/numbers in hashTable to be able to look for them later on.
is there any way to do that? Any suggestions?
I hope you got my question

You mean something like this?
val file = List(("u1",("name", "John Sam")),
("u2",("age", "twinty Four")),
("u3",("name", "sam Blake")))
val first = file.map(line=> line._1) ++
file.flatMap(line=> List(line._2._1, line._2._2)).distinct
val z1: Map[String,Int] = Map[String,Int](first.zipWithIndex:_*)
file.map{ l =>
(z1(l._1),
(z1(l._2._1), z1(l._2._2)))
}

Related

How to output field padding in file Scala spark?

I have a text file. Now, I want output field padding in file as Exp1 & Exp2.
What should I do?
This is my input:
a
a a
a a a
a a a a
a a a a a
Exp1. Fill the remaining fields with the - character when each record in the file does not fit into the n=4 field.
a _ _ _
a a _ _
a a a _
a a a a
a a a a a
Exp2. Same as above. Delete the fields after the n=4 field when the number of fields in the record exceeds n.
a _ _ _
a a _ _
a a a _
a a a a
a a a a
My code:
val df = spark.read.text("data.txt")
val result = df.columns.foldLeft(df){(newdf, colname) =>
newdf.withColumnRenamed(colname, colname.replace("a", "_"))
}
result .show
This resembles a homework-style problem, so I will help guide you based on your provided code and try to lead you on the right path here.
Your current code is only changing the name of the columns. In this case, the column name "value" is being changed to "v_lue".
You want to change the actual records themselves.
First, you want to read this data into an RDD. It can be done with a dataframe, but being able to map on the row strings
instead of Row objects might make this easier to understand conceptually. I'll get you started.
val data = sc.textFile("data.txt")
Data will be an RDD of strings, where each element is a line in the data file.
We're going to want to map this data to some new data, and transform each row.
data.map(row => {
// transform each row here
})
Inside this map we make some change to row, which is a string. The code inside applies to every string in the RDD.
You will probably want to split the row to get an array of strings, so that you can count how many occurrences
of 'a' there are. Depending on the size of the array, you will want to create a new string and output that from this map.
If there are fewer 'a's than n, you will probably want to create a string with enough '_'s. If there are too many,
you will probably want to return a string with the correct number.
Hope this helps.

How to read with Spark constantly updating HDFS directory and split output to multiple HDFS files based on String (row)?

Elaborated scenario -> HDFS directory which is "fed" with new log data of multiple types of bank accounts activity.
Each row represents a random activity type, and each row (String) contains the text "ActivityType=<TheTypeHere>".
In Spark-Scala, what's the best approach to read the input file/s in the HDFS directory and output multiple HDFS files, where each ActivityType is written to its own new file?
Adapted first answer to the statement:
The location of the "key" string is random within the parent String,
the only thing that is guaranteed is that it contains that sub-string,
in this case "ActivityType" followed by some val.
The question is really about this. Here goes:
// SO Question
val rdd = sc.textFile("/FileStore/tables/activitySO.txt")
val rdd2 = rdd.map(x => (x.slice (x.indexOfSlice("ActivityType=<")+14, x.indexOfSlice(">", (x.indexOfSlice("ActivityType=<")+14))), x))
val df = rdd2.toDF("K", "V")
df.write.partitionBy("K").text("SO_QUESTION2")
Input is:
ActivityType=<ACT_001>,34,56,67,89,90
3,4,4,ActivityType=<ACT_002>,A,1,2
ABC,ActivityType=<ACT_0033>
DEF,ActivityType=<ACT_0033>
Output is 3 files whereby the key is e.g. not ActivityType=, but rather ACT_001, etc. The key data is not stripped, it is still there in the String. You can modify that if you want as well as output location and format.
You can use MultipleOutputFormat for this.Convert rdd into key value pairs such that ActivityType is the key.Spark will create different files for different keys.You can decide based on the key where to place the files and what their names will be.
You can do something like this using RDDs whereby I assume you have variable length files and then converting to DFs:
val rdd = sc.textFile("/FileStore/tables/activity.txt")
val rdd2 = rdd.map(_.split(","))
.keyBy(_(0))
val rdd3 = rdd2.map(x => (x._1, x._2.mkString(",")))
val df = rdd3.toDF("K", "V")
//df.show(false)
df.write.partitionBy("K").text("SO_QUESTION")
Input is:
ActivityType=<ACT_001>,34,56,67,89,90
ActivityType=<ACT_002>,A,1,2
ActivityType=<ACT_003>,ABC
I get then as output 3 files, in this case 1 for each record. A bit hard to show as did it in Databricks.
You can adjust your output format and location, etc. partitionBy is the key here.

find line number in an unstructured file in scala

Hi guys I am parsing an unstructured file for some key words but i can't seem to easily find the line number of what the results I am getiing
val filePath:String = "myfile"
val myfile = sc.textFile(filePath);
var ora_temp = myfile.filter(line => line.contains("MyPattern")).collect
ora_temp.length
However, I not only want to find the lines that contains MyPatterns but I want more like a tupple (Mypattern line, line number)
Thanks in advance,
You can use ZipWithIndex as eliasah pointed out in a comment (with probably the most succinct way to do this using the direct tuple accessor syntax), or like so using pattern matching in the filter:
val matchingLineAndLineNumberTuples = sc.textFile("myfile").zipWithIndex().filter({
case (line, lineNumber) => line.contains("MyPattern")
}).collect

_.split(" ") more fields in scala RDD

I'm trying to extract data from an RDD[string] into another RDD[string]
the RDD contains data similar to this :
17.808 15.749 6.649 -0.548 15.9994
I need to multiply 4th and 5th fields of each row and store them into a different RDD[string].
I can use the following code to pull out one field
ansRDD = rawRDD(._split(" ")(4)).(_.toFloat)
rawRDD contains the string.
But I need to pull out both the fields into a single RDD as
-0.548 15.9994
so that I can simply do
answer = ansRDD.foreach(case(a,b) => a*b)
You could use:
rawRDD.map(_.split(' ').view(4, 6).map(_.toFloat).reduce(_*_).toString)
You could define ansRDD as:
ansRDD = rawRD.map(item => {val comps=item.split(" "); (comps(3),comps(4)})

Count filtered records in scala

As I am new to scala ,This problem might look very basic to all..
I have a file called data.txt which contains like below:
xxx.lss.yyy23.com-->mailuogwprd23.lss.com,Hub,12689,14.98904563,1549
xxx.lss.yyy33.com-->mailusrhubprd33.lss.com,Outbound,72996,1.673717588,1949
xxx.lss.yyy33.com-->mailuogwprd33.lss.com,Hub,12133,14.9381027,664
xxx.lss.yyy53.com-->mailusrhubprd53.lss.com,Outbound,72996,1.673717588,3071
I want to split the line and find the records depending upon the numbers in xxx.lss.yyy23.com
val data = io.Source.fromFile("data.txt").getLines().map { x => (x.split("-->"))}.map { r => r(0) }.mkString("\n")
which gives me
xxx.lss.yyy23.com
xxx.lss.yyy33.com
xxx.lss.yyy33.com
xxx.lss.yyy53.com
This is what I am trying to count the exact value...
data.count { x => x.contains("33")}
How do I get the count of records who does not contain 33...
The following will give you the number of lines that contain "33":
data.split("\n").count(a => a.contains("33"))
The reason what you have above isn't working is that you need to split data into an array of strings again. Your previous statement actually concatenates the result into a single string using newline as a separator using mkstring, so you can't really run collection operations like count on it.
The following will work for getting the lines that do not contain "33":
data.split("\n").count(a => !a.contains("33"))
You simply need to negate the contains operation in this case.