My requirement is to retrieve the order number from the comment column which is in a column comment and always starts with R. The order number should be added as a new column to the table.
Input data:
code,id,mode,location,status,comment
AS-SD,101,Airways,hyderabad,D,order got delayed R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged
TY-OP,103,Airways,Pune,D,Order number R5463 not received
Expected output:
AS-SD,101,Airways,hyderabad,D,order got delayed R1657,R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged,R7856
TY-OP,103,Airways,Pune,D,Order number R5463 not received,R5463
I have tried it in spark-sql, the query I am using is given below:
val r = sqlContext.sql("select substring(comment, PatIndex('%[0-9]%',comment, length(comment))) as number from A")
However, I'm getting the following error:
org.apache.spark.sql.AnalysisException: undefined function PatIndex; line 0 pos 0
You can use regexp_extract which has the definition :
def regexp_extract(e: Column, exp: String, groupIdx: Int): Column
(R\\d{4}) means R followed by 4 digits. You can easily accommodate any other case by using a valid regex
df.withColumn("orderId", regexp_extract($"comment", "(R\\d{4})" , 1 )).show
+-----+---+-------+---------+------+--------------------+-------+
| code| id| mode| location|status| comment|orderId|
+-----+---+-------+---------+------+--------------------+-------+
|AS-SD|101|Airways|hyderabad| D|order got delayed...| R1657|
|FY-YT|102|Airways| Delhi| ND|R7856 package dam...| R7856|
|TY-OP|103|Airways| Pune| D|Order number R546...| R5463|
+-----+---+-------+---------+------+--------------------+-------+
You can use a udf function as following
import org.apache.spark.sql.functions._
def extractString = udf((comment: String) => comment.split(" ").filter(_.startsWith("R")).head)
df.withColumn("newColumn", extractString($"comment")).show(false)
where the comment column is splitted with space and filtering the words that starts with R. head will take the first word that was filtered starting with R.
Updated
To ensure that the returned string is order number starting with R and rest of the strings are digits, you can add additional filter
import scala.util.Try
def extractString = udf((comment: String) => comment.split(" ").filter(x => x.startsWith("R") && Try(x.substring(1).toDouble).isSuccess).head)
You can edit the filter according to your need.
Related
I am using Scala and reading input from the console. I am able to regurgitate the strings that make up each line, but if my input has the following format, how can I access each integer within each line?
2 2
1 2 2
2 1 1
Currently I just regurgitate the input back to the console using
object Main {
def main(args: Array[String]): Unit = {
for (ln <- io.Source.stdin.getLines) println(ln)
//how can I access each individual number within each line?
}
}
And I need to compile this project like so:
$ scalac main.scala
$ scala Main <input01.txt
2 2
1 2 2
2 1 1
A reasonable algorithm would be:
for each line, split it into words
parse each word into an Int
An implementation of that algorithm:
io.Source.stdin.getLines // for each line...
.flatMap(
_.split("""\s+""") // split it into words
.map(_.toInt) // parse each word into an Int
)
The result of this expression will be an Iterator[Int]; if you want a Seq, you can call toSeq on that Iterator (if there's a reasonable chance there will be more than 7 or so integers, it's probably worth calling toVector instead). It will blow up with a NumberFormatException if there's a word which isn't an integer. You can handle this a few different ways... if you want to ignore words that aren't integers, you can:
import scala.util.Try
io.Source.stdin.getLines
.flatMap(
_.split("""\s+""")
.flatMap(Try(_.toInt).toOption)
)
The following will give you a flat list of numbers.
val integers = (
for {
line <- io.Source.stdin.getLines
number <- line.split("""\s+""").map(_.toInt)
} yield number
)
As you can read here, some care must be taken when parsing the numbers.
I have to read in files from vendors, that can get potentially pretty big (multiple GB). These files may have multiple header and footer rows I want to strip off.
Reading the file in is easy:
val rawData = spark.read
.format("csv")
.option("delimiter","|")
.option("mode","PERMISSIVE")
.schema(schema)
.load("/path/to/file.csv")
I can add a simple row number using monotonically_increasing_id:
val withRN = rawData.withColumn("aIndex",monotonically_increasing_id())
That seems to work fine.
I can easily use that to strip off header rows:
val noHeader = withRN.filter($"aIndex".geq(2))
but how can I strip off footer rows?
I was thinking about getting the max of the index column, and using that as a filter, but I can't make that work.
val MaxRN = withRN.agg(max($"aIndex")).first.toString
val noFooter = noHeader.filter($"aIndex".leq(MaxRN))
That returns no rows, because MaxRN is a string.
If I try to convert it to a long, that fails:
noHeader.filter($"aIndex".leq(MaxRN.toLong))
java.lang.NumberFormatException: For input string: "[100000]"
How can I use that max value in a filter?
Is trying to use monotonically_increasing_id like this even a viable approach? Is it really deterministic?
This happens because first will return a Row. To access the first element of the row you must do:
val MaxRN = withRN.agg(max($"aIndex")).first.getLong(0)
By converting the row to string you will get [100000] and of course this is not a valid Long that's why the casting is failing.
I have a text file. Now, I want output field padding in file as Exp1 & Exp2.
What should I do?
This is my input:
a
a a
a a a
a a a a
a a a a a
Exp1. Fill the remaining fields with the - character when each record in the file does not fit into the n=4 field.
a _ _ _
a a _ _
a a a _
a a a a
a a a a a
Exp2. Same as above. Delete the fields after the n=4 field when the number of fields in the record exceeds n.
a _ _ _
a a _ _
a a a _
a a a a
a a a a
My code:
val df = spark.read.text("data.txt")
val result = df.columns.foldLeft(df){(newdf, colname) =>
newdf.withColumnRenamed(colname, colname.replace("a", "_"))
}
result .show
This resembles a homework-style problem, so I will help guide you based on your provided code and try to lead you on the right path here.
Your current code is only changing the name of the columns. In this case, the column name "value" is being changed to "v_lue".
You want to change the actual records themselves.
First, you want to read this data into an RDD. It can be done with a dataframe, but being able to map on the row strings
instead of Row objects might make this easier to understand conceptually. I'll get you started.
val data = sc.textFile("data.txt")
Data will be an RDD of strings, where each element is a line in the data file.
We're going to want to map this data to some new data, and transform each row.
data.map(row => {
// transform each row here
})
Inside this map we make some change to row, which is a string. The code inside applies to every string in the RDD.
You will probably want to split the row to get an array of strings, so that you can count how many occurrences
of 'a' there are. Depending on the size of the array, you will want to create a new string and output that from this map.
If there are fewer 'a's than n, you will probably want to create a string with enough '_'s. If there are too many,
you will probably want to return a string with the correct number.
Hope this helps.
I am new to Scala and I want to calculate number of occurrences of a character in which start with a particular alphabet in a list of Strings.
For example-
val test1 : List[String] = List("zero","zebra","zenith","tiger","mosquito")
I have defined above List of Strings and I want to calculate count of all strings which start with "z".
I tried with below code-
scala> test2.count(s=> s.charAt(0) == "z")
res7: Int = 0
It is giving me result as 0. I am not sure what I am doing wrong. Please suggest.
Character values are delimited by single quotes. Double quotes are reserved for strings:
val test : List[String] = List("zero","zebra","zenith","tiger","mosquito")
test.count(_.charAt(0) == 'z') // 3: Int
you can simply use filter and find the length of the list
println(test1.filter(_.startsWith("z")).length)
If you want to ignore the cases (uppercase or lowercase) you can add .toLowerCase as
println(test1.filter(_.toLowerCase.startsWith("z")).length)
I hope the answer is helpful
I'm trying to extract data from an RDD[string] into another RDD[string]
the RDD contains data similar to this :
17.808 15.749 6.649 -0.548 15.9994
I need to multiply 4th and 5th fields of each row and store them into a different RDD[string].
I can use the following code to pull out one field
ansRDD = rawRDD(._split(" ")(4)).(_.toFloat)
rawRDD contains the string.
But I need to pull out both the fields into a single RDD as
-0.548 15.9994
so that I can simply do
answer = ansRDD.foreach(case(a,b) => a*b)
You could use:
rawRDD.map(_.split(' ').view(4, 6).map(_.toFloat).reduce(_*_).toString)
You could define ansRDD as:
ansRDD = rawRD.map(item => {val comps=item.split(" "); (comps(3),comps(4)})