Using apache spark to iterate over string - scala

For example we have the string "abcdabcd"
And we want to count all the pairs (e.g: "ab" or "da") that are available in the string.
So how do we do that in apache spark?
I asked this cause it looks like that the RDD does not support sliding function:
rdd.sliding(2).toList
//Count number of pairs in list
//Returns syntax error on first line (sliding)

Apparently it supports sliding via mllib as shown by zero323 here
import org.apache.spark.mllib.rdd.RDDFunctions._
val str = "abcdabcd"
val rdd = sc.parallelize(str)
rdd.sliding(2).map(_.mkString).toLocalIterator.forEach(println)
will show
ab
bc
cd
da
ab
bc
cd

Related

To split data into good and bad rows and write to output file using Spark program

I am trying to filter the good and bad rows by counting the number of delimiters in a TSV.gz file and write to separate files in HDFS
I ran the below commands in spark-shell
Spark Version: 1.6.3
val file = sc.textFile("/abc/abc.tsv.gz")
val data = file.map(line => line.split("\t"))
var good = data.filter(a => a.size == 995)
val bad = data.filter(a => a.size < 995)
When I checked the first record the value could be seen in the spark shell
good.first()
But when I try to write to an output file I am seeing the below records,
good.saveAsTextFile(good.tsv)
Output in HDFS (top 2 rows):
[Ljava.lang.String;#1287b635
[Ljava.lang.String;#2ef89922
Could ypu please let me know on how to get the required output file in HDFS
Thanks.!
Your final RDD is type of org.apache.spark.rdd.RDD[Array[String]]. Which leads to writing objects instead of string values in the write operation.
You should convert the array of strings to tab separated string values again before saving. Just try;
good.map(item => item.mkString("\t")).saveAsTextFile("goodFile.tsv")

How to read with Spark constantly updating HDFS directory and split output to multiple HDFS files based on String (row)?

Elaborated scenario -> HDFS directory which is "fed" with new log data of multiple types of bank accounts activity.
Each row represents a random activity type, and each row (String) contains the text "ActivityType=<TheTypeHere>".
In Spark-Scala, what's the best approach to read the input file/s in the HDFS directory and output multiple HDFS files, where each ActivityType is written to its own new file?
Adapted first answer to the statement:
The location of the "key" string is random within the parent String,
the only thing that is guaranteed is that it contains that sub-string,
in this case "ActivityType" followed by some val.
The question is really about this. Here goes:
// SO Question
val rdd = sc.textFile("/FileStore/tables/activitySO.txt")
val rdd2 = rdd.map(x => (x.slice (x.indexOfSlice("ActivityType=<")+14, x.indexOfSlice(">", (x.indexOfSlice("ActivityType=<")+14))), x))
val df = rdd2.toDF("K", "V")
df.write.partitionBy("K").text("SO_QUESTION2")
Input is:
ActivityType=<ACT_001>,34,56,67,89,90
3,4,4,ActivityType=<ACT_002>,A,1,2
ABC,ActivityType=<ACT_0033>
DEF,ActivityType=<ACT_0033>
Output is 3 files whereby the key is e.g. not ActivityType=, but rather ACT_001, etc. The key data is not stripped, it is still there in the String. You can modify that if you want as well as output location and format.
You can use MultipleOutputFormat for this.Convert rdd into key value pairs such that ActivityType is the key.Spark will create different files for different keys.You can decide based on the key where to place the files and what their names will be.
You can do something like this using RDDs whereby I assume you have variable length files and then converting to DFs:
val rdd = sc.textFile("/FileStore/tables/activity.txt")
val rdd2 = rdd.map(_.split(","))
.keyBy(_(0))
val rdd3 = rdd2.map(x => (x._1, x._2.mkString(",")))
val df = rdd3.toDF("K", "V")
//df.show(false)
df.write.partitionBy("K").text("SO_QUESTION")
Input is:
ActivityType=<ACT_001>,34,56,67,89,90
ActivityType=<ACT_002>,A,1,2
ActivityType=<ACT_003>,ABC
I get then as output 3 files, in this case 1 for each record. A bit hard to show as did it in Databricks.
You can adjust your output format and location, etc. partitionBy is the key here.

pyspark convert string array to Map()

I have a string as below in a text file:
ar.txt has 'K1:v1,K2:v2, K3:v3'
I have read this into an RDD and trying to convert it into MapType(StringType(), StringType()). When I try below it gives error with nulltype.
# Say data is in rdd called ar_rdd
ar_rdd1 = ar_rdd.map(lambda x: create_map(x.encode("ascii","ignore").split(",")) ))
Please suggest how to convert into a MapType() column ?
I was able to solve using below.
Read it into an rdd and split the pairs:
[Showing in steps though we can combine]
##File Input format : 'k1:v1,k2:v2,k3:v3'
rdd1 = sc.textFile(file_path)
rdd2 = rdd1.(lambda x : x.encode("ascii","ignore").split(","))
rdd3 = rdd2.(lambda x : (x[0].split(":"),x[1].split(":"),x[2].split(":")))
df = rdd3.toDF()
df.withColumn("map_column",create_map(col('_1')[0],col('_1')[1],col('_2')[0],col('_2')[1],col('_3')[0],col('_3')[1]))
If there is any better alternative or making it dynamic for any number of pairs, please suggest.

How to extract number from string column?

My requirement is to retrieve the order number from the comment column which is in a column comment and always starts with R. The order number should be added as a new column to the table.
Input data:
code,id,mode,location,status,comment
AS-SD,101,Airways,hyderabad,D,order got delayed R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged
TY-OP,103,Airways,Pune,D,Order number R5463 not received
Expected output:
AS-SD,101,Airways,hyderabad,D,order got delayed R1657,R1657
FY-YT,102,Airways,Delhi,ND,R7856 package damaged,R7856
TY-OP,103,Airways,Pune,D,Order number R5463 not received,R5463
I have tried it in spark-sql, the query I am using is given below:
val r = sqlContext.sql("select substring(comment, PatIndex('%[0-9]%',comment, length(comment))) as number from A")
However, I'm getting the following error:
org.apache.spark.sql.AnalysisException: undefined function PatIndex; line 0 pos 0
You can use regexp_extract which has the definition :
def regexp_extract(e: Column, exp: String, groupIdx: Int): Column
(R\\d{4}) means R followed by 4 digits. You can easily accommodate any other case by using a valid regex
df.withColumn("orderId", regexp_extract($"comment", "(R\\d{4})" , 1 )).show
+-----+---+-------+---------+------+--------------------+-------+
| code| id| mode| location|status| comment|orderId|
+-----+---+-------+---------+------+--------------------+-------+
|AS-SD|101|Airways|hyderabad| D|order got delayed...| R1657|
|FY-YT|102|Airways| Delhi| ND|R7856 package dam...| R7856|
|TY-OP|103|Airways| Pune| D|Order number R546...| R5463|
+-----+---+-------+---------+------+--------------------+-------+
You can use a udf function as following
import org.apache.spark.sql.functions._
def extractString = udf((comment: String) => comment.split(" ").filter(_.startsWith("R")).head)
df.withColumn("newColumn", extractString($"comment")).show(false)
where the comment column is splitted with space and filtering the words that starts with R. head will take the first word that was filtered starting with R.
Updated
To ensure that the returned string is order number starting with R and rest of the strings are digits, you can add additional filter
import scala.util.Try
def extractString = udf((comment: String) => comment.split(" ").filter(x => x.startsWith("R") && Try(x.substring(1).toDouble).isSuccess).head)
You can edit the filter according to your need.

_.split(" ") more fields in scala RDD

I'm trying to extract data from an RDD[string] into another RDD[string]
the RDD contains data similar to this :
17.808 15.749 6.649 -0.548 15.9994
I need to multiply 4th and 5th fields of each row and store them into a different RDD[string].
I can use the following code to pull out one field
ansRDD = rawRDD(._split(" ")(4)).(_.toFloat)
rawRDD contains the string.
But I need to pull out both the fields into a single RDD as
-0.548 15.9994
so that I can simply do
answer = ansRDD.foreach(case(a,b) => a*b)
You could use:
rawRDD.map(_.split(' ').view(4, 6).map(_.toFloat).reduce(_*_).toString)
You could define ansRDD as:
ansRDD = rawRD.map(item => {val comps=item.split(" "); (comps(3),comps(4)})