RDD skip headers - Pyspark - pyspark

I want to read an RDD with header. I found similar question here, but it's not working for me. How do I skip a header from CSV files in Spark?
rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1)
else iter }
so I tried
def f(idx, iter):
if idx==0:
iter.drop(1)
else:
yield list(iterator)
rdd2 = rdd.mapPartitionsWithIndex(f)
but it says AttributeError: 'generator' object has no attribute 'drop'
any help?

Try something like this:
def f(idx, iter):
output=[]
for sublist in iter:
output.append(sublist)
if idx>0:
return(output)
else:
return(output[1:])

Related

How to remove last line from RDD Spark Scala

I want to remove last line from RDD using .mapPartitionsWithIndex function.
I have tried below code
val withoutFooter = rdd.mapPartitionsWithIndex { (idx, iter) =>
if (idx == noOfTotalPartitions) {
iter.drop(size - 1)
}
else iter
}
But not able to get correct result.
drop will drop first n elements and returns the remaining elements
Read more here https://stackoverflow.com/a/51792161/6556191
Below code works for me
val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9),4)
val lastPartitionIndex = rdd.getNumPartitions - 1
rdd.mapPartitionsWithIndex { (idx, iter) =>
var reti = iter
if (idx == lastPartitionIndex) {
var lastPart = iter.toArray
reti = lastPart.slice(0, lastPart.length-1).toIterator
}
reti
}

How to deal with None output of a function in Scala?

I have the following function:
def getData(spark: SparkSession,
indices: Option[String]): Option[DataFrame] = {
indices.map{
ind =>
spark
.read
.format("org.elasticsearch.spark.sql")
.load(ind)
}
}
This function returns Option[DataFrame].
Then I want to use this function as follows:
val df = getData(spark, indices)
df.persist(StorageLevel.MEMORY_AND_DISK)
Of course the last two lines of code will not compile because df might be None. What is the idiomatic way deal with None output in Scala?
I would like to throw an exception and stop the program if df is None. Otherwise I want to persist it.
If you do care about the None I'd use simple pattern match here:
df match {
case None => throw new RuntimeException()
case Some(dataFrame) => dataFrame.persist(StorageLevel.MEMORY_AND_DISK)
}
But if you don't care, just use foreach like:
df.foreach { dataFrame =>
dataFrame.persist(StorageLevel.MEMORY_AND_DISK)
}
val df = dfOption.getOrElse(throw new Exception("Disaster Strikes"))
df.persist(...)

Replacing data on some condition in RDDs and filtering out non required RDDs

I've RDD which looks like this:
[((String, String, String), (String, String))]
Sample data is like this:
((10,1,a),(x,3))
((10,2,b),(y,5))
((11,2,b),
((11,3,c),(z,4))
So if the value of 2nd string inside key is 2 or 3, replace it with 2-3, if it is 1 or if the rdd is like the 3rd one, remove that rdd.
So the expected output is like this:
((10,2-3,b),(y,5))
((11,2-3,c),(z,4))
Given a input data as
val rdd = spark.sparkContext.parallelize(Seq(
(("10","1","a"),("x","3")),
(("10","2","b"),("y","5")),
(("11","2","b"),()),
(("11","3","c"),("z","4"))
))
You can do the following to get your desired output as
rdd.filter(x => x._1._2 != "1").filter(x => x._2 != ()).map(x => {
if(x._1._2 == "2" || x._1._2 == "3") ((x._1._1, "2-3", x._1._3), x._2)
else ((x._1._1, x._1._2, x._1._3), x._2)
})
Your output would be
((10,2-3,b),(y,5))
((11,2-3,c),(z,4))
Thanks to philantrovert for pointing out that it has to be String rather than an Int.

How to correctly handle Option in Spark/Scala?

I have a method, createDataFrame, which returns an Option[DataFrame]. I then want to 'get' the DataFrame and use it in later code. I'm getting a type mismatch that I can't fix:
val df2: DataFrame = createDataFrame("filename.txt") match {
case Some(df) => { //proceed with pipeline
df.filter($"activityLabel" > 0)
case None => println("could not create dataframe")
}
val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
I need df2 to be of type: DataFrame otherwise later code won't recognise df2 as a DataFrame e.g. val Array(trainData, testData) = df2.randomSplit(Array(0.5,0.5),seed = 12345)
However, the case None statement is not of type DataFrame, it returns Unit, so won't compile. But if I don't declare the type of df2 the later code won't compile as it is not recognised as a DataFrame. If someone can suggest a fix that would be helpful - been going round in circles with this for some time. Thanks
What you need is a map. If you map over an Option[T] you are doing something like: "if it's None I'm doing nothing, otherwise I transform the content of the Option in something else. In your case this content is the dataframe itself. So inside this myDFOpt.map() function you can put all your dataframe transformation and just in the end do the pattern matching you did, where you may print something if you have a None.
edit:
val df2: DataFrame = createDataFrame("filename.txt").map(df=>{
val filteredDF=df.filter($"activityLabel" > 0)
val Array(trainData, testData) = filteredDF.randomSplit(Array(0.5,0.5),seed = 12345)})

Modifying List of String in scala

I have input file i would like to read a scala stream and then modify each record and then output the file.
My input is as follows -
Name,id,phone-number
abc,1,234567
dcf,2,345334
I want to change the above input as follows -
Name,id,phone-number
testabc,test1,test234567
testdcf,test2,test345334
I am trying to read a file as scala stream as follows:
val inputList = Source.fromFile("/test.csv")("ISO-8859-1").getLines
after the above step i get Iterator[String]
val newList = inputList.map{line =>
line.split(',').map{s =>
"test" + s
}.mkString (",")
}.toList
but the new list is empty.
I am not sure if i can define an empty list and empty array and then append the modified record to the list.
Any suggestions?
You might want to transform the iterator into a stream
val l = Source.fromFile("test.csv")
.getLines()
.toStream
.tail
.map { row =>
row.split(',')
.map { col =>
s"test$col"
}.mkString (",")
}
l.foreach(println)
testabc,test1,test234567
testdcf,test2,test345334
Here's a similar approach that returns a List[Array[String]]. You can use mkString, toString, or similar if you want a String returned.
scala> scala.io.Source.fromFile("data.txt")
.getLines.drop(1)
.map(l => l.split(",").map(x => "test" + x)).toList
res3: List[Array[String]] = List(
Array(testabc, test1, test234567),
Array(testdcf, test2, test345334)
)