Stringbuilder to RDD - scala

I have a string builder(sb) with data as below in Scala IDE
CellId,Date,Time,MeasType,MeasResult
251498240,2016-12-02,20:45:00,RRC.ConnEstabAtt.emergency,0
251498240,2016-12-02,20:45:00,RRC.ConnEstabAtt.highPriorityAccess,0
251498240,2016-12-02,20:45:00,RRC.ConnEstabAtt.mt-Access,4
Now I want to convert this string into RDD by using scala. Please help me.
I am using this code. But no luck. Thanks in advance
val headerFile = sc.parallelize(sb)
headerFile.collect()

StringBuilder is used to build strings from mutable sequence of characters. So what ever you add to the builder would be appended to become as one string.
You would need to separate the strings added to be used as list of strings in sparkcontext
Assuming that the string are added with trailing line feed, you can split the string builder with line feed and use it to be transformed as rdd
val headerFile = sc.parallelize(sb.toString.split("\n"))
headerFile.collect()
To visualize the data, you would have to print them or save them to file
Now if you want to convert to dataframe before saving then you can perform as below
val data = sb.toString.split("\n")
import org.apache.spark.sql.types._
val schema = StructType(data.head.split(",").map(StructField(_, StringType, true)))
val rdd = sc.parallelize(sb.toString.split("\n").tail.map(line => Row.fromSeq(line.split(","))))
spark.createDataFrame(rdd, schema).show(false)
which should give you
+---------+----------+--------+-----------------------------------+----------+
|CellId |Date |Time |MeasType |MeasResult|
+---------+----------+--------+-----------------------------------+----------+
|251498240|2016-12-02|20:45:00|RRC.ConnEstabAtt.emergency |0 |
|251498240|2016-12-02|20:45:00|RRC.ConnEstabAtt.highPriorityAccess|0 |
|251498240|2016-12-02|20:45:00|RRC.ConnEstabAtt.mt-Access |4 |
+---------+----------+--------+-----------------------------------+----------+

Related

Create SOAP XML REQUEST from selected dataframe columns in Scala

Is there a way to create an XML SOAP REQUEST by extracting a few columns from each row of a dataframe ? 10 records in a dataframe means 10 separate SOAP XML REQUESTs.
How would you make the function call using map now?
You can do that by applying a map function to the dataframe.
val df = your dataframe
df.map(x => convertToSOAP(x))
// convertToSOAP is your function.
Putting up an example based on your comment, hope you find this useful.
case class emp(id:String,name:String,city:String)
val list = List(emp("1","user1","NY"),emp("2","user2","SFO"))
val rdd = sc.parallelize(list)
val df = rdd.toDF
df.map(x => "<root><name>" + x.getString(1) + "</name><city>"+ x.getString(2) +"</city></root>").show(false)
// Note: x is a type of org.apache.spark.sql.Row
Output will be as follows :
+--------------------------------------------------+
|value |
+--------------------------------------------------+
|<root><name>user1</name><city>NY</city></root> |
|<root><name>user2</name><city>SFO</city></root> |
+--------------------------------------------------+

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.

creating dataframe by loading csv file using scala in spark

but csv file is added with extra double quotes which results all cloumns into single column
there are four columns,header and 2 rows
"""SlNo"",""Name"",""Age"",""contact"""
"1,""Priya"",78,""Phone"""
"2,""Jhon"",20,""mail"""
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").option("inferSchema","true").load ("bank.csv")
df: org.apache.spark.sql.DataFrame = ["SlNo","Name","Age","contact": string]
What you can do is read it using sparkContext and replace all " with empty and use zipWithIndex() to separate header and text data so that custom schema and row rdd data can be created. Finally just use the row rdd and schema in sqlContext's createDataFrame api
//reading text file, replacing and splitting and finally zipping with index
val rdd = sc.textFile("bank.csv").map(_.replaceAll("\"", "").split(",")).zipWithIndex()
//separating header to form schema
val header = rdd.filter(_._2 == 0).flatMap(_._1).collect()
val schema = StructType(header.map(StructField(_, StringType, true)))
//separating data to form row rdd
val rddData = rdd.filter(_._2 > 0).map(x => Row.fromSeq(x._1))
//creating the dataframe
sqlContext.createDataFrame(rddData, schema).show(false)
You should be getting
+----+-----+---+-------+
|SlNo|Name |Age|contact|
+----+-----+---+-------+
|1 |Priya|78 |Phone |
|2 |Jhon |20 |mail |
+----+-----+---+-------+
I hope the answer is helpful

Flag faulty Row in spark dataframe with boolean values

I was trying a hands on the spark dataframes. With previous knowledge from Cascading Framework which has a trap mechanism to filter out faulty rows (rows with null values) into a separate Tap called Trap. Those who are unaware let me make that clear. When you get a faulty row which has been read from a text file. The framework either scraps out the bad row from the entire data or stops the execution. Now in apache spark, I observed that the bad rows didn't hinder the execution. That is good but when it comes to getting business insights from data, Quality of the data Does matter!
So, I have a text file with bunch of rows in it (you may pick up any dataset, you like to), in which few records do contain null values. Now I load the text file into a Dataframe with spark.read.csv. Now, what I want to do is analyze the Dataframe and dynamically create a column named "isMyRowBad" where the logic will analyze each rows at a time and if the logic finds out the row which has a null value, it flags the isMyRowBad column on that particular row as true and the columns which do not have null values, the corresponding column isMyRowBad should have false for that purticular row which is clean.
Giving you the overview of the incoming and outgoing datasets
INCOMING DATAFRAME
fname,lname,age
will,smith,40
Dwayne,Nunn,36
Aniruddha,Sinha,
Maria,,22
OUTGOING DATAFRAME
fname,lname,age,isMyRowBad
will,smith,40,false
Dwayne,Nunn,36,false
Aniruddha,Sinha,,true
Maria,,22,true
The above method to classify good and bad rows might seem a little foolish but it does make sense since I will not need to run filter operation multiple times. let us take a look, how?
Suppose I have a Dataframe named inDf as inputDf and AnalysedDf:(DataFrame,DataFrame) as output Df Tuple
Now, I did try this part of code
val analyzedDf: (DataFrame, DataFrame) = (inputDf.filter(_.anyNull),inputDf.filter(!_.anyNull))
This code segregates good and bad rows. I agree! but this has a performance setback as filter runs two times which means filter will iterate all over the dataset for two times!( you may counter this point if you feel running filter two times does make sense when considering 50 fields and atleast 584000 rows ( that is 250 mb of data)!)
and this as well
val analyzedDf: DataFrame = inputDf.select("*").withColumn("isMyRowBad", <this point, I am not able to analyze row>
The above snippet shows where I am not able to figure out how to sweep the entire row and mark the row as bad with a boolean value.
Hope, you all got to understand what am I aiming to achieve. Please ignore the syntactical errors if you find in the snippets since I typed them here right away(will correct the same with future edits)
Please give me a hint(a little code snippet or a pseudo code will be enough) on how to proceed with the challenge. Please reach out to me if you didn't understand what I intend to do.
Any help will be greatly appreciated. Thanks in advance!
P.S: There are brilliant people out here on BigData/spark/hadoop/scala etc. Request you to kindly correct me on any point which I might have wrongly written(conceptually)
The below code give me a solution by the way. Please have a look
package aniruddha.data.quality
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.functions._
/**
* Created by aniruddha on 8/4/17.
*/
object DataQualityCheck extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val schema: StructType = StructType(List(
StructField("fname", StringType, nullable = true),
StructField("lname", StringType, nullable = true),
StructField("age", IntegerType, nullable = true),
StructField("pan", StringType, nullable = true),
StructField("married", StringType, nullable = true)
))
val inputDataFrame: DataFrame = spark
.read
.schema(schema)
.option("header",true)
.option("delimiter",",")
.csv("inputData/infile")
//inputDataFrame.show()
val analysedDataFrame: DataFrame = inputDataFrame.select("*").withColumn("isRowBad", when($"pan".isNull||$"lname".isNull||$"married".isNull,true).otherwise(false))
analysedDataFrame show
}
input
fname,lname,age,pan,married
aniruddha,sinha,23,0AA22,no
balajee,venkatesh,23,0b96,no
warren,shannon,72,,
wes,borland,63,0b22,yes
Rohan,,32,0a96,no
james,bond,66,007,no
output
+---------+---------+---+-----+-------+--------+
| fname| lname|age| pan|married|isRowBad|
+---------+---------+---+-----+-------+--------+
|aniruddha| sinha| 23|0AA22| no| false|
| balajee|venkatesh| 23| 0b96| no| false|
| warren| shannon| 72| null| null| true|
| wes| borland| 63| 0b22| yes| false|
| Rohan| null| 32| 0a96| no| true|
| james| bond| 66| 007| no| false|
+---------+---------+---+-----+-------+--------+
The code works fine but I have a problem with the when function. Can't we just select all the columns without hardcoding it?
As far as I know, you can't do this with the inbuilt csv parser. You can get the parser to stop if it hits an error (failFast mode), but not annotate.
However, you could do this with a custom csv parser, that can process the data in a single pass. Unless we want to do some clever type introspection, it is easiest if we create a helper class to annotate the structure of the file:
case class CSVColumnDef(colPos: Int, colName: String, colType: String)
val columns = List(CSVColumnDef(0,"fname","String"),CSVColumnDef(1,"lname","String"),CSVColumnDef(2,"age", "Int"))
Next, we need some functions to a) split the input, b) extract data from split data, c) check if row is bad:
import scala.util.Try
def splitToSeq(delimiter: String) = udf[Seq[String],String](_.split(delimiter))
def extractColumnStr(i: Int) = udf[Option[String],Seq[String]](s => Try(Some(s(i))).getOrElse(None))
def extractColumnInt(i: Int) = udf[Option[Int],Seq[String]](s => Try(Some(s(i).toInt)).getOrElse(None))
def isRowBad(delimiter: String) = udf[Boolean,String](s => {
(s.split(delimiter).length != columns.length) || (s.split(delimiter).exists(_.length==0))
})
To use these, we first need to read in the text file (since I don't have it, and to allow people to replicate this answer, I will create an rdd):
val input = sc.parallelize(List(("will,smith,40"),("Dwayne,Nunn,36"),("Aniruddha,Sinha,"),("Maria,,22")))
input.take(5).foreach(println)
Given this input, we can create a dataframe with a single column, the raw line, and add our split column to it:
val delimiter = ","
val raw = "raw"
val delimited = "delimited"
val compDF = input.toDF(raw).withColumn(delimited, splitToSeq(delimiter)(col(raw)))
Finally, we can extract all the columns we previously defined, and check if the rows are bad:
val df = columns.foldLeft(compDF){case (acc,column) => column.colType match {
case "Int" => acc.withColumn(column.colName, extractColumnInt(column.colPos)(col(delimited)))
case _ => acc.withColumn(column.colName, extractColumnStr(column.colPos)(col(delimited)))
}}.
withColumn("isMyRowBad", isRowBad(delimiter)(col(raw))).
drop(raw).drop(delimited)
df.show
df.printSchema
The nice thing about this solution is that the spark execution planner is smart enough to build all of those .withColumn operations into a single pass (map) over the data, without zero shuffling. The annoying thing is that it is a lot more dev work than using a nice shiny csv library, and we need to define the columns somehow. If you wanted to be a bit more clever, you could get the column names from the first line of the file (hint: .mapPartitionsWithIndex), and just parse everything as a string. We also can't define a case class to describe the entire DF, since you have too many columns to approach the solution that way. Hope this helps...
This can be done using udf. Although the answer given by Ben Horsburgh is definitely brilliant, yet we can do this without getting much into internal architecture behind Dataframes. The following code can give you an idea
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
/**
* Created by vaijnath on 10/4/17.
*/
object DataQualityCheck extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val schema: StructType = StructType(List(
StructField("fname", StringType, nullable = true),
StructField("lname", StringType, nullable = true),
StructField("married", StringType, nullable = true)
))
val inputDataFrame: DataFrame = spark
.read
.schema(schema)
.option("header",false)
.option("delimiter",",")
.csv("hydrograph.engine.spark/testData/inputFiles/delimitedInputFile.txt")
//inputDataFrame.show()
def isBad(row:Row):Boolean={
row.anyNull
}
val simplefun=udf(isBad(_:Row))
val cols=struct(inputDataFrame.schema.fieldNames.map(e=> col(e)):_*)
// println(cols+"******************") //for debugging
val analysedDataFrame: DataFrame = inputDataFrame.withColumn("isRowBad", simplefun(cols))
analysedDataFrame.show
}
Please get back to me if you face any issues. I believe this solution can be appropriate since you seem to look for a code with usage of dataframe.
Thanks.

Spark each file to a dataset row

I have many files in a directory, each file containing text spanning multiple lines.
Currently, I use the following code to read all those files to a spark dataset (>2.0)
val ddf = spark.read.text("file:///input/*")
However, this creates a dataset where each row is a line, not a file. I'd like to have each file (as string) per row in the dataset.
How can I achieve this without iterating over each file and reading it in separately as RDD?
Use wholeTextFiles() on SparkContext
val rdd: RDD[(String, String)] = spark.sparkContext
.wholeTextFiles("file/path/to/read/as/rdd")
SparkContext.wholeTextFiles lets you read a directory containing
multiple small text files, and returns each of them as (filename,
content) pairs. This is in contrast with textFile, which would return
one record per line in each file.
An alternative to #mrsrinivas's answer would be to group by input_file_name. Given the structure:
evan#vbox>~/junk/so> find .
.
./d2
./d2/t.txt
./d1
./d1/t.txt
evan#vbox>~/junk/so> cat */*.txt
d1_1
d1_2
d2_1
d2_2
We can collect lists based on the input files like so:
scala> val ddf = spark.read.textFile("file:///home/evan/junk/so/*").
| select($"value", input_file_name as "fName")
ddf: org.apache.spark.sql.DataFrame = [value: string, fName: string]
scala> ddf.show(false)
+-----+----------------------------------+
|value|fName |
+-----+----------------------------------+
|d2_1 |file:///home/evan/junk/so/d2/t.txt|
|d2_2 |file:///home/evan/junk/so/d2/t.txt|
|d1_1 |file:///home/evan/junk/so/d1/t.txt|
|d1_2 |file:///home/evan/junk/so/d1/t.txt|
+-----+----------------------------------+
scala> ddf.groupBy("fName").agg(collect_list($"value") as "value").
| drop("fName").show
+------------+
| value|
+------------+
|[d1_1, d1_2]|
|[d2_1, d2_2]|
+------------+