creating dataframe by loading csv file using scala in spark - scala

but csv file is added with extra double quotes which results all cloumns into single column
there are four columns,header and 2 rows
"""SlNo"",""Name"",""Age"",""contact"""
"1,""Priya"",78,""Phone"""
"2,""Jhon"",20,""mail"""
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").option("inferSchema","true").load ("bank.csv")
df: org.apache.spark.sql.DataFrame = ["SlNo","Name","Age","contact": string]

What you can do is read it using sparkContext and replace all " with empty and use zipWithIndex() to separate header and text data so that custom schema and row rdd data can be created. Finally just use the row rdd and schema in sqlContext's createDataFrame api
//reading text file, replacing and splitting and finally zipping with index
val rdd = sc.textFile("bank.csv").map(_.replaceAll("\"", "").split(",")).zipWithIndex()
//separating header to form schema
val header = rdd.filter(_._2 == 0).flatMap(_._1).collect()
val schema = StructType(header.map(StructField(_, StringType, true)))
//separating data to form row rdd
val rddData = rdd.filter(_._2 > 0).map(x => Row.fromSeq(x._1))
//creating the dataframe
sqlContext.createDataFrame(rddData, schema).show(false)
You should be getting
+----+-----+---+-------+
|SlNo|Name |Age|contact|
+----+-----+---+-------+
|1 |Priya|78 |Phone |
|2 |Jhon |20 |mail |
+----+-----+---+-------+
I hope the answer is helpful

Related

convert dataframe column values and apply SHA2 masking logic

I have a dataframe that contains the Property table and main table from Hive. I want to remove columns and then I want to apply masking logic (SHA2).
Reading Property config from postgre DB as a Dataframe in Spark/scala job.
val propertydf = loading the property dataframe from postgre db
Main Hive table
and the output should be
Anyone, please help me write a code in Spark/Scala. I am unable to convert List[String] and pass it to function from dataframe config.
You can manipulate the column names and select them as appropriate:
val masking = propertydf.head(1)(0).getAs[String]("maskingcolumns").split(",")
val exclude = propertydf.head(1)(0).getAs[String]("columnstoexclude").split(",")
val result = df.select(
masking.map(c => sha2(col(c).cast("string"), 256).as(c)) ++
df.columns.filterNot(c => masking.contains(c) || exclude.contains(c)).map(col)
:_*
)
result.show(false)
+----------------------------------------------------------------+----------------------------------------------------------------+---+---+
|a |b |c |d |
+----------------------------------------------------------------+----------------------------------------------------------------+---+---+
|ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad|6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b|11 |cbc|
+----------------------------------------------------------------+----------------------------------------------------------------+---+---+

Create SOAP XML REQUEST from selected dataframe columns in Scala

Is there a way to create an XML SOAP REQUEST by extracting a few columns from each row of a dataframe ? 10 records in a dataframe means 10 separate SOAP XML REQUESTs.
How would you make the function call using map now?
You can do that by applying a map function to the dataframe.
val df = your dataframe
df.map(x => convertToSOAP(x))
// convertToSOAP is your function.
Putting up an example based on your comment, hope you find this useful.
case class emp(id:String,name:String,city:String)
val list = List(emp("1","user1","NY"),emp("2","user2","SFO"))
val rdd = sc.parallelize(list)
val df = rdd.toDF
df.map(x => "<root><name>" + x.getString(1) + "</name><city>"+ x.getString(2) +"</city></root>").show(false)
// Note: x is a type of org.apache.spark.sql.Row
Output will be as follows :
+--------------------------------------------------+
|value |
+--------------------------------------------------+
|<root><name>user1</name><city>NY</city></root> |
|<root><name>user2</name><city>SFO</city></root> |
+--------------------------------------------------+

Spark dataframe with complex & nested data

I have 3 dataframes currently
Call them dfA, dfB, and dfC
dfA has 3 cols
|Id | Name | Age |
dfB has say 5 cols. the 2nd col, is a FK reference back to dFA record.
|Id | AId | Street | City | Zip |
Similarily dfC has 3 cols, also with a reference back to dfA
|Id | AId | SomeField |
Using Spark SQL i can do an JOIN across the 3
%sql
SELECT * FROM dfA
INNER JOIN dfB ON dfA.Id = dfB.AId
INNER JOIN dfC ON dfA.Id = dfC.AId
And I'll get my resultset, but it's been "flattened" as SQL would do with tabular results like this.
I want to load it in to a complex schema like this
val destinationSchema = new StructType()
.add("id", IntegerType)
.add("name", StringType)
.add("age", StringType)
.add("b",
new StructType()
.add("street", DoubleType, true)
.add("city", StringType, true)
.add("zip", StringType, true)
)
.add("c",
new StructType()
.add("somefield", StringType, true)
)
Any ideas how to take the results of the SELECT and save to dataframe with specifying the schema?
I ultimately want to save the complex StructType, or JSON, and load this is to Mongo DB using the Mongo Spark Connector.
Or, is there a better way to accomplish this from the 3 seperate dataframes (which were originally 3 seperate CSV files that were read in)?
given three dataframes, loaded in from csv files, you can do this:
import org.apache.spark.sql.functions._
val destDF = atableDF
.join(btableDF, atableDF("id") === btableDF("id")).drop(btableDF("id"))
.join(ctableDF, atableDF("id") === ctableDF("id")).drop(ctableDF("id"))
.select($"id",$"name",$"age",struct($"street",$"city",$"zip") as "b",struct($"somefield") as "c")
val jsonDestDF = destDF.select(to_json(struct($"*")).as("row"))
which will output:
row
"{""id"":100,""name"":""John"",""age"":""43"",""b"":{""street"":""Dark Road"",""city"":""Washington"",""zip"":""98002""},""c"":{""somefield"":""appples""}}"
"{""id"":101,""name"":""Sally"",""age"":""34"",""b"":{""street"":""Light Ave"",""city"":""Los Angeles"",""zip"":""90210""},""c"":{""somefield"":""bananas""}}"
"{""id"":102,""name"":""Damian"",""age"":""23"",""b"":{""street"":""Short Street"",""city"":""New York"",""zip"":""70701""},""c"":{""somefield"":""pears""}}"
the previous one worked if all the records had a 1:1 relationship.
here is how you can achieve it for 1:M (hint: use collect_set to group rows)
import org.apache.spark.sql.functions._
val destDF = atableDF
.join(btableDF, atableDF("id") === btableDF("id")).drop(btableDF("id"))
.join(ctableDF, atableDF("id") === ctableDF("id")).drop(ctableDF("id"))
.groupBy($"id",$"name",$"age")
.agg(collect_set(struct($"street",$"city",$"zip")) as "b",collect_set(struct($"somefield")) as "c")
val jsonDestDF = destDF.select(to_json(struct($"*")).as("row"))
display(jsonDestDF)
which will give you the following output:
row
"{""id"":102,""name"":""Damian"",""age"":""23"",""b"":[{""street"":""Short Street"",""city"":""New York"",""zip"":""70701""}],""c"":[{""somefield"":""pears""},{""somefield"":""pineapples""}]}"
"{""id"":100,""name"":""John"",""age"":""43"",""b"":[{""street"":""Dark Road"",""city"":""Washington"",""zip"":""98002""}],""c"":[{""somefield"":""appples""}]}"
"{""id"":101,""name"":""Sally"",""age"":""34"",""b"":[{""street"":""Light Ave"",""city"":""Los Angeles"",""zip"":""90210""}],""c"":[{""somefield"":""grapes""},{""somefield"":""peaches""},{""somefield"":""bananas""}]}"
sample data I used just in case anyone wants to play:
atable.csv
100,"John",43
101,"Sally",34
102,"Damian",23
104,"Rita",14
105,"Mohit",23
btable.csv:
100,"Dark Road","Washington",98002
101,"Light Ave","Los Angeles",90210
102,"Short Street","New York",70701
104,"Long Drive","Buffalo",80345
105,"Circular Quay","Orlando",65403
ctable.csv:
100,"appples"
101,"bananas"
102,"pears"
101,"grapes"
102,"pineapples"
101,"peaches"

Stringbuilder to RDD

I have a string builder(sb) with data as below in Scala IDE
CellId,Date,Time,MeasType,MeasResult
251498240,2016-12-02,20:45:00,RRC.ConnEstabAtt.emergency,0
251498240,2016-12-02,20:45:00,RRC.ConnEstabAtt.highPriorityAccess,0
251498240,2016-12-02,20:45:00,RRC.ConnEstabAtt.mt-Access,4
Now I want to convert this string into RDD by using scala. Please help me.
I am using this code. But no luck. Thanks in advance
val headerFile = sc.parallelize(sb)
headerFile.collect()
StringBuilder is used to build strings from mutable sequence of characters. So what ever you add to the builder would be appended to become as one string.
You would need to separate the strings added to be used as list of strings in sparkcontext
Assuming that the string are added with trailing line feed, you can split the string builder with line feed and use it to be transformed as rdd
val headerFile = sc.parallelize(sb.toString.split("\n"))
headerFile.collect()
To visualize the data, you would have to print them or save them to file
Now if you want to convert to dataframe before saving then you can perform as below
val data = sb.toString.split("\n")
import org.apache.spark.sql.types._
val schema = StructType(data.head.split(",").map(StructField(_, StringType, true)))
val rdd = sc.parallelize(sb.toString.split("\n").tail.map(line => Row.fromSeq(line.split(","))))
spark.createDataFrame(rdd, schema).show(false)
which should give you
+---------+----------+--------+-----------------------------------+----------+
|CellId |Date |Time |MeasType |MeasResult|
+---------+----------+--------+-----------------------------------+----------+
|251498240|2016-12-02|20:45:00|RRC.ConnEstabAtt.emergency |0 |
|251498240|2016-12-02|20:45:00|RRC.ConnEstabAtt.highPriorityAccess|0 |
|251498240|2016-12-02|20:45:00|RRC.ConnEstabAtt.mt-Access |4 |
+---------+----------+--------+-----------------------------------+----------+

Create DataFrame / Dataset using Header and Data in two different directories

I am getting the input file as CSV. Here I get two directories, first directory will have one file with header record and second directory will have data files. Here, I want to create a Dataframe/Dataset.
One way I can do is creating case class and split the data files by delimiter and attached the schema and create dataFrame.
What I am looking is read Header file and data file and create dataFrame. I saw a solution using databricks but my organization has restriction to use the databricks and below is the code which I come across. Can one you help me the solution without using databricks.
val headersDF = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.load("path to headers.csv")
val schema = headersDF.schema
val dataDF = sqlContext
.read
.format("com.databricks.spark.csv")
.schema(schema)
.load("path to data.csv")
You can do it like this
val schema=spark
.read
.format("csv")
.option("header","true")
.option("delimiter",",")
.load("C:\\spark\\programs\\empheaders.csv")
.schema
val data=spark
.read
.format("csv")
.schema(schema)
.option("delimiter",",")
.load("C:\\spark\\programs\\empdata.csv")
Because in your header CSV file you don't have any data there is no point in inferring the schema out of it.
So just get the field names by reading it.
val headerRDD = sc.parallelize(Seq(("Name,Age,Sal"))) //Assume this line is in your Header CSV
val header = headerRDD.flatMap(_.split(",")).collect
//headerRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at parallelize at command-2903591155643047:1
//header: Array[String] = Array(Name, Age, Sal)
Then read the data CSV file.
Either map each line to a case class or a tuple. Convert the data to a DataFrame by passing the header array.
val dataRdd = sc.parallelize(Seq(("Tom,22,500000"),("Rick,40,1000000"))) //Assume these lines are in your data CSV file
val data = dataRdd.map(_.split(",")).map(x => (x(0),x(1).toInt,x(2).toDouble)).toDF(header: _*)
//dataRdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[72] at parallelize at command-2903591155643048:1
//data: org.apache.spark.sql.DataFrame = [Name: string, Age: int ... 1 more field]
Result:
data.show()
+----+---+---------+
|Name|Age| Sal|
+----+---+---------+
| Tom| 22| 500000.0|
|Rick| 40|1000000.0|
+----+---+---------+