Create SOAP XML REQUEST from selected dataframe columns in Scala - scala

Is there a way to create an XML SOAP REQUEST by extracting a few columns from each row of a dataframe ? 10 records in a dataframe means 10 separate SOAP XML REQUESTs.
How would you make the function call using map now?

You can do that by applying a map function to the dataframe.
val df = your dataframe
df.map(x => convertToSOAP(x))
// convertToSOAP is your function.
Putting up an example based on your comment, hope you find this useful.
case class emp(id:String,name:String,city:String)
val list = List(emp("1","user1","NY"),emp("2","user2","SFO"))
val rdd = sc.parallelize(list)
val df = rdd.toDF
df.map(x => "<root><name>" + x.getString(1) + "</name><city>"+ x.getString(2) +"</city></root>").show(false)
// Note: x is a type of org.apache.spark.sql.Row
Output will be as follows :
+--------------------------------------------------+
|value |
+--------------------------------------------------+
|<root><name>user1</name><city>NY</city></root> |
|<root><name>user2</name><city>SFO</city></root> |
+--------------------------------------------------+

Related

How to write an UDF in Spark to map indexes to string labels?

I am using Spark and I have a table that has a specific string format in one of the columns called predictions. The format is always of the type - 0=some_probability,1=some_other_probability,2=some_other_probability .
Here are a few sample records from that table -
val table1 = Seq(
("0=0.5,1=0.3,2=0.2"),
("0=0.6,1=0.2,2=0.2"),
("0=0.1,1=0.1,2=0.8")
).toDF("predictions")
table1.show(false)
+-----------------+
|predictions |
+-----------------+
|0=0.5,1=0.3,2=0.2|
|0=0.6,1=0.2,2=0.2|
|0=0.1,1=0.1,2=0.8|
+-----------------+
Now, I also have metadata information about each of these indexes - 0,1,2...n in a separate string. The metadata string looks like -
val metadata = "AA::BB::CC"
I would like to write a UDF in Scala to map these indexes to each element in the string. The output of that UDF should give me a new column which looks like this -
+--------------------+
|labelled_predictions|
+--------------------+
|AA=0.5,BB=0.3,CC=0.2|
|AA=0.6,BB=0.2,CC=0.2|
|AA=0.1,BB=0.1,CC=0.8|
+--------------------+
So, 0 is replaced by AA since AA is the first element in the metadata string that is always split by ::.
How do I write an UDF in Scala-Spark to do this ?
val metadata = "AA::BB::CC"
based on given data, this should work for you:
def myUDF(metadata:String) = udf((s: String) => {
val metadataSplit = metadata.split("::")
val dataSplit = s.split(",")
val output = new Array[String](dataSplit.size)
for (i <- 0 until dataSplit.size) {
output(i) = metadataSplit(i) + "=" + dataSplit(i).split("=")(1)
}
output.mkString(",")
})
table1.withColumn("labelled_predictions", myUDF(metadata)(col("predictions"))).select("labelled_predictions").show(false)
output:
+--------------------+
|labelled_predictions|
+--------------------+
|AA=0.5,BB=0.3,CC=0.2|
|AA=0.6,BB=0.2,CC=0.2|
|AA=0.1,BB=0.1,CC=0.8|
+--------------------+

check data size spark dataframes

I have the following question :
Actually I am working with the following csv file:
""job"";""marital"""
""management"";""married"""
""technician"";""single"""
I loaded it into a spark dataframe as follows:
My aim is to check the length and type of each field in the dataframe following the set od rules below :
col type
job char10
marital char7
I started implementing the check of the length of each field but I am getting a compilation error :
val data = spark.read.option("inferSchema", "true").option("header", "true").csv("file:////home/user/Desktop/user/file.csv")
data.map(line => {
val fields = line.toString.split(";")
fields(0).size
fields(1).size
})
The expected output should be:
List(10,10)
As for the check of the types I don't have any idea about how to implement it as we are using dataframes. Any idea about a function verifying the data format ?
Thanks a lot in advance for your replies.
ata
I see you are trying to use Dataframe, But if there are multiple double quotes then you can read as a textFile and remove them and convert to Dataframe as below
import org.apache.spark.sql.functions._
import spark.implicits._
val raw = spark.read.textFile("path to file ")
.map(_.replaceAll("\"", ""))
val header = raw.first
val data = raw.filter(row => row != header)
.map { r => val x = r.split(";"); (x(0), x(1)) }
.toDF(header.split(";"): _ *)
You get with data.show(false)
+----------+-------+
|job |marital|
+----------+-------+
|management|married|
|technician|single |
+----------+-------+
To calculate the size you can use withColumn and length function and play around as you need.
data.withColumn("jobSize", length($"job"))
.withColumn("martialSize", length($"marital"))
.show(false)
Output:
+----------+-------+-------+-----------+
|job |marital|jobSize|martialSize|
+----------+-------+-------+-----------+
|management|married|10 |7 |
|technician|single |10 |6 |
+----------+-------+-------+-----------+
All the column type are String.
Hope this helps!
You are using a dataframe. So when you use the map method, you are processing Row in your lambda.
so line is a Row.
Row.toString will return a string representing the Row, so in your case 2 structfields typed as String.
If you want to use map and process your Row, you have to get the vlaue inside the fields manually. with getAsString and getAsString.
Usually when you use Dataframes, you have to work in column's logic as in SQL using select, where... or directly the SQL syntax.

creating dataframe by loading csv file using scala in spark

but csv file is added with extra double quotes which results all cloumns into single column
there are four columns,header and 2 rows
"""SlNo"",""Name"",""Age"",""contact"""
"1,""Priya"",78,""Phone"""
"2,""Jhon"",20,""mail"""
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").option("inferSchema","true").load ("bank.csv")
df: org.apache.spark.sql.DataFrame = ["SlNo","Name","Age","contact": string]
What you can do is read it using sparkContext and replace all " with empty and use zipWithIndex() to separate header and text data so that custom schema and row rdd data can be created. Finally just use the row rdd and schema in sqlContext's createDataFrame api
//reading text file, replacing and splitting and finally zipping with index
val rdd = sc.textFile("bank.csv").map(_.replaceAll("\"", "").split(",")).zipWithIndex()
//separating header to form schema
val header = rdd.filter(_._2 == 0).flatMap(_._1).collect()
val schema = StructType(header.map(StructField(_, StringType, true)))
//separating data to form row rdd
val rddData = rdd.filter(_._2 > 0).map(x => Row.fromSeq(x._1))
//creating the dataframe
sqlContext.createDataFrame(rddData, schema).show(false)
You should be getting
+----+-----+---+-------+
|SlNo|Name |Age|contact|
+----+-----+---+-------+
|1 |Priya|78 |Phone |
|2 |Jhon |20 |mail |
+----+-----+---+-------+
I hope the answer is helpful

Reshape spark data frame of key-value pairs with keys as new columns

I am new to spark and scala. Lets say I have a data frame of lists that are key value pairs. Is there a way to map the id vars of column ids as new columns?
df.show()
+--------------------+-------------------- +
| ids | vals |
+--------------------+-------------------- +
|[id1,id2,id3] | null |
|[id2,id5,id6] |[WrappedArray(0,2,4)] |
|[id2,id4,id7] |[WrappedArray(6,8,10)]|
Expected output:
+----+----+
|id1 | id2| ...
+----+----+
|null| 0 | ...
|null| 6 | ...
A possible way would be to compute the columns of the new DataFrame and use those columns to construct the rows.
import org.apache.spark.sql.functions._
val data = List((Seq("id1","id2","id3"),None),(Seq("id2","id4","id5"),Some(Seq(2,4,5))),(Seq("id3","id5","id6"),Some(Seq(3,5,6))))
val df = sparkContext.parallelize(data).toDF("ids","values")
val values = df.flatMap{
case Row(t1:Seq[String], t2:Seq[Int]) => Some((t1 zip t2).toMap)
case Row(_, null) => None
}
// get the unique names of the columns across the original data
val ids = df.select(explode($"ids")).distinct.collect.map(_.getString(0))
// map the values to the new columns (to Some value or None)
val transposed = values.map(entry => Row.fromSeq(ids.map(id => entry.get(id))))
// programmatically recreate the target schema with the columns we found in the data
import org.apache.spark.sql.types._
val schema = StructType(ids.map(id => StructField(id, IntegerType, nullable=true)))
// Create the new DataFrame
val transposedDf = sqlContext.createDataFrame(transposed, schema)
This process will pass through the data 2 times, although depending on the backing data source, calculating the column names can be rather cheap.
Also, this goes back and forth between DataFrames and RDD. I would be interested in seeing a "pure" DataFrame process.

SPARK: What is the most efficient way to take a KV pair and turn it into a typed dataframe

Spark Newbie here attempting to use Spark to do some ETL and am having trouble finding a clean way of unifying the data into the destination scheme.
I have multiple dataframes with these keys / values in a spark context (streaming)
Dataframe of long values:
entry---------|long---------
----------------------------
alert_refresh |1446668689989
assigned_on |1446668689777
Dataframe of string values
entry---------|string-------
----------------------------
statusmsg |alert msg
url |http:/svr/pth
Dataframe of boolean values
entry---------|boolean------
----------------------------
led_1 |true
led_2 |true
Dataframe of integer values:
entry---------|int----------
----------------------------
id |789456123
I need to create a unified dataframe based on these where the key is the fieldName and it maintains the type from each source dataframe. It would look something like this:
id-------|led_1|led_2|statusmsg|url----------|alert_refresh|assigned_on
-----------------------------------------------------------------------
789456123|true |true |alert msg|http:/svr/pth|1446668689989|1446668689777
What is the most efficient way to do this in Spark?
BTW - I tried doing a matrix transform:
val seq_b= df_booleans.flatMap(row => (row.toSeq.map(col => (col, row.toSeq.indexOf(col)))))
.map(v => (v._2, v._1))
.groupByKey.sortByKey(true)
.map(._2)
val b_schema_names = seq_b.first.flatMap(r => Array(r))
val b_schema = StructType(b_schema_names.map(r => StructField(r.toString(), BooleanType, true)))
val b_data = seq_b.zipWithIndex.filter(._2==1).map(_._1).first()
val boolean_df = sparkContext.createDataFrame(b_data, b_schema)
Issue: Takes 12 seconds and .sortByKey(true) does not always sort values last