SPARK: What is the most efficient way to take a KV pair and turn it into a typed dataframe - scala

Spark Newbie here attempting to use Spark to do some ETL and am having trouble finding a clean way of unifying the data into the destination scheme.
I have multiple dataframes with these keys / values in a spark context (streaming)
Dataframe of long values:
entry---------|long---------
----------------------------
alert_refresh |1446668689989
assigned_on |1446668689777
Dataframe of string values
entry---------|string-------
----------------------------
statusmsg |alert msg
url |http:/svr/pth
Dataframe of boolean values
entry---------|boolean------
----------------------------
led_1 |true
led_2 |true
Dataframe of integer values:
entry---------|int----------
----------------------------
id |789456123
I need to create a unified dataframe based on these where the key is the fieldName and it maintains the type from each source dataframe. It would look something like this:
id-------|led_1|led_2|statusmsg|url----------|alert_refresh|assigned_on
-----------------------------------------------------------------------
789456123|true |true |alert msg|http:/svr/pth|1446668689989|1446668689777
What is the most efficient way to do this in Spark?
BTW - I tried doing a matrix transform:
val seq_b= df_booleans.flatMap(row => (row.toSeq.map(col => (col, row.toSeq.indexOf(col)))))
.map(v => (v._2, v._1))
.groupByKey.sortByKey(true)
.map(._2)
val b_schema_names = seq_b.first.flatMap(r => Array(r))
val b_schema = StructType(b_schema_names.map(r => StructField(r.toString(), BooleanType, true)))
val b_data = seq_b.zipWithIndex.filter(._2==1).map(_._1).first()
val boolean_df = sparkContext.createDataFrame(b_data, b_schema)
Issue: Takes 12 seconds and .sortByKey(true) does not always sort values last

Related

Apply a transformation to all the columns with the same data type on Spark

I need to apply a transformation to all the Integer columns of my Data Frame before writting a CSV. The transformation consists on changing the type to String and then transform the format to the European one (E.g. 1234567 -> "1234567" -> "1.234.567").
Has Spark any way to apply this transformation to all the Integer Columns? I want it to be a generic functionality (because I need to write multiple CSVs) instead of hardcoding all the columns to transform for each dataframe.
DataFrame has dtypes method, which returns column names along with their data types: Array[("Column name", "Data Type")].
You can map this array, applying different expressions to each column, based on their data type. And you can then pass this mapped list to the select method:
import spark.implicits._
import org.apache.spark.sql.functions._
val dataSeq = Seq(
(1246984, 993922, "test_1"),
(246984, 993922, "test_2"),
(246984, 993922, "test_3"))
val df = dataSeq.toDF("int_1", "int_2", "str_3")
df.show
+-------+------+------+
| int_1| int_2| str_3|
+-------+------+------+
|1246984|993922|test_1|
| 246984|993922|test_2|
| 246984|993922|test_3|
+-------+------+------+
val columns =
df.dtypes.map{
case (c, "IntegerType") => regexp_replace(format_number(col(c), 0), ",", ".").as(c)
case (c, t) => col(c)
}
val df2 = df.select(columns:_*)
df2.show
+---------+-------+------+
| int_1| int_2| str_3|
+---------+-------+------+
|1,246,984|993,922|test_1|
| 246,984|993,922|test_2|
| 246,984|993,922|test_3|
+---------+-------+------+

Spark dataframe random sampling based on frequency occurrence in dataframe

Input description
I have a spark job with input dataframe with a column queryId. This queryId is not unique with respect to the dataframe. For example, there are roughly 3M rows in the spark dataframe with 450k distinct query ids.
Problem
I am trying to implement sampling logic and create a new column sampledQueryId which contains randomly sampled query id for each dataframe row by looking up query ids from the aggregate spark dataframe query id set.
Sampling goal
The restriction is that sampled query id shouldn't be equal to input query id.
Sampling should correspond to frequency of occurrence of query id in the incoming spark dataframe - ie given two query id q1 and q2, if the ratio of occurrence is 10:1(q1:q2), then q1 should appear approximately 10 times more in the sample id column.
Solution tried so far
I have tried to implement this through collecting the query ids into a list and lookup query id list with random sampling but have some suspicion based on empirical evidence that the logic doesn't work as expected for eg I see a specific query id getting sampled 200 times but a query id with similar frequency never gets sampled.
Any suggestions on whether this spark code is expected to work as intended?
val random = new scala.util.Random
val queryIds = data.select($"queryId").map(row => row.getAs[Long](0)).collect()
val sampleQueryId = udf((queryId: Long) => {
val sampledId = queryIds(random.nextInt(queryIds.length))
if (sampledId != queryId) sampledId else null
})
val dataWithSampledIds = data.withColumn("sampledQueryId",sampleQueryId($"queryId"))
Received response on different forum documenting for posterity's sake. The issue is that one random instance is being passed to all executors through the udf. So the n-th row on every executor is going to give the same output.
scala> val random = new scala.util.Random
scala> val getRandom = udf((data: Long) => random.nextInt(10000))
scala> spark.range(0, 12, 1, 4).withColumn("rnd", getRandom($"id")).orderBy($"id").show
+---+----+
| id| rnd|
+---+----+
| 0|6720|
| 1|7667|
| 2|3344|
| 3|6720|
| 4|7667|
| 5|3344|
| 6|6720|
| 7|7667|
| 8|3344|
| 9|6720|
| 10|7667|
| 11|3344|
+---+----+
This df had 4 partitions. The value of rrd for every n-th row is the same (e.g. id = 1, 4, 7, 10 are the same).The solution is to use rand() built-in function in Spark like below.
val queryIds = data.select($"queryId").map(row => row.getAs[Long](0)).collect()
val sampleQueryId = udf((companyId: Long, rand: Double) => {
val sampledId = queryIds(scala.math.floor(rand*queryIds.length).toInt)
if (sampledId != queryId) sampledId else null
})
val dataWithSampledIds = data.withColumn("sampledQueryId",sampleQueryId($"queryId", rand()))

Create SOAP XML REQUEST from selected dataframe columns in Scala

Is there a way to create an XML SOAP REQUEST by extracting a few columns from each row of a dataframe ? 10 records in a dataframe means 10 separate SOAP XML REQUESTs.
How would you make the function call using map now?
You can do that by applying a map function to the dataframe.
val df = your dataframe
df.map(x => convertToSOAP(x))
// convertToSOAP is your function.
Putting up an example based on your comment, hope you find this useful.
case class emp(id:String,name:String,city:String)
val list = List(emp("1","user1","NY"),emp("2","user2","SFO"))
val rdd = sc.parallelize(list)
val df = rdd.toDF
df.map(x => "<root><name>" + x.getString(1) + "</name><city>"+ x.getString(2) +"</city></root>").show(false)
// Note: x is a type of org.apache.spark.sql.Row
Output will be as follows :
+--------------------------------------------------+
|value |
+--------------------------------------------------+
|<root><name>user1</name><city>NY</city></root> |
|<root><name>user2</name><city>SFO</city></root> |
+--------------------------------------------------+

Converting a Dataframe to a scala Mutable map doesn't produce equal number of records

I am new to Scala/spark. I am working on Scala/Spark application that selects a couple of columns from a hive table and then converts it into a Mutable map with the first column being the keys and second column being the values. For example:
+--------+--+
| c1 |c2|
+--------+--+
|Newyork |1 |
| LA |0 |
|Chicago |1 |
+--------+--+
will be converted to Scala.mutable.Map(Newyork -> 1, LA -> 0, Chicago -> 1)
Here is my code for the above conversion:
val testDF = hiveContext.sql("select distinct(trim(c1)),trim(c2) from default.table where trim(c1)!=''")
val testMap = scala.collection.mutable.Map(testDF.map(r => (r(0).toString,r(1).toString)).collectAsMap().toSeq: _*)
I have no problem with the conversion. However, when I print the counts of rows in the Dataframe and the size of the Map, I see that they don't match:
println("Map - "+testMap.size+" DataFrame - "+testDF.count)
//Map - 2359806 DataFrame - 2368295
My idea is to convert the Dataframes to collections and perform some comparisons. I am also picking up data from other tables, but they are just single columns. and I have no problem converting them to ArrayBuffer[String] - The counts match.
I don't understand why I am having a problem with the testMap. Generally, the counts rows in the DF and the size of the Map should match, right?
Is it because there are too many records? How do I get the same number of records in the DF into the Map?
Any help would be appreciated. Thank you.
I believe the mismatch in counts is caused by elimination of duplicated keys (i.e. city names) in Map. By design, Map maintains unique keys by removing all duplicates. For example:
val testDF = Seq(
("Newyork", 1),
("LA", 0),
("Chicago", 1),
("Newyork", 99)
).toDF("city", "value")
val testMap = scala.collection.mutable.Map(
testDF.rdd.map( r => (r(0).toString, r(1).toString)).
collectAsMap().toSeq: _*
)
// testMap: scala.collection.mutable.Map[String,String] =
// Map(Newyork -> 99, LA -> 0, Chicago -> 1)
You might want to either use a different collection type or include an identifying field to your Map key to make it unique. Depending on your data processing need, you can also aggregate data into a Map-like dataframe via groupBy like below:
testDF.groupBy("city").agg(count("value").as("valueCount"))
In this example, the total of valueCount should match the original row count.
If you add entries with duplicate key to your map, duplicates are automatically removed. So what you should compare is:
println("Map - "+testMap.size+" DataFrame - "+testDF.select($"c1").distinct.count)

Reshape spark data frame of key-value pairs with keys as new columns

I am new to spark and scala. Lets say I have a data frame of lists that are key value pairs. Is there a way to map the id vars of column ids as new columns?
df.show()
+--------------------+-------------------- +
| ids | vals |
+--------------------+-------------------- +
|[id1,id2,id3] | null |
|[id2,id5,id6] |[WrappedArray(0,2,4)] |
|[id2,id4,id7] |[WrappedArray(6,8,10)]|
Expected output:
+----+----+
|id1 | id2| ...
+----+----+
|null| 0 | ...
|null| 6 | ...
A possible way would be to compute the columns of the new DataFrame and use those columns to construct the rows.
import org.apache.spark.sql.functions._
val data = List((Seq("id1","id2","id3"),None),(Seq("id2","id4","id5"),Some(Seq(2,4,5))),(Seq("id3","id5","id6"),Some(Seq(3,5,6))))
val df = sparkContext.parallelize(data).toDF("ids","values")
val values = df.flatMap{
case Row(t1:Seq[String], t2:Seq[Int]) => Some((t1 zip t2).toMap)
case Row(_, null) => None
}
// get the unique names of the columns across the original data
val ids = df.select(explode($"ids")).distinct.collect.map(_.getString(0))
// map the values to the new columns (to Some value or None)
val transposed = values.map(entry => Row.fromSeq(ids.map(id => entry.get(id))))
// programmatically recreate the target schema with the columns we found in the data
import org.apache.spark.sql.types._
val schema = StructType(ids.map(id => StructField(id, IntegerType, nullable=true)))
// Create the new DataFrame
val transposedDf = sqlContext.createDataFrame(transposed, schema)
This process will pass through the data 2 times, although depending on the backing data source, calculating the column names can be rather cheap.
Also, this goes back and forth between DataFrames and RDD. I would be interested in seeing a "pure" DataFrame process.