Run a custom transformation on string columns - scala

Suppose I have the following dataframe:
var df = Seq(
("2019-09-01", 0.1, 1, "0x0000000000000001", "0x00000001", "True"),
("2019-09-02", 0.2, 2, "0x0000000000000002", "0x00000002", "False"),
("2019-09-03", 0.3, 3, "0x0000000000000003", "0x00000003", "True")
).toDF("Timestamp", "Float", "Integer", "Hex1", "Hex2", "Bool")
I need to run a transformation on the string colums (in this example: Hex1, Hex2 and Bool) and convert them to a numeric value by using some custom logic.
The dataframes are generated by reading CSV files which I don't know the schema. All I know is that they contain a Timestamp column as the first column and then a variable number of columns which might be numeric (integers or doubles/floats) or these hex and boolean values.
I'm thinking this transformation would need to find all the string columns and for each one, run the transformation that will add a new column to the dataframe with the numerical representation of the string.
In this case, the hex values would be converted to their decimal representation. And the "True", "False" strings would be converted to 1 and 0 respectively.
Back to the simplified example, I should get a df like this:
|Timestamp |Float|Integer|Hex1 |Hex2 |Bool |
|-----------|-----|-------|------------------|----------|-----|
|2019-09-01 |0.1 |1 |1 |1 |1 |
|2019-09-02 |0.2 |2 |2 |2 |0 |
|2019-09-03"|0.3 |3 |3 |3 |1 |
With all numeric (integer, float or double) columns except for the Timestamp

As per your example use following function:
Use conv standard function to convert hex to appropriate type.
conv(num: Column, fromBase: Int, toBase: Int): Column Convert a number in a string column from one base to another.
when(Column condition, Object value):
Evaluates a list of conditions and returns one of multiple possible result expressions.
import org.apache.spark.sql.functions.conv
import org.apache.spark.sql.functions._
val s1 = df.
withColumn("Hex1", conv(col("Hex1").substr(lit(3), length(col("Hex1"))), 16, 10) cast IntegerType).
withColumn("Hex2", conv(col("Hex2").substr(lit(3), length(col("Hex2"))), 16, 10) cast IntegerType).
withColumn("Bool", when(col("Bool") === "True", 1)
.otherwise(0))
s1.show()
s1.printSchema()
From your problem definition ie dynamically. If you want to do same task dynamically you have to do extra work.
Create mapping ie column and it's datatype map: This can be abstracted out, you can create your mapping file externally. Can be generated dynamically by reading mapping file.
val list = List(
("Hex", "Hex1"),
("Hex", "Hex2"),
("Bool", "Bool")
)
create converter using pattern matching :
object Helper {
def convert(columnDetail: (String, String)): Column = {
columnDetail._1 match {
case "Hex" => conv(col(columnDetail._2).substr(lit(3), length(col(columnDetail._2))), 16, 10) cast IntegerType
case "Bool" => when(col(columnDetail._2) === "True", 1).otherwise(0)
// your other case
}
}
}
you can add all the cases and its appropriate implementation.
final solution
import spark.implicits._
var df = Seq(
("2019-09-01", 0.1, 1, "0x0000000000000001", "0x00000001", "True"),
("2019-09-02", 0.2, 2, "0x0000000000000002", "0x00000002", "False"),
("2019-09-03", 0.3, 3, "0x0000000000000003", "0x00000003", "True")
).toDF("Timestamp", "Float", "Integer", "Hex1", "Hex2", "Bool")
val list = List(
("Hex", "Hex1"),
("Hex", "Hex2"),
("Bool", "Bool")
)
val temp = list.foldLeft(df) { (tempDF, listValue) =>
tempDF.withColumn(listValue._2, Helper.convert(listValue))
}
temp.show(false)
temp.printSchema()
}
object Helper {
def convert(columnDetail: (String, String)): Column = {
columnDetail._1 match {
case "Hex" => conv(col(columnDetail._2).substr(lit(3), length(col(columnDetail._2))), 16, 10) cast IntegerType
case "Bool" => when(col(columnDetail._2) === "True", 1).otherwise(0)
// your other case
}
}
}
Result:
+----------+-----+-------+----+----+----+
|Timestamp |Float|Integer|Hex1|Hex2|Bool|
+----------+-----+-------+----+----+----+
|2019-09-01|0.1 |1 |1 |1 |1 |
|2019-09-02|0.2 |2 |2 |2 |0 |
|2019-09-03|0.3 |3 |3 |3 |1 |
+----------+-----+-------+----+----+----+
root
|-- Timestamp: string (nullable = true)
|-- Float: double (nullable = false)
|-- Integer: integer (nullable = false)
|-- Hex1: integer (nullable = true)
|-- Hex2: integer (nullable = true)
|-- Bool: integer (nullable = false)

Below is my spark code to do this. I have used conv function of spark sql http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.conv
. Also if you want to write a logic to dynamically identify all string columns at run time and perform conversion, it could be done only if you know exactly what kind of conversion you are going to do.
var df = Seq(
("2019-09-01", 0.1, 1, "0x0000000000000001", "0x00000001", "True"),
("2019-09-02", 0.2, 2, "0x0000000000000002", "0x00000002", "False"),
("2019-09-03", 0.3, 3, "0x0000000000000003", "0x00000003", "True")
).toDF("Timestamp", "Float", "Integer", "Hex1", "Hex2", "Bool")
// df.show
df.createOrReplaceTempView("sourceTable")
val finalDF = spark.sql("""
select Timestamp,
Float,
Integer,
conv(substr(Hex1,3),16,10) as Hex1,
conv(substr(Hex2,3),16,10) as Hex2,
case when Bool = "True" then 1
when Bool = "False" then 0
else NULL
end as Bool
from sourceTable
""")
finalDF.show
Result :
+----------+-----+-------+----+----+----+
| Timestamp|Float|Integer|Hex1|Hex2|Bool|
+----------+-----+-------+----+----+----+
|2019-09-01| 0.1| 1| 1| 1| 1|
|2019-09-02| 0.2| 2| 2| 2| 0|
|2019-09-03| 0.3| 3| 3| 3| 1|
+----------+-----+-------+----+----+----+

Related

Add new column of Map Datatype to Spark Dataframe in scala

I'm able to create a new Dataframe with one column having Map datatype.
val inputDF2 = Seq(
(1, "Visa", 1, Map[String, Int]()),
(2, "MC", 2, Map[String, Int]())).toDF("id", "card_type", "number_of_cards", "card_type_details")
scala> inputDF2.show(false)
+---+---------+---------------+-----------------+
|id |card_type|number_of_cards|card_type_details|
+---+---------+---------------+-----------------+
|1 |Visa |1 |[] |
|2 |MC |2 |[] |
+---+---------+---------------+-----------------+
Now I want to create a new column of the same type as card_type_details. I'm trying to use the spark withColumn method to add this new column.
inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").show(false)
+---------+---------+---------------+---------------------+-----+
|person_id|card_type|number_of_cards|card_type_details |tmp |
+---------+---------+---------------+---------------------+-----+
|1 |Visa |1 |[] |null |
|2 |MC |2 |[] |null |
+---------+---------+---------------+---------------------+-----+
When I checked the schema of both the columns, it is same but values are coming different.
scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").printSchema
root
|-- id: integer (nullable = false)
|-- card_type: string (nullable = true)
|-- number_of_cards: integer (nullable = false)
|-- card_type_details: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
|-- tmp: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
I'm not sure if I'm doing correctly while adding the new column. Issue is coming when I'm applying the .isEmpty method on the tmp column. I'm getting null pointer exception.
scala> def checkValue = udf((card_type_details: Map[String, Int]) => {
| var output_map = Map[String, Int]()
| if (card_type_details.isEmpty) { output_map += 0.toString -> 1 }
| else {output_map = card_type_details }
| output_map
| })
checkValue: org.apache.spark.sql.expressions.UserDefinedFunction
scala> inputDF2.withColumn("value", checkValue(col("card_type_details"))).show(false)
+---+---------+---------------+-----------------+--------+
|id |card_type|number_of_cards|card_type_details|value |
+---+---------+---------------+-----------------+--------+
|1 |Visa |1 |[] |[0 -> 1]|
|2 |MC |2 |[] |[0 -> 1]|
+---+---------+---------------+-----------------+--------+
scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>")
.withColumn("value", checkValue(col("tmp"))).show(false)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$checkValue$1: (map<string,int>) => map<string,int>)
Caused by: java.lang.NullPointerException
at $anonfun$checkValue$1.apply(<console>:28)
at $anonfun$checkValue$1.apply(<console>:26)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:108)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:107)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1063)
How to add a new column that should have the same values as card_type_details column.
To add the tmp column with the same value as card_type_details, you just do:
inputDF2.withColumn("tmp", col("cart_type_details"))
If you aim to add a column with an empty map and avoid the NullPointerException, the solution is:
inputDF2.withColumn("tmp", typedLit(Map.empty[Int, String]))

Scala Spark: Flatten Array of Key/Value structs

I have an input dataframe which contains an array-typed column. Each entry in the array is a struct consisting of a key (one of about four values) and a value. I want to turn this into a dataframe with one column for each possible key, and nulls where that value is not in the array for that row. Keys are never duplicated in any of the arrays, but they may be out of order or missing.
So far the best I've got is
val wantedCols =df.columns
.filter(_ != arrayCol)
.filter(_ != "col")
val flattened = df
.select((wantedCols.map(col(_)) ++ Seq(explode(col(arrayCol)))):_*)
.groupBy(wantedCols.map(col(_)):_*)
.pivot("col.key")
.agg(first("col.value"))
This does exactly what I want, but it's hideous and I have no idea what the ramifactions of grouping on every-column-but-one would be. What's the RIGHT way to do this?
EDIT: Example input/output:
case class testStruct(name : String, number : String)
val dfExampleInput = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))))
.toDF("index", "state", "entries")
.show
+-----+-----+------------------+
|index|state| entries|
+-----+-----+------------------+
| 0| KY| [[A, 45]]|
| 1| OR|[[A, 30], [B, 10]]|
+-----+-----+------------------+
val dfExampleOutput = Seq(
(0, "KY", "45", null),
(1, "OR", "30", "10"))
.toDF("index", "state", "A", "B")
.show
+-----+-----+---+----+
|index|state| A| B|
+-----+-----+---+----+
| 0| KY| 45|null|
| 1| OR| 30| 10|
+-----+-----+---+----+
FURTHER EDIT:
I submitted a solution myself (see below) that handles this well so long as you know the keys in advance (in my case I do.) If finding the keys is an issue, another answer holds code to handle that.
Without groupBy pivot agg first
Please check below code.
scala> val df = Seq((0, "KY", Seq(("A", "45"))),(1, "OR", Seq(("A", "30"),("B", "10")))).toDF("index", "state", "entries").withColumn("entries",$"entries".cast("array<struct<name:string,number:string>>"))
df: org.apache.spark.sql.DataFrame = [index: int, state: string ... 1 more field]
scala> df.printSchema
root
|-- index: integer (nullable = false)
|-- state: string (nullable = true)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- number: string (nullable = true)
scala> df.show(false)
+-----+-----+------------------+
|index|state|entries |
+-----+-----+------------------+
|0 |KY |[[A, 45]] |
|1 |OR |[[A, 30], [B, 10]]|
+-----+-----+------------------+
scala> val finalDFColumns = df.select(explode($"entries").as("entries")).select("entries.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect.foldLeft(df.limit(0))((cdf,c) => cdf.withColumn(c,lit(null))).columns
finalDFColumns: Array[String] = Array(index, state, entries, A, B)
scala> val finalDF = df.select($"*" +: (0 until max).map(i => $"entries".getItem(i)("number").as(i.toString)): _*)
finalDF: org.apache.spark.sql.DataFrame = [index: int, state: string ... 3 more fields]
scala> finalDF.show(false)
+-----+-----+------------------+---+----+
|index|state|entries |0 |1 |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
scala> finalDF.printSchema
root
|-- index: integer (nullable = false)
|-- state: string (nullable = true)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- number: string (nullable = true)
|-- 0: string (nullable = true)
|-- 1: string (nullable = true)
scala> finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf,column) => fdf.withColumnRenamed(column._1,column._2)).show(false)
+-----+-----+------------------+---+----+
|index|state|entries |A |B |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
scala>
Final Output
scala> finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf,column) => fdf.withColumnRenamed(column._1,column._2)).drop($"entries").show(false)
+-----+-----+---+----+
|index|state|A |B |
+-----+-----+---+----+
|0 |KY |45 |null|
|1 |OR |30 |10 |
+-----+-----+---+----+
I wouldn't worry too much about grouping by several columns, other than potentially making things confusing. In that vein, if there is a simpler, more maintainable way, go for it. Without example input/output, I'm not sure if this gets you where you're trying to go, but maybe it'll be of use:
Seq(Seq("k1" -> "v1", "k2" -> "v2")).toDS() // some basic input based on my understanding of your description
.select(explode($"value")) // flatten the array
.select("col.*") // de-nest the struct
.groupBy("_2") // one row per distinct value
.pivot("_1") // one column per distinct key
.count // or agg(first) if you want the value in each column
.show
+---+----+----+
| _2| k1| k2|
+---+----+----+
| v2|null| 1|
| v1| 1|null|
+---+----+----+
Based on what you've now said, I get the impression that there are many columns like "state" that aren't required for the aggregation, but need to be in the final result.
For reference, if you didn't need to pivot, you could add a struct column with all such fields nested within, then add it to your aggregation, eg: .agg(first($"myStruct"), first($"number")). The main advantage is only having actual key column(s) referenced in the groubBy. But when using pivot things get a little weird, so we'll set that option aside.
In this use case, the simplest way I could come up with involves splitting your dataframe and joining it back together after the aggregation using some rowkey. In this example I am assuming that "index" is suitable for that purpose:
val mehCols = dfExampleInput.columns.filter(_ != "entries").map(col)
val mehDF = dfExampleInput.select(mehCols:_*)
val aggDF = dfExampleInput
.select($"index", explode($"entries").as("entry"))
.select($"index", $"entry.*")
.groupBy("index")
.pivot("name")
.agg(first($"number"))
scala> mehDF.join(aggDF, Seq("index")).show
+-----+-----+---+----+
|index|state| A| B|
+-----+-----+---+----+
| 0| KY| 45|null|
| 1| OR| 30| 10|
+-----+-----+---+----+
I doubt you would see much of a difference in performance, if any. Maybe at the extremes, eg: very many meh columns, or very many pivot columns, or something like that, or maybe nothing at all. Personally, I would test both with decently-sized input, and if there wasn't a significant difference, use whichever one seemed easier to maintain.
Here is another way that is based on the assumption that there are no duplicates on the entries column i.e Seq(testStruct("A", "30"), testStruct("A", "70"), testStruct("B", "10")) will cause an error. The next solution combines both RDD and Dataframe APIs for the implementation:
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.types.StructType
case class testStruct(name : String, number : String)
val df = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))),
(2, "FL", Seq(testStruct("A", "30"), testStruct("B", "10"), testStruct("C", "20"))),
(3, "TX", Seq(testStruct("B", "60"), testStruct("A", "19"), testStruct("C", "40")))
)
.toDF("index", "state", "entries")
.cache
// get all possible keys from entries i.e Seq[A, B, C]
val finalCols = df.select(explode($"entries").as("entry"))
.select($"entry".getField("name").as("entry_name"))
.distinct
.collect
.map{_.getAs[String]("entry_name")}
.sorted // Attention: we need to retain the order of the columns
// 1. when generating row values and
// 2. when creating the schema
val rdd = df.rdd.map{ r =>
// transform the entries array into a map i.e Map(A -> 30, B -> 10)
val entriesMap = r.getSeq[Row](2).map{r => (r.getString(0), r.getString(1))}.toMap
// transform finalCols into a map with null value i.e Map(A -> null, B -> null, C -> null)
val finalColsMap = finalCols.map{c => (c, null)}.toMap
// replace null values with those that are present from the current row by merging the two previous maps
// Attention: this should retain the order of finalColsMap
val merged = finalColsMap ++ entriesMap
// concatenate the two first row values ["index", "state"] with the values from merged
val finalValues = Seq(r(0), r(1)) ++ merged.values
Row.fromSeq(finalValues)
}
val extraCols = finalCols.map{c => s"`${c}` STRING"}
val schema = StructType.fromDDL("`index` INT, `state` STRING," + extraCols.mkString(","))
val finalDf = spark.createDataFrame(rdd, schema)
finalDf.show
// +-----+-----+---+----+----+
// |index|state| A| B| C|
// +-----+-----+---+----+----+
// | 0| KY| 45|null|null|
// | 1| OR| 30| 10|null|
// | 2| FL| 30| 10| 20|
// | 3| TX| 19| 60| 40|
// +-----+-----+---+----+----+
Note: the solution requires one extra action to retrieve the unique keys although it doesn't cause any shuffling since it it based on narrow transformations only.
I've worked out a solution myself:
def extractFromArray(colName : String, key : String, numKeys : Int, keyName : String) = {
val indexCols = (0 to numKeys-1).map(col(colName).getItem(_))
indexCols.foldLeft(lit(null))((innerCol : Column, indexCol : Column) =>
when(indexCol.isNotNull && (indexCol.getItem(keyName) === key), indexCol)
.otherwise(innerCol))
}
Example:
case class testStruct(name : String, number : String)
val df = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))),
(2, "FL", Seq(testStruct("A", "30"), testStruct("B", "10"), testStruct("C", "20"))),
(3, "TX", Seq(testStruct("B", "60"), testStruct("A", "19"), testStruct("C", "40")))
)
.toDF("index", "state", "entries")
.withColumn("A", extractFromArray("entries", "B", 3, "name"))
.show
which produces:
+-----+-----+--------------------+-------+
|index|state| entries| A|
+-----+-----+--------------------+-------+
| 0| KY| [[A, 45]]| null|
| 1| OR| [[A, 30], [B, 10]]|[B, 10]|
| 2| FL|[[A, 30], [B, 10]...|[B, 10]|
| 3| TX|[[B, 60], [A, 19]...|[B, 60]|
+-----+-----+--------------------+-------+
This solution is a little different from other answers:
It works on only a single key at a time
It requires the key name and number of keys be known in advance
It produces a column of structs, rather than doing the extra step of extracting specific values
It works as a simple column-to-column operation, rather than requiring transformations on the entire DF
It can be evaluated lazily
The first three issues can be handled by calling code, and leave it somewhat more flexible for cases where you already know the keys or where the structs contain additional values to extract.

create new columns from string column

I have as DataFrame with a string column
val df= Seq(("0003C32C-FC1D-482F-B543-3CBD7F0A0E36 0,8,1,799,300:3 0,6,1,330,300:1 2,6,1,15861:1 0,7,1,734,300:1 0,6,0,95,300:1 2,7,1,15861:1 0,8,0,134,300:3")).toDF("col_str")
+--------------------+
| col_str|
+--------------------+
|0003C32C-FC1D-482...|
+--------------------+
The string column is comprised of character sequences separated by whitespace. If a character sequence starts with 0, I want to return the second number and the last number of the sequence. The second number can be any number between 0 and 8.
Array("8,3", "6,1", "7,1", "6,1", "7,1", "8,3")
I then want to transform the array of pairs into 9 columns, with the first number of the pair as the column and the second number as the value. If a number is missing, it will get a value of 0.
For example
val df= Seq(("0003C32C-FC1D-482F-B543-3CBD7F0A0E36 0,8,1,799,300:3 0,6,1,330,300:1 2,6,1,15861:1 0,7,1,734,300:1 0,6,0,95,300:1 2,7,1,15861:1 0,8,0,134,300:1")).).toDF("col_str", "col0", "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8")
+--------------------+----+----+----+----+----+----+----+----+----+
| col_str|col0|col1|col2|col3|col4|col5|col6|col7|col8|
+--------------------+----+----+----+----+----+----+----+----+----+
|0003C32C-FC1D-482...| 0| 0| 0| 0| 0| 0| 1| 1| 3|
+--------------------+----+----+----+----+----+----+----+----+----+
I don't care if the solution is in either scala or python.
You can do the following (commented for clarity)
//string defining
val str = """0003C32C-FC1D-482F-B543-3CBD7F0A0E36 0,8,1,799,300:3 0,6,1,330,300:1 2,6,1,15861:1 0,7,1,734,300:1 0,6,0,95,300:1 2,7,1,15861:1 0,8,0,134,300:3"""
//string splitting with space
val splittedStr = str.split(" ")
//parsing the splitted string to get the desired format with the second element as key and the last element as value of the elements starting with 0
val parsedStr = List(("col_str"->splittedStr.head)) ++ splittedStr.tail.filter(_.startsWith("0")).map(value => {val splittedValue = value.split("[,:]"); ("col"+splittedValue(1)->splittedValue.last)}) toMap
//expected header names
val expectedHeader = Seq("col_str", "col0", "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8")
//populating 0 for the missing header names in the parsed string in above step
val missedHeaderWithValue = expectedHeader.diff(parsedStr.keys.toSeq).map((_->"0")).toMap
//combining both the maps
val expectedKeyValues = parsedStr ++ missedHeaderWithValue
//converting to a dataframe
Seq(expectedDF(expectedKeyValues(expectedHeader(0)), expectedKeyValues(expectedHeader(1)), expectedKeyValues(expectedHeader(2)), expectedKeyValues(expectedHeader(3)), expectedKeyValues(expectedHeader(4)), expectedKeyValues(expectedHeader(5)), expectedKeyValues(expectedHeader(6)), expectedKeyValues(expectedHeader(7)), expectedKeyValues(expectedHeader(8)), expectedKeyValues(expectedHeader(9))))
.toDF()
.show(false)
which should give you
+------------------------------------+----+----+----+----+----+----+----+----+----+
|col_str |col0|col1|col2|col3|col4|col5|col6|col7|col8|
+------------------------------------+----+----+----+----+----+----+----+----+----+
|0003C32C-FC1D-482F-B543-3CBD7F0A0E36|0 |0 |0 |0 |0 |0 |1 |1 |3 |
+------------------------------------+----+----+----+----+----+----+----+----+----+
and of course you would need expectedDF case class defined somewhere out of scope
case class expectedDF(col_str: String, col0: String, col1: String, col2: String, col3: String, col4: String, col5: String, col6: String, col7: String, col8: String)

How to refer to data in Spark SQL using column names?

If I have a file named shades.txt with the following data:
1 | 1 | dark red
2 | 1 | light red
3 | 2 | dark green
4 | 3 | light blue
5 | 3 | sky blue
I can store it in SparkSQL like below:
var shades_part = sc.textFile("shades.txt")
var shades_data = shades_part.map(line => line.split("\\|").map(elem => elem.trim))
shades_data.toDF().createOrReplaceTempView("shades_data")
But when I try to run a query on it
sqlContext.sql("select * from shades_data A where A.shade_name = "dark green")
I get an error:
cannot resolve '`A.shade_name`' given input columns: [A.value];
You can use case class for your code as
case class Shades(id: Int, colorCode:Int, shadeColor: String)
Then modify your code as
var shades_part = sc.textFile("shades.txt")
var shades_data = shades_part.map(line => line.split("\\|")).map(elem => Shades(elem(0).trim.toInt, elem(1).trim.toInt, elem(2).trim))
val df = shades_data.toDF()
You should have dataframe as
+---+---------+----------+
|id |colorCode|shadeColor|
+---+---------+----------+
|1 |1 |dark red |
|2 |1 |light red |
|3 |2 |dark green|
|4 |3 |light blue|
|5 |3 |sky blue |
+---+---------+----------+
Now you can use use filter function as
df.filter($"shadeColor" === "dark green").show(false)
which should give you
+---+---------+----------+
|id |colorCode|shadeColor|
+---+---------+----------+
|3 |2 |dark green|
+---+---------+----------+
Using Schema
You can create schema as
val schema = StructType(Array(StructField("id", IntegerType, true), StructField("colorCode", IntegerType, true), StructField("shadeColor", StringType, true)))
and use the schema in sqlContext as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.schema(schema)
.option("ignoreTrailingWhiteSpace", true)
.option("ignoreLeadingWhiteSpace", true)
.load("shades.txt")
Or you can use createDataFrame function as
var shades_part = sc.textFile("shades.txt")
var shades_data = shades_part.map(line => line.split("\\|").map(_.trim)).map(elem => Row.fromSeq(Seq(elem(0).toInt, elem(1).toInt, elem(2))))
sqlContext.createDataFrame(shades_data, schema).show(false)
check the schema of the dataframe, you will know the column names
val dataframe = shades_data.toDF()
dataframe.printSchema()
If there are no column names defined while creating a dataframe then by default it will use _c0, _c1 ... as column names.
or else you can give column names while creating dataframes like below
val dataframe = shades_data.toDF("col1", "col2", "col3")
dataframe.printSchema()
root
|-- col1: integer (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: string (nullable = true)

Sort by date an Array of a Spark DataFrame Column

I have a DataFrame formated as below:
+---+------------------------------------------------------+
|Id |DateInfos |
+---+------------------------------------------------------+
|B |[[3, 19/06/2012-02.42.01], [4, 17/06/2012-18.22.21]] |
|A |[[1, 15/06/2012-18.22.16], [2, 15/06/2012-09.22.35]] |
|C |[[5, 14/06/2012-05.20.01]] |
+---+------------------------------------------------------+
I would like to sort each element of DateInfos column by date with the timestamp in the second element of my Array
+---+------------------------------------------------------+
|Id |DateInfos |
+---+------------------------------------------------------+
|B |[[4, 17/06/2012-18.22.21], [3, 19/06/2012-02.42.01]] |
|A |[[2, 15/06/2012-09.22.35], [1, 15/06/2012-18.22.16]] |
|C |[[5, 14/06/2012-05.20.01]] |
+---+------------------------------------------------------+
the schema of my DataFrame is printed as below:
root
|-- C1: string (nullable = true)
|-- C2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: string (nullable = false)
I assume I have to create an udf which use a function with the following signature:
def sort_by_date(mouvements : Array[Any]) : Array[Any]
Do you have any idea?
That's indeed a bit tricky - because although the UDF's input and output types seem identical, we can't really define it that way - because the input is actually a mutable.WrappedArray[Row] and the output can't use Row or else Spark will fail to decode it into a Row...
So we define a UDF that takes a mutable.WrappedArray[Row] and returns an Array[(Int, String)]:
val sortDates = udf { arr: mutable.WrappedArray[Row] =>
arr.map { case Row(i: Int, s: String) => (i, s) }.sortBy(_._2)
}
val result = input.select($"Id", sortDates($"DateInfos") as "DateInfos")
result.show(truncate = false)
// +---+--------------------------------------------------+
// |Id |DateInfos |
// +---+--------------------------------------------------+
// |B |[[4,17/06/2012-18.22.21], [3,19/06/2012-02.42.01]]|
// |A |[[2,15/06/2012-09.22.35], [1,15/06/2012-18.22.16]]|
// |C |[[5,14/06/2012-05.20.01]] |
// +---+--------------------------------------------------+