Cannot resolve column (numeric column name) in Spark Dataframe - scala

This is my data:
scala> data.printSchema
root
|-- 1.0: string (nullable = true)
|-- 2.0: string (nullable = true)
|-- 3.0: string (nullable = true)
This doesn't work :(
scala> data.select("2.0").show
Exception:
org.apache.spark.sql.AnalysisException: cannot resolve '`2.0`' given input columns: [1.0, 2.0, 3.0];;
'Project ['2.0]
+- Project [_1#5608 AS 1.0#5615, _2#5609 AS 2.0#5616, _3#5610 AS 3.0#5617]
+- LocalRelation [_1#5608, _2#5609, _3#5610]
...
Try this at home (I'm running on the shell v_2.1.0.5)!
val data = spark.createDataFrame(Seq(
("Hello", ", ", "World!")
)).toDF("1.0", "2.0", "3.0")
data.select("2.0").show

You can use backticks to escape the dot, which is reserved for accessing columns for struct type:
data.select("`2.0`").show
+---+
|2.0|
+---+
| , |
+---+

The problem is you can not add dot character in the column name while selecting from dataframe. You can have a look at this question, kind of similar.
val data = spark.createDataFrame(Seq(
("Hello", ", ", "World!")
)).toDF("1.0", "2.0", "3.0")
data.select(sanitize("2.0")).show
def sanitize(input: String): String = s"`$input`"

Related

String Functions for Nested Schema in Spark Scala

I am learning Spark in Scala programming language.
Input file ->
"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}
Schema ->
root
|-- Personal: struct (nullable = true)
| |-- ID: integer (nullable = true)
| |-- Name: array (nullable = true)
| | |-- element: string (containsNull = true)
Operation for output ->
I want to concat the Strings of "Name" element
Eg - abcs|dakjdb
I am reading the file using dataframe API.
Please help me from this.
It should be pretty straightforward if you are working with Spark >= 1.6.0 you can use get_json_object and concat_ws:
import org.apache.spark.sql.functions.{get_json_object, concat_ws}
val df = Seq(
("""{"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}"""),
("""{"Personal":{"ID":3425,"Name":["cfg","woooww"]}}""")
)
.toDF("data")
df.select(
concat_ws(
"-",
get_json_object($"data", "$.Personal.Name[0]"),
get_json_object($"data", "$.Personal.Name[1]")
).as("FullName")
).show(false)
// +-----------+
// |FullName |
// +-----------+
// |abcs-dakjdb|
// |cfg-woooww |
// +-----------+
With get_json_object we go through the json data an extract the two elements of the Name array which we concatenate later on.
There is an inbuilt function concat_ws which should be useful here.
to extend #Alexandros Biratsis answer. you can first convert Name into array[String] type before concatenating to avoid writing every name position. Querying by position would also fail when the value is null or when only one value exist instead of two.
import org.apache.spark.sql.functions.{get_json_object, concat_ws, from_json}
import org.apache.spark.sql.types.{ArrayType, StringType}
val arraySchema = ArrayType(StringType)
val df = Seq(
("""{"Personal":{"ID":3424,"Name":["abcs","dakjdb"]}}"""),
("""{"Personal":{"ID":3425,"Name":["cfg","woooww"]}}""")
)
.toDF("data")
.select(get_json_object($"data", "$.Personal.Name") as "name")
.select(from_json($"name", arraySchema) as "name")
.select(concat_ws("|", $"name"))
.show(false)

Unable to explode() Map[String, Struct] in Spark

Been struggling with this for a while and still can't make my mind around it.
I'm trying to flatMap (or use .withColumn with explode() instead as it seems easier so I don't lose column names), but I'm always getting the error UDTF expected 2 aliases but got 'name' instead.
I've revisited some similar questions but none of them shed some light as their schemas are too simple.
The column of the schema I'm trying to perform flatMap with is the following one...
StructField(CarMake,
StructType(
List(
StructField(
Models,
MapType(
StringType,
StructType(
List(
StructField(Variant, StringType),
StructField(GasOrPetrol, StringType)
)
)
)
)
)
))
What I'm trying to achieve by calling explode() like this...
carsDS
.withColumn("modelsAndVariant", explode($"carmake.models"))
...is to achieve a Row without that nested Map and Struct so I get as many rows as variants are.
Example input
(country: Sweden, carMake: Volvo, carMake.Models: {"850": ("T5", "petrol"), "V50": ("T5", "petrol")})
Example output
(country: Sweden, carMake: Volvo, Model: "850", Variant: "T5", GasOrPetrol: "petrol"}
(country: Sweden, carMake: Volvo, Model: "V50", Variant: "T5", GasOrPetrol: "petrol"}
Basically leaving the nested Map with its inner Struct all in the same level.
Try this:
case class Models(variant:String, gasOrPetrol:String)
case class CarMake(brand:String, models : Map[String, Models] )
case class MyRow(carMake:CarMake)
val df = List(
MyRow(CarMake("volvo",Map(
"850" -> Models("T5","petrol"),
"V50" -> Models("T5","petrol")
)))
).toDF()
df.printSchema()
df.show()
gives
root
|-- carMake: struct (nullable = true)
| |-- brand: string (nullable = true)
| |-- models: map (nullable = true)
| | |-- key: string
| | |-- value: struct (valueContainsNull = true)
| | | |-- variant: string (nullable = true)
| | | |-- gasOrPetrol: string (nullable = true)
+--------------------+
| carMake|
+--------------------+
|[volvo, [850 -> [...|
+--------------------+
now explode, note that withColumn does not work because èxplode on a map returns 2 columns (key and value), so you need to use select:
val cols: Array[Column] = df.columns.map(col)
df
.select((cols:+explode($"carMake.models")):_*)
.select((cols:+$"key".as("model"):+$"value.*"):_*)
.show()
gives:
+--------------------+-----+-------+-----------+
| carMake|model|variant|gasOrPetrol|
+--------------------+-----+-------+-----------+
|[volvo, [850 -> [...| 850| T5| petrol|
|[volvo, [850 -> [...| V50| T5| petrol|
+--------------------+-----+-------+-----------+

Convert array<map<string,string>> type to <string,string> in scala

I am facing in problem in converting a column in my dataframe to string format. The example of the dataframe is as follows:
-- example_code_b: string (nullable = true)
-- example_code: array (nullable = true)
[info] | |-- element: map (containsNull = true)
[info] | | |-- key: string
[info] | | |-- value: string (valueContainsNull = true)
I want to convert example code to (string,string) format from the current array(map(string,string)).
The input is in the form of [Map(entity -> PER), Map(entity -> PER)] and
I want the output to be in the form of PER,PER
you can either do an UDF in DataFrame API or use Dataset-API to do it:
import spark.implicits._
df
.as[Seq[Map[String,String]]]
.map(s => s.reduce(_ ++ _))
.toDF("example_code")
.show()
Note that this does not consider the case of multiple keys, they are not "merged" but just overwritten
You can simply use explode function on any array column, which will create separate rows for each value of array.
val newDF = df.withColumn("mymap" explode(col("example_code")))

How to convert a column from hex string to long?

I have a DataFrame with Icao column with hex codes that I would like to convert to Long datatype. How could I do this in Spark SQL?
| Icao|count|
+------+-----+
|471F8D|81350|
|471F58|79634|
|471F56|79112|
|471F86|78177|
|471F8B|75300|
|47340D|75293|
|471F83|74864|
|471F57|73815|
|471F4A|72290|
|471F5F|72133|
|40612C|69676|
TL;DR Use conv standard function.
conv(num: Column, fromBase: Int, toBase: Int): Column Convert a number in a string column from one base to another.
With conv a solution could be as follows:
scala> icao.show
+------+-----+
| Icao|count|
+------+-----+
|471F8D|81350|
|471F58|79634|
|471F56|79112|
|471F86|78177|
|471F8B|75300|
|47340D|75293|
|471F83|74864|
|471F57|73815|
|471F4A|72290|
|471F5F|72133|
|40612C|69676|
+------+-----+
// conv is not available by default unless you're in spark-shell
import org.apache.spark.sql.functions.conv
val s1 = icao.withColumn("conv", conv($"Icao", 16, 10))
scala> s1.show
+------+-----+-------+
| Icao|count| conv|
+------+-----+-------+
|471F8D|81350|4661133|
|471F58|79634|4661080|
|471F56|79112|4661078|
|471F86|78177|4661126|
|471F8B|75300|4661131|
|47340D|75293|4666381|
|471F83|74864|4661123|
|471F57|73815|4661079|
|471F4A|72290|4661066|
|471F5F|72133|4661087|
|40612C|69676|4219180|
+------+-----+-------+
conv has a feature of giving you a result of the type of the input column, so I started with strings and got strings.
scala> s1.printSchema
root
|-- Icao: string (nullable = true)
|-- count: string (nullable = true)
|-- conv: string (nullable = true)
If I had used ints I'd have got ints.
You could cast the result of conv using another built-in method cast (or start with a proper type of the input column).
val s2 = icao.withColumn("conv", conv($"Icao", 16, 10) cast "long")
scala> s2.printSchema
root
|-- Icao: string (nullable = true)
|-- count: string (nullable = true)
|-- conv: long (nullable = true)
scala> s2.show
+------+-----+-------+
| Icao|count| conv|
+------+-----+-------+
|471F8D|81350|4661133|
|471F58|79634|4661080|
|471F56|79112|4661078|
|471F86|78177|4661126|
|471F8B|75300|4661131|
|47340D|75293|4666381|
|471F83|74864|4661123|
|471F57|73815|4661079|
|471F4A|72290|4661066|
|471F5F|72133|4661087|
|40612C|69676|4219180|
+------+-----+-------+
You can use java hex to Long converter
java.lang.Long.parseLong(hex.trim(), 16)
All you need is to define a udf function as below
import org.apache.spark.sql.functions.udf
def hexToLong = udf((hex: String) => java.lang.Long.parseLong(hex.trim(), 16))
And call the udf function using .withColumn api
df.withColumn("Icao", hexToLong($"Icao")).show(false)

Explode array of structs to columns in Spark

I'd like to explode an array of structs to columns (as defined by the struct fields). E.g.
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = false)
| | |-- name: string (nullable = true)
Should be transformed to
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
I can achieve this with
df
.select(explode($"arr").as("tmp"))
.select($"tmp.*")
How can I do that in a single select statement?
I thought this could work, unfortunately it does not:
df.select(explode($"arr")(".*"))
Exception in thread "main" org.apache.spark.sql.AnalysisException: No
such struct field .* in col;
Single step solution is available only for MapType columns:
val df = Seq(Tuple1(Map((1L, "bar"), (2L, "foo")))).toDF
df.select(explode($"_1") as Seq("foo", "bar")).show
+---+---+
|foo|bar|
+---+---+
| 1|bar|
| 2|foo|
+---+---+
With arrays you can use flatMap:
val df = Seq(Tuple1(Array((1L, "bar"), (2L, "foo")))).toDF
df.as[Seq[(Long, String)]].flatMap(identity)
A single SELECT statement can written in SQL:
df.createOrReplaceTempView("df")
spark.sql("SELECT x._1, x._2 FROM df LATERAL VIEW explode(_1) t AS x")