Spark: Convert column of string to an array - scala

How to convert a column that has been read as a string into a column of arrays?
i.e. convert from below schema
scala> test.printSchema
root
|-- a: long (nullable = true)
|-- b: string (nullable = true)
+---+---+
| a| b|
+---+---+
| 1|2,3|
+---+---+
| 2|4,5|
+---+---+
To:
scala> test1.printSchema
root
|-- a: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: long (containsNull = true)
+---+-----+
| a| b |
+---+-----+
| 1|[2,3]|
+---+-----+
| 2|[4,5]|
+---+-----+
Please share both scala and python implementation if possible.
On a related note, how do I take care of it while reading from the file itself?
I have data with ~450 columns and few of them I want to specify in this format.
Currently I am reading in pyspark as below:
df = spark.read.format('com.databricks.spark.csv').options(
header='true', inferschema='true', delimiter='|').load(input_file)
Thanks.

There are various method,
The best way to do is using split function and cast to array<long>
data.withColumn("b", split(col("b"), ",").cast("array<long>"))
You can also create simple udf to convert the values
val tolong = udf((value : String) => value.split(",").map(_.toLong))
data.withColumn("newB", tolong(data("b"))).show
Hope this helps!

Using a UDF would give you exact required schema. Like this:
val toArray = udf((b: String) => b.split(",").map(_.toLong))
val test1 = test.withColumn("b", toArray(col("b")))
It would give you schema as follows:
scala> test1.printSchema
root
|-- a: long (nullable = true)
|-- b: array (nullable = true)
| |-- element: long (containsNull = true)
+---+-----+
| a| b |
+---+-----+
| 1|[2,3]|
+---+-----+
| 2|[4,5]|
+---+-----+
As far as applying schema on file read itself is concerned, I think that is a tough task. So, for now you can apply transformation after creating DataFrameReader of test.
I hope this helps!

In python (pyspark) it would be:
from pyspark.sql.types import *
from pyspark.sql.functions import col, split
test = test.withColumn(
"b",
split(col("b"), ",\s*").cast("array<int>").alias("ev")
)

Related

Convert Array with nested struct to string column along with other columns from the PySpark DataFrame

This is similar to Pyspark: cast array with nested struct to string
But, the accepted answer is not working for my case, so asking here
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
|-- element: struct (containsNull = true)
|-- Col2Sub: string (nullable = true)
Sample JSON
{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}
This gives result in a single column
import pyspark.sql.functions as F
df.selectExpr("EXPLODE(Col2) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+----------------+
| Col2_concated |
+----------------+
|foo,bar |
+----------------+
But, how to get a result or DataFrame like this
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo,bar |
+-------+---------------+
EDIT:
This solution gives the wrong result
df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo |
+-------+---------------+
|abc123 |bar |
+-------+---------------+
Just avoid the explode and you are already there. All you need is the concat_ws function. This function concatenates multiple string columns with a given seperator. See example below:
from pyspark.sql import functions as F
j = '{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}'
df = spark.read.json(sc.parallelize([j]))
#printSchema tells us the column names we can use with concat_ws
df.printSchema()
Output:
root
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Col2Sub: string (nullable = true)
The column Col2 is an array of Col2Sub and we can use this column name to get the desired result:
bla = df.withColumn('Col2', F.concat_ws(',', df.Col2.Col2Sub))
bla.show()
+------+-------+
| Col1| Col2|
+------+-------+
|abc123|foo,bar|
+------+-------+

Spark filter on generic data generic array when knowing only filter condition

I want to filter Spark sql.DataFrame leaving only wanted array elements without any knowledge for the whole schema before hand (don't want to hardcode it).
Schema:
root
|-- callstartcelllabel: string (nullable = true)
|-- calltargetcelllabel: string (nullable = true)
|-- measurements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- enodeb: string (nullable = true)
| | |-- label: string (nullable = true)
| | |-- ltecelloid: long (nullable = true)
|-- networkcode: long (nullable = true)
|-- ocode: long (nullable = true)
|-- startcelllabel: string (nullable = true)
|-- startcelloid: long (nullable = true)
|-- targetcelllabel: string (nullable = true)
|-- targetcelloid: long (nullable = true)
|-- timestamp: long (nullable = true)
I want whole root only with particular measurements (which are filtered on) and root must contain at least one after filtering.
I have a dataframe of this root, and I have a dataframe of filtering values (one column).
So, example: I would only know that my root contains measurements array, and this array contains labels. So I want whole root with whole measurements which contains labels ("label1","label2").
last trial with explode and collect_list leads to: grouping expressions sequence is empty, and 'callstartcelllabel' is not an aggregate function... Is it even possible to generalize such filtering case ? Don't know how such generic udaf should look like yet.
New in Spark.
EDIT:
Current solution I've came to is:
explode array -> filter out unwanted rows with unwanted array members -> groupby everything but array members -> agg.(collect_list(col("measurements"))
Would it be faster doing it with udf ? I can't figure out how to make a generic udf filtering generic array, knowing only about filtering values...
case class Test(a:Int,b:Int) // declared case class to show above scenario
var df=List((1,2,Test(1,2)),(2,3,Test(3,4)),(4,2,Test(5,6))).toDF("name","rank","array")
+----+----+------+
|name|rank| array|
+----+----+------+
| 1| 2|[1, 2]|
| 2| 3|[3, 4]|
| 4| 2|[5, 6]|
+----+----+------+
df.printSchema
//dataFrame structure look like this
root
|-- name: integer (nullable = false)
|-- rank: integer (nullable = false)
|-- array: struct (nullable = true)
| |-- a: integer (nullable = false)
| |-- b: integer (nullable = false)
df.filter(df("array")("a")>1).show
//after filter on dataFrame on specified condition
+----+----+------+
|name|rank| array|
+----+----+------+
| 2| 3|[3, 4]|
| 4| 2|[5, 6]|
+----+----+------+
//Above code help you to understand the Scenario
//use this piece of code:
df.filter(df("measurements")("label")==="label1" || df("measurements")("label")==="label2).show
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.udf
var df=Seq((1,2,Array(Test(1,2),Test(5,6))),(1,3,Array(Test(1,2),Test(5,3))),(10,11,Array(Test(1,6)))).toDF("name","rank","array")
df.show
+----+----+----------------+
|name|rank| array|
+----+----+----------------+
| 1| 2|[[1, 2], [5, 6]]|
| 1| 3|[[1, 2], [5, 3]]|
| 10| 11| [[1, 6]]|
+----+----+----------------+
def test={
udf((a:scala.collection.mutable.WrappedArray[Row])=> {
val b=a.toArray.map(x=>(x.getInt(0),x.getInt(1)))
b.filter(y=>y._1>1)
})}
df.withColumn("array",test(df("array"))).show
+----+----+--------+
|name|rank| array|
+----+----+--------+
| 1| 2|[[5, 6]]|
| 1| 3|[[5, 3]]|
| 10| 11| []|
+----+----+--------+

spark sql cast function creates column with NULLS

I have the following dataframe and schema in Spark
val df = spark.read.options(Map("header"-> "true")).csv("path")
scala> df show()
+-------+-------+-----+
| user| topic| hits|
+-------+-------+-----+
| om| scala| 120|
| daniel| spark| 80|
|3754978| spark| 1|
+-------+-------+-----+
scala> df printSchema
root
|-- user: string (nullable = true)
|-- topic: string (nullable = true)
|-- hits: string (nullable = true)
I want to change the column hits to integer
I tried this:
scala> df.createOrReplaceTempView("test")
val dfNew = spark.sql("select *, cast('hist' as integer) as hist2 from test")
scala> dfNew.printSchema
root
|-- user: string (nullable = true)
|-- topic: string (nullable = true)
|-- hits: string (nullable = true)
|-- hist2: integer (nullable = true)
but when I print the dataframe the column hist 2 is filled with NULLS
scala> dfNew show()
+-------+-------+-----+-----+
| user| topic| hits|hist2|
+-------+-------+-----+-----+
| om| scala| 120| null|
| daniel| spark| 80| null|
|3754978| spark| 1| null|
+-------+-------+-----+-----+
I also tried this:
scala> val df2 = df.withColumn("hitsTmp",
df.hits.cast(IntegerType)).drop("hits"
).withColumnRenamed("hitsTmp", "hits")
and got this:
<console>:26: error: value hits is not a member of org.apache.spark.sql.DataFram
e
Also tried this:
scala> val df2 = df.selectExpr ("user","topic","cast(hits as int) hits")
and got this:
org.apache.spark.sql.AnalysisException: cannot resolve '`topic`' given input col
umns: [user, topic, hits]; line 1 pos 0;
'Project [user#0, 'topic, cast('hits as int) AS hits#22]
+- Relation[user#0, topic#1, hits#2] csv
with
scala> val df2 = df.selectExpr ("cast(hits as int) hits")
I get similar error.
Any help will be appreciated. I know this question has been addressed before but I tried 3 different approaches (published here) and none is working.
Thanks.
How do we let the spark cast throw an exception instead of generating all the null values?
Do I have to calculate the total number of null values before & after the cast in order to see if the cast is actually successful?
This post How to test datatype conversion during casting is doing that. I wonder if there is a better solution here.
You can cast a column to Integer type in following ways
df.withColumn("hits", df("hits").cast("integer"))
Or
data.withColumn("hitsTmp",
data("hits").cast(IntegerType)).drop("hits").
withColumnRenamed("hitsTmp", "hits")
Or
data.selectExpr ("user","topic","cast(hits as int) hits")
I know that this answer probably won't be useful for the OP since it comes with a ~2 year delay. It might however be helpful for someone facing this problem.
Just like you, I had a dataframe with a column of strings which I was trying to cast to integer:
> df.show
+-------+
| id|
+-------+
|4918088|
|4918111|
|4918154|
...
> df.printSchema
root
|-- id: string (nullable = true)
But after doing the cast to IntegerType the only thing I obtained, just as you did, was a column of nulls:
> df.withColumn("test", $"id".cast(IntegerType))
.select("id","test")
.show
+-------+----+
| id|test|
+-------+----+
|4918088|null|
|4918111|null|
|4918154|null|
...
By default if you try to cast a string that contain non-numeric characters to integer the cast of the column won't fail but those values will be set to null as you can see in the following example:
> val testDf = sc.parallelize(Seq(("1"), ("2"), ("3A") )).toDF("n_str")
> testDf.withColumn("n_int", $"n_str".cast(IntegerType))
.select("n_str","n_int")
.show
+-----+-----+
|n_str|n_int|
+-----+-----+
| 1| 1|
| 2| 2|
| 3A| null|
+-----+-----+
The thing with our initial dataframe is that, at first sight, when we use the show method, we can't see any non-numeric character. However, if you a row from your dataframe you'll see something different:
> df.first
org.apache.spark.sql.Row = [4?9?1?8?0?8?8??]
Why is this happening? You are probably reading a csv file containing a non-supported encoding.
You can solve this by changing the encoding of the file you are reading. If that is not an option you can also clean each column before doing the type cast. An example:
> val df_cast = df.withColumn("test", regexp_replace($"id", "[^0-9]","").cast(IntegerType))
.select("id","test")
> df_cast.show
+-------+-------+
| id| test|
+-------+-------+
|4918088|4918088|
|4918111|4918111|
|4918154|4918154|
...
> df_cast.printSchema
root
|-- id: string (nullable = true)
|-- test: integer (nullable = true)
Try removing the quote around hist
if that does not work, then
try trimming the column:
dfNew = spark.sql("select *, cast(trim(hist) as integer) as hist2 from test")
The response is delayed but i was facing the same issue & worked.So thought to put it over here. Might be of help to someone.
Try declaring the schema as StructType. Reading from CSV files & providing inferential schema using case class gives weird errors for data types. Although, all the data formats are properly specified.
I had a similar problem where I was casting Strings to integers but I realized I needed to cast them to longs instead. It was hard to realize this at first since my column's type was an int when I tried to print the type using
print(df.dtypes)

cast schema of a data frame in Spark and Scala

I want to cast the schema of a dataframe to change the type of some columns
using Spark and Scala.
Specifically I am trying to use as[U] function whose description reads:
"Returns a new Dataset where each record has been mapped on to the specified type.
The method used to map columns depend on the type of U"
In principle this is exactly what I want, but I cannot get it to work.
Here is a simple example taken from
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
// definition of data
val data = Seq(("a", 1), ("b", 2)).toDF("a", "b")
As expected the schema of data is:
root
|-- a: string (nullable = true)
|-- b: integer (nullable = false)
I would like to cast the column "b" to Double. So I try the following:
import session.implicits._;
println(" --------------------------- Casting using (String Double)")
val data_TupleCast=data.as[(String, Double)]
data_TupleCast.show()
data_TupleCast.printSchema()
println(" --------------------------- Casting using ClassData_Double")
case class ClassData_Double(a: String, b: Double)
val data_ClassCast= data.as[ClassData_Double]
data_ClassCast.show()
data_ClassCast.printSchema()
As I understand the definition of as[u], the new DataFrames should have the following schema
root
|-- a: string (nullable = true)
|-- b: double (nullable = false)
But the output is
--------------------------- Casting using (String Double)
+---+---+
| a| b|
+---+---+
| a| 1|
| b| 2|
+---+---+
root
|-- a: string (nullable = true)
|-- b: integer (nullable = false)
--------------------------- Casting using ClassData_Double
+---+---+
| a| b|
+---+---+
| a| 1|
| b| 2|
+---+---+
root
|-- a: string (nullable = true)
|-- b: integer (nullable = false)
which shows that column "b" has not been cast to double.
Any hints on what I am doing wrong?
BTW: I am aware of the previous post "How to change column types in Spark SQL's DataFrame?" (see How to change column types in Spark SQL's DataFrame?). I know I can change the type of columns one at a time, but I am looking for a more general solution that changes the schema of the whole data in one shot (and I am trying to understand Spark in the process).
Well, since functions are chained and Spark does lazy evaluation,
it actually does change the schema of the whole data in one shot, even if you do write it as changing one column at the time like this:
import spark.implicits._
df.withColumn("x", 'x.cast(DoubleType)).withColumn("y", 'y.cast(StringType))...
As an alternative, I'm thinking you could use map to do your cast in one go, like:
df.map{t => (t._1, t._2.asInstanceOf[Double], t._3.asInstanceOf[], ...)}

udf Function for DataType casting, Scala

I have next DataFrame:
df.show()
+---------------+----+
| x| num|
+---------------+----+
|[0.1, 0.2, 0.3]| 0|
|[0.3, 0.1, 0.1]| 1|
|[0.2, 0.1, 0.2]| 2|
+---------------+----+
This DataFrame has follow Datatypes of columns:
df.printSchema
root
|-- x: array (nullable = true)
| |-- element: double (containsNull = true)
|-- num: long (nullable = true)
I try to convert currently the DoubleArray inside of DataFrame to the FloatArray. I do it with the next statement of udf:
val toFloat = udf[(val line: Seq[Double]) => line.map(_.toFloat)]
val test = df.withColumn("testX", toFloat(df("x")))
This code is currently not working. Can anybody share with me the solution how to change the array Type inseide of DataFrame?
What I want is:
df.printSchema
root
|-- x: array (nullable = true)
| |-- element: float (containsNull = true)
|-- num: long (nullable = true)
This question is based on the question How tho change the simple DataType in Spark SQL's DataFrame
Your udf is wrongly declared. You should write it as follows :
val toFloat = udf((line: Seq[Double]) => line.map(_.toFloat))