Creating a Spark Vector Column with createDataFrame - scala

I can make a Spark DataFrame with a vector column with the toDF method.
val dataset = Seq((1.0, org.apache.spark.ml.linalg.Vectors.dense(0.0, 10.0, 0.5))).toDF("id", "userFeatures")
scala> dataset.printSchema()
root
|-- id: double (nullable = false)
|-- userFeatures: vector (nullable = true)
scala> dataset.schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(id,DoubleType,false), StructField(userFeatures,org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7,true))
I'm not sure how to create a vector column with the createDataFrame method. There isn't a VectorType type in org.apache.spark.sql.types.
This doesn't work:
val rows = spark.sparkContext.parallelize(
List(
Row(1.0, Vectors.dense(1.0, 2.0))
)
)
val schema = List(
StructField("id", DoubleType, true),
StructField("features", new org.apache.spark.ml.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
rows,
StructType(schema)
)
df.show()
df.printSchema()

To create a Spark Vector Column with createDataFrame, you can use following code:
val rows = spark.sparkContext.parallelize(
List(
Row(1.0, org.apache.spark.mllib.linalg.Vectors.dense(1.0, 2.0))
)
)
val schema = List(
StructField("id", DoubleType, true),
StructField("features", new org.apache.spark.mllib.linalg.VectorUDT, true)
)
val df = spark.createDataFrame(
rows,
StructType(schema)
)
df.show()
+---+---------+
| id| features|
+---+---------+
|1.0|[1.0,2.0]|
+---+---------+
df.printSchema()
root
|-- id: double (nullable = true)
|-- features: vector (nullable = true)
The actual issue was incompatible type org.apache.spark.ml.linalg.Vectors.dense which is not a valid external type for schema of vector. So, we have to switch to mllib package instead of ml package.
I hope it helps!
Note: I am using Spark v2.3.0. Also, class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg.
For reference - https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib

Related

Cannot filter a strucure of Strings with spark

I'm trying to filter lines from a dataframe with this structure :
|-- age: integer (nullable = true)
|-- qty: integer (nullable = true)
|-- dates: array (nullable = true)
| |-- element: timestamp (containsNull = true)
For example in this dataframe I only want the first row :
+---------+------------+------------------------------------------------------------------+
| age | qty |dates |
+---------+------------+------------------------------------------------------------------+
| 54 | 1| [2020-12-31 12:15:20, 2021-12-31 12:15:20] |
| 45 | 1| [2020-12-31 12:15:20, 2018-12-31 12:15:20, 2019-12-31 12:15:20] |
+---------+------------+------------------------------------------------------------------+
Here is my code :
val result = sqlContext
.table("scores")
result.filter(array_contains(col("dates").cast("string"),
2021)).show(false)
But I'm getting this error :
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(
due to data type mismatch: Arguments must be an array followed by a value of same type as the > array members;
Can anyone help please?
You need to use rlike to check if each array element contains 2021. array_contains check for exact match, not partial match.
result.filter("array_max(transform(dates, x -> string(x) rlike '2021'))").show(false)
You can explode the ArrayType and then make your proceessing as You want: cast the column as String then apply your filter:
val spark: SparkSession = SparkSession.builder()
.master("local[*]")
.appName("SparkByExamples")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import java.sql.Timestamp
import java.text.SimpleDateFormat
def convertToTimeStamp(s: String) = {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd hh:mm:ss")
val parsedDate = dateFormat.parse(s)
new Timestamp(parsedDate.getTime)
}
val data = Seq(
Row(54, 1, Array(convertToTimeStamp("2020-12-31 12:15:20"), convertToTimeStamp("2021-12-31 12:15:20"))),
Row(45, 1, Array(convertToTimeStamp("2020-12-31 12:15:20"), convertToTimeStamp("2018-12-31 12:15:20"), convertToTimeStamp("2019-12-31 12:15:20")))
)
val Schema = StructType(Array(
StructField("age", IntegerType, nullable = true),
StructField("qty", IntegerType, nullable = true),
StructField("dates", ArrayType(TimestampType, containsNull = true), nullable = true)
))
val rdd = spark.sparkContext.parallelize(data)
var df = spark.createDataFrame(rdd, Schema)
df.show()
df.printSchema()
df = df.withColumn("exp",f.explode(f.col("dates")))
df.filter(f.col("exp").cast(StringType).contains("2021")).show()
You can use exists function to check whether dates array contains a date with year 2021:
df.filter("exists(dates, x -> year(x) = 2021)").show(false)
//+---+---+------------------------------------------+
//|age|qty|dates |
//+---+---+------------------------------------------+
//|54 |1 |[2020-12-31 12:15:20, 2021-12-31 12:15:20]|
//+---+---+------------------------------------------+
If you want to use array_contains, you need to transform the timestamp elements into year:
df.filter("array_contains(transform(dates, x -> year(x)), 2021)").show(false)

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions

I got this error while using this code to drop a nested column with pyspark. Why is this not working? I was trying to use a tilde instead of a not != as the error suggests but it doesnt work either. So what do you do in that case?
def drop_col(df, struct_nm, delete_struct_child_col_nm):
fields_to_keep = filter(lambda x: x != delete_struct_child_col_nm, df.select("
{}.*".format(struct_nm)).columns)
fields_to_keep = list(map(lambda x: "{}.{}".format(struct_nm, x), fields_to_keep))
return df.withColumn(struct_nm, struct(fields_to_keep))
I built a simple example with a struct column and a few dummy columns:
from pyspark import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import monotonically_increasing_id, lit, col, struct
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
spark = SparkSession.builder.getOrCreate()
sql_context = SQLContext(spark.sparkContext)
schema = StructType(
[
StructField('addresses',
StructType(
[StructField("state", StringType(), True),
StructField("street", StringType(), True),
StructField("country", StringType(), True),
StructField("code", IntegerType(), True)]
)
)
]
)
rdd = [({'state': 'pa', 'street': 'market', 'country': 'USA', 'code': 100},),
({'state': 'ca', 'street': 'baker', 'country': 'USA', 'code': 101},)]
df = sql_context.createDataFrame(rdd, schema)
df = df.withColumn('id', monotonically_increasing_id())
df = df.withColumn('name', lit('test'))
print(df.show())
print(df.printSchema())
Output:
+--------------------+-----------+----+
| addresses| id|name|
+--------------------+-----------+----+
|[pa, market, USA,...| 8589934592|test|
|[ca, baker, USA, ...|25769803776|test|
+--------------------+-----------+----+
root
|-- addresses: struct (nullable = true)
| |-- state: string (nullable = true)
| |-- street: string (nullable = true)
| |-- country: string (nullable = true)
| |-- code: integer (nullable = true)
|-- id: long (nullable = false)
|-- name: string (nullable = false)
To drop the whole struct column, you can simply use the drop function:
df2 = df.drop('addresses')
print(df2.show())
Output:
+-----------+----+
| id|name|
+-----------+----+
| 8589934592|test|
|25769803776|test|
+-----------+----+
To drop specific fields, in a struct column, it's a bit more complicated - there are some other similar questions here:
Dropping a nested column from Spark DataFrame
Dropping nested column of Dataframe with PySpark
In any case, I found them to be a bit complicated - my approach would just be to reassign the original column with the subset of struct fields you want to keep:
columns_to_keep = ['country', 'code']
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
Output:
+----------+-----------+----+
| addresses| id|name|
+----------+-----------+----+
|[USA, 100]| 8589934592|test|
|[USA, 101]|25769803776|test|
+----------+-----------+----+
Alternatively, if you just wanted to specify the columns you want to remove rather than the columns you want to keep:
columns_to_remove = ['country', 'code']
all_columns = df.select("addresses.*").columns
columns_to_keep = list(set(all_columns) - set(columns_to_remove))
df = df.withColumn('addresses', struct(*[f"addresses.{column}" for column in columns_to_keep]))
Output:
+------------+-----------+----+
| addresses| id|name|
+------------+-----------+----+
|[pa, market]| 8589934592|test|
| [ca, baker]|25769803776|test|
+------------+-----------+----+
Hope this helps!

Is it possible to create a StructField of tuple type using PySpark?

I need to create a schema for a dataframe in Spark. I have no problem creating regular StructFields, such as StringType, IntegerType. However, I want to create a StructField for a tuple.
I have tried the following:
StructType([
StructField("dst_ip", StringType()),
StructField("port", StringType())
])
However, it throws an error
"list object has no attribute 'name'"
Is it possible to create a StructField for a tuple type?
You can define a StructType inside of a StructField:
schema = StructType(
[
StructField(
"myTuple",
StructType(
[
StructField("dst_ip", StringType()),
StructField("port", StringType())
]
)
)
]
)
df = sqlCtx.createDataFrame([], schema)
df.printSchema()
#root
# |-- myTuple: struct (nullable = true)
# | |-- dst_ip: string (nullable = true)
# | |-- port: string (nullable = true)
The class StructType--used to to define the structure of a DataFrame--is the data type representing a Row and it consists of a list of StructField's.
In order to define a tuple datatype for a column (say columnA) you need to encapsulate (list) the StructType's of the the tuple's elements into a StructField. Note that StructFields need to have names since they represent columns.
Define tuple StructField as a new StructType:
columnA = StructField('columnA', StructType([
StructField("dst_ip", StringType()),
StructField("port", StringType())
])
)
Define schema containing columnA and columnB (of type FloatType):
mySchema = StructType([ columnA, StructField("columnB", FloatType())])
Apply schema to dataframe:
data =[{'columnA': ('x', 'y'), 'columnB': 1.0}]
# data = [Row(columnA=('x', 'y'), columnB=1.0)] (needs from pyspark.sql import Row)
df = spark.createDataFrame(data, mySchema)
df.printSchema()
# root
# |-- columnA: struct (nullable = true)
# | |-- dst_ip: string (nullable = true)
# | |-- port: string (nullable = true)
# |-- columnB: float (nullable = true)
Show dataframe:
df.show()
# +-------+-------+
# |columnA|columnB|
# +-------+-------+
# | [x, y]| 1.0|
# +-------+-------+
(this is just the longer version of the other answer)

Casting type of columns in a dataframe

My Spark program needs to read a file which contains a matrix of integers. Columns are separated with ",". Number of columns is not the same each time I run the program.
I read the file as a dataframe:
var df = spark.read.csv(originalPath);
but when I print schema it gives me all the columns as Strings.
I convert all columns to Integers as below but after that when I print the schema of df again, columns are still Strings.
df.columns.foreach(x => df.withColumn(x + "_new", df.col(x).cast(IntegerType))
.drop(x).withColumnRenamed(x + "_new", x));
I appreciate any help to solve the issue of casting.
Thanks.
DataFrames are immutable. Your code creates new DataFrame for each value and discards it.
It is best to use map and select:
val newDF = df.select(df.columns.map(c => df.col(c).cast("integer")): _*)
but you could foldLeft:
df.columns.foldLeft(df)((df, x) => df.withColumn(x , df.col(x).cast("integer")))
or even (please don't) mutable reference:
var df = Seq(("1", "2", "3")).toDF
df.columns.foreach(x => df = df.withColumn(x , df.col(x).cast("integer")))
Or as you mentioned your column numbers are not same each time, you could take the highest number of possible column and make a schema out of it, having IntegerType as column type. During loading the file infer this schema to automatically convert your dataframe columns from string to integer. No explicit conversion required in this case.
import org.apache.spark.sql.types._
val csvSchema = StructType(Array(
StructField("_c0", IntegerType, true),
StructField("_c1", IntegerType, true),
StructField("_c2", IntegerType, true),
StructField("_c3", IntegerType, true)))
val df = spark.read.schema(csvSchema).csv(originalPath)
scala> df.printSchema
root
|-- _c0: integer (nullable = true)
|-- _c1: integer (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: integer (nullable = true)

How to automate StructType creation for passing RDD to DataFrame

I want to save RDD as a parquet file. To do this, I pass RDD to DataFrame and then I use a structure to save DataFrame as a parquet file:
val aStruct = new StructType(Array(StructField("id",StringType,nullable = true),
StructField("role",StringType,nullable = true)))
val newDF = sqlContext.createDataFrame(filtered, aStruct)
The question is how to create aStruct automatically for all columns assuming that all of them are StringType? Also, what is the meaning of nullable = true? Does it mean that all empty values will be substituted by Null?
Why not use the built-in toDF?
scala> val myRDD = sc.parallelize(Seq(("1", "roleA"), ("2", "roleB"), ("3", "roleC")))
myRDD: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[60] at parallelize at <console>:27
scala> val colNames = List("id", "role")
colNames: List[String] = List(id, role)
scala> val myDF = myRDD.toDF(colNames: _*)
myDF: org.apache.spark.sql.DataFrame = [id: string, role: string]
scala> myDF.show
+---+-----+
| id| role|
+---+-----+
| 1|roleA|
| 2|roleB|
| 3|roleC|
+---+-----+
scala> myDF.printSchema
root
|-- id: string (nullable = true)
|-- role: string (nullable = true)
scala> myDF.write.save("myDF.parquet")
The nullable=true simply means that the specified column can contain null values (this is esp. useful for int columns which would normally not have a null value -- Int has no NA or null).