NullPointerException Error when running CountVectorizer from scala [duplicate] - scala

I have a column in my Spark DataFrame:
|-- topics_A: array (nullable = true)
| |-- element: string (containsNull = true)
I'm using CountVectorizer on it:
topic_vectorizer_A = CountVectorizer(inputCol="topics_A", outputCol="topics_vec_A")
I get NullPointerExceptions, because sometimes the topic_A column contains null.
Is there a way around this? Filling it with a zero-length array would work ok (although it will blow out the data size quite a lot) - but I can't work out how to do a fillNa on an Array column in PySpark.

Personally I would drop columns with NULL values because there is no useful information there but you can replace nulls with empty arrays. First some imports:
from pyspark.sql.functions import when, col, coalesce, array
You can define an empty array of specific type as:
fill = array().cast("array<string>")
and combine it with when clause:
topics_a = when(col("topics_A").isNull(), fill).otherwise(col("topics_A"))
or coalesce:
topics_a = coalesce(col("topics_A"), fill)
and use it as:
df.withColumn("topics_A", topics_a)
so with example data:
df = sc.parallelize([(1, ["a", "b"]), (2, None)]).toDF(["id", "topics_A"])
df_ = df.withColumn("topics_A", topics_a)
topic_vectorizer_A.fit(df_).transform(df_)
the result will be:
+---+--------+-------------------+
| id|topics_A| topics_vec_A|
+---+--------+-------------------+
| 1| [a, b]|(2,[0,1],[1.0,1.0])|
| 2| []| (2,[],[])|
+---+--------+-------------------+

I had similar issue, based on comment, I used following syntax to resolve before tokenizing:
remove the null values
clean_text_ddf.where(col("title").isNull()).show()
cleaned_text=clean_text_ddf.na.drop(subset=["title"])
cleaned_text.where(col("title").isNull()).show()
cleaned_text.printSchema()
cleaned_text.show(2)
+-----+
|title|
+-----+
+-----+
+-----+
|title|
+-----+
+-----+
root
|-- title: string (nullable = true)
+--------------------+
| title|
+--------------------+
|Mr. Beautiful (Up...|
|House of Ravens (...|
+--------------------+
only showing top 2 rows

Related

Parse through each element of an array in pyspark and apply substring

Hi I have a pyspark dataframe with an array col shown below.
I want to iterate through each element and fetch only string prior to hyphen and create another column.
+------------------------------+
|array_col |
+------------------------------+
|[hello-123, abc-111] |
|[hello-234, def-22, xyz-33] |
|[hiiii-111, def2-333, lmn-222]|
+------------------------------+
Desired Output;
+------------------------------+--------------------+
|col1 |new_column |
+------------------------------+--------------------+
|[hello-123, abc-111] |[hello, abc] |
|[hello-234, def-22, xyz-33] |[hello, def, xyz] |
|[hiiii-111, def2-333, lmn-222]|[hiiii, def2, lmn] |
+------------------------------+--------------------+
I am trying something like below but I could not apply a regex/substring inside a udf.
cust_udf = udf(lambda arr: [x for x in arr],ArrayType(StringType()))
df1.withColumn('new_column', cust_udf(col("col1")))
Can anyone please help on this. Thanks
From Spark-2.4 use transform higher order function.
Example:
df.show(10,False)
#+---------------------------+
#|array_col |
#+---------------------------+
#|[hello-123, abc-111] |
#|[hello-234, def-22, xyz-33]|
#+---------------------------+
df.printSchema()
#root
# |-- array_col: array (nullable = true)
# | |-- element: string (containsNull = true)
from pyspark.sql.functions import *
df.withColumn("new_column",expr('transform(array_col,x -> split(x,"-")[0])')).\
show()
#+--------------------+-----------------+
#| array_col| new_column|
#+--------------------+-----------------+
#|[hello-123, abc-111]| [hello, abc]|
#|[hello-234, def-2...|[hello, def, xyz]|
#+--------------------+-----------------+

How can I split a column containing array of some struct into separate columns?

I have the following scenarios:
case class attribute(key:String,value:String)
case class entity(id:String,attr:List[attribute])
val entities = List(entity("1",List(attribute("name","sasha"),attribute("home","del"))),
entity("2",List(attribute("home","hyd"))))
val df = entities.toDF()
// df.show
+---+--------------------+
| id| attr|
+---+--------------------+
| 1|[[name,sasha], [d...|
| 2| [[home,hyd]]|
+---+--------------------+
//df.printSchema
root
|-- id: string (nullable = true)
|-- attr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
what I want to produce is
+---+--------------------+-------+
| id| name | home |
+---+--------------------+-------+
| 1| sasha |del |
| 2| null |hyd |
+---+--------------------+-------+
How do I go about this. I looked at quite a few similar questions on stack but couldn't find anything useful.
My main motive is to do groupBy on different attributes, thus want to bring it in the above mentioned format.
I looked into explode functionality. It breaks downs a list in separate rows, I don't want that. I want to create more columns from the array of attribute.
Similar things I found:
Spark - convert Map to a single-row DataFrame
Split 1 column into 3 columns in spark scala
Spark dataframe - Split struct column into 2 columns
That can easily be reduced to PySpark converting a column of type 'map' to multiple columns in a dataframe or How to get keys and values from MapType column in SparkSQL DataFrame. First convert attr to map<string, string>
import org.apache.spark.sql.functions.{explode, map_from_entries, map_keys}
val dfMap = df.withColumn("attr", map_from_entries($"attr"))
then it's just a matter of finding the unique keys
val keys = dfMap.select(explode(map_keys($"attr"))).as[String].distinct.collect
then selecting from the map
val result = dfMap.select($"id" +: keys.map(key => $"attr"(key) as key): _*)
result.show
+---+-----+----+
| id| name|home|
+---+-----+----+
| 1|sasha| del|
| 2| null| hyd|
+---+-----+----+
Less efficient but more concise variant is to explode and pivot
val result = df
.select($"id", explode(map_from_entries($"attr")))
.groupBy($"id")
.pivot($"key")
.agg(first($"value"))
result.show
+---+----+-----+
| id|home| name|
+---+----+-----+
| 1| del|sasha|
| 2| hyd| null|
+---+----+-----+
but in practice I'd advise against it.

How convert Spark dataframe column from Array[Int] to linalg.Vector?

I have a dataframe, df, that looks like this:
+--------+--------------------+
| user_id| is_following|
+--------+--------------------+
| 1|[2, 3, 4, 5, 6, 7] |
| 2|[20, 30, 40, 50] |
+--------+--------------------+
I can confirm this has the schema:
root
|-- user_id: integer (nullable = true)
|-- is_following: array (nullable = true)
| |-- element: integer (containsNull = true)
I would like to use Spark's ML routines such as LDA to do some machine learning on this, requiring me to convert the is_following column to a linalg.Vector (not a Scala vector). When I try to do this via
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val assembler = new VectorAssembler().setInputCols(Array("is_following")).setOutputCol("features")
val output = assembler.transform(df)
I then get the following error:
java.lang.IllegalArgumentException: Data type ArrayType(IntegerType,true) is not supported.
If I am interpreting that correctly, I take away from it that I need to convert types here from integer to something else. (Double? String?)
My question is what is the best way to convert this array to something that will properly vectorize for the ML pipeline?
EDIT: If it helps, I don't have to structure the dataframe this way. I could instead have it be:
+--------+------------+
| user_id|is_following|
+--------+------------+
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 2| 20|
| ...| ...|
+--------+------------+
A simple solution to both converting the array into a linalg.Vector and at the same time convert the integers into doubles would be to use an UDF.
Using your dataframe:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = spark.createDataFrame(Seq((1, Array(2,3,4,5,6,7)), (2, Array(20,30,40,50))))
.toDF("user_id", "is_following")
val convertToVector = udf((array: Seq[Int]) => {
Vectors.dense(array.map(_.toDouble).toArray)
})
val df2 = df.withColumn("is_following", convertToVector($"is_following"))
spark.implicits._ is imported here to allow the use of $, col() or ' could be used instead.
Printing the df2 dataframe will give the wanted results:
+-------+-------------------------+
|user_id|is_following |
+-------+-------------------------+
|1 |[2.0,3.0,4.0,5.0,6.0,7.0]|
|2 |[20.0,30.0,40.0,50.0] |
+-------+-------------------------+
schema:
root
|-- user_id: integer (nullable = false)
|-- is_following: vector (nullable = true)
So your initial input might be better suited than your transformed input. Spark's VectorAssembler requires that all of the columns be Doubles, and not array's of doubles. Since different users could follow different numbers of people your current structure could be good, you just need to convert the is_following into a Double, you could infact do this with Spark's VectorIndexer https://spark.apache.org/docs/2.1.0/ml-features.html#vectorindexer or just manually do it in SQL.
So the tl;dr is - the type error is because Spark's Vector's only support Doubles (this is changing likely for image data in the not so distant future but isn't a good fit for your use case anyways) and you're alternative structure might actually be better suited (the one without the grouping).
You might find looking at the collaborative filtering example in the Spark documentation useful on your further adventure - https://spark.apache.org/docs/latest/ml-collaborative-filtering.html . Good luck and have fun with Spark ML :)
edit:
I noticed you said you're looking to do LDA on the inputs so let's also look at how to prepare the data for that format. For LDA input you might want to consider using the CountVectorizer (see https://spark.apache.org/docs/2.1.0/ml-features.html#countvectorizer)

spark sql cast function creates column with NULLS

I have the following dataframe and schema in Spark
val df = spark.read.options(Map("header"-> "true")).csv("path")
scala> df show()
+-------+-------+-----+
| user| topic| hits|
+-------+-------+-----+
| om| scala| 120|
| daniel| spark| 80|
|3754978| spark| 1|
+-------+-------+-----+
scala> df printSchema
root
|-- user: string (nullable = true)
|-- topic: string (nullable = true)
|-- hits: string (nullable = true)
I want to change the column hits to integer
I tried this:
scala> df.createOrReplaceTempView("test")
val dfNew = spark.sql("select *, cast('hist' as integer) as hist2 from test")
scala> dfNew.printSchema
root
|-- user: string (nullable = true)
|-- topic: string (nullable = true)
|-- hits: string (nullable = true)
|-- hist2: integer (nullable = true)
but when I print the dataframe the column hist 2 is filled with NULLS
scala> dfNew show()
+-------+-------+-----+-----+
| user| topic| hits|hist2|
+-------+-------+-----+-----+
| om| scala| 120| null|
| daniel| spark| 80| null|
|3754978| spark| 1| null|
+-------+-------+-----+-----+
I also tried this:
scala> val df2 = df.withColumn("hitsTmp",
df.hits.cast(IntegerType)).drop("hits"
).withColumnRenamed("hitsTmp", "hits")
and got this:
<console>:26: error: value hits is not a member of org.apache.spark.sql.DataFram
e
Also tried this:
scala> val df2 = df.selectExpr ("user","topic","cast(hits as int) hits")
and got this:
org.apache.spark.sql.AnalysisException: cannot resolve '`topic`' given input col
umns: [user, topic, hits]; line 1 pos 0;
'Project [user#0, 'topic, cast('hits as int) AS hits#22]
+- Relation[user#0, topic#1, hits#2] csv
with
scala> val df2 = df.selectExpr ("cast(hits as int) hits")
I get similar error.
Any help will be appreciated. I know this question has been addressed before but I tried 3 different approaches (published here) and none is working.
Thanks.
How do we let the spark cast throw an exception instead of generating all the null values?
Do I have to calculate the total number of null values before & after the cast in order to see if the cast is actually successful?
This post How to test datatype conversion during casting is doing that. I wonder if there is a better solution here.
You can cast a column to Integer type in following ways
df.withColumn("hits", df("hits").cast("integer"))
Or
data.withColumn("hitsTmp",
data("hits").cast(IntegerType)).drop("hits").
withColumnRenamed("hitsTmp", "hits")
Or
data.selectExpr ("user","topic","cast(hits as int) hits")
I know that this answer probably won't be useful for the OP since it comes with a ~2 year delay. It might however be helpful for someone facing this problem.
Just like you, I had a dataframe with a column of strings which I was trying to cast to integer:
> df.show
+-------+
| id|
+-------+
|4918088|
|4918111|
|4918154|
...
> df.printSchema
root
|-- id: string (nullable = true)
But after doing the cast to IntegerType the only thing I obtained, just as you did, was a column of nulls:
> df.withColumn("test", $"id".cast(IntegerType))
.select("id","test")
.show
+-------+----+
| id|test|
+-------+----+
|4918088|null|
|4918111|null|
|4918154|null|
...
By default if you try to cast a string that contain non-numeric characters to integer the cast of the column won't fail but those values will be set to null as you can see in the following example:
> val testDf = sc.parallelize(Seq(("1"), ("2"), ("3A") )).toDF("n_str")
> testDf.withColumn("n_int", $"n_str".cast(IntegerType))
.select("n_str","n_int")
.show
+-----+-----+
|n_str|n_int|
+-----+-----+
| 1| 1|
| 2| 2|
| 3A| null|
+-----+-----+
The thing with our initial dataframe is that, at first sight, when we use the show method, we can't see any non-numeric character. However, if you a row from your dataframe you'll see something different:
> df.first
org.apache.spark.sql.Row = [4?9?1?8?0?8?8??]
Why is this happening? You are probably reading a csv file containing a non-supported encoding.
You can solve this by changing the encoding of the file you are reading. If that is not an option you can also clean each column before doing the type cast. An example:
> val df_cast = df.withColumn("test", regexp_replace($"id", "[^0-9]","").cast(IntegerType))
.select("id","test")
> df_cast.show
+-------+-------+
| id| test|
+-------+-------+
|4918088|4918088|
|4918111|4918111|
|4918154|4918154|
...
> df_cast.printSchema
root
|-- id: string (nullable = true)
|-- test: integer (nullable = true)
Try removing the quote around hist
if that does not work, then
try trimming the column:
dfNew = spark.sql("select *, cast(trim(hist) as integer) as hist2 from test")
The response is delayed but i was facing the same issue & worked.So thought to put it over here. Might be of help to someone.
Try declaring the schema as StructType. Reading from CSV files & providing inferential schema using case class gives weird errors for data types. Although, all the data formats are properly specified.
I had a similar problem where I was casting Strings to integers but I realized I needed to cast them to longs instead. It was hard to realize this at first since my column's type was an int when I tried to print the type using
print(df.dtypes)

Assign label to categorical data in a table in PySpark

I want to assign the label to the categorical numbers in a dataframe below using pyspark sql.
In the MARRIAGE column 1=Married and 2=Unmarried. In the EDUCATION Column 1=Grad and 2=Undergrad
Current Dataframe:
+--------+---------+-----+
|MARRIAGE|EDUCATION|Total|
+--------+---------+-----+
| 1| 2| 87|
| 1| 1| 123|
| 2| 2| 3|
| 2| 1| 8|
+--------+---------+-----+
Resulting Dataframe:
+---------+---------+-----+
|MARRIAGE |EDUCATION|Total|
+---------+---------+-----+
|Married |Grad | 87|
|Married |UnderGrad| 123|
|UnMarried|Grad | 3|
|UnMarried|UnderGrad| 8|
+---------+---------+-----+
Is it possible to assign the labels using a single udf and the withColumn()? Is there any way to assign in the single UDF by passing the whole dataframe and keep the column names as it is?
I can think of a solution to do the operation on each column by using separate udfs as below. But can't figure out if there's a way to do together.
from pyspark.sql import functions as F
def assign_marital_names(record):
if record == 1:
return "Married"
elif record == 2:
return "UnMarried"
def assign_edu_names(record):
if record == 1:
return "Grad"
elif record == 2:
return "UnderGrad"
assign_marital_udf = F.udf(assign_marital_names)
assign_edu_udf = F.udf(assign_edu_names)
df.withColumn("MARRIAGE", assign_marital_udf("MARRIAGE")).\
withColumn("EDUCATION", assign_edu_udf("EDUCATION")).show(truncate=False)
One UDF can result in only one column. But this can be structured column and UDF can apply labels on both marriage and education. See code below:
from pyspark.sql.types import *
from pyspark.sql import Row
udf_result = StructType([StructField('MARRIAGE', StringType()), StructField('EDUCATION', StringType())])
marriage_dict = {1: 'Married', 2: 'UnMarried'}
education_dict = {1: 'Grad', 2: 'UnderGrad'}
def assign_labels(marriage, education):
return Row(marriage_dict[marriage], education_dict[education])
assign_labels_udf = F.udf(assign_labels, udf_result)
df.withColumn('labels', assign_labels_udf('MARRIAGE', 'EDUCATION')).printSchema()
root
|-- MARRIAGE: long (nullable = true)
|-- EDUCATION: long (nullable = true)
|-- Total: long (nullable = true)
|-- labels: struct (nullable = true)
| |-- MARRIAGE: string (nullable = true)
| |-- EDUCATION: string (nullable = true)
But as you see, it's not replacing the original columns, it's just adding a new one. To replace them you will need to use withColumn twice and then drop labels.