spark error in column type - pyspark

I have a dataframe column,called 'SupplierId' ,typed as a string, with a lot of digits, but also some characters chain.
(ex: ['123','456','789',......,'abc']).
I formatted this column as a string using
from pyspark.sql.types import StringType
df=df.withColumn('SupplierId',df['SupplierId'].cast(StringType())
So I check it is treated as a string using:
df.printSchema()
and I get:
root
|-- SupplierId: string (nullable = true)
But when I try to convert to Pandas, or just to use df.collect(),
I obtain the following error:
An error occurred while calling o516.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11, servername.ops.somecompany.local, executor 3):
ava.lang.RuntimeException: Error while encoding: java.lang.RuntimeException:
Exception parsing 'CPD160001' into a IntegerType$ for column "SupplierId":
Unable to deserialize value using com.somecompany.spark.parsers.text.converters.IntegerConverter.
The value being deserialized was: CPD160001
So it seems Spark treats the value of this column as integers.
I have tried using UDF to force convert to string with python, but it still doesn't work.
Do you have any idea what could cause this error?

Please do share a sample of your actual data, as your issue cannot be reproduced with toy ones:
spark.version
# u'2.2.0'
from pyspark.sql import Row
df = spark.createDataFrame([Row(1, 2, '3'),
Row(4, 5, 'a'),
Row(7, 8, '9')],
['x1', 'x2', 'id'])
df.printSchema()
# root
# |-- x1: long (nullable = true)
# |-- x2: long (nullable = true)
# |-- id: string (nullable = true)
df.collect()
# [Row(x1=1, x2=2, id=u'3'), Row(x1=4, x2=5, id=u'a'), Row(x1=7, x2=8, id=u'9')]
import pandas as pd
df_pandas = df.toPandas()
df_pandas
# x1 x2 id
# 0 1 2 3
# 1 4 5 a
# 2 7 8 9

Related

PySpark DataFrame When to use/ not to use Select

Based on PySpark document:
A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SQLContext
Meaning I can use Select for showing the value of a column, however, I saw sometimes these two equivalent codes are used instead:
# df is a sample DataFrame with column a
df.a
# or
df['a']
And sometimes when I use select I might get an error instead of them and vice versa sometimes I have to use Select.
For example, this is a DataFrame for finding a dog in a given picture problem:
joined_df.printSchema()
root
|-- folder: string (nullable = true)
|-- filename: string (nullable = true)
|-- width: string (nullable = true)
|-- height: string (nullable = true)
|-- dog_list: array (nullable = true)
| |-- element: string (containsNull = true)
If I want to select the dog details and show 10 rows, this code shows an error:
print(joined_df.dog_list.show(truncate=False))
Traceback (most recent call last):
 File "<stdin>", line 2, in <module>
    print(joined_df.dog_list.show(truncate=False))
TypeError: 'Column' object is not callable
And this is not:
print(joined_df.select('dog_list').show(truncate=False))
Question1: When I have to use Select and when I have to use df.a or df["a"]
Question2: what is the meaning of the error above? 'Column' object is not callable
df.col_name return a Column object but df.select("col_name") return another dataframe
see this for documentation
The key here is Those two methods are returning two different objects, that is why your print(joined_df.dog_list.show(truncate=False)) give you the error. Meaning that the Column object does not have this .show method but the dataframe does.
So when you call a function, function takes Column as input, you should use df.col_name, if you want to operate at dataframe level, you want to use df.select("col_name")

unable to save decimal value in decimal type in pyspark dataframe

I'm trying to write a json into a dataframe using pyspark. json has a decimal value and in the schema also I have defined that field as DecimalType but when creating the data frame, spark throws exception that TypeError: field pr: DecimalType(3,1) can not accept object 20.0 in type
r = {'name':'wellreading','pr':20.0}
distData = sc.parallelize([r])
schema = StructType([StructField('name',StringType(),True),StructField('pr',DecimalType(3,1),True)])
df = spark.createDataFrame(distData,schema)
df.collect()
here I have given a sample code but I'm unable to understand how come spark determines that 20.0 is float and can not be stored in decimal type?
One of the quick solutions (not sure if the best one) is that you can read your json file directly to a data frame and then perform a conversion you like, eg.
from pyspark.sql.types import DecimalType
from pyspark.sql.functions import col
df1 = spark.read.json("/tmp/test.json")
df2 = df1.select(col('name'),col('pr').cast(DecimalType(3,1)).alias('pr'))
df2.printSchema()
root
|-- name: string (nullable = true)
|-- pr: decimal(3,1) (nullable = true)
OR
df2 = df1.withColumn("pr",df1.pr.cast(DecimalType(3,1)))

How to partition a table in scala with the proper name

I have a large Dataframe in scala 2.4.0, that looks like this
+--------------------+--------------------+--------------------+-------------------+--------------+------+
| cookie| updated_score| probability| date_last_score|partition_date|target|
+--------------------+--------------------+--------------------+-------------------+--------------+------+
|00000000000001074780| 0.1110987111481027| 0.27492987342938174|2019-03-29 16:00:00| 2019-04-07_10| 0|
|00000000000001673799| 0.02621894072693878| 0.2029688362968775|2019-03-19 08:00:00| 2019-04-07_10| 0|
|00000000000002147908| 0.18922034021212567| 0.3520678649755828|2019-03-31 19:00:00| 2019-04-09_12| 1|
|00000000000004028302| 0.06803669083452231| 0.23089047208736854|2019-03-25 17:00:00| 2019-04-07_10| 0|
and this schema:
root
|-- cookie: string (nullable = true)
|-- updated_score: double (nullable = true)
|-- probability: double (nullable = true)
|-- date_last_score: string (nullable = true)
|-- partition_date: string (nullable = true)
|-- target: integer (nullable = false)
then I create a partition table and insert the data into database.table_name. But when I look up at hive database and type: show partitions database.table_name I only got partition_date=0 and partition_date=1, and 0 and 1 are not values from partition_date column.
I don't know if I wrote something wrong, there are some scala concepts that I don't understand or the dataframe is too large.
I've tried differents ways to do this looking up similar questions as:
result_df.write.mode(SaveMode.Overwrite).insertInto("table_name")
or
result_df.write.mode(SaveMode.Overwrite).saveAsTable("table_name")
In case it helps I provide some INFO message from scala:
Looking at this message, I think I got my result_df partitions properly.
19/07/31 07:53:57 INFO TaskSetManager: Starting task 11.0 in stage 2822.0 (TID 123456, ip-xx-xx-xx.aws.local.somewhere, executor 45, partition 11, PROCESS_LOCAL, 7767 bytes)
19/07/31 07:53:57 INFO TaskSetManager: Starting task 61.0 in stage 2815.0 (TID 123457, ip-xx-xx-xx-xyz.aws.local.somewhere, executor 33, partition 61, NODE_LOCAL, 8095 bytes)
Then, I am starting to saving the partitions as a Vector(0, 1, 2...), but I may only save 0 and 1? I don't really know.
19/07/31 07:56:02 INFO DAGScheduler: Submitting 35 missing tasks from ShuffleMapStage 2967 (MapPartitionsRDD[130590] at insertInto at evaluate_decay_factor.scala:165) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
19/07/31 07:56:02 INFO YarnScheduler: Adding task set 2967.0 with 35 tasks
19/07/31 07:56:02 INFO DAGScheduler: Submitting ShuffleMapStage 2965 (MapPartitionsRDD[130578] at insertInto at evaluate_decay_factor.scala:165), which has no missing parents
My code looks like this:
val createTableSQL = s"""
CREATE TABLE IF NOT EXISTS table_name (
cookie string,
updated_score float,
probability float,
date_last_score string,
target int
)
PARTITIONED BY (partition_date string)
STORED AS PARQUET
TBLPROPERTIES ('PARQUET.COMPRESSION'='SNAPPY')
"""
spark.sql(createTableSQL)
result_df.write.mode(SaveMode.Overwrite).insertInto("table_name")
Given a dataframe like this:
val result = Seq(
(8, "123", 1.2, 0.5, "bat", "2019-04-04_9"),
(64, "451", 3.2, -0.5, "mouse", "2019-04-04_12"),
(-27, "613", 8.2, 1.5, "horse", "2019-04-04_10"),
(-37, "513", 4.33, 2.5, "horse", "2019-04-04_11"),
(45, "516", -3.3, 3.4, "bat", "2019-04-04_10"),
(12, "781", 1.2, 5.5, "horse", "2019-04-04_11")
I want to run: show partitions "table_name" on hive command line and get:
partition_date=2019-04-04_9
partition_date=2019-04-04_10
partition_date=2019-04-04_11
partition_date=2019-04-04_12
Instead in my output is:
partition_date=0
partition_date=1
In this simple example case it works perfectly, but with my large dataframe I get the previous output.
To change the number of partitions, use repartition(numOfPartitions)
To change the column you partition by when writing, use partitionBy("col")
example used together: final_df.repartition(40).write.partitionBy("txnDate").mode("append").parquet(destination)
Two helpful hints:
Make your repartition size equal to the number of worker cores for quickest write / repartition. In this example, I have 10 executors, each with 4 cores (40 cores total). Thus, I set it to 40.
When you are writing to a destination, don't specify anything more than the sub bucket -- let spark handle the indexing.
good destination: "s3a://prod/subbucket/"
bad destination: s"s3a://prod/subbucket/txndate=$txndate"

PySpark - Get the size of each list in group by

I have a massive pyspark dataframe. I need to group by Person and then collect their Budget items into a list, to perform a further calculation.
As an example,
a = [('Bob', 562,"Food", "12 May 2018"), ('Bob',880,"Food","01 June 2018"), ('Bob',380,'Household'," 16 June 2018"), ('Sue',85,'Household'," 16 July 2018"), ('Sue',963,'Household'," 16 Sept 2018")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])
Group By:
import pyspark.sql.functions as F
df_grouped = df.groupby('person').agg(F.collect_list("Budget").alias("data"))
Schema:
root
|-- person: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: string (containsNull = true)
However, I am getting a memory error when I try to apply a UDF on each person. How can I get the size (in megabytes or gigbabytes) of each list (data) for each person?
I have done the following, but I am getting nulls
import sys
size_list_udf = F.udf(lambda data: sys.getsizeof(data)/1000, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show()
Output:
+------+--------------------+----+
|person| data|size|
+------+--------------------+----+
| Sue|[Household, House...|null|
| Bob|[Food, Food, Hous...|null|
+------+--------------------+----+
You just have one minor issue with your code. sys.getsizeof() returns the size of an object in bytes as an integer. You're dividing this by the integer value 1000 to get kilobytes. In python 2, this returns an integer. However you defined your udf to return a DoubleType(). The simple fix is to divide by 1000.0.
import sys
size_list_udf = f.udf(lambda data: sys.getsizeof(data)/1000.0, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show(truncate=False)
#+------+-----------------------+-----+
#|person|data |size |
#+------+-----------------------+-----+
#|Sue |[Household, Household] |0.112|
#|Bob |[Food, Food, Household]|0.12 |
#+------+-----------------------+-----+
I have found that in cases where a udf is returning null, the culprit is very frequently a type mismatch.

How to map MongoDB data in Spark for kmeans?

I want to run k-means within Spark on data provided from a MongoDB.
I have a working example which acts against a flatfile:
sc = SparkContext(appName="KMeansExample") # SparkContext
data = sc.textFile("/home/mhoeller/kmeans_data.txt")
parsedData = data.map(lambda line: array([int(x) for x in line.split(' ')]))
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random")
this is the format of the flatfile is:
0 0 1
1 1 1
2 2 2
9 9 6
Now I want to replace the flatfile with a MongoDB:
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/ycsb.usertable") \
.config("spark.mongodb.output.uri", "mongodb:/127.0.0.1/ycsb.usertable") \
.getOrCreate()
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri","mongodb://127.0.0.1/ycsb.usertable").load()
# <<<< Here I am missing the parsing >>>>>
clusters = KMeans.train(parsedData, 2, maxIterations=10, initializationMode="random")
I like to understand how to map data from the df so that it can be used as input for kmeans.
The "layout" of the database is:
root
|-- _id: string (nullable = true)
|-- field0: binary (nullable = true)
|-- field1: binary (nullable = true)
|-- field2: binary (nullable = true)
|-- field3: binary (nullable = true)
|-- field4: binary (nullable = true)
|-- field5: binary (nullable = true)
|-- field6: binary (nullable = true)
|-- field7: binary (nullable = true)
|-- field8: binary (nullable = true)
|-- field9: binary (nullable = true)
I like to understand how to map data from the df so that it can be used as input for kmeans.
Based on your snippet, I assumed that you're using PySpark.
If you look into clustering.KMeans Python API doc, you can see that the first parameter needs to be RDD of Vector or convertible sequence types
After you performed below code which load data from MongoDB using MongoDB Spark Connector
df = spark.read.format("com.mongodb.spark.sql.DefaultSource")
.option("uri","mongodb://127.0.0.1/ycsb.usertable")
.load()
What you have in df is a DataFrame, so we need to convert it into something convertible to a Vector type.
Since you are using numpy.array in your text file example, we can keep using this array type for learning transition.
Based on the provided layout, firstly we need to remove the _id column as it won't be needed for the clustering training. See also Vector data type for more information.
With the above information, let's get into it:
# Drop _id column and get RDD representation of the DataFrame
rowRDD = df.drop("_id").rdd
# Convert RDD of Row into RDD of numpy.array
parsedRdd = rowRDD.map(lambda row: array([int(x) for x in row]))
# Feed into KMeans
clusters = KMeans.train(parsedRdd, 2, maxIterations=10, initializationMode="random")
If you would like to keep the boolean value (True/False) instead of integer (1/0), then you can remove the int part. As below:
parsedRdd = rowRDD.map(lambda row: array([x for x in row]))
Putting all of them together :
from numpy import array
from pyspark.mllib.clustering import KMeans
import org.apache.spark.sql.SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/ycsb.usertable") \
.config("spark.mongodb.output.uri", "mongodb:/127.0.0.1/ycsb.usertable") \
.getOrCreate()
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
rowRDD = df.drop("_id").rdd
parsedRdd = rowRDD.map(lambda row: array([int(x) for x in row]))
clusters = KMeans.train(parsedRdd, 2, maxIterations=10, initializationMode="random")
clusters.clusterCenters