Pyspark Dataframe Creation DecimalType issue - pyspark

I am trying to create a pyspark dataframe from a list of dict and a defined schema for the dataframe. One column in the defined schema is a DecimalType. While I create the dataframe, I get an error;
TypeError: field b: DecimalType(38,18) can not accept object 0.1 in type <class 'float'>
test_data = [{"a": "1", "b": 0.1}, {"a": "2", "b": 0.2}]
schema = StructType(
[
StructField("a", StringType()),
StructField("b", DecimalType(38, 18)),
]
)
# Create an dataframe
df = spark.createDataFrame(data = test_data,
schema = schema)
Could someone help out with this issue. How can I pass a decimaltype data in a list.?

If you can lose some accuracy then you can change the type to FloatType as Bala suggested .
You can also change to DoubleType if you need more accuracy. FloatType support 4 bytes of information while DoubleType have 8 bytes (see here).
If you need maximum accuracy, you can use Pythons Decimal module that by default has 28 digits after the dot:
from pyspark.sql.types import *
from decimal import Decimal
test_data = [{"a": "1", "b": Decimal(0.1) }, {"a": "2", "b": Decimal(0.2) }]
schema = StructType(
[
StructField("a", StringType()),
StructField("b", DecimalType(38, 18)),
]
)
# Create a dataframe
df = spark.createDataFrame(data = test_data,
schema = schema)
If for example we run this code:
from pyspark.sql.types import *
from decimal import Decimal
test_data = [
(1.9868968969869869045652421846, 1.9868968969869869045652421846, Decimal(1.9868968969869869045652421846)),
]
schema = StructType(
[
StructField("float_col", FloatType()),
StructField("double_col", DoubleType()),
StructField("decimal_col", DecimalType(38, 28)),
]
)
# Create an dataframe
df = spark.createDataFrame(data = test_data,
schema = schema)
we would get this difference:

Change it to FloatType
test_data = [{"a": "1", "b": 0.1}, {"a": "2", "b": 0.2}]
schema2 = StructType(
[
StructField("a", StringType()),
StructField("b", FloatType()),
]
)
df = spark.createDataFrame(data=test_data,schema=schema2)
df.show()
+---+---+
| a| b|
+---+---+
| 1|0.1|
| 2|0.2|
+---+---+

Related

Create a column for each struct in an array of struct's in a PySpark DataFrame

Suppose I have a dataframe where a have an id and a distinct list of keys and values, such as the following:
import pyspark.sql.functions as fun
from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, StringType
schema = StructType(
[
StructField('id', StringType(), True),
StructField('columns',
ArrayType(
StructType([
StructField('key', StringType(), True),
StructField('value', IntegerType(), True)
])
)
)
]
)
data = [
('1', [('Savas', 5)]),
('2', [('Savas', 5), ('Ali', 3)]),
('3', [('Savas', 5), ('Ali', 3), ('Ozdemir', 7)])
]
df = spark.createDataFrame(data, schema)
df.show()
For each struct in the array type column I want create a column, as follows:
df1 = df\
.withColumn('temp', fun.explode('names'))\
.select('id', 'temp.key', 'temp.value')\
.groupby('id')\
.pivot('key')\
.agg(fun.first(fun.col('value')))\
.sort('user_id')
df1.show()
Is there a more efficient way to achieve the same result?

Extract field(s) and value from nested json in PySpark dataframe and Sort it based off of value

How I can write a function in databricks/spark which will take email or md5 value of email and Mon as input and return top 5 cities sorted by activityCount in Dict format (if it doesn't have 3 cities then return however many matches are found).
PS: there's more columns in df for other days as well such as "Tue", "Wed", "Thu", "Fri", "Sat", "Sun" and they'll have data in similar format in them but I've only added "Mon" for brevity.
Dataframe
+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|email |Mon |
+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|aaaa#aol.com|{[California]={"[San Francisco]":{"activityCount":4}}, {"[San Diego]":{"activityCount":5}}, {"[San Jose]":{"activityCount":6}}, [New York]={"[New York City]":{"activityCount":1}}, {"[Fairport]":{"activityCount":2}}, {"[Manhattan]":{"activityCount":3}}}|
|bbbb#aol.com|{[Alberta]={"[city1]":{"activityCount":1}}, {"[city2]":{"activityCount":2}}, {"[city3]":{"activityCount":3}}, [New York]={"[New York City]":{"activityCount":7}}, {"[Fairport]":{"activityCount":8}}, {"[Manhattan]":{"activityCount":9}}}|
+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
dataframe_schema is as following:
schema = StructType([
StructField("email", StringType(), True),
StructField("Mon", StringType(), False)
])
Sample code to set it up
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import StructType, StructField, StringType
if __name__ == "__main__":
spark = SparkSession.builder.master("local[1]") \
.appName('SparkByExamples.com') \
.getOrCreate()
data2 = [("aaaa#aol.com",
{
"[New York]": "{\"[New York City]\":{\"activityCount\":1}}, {\"[Fairport]\":{\"activityCount\":2}}, "
"{\"[Manhattan]\":{\"activityCount\":3}}",
"[California]": "{\"[San Francisco]\":{\"activityCount\":4}}, {\"[San Diego]\":{\"activityCount\":5}}, "
"{\"[San Jose]\":{\"activityCount\":6}}"
}
)]
schema = StructType([
StructField("email", StringType(), True),
StructField("Mon", StringType(), False)
])
task5DF = spark.createDataFrame(data=data2, schema=schema)
task5DF.show(truncate=False)

Manually create a pyspark dataframe

I am trying to manually create a pyspark dataframe given certain data:
row_in = [(1566429545575348), (40.353977), (-111.701859)]
rdd = sc.parallelize(row_in)
schema = StructType(
[
StructField("time_epocs", DecimalType(), True),
StructField("lat", DecimalType(), True),
StructField("long", DecimalType(), True),
]
)
df_in_test = spark.createDataFrame(rdd, schema)
This gives an error when I try to display the dataframe, so I am not sure how to do this.
However, the Spark documentation seems to be a bit convoluted to me, and I got similar errors when I tried to follow those instructions.
Does anyone know how to do this?
Simple dataframe creation:
df = spark.createDataFrame(
[
(1, "foo"), # create your data here, be consistent in the types.
(2, "bar"),
],
["id", "label"] # add your column names here
)
df.printSchema()
root
|-- id: long (nullable = true)
|-- label: string (nullable = true)
df.show()
+---+-----+
| id|label|
+---+-----+
| 1| foo|
| 2| bar|
+---+-----+
According to official doc:
when schema is a list of column names, the type of each column will be inferred from data. (example above ↑)
When schema is pyspark.sql.types.DataType or a datatype string, it must match the real data. (examples below ↓)
# Example with a datatype string
df = spark.createDataFrame(
[
(1, "foo"), # Add your data here
(2, "bar"),
],
"id int, label string", # add column names and types here
)
# Example with pyspark.sql.types
from pyspark.sql import types as T
df = spark.createDataFrame(
[
(1, "foo"), # Add your data here
(2, "bar"),
],
T.StructType( # Define the whole schema within a StructType
[
T.StructField("id", T.IntegerType(), True),
T.StructField("label", T.StringType(), True),
]
),
)
df.printSchema()
root
|-- id: integer (nullable = true) # type is forced to Int
|-- label: string (nullable = true)
Additionally, you can create your dataframe from Pandas dataframe, schema will be inferred from Pandas dataframe's types :
import pandas as pd
import numpy as np
pdf = pd.DataFrame(
{
"col1": [np.random.randint(10) for x in range(10)],
"col2": [np.random.randint(100) for x in range(10)],
}
)
df = spark.createDataFrame(pdf)
df.show()
+----+----+
|col1|col2|
+----+----+
| 6| 4|
| 1| 39|
| 7| 4|
| 7| 95|
| 6| 3|
| 7| 28|
| 2| 26|
| 0| 4|
| 4| 32|
+----+----+
To elaborate/build off of #Steven's answer:
field = [
StructField("MULTIPLIER", FloatType(), True),
StructField("DESCRIPTION", StringType(), True),
]
schema = StructType(field)
multiplier_df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
Will create a blank dataframe.
We can now simply add a row to it:
l = [(2.3, "this is a sample description")]
rdd = sc.parallelize(l)
multiplier_df_temp = spark.createDataFrame(rdd, schema)
multiplier_df = wtp_multiplier_df.union(wtp_multiplier_df_temp)
This answer demonstrates how to create a PySpark DataFrame with createDataFrame, create_df and toDF.
df = spark.createDataFrame([("joe", 34), ("luisa", 22)], ["first_name", "age"])
df.show()
+----------+---+
|first_name|age|
+----------+---+
| joe| 34|
| luisa| 22|
+----------+---+
You can also pass createDataFrame a RDD and schema to construct DataFrames with more precision:
from pyspark.sql import Row
from pyspark.sql.types import *
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
schema = schema = StructType([
StructField("name", StringType(), True),
StructField("age", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema)
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
create_df from my Quinn project allows for the best of both worlds - it's concise and fully descriptive:
from pyspark.sql.types import *
from quinn.extensions import *
df = spark.create_df(
[("jose", "a"), ("li", "b"), ("sam", "c")],
[("name", StringType(), True), ("blah", StringType(), True)]
)
df.show()
+----+----+
|name|blah|
+----+----+
|jose| a|
| li| b|
| sam| c|
+----+----+
toDF doesn't offer any advantages over the other approaches:
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([
Row(name='Allie', age=2),
Row(name='Sara', age=33),
Row(name='Grace', age=31)])
df = rdd.toDF()
df.show()
+-----+---+
| name|age|
+-----+---+
|Allie| 2|
| Sara| 33|
|Grace| 31|
+-----+---+
With formatting
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, StringType
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
(1, "foo"),
(2, "bar"),
],
StructType(
[
StructField("id", IntegerType(), False),
StructField("txt", StringType(), False),
]
),
)
print(df.dtypes)
df.show()
Extending #Steven's Answer:
data = [(i, 'foo') for i in range(1000)] # random data
columns = ['id', 'txt'] # add your columns label here
df = spark.createDataFrame(data, columns)
Note: When schema is a list of column-names, the type of each column will be inferred from data.
If you want to specifically define schema then do this:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
schema = StructType([StructField("id", IntegerType(), True), StructField("txt", StringType(), True)])
df1 = spark.createDataFrame(data, schema)
Outputs:
>>> df1
DataFrame[id: int, txt: string]
>>> df
DataFrame[id: bigint, txt: string]
for beginners, a full example importing data from file:
from pyspark.sql import SparkSession
from pyspark.sql.types import (
ShortType,
StringType,
StructType,
StructField,
TimestampType,
)
import os
here = os.path.abspath(os.path.dirname(__file__))
spark = SparkSession.builder.getOrCreate()
schema = StructType(
[
StructField("id", ShortType(), nullable=False),
StructField("string", StringType(), nullable=False),
StructField("datetime", TimestampType(), nullable=False),
]
)
# read file or construct rows manually
df = spark.read.csv(os.path.join(here, "data.csv"), schema=schema, header=True)

PySpark from_json Schema for ArrayType with No Name

I'm trying to use from_json with the following JSON string and need to specify a schema. What schema matches this JSON?
[{"key": "value1"}, {"key": "value2"}]
As a work-around, I'm doing a string concat to turn the JSON into this (i.e. adding an array name).
{ "data": [{"key": "value1"}, {"key": "value2"}] }
Then I can use the following schema. However, it should be possible to specify a schema without changing the original JSON.
schema = StructType([
StructField("data", ArrayType(
StructType([
StructField("key", StringType())
])
))
])
Hi Here is the example
df = self.spark.createDataFrame(['[{"key": "value1"}, {"key": "value2"}]'], StringType())
df.show(1, False)
schema = ArrayType(StructType([StructField("key", StringType(), True)]))
df = df.withColumn("json", from_json("value", schema))
df.show()
+--------------------------------------+
|value |
+--------------------------------------+
|[{"key": "value1"}, {"key": "value2"}]|
+--------------------------------------+
+--------------------+--------------------+
| value| json|
+--------------------+--------------------+
|[{"key": "value1"...|[[value1], [value2]]|
+--------------------+--------------------+

Scala DataFrame: Explode an array

I am using the spark libraries in Scala. I have created a DataFrame using
val searchArr = Array(
StructField("log",IntegerType,true),
StructField("user", StructType(Array(
StructField("date",StringType,true),
StructField("ua",StringType,true),
StructField("ui",LongType,true))),true),
StructField("what",StructType(Array(
StructField("q1",ArrayType(IntegerType, true),true),
StructField("q2",ArrayType(IntegerType, true),true),
StructField("sid",StringType,true),
StructField("url",StringType,true))),true),
StructField("where",StructType(Array(
StructField("o1",IntegerType,true),
StructField("o2",IntegerType,true))),true)
)
val searchSt = new StructType(searchArr)
val searchData = sqlContext.jsonFile(searchPath, searchSt)
I am now what to explode the field what.q1, which should contain an array of integers, but the documentation is limited:
http://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#explode(java.lang.String,%20java.lang.String,%20scala.Function1,%20scala.reflect.api.TypeTags.TypeTag)
So far I tried a few things without much luck
val searchSplit = searchData.explode("q1", "rb")(q1 => q1.getList[Int](0).toArray())
Any ideas/examples of how to use explode on an array?
Did you try with an UDF on field "what"? Something like that could be useful:
val explode = udf {
(aStr: GenericRowWithSchema) =>
aStr match {
case null => ""
case _ => aStr.getList(0).get(0).toString()
}
}
val newDF = df.withColumn("newColumn", explode(col("what")))
where:
getList(0) returns "q1" field
get(0) returns the first element of "q1"
I'm not sure but you could try to use getAs[T](fieldName: String) instead of getList(index: Int).
I'm not used to Scala; but in Python/pyspark, the array type column nested within a struct type field can be exploded as follows. If it works for you, then you can convert it to corresponding Scala representation.
from pyspark.sql.functions import col, explode
from pyspark.sql.types import ArrayType, IntegerType, LongType, StringType, StructField, StructType
schema = StructType([
StructField("log", IntegerType()),
StructField("user", StructType([
StructField("date", StringType()),
StructField("ua", StringType()),
StructField("ui", LongType())])),
StructField("what", StructType([
StructField("q1", ArrayType(IntegerType())),
StructField("q2", ArrayType(IntegerType())),
StructField("sid", StringType()),
StructField("url", StringType())])),
StructField("where", StructType([
StructField("o1", IntegerType()),
StructField("o2", IntegerType())]))
])
data = [((1), ("2022-01-01","ua",1), ([1,2,3],[6],"sid","url"), (7,8))]
df = spark.createDataFrame(data=data, schema=schema)
df.show(truncate=False)
Output:
+---+-------------------+--------------------------+------+
|log|user |what |where |
+---+-------------------+--------------------------+------+
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|
+---+-------------------+--------------------------+------+
With what.q1 exploded:
df.withColumn("what.q1_exploded", explode(col("what.q1"))).show(truncate=False)
Output:
+---+-------------------+--------------------------+------+----------------+
|log|user |what |where |what.q1_exploded|
+---+-------------------+--------------------------+------+----------------+
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|1 |
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|2 |
|1 |{2022-01-01, ua, 1}|{[1, 2, 3], [6], sid, url}|{7, 8}|3 |
+---+-------------------+--------------------------+------+----------------+