PySpark UDF On Nested Array - pyspark

I am new to PySpark and I am attempting to create a UDF that will perform an operation on a string. The column I am attempting to use the UDF on has a type of: array<structday:string,month:string,year:string> where some of the years are numbers and others written.
A typical row within this "day_dict" column looks like:
[{20, 5, 1997}]
For the time being I just want to print the output before I perform any operations within the UDF.
def isnum(col):
print(col[0]['year'])
My UDF call:
isnumfunc = udf(isnum, ArrayType(StructType(StringType())))
isnumfunc.select(isnumfunc(col("day_dict"))).show()
But, I get a:
TypeError: 'StringType' object is not iterable
What do I need to change to print the year?

Please acquint yourself with schemas before you proceed. They are the backbone of dataframes API. StructType is a special schema type that is used to nest a collection of items individually, in an array or map. There are special methods of handling it and you do not need a udf.
Schema
sch = "id integer,col array<struct<structday:string,month:string,year:string>>"
or
myschema = StructType(
[
StructField("id", IntegerType(), True),
StructField("col",
ArrayType(
StructType([
StructField("structday", StringType(), True),
StructField("month", StringType(), True),
StructField("year", StringType(), True)
])
))
]
)
Data Frame
df=spark.createDataFrame([(1,[('20', '5', '1997')])],myschema)
df=spark.createDataFrame([(1,[('20', '5', '1997')])],sch)
Select First Column in Array Struct
s = df.selectExpr('inline(col)')
s.select(s.columns[0]).show()
+---------+
|structday|
+---------+
| 20|
+---------+

Related

Issue with pyspark df.show

I am reading a gz file in pyspark creating an RDD & Schema and then using that RDD to create the Dataframe. But I am not able to see any output.
Here is my code, I am not sure what I am doing wrong.
lines = sc.textFile("x.gz")
parts = lines.map(lambda l: l.split("\t"))
db = parts.map(lambda p: (p[0], int(p[1]), int(p[2]), int(p[3]).strip()))
schema = StructType([
StructField("word", StringType(), True),
StructField("count1", IntegerType(), True),
StructField("count2", IntegerType(), True),
StructField("count3", IntegerType(), True)])
df = sqlContext.createDataFrame(db, schema)
df.show()
df.createOrReplaceTempView("dftable")
result = sqlContext.sql("SELECT COUNT(*) FROM dftable")
result.show()
Moreover, I also want to calculate the number of rows in my table, that's why I used SQL query. But whenever try to call .show() error is thrown. What am I doing it wrong over here?
The data in the gz file is something like.....
A'String' some_number some_number some_number
some_number are in string format.
Please guide me what am I doing wrong?

Spark generate a dataframe from two json columns

I have a dataframe with two columns. Each column contains json.
cola
colb
{"name":"Adam", "age": 23}
{"country" : "USA"}
I wish to convert it to:
cola_name
cola_age
colb_country
Adam
23
USA
How do I do this?
The approach I have in mind is: In the original dataframe, If I can merge both the json to a single json object. I can then obtain the intended result
spark.read.json(df.select("merged_column").as[String])
But cant find an easy way of merging two json object to single json object in spark
Update: The contents of the json is not known pre-hand. Looking for a way to auto-detect schema
I'm more familiar with pyspark syntax. I think this works:
import pyspark.sql.functions as f
from pyspark.sql.types import *
schema_cola = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True)
])
schema_colb = StructType([
StructField('country', StringType(), True)
])
df = spark.createDataFrame([('{"name":"Adam", "age": 23}', '{"country" : "USA"}')], ['cola', 'colb'])
display(df
.withColumn('cola_struct', f.from_json(f.col('cola'), schema_cola))
.withColumn('colb_struct', f.from_json(f.col('colb'), schema_colb))
.select(f.col('cola_struct.*'), f.col('colb_struct.*'))
)
The output looks like this:

Get the column names of malformed records while reading a csv file using pyspark

I am reading a csv file using pyspark with predefined schema.
schema = StructType([
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True)
StructField("col3", FloatType(), True)
])
df = spark.sqlContext.read
.schema(schema)
.option("header",true)
.option("delimiter", ",")
.csv(path)
Now in the csv file, there is float value in col1 and string value in col3. I need to raise an exception and get the names of these columns(col1, col3) because these columns contain the values of different data type than that of defined in schema.
How do I achieve this?
In pyspark versions >2.2 you can use columnNameOfCorruptRecord with csv:
schema = StructType(
[
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True),
StructField("col3", FloatType(), True),
StructField("corrupted", StringType(), True),
]
)
df = spark.sqlContext.read.csv(
path,
schema=schema,
header=True,
sep=",",
mode="PERMISSIVE",
columnNameOfCorruptRecord="corrupted",
).show()
+----+----+----+------------+
|col1|col2|col3| corrupted|
+----+----+----+------------+
|null|null|null|0.10,123,abc|
+----+----+----+------------+
EDIT: CSV record fields are not independent of one another, so it can't generally be said that one field is corrupt, but others are not. Only the entire record can be corrupt or not corrupt.
For example, suppose we have a comma delimited file with one row and two floating point columns, the Euro values 0,10 and 1,00. The file looks like this:
col1,col2
0,10,1,00
Which field is corrupt?

Is there a way to make an empty schema with column names from a list?

I am trying to make an empty PySpark dataframe in the case where it didn't exist before. I also have a list of column names. Is it possible to define an empty PySpark dataframe without manual assignment?
I have a list of columns final_columns, which I can use to select a subset of columns from a dataframe. However, in the case when this dataframe doesn't exist, I would like to create an empty dataframe with the same columns in final_columns. I would like to do this without manually assigning the names.
final_columns = ['colA', 'colB', 'colC', 'colD', 'colE']
try:
sdf = sqlContext.table('test_table')
except:
print("test_table is empty")
mySchema = StructType([ StructField("colA", StringType(), True),
StructField("colB", StringType(), True),
StructField("colC", StringType(), True),
StructField("colD", StringType(), True),
StructField("colE", DoubleType(), True) ])
sdf = sqlContext.createDataFrame(spark.sparkContext.emptyRDD(),schema=mySchema)
sdf = sdf.select(final_columns)

sqlContext.createDataframe from Row with Schema. pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>

After spending way to much time figuring out why I get the following error
pyspark: TypeError: IntegerType can not accept object in type <type 'unicode'>
while trying to create a dataframe based on Rows and a Schema, I noticed the following:
With a Row inside my rdd called rrdRows looking as follows:
Row(a="1", b="2", c=3)
and my dfSchema defined as:
dfSchema = StructType([
StructField("c", IntegerType(), True),
StructField("a", StringType(), True),
StructField("b", StringType(), True)
])
creating a dataframe as follows:
df = sqlContext.createDataFrame(rddRows, dfSchema)
brings the above mentioned Error, because Spark only considers the order of StructFields in the schema and does not match the name of the StructFields with the name of the Row fields.
In other words, in the above example, I noticed that spark tries to create a dataframe that would look as follow (if there would not be a typeError. e.x if everything would be of type String)
+---+---+---+
| c | b | a |
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
is this really expected, or some sort of bug?
EDIT: the rddRows are create along those lines:
def createRows(dic):
res = Row(a=dic["a"],b=dic["b"],c=int(dic["c"])
return res
rddRows = rddDict.map(createRows)
where rddDict is a parsed JSON file.
The constructor of the Row sorts the keys if you provide keyword arguments. Take a look at the source code here. When I found out about that, I ended up sorting my schema accordingly before applying it to the dataframe:
sorted_fields = sorted(dfSchema.fields, key=lambda x: x.name)
sorted_schema = StructType(fields=sorted_fields)
df = sqlContext.createDataFrame(rddRows, sorted_schema)