DateType() definition giving Null in PySpark? - pyspark

I have dates which are big endian like:
YYYYMMDD in a CSV.
When I use simple string types, the data loads in correctly but when I used the DateType() object to define the column, I get nulls for everything. Am I able to define the date format somewhere or should Spark infer this automatically?
schema_comments= StructType([
StructField("id", StringType(), True),
StructField("date", DateType(), True),
])

DateType expect standard timestamp format in spark so if you are providing it in schema it should be of the format 1997-02-28 10:30:00 if that's not the case read it using pandas or pyspark in string format and then you can convert it into a DateType() object using python and pyspark. Below is the sample code to convert the YYYYMMDD format into DateType in pyspark :
from pyspark.sql.functions import unix_timestamp
df2 = df.select('date_str', from_unixtime(unix_timestamp('date_str', 'yyyyMMdd')).alias('date'))

The schema looks good to me.
You can define how spark reads the CSV using dateFormat.
For example:
rc = spark.read.csv('yourCSV.csv', header=False,
dateFormat="yyyyddMM", schema=schema)

Related

I have date column in pyspark STRING format and i want to typecast into date format

I have date in date column like maintained below
date column
01-JAN-22
In string format and I want to type cast into date format.
I have tried many ways using pyspark functions and SQL functions but not getting output its showing null.
Can anybody help me to solve this query?
You can use to_date.
The variaous datetime patterns for formatting and parsing are documented here.
df = spark.createDataFrame(data=[["01-JAN-22"]], schema=["date column"])
import pyspark.sql.functions as F
df = df.withColumn("date column", F.to_date("date column", "d-MMM-yy"))
df.printSchema()
[Out]:
root
|-- date column: date (nullable = true)
print(df.schema)
[Out]:
StructType([StructField('date column', DateType(), True)])

pyspark json datframe created with all null values

I have created a dataframe from a json file. However dataframe is created with all the schema but with values as null. Its a valid json file.
df = spark.read.json(path)
when I displayed the data , using df.display() all i can view is null in the dataframe. Can anyone tell me what could be the issue?
Reading the json file without enabling multiline might be the cause for this.
Please go through the sample demonstration.
My sample json.
[{"id":1,"first_name":"Amara","last_name":"Taplin"},
{"id":2,"first_name":"Gothart","last_name":"McGrill"},
{"id":3,"first_name":"Georgia","last_name":"De Miranda"},
{"id":4,"first_name":"Dukie","last_name":"Arnaud"},
{"id":5,"first_name":"Mellicent","last_name":"Scathard"}]
I got null values when multiline not used.
When multiline enabled I got proper result.
df= spark.read.option('multiline', True).json('/FileStore/tables/Sample1_json.json')
df.display()
If you want to give schema externally also, you can do like this.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,LongType
schema = StructType([StructField('first_name', StringType(), True),
StructField('id', IntegerType(), True),
StructField('last_name', StringType(), True)])
df= spark.read.schema(schema).option('multiline', True).json('/FileStore/tables/Sample1_json.json')
df.show()

Spark generate a dataframe from two json columns

I have a dataframe with two columns. Each column contains json.
cola
colb
{"name":"Adam", "age": 23}
{"country" : "USA"}
I wish to convert it to:
cola_name
cola_age
colb_country
Adam
23
USA
How do I do this?
The approach I have in mind is: In the original dataframe, If I can merge both the json to a single json object. I can then obtain the intended result
spark.read.json(df.select("merged_column").as[String])
But cant find an easy way of merging two json object to single json object in spark
Update: The contents of the json is not known pre-hand. Looking for a way to auto-detect schema
I'm more familiar with pyspark syntax. I think this works:
import pyspark.sql.functions as f
from pyspark.sql.types import *
schema_cola = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True)
])
schema_colb = StructType([
StructField('country', StringType(), True)
])
df = spark.createDataFrame([('{"name":"Adam", "age": 23}', '{"country" : "USA"}')], ['cola', 'colb'])
display(df
.withColumn('cola_struct', f.from_json(f.col('cola'), schema_cola))
.withColumn('colb_struct', f.from_json(f.col('colb'), schema_colb))
.select(f.col('cola_struct.*'), f.col('colb_struct.*'))
)
The output looks like this:

Inferschema detecting column as string instead of double from parquet in pyspark

Problem -
I am reading a parquet file in pyspark using azure databricks. There are columns which lot of nulls and have decimal values, these columns are read as string instead of double.
Is there any way of inferring the proper data type in pyspark?
Code -
To read parquet file -
df_raw_data = sqlContext.read.parquet(data_filename[5:])
The output of this is a dataframe with more than 100 columns of which most of the columns are of the type double but the printSchema() shows it as string.
P.S -
I have a parquet file which can have dynamic columns hence defining struct for the dataframe does not work for me. I used to convert the spark dataframe to pandas and use convert_objects but that does not work as the parquet file is huge.
You can define the schema using StructType and then provide this schema in the schema option while loading the data.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
fileSchema = StructType([StructField('atm_id', StringType(),True),
StructField('atm_street_number', IntegerType(),True),
StructField('atm_zipcode', IntegerType(),True),
StructField('atm_lat', DoubleType(),True),
])
df_raw_data = spark.read \
.option("header",True) \
.option("format", "parquet") \
.schema(fileSchema) \
.load(data_filename[5:])

How to log malformed rows from Scala Spark DataFrameReader csv

The documentation for the Scala_Spark_DataFrameReader_csv suggests that spark can log the malformed rows detected while reading a .csv file.
- How can one log the malformed rows?
- Can one obtain a val or var containing the malformed rows?
The option from the linked documentation is:
maxMalformedLogPerPartition (default 10): sets the maximum number of malformed rows Spark will log for each partition. Malformed records beyond this number will be ignored
Based on this databricks example you need to explicitly add the "_corrupt_record" column to a schema definition when you read in the file. Something like this worked for me in pyspark 2.4.4:
from pyspark.sql.types import *
my_schema = StructType([
StructField("field1", StringType(), True),
...
StructField("_corrupt_record", StringType(), True)
])
my_data = spark.read.format("csv")\
.option("path", "/path/to/file.csv")\
.schema(my_schema)
.load()
my_data.count() # force reading the csv
corrupt_lines = my_data.filter("_corrupt_record is not NULL")
corrupt_lines.take(5)
If you are using the spark 2.3 check the _corrupt_error special column ... according to several spark discussions "it should work " , so after the read filter those which non-empty cols - there should be your errors ... you could check also the input_file_name() sql func
if you are not using lower than version 2.3 you should implement a custom read , record solution, because according to my tests the _corrupt_error does not work for csv data source ...
I've expanded on klucar's answer here by loading the csv, making a schema from the non-corrupted records, adding the corrupted record column, using the new schema to load the csv and then looking for corrupted records.
from pyspark.sql.types import StructField, StringType
from pyspark.sql.functions import col
file_path = "/path/to/file"
mode = "PERMISSIVE"
schema = spark.read.options(mode=mode).csv(file_path).schema
schema = schema.add(StructField("_corrupt_record", StringType(), True))
df = spark.read.options(mode=mode).schema(schema).csv(file_path)
df.cache()
df.count()
df.filter(col("_corrupt_record").isNotNull()).show()