Inferschema detecting column as string instead of double from parquet in pyspark - pyspark

Problem -
I am reading a parquet file in pyspark using azure databricks. There are columns which lot of nulls and have decimal values, these columns are read as string instead of double.
Is there any way of inferring the proper data type in pyspark?
Code -
To read parquet file -
df_raw_data = sqlContext.read.parquet(data_filename[5:])
The output of this is a dataframe with more than 100 columns of which most of the columns are of the type double but the printSchema() shows it as string.
P.S -
I have a parquet file which can have dynamic columns hence defining struct for the dataframe does not work for me. I used to convert the spark dataframe to pandas and use convert_objects but that does not work as the parquet file is huge.

You can define the schema using StructType and then provide this schema in the schema option while loading the data.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
fileSchema = StructType([StructField('atm_id', StringType(),True),
StructField('atm_street_number', IntegerType(),True),
StructField('atm_zipcode', IntegerType(),True),
StructField('atm_lat', DoubleType(),True),
])
df_raw_data = spark.read \
.option("header",True) \
.option("format", "parquet") \
.schema(fileSchema) \
.load(data_filename[5:])

Related

can not cast values in spark scala dataframe

I am trying to parse the data from numbers
Enviroment: DataBricks Scala 2.12 Spark 3.1
I had chosen columns that were incorrectly parsed as Strings the reason is that sometimes numbers were written with coma sometimes with dot.
I am trying to first replace all commas to dots parse it as floats, create schema with type of floating numbers and recreate the dataframe but it does not work.
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, FloatType};
import org.apache.spark.sql.{Row, SparkSession}
import sqlContext.implicits._
//temp is a dataframe with data that I included below
val jj = temp.collect().map(row=> Row(row.toSeq.map(it=> if(it==null) {null} else {it.asInstanceOf[String].replace( ",", ".").toFloat }) ))
val schemaa = temp.columns.map(colN=> (StructField(colN, FloatType, true)))
val newDatFrame = spark.createDataFrame(jj,schemaa)
Data screen
CSV
Podana aktywność,CRP(6 mcy),WBC(6 mcy),SUV (max) w miejscu zapalenia,SUV (max) tła,tumor to background ratio
218,72,"15,2",16,"1,8","8,888888889"
"199,7",200,"16,5","21,5","1,4","15,35714286"
270,42,"11,17","7,6","2,4","3,166666667"
200,226,"29,6",9,"2,8","3,214285714"
200,45,"13,85",17,"2,1","8,095238095"
300,null,"37,8","6,19","2,5","2,476"
290,175,"7,35",9,"2,4","3,75"
279,160,"8,36",13,2,"6,5"
202,24,10,"6,7","2,6","2,576923077"
334,"22,9","8,01",12,"2,4",5
"200,4",null,"25,56",7,"2,4","2,916666667"
198,102,"8,36","7,4","1,8","4,111111111"
"211,6","26,7","10,8","4,2","1,6","2,625"
205,null,null,"9,7","2,07","4,685990338"
326,300,18,14,"2,4","5,833333333"
270,null,null,15,"2,5",6
258,null,null,6,"2,5","2,4"
300,197,"13,5","12,5","2,6","4,807692308"
200,89,"20,9","4,8","1,7","2,823529412"
"201,7",28,null,11,"1,8","6,111111111"
198,9,13,9,2,"4,5"
264,null,"20,3",12,"2,5","4,8"
230,31,"13,3","4,8","1,8","2,666666667"
284,107,"9,92","5,8","1,49","3,89261745"
252,270,null,8,"1,56","5,128205128"
266,null,null,"10,4","1,95","5,333333333"
242,null,null,"14,7",2,"7,35"
259,null,null,"10,01","1,65","6,066666667"
224,null,null,"4,2","1,86","2,258064516"
306,148,10.3,11,1.9,"0,0002488406289"
294,null,5.54,"9,88","1,93","5,119170984"
You can map the columns using Spark SQL regexp_replace. collect is not needed and will not give a good performance. You might also want to use double instead of float because some entries have many decimal places.
val new_df = df.select(
df.columns.map(
c => regexp_replace(col(c), ",", ".").cast("double").as(c)
):_*
)

SparkDataFrame.dtypes fails if a column has special chars..how to bypass and read the csv and inferschema

Inferring Schema of a Spark Dataframe throws error if the csv file has column with special chars..
Test sample
foo.csv
id,comment
1, #Hi
2, Hello
spark = SparkSession.builder.appName("footest").getOrCreate()
df= spark.read.load("foo.csv", format="csv", inferSchema="true", header="true")
print(df.dtypes)
raise ValueError("Could not parse datatype: %s" % json_value)
I found comment from Dat Tran on inferSchema in spark csv package how to resolve this...cann't we still inferschema before dataclean?
Use it like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').enableHiveSupport().getOrCreate()
df = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("test19.csv")
print(df.dtypes)
Output:
[('id', 'int'), ('comment', 'string')]

DateType() definition giving Null in PySpark?

I have dates which are big endian like:
YYYYMMDD in a CSV.
When I use simple string types, the data loads in correctly but when I used the DateType() object to define the column, I get nulls for everything. Am I able to define the date format somewhere or should Spark infer this automatically?
schema_comments= StructType([
StructField("id", StringType(), True),
StructField("date", DateType(), True),
])
DateType expect standard timestamp format in spark so if you are providing it in schema it should be of the format 1997-02-28 10:30:00 if that's not the case read it using pandas or pyspark in string format and then you can convert it into a DateType() object using python and pyspark. Below is the sample code to convert the YYYYMMDD format into DateType in pyspark :
from pyspark.sql.functions import unix_timestamp
df2 = df.select('date_str', from_unixtime(unix_timestamp('date_str', 'yyyyMMdd')).alias('date'))
The schema looks good to me.
You can define how spark reads the CSV using dateFormat.
For example:
rc = spark.read.csv('yourCSV.csv', header=False,
dateFormat="yyyyddMM", schema=schema)

How to add a new column to an existing dataframe while also specifying the datatype of it?

I have a dataframe: yearDF obtained from reading an RDBMS table on Postgres which I need to ingest in a Hive table on HDFS.
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",10)
.load()
Before ingesting it, I have to add a new column: delete_flag of datatype: IntegerType to it. This column is used to mark a primary-key whether the row is deleted in the source table or not.
To add a new column to an existing dataframe, I know that there is the option: dataFrame.withColumn("del_flag",someoperation) but there is no such option to specify the datatype of new column.
I have written the StructType for the new column as:
val delFlagColumn = StructType(List(StructField("delete_flag", IntegerType, true)))
But I don't understand how to add this column with the existing dataFrame: yearDF. Could anyone let me know how to add a new column along with its datatype, to an existing dataFrame ?
import org.apache.spark.sql.types.IntegerType
df.withColumn("a", lit("1").cast(IntegerType)).show()
Though casting is not required if you are passing lit(1) as spark will infer the schema for you. But if you are passing as lit("1") it will cast it to Int

How to log malformed rows from Scala Spark DataFrameReader csv

The documentation for the Scala_Spark_DataFrameReader_csv suggests that spark can log the malformed rows detected while reading a .csv file.
- How can one log the malformed rows?
- Can one obtain a val or var containing the malformed rows?
The option from the linked documentation is:
maxMalformedLogPerPartition (default 10): sets the maximum number of malformed rows Spark will log for each partition. Malformed records beyond this number will be ignored
Based on this databricks example you need to explicitly add the "_corrupt_record" column to a schema definition when you read in the file. Something like this worked for me in pyspark 2.4.4:
from pyspark.sql.types import *
my_schema = StructType([
StructField("field1", StringType(), True),
...
StructField("_corrupt_record", StringType(), True)
])
my_data = spark.read.format("csv")\
.option("path", "/path/to/file.csv")\
.schema(my_schema)
.load()
my_data.count() # force reading the csv
corrupt_lines = my_data.filter("_corrupt_record is not NULL")
corrupt_lines.take(5)
If you are using the spark 2.3 check the _corrupt_error special column ... according to several spark discussions "it should work " , so after the read filter those which non-empty cols - there should be your errors ... you could check also the input_file_name() sql func
if you are not using lower than version 2.3 you should implement a custom read , record solution, because according to my tests the _corrupt_error does not work for csv data source ...
I've expanded on klucar's answer here by loading the csv, making a schema from the non-corrupted records, adding the corrupted record column, using the new schema to load the csv and then looking for corrupted records.
from pyspark.sql.types import StructField, StringType
from pyspark.sql.functions import col
file_path = "/path/to/file"
mode = "PERMISSIVE"
schema = spark.read.options(mode=mode).csv(file_path).schema
schema = schema.add(StructField("_corrupt_record", StringType(), True))
df = spark.read.options(mode=mode).schema(schema).csv(file_path)
df.cache()
df.count()
df.filter(col("_corrupt_record").isNotNull()).show()