I am using pyspark and have a situation where I need to compare metadata of 2 parquet files.
Example:-
Parquet 1 Schema is :
1, ID, string
2, address string
3, Date, date
Parquet 2 Schema is :
1, ID, string
2, Date, date
3, address string
This should show me a difference, as col 2 moved to col 3 in parquet 2.
Thanks,
VK
In Spark there isn't a native command to do the comparison of the headers. A solution to your problem could be the following:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df1 = spark.read.parquet('path/to/file1.parquet', header='true')
df2= spark.read.parquet('path/to/file2.parquet', header='true')
df1_headers = df1.columns
df2_headers = df2.columns
# Now in Python you could compare the lists with the headers
# You don't need Spark to compare simple headers :-)
Related
I have few csv files with headers but I found out that some files have different column orders. Is there a way to handle this with Spark where I can define select order for each file so that the master DF doesn't have mismatch where col x might have values from col y?
My current read -
val masterDF = spark.read.option("header", "true").csv(allFiles:_*)
Extract all file names and store into list variable.
Then define schema of with all the columns in it.
iterate through each file using header true, so we are reading each file separately.
unionAll the new dataframe with the existing dataframe.
Example:
file_lst=['<path1>','<path2>']
from pyspark.sql.functions import *
from pyspark.sql.types import *
#define schema for the required columns
schema = StructType([StructField("column1",StringType(),True),StructField("column2",StringType(),True)])
#create an empty dataframe
df=spark.createDataFrame([],schema)
for i in file_lst:
tmp_df=spark.read.option("header","true").csv(i).select("column1","column2")
df=df.unionAll(tmp_df)
#display results
df.show()
Problem -
I am reading a parquet file in pyspark using azure databricks. There are columns which lot of nulls and have decimal values, these columns are read as string instead of double.
Is there any way of inferring the proper data type in pyspark?
Code -
To read parquet file -
df_raw_data = sqlContext.read.parquet(data_filename[5:])
The output of this is a dataframe with more than 100 columns of which most of the columns are of the type double but the printSchema() shows it as string.
P.S -
I have a parquet file which can have dynamic columns hence defining struct for the dataframe does not work for me. I used to convert the spark dataframe to pandas and use convert_objects but that does not work as the parquet file is huge.
You can define the schema using StructType and then provide this schema in the schema option while loading the data.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
fileSchema = StructType([StructField('atm_id', StringType(),True),
StructField('atm_street_number', IntegerType(),True),
StructField('atm_zipcode', IntegerType(),True),
StructField('atm_lat', DoubleType(),True),
])
df_raw_data = spark.read \
.option("header",True) \
.option("format", "parquet") \
.schema(fileSchema) \
.load(data_filename[5:])
I have a dataframe with headers for example outputDF. I now want to take outputDF.columns and create a new dataframe with just one row which contains column names.
I then want to union both these dataframes with option("head=false") which spark can then write to a HDFS.
How do i do that?
below is an example
Val df = spark.read.csv("path")
val newDf = df.columns.toSeq.toDF
val unoindf= df.union(newDf);
df = spark.read.json("dbfs:/mnt/evbhaent2blobs", multiLine=True)
df2 = df.select(F.col('body').cast("Struct").getItem('CustomerType').alias('CustomerType'))
display(df)
my df is
my oupputdf
I am taking a guess that your dataframe has a column "body" which is a json string and you want to parse the json and extract an element from it.
First you need to define or extract the json schema. And then parse json string and extract its elements as column. From the extracted columns, you can select the desired columns.
json_schema = spark.read.json(df.rdd.map(lambda row: row.body)).schema
df2 = df.withColumn('body_json', F.from_json(F.col('body'), json_schema))\
.select("body_json.*").select('CustomerType')
display(df2)
Is it possible to divide DF in two parts using single filter operation.For example
let say df has below records
UID Col
1 a
2 b
3 c
if I do
df1 = df.filter(UID <=> 2)
can I save filtered and non-filtered records in different RDD in single operation
?
df1 can have records where uid = 2
df2 can have records with uid 1 and 3
If you're interested only in saving data you can add an indicator column to the DataFrame:
val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col")
val dfWithInd = df.withColumn("ind", $"uid" <=> 2)
and use it as a partition column for the DataFrameWriter with one of the supported formats (as for 1.6 it is Parquet, text, and JSON):
dfWithInd.write.partitionBy("ind").parquet(...)
It will create two separate directories (ind=false, ind=true) on write.
In general though, it is not possible to yield multiple RDDs or DataFrames from a single transformation. See How to split a RDD into two or more RDDs?