Reading and appending files into a spark dataframe - pyspark

I have created an empty dataframe and started adding to it, by reading each file. But one of the files has more number of columns than the previous. How can I select only the columns in the first file for all the other files?
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import StructType
import os, glob
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
fpath=''
schema = StructType([])
sc = spark.sparkContext
df_spark=spark.createDataFrame(sc.emptyRDD(), schema)
files=glob.glob(fpath +'*.sas7bdat')
for i,f in enumerate(files):
if i == 0:
df=spark.read.format('com.github.saurfang.sas.spark').load(f)
df_spark= df
else:
df=spark.read.format('com.github.saurfang.sas.spark').load(f)
df_spark=df_spark.union(df)

You can provide your own schema while creating a dataframe.
for example, I have two files emp1.csv & emp2.csv having diffrent schema.
id,empname,empsalary
1,Vikrant,55550
id,empname,empsalary,age,country
2,Raghav,10000,32,India
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("salary", IntegerType(), True)])
file_path="file:///home/vikct001/user/vikrant/inputfiles/testfiles/emp*.csv"
df=spark.read.format("com.databricks.spark.csv").option("header", "true").schema(schema).load(file_path)
Specifying a schema not only addresses data types & format issue but it's also necessary to improve a performance.
There are other options as well if you need to drop malformed records, but this will also drop the records which are having nulls or which doesn't fit as per schema provided.
It may skip those records also having multiple delimiters and junk characters or an empty file.
.option("mode", "DROPMALFORMED")
FAILFAST mode will throw an exception as and when it found malformed record.
.option("mode", "FAILFAST")
you can also use map function to select the elements of your choice and exclude others while building a dataframe.
df=spark.read.format('com.databricks.spark.csv').option("header", "true").load(file_path).rdd.map(lambda x :(x[0],x[1],x[2])).toDF(["id","name","salary"])
you need to set header as 'true' in both the cases, otherwise it will include your csv header as first record for your dataframe.

You can get the fieldnames from the schema of the first file and then use the array of fieldnames to select the columns from all other files.
fields = df.schema.fieldNames
You can use the fields array to select the columns from all other datasets. Following is the scala code for that.
df=spark.read.format('com.github.saurfang.sas.spark').load(f).select(fields(0),fields.drop(1):_*)

Related

drop all df2.columns from another df (pyspark.sql.dataframe.DataFrame specific)

I have a large DF (pyspark.sql.dataframe.DataFrame) that is a result of multiple joins, plus new columns being created by using a combination of inputs from different DFS, including DF2.
I want to drop all DF2 columns from DF after I'm done with the join/creating new columns based on DF2 input.
drop() doesn't accept list - only a string or a Column.
I know that df.drop("col1", "col2", "coln") will work but I'd prefer not to crowd the code (if I can) by listing those 20 columns.
Is there a better way of doing this in pyspark dataframe specifically?
drop_cols = df2.columns
df = df.drop(*drop_cols)

how can I split a dataframe into different df and need to save in different file?

var df = sparkSession.read
.option("delimiter", delimiter)
.option("header", true) // Use first line of all files as header
// .schema(customSchema)
.option("inferSchema", "true") // Automatically infer data types
.format("csv")
.load(filePath)
df.show()
df.write.partitionBy("outlook").csv("output/weather.csv")
but the output saved without that column values :
for example :
hot,high,false,yes
cool,normal,true,yes
Expected output for overcast file is:
overcast,hot,high,false,yes
overcast,cool,normal,true,yes
When you partition your data to write it, spark creates subfolders respecting the HDFS partitioning standards. Here you'll get a subfolder for each "outlook" value found in the dataset. All the files in the "outlook=overcast" subdirectory will only concern the records for which the outlook is overcast. So no need to store the outlook column in your data, its value would be the same across the whole files in a same subdirectory.
When reading back your data through Hive or Spark for instance, you'll have to specify that the outlook subdirectories are partitions indeed so a logical column can be used for projection, grouping, filtering or whatever you want to do.
In spark this can be expressed by specifying the basePath option :
val df = spark.read.option("basePath", "output/weather.csv").csv("output/weather.csv/*")
If you really need to store the outlook column in each file then maybe partitioning is not what you need.

Create pyspark column from large # of case statements w/ regex

I'm trying to transform a complicated text field into one of ~2000 possible values based on regular expressions and conditions.
Example: if VAL1 in ('3025','4817') and re.match('foo', VAL2) then (123, "GROUP_ABX")
elif ... (repeat about 2000 unique scenarios)
I put this bunch of conditions into a massive pySpark UDF function. Problem is, if I have more than a few hundred conditions, the performance grinds to a halt.
The UDF is registered like so:
schema = StructType([
StructField("FOO_ID", IntegerType(), False),
StructField("FOO_NAME", StringType(), False)])
spark.udf.register("FOOTagging", FOOTag, schema)
test_udf = udf(FOOTag, schema)
the dataframe is updated like:
df1 = spark.read.csv(file)\
.toDF(*Fields)\
.select(*FieldList)\
.withColumn("FOO_TAG_STRUCT", test_udf('VAL1','VAL2'))
When I run with <200 conditions, I process the 23k row file in a couple seconds. When I get over 500 or so, it takes forever.
Seems UDF doesn't handle large functions. Is there another solution out there?

Reading null values from CSV in Spark with a schema defining nullable = false doesn't behave as expected

When loading a CSV file defining a schema where some fields are marked with nullable = false, I would expect those rows containing null values for the specified columns to be dropped or filtered out of the dataset when also defining a mode of DROPMALFORMED. This may be a misunderstanding on my part on exactly what is considered malformed, but in any case, I'm confused as to how the code continues to work when the schema explicitly defines certain fields as not accepting null values.
Looking at the databricks docs for the CSV reader (I know that this functionality has now been rolled into Apache Spark directly, but I can't find documentation for it), it implies that the schema should be taken into account when reading values.
https://github.com/databricks/spark-csv#features
DROPMALFORMED: drops lines which have fewer or more tokens than expected or tokens which do not match the schema
Example:
CSV file:
value1,value2,,value4
Spark code (in Scala):
val spark = SparkSession
.builder()
.appName("example")
.master("local")
.getOrCreate()
spark.read.
.format("csv")
.schema(StructType(
StructField("col1", StringType),
StructField("col2", StringType),
StructField("col3", StringType, nullable = false),
StructField("col4", StringType)))
.option("header", false)
.option("mode", "DROPMALFORMED")
.load("path/to/file.csv")
The above code will still include the row containing the null value for col3, and in fact, if I want to filter out records with a null value, I have to do the following:
.filter(row => !row.isNullAt(row.fieldIndex("col3")))
So my questions are:
1) Was my assumption that DROPMALFORMED mode would drop data not conforming to the schema an invalid one?
2) Have I done something wrong in the way I'm loading the CSV which has resulted in the unexpected behaviour above (or perhaps there's a better cleaner way of doing what I want)?
My code looks almost identical to the examples on databricks' documentation, and other examples of loading CSVs using Spark found online.
I'm using Spark 2.3.1 with Scala 2.11.8.
[edit]
I have raised this issue on the Apache Spark's JIRA: https://issues.apache.org/jira/browse/SPARK-25545

Append columns to existing CSV file in HDFS

I am trying to append columns to a existing CSV file in HDFS.
Script1:
someDF1.repartition(1).write.format("com.databricks.spark.csv").mode("append").option("sep", "\t").option("header","true").save("folder/test_file.csv")
Error:
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory.
Any suggestions on the mistake would be helpful
CSV files doesn't support Schema Evolution. So basically what you have to do is to read the entire data in the target path and then add the new column in this dataframe with some default value.
val oldDF = dfWithExistingData.withColumn("new_col", lit(null))
You can then union or merge this dataframe with the new dataset.
val targetData = oldDF.union(newDF)
You can then write the Data back to your target path in overwrite mode.
targetData
.repartition(1)
.write
.format("com.databricks.spark.csv")
.mode("overwrite")
.option("sep", "\t")
.option("header","true")
.save("folder")
Alternative: You can switch to other file formats which supports schema evolution e.g: Parquet to avoid doing the above process.