How to create child dataframe from xml file using Pyspark? - pyspark

I have all those supporting libraries in pyspark and I am able to create dataframe for parent-
def xmlReader(root, row, filename):
df = spark.read.format("com.databricks.spark.xml").options(rowTag=row,rootTag=root).load(filename)
xref = df.select("genericEntity.entityId", "genericEntity.entityName","genericEntity.entityType","genericEntity.inceptionDate","genericEntity.updateTimestamp","genericEntity.entityLongName")
return xref
df1 = xmlReader("BOBML","entityTransaction","s3://dev.xml")
df1.head()
I am unable to create child dataframe-
def xmlReader(root, row, filename):
df2 = spark.read.format("com.databricks.spark.xml").options(rowTag=row, rootTag=root).load(filename)
xref = df2.select("genericEntity.entityDetail", "genericEntity.entityDetialId","genericEntity.updateTimestamp")
return xref
df3 = xmlReader("BOBML","s3://dev.xml")
df3.head()
I am not getting any output and I was planning to do union between parent and child dataframe. Any help will be truly appreciated!

After more than 24 hours, I am able to solve the problem and thanks to all whoever at least look at my problem.
Solution:
Step 1: Upload couple of libraries
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
Step2 (Parents):Read xml files, print schema, register temp tables, and create dataframe.
Step3 (Child): Repeat step 2.
Step4: Create final Dataframe by joining Child and parent dataframes.
Step5: Load data into S3 (write.csv/S3://Path) or Database.

Related

Read files with different column order

I have few csv files with headers but I found out that some files have different column orders. Is there a way to handle this with Spark where I can define select order for each file so that the master DF doesn't have mismatch where col x might have values from col y?
My current read -
val masterDF = spark.read.option("header", "true").csv(allFiles:_*)
Extract all file names and store into list variable.
Then define schema of with all the columns in it.
iterate through each file using header true, so we are reading each file separately.
unionAll the new dataframe with the existing dataframe.
Example:
file_lst=['<path1>','<path2>']
from pyspark.sql.functions import *
from pyspark.sql.types import *
#define schema for the required columns
schema = StructType([StructField("column1",StringType(),True),StructField("column2",StringType(),True)])
#create an empty dataframe
df=spark.createDataFrame([],schema)
for i in file_lst:
tmp_df=spark.read.option("header","true").csv(i).select("column1","column2")
df=df.unionAll(tmp_df)
#display results
df.show()

loop a sequence of s3 parquet file path with same schema and save in a single dataframe in scala

What is needed to a seq of s3 location is given. The difference for any two location is the partition column value of the table.
Each parquet folder has same schema .
So we need to loop the sequence of s3 parquet file path with same schema and save in a single dataframe in scala along.
If you have an array with all of directories you want to import you can iterate over that array and make a collection of dataframes and then union them into a single one.
Try something like this.
//You have now a collection of dataframes
val dataframes = directories.map(dir =>
spark.read.parquet(dir))
//Let's union them into one
val df_union = dataframes.reduce(_ union _)
If you turn on the options, then simply you can load the files recursively.
spark.read.parquet("s3a://path/to/root/")
The options are as follows.
spark.hive.mapred.supports.subdirectories true
spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive true
This can be used in a way that
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.SparkSession
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("test")
.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.read.parquet("s3a://path/to/root/")

how to use createDataFrame to create a pyspark dataframe?

I know this is probably to be a stupid question. I have the following code:
from pyspark.sql import SparkSession
rows = [1,2,3]
df = SparkSession.createDataFrame(rows)
df.printSchema()
df.show()
But I got an error:
createDataFrame() missing 1 required positional argument: 'data'
I don't understand why this happens because I already supplied 'data', which is the variable rows.
Thanks
You have to create SparkSession instance using the build pattern and use it for creating dataframe, check
https://spark.apache.org/docs/2.2.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
spark= SparkSession.builder.getOrCreate()
Below are the steps to create pyspark dataframe using createDataFrame
Create sparksession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
Create data and columns
columns = ["language","users_count"]
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
Creating DataFrame from RDD
rdd = spark.sparkContext.parallelize(data)
df= spark.createDataFrame(rdd).toDF(*columns)
the second approach, Directly creating dataframe
df2 = spark.createDataFrame(data).toDF(*columns)
Try
row = [(1,), (2,), (3,)]
?
If I am not wrong createDataFrame() takes 2 lists as input: first list is the data and second list is the column names. The data must be a lists of list of tuples, where each tuple is a row of the dataframe.

Spark scala copying dataframe column to new dataframe

I have an empty dataframe with schema already created.
I'm trying to add the columns to this dataframe from a new dataframe to the existing columns in a for loop.
k schema - |ID|DATE|REPORTID|SUBMITTEDDATE|
for(data <- 0 to range-1){
val c = df2.select(substring(col("value"), str(data)._2, str(data)._3).alias(str(data)._1)).toDF()
//c.show()
k = c.withColumn(str(data)._1, c(str(data)._1))
}
k.show()
But the k dataframe has just one column, but it should have all the 4 columns populated with values.
I think the last line in for loop is replacing exisitng columns in the dataframe.
Can somebody help me with this?
Thanks!!
Add your logic and conditions and create new dataframe
val dataframe2 = dataframe1.select("A","B",C)
Copying few columns of a dataframe to another one is not possible in spark.
Although there are few alternatives to attain the same
1. You need to join both the dataframe based on some join condition.
2. Convert bot the data frame to json and do RDD Union
val rdd = df1.toJSON.union(df2.toJSON)
val dfFinal = spark.read.json(rdd)

how to create multiple list from dataframe in spark?

how to create multiple list from dataframe in spark.
In my case, I want to order mongodb documents with grouping specific key. and create multiple list which is grouped on the basis of one key of schema
please help me
sparkSession = SparkSession.builder().getOrCreate()
MongoSpark.load[SparkSQL.Character](sparkSession).printSchema()
val characters = MongoSpark.load[SparkSQL.Character](sparkSession)
characters.createOrReplaceTempView("characters")
val sqlstmt = sparkSession.sql("SELECT * FROM characters WHERE site = 'website'")
...
You can do something like this:
val columns = sqlstmt.columns.map(col)
task1
.groupBy(key)
.agg(collect_list(struct(columns: _*)).as("data"))
Don't forget to import
import org.apache.spark.sql.functions._