How to change the schema of existing dataframe - scala

Problem statement : I have a csv file with around 100+ fields.I need to perform transformation on these fields and generate new 80+ fields and write only these new fields into s3 in parquet format.
The parquet predefined schema = 80+ newly populated fields + some non populated fields.
Is there any way to pass this predefined parquet schema while writing data to s3 so that these extra fields also populated with null data.
select will not be useful to select only 80+ fields as predefined schema might have around 120 predefined fields.
Below is sample data and transformation requirementCSV data
aid, productId, ts, orderId
1000,100,1674128580179,edf9929a-f253-487
1001,100,1674128580179,cc41a026-63df-410
1002,100,1674128580179,9732755b-1207-471
1003,100,1674128580179,51125ddd-4129-48a
1001,200,1674128580179,f4917676-b08d-41e
1004,200,1674128580179,dc80559d-16e6-4fa
1005,200,1674128580179,c9b743eb-457b-455
1006,100,1674128580179,e8611141-3e0e-4d5
1002,200,1674128580179,30be34c7-394c-43a
Parquet schema
def getPartitionFieldsSchema() = {
List(
Map("name" -> "company", "type" -> "long",
"nullable" -> true, "metadata" -> Map()),
Map("name" -> "epoch_day", "type" -> "long",
"nullable" -> true, "metadata" -> Map()),
Map("name" -> "account", "type" -> "string",
"nullable" -> true, "metadata" -> Map()),
)
}
val schemaMap = Map("type" -> "struct",
"fields" -> getPartitionFieldsSchema)
simple example
val dataDf = spark
.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("./scripts/input.csv")
dataDf
.withColumn("company",lit(col("aid")/100))
.withColumn("epoch_day",lit(col("ts")/86400))
.write // how to write only company, epoch_day, account ?
.mode("append")
.csv("/tmp/data2")
Output should have below columns: company, epoch_day, account

This is how I understand your problem:
you wanna read some csv and transform them to parquet in s3.
during the transformation, you need to create 3 new cols based on existing cols in csv files.
but since only 2 out of 3 new cols are calculated, the output are only showing two new cols but not 3.
In such case, you can create a external table in redshift, and you specify all the cols. As a result even some colums are not fed, there would be null in your external tables.

Related

How read parquet files and only keep files that contain some columns

I have a bunch of parquet files in an S3 bucket. The files contain different columns. I would like to read the files and create a datframe only with the files that contain some columns.
for example: let's say I have three columns "name", "city" and "years". Some of my files only contain, "name and "city", other contains "name","city" and "year". How can I create a dataframe by excluding the files that do not contain the column "year". I am working with spark and scala.
any help is welcome.
How can I create a dataframe by excluding the files that do not contain the column "year".
First off I would advise restructuring bucket to separate these files based on their schema, or better yet have a process which transforms these "raw" files into a common schema that would be more easy to work with.
Working with what you have, starting with some parquet files:
val df1 = List(
("a", "b", "c")
).toDF("name", "city", "years")
df1.write.parquet("s3://{bucket}/test/a.parquet")
val df2 = List(
("aa", "bb")
).toDF("name", "city")
df2.write.parquet("s3://{bucket}/test/b.parquet")
val df3 = List(
("aaa", "bbb", "ccc")
).toDF("name", "city", "year")
df3.write.parquet("s3://{bucket}/test/c.parquet")
We can do:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, input_file_name}
// determine the s3 paths of all the parquet files, to be read independantly
val s3Paths = spark
.read
.schema(schema)
.parquet("s3://{bucket}/test/")
.withColumn("filename", input_file_name())
.select("filename")
.collect()
.map(_(0).toString)
// individual DataFrames of each parquet file, where the `year` is not present
val dfs = s3Paths.flatMap {
path =>
val df = spark.read.parquet(path)
if (!df.columns.contains("year")) {
List(df)
} else {
List.empty[DataFrame]
}
}
// take the first DataFrame, and the rest
val (firstDFs, otherDFs) = (dfs.head, dfs.tail)
// combine all of the DataFrame, unioning the rows
otherDFs.foldLeft(firstDFs) {
case (acc, df) => acc.unionByName(df, allowMissingColumns = true)
}.show()
Note:
In the above example, when creating the test data:
df1.write.parquet("s3://{bucket}/test/a.parquet")
will have created the a file in s3, for example:
s3://{bucket}/test/a.parquet/part-0000-blah.parquet
for the purpose of creating this example I moved the parquet files up a level into the test/ path.

Configuration for spark job to write 3000000 file as output

I have to generate 3000000 files as the output of spark job.
I have two input file :
File 1 -> Size=3.3 Compressed, No.Of Records=13979835
File 2 -> Size=1.g Compressed, No.Of Records=6170229
Spark Job is doing the following:
reading both this file and joining them based on common column1. -> DataFrame-A
Grouping result of DataFrame-A based on one column2 -> DataFrame-B
From DataFrame-B used array_join for the aggregated column and separate that column by '\n' char. -> DataFrame-C
Writing result of DataFrame-C partition by column2.
val DF1 = sparkSession.read.json("FILE1") // |ID |isHighway|isRamp|pvId |linkIdx|ffs |length |
val DF12 = sparkSession.read.json("FILE2") // |lId |pid |
val joinExpression = DF1.col("pvId") === DF2.col("lId")
val DFA = DF.join(tpLinkDF, joinExpression, "inner").select(col("ID").as("SCAR"), col("lId"), col("length"), col("ffs"), col("ar"), col("pid")).orderBy("linkIdx")
val DFB = DFA.select(col("SCAR"),concat_ws(",", col("lId"), col("length"),col("ffs"), col("ar"), col("pid")).as("links")).groupBy("SCAR").agg(collect_list("links").as("links"))
val DFC = DFB.select(col("SCAR"), array_join(col("links"), "\n").as("links"))
DFC.write.format("com.databricks.spark.csv").option("quote", "\u0000").partitionBy("SCAR").mode(SaveMode.Append).format("csv").save("/tmp")
I have to generate 3000000 files as output of spark job.
After running some test I got an idea to run this job in batch like :
query startIdx: 0, endIndex:100000
query startIdx: 100000, endIndex:200000
query startIdx: 200000, endIndex:300000
and so.... on till
query startIdx: 2900000, endIndex:3000000

Optimal way to save spark sql dataframe to S3 using information stored in them

I have dataframes with data like :
channel eventId1 eventId2 eventTs eventTs2 serialNumber someCode
Web-DTB akefTEdZhXt8EqzLKXNt1Wjg akTEdZhXt8EqzLKXNt1Wjg 1545502751154 1545502766731 4 rfs
Web-DTB 3ycLHHrbEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 njs
Web-DTB 3ycL4rHHEkBJ.piYNyI7u55w 3ycLHHEkBJ.piYNyI7u55w 1545502766247 1545502767800 4 null
I need to save this data to S3 path looking like :
s3://test/data/ABC/hb/eventTs/[eventTs]/uploadTime_[eventTs2]/*.json.gz
How can I proceed with this as I need to extract data from the partitions to write to S3 path: (the s3 path is a function of eventTs and eventTs2 present in the dataframes)
df.write.partitionBy("eventTs","eventTs2").format("json").save("s3://test/data/ABC/hb????")
I guess I can iterate over each row in the dataframe , extract the path and save to S3 but do not want to do that.
Is there any way to group by the dataframes on eventTs and eventTs2 and then save the dataframes to the full S3 path? Is there something more optimal?
Spark supports partitions like what we have in Hive. If number of distinct elements for eventTs, eventTs2 is less, partitions will be a good way to solve this.
Check the scala doc for more information around partitionBy.
Example usage:
val someDF = Seq((1, "bat", "marvel"), (2, "mouse", "disney"), (3, "horse", "animal"), (1, "batman", "marvel"), (2, "tom", "disney") ).toDF("id", "name", "place")
someDF.write.partitionBy("id", "name").orc("/tmp/somedf")
If you write the dataframe with the paritionBy on "id" and "name" the following directory structure will be created.
/tmp/somedf/id=1/name=bat
/tmp/somedf/id=1/name=batman
/tmp/somedf/id=2/name=mouse
/tmp/somedf/id=2/name=tom
/tmp/somedf/id=3/name=horse
The first and second partition becomes directories and all rows where id is equal to 1 and name is bat will be saved under the directory structure /tmp/somedf/id=1/name=bat, the order of the partitions defined in partitionBy decides the order of directories.
In your case, the partitions will be on eventTs and eventTS2.
val someDF = Seq(
("Web-DTB","akefTEdZhXt8EqzLKXNt1Wjg","akTEdZhXt8EqzLKXNt1Wjg","1545502751154","1545502766731",4,"rfs"),
("Web-DTB","3ycLHHrbEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"njs"),
("Web-DTB","3ycL4rHHEkBJ.piYNyI7u55w","3ycLHHEkBJ.piYNyI7u55w","1545502766247","1545502767800",4,"null"))
.toDF("channel" , "eventId1", "eventId2", "eventTs", "eventTs2", "serialNumber", "someCode")
someDF.write("eventTs", "eventTs2").orc("/tmp/someDF")
Creating a directory structure as follows.
/tmp/someDF/eventTs=1545502766247/eventTs2=1545502767800
/tmp/someDF/eventTs=1545502751154/eventTs2=1545502766731

Adding/selecting fields to/from RDD

I've an RDD lets say dataRdd with fields like timestamp ,url, ...
I want to create a new RDD with few fields from this dataRdd.
Following code segment creates the new RDD, where timestamp and URL are considered values and not field/column names:
var fewfieldsRDD= dataRdd.map(r=> ( "timestamp" -> r.timestamp , "URL" -> r.url))
However, with below code segment, one, two, three, arrival, and SFO are considered as column names.:
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")
val numairRdd= sc.makeRDD(Seq(numbers, airports))
Can anyone tell me what am I doing wrong and how can I create a new Rdd with field names mapped to values from another Rdd?
You are creating an RDD of tuples, not Map objects. Try:
var fewfieldsRDD= dataRdd.map(r=> Map( "timestamp" -> r.timestamp , "URL" -> r.url))

Timestamp issue when loading CSV to dataframe

I am trying to load a csv file into a distributed dataframe (ddf), whilst giving a schema. The ddf gets loaded but the timestamp column shows only null values. I believe this happens because spark expects timestamp in a particular format. So I have two questions:
1) How do I give spark the format or make it detect format (like
"MM/dd/yyyy' 'HH:mm:ss")
2) If 1 is not an option how to convert the field (assuming I imported as String) to timestamp.
For Q2 I have tried using following :
def convert(row :org.apache.spark.sql.Row) :org.apache.spark.sql.Row = {
import org.apache.spark.sql.Row
val format = new java.text.SimpleDateFormat("MM/dd/yyyy' 'HH:mm:ss")
val d1 = getTimestamp(row(3))
return Row(row(0),row(1),row(2),d1);
}
val rdd1 = df.map(convert)
val df1 = sqlContext.createDataFrame(rdd1,schema1)
The last step doesn't work as there are null values which dont let it finish. I get errors like :
java.lang.RuntimeException: Failed to check null bit for primitive long value.
The sqlContext.load however is able to load the csv without any problems.
val df = sqlContext.load("com.databricks.spark.csv", schema, Map("path" -> "/path/to/file.csv", "header" -> "true"))