Partitioning by column in Apache Spark to S3 - scala

have use-case where we want to read files from S3 which has JSON. Then, based on a particular JSON node value we want to group the data and write it to S3.
I am able to read the data but not able to find good example on how partition the data based on JSON key and then upload to S3. Can anyone provide any example or point me to a tutorial which can help me with this use-case?
I have got the schema of my data after creating the dataframe:
root
|-- customer: struct (nullable = true)
| |-- customerId: string (nullable = true)
|-- experiment: string (nullable = true)
|-- expiryTime: long (nullable = true)
|-- partitionKey: string (nullable = true)
|-- programId: string (nullable = true)
|-- score: double (nullable = true)
|-- startTime: long (nullable = true)
|-- targetSets: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- featured: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- data: struct (nullable = true)
| | | | | |-- asinId: string (nullable = true)
| | | | |-- pk: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- reason: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- recommended: array (nullable = true)
| | | |-- element: string (containsNull = true)
I want to partition the data based on the random hash on the customerId column. But when i do this:
df.write.partitionBy("customerId").save("s3/bucket/location/to/save");
It give error:
org.apache.spark.sql.AnalysisException: Partition column customerId not found in schema StructType(StructField(customer,StructType(StructField(customerId,StringType,true)),true), StructField(experiment,StringType,true), StructField(expiryTime,LongType,true), StructField(partitionKey,StringType,true), StructField(programId,StringType,true), StructField(score,DoubleType,true), StructField(startTime,LongType,true), StructField(targetSets,ArrayType(StructType(StructField(featured,ArrayType(StructType(StructField(data,StructType(StructField(asinId,StringType,true)),true), StructField(pk,StringType,true), StructField(type,StringType,true)),true),true), StructField(reason,ArrayType(StringType,true),true), StructField(recommended,ArrayType(StringType,true),true)),true),true));
Please let me know i can access customerId column.

Let's take example dataset sample.json
{"CUST_ID":"115734","CITY":"San Jose","STATE":"CA","ZIP":"95106"}
{"CUST_ID":"115728","CITY":"Allentown","STATE":"PA","ZIP":"18101"}
{"CUST_ID":"115730","CITY":"Allentown","STATE":"PA","ZIP":"18101"}
{"CUST_ID":"114728","CITY":"San Mateo","STATE":"CA","ZIP":"94401"}
{"CUST_ID":"114726","CITY":"Somerset","STATE":"NJ","ZIP":"8873"}
Now start hacking it with Spark
val jsonDf = spark.read
.format("json")
.load("path/of/sample.json")
jsonDf.show()
+---------+-------+-----+-----+
| CITY|CUST_ID|STATE| ZIP|
+---------+-------+-----+-----+
| San Jose| 115734| CA|95106|
|Allentown| 115728| PA|18101|
|Allentown| 115730| PA|18101|
|San Mateo| 114728| CA|94401|
| Somerset| 114726| NJ| 8873|
+---------+-------+-----+-----+
Then partition dataset by column "ZIP" and write to S3
jsonDf.write
.partitionBy("ZIP")
.save("s3/bucket/location/to/save")
// one liner athentication to s3
//.save("s3n://$accessKey:$secretKey" + "#" + s"$buckectName/location/to/save")
Note: In order this code successfully S3 access and secret key has to
be configured properly. Check this answer for Spark/Hadoop
integration with S3
Edit: Resolution: Partition column customerId not found in schema (as per comment)
customerId exists inside customer struct, so try extract the customerId then do partition.
df.withColumn("customerId", $"customer.customerId")
.drop("customer")
.write.partitionBy("customerId")
.save("s3/bucket/location/to/save")

Related

Compare two columns in different dataframes, of types String and Array<string> respectively in pyspark without use explode function

I have two dfs:
df1:
sku category cep seller state
4858 BDU 00000 xefd SP
df2:
depth price sku seller infos_product
6.1 5.60 47347 gaha [{1, 86800000, 86...
For df2 I have the follow schema:
|-- depth: double (nullable = true)
|-- sku: string (nullable = true)
|-- price: double (nullable = true)
|-- seller: string (nullable = true)
|-- infos_produt: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- modality_id: integer (nullable = true)
| | |-- cep_coleta_ini: integer (nullable = true)
| | |-- cep_coleta_fim: integer (nullable = true)
| | |-- cep_entrega_ini: integer (nullable = true)
| | |-- cep_entrega_fim: integer (nullable = true)
| | |-- cubage_factor_entrega: double (nullable = true)
| | |-- value_coleta: double (nullable = true)
| | |-- value_entrega: double (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
I need to do a check between these df's. Something like that:
condi = [(df1.seller_id == df2.seller) & (df2.infos_produt.state == df1.state)]
df_finish = (df1\
.join(df2, on = condi ,how='left'))
But, return a error:
AnalysisException: cannot resolve '(infos_produt.`state` = view.coverage_state)' due to data type mismatch: differing types in '(infos_produt.`state` = view.coverage_state)' (array<string> and string).
Can anyone help me?
PS: I would like resolve this problem without apply 'explode', because I have a big data and explode function don't work.

Selecting fields of structs inside an array inside a dataframe

I have a PySpark dataframe loaded from a 3 GB json.gz file, with the following schema:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- articleID: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- author: string (nullable = true)
| | |-- source: string (nullable = true)
I need to drop the title, author and date fields, or create a new dataFrame that does not include these fields.
So far I've managed to get the following schema:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- articleID: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- source: array (nullable = true)
| | | |-- element: string (containsNull = true)
using
df.select(df._id, df.quote,
array(
struct(
col("occurrences.articleID"),
col("occurrences.source")
)
).alias("occurrences"))
But I need a way to keep articleIDs and sources together in the same struct. How can I do this?
Okay, I found something that works:
clean_df = df.withColumn("exploded",explode("occurrences")).drop("occurrences")
.select(
df._id,
df.quote,
df.exploded.articleID.alias("articleID"),
df.exploded.source.alias("source")
)
.withColumn("occs", struct(col("articleID"), col("source")))
.groupBy("_id", "quote").agg(collect_set("occs").alias("occurrences"))
But if anyone has a better solution, I'd love to hear it, since this seems very round-about. (And as a sidenote, collect_set only seems to works with java 8.)

How to convert wrapped array to dataset in spark scala?

Hi I am new to spark scala. I have this structure in json file which I need to convert it to dataset. I am being unable to do this because of the nested data.
I tried to do something like this which I got from some post but it does not work. Can someone please suggest me the solution?
spark.read.json(path).map(r=>r.getAs[mutable.WrappedArray[String]]("readings"))
Your JSON format is invalid for spark to convert into dataframe. json informations that needs to be converted into dataframe/dataset row should be a line.
So the first step for you to do is read the json file and convert into valid json format. You can use wholeTextFiles api and some replacements.
val rdd = sc.wholeTextFiles("path to your json text file")
val validJson = rdd.map(_._2.replace(" ", "").replace("\n", ""))
Second step is to covert the valid json data into dataframe or dataset. Here I am using dataframe
val dataFrame = sqlContext.read.json(validJson)
which should give you
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|did |readings |
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|d7cc92c24be32d5d419af1277289313c|[[WrappedArray([aa1111111111111111c1111111111111112222222222e,AppleiOS,-46,49,ITU++], [09dfs1111111111111c1111111111111112222222222e,AppleiOS,-50,45,ITU++]),1506770544]]|
+--------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
root
|-- did: string (nullable = true)
|-- readings: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- clients: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- cid: string (nullable = true)
| | | | |-- clientOS: string (nullable = true)
| | | | |-- rssi: long (nullable = true)
| | | | |-- snRatio: long (nullable = true)
| | | | |-- ssid: string (nullable = true)
| | |-- ts: long (nullable = true)
Now selecting WrappedArray is easy step as
dataFrame.select("readings.clients")
which should give you
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|clients |
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[WrappedArray([aa1111111111111111c1111111111111112222222222e,AppleiOS,-46,49,ITU++], [09dfs1111111111111c1111111111111112222222222e,AppleiOS,-50,45,ITU++])]|
+------------------------------------------------------------------------------------------------------------------------------------------------------------+
I hope the answer is helpful
Updated
Dataframe and datasets are almost the same except that datasets are type safety with encoders used, and that datasets are optimized than dataframes.
Long story short, you can change the dataframe to dataset by creating case classes. For your case you would need three case classes.
case class client(cid: String, clientOS: String, rssi: Long, snRatio: Long, ssid: String)
case class reading(clients: Array[client], ts: Long)
case class dataset(did: String, readings: Array[reading])
And then cast the dataframe to dataset as
val dataSet = sqlContext.read.json(validJson).as[dataset]
You should have dataset in your hand :)
You cannot create DataSet with the following code
spark.read.json(path).map(r => r.getAs[WrappedArray[String]]("readings"))
Check the schema of clients type for the DF created upon reading the JSON.
spark.read.json(path).printSchema
root
|-- did: string (nullable = true)
|-- readings: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- clients: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- cid: string (nullable = true)
| | | | |-- clientOS: string (nullable = true)
| | | | |-- rssi: long (nullable = true)
| | | | |-- snRatio: long (nullable = true)
| | | | |-- ssid: string (nullable = true)
| | |-- ts: long (nullable = true)
You can get the scala.collection.mutable.WrappedArrayobject with the below code
spark.read.json(path).first.getAs[WrappedArray[(String,String,Long,Long,String)]]("readings")
If you need a create the dataframe use the below.
spark.read.json(path).select("readings.clients")

Spark: pruning nested columns/fields

I have a question about the possibility to prune nested fields.
I'm developing a source for High Energy Physics Data format (ROOT).
below is the schema for some file using a DataSource that I'm developing.
root
|-- EventAuxiliary: struct (nullable = true)
| |-- processHistoryID_: struct (nullable = true)
| | |-- hash_: string (nullable = true)
| |-- id_: struct (nullable = true)
| | |-- run_: integer (nullable = true)
| | |-- luminosityBlock_: integer (nullable = true)
| | |-- event_: long (nullable = true)
| |-- processGUID_: string (nullable = true)
| |-- time_: struct (nullable = true)
| | |-- timeLow_: integer (nullable = true)
| | |-- timeHigh_: integer (nullable = true)
| |-- luminosityBlock_: integer (nullable = true)
| |-- isRealData_: boolean (nullable = true)
| |-- experimentType_: integer (nullable = true)
| |-- bunchCrossing_: integer (nullable = true)
| |-- orbitNumber_: integer (nullable = true)
| |-- storeNumber_: integer (nullable = true)
The DataSource is here https://github.com/diana-hep/spark-root/blob/master/src/main/scala/org/dianahep/sparkroot/experimental/package.scala#L62
When building a reader using the buildReader method of the FileFormat:
override def buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
I see that requiredSchema will always contain all of the fields/members of the top column that is being looked at. Meaning that when I want to select a particular nested field with :
df.select("EventAuxiliary.id_.run_"), requiredSchema will be again the full struct for that top column ("EventAuxiliary"). I would expect that schema would be something like this:
root
|-- EventAuxiliary: struct...
| |-- id_: struct ...
| | |-- run_: integer
since this is the only schema that has been required by the select statement.
Basically, I want to know how on the data source level I can prune nested fields. I thought that requiredSchema will be only the fields that are coming from the df.select.
I'm trying to see what avro/parquet are doing and found this: https://github.com/apache/spark/pull/14957/files
If there are suggestions/comments - would be appreciated!
Thanks!
VK

Spark/Scala: join dataframes when id is nested in an array of structs

I'm using Spark's MlLib DataFrame ALS functionality on Spark 2.2.0. I had to run my userId and itemId fields through an StringIndexer to get things going
The method 'recommendForAllUsers' returns the following schema
root
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: long (nullable = true)
| | |-- rating: double (nullable = true)
|-- userIdIndex: string (nullable = true)
This is perfect for my needs (would love not to flatten it) but I need to replace userIdIndex and itemIdIndex with their actual value
for the userIdIndex was ok (I couldn't simply reverse it with IndexToString as the ALS FITTING seems to erase the link between index and value):
df.join(df2, df2("userIdIndex")===df("userIdIndex"), "left")
.select(df2("userId"), df("recommendations"))
where df2 looks like this:
+------------------+--------------------+----------+-----------+-----------+
| userId| itemId| rating|userIdIndex|itemIdIndex|
+------------------+--------------------+----------+-----------+-----------+
|glorified-consumer| item-22302| 3.0| 15.0| 4.0|
the result is this schema:
root
|-- userId: string (nullable = true)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: integer (nullable = true)
| | |-- rating: float (nullable = true)
QUESTION: for itemIdIndex, being inside an array of structures.
You can explode the array so that struct is only remained as
val tempdf2 = df2.withColumn("recommendations", explode('recommendations))
which should leave you with schema as
root
|-- userdId: string (nullable = true)
|-- recommendations: struct (nullable = true)
| |-- itemIdIndex: string (nullable = true)
| |-- rating: string (nullable = true)
Do the same for df (the first dataframe)
Then after that you can join them as
tempdf1.join(tempdf2, tempdf1("recommendations.itemIndex") === tempdf2("recommendations.itemIndex"))