Exporting Spark DataFrame to S3 - scala

So after certain operations I have some data in a Spark DataFrame, to be specific, org.apache.spark.sql.DataFrame = [_1: string, _2: string ... 1 more field]
Now when I do df.show(), I get the following output, which is expected.
+--------------------+--------------------+--------------------+
| _1| _2| _3|
+--------------------+--------------------+--------------------+
|industry_name_ANZSIC|'industry_name_AN...|.isComplete("indu...|
|industry_name_ANZSIC|'industry_name_AN...|.isContainedIn("i...|
|industry_name_ANZSIC|'industry_name_AN...|.isContainedIn("i...|
| rme_size_grp|'rme_size_grp' is...|.isComplete("rme_...|
| rme_size_grp|'rme_size_grp' ha...|.isContainedIn("r...|
| rme_size_grp|'rme_size_grp' ha...|.isContainedIn("r...|
| year| 'year' is not null| .isComplete("year")|
| year|'year' has type I...|.hasDataType("yea...|
| year|'year' has no neg...|.isNonNegative("y...|
|industry_code_ANZSIC|'industry_code_AN...|.isComplete("indu...|
|industry_code_ANZSIC|'industry_code_AN...|.isContainedIn("i...|
|industry_code_ANZSIC|'industry_code_AN...|.isContainedIn("i...|
| variable|'variable' is not...|.isComplete("vari...|
| variable|'variable' has va...|.isContainedIn("v...|
| unit| 'unit' is not null| .isComplete("unit")|
| unit|'unit' has value ...|.isContainedIn("u...|
| value| 'value' is not null|.isComplete("value")|
+--------------------+--------------------+--------------------+
The problem occurs when I try exporting the dataframe as a csv to my S3 bucket.
The code I have is : df.coalesce(1).write.mode("Append").csv("s3://<my path>")
But the csv generated in my S3 path is full of gibberish or rich text. Also, the spark prompt doesn't reappear after execution (meaning execution didn't finish?) Here's a sample screenshot of the generated csv in my S3 :
What am I doing wrong and how do I rectify this?

S3: short description.
When you change the letter on the URI scheme, it will make a big difference because it causes different software to be used to interface to S3.
This is the difference between the three:
s3 is a block-based overlay on top of Amazon S3,whereas s3n/s3a are not. These are are object-based.
s3n supports objects up to 5GB when size is the concern, while s3a supports objects up to 5TB and has higher performance.Note that s3a is the successor to s3n.

Related

How to read .dat file using pyspark.sql.session.SparkSession object

I am a newbie to Spark, bear my silly mistakes if there's any (Open for your suggestions :))
I have created a pyspark.sql.session.SparkSession object using following code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
I know that I can read a csv file using spark.read.csv('filepath').
Now, I would like to read .dat file using that SparkSession object.
My ratings.dat file looks like:
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
My code:
ratings = spark.read.format('dat').load('filepath/ratings', sep='::')
Output:
An error occurred while calling o102.load.
: java.lang.ClassNotFoundException
Expected Output:
+------+-------+------+---------+
|UserID|MovieID|Rating|Timestamp|
+------+-------+------+---------+
| 1| 1193| 5|978300760|
| 1| 661| 3|978302109|
| and.........so.......on.......|
+------+-------+------+---------+
Note: My ratings.dat file do not contain headers and separator is ::.
Questions:
How can I read .dat file?
How can I add my custom header like I mentioned in Expected output?
So, How can I achieve my expected output? Where am I committing mistakes?
I would love to read your suggestions and answers :)
Would really appreciate long and detailed answers.
You can just use the csv reader with :: as the separator, and provide a schema:
df = spark.read.csv('ratings.dat', sep='::', schema='UserID int, MovieID int, Rating int, Timestamp long')
df.show()
+------+-------+------+---------+
|UserID|MovieID|Rating|Timestamp|
+------+-------+------+---------+
| 1| 1193| 5|978300760|
| 1| 661| 3|978302109|
| 1| 914| 3|978301968|
| 1| 3408| 4|978300275|
+------+-------+------+---------+

Spark: choose default value for MergeSchema fields

I have a parquet that has an old schema like this :
| name | gender | age |
| Tom | Male | 30 |
And as our schema got updated to :
| name | gender | age | office |
we used mergeSchema when reading from the old parquet :
val mergedDF = spark.read.option("mergeSchema", "true").parquet("data/test_table")
But when reading from these old parquet files, I got the following output :
| name | gender | age | office |
| Tom | Male | 30 | null |
which is normal. But I would like to take a default value for office (e.g. "California"), if and only if the field is not present in old schema. Is it possible ?
You don't have any simple method to put a default value when column doesn't exist in some parquet files but exists in other parquet files
In Parquet file format, each parquet file contains the schema definition. By default, when reading parquet, Spark get the schema from parquet file. The only effect of mergeSchema option is that instead of retrieving schema from one random parquet file, with mergeSchema Spark will read all schema of all parquet files and merge them.
So you can't put a default value without modifying the parquet files.
The other possible method is to provide your own schema when reading parquets by setting the option .schema() like that:
spark.read.schema(StructType(Array(FieldType("name", StringType), ...)).parquet(...)
But in this case, there is no option to set a default value.
So the only remaining solution is to add column default value manually
If we have two parquets, first one containing the data with the old schema:
+----+------+---+
|name|gender|age|
+----+------+---+
|Tom |Male |30 |
+----+------+---+
and second one containing the data with the new schema:
+-----+------+---+------+
|name |gender|age|office|
+-----+------+---+------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |null |
+-----+------+---+------+
If you don't care to replace all the null value in "office" column, you can use .na.fill as follow:
spark.read.option("mergeSchema", "true").parquet(path).na.fill("California", Array("office"))
And you get the following result:
+-----+------+---+----------+
|name |gender|age|office |
+-----+------+---+----------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |California|
|Tom |Male |30 |California|
+-----+------+---+----------+
If you want that only the old data get the default value, you have to read each parquet file to a dataframe, add the column with default value if necessary, and union all the resulting dataframes:
import org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat
import org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable
import org.apache.spark.sql.util.CaseInsensitiveStringMap
ParquetTable("my_table",
sparkSession = spark,
options = CaseInsensitiveStringMap.empty(),
paths = Seq(path),
userSpecifiedSchema = None,
fallbackFileFormat = classOf[ParquetFileFormat]
).fileIndex.allFiles().map(file => {
val dataframe = spark.read.parquet(file.getPath.toString)
if (dataframe.columns.contains("office")) {
dataframe
} else {
dataframe.withColumn("office", lit("California"))
}
}).reduce(_ unionByName _)
And you get the following result:
+-----+------+---+----------+
|name |gender|age|office |
+-----+------+---+----------+
|Jane |Female|45 |Idaho |
|Roger|Male |22 |null |
|Tom |Male |30 |California|
+-----+------+---+----------+
Note that all the part with ParquetTable([...].allFiles() is to retrieve the list of parquet files. It can be simplified if you are on hadoop or on local file system.

Loading JSON to Spark SQL

I'm doing self study about JSON with Spark SQL in v2.1 and am using the data from the link
https://catalog.data.gov/dataset/air-quality-measures-on-the-national-environmental-health-tracking-network
The problem I have is when I use :
val lines = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("E:/VW/meta_plus_sample_Data.json")`
I get Spark SQL returning all the data as one row.
+--------------------+--------------------+
| data| meta|
+--------------------+--------------------+
|[[row-8eh8_xxkx-u...|[[[[1439474950, t...|
+--------------------+--------------------+`
And when I remove:
.option("multiLine", true).option("mode", "PERMISSIVE")
I get an error as
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
Is there an option to achieve it in Spark SQL with each record from file as one row in table?
This is expected behavior as we have only one record(in the link provided in the question) in the having meta (object) and data (array).
As one json record is in multiple lines so we need to include multiLine option.
spark.read.option("multiLine",true).option("mode","PERMISSIVE").json("tmp.json").show()
//sample data
//+--------------------+--------------------+
//| data| meta|
//+--------------------+--------------------+
//|[[row-8eh8_xxkx-u...|[[[[1439474950, t...|
//+--------------------+--------------------+
//access meta struct columns
df.select("meta.view.*").show()
//+--------------------+-------------+--------------------+--------------------+----------+--------------------+-----------+-------------+--------------------+--------------------+---------------+----------------+---------+--------------+--------------------+--------------------+----------+----------------+--------+--------------------+----------+------------------------+---------------+----------------+----------------+--------------------+------+--------+-------------+-------------+--------------------+-------+--------------------+---------------+---------+----------------+--------+
//| approvals|averageRating| category| columns| createdAt| description|displayType|downloadCount| flags| grants|hideFromCatalog|hideFromDataJson| id|indexUpdatedAt| metadata| name|newBackend|numberOfComments| oid| owner|provenance|publicationAppendEnabled|publicationDate|publicationGroup|publicationStage| query|rights|rowClass|rowsUpdatedAt|rowsUpdatedBy| tableAuthor|tableId| tags|totalTimesRated|viewCount|viewLastModified|viewType|
//+--------------------+-------------+--------------------+--------------------+----------+--------------------+-----------+-------------+--------------------+--------------------+---------------+----------------+---------+--------------+--------------------+--------------------+----------+----------------+--------+--------------------+----------+------------------------+---------------+----------------+----------------+--------------------+------+--------+-------------+-------------+--------------------+-------+--------------------+---------------+---------+----------------+--------+
//|[[1439474950, tru...| 0|Environmental Hea...|[[, meta_data,, :...|1439381433|The Environmental...| table| 26159|[default, restora...|[[[public], false...| false| false|cjae-szjv| 1528204279|[[table, fatrow, ...|Air Quality Measu...| true| 0|12801487|[Tracking, 94g5-7...| official| false| 1439474950| 3957835| published|[[[true, [2171820...|[read]| | 1439402317| 94g5-7as2|[Tracking, 94g5-7...|3960642|[environmental ha...| 0| 3843| 1528203875| tabular|
//+--------------------+-------------+--------------------+--------------------+----------+--------------------+-----------+-------------+--------------------+--------------------+---------------+----------------+---------+--------------+--------------------+--------------------+----------+----------------+--------+--------------------+----------+------------------------+---------------+----------------+----------------+--------------------+------+--------+-------------+-------------+--------------------+-------+--------------------+---------------+---------+----------------+--------+
//to access data array we need to explode
df.selectExpr("explode(data)").show()
//+--------------------+
//| col|
//+--------------------+
//|[row-8eh8_xxkx-u3...|
//|[row-u2v5_78j5-px...|
//|[row-68zj_7qfn-sx...|
//|[row-8b4d~zt5j~da...|
//|[row-5gee.63td_z6...|
//|[row-tzyx.ssxh_pz...|
//|[row-3yj2_u42c_mr...|
//|[row-va7z.p2v8.7p...|
//|[row-r7kk_e3dm-z2...|
//|[row-bnrc~w34s-4a...|
//|[row-ezrk~m5dc_5n...|
//|[row-nyya.dvnz~c6...|
//|[row-dq3i_wt6d_c6...|
//|[row-u6rc-k3mf-cn...|
//|[row-t9c6-4d4b_r6...|
//|[row-vq6r~mxzj-e6...|
//|[row-vxqn-mrpc~5b...|
//|[row-3akn_5nzm~8v...|
//|[row-ugxn~bhax.a2...|
//|[row-ieav.mdz9-m8...|
//+--------------------+
Load multiple json records:
//json array with two records
spark.read.json(Seq(("""
[{"id":1,"name":"a"},
{"id":2,"name":"b"}]
""")).toDS).show()
//as we have 2 json objects and loaded as 2 rows
//+---+----+
//| id|name|
//+---+----+
//| 1| a|
//| 2| b|
//+---+----+

parse string of jsons pyspark

I am trying to parse a column of a list of json strings but even after trying multiple schemas using structType, structField etc I am just unable to get the schema at all.
[{"event":"empCreation","count":"148"},{"event":"jobAssignment","count":"3"},{"event":"locationAssignment","count":"77"}]
[{"event":"empCreation","count":"334"},{"event":"jobAssignment","count":33"},{"event":"locationAssignment","count":"73"}]
[{"event":"empCreation","count":"18"},{"event":"jobAssignment","count":"32"},{"event":"locationAssignment","count":"72"}]
Based on this SO post, I was able to derive the json schema but even after apply from_json function, it still wouldn't work
Pyspark: Parse a column of json strings
Can you please help?
You can parse the given json schema with below schame definition and read the json as a DataFrame providing the schema info.
>>> dschema = StructType([
... StructField("event", StringType(),True),
... StructField("count", StringType(),True)])
>>>
>>>
>>> df = spark.read.json('/<json_file_path>/json_file.json', schema=dschema)
>>>
>>> df.show()
+------------------+-----+
| event|count|
+------------------+-----+
| empCreation| 148|
| jobAssignment| 3|
|locationAssignment| 77|
| empCreation| 334|
| jobAssignment| 33|
|locationAssignment| 73|
| empCreation| 18|
| jobAssignment| 32|
|locationAssignment| 72|
+------------------+-----+
>>>
Contents of the json file:
cat json_file.json
[{"event":"empCreation","count":"148"},{"event":"jobAssignment","count":"3"},{"event":"locationAssignment","count":"77"}]
[{"event":"empCreation","count":"334"},{"event":"jobAssignment","count":"33"},{"event":"locationAssignment","count":"73"}]
[{"event":"empCreation","count":"18"},{"event":"jobAssignment","count":"32"},{"event":"locationAssignment","count":"72"}]
thanks so much #Lakshmanan but I had to just add a slight change to the schema:
eventCountSchema = ArrayType(StructType([StructField("event", StringType(),True),StructField("count", StringType(),True)]), True)
and then applied this schema to the dataframe complex datatype column

How to ignore headers in PySpark when using Athena and AWS Glue Data Catalog

Assume I have a CSV file like this:
"Col1Name", "Col2Name"
"a", "b"
"c", "d"
Assume I issue the following CREATE EXTERNAL TABLE command in Athena:
CREATE EXTERNAL TABLE test.sometable (
col1name string,
col2name string
)
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
stored as textfile
location 's3://somebucket/some/path/'
tblproperties("skip.header.line.count"="1")
Then I issue the following SELECT:
SELECT * FROM test.sometable
I expect to get the following:
+----------+----------+
| col1name| col2name|
+----------+----------+
| a| b|
| c| d|
+----------+----------+
...and sure enough, that's exactly what I get.
On an EMR cluster using the AWS Glue metadata catalog in Spark, I issue the following in the pyspark REPL:
a = spark.sql("select * from test.sometable")
a.show()
I expect to receive the same output, but, instead, I get this:
+----------+----------+
| col1name| col2name|
+----------+----------+
| col1name| col2name|
| a| b|
| c| d|
+----------+----------+
Obviously, Athena is honoring the "skip.header.line.count" tblproperty, but PySpark appears to be ignoring it.
How can I get PySpark to ignore this header line, as Athena does?
Any of the two methods should help you:
(1) Set the header row count to be skipped in the parameter:
'skip.header.line.count'='1'
(2) Or, in the select query use a where clause to filter that row. Say:
SELECT * FROM test.sometable where col1name <> 'col1name'