I am a newbie to Spark, bear my silly mistakes if there's any (Open for your suggestions :))
I have created a pyspark.sql.session.SparkSession object using following code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
I know that I can read a csv file using spark.read.csv('filepath').
Now, I would like to read .dat file using that SparkSession object.
My ratings.dat file looks like:
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
My code:
ratings = spark.read.format('dat').load('filepath/ratings', sep='::')
Output:
An error occurred while calling o102.load.
: java.lang.ClassNotFoundException
Expected Output:
+------+-------+------+---------+
|UserID|MovieID|Rating|Timestamp|
+------+-------+------+---------+
| 1| 1193| 5|978300760|
| 1| 661| 3|978302109|
| and.........so.......on.......|
+------+-------+------+---------+
Note: My ratings.dat file do not contain headers and separator is ::.
Questions:
How can I read .dat file?
How can I add my custom header like I mentioned in Expected output?
So, How can I achieve my expected output? Where am I committing mistakes?
I would love to read your suggestions and answers :)
Would really appreciate long and detailed answers.
You can just use the csv reader with :: as the separator, and provide a schema:
df = spark.read.csv('ratings.dat', sep='::', schema='UserID int, MovieID int, Rating int, Timestamp long')
df.show()
+------+-------+------+---------+
|UserID|MovieID|Rating|Timestamp|
+------+-------+------+---------+
| 1| 1193| 5|978300760|
| 1| 661| 3|978302109|
| 1| 914| 3|978301968|
| 1| 3408| 4|978300275|
+------+-------+------+---------+
Related
I am trying to filter out bad records from a csv file using pyspark. Code snippet given below
from pyspark.sql import SparkSession
schema="employee_id int,name string,address string,dept_id int"
spark = SparkSession.builder.appName("TestApp").getOrCreate()
data = spark.read.format("csv").option("header", True).schema(schema).option("badRecordsPath", "/tmp/bad_records").load("/path/to/csv/file")
schema_for_bad_record="path string,record string,reason string"
bad_records_frame=spark.read.schema(schema_for_bad_record).json("/tmp/bad_records")
bad_records_frame.select("reason").show()
The valid dataframe is
+-----------+-------+-------+-------+
|employee_id| name|address|dept_id|
+-----------+-------+-------+-------+
| 1001| Floyd| Delhi| 1|
| 1002| Watson| Mumbai| 2|
| 1004|Thomson|Chennai| 3|
| 1005| Bonila| Pune| 4|
+-----------+-------+-------+-------+
In one of the records, both employee_id and dept_id has incorrect values. But the reason shows only one column's issue.
java.lang.NumberFormatException: For input string: \\"abc\\"
Is there any way to show reasons for multiple columns in case of failure?
So after certain operations I have some data in a Spark DataFrame, to be specific, org.apache.spark.sql.DataFrame = [_1: string, _2: string ... 1 more field]
Now when I do df.show(), I get the following output, which is expected.
+--------------------+--------------------+--------------------+
| _1| _2| _3|
+--------------------+--------------------+--------------------+
|industry_name_ANZSIC|'industry_name_AN...|.isComplete("indu...|
|industry_name_ANZSIC|'industry_name_AN...|.isContainedIn("i...|
|industry_name_ANZSIC|'industry_name_AN...|.isContainedIn("i...|
| rme_size_grp|'rme_size_grp' is...|.isComplete("rme_...|
| rme_size_grp|'rme_size_grp' ha...|.isContainedIn("r...|
| rme_size_grp|'rme_size_grp' ha...|.isContainedIn("r...|
| year| 'year' is not null| .isComplete("year")|
| year|'year' has type I...|.hasDataType("yea...|
| year|'year' has no neg...|.isNonNegative("y...|
|industry_code_ANZSIC|'industry_code_AN...|.isComplete("indu...|
|industry_code_ANZSIC|'industry_code_AN...|.isContainedIn("i...|
|industry_code_ANZSIC|'industry_code_AN...|.isContainedIn("i...|
| variable|'variable' is not...|.isComplete("vari...|
| variable|'variable' has va...|.isContainedIn("v...|
| unit| 'unit' is not null| .isComplete("unit")|
| unit|'unit' has value ...|.isContainedIn("u...|
| value| 'value' is not null|.isComplete("value")|
+--------------------+--------------------+--------------------+
The problem occurs when I try exporting the dataframe as a csv to my S3 bucket.
The code I have is : df.coalesce(1).write.mode("Append").csv("s3://<my path>")
But the csv generated in my S3 path is full of gibberish or rich text. Also, the spark prompt doesn't reappear after execution (meaning execution didn't finish?) Here's a sample screenshot of the generated csv in my S3 :
What am I doing wrong and how do I rectify this?
S3: short description.
When you change the letter on the URI scheme, it will make a big difference because it causes different software to be used to interface to S3.
This is the difference between the three:
s3 is a block-based overlay on top of Amazon S3,whereas s3n/s3a are not. These are are object-based.
s3n supports objects up to 5GB when size is the concern, while s3a supports objects up to 5TB and has higher performance.Note that s3a is the successor to s3n.
I have .log file in ADLS which contain multiple nested Json objects as follows
{"EventType":3735091736,"Timestamp":"2019-03-19","Data":{"Id":"event-c2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"d69b7489","ActionId":"d0e2c3fd"}},"Id":"event-c20b9c7eac0808d6321106d901000000"}
{"EventType":3735091737,"Timestamp":"2019-03-18","Data":{"Id":"event-d2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"f69b7489","ActionId":"d0f2c3fd"}},"Id":"event-d20b9c7eac0808d6321106d901000000"}
{"EventType":3735091738,"Timestamp":"2019-03-17","Data":{"Id":"event-e2","Level":1,"MessageTemplate":"Test1","Properties":{"CorrId":"g69b7489","ActionId":"d0d2c3fd"}},"Id":"event-e20b9c7eac0808d6321106d901000000"}
Need to read the above multiple nested Json objects in pyspark and convert to dataframe as follows
EventType Timestamp Data.[Id] ..... [Data.Properties.CorrId] [Data.Properties. ActionId]
3735091736 2019-03-19 event-c2 ..... d69b7489 d0e2c3fd
3735091737 2019-03-18 event-d2 ..... f69b7489 d0f2c3fd
3735091738 2019-03-17 event-e2 ..... f69b7489 d0d2c3fd
For above I am using ADLS,Pyspark in Azure DataBricks.
Does anyone know a general way to deal with above problem? Thanks!
You can read it into an RDD first. It will be read as a list of strings
You need to convert the json string into a native python datatype using
json.loads()
Then you can convert the RDD into a dataframe, and it can infer the schema directly using toDF()
Using the answer from Flatten Spark Dataframe column of map/dictionary into multiple columns, you can explode the Data column into multiple columns. Given your Id column is going to be unique. Note that, explode would return key, value columns for each entry in the map type.
You can repeat the 4th point to explode the properties column.
Solution:
import json
rdd = sc.textFile("demo_files/Test20191023.log")
df = rdd.map(lambda x: json.loads(x)).toDF()
df.show()
# +--------------------+----------+--------------------+----------+
# | Data| EventType| Id| Timestamp|
# +--------------------+----------+--------------------+----------+
# |[MessageTemplate ...|3735091736|event-c20b9c7eac0...|2019-03-19|
# |[MessageTemplate ...|3735091737|event-d20b9c7eac0...|2019-03-18|
# |[MessageTemplate ...|3735091738|event-e20b9c7eac0...|2019-03-17|
# +--------------------+----------+--------------------+----------+
data_exploded = df.select('Id', 'EventType', "Timestamp", F.explode('Data'))\
.groupBy('Id', 'EventType', "Timestamp").pivot('key').agg(F.first('value'))
# There is a duplicate Id column and might cause ambiguity problems
data_exploded.show()
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# | Id| EventType| Timestamp| Id|Level|MessageTemplate| Properties|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# |event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|{CorrId=d69b7489,...|
# |event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|{CorrId=f69b7489,...|
# |event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|{CorrId=g69b7489,...|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
I was able to read the data by following code.
from pyspark.sql.functions import *
DF = spark.read.json("demo_files/Test20191023.log")
DF.select(col('Id'),col('EventType'),col('Timestamp'),col('Data.Id'),col('Data.Level'),col('Data.MessageTemplate'),
col('Data.Properties.CorrId'),col('Data.Properties.ActionId'))\
.show()```
***Result***
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
| Id| EventType| Timestamp| Id|Level|MessageTemplate| CorrId|ActionId|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
|event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|d69b7489|d0e2c3fd|
|event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|f69b7489|d0f2c3fd|
|event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|g69b7489|d0d2c3fd|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
I am trying to parse a column of a list of json strings but even after trying multiple schemas using structType, structField etc I am just unable to get the schema at all.
[{"event":"empCreation","count":"148"},{"event":"jobAssignment","count":"3"},{"event":"locationAssignment","count":"77"}]
[{"event":"empCreation","count":"334"},{"event":"jobAssignment","count":33"},{"event":"locationAssignment","count":"73"}]
[{"event":"empCreation","count":"18"},{"event":"jobAssignment","count":"32"},{"event":"locationAssignment","count":"72"}]
Based on this SO post, I was able to derive the json schema but even after apply from_json function, it still wouldn't work
Pyspark: Parse a column of json strings
Can you please help?
You can parse the given json schema with below schame definition and read the json as a DataFrame providing the schema info.
>>> dschema = StructType([
... StructField("event", StringType(),True),
... StructField("count", StringType(),True)])
>>>
>>>
>>> df = spark.read.json('/<json_file_path>/json_file.json', schema=dschema)
>>>
>>> df.show()
+------------------+-----+
| event|count|
+------------------+-----+
| empCreation| 148|
| jobAssignment| 3|
|locationAssignment| 77|
| empCreation| 334|
| jobAssignment| 33|
|locationAssignment| 73|
| empCreation| 18|
| jobAssignment| 32|
|locationAssignment| 72|
+------------------+-----+
>>>
Contents of the json file:
cat json_file.json
[{"event":"empCreation","count":"148"},{"event":"jobAssignment","count":"3"},{"event":"locationAssignment","count":"77"}]
[{"event":"empCreation","count":"334"},{"event":"jobAssignment","count":"33"},{"event":"locationAssignment","count":"73"}]
[{"event":"empCreation","count":"18"},{"event":"jobAssignment","count":"32"},{"event":"locationAssignment","count":"72"}]
thanks so much #Lakshmanan but I had to just add a slight change to the schema:
eventCountSchema = ArrayType(StructType([StructField("event", StringType(),True),StructField("count", StringType(),True)]), True)
and then applied this schema to the dataframe complex datatype column
This question already has answers here:
How do I convert csv file to rdd
(12 answers)
Closed 5 years ago.
I have a csv file with one of the columns containing value enclosed in double quotes. This column also has commas in it. How do I read this type of columns in CSV in Spark using Scala into an RDD. Column values enclosed in double quotes should be read as Integer type as they are values like Total assets, Total Debts.
example records from csv is
Jennifer,7/1/2000,0,,0,,151,11,8,"25,950,816","5,527,524",51,45,45,45,48,50,2,,
John,7/1/2003,0,,"200,000",0,151,25,8,"28,255,719","6,289,723",48,46,46,46,48,50,2,"4,766,127,272",169
I would suggest you to read with SQLContext as a csv file as it has well tested mechanisms and flexible apis to satisfy your needs
You can do
val dataframe =sqlContext.read.csv("path to your csv file")
Output would be
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
| _c0| _c1|_c2| _c3| _c4| _c5|_c6|_c7|_c8| _c9| _c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17| _c18|_c19|
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
| Jennifer|7/1/2000| 0|null| 0|null|151| 11| 8|25,950,816|5,527,524| 51| 45| 45| 45| 48| 50| 2| null|null|
|Afghanistan|7/1/2003| 0|null|200,000| 0|151| 25| 8|28,255,719|6,289,723| 48| 46| 46| 46| 48| 50| 2|4,766,127,272| 169|
+-----------+--------+---+----+-------+----+---+---+---+----------+---------+----+----+----+----+----+----+----+-------------+----+
Now you can change the header names, change the required columns to integers and do a lot of things
You can even change it to rdd
Edited
If you prefer to read in RDD and stay in RDD, then
Read the file with sparkContext as a textFile
val rdd = sparkContext.textFile("/home/anahcolus/IdeaProjects/scalaTest/src/test/resources/test.csv")
Then split the lines with , by ignoring , in between "
rdd.map(line => line.split(",(?=([^\\\"]*\\\"[^\\\"]*\\\")*[^\\\"]*$)", -1))
#ibh this is not Spark or Scala specific stuff. In Spark you will read file the usual way
val conf = new SparkConf().setAppName("app_name").setMaster("local")
val ctx = new SparkContext(conf)
val file = ctx.textFile("<your file>.csv")
rdd.foreach{line =>
// cleanup code as per regex below
val tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1)
// side effect
val myObject = new MyObject(tokens)
mylist.add(myObject)
}
See this regex also.