Read multiple CSVs with different headers into one single dataframe - pyspark

I have a few CSV files where some files might have some matching columns and some have altogether different columns.
For Example file 1 has the following columns:
['circuitId', 'circuitRef', 'name', 'location', 'country', 'lat', 'lng', 'alt', 'url']
and file2 has the following columns :
['raceId', 'year', 'round', 'circuitId', 'name', 'date', 'time', 'url']
I want to create a dataframe that will have all these columns. I wrote the following code hoping that mapping between pre-defined schema and CSV file headers will automatically take place, but it did not work out.
sch=StructType([StructField('circuitId',StringType(),True),
StructField('year',StringType(),True),
StructField('name',StringType(),True),
StructField('alt',StringType(),True),
StructField('url',StringType(),True),
StructField('round',StringType(),True),
StructField('lng',StringType(),True),
StructField('date',StringType(),True),
StructField('circuitRef',StringType(),True),
StructField('raceId',StringType(),True),
StructField('lat',StringType(),True),
StructField('location',StringType(),True),
StructField('country',StringType(),True),
StructField('time',StringType(),True)
])
df=spark.read \
.option('header','true') \
.schema(sch) \
.csv('/FileStore/Udemy/Formula_One_Raw/*.csv')
with this code I am getting the following output :

The CSV format uses positions, not names, of the columns to resolve the schema, so if you have a file with 8 columns, like:
['raceId', 'year', 'round', 'circuitId', 'name', 'date', 'time', 'url']
and the first 8 fields of schema you are trying to apply are:
StructField('circuitId',StringType(),True),
StructField('year',StringType(),True),
StructField('name',StringType(),True),
StructField('alt',StringType(),True),
StructField('url',StringType(),True),
StructField('round',StringType(),True),
StructField('lng',StringType(),True),
StructField('date',StringType(),True)
then raceId will be inferred as circuitId, round as name and so on. You can do the following thing to resolve this:
Create separate schema for every distinct file, taking into account what columns are there, and add the ones that are not part of this particular schema at the end - this way they should be included in your DataFrame, and filled with NULL.
So for the example CSV schema at the top, you can declare the schema like so:
StructField('raceId',StringType(),True),
StructField('year',StringType(),True),
StructField('round',StringType(),True),
StructField('circuitId',StringType(),True),
StructField('name',StringType(),True),
StructField('date',StringType(),True),
StructField('time',StringType(),True),
StructField('url',StringType(),True),
StructField('alt',StringType(),True),
StructField('lng',StringType(),True),
StructField('circuitRef',StringType(),True),
StructField('lat',StringType(),True),
StructField('location',StringType(),True),
StructField('country',StringType(),True)
And your DataFrame should look like so:
+------+----+-----+---------+--------------------+----------+--------+--------------------+----+----+----------+----+--------+-------+
|raceId|year|round|circuitId| name| date| time| url| alt| lng|circuitRef| lat|location|country|
+------+----+-----+---------+--------------------+----------+--------+--------------------+----+----+----------+----+--------+-------+
| 1|2009| 1| 1|Australian Grand ...|2009-03-29|06:00:00|http://en.wikiped...|null|null| null|null| null| null|
+------+----+-----+---------+--------------------+----------+--------+--------------------+----+----+----------+----+--------+-------+
After all your data has been read to DataFrames, you will be able to union them using column names, if needed.

Related

How to check if at least one element of a list is included in a text column?

I have a data frame with a column containing text and a list of keywords. My goal is to build a new column showing if the text column contains at least one of the keywords. Let's look at some mock data:
test_data = [('1', 'i like stackoverflow'),
('2', 'tomorrow the sun will shine')]
test_df = spark.sparkContext.parallelize(test_data).toDF(['id', 'text'])
With a single keyword ("sun") the solution would be:
test_df.withColumn(
'text_contains_keyword', F.array_contains(F.split(test_df.text, ' '), 'sun')
).show()
The word "sun" is included in the second row of the text column, but not in the first. Now, let's say I have a list of keywords:
test_keywords = ['sun', 'foo', 'bar']
How to check for each of the words in test_keywords if they are included in the text column? Unfortunately, if I simply replace "sun" with the list, it leads to this error:
Unsupported literal type class java.util.ArrayList [sun, foo, bar]
You can do that using the built in rlike function with the following code.
from pyspark.sql import functions
test_df = (test_df.withColumn("text_contains_word",
functions.col('text')
.rlike('(^|\s)(' + '|'.join(test_keywords)
+ ')(\s|$)')))
test_df.show()
+---+--------------------+------------------+
| id| text|text_contains_word|
+---+--------------------+------------------+
| 1|i like stackoverflow| false|
| 2|tomorrow the sun ...| true|
+---+--------------------+------------------+

pyspark: filter parquet files with different column structures

I have my parquet data saved in aws s3 bucket. The parquet files are partitioned by date and the folder structure looks like
MyFolder
|-- date=20210701
|--part-xysdf-snappy.parquet
|-- date=20210702
|--part-fasdf-snappy.parquet
|-- date=20210703
|--part-ghdfg-snappy.parquet
....
....
Please note that Parquet in date=20210701 (which is the earliest entry) is faulty and its missed two columns
+-------+-----+
| name|grade|
+-------+-----+
|Alberto| 100|
| Dakota| 96|
+-------+-----+
The rest of parquet files are all good, which is like
+-------+-----+------+-------+
| name|grade|height| date |
+-------+-----+--------------+
|Karolin| 110| 173 |20210701
| Lucas | 91| 178 |20210701
+-------+-----+------+-------+
If I want to only focus on 'name' and 'grade', I can use the following code to show the results
def check_data(start_date, end_date):
cols = ['name', 'grade']
df = spark.read.parquet('path/MyFolder').select(cols)
df = df.filter(f'date > "{start_date}" and date < "{end_date}"')
return df
The code above is handy and it works fine. However, now I want to add 'height' and 'date' columns, and ignore date=20210701 (because it missed two columns). Things got tricker. If I use
def check_data(start_date, end_date):
cols = ['name', 'grade', 'height', 'date']
nan = 'Nan'
df = spark.read.parquet('path/MyFolder').filter(f'height != "{nan}"')
df = df.filter(f'date > "{start_date}" and date < "{end_date}"')
df = df.select(cols)
return df
I got this Error
Cannot resolve 'height' given input columns [name, grade].....
The only solution I got here is to loop through all parquet folders and then append pyspark dataframe, but that would take extra hours.
Also, if I delete the date=20210701, problem is also solved, but I just cannot do that.
Can you please share your thoughts? Thanks. 🖖
If the data is missing for a single row or for a small amount of rows, you can replace the null value with the mean/median value of that column.
In this case you can add calculate median of all the height in parquet and then add that value for date=20210701.
This way your data won't be skewed.
Also median is preferred over mean, because some outliers can skew the mean value.
Actually, the solution is very simple.
df = spark.read.format('parquet').option('mergeSchema','true').load(path).select('name', 'grade', 'height', 'date')

How to read multiple nested json objects in one file extract by pyspark to dataframe in Azure databricks?

I have .log file in ADLS which contain multiple nested Json objects as follows
{"EventType":3735091736,"Timestamp":"2019-03-19","Data":{"Id":"event-c2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"d69b7489","ActionId":"d0e2c3fd"}},"Id":"event-c20b9c7eac0808d6321106d901000000"}
{"EventType":3735091737,"Timestamp":"2019-03-18","Data":{"Id":"event-d2","Level":2,"MessageTemplate":"Test1","Properties":{"CorrId":"f69b7489","ActionId":"d0f2c3fd"}},"Id":"event-d20b9c7eac0808d6321106d901000000"}
{"EventType":3735091738,"Timestamp":"2019-03-17","Data":{"Id":"event-e2","Level":1,"MessageTemplate":"Test1","Properties":{"CorrId":"g69b7489","ActionId":"d0d2c3fd"}},"Id":"event-e20b9c7eac0808d6321106d901000000"}
Need to read the above multiple nested Json objects in pyspark and convert to dataframe as follows
EventType Timestamp Data.[Id] ..... [Data.Properties.CorrId] [Data.Properties. ActionId]
3735091736 2019-03-19 event-c2 ..... d69b7489 d0e2c3fd
3735091737 2019-03-18 event-d2 ..... f69b7489 d0f2c3fd
3735091738 2019-03-17 event-e2 ..... f69b7489 d0d2c3fd
For above I am using ADLS,Pyspark in Azure DataBricks.
Does anyone know a general way to deal with above problem? Thanks!
You can read it into an RDD first. It will be read as a list of strings
You need to convert the json string into a native python datatype using
json.loads()
Then you can convert the RDD into a dataframe, and it can infer the schema directly using toDF()
Using the answer from Flatten Spark Dataframe column of map/dictionary into multiple columns, you can explode the Data column into multiple columns. Given your Id column is going to be unique. Note that, explode would return key, value columns for each entry in the map type.
You can repeat the 4th point to explode the properties column.
Solution:
import json
rdd = sc.textFile("demo_files/Test20191023.log")
df = rdd.map(lambda x: json.loads(x)).toDF()
df.show()
# +--------------------+----------+--------------------+----------+
# | Data| EventType| Id| Timestamp|
# +--------------------+----------+--------------------+----------+
# |[MessageTemplate ...|3735091736|event-c20b9c7eac0...|2019-03-19|
# |[MessageTemplate ...|3735091737|event-d20b9c7eac0...|2019-03-18|
# |[MessageTemplate ...|3735091738|event-e20b9c7eac0...|2019-03-17|
# +--------------------+----------+--------------------+----------+
data_exploded = df.select('Id', 'EventType', "Timestamp", F.explode('Data'))\
.groupBy('Id', 'EventType', "Timestamp").pivot('key').agg(F.first('value'))
# There is a duplicate Id column and might cause ambiguity problems
data_exploded.show()
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# | Id| EventType| Timestamp| Id|Level|MessageTemplate| Properties|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
# |event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|{CorrId=d69b7489,...|
# |event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|{CorrId=f69b7489,...|
# |event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|{CorrId=g69b7489,...|
# +--------------------+----------+----------+--------+-----+---------------+--------------------+
I was able to read the data by following code.
from pyspark.sql.functions import *
DF = spark.read.json("demo_files/Test20191023.log")
DF.select(col('Id'),col('EventType'),col('Timestamp'),col('Data.Id'),col('Data.Level'),col('Data.MessageTemplate'),
col('Data.Properties.CorrId'),col('Data.Properties.ActionId'))\
.show()```
***Result***
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
| Id| EventType| Timestamp| Id|Level|MessageTemplate| CorrId|ActionId|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+
|event-c20b9c7eac0...|3735091736|2019-03-19|event-c2| 2| Test1|d69b7489|d0e2c3fd|
|event-d20b9c7eac0...|3735091737|2019-03-18|event-d2| 2| Test1|f69b7489|d0f2c3fd|
|event-e20b9c7eac0...|3735091738|2019-03-17|event-e2| 1| Test1|g69b7489|d0d2c3fd|
+--------------------+----------+----------+--------+-----+---------------+--------+--------+

Spark dataframe explode column

Every row in the dataframe contains a csv formatted string line plus another simple string, so what I'm trying to get at the end is a dataframe composed of the fields extracted from the line string together with category.
So I proceeded as follows to explode the line string
val df = stream.toDF("line","category")
.map(x => x.getString(0))......
At the end I manage to get a new dataframe composed of the line fields but I can't return the category to the new dataframe
I can't join the new dataframe with the initial one since the common field id was not a separate column at first.
Sample of input :
line | category
"'1';'daniel';'dan#gmail.com'" | "premium"
Sample of output:
id | name | email | category
1 | "daniel"| "dan#gmail.com"| "premium"
Any suggestions, thanks in advance.
If the structure of strings in line column is fixed as mentioned in the question, then following simple solution should work where split inbuilt function is used to split the string into array and then finally selecting the elements from the array and aliasing to get the final dataframe
import org.apache.spark.sql.functions._
df.withColumn("line", split(col("line"), ";"))
.select(col("line")(0).as("id"), col("line")(1).as("name"), col("line")(2).as("email"), col("category"))
.show(false)
which should give you
+---+--------+---------------+--------+
|id |name |email |category|
+---+--------+---------------+--------+
|'1'|'daniel'|'dan#gmail.com'|premium |
+---+--------+---------------+--------+
I hope the answer is helpful

How to drop particular record with some junk characters(like $,"NA",etc) from CSV file using PYSPARK

I have a CSV file, from which I want to remove records which have particular characters like "$", "NA", "##".
I am not able to figure out any function to drop the records for this scenario.
How can I achieve this?
Hello All,
I tried below code and it is working fine.But this code for paricular pattern
and i want to remove multiple occurances of garbage
values(#,##,###,$,$$,$$$)like this.
eg] filter_list = ['##', '$']
df = df.filter(df.color.isin(*filter_list) == False)
df.show()
In this example I used single column is "color", but instead of a single column
I want to work with multiple columns (passing array).
Thanks in advance.
You can accomplish this by using the filter function
(http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter)
Here's some example code:
from pyspark.sql import functions as F
#create some test data
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00", "orange"),
(13, "2017-03-15T12:27:18+00:00", "$"),
(25, "2017-03-18T11:27:18+00:00", "##")],
["dollars", "timestampGMT", "color"])
df.show()
Here's what the data looks like:
+-------+--------------------+------+
|dollars| timestampGMT| color|
+-------+--------------------+------+
| 17|2017-03-10T15:27:...|orange|
| 13|2017-03-15T12:27:...| $|
| 25|2017-03-18T11:27:...| ##|
+-------+--------------------+------+
You can create a filter list and then filter out the records that match (in this case from the color column):
filter_list = ['##', '$']
df = df.filter(df.color.isin(*filter_list) == False)
df.show()
Here's what the data looks like now:
+-------+--------------------+------+
|dollars| timestampGMT| color|
+-------+--------------------+------+
| 17|2017-03-10T15:27:...|orange|
+-------+--------------------+------+