I am trying to import an unstructured csv from datalake storage to databricks and i want to read the entire content of this file:
EdgeMaster
Name Value Unit Status Nom. Lower Upper Description
Type A A
Date 1/1/2022 B
Time 0:00:00 A
X 1 m OK 1 2 3 B
Y - A
EdgeMaster
Name Value Unit Status Nom. Lower Upper Description
Type B C
Date 1/1/2022 D
Time 0:00:00 C
X 1 m OK 1 2 3 D
Y - C
1. Method 1 : I tried reading the first line a header
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load('abfss://xyz/sample.csv')
I get only this :
2. Method 2: I skipped reading header
No improvements :
3. Method 3: Defined a custom schema
Query returns no result:
If you know the schema ahead of time it should be possible to read the csv file and drop malformed data.
See this as an example:
name_age.csv
Hello
name,age
aj,19
Hello
name,age
test,20
And the code to read this would be:
>>> from pyspark.sql.types import StringType,IntegerType,StructField,StructType
>>> schema=StructType([StructField("name",StringType(),True),StructField("age",IntegerType(),True)])
>>> df=spark.read.csv("name_age.csv",sep=",",mode="DROPMALFORMED",schema=schema)
>>> df.show()
+----+---+
|name|age|
+----+---+
| aj| 19|
|test| 20|
+----+---+
Other helpful link: Remove first and last row from the text file in pyspark
Related
I am trying to convert a piece of code containing the retain functionality and multiple if-else statements in SAS to pyspark.. I had no luck when I tried to search for similar answers.
Input Dataset:
Prod_Code
Rate
Rank
ADAMAJ091
1234.0091
1
ADAMAJ091
1222.0001
2
ADAMAJ091
1222.0000
3
BASSDE012
5221.0123
1
BASSDE012
5111.0022
2
BASSDE012
5110.0000
3
I have calculated the rank using df.withColumn("rank", row_number().over(window.partitionBy('Prod_code'))).orderBy('Rate') function
The value in Rate column must be replicated to all other values in the partition containing rank from 1 to N
Expected Output Dataset:
Prod_Code
Rate
Rank
ADAMAJ091
1234.0091
1
ADAMAJ091
1234.0091
2
ADAMAJ091
1234.0091
3
BASSDE012
5221.0123
1
BASSDE012
5221.0123
2
BASSDE012
5221.0123
3
Rate column's value present at rank=1 must be replicated to all other rows in the same partition. This is retain functionality and I need help in replicating the same in Pyspark code.
I tried using df.withColumn() approach for individual rows, but i was not able to achieve this functionality in pyspark.
Since you already had Rank column, then you can use first function to get the first Rate value in a window ordered by Rank.
from pyspark.sql.functions import first
from pyspark.sql.window import Window
df = df.withColumn('Rate', first('Rate').over(Window.partitionBy('prod_code').orderBy('rank')))
df.show()
# +---------+---------+----+
# |Prod_Code| Rate|Rank|
# +---------+---------+----+
# |BASSDE012|5221.0123| 1|
# |BASSDE012|5221.0123| 2|
# |BASSDE012|5221.0123| 3|
# |ADAMAJ091|1234.0091| 1|
# |ADAMAJ091|1234.0091| 2|
# |ADAMAJ091|1234.0091| 3|
# +---------+---------+----+
I have a dataframe with column FN and a list of a subset of these column values
e.g.
**FN**
ABC
DEF
GHI
JKL
MNO
List:
["GHI","DEF"]
I want to add a column to my dataframe where, if the column value exists in the List, I record the position within the list, that is my end DF
FN POS
ABC
DEF 1
GHI 0
JKL
MNO
My code is as follows
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
l = ["GHI","DEF"]
x = udf(lambda fn, p = l: p.index(fn), StringType())
df = df.withColumn('POS', when(col("FN").isin(l), x(col("FN"))).otherwise(lit('')))
But when running I get a "Job aborted due to stage failure" exception with a series of other exceptions, the only meaningful part being "ValueError: 'JKL' is not in list" (JKL being a random other column in my DF column)
If instead of "p.index(fn)" I just enter "fn", I get the correct column values in my new column, similarly if I use "p.index("DEF")", I get "1" back so individually these are working, any ideas why the exceptions?
TIA
EDIT: I have managed to go around this by doing an if-else within the lambda which is almost implying that it is executing the lambda prior to the "isin" check within the withColumn statement.
What I would like to know (other than whether the above is true), does anyone have a better suggestion on how to achieve this in a better manner?
Here is my try. I have made a dataframe for the given list and join them.
from pyspark.sql.functions import *
l = ['GHI','DEF']
m = [(l[i], i) for i in range(0, len(l))]
df2 = spark.createDataFrame(m).toDF('FN', 'POS')
df1 = spark.createDataFrame(['POS','ABC','DEF','GHI','JKL','MNO'], "string").toDF('FN')
df1.join(df2, ['FN'], 'left').show()
+---+----+
| FN| POS|
+---+----+
|JKL|null|
|MNO|null|
|DEF| 1|
|POS|null|
|GHI| 0|
|ABC|null|
+---+----+
I'm running the following code on databricks:
dataToShow = jDataJoined.\
withColumn('id', monotonically_increasing_id()).\
filter(
(jDataJoined.containerNumber == 'SUDU8108536')).\
select(col('id'), col('returnTemperature'), col('supplyTemperature'))
This will give me tabular data like
Now I would like to display a line graph with this returnTemperature and supplyTemperature as categories.
As far as I understood, the method display in databricks wants as second argument the category, so basically what I should have is something like
id - temperatureCategory - value
1 - returnTemperature - 25.0
1 - supplyTemperature - 27.0
2 - returnTemperature - 24.0
2 - supplyTemperature - 28.0
How can I transform the dataframe in this way?
I don't know if your format is what the display method is expecting, but you can do this transformation with the sql functions create_map and explode:
#creates a example df
from pyspark.sql import functions as F
l1 = [(1,25.0,27.0),(2,24.0,28.0)]
df = spark.createDataFrame(l1,['id','returnTemperature','supplyTemperature'])
#creates a map column which contains the values of the returnTemperature and supplyTemperature
df = df.withColumn('mapCol', F.create_map(
F.lit('returnTemperature'),df.returnTemperature
,F.lit('supplyTemperature'),df.supplyTemperature
)
)
#The explode function creates a new row for each element of the map
df = df.select('id',F.explode(df.mapCol).alias('temperatureCategory','value'))
df.show()
Output:
+---+-------------------+-----+
| id|temperatureCategory|value|
+---+-------------------+-----+
| 1 | returnTemperature| 25.0|
| 1 | supplyTemperature| 27.0|
| 2 | returnTemperature| 24.0|
| 2 | supplyTemperature| 28.0|
+---+-------------------+-----+
I am a beginner of Spark. Please help me out with a solution.
The csv file contains the text in the form of key:value paring delimited by a comma. And in some lines, the keys(or columns) may be missing.
I have loaded this file into a single column of a dataframe. I want to segregate these keys as columns and values associated to it as data into that column. And when there are some columns missing i want to add a new column and a dummy data to that.
Dataframe
+----------------------------------------------------------------+
| _c0 |
+----------------------------------------------------------------+
|name:Pradnya,IP:100.0.0.4, college: SDM, year:2018 |
|name:Ram, IP:100.10.10.5, college: BVB, semester:IV, year:2018 |
+----------------------------------------------------------------+
I want the output in this form
+----------- ----------------------------------------------
| name | IP | College | Semester | year |
+-----------+-------------------------+-----------+-------+
| Pradnya |100.0.0.4 | SDM | null | 2018 |
+-----------+-------------+-----------+-----------+-------+
| Ram | 100.10.10.5 | BVB | IV |2018 |
+-----------+-------------+-----------+-----------+-------+
Thanks.
Pyspark won't recognize the key:value pairing. One workaround is convert the file int json format and then read the json file.
content of raw.txt:
name:Pradnya,IP:100.0.0.4, college: SDM, year:2018
name:Ram, IP:100.10.10.5, college: BVB, semester:IV, year:2018
Following code will create the json file :
with open('raw.json', 'w') as outfile:
json.dump([dict([p.split(':') for p in l.split(',')]) for l in open('raw.txt')], outfile)
Now you can create the pyspark dataframe using following code :
df = spark.read.format('json').load('raw.json')
If you know all field names and keys/values do not contain embedded delimiters. then you can probably convert the key/value lines into Row object through RDD's map function.
from pyspark.sql import Row
from string import lower
# assumed you already defined SparkSession named `spark`
sc = spark.sparkContext
# initialize the RDD
rdd = sc.textFile("key-value-file")
# define a list of all field names
columns = ['name', 'IP', 'College', 'Semester', 'year']
# set Row object
def setRow(x):
# convert line into key/value tuples. strip spaces and lowercase the `k`
z = dict((lower(k.strip()), v.strip()) for e in x.split(',') for k,v in [ e.split(':') ])
# make sure all columns shown in the Row object
return Row(**dict((c, z[c] if c in z else None) for c in map(lower, columns)))
# map lines to Row objects and then convert the result to dataframe
rdd.map(setRow).toDF().show()
#+-------+-----------+-------+--------+----+
#|college| ip| name|semester|year|
#+-------+-----------+-------+--------+----+
#| SDM| 100.0.0.4|Pradnya| null|2018|
#| BVB|100.10.10.5| Ram| IV|2018|
#+-------+-----------+-------+--------+----+
I have Data Frame like below with three column
id|visit_class|in_date
+--+-----------+--------
|1|Non Hf |24-SEP-2017
|1|Non Hf |23-SEP-2017
|1|Hf |27-SEP-2017
|1|Non Hf |28-SEP-2017
|2|Non Hf |24-SEP-2017
|2|Hf |25-SEP-2017
I want to group this data frame on id then sort this grouped data on indate column and want only those rows which are coming after first occurrence of HF. The output will be like below. Means first 2 rows will drop for id =1 and first 1 row will drop for id = 2.
id|visit_class|in_date
+--+-----------+--------
|1|Hf |27-SEP-2017
|1|Non Hf |28-SEP-2017
|2|Hf |25-SEP-2017
How I will achieve this in Spark and Scala.
Steps:
1) Create the WindowSpec, order by date and partition by id:
2) Create a cumulative sum as indicates of whether Hf has appeared, and then filter based on the condition:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("id").orderBy(to_date($"in_date", "dd-MMM-yyyy"))
(df.withColumn("rn", sum(when($"visit_class" === "Hf", 1).otherwise(0)).over(w))
.filter($"rn" >= 1).drop("rn").show)
+---+-----------+-----------+
| id|visit_class| in_date|
+---+-----------+-----------+
| 1| Hf|27-SEP-2017|
| 1| Non Hf|28-SEP-2017|
| 2| Hf|25-SEP-2017|
+---+-----------+-----------+
Using spark 2.2.0, to_date with the format signature is a new function in 2.2.0
If you are using spark < 2.2.0, you can use unix_timestamp in place of to_date:
val w = Window.partitionBy("id").orderBy(unix_timestamp($"in_date", "dd-MMM-yyyy"))