How to parse nested column for CSV data in Pyspark? - pyspark

I am working on a database where the data is stored in csv format. The DB looks like the following:
id
containertype
size
1
CASE
{height=2.01, length=1.07, width=1.22}
2
PALLET
{height=1.80, length=1.07, width=1.23}
I want to parse the data inside size column and create a pyspark df like:
id
containertype
height
length
width
1
CASE
2.01
1.07
1.22
2
PALLET
1.80
1.07
1.23
I tried parsing the string to StructType and MapType but none of the approaches are working. Is there any way to do it except the messy string manipulation?
Reproducible data-frame code:
df = spark.createDataFrame(
[
("1", "CASE", "{height=2.01, length=1.07, width=1.22}"),
("2", "PALLET", "{height=2.01, length=1.07, width=1.22}"),
],
["id", "containertype", "size"]
)
df.printSchema()

If one of the columns is a JSON, you can parse it with the function to_json, which requires the column you want to parse, in your case size, and the schema that will result in the parsing, in this case:
schema = StructType([ \
StructField("height",FloatType(),True), \
StructField("length",FloatType(),True), \
StructField("width",FloatType(),True)
])
df.withColumn("json", F.from_json(F.col("size"), schema))\
.select(F.col("id"), F.col("containertype"), F.col("json.*"))

Use a regex to extract the value
def getParameter(tag):
return F.regexp_extract("size", tag+"=(\d+\.\d+)", 1).cast(FloatType()).alias(tag)
df.select(F.col("id"), F.col("containertype"), getParameter("height"), getParameter("length"), getParameter("width"))

Related

Creating pyspark dataframe from same entries in the list

I have two list, one is col_name=['col1','col2','col3'] and the other is col_value=['val1', 'val2', 'val3']. I am trying to create a dataframe from the two list with col_name being the column.I need the output with 3 columns and 1 row (with 1 header as a col_name).
Finding difficult to get the solution for this. Pls help.
Construct data directly, and use the createDataFrame method to create a dataframe.
col_name = ['col1', 'col2', 'col3']
col_value = ['val1', 'val2', 'val3']
data = [col_value]
df = spark.createDataFrame(data, col_name)
df.printSchema()
df.show(truncate=False)

Pyspark: Identify the arrayType column from the the Struct and call udf to convert array to string

I am creating an accelerator where it migrates the data from source to destination. For Example, I will pick the data from an API and will migrate the data to csv. I have faced issues with handling arraytype while data is converted to csv. I have used withColumn and concat_ws method(i.e., df1=df.withColumn('films',F.concat_ws(':',F.col("films"))) films is the arraytype column ) for this conversion and it worked. Now I wanted this to happen dynamically. I mean, without specifying the column name, is there a way that I can pick the column name from struct which have the arraytype and then call the udf?
Thank you for your time!
You can get the type of the columns using df.schema. Depending on the type of the column you can apply concat_ws or not:
data = [["test1", "test2", [1,2,3], ["a","b","c"]]]
schema= ["col1", "col2", "arr1", "arr2"]
df = spark.createDataFrame(data, schema)
array_cols = [F.concat_ws(":", c.name).alias(c.name) \
for c in df.schema if isinstance(c.dataType, T.ArrayType) ]
other_cols = [F.col(c.name) \
for c in df.schema if not isinstance(c.dataType, T.ArrayType) ]
df = df.select(other_cols + array_cols)
Result:
+-----+-----+-----+-----+
| col1| col2| arr1| arr2|
+-----+-----+-----+-----+
|test1|test2|1:2:3|a:b:c|
+-----+-----+-----+-----+

i have json string in my dataframe i already tried to exract json sting columns using pyspark

df = spark.read.json("dbfs:/mnt/evbhaent2blobs", multiLine=True)
df2 = df.select(F.col('body').cast("Struct").getItem('CustomerType').alias('CustomerType'))
display(df)
my df is
my oupputdf
I am taking a guess that your dataframe has a column "body" which is a json string and you want to parse the json and extract an element from it.
First you need to define or extract the json schema. And then parse json string and extract its elements as column. From the extracted columns, you can select the desired columns.
json_schema = spark.read.json(df.rdd.map(lambda row: row.body)).schema
df2 = df.withColumn('body_json', F.from_json(F.col('body'), json_schema))\
.select("body_json.*").select('CustomerType')
display(df2)

Print out types of data frame columns in Spark

I tried using VectorAssembler on my Spark Data Frame and it complained that it didn't support the StringType type. My Data Frame has 2126 columns.
What's the programmatic way to print out all the column types?
df.printSchema() will print you the dataframe schema in an easy to follow formatting
Try:
>>> for name, dtype in df.dtypes:
... print(name, dtype)
or
>>> df.schema

Filter out rows with NaN values for certain column

I have a dataset and in some of the rows an attribute value is NaN. This data is loaded into a dataframe and I would like to only use the rows which consist of rows where all attribute have values. I tried doing it via sql:
val df_data = sqlContext.sql("SELECT * FROM raw_data WHERE attribute1 != NaN")
I tried several variants on this, but I can't seem to get it working.
Another option would be to transform it to a RDD and then filter it, since filtering this dataframe to check if a attribute isNaN , does not work.
I know you accepted the other answer, but you can do it without the explode (which should perform better than doubling your DataFrame size).
Prior to Spark 1.6, you could use a udf like this:
def isNaNudf = udf[Boolean,Double](d => d.isNaN)
df.filter(isNaNudf($"value"))
As of Spark 1.6, you can now use the built-in SQL function isnan() like this:
df.filter(isnan($"value"))
Here is some sample code that shows you my way of doing it -
import sqlContext.implicits._
val df = sc.parallelize(Seq((1, 0.5), (2, Double.NaN))).toDF("id", "value")
val df2 = df.explode[Double, Boolean]("value", "isNaN")(d => Seq(d.isNaN))
df will have -
df.show
id value
1 0.5
2 NaN
while doing filter on df2 will give you what you want -
df2.filter($"isNaN" !== true).show
id value isNaN
1 0.5 false
This works:
where isNaN(tau_doc) = false
e.g.
val df_data = sqlContext.sql("SELECT * FROM raw_data where isNaN(attribute1) = false")