How to create empty struct in pyspark? - pyspark

I'm trying to create empty struct column in pyspark. For array this works
import pyspark.sql.functions as F
df = df.withColumn('newCol', F.array([]))
but this gives me an error.
df = df.withColumn('newCol', F.struct())
I saw similar question but for scala not pyspark so it doesn't really help me.

If you know the schema of the struct column, you can use the function from_json as follows
struct_schema = StructType([
StructField('name', StringType(), False),
StructField('surname', StringType(), False),
])
df = df.withColumn(
'newCol', F.from_json(psf.lit(""), struct_schema)
)

Actually the array is not really empty, because it has an empty element.
You should instead consider something like this:
df = df.withColumn('newCol', F.lit(None).cast(T.StructType())
PS: it's a late conversion of my comment into an answer, as it has been proposed - I hope it will help even if it's late after the OP's question

Related

How to join rdd rows based on common value in another column RDD pyspark

Textfile Dataset I have :
Benz,25
BMW,27
BMW,25
Land Rover,22
Audi,25
Benz,25
The result I want is :
[((Benz,BMW),2),((Benz,Audi),1),((BMW,Audi),1)]
it basically pairs the cars with common values and gets the occurrence together.
My code so far is :
cars= sc.textFile('cars.txt')
carpair= cars.flatMap(lambda x: float(x.split(',')))
carpair.map(lambda x: (x[0], x[1])).groupByKey().collect()
As I'm a beginner I'm not able to figure it out.
Getting the occurrence is secondary i cant even map the values together.
In pyspark, it'd look like this:
schema = StructType([
StructField('car_model', StringType(), True),
StructField('value', IntegerType(), True)
])
df = (
spark
.read
.schema(schema)
.option('header', False)
.csv('./cars.txt')
)
(
df.alias('left')
.join(df.alias('right'), ['value'], 'left')
.select('value', f.col('left.car_model').alias('left_car_model'), f.col('right.car_model').alias('right_car_model'))
.where(f.col('left_car_model') != f.col('right_car_model'))
.groupBy('left_car_model', 'right_car_model')
.agg(f.count('*').alias('count'))
.show()
)

pyspark json datframe created with all null values

I have created a dataframe from a json file. However dataframe is created with all the schema but with values as null. Its a valid json file.
df = spark.read.json(path)
when I displayed the data , using df.display() all i can view is null in the dataframe. Can anyone tell me what could be the issue?
Reading the json file without enabling multiline might be the cause for this.
Please go through the sample demonstration.
My sample json.
[{"id":1,"first_name":"Amara","last_name":"Taplin"},
{"id":2,"first_name":"Gothart","last_name":"McGrill"},
{"id":3,"first_name":"Georgia","last_name":"De Miranda"},
{"id":4,"first_name":"Dukie","last_name":"Arnaud"},
{"id":5,"first_name":"Mellicent","last_name":"Scathard"}]
I got null values when multiline not used.
When multiline enabled I got proper result.
df= spark.read.option('multiline', True).json('/FileStore/tables/Sample1_json.json')
df.display()
If you want to give schema externally also, you can do like this.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,LongType
schema = StructType([StructField('first_name', StringType(), True),
StructField('id', IntegerType(), True),
StructField('last_name', StringType(), True)])
df= spark.read.schema(schema).option('multiline', True).json('/FileStore/tables/Sample1_json.json')
df.show()

Spark generate a dataframe from two json columns

I have a dataframe with two columns. Each column contains json.
cola
colb
{"name":"Adam", "age": 23}
{"country" : "USA"}
I wish to convert it to:
cola_name
cola_age
colb_country
Adam
23
USA
How do I do this?
The approach I have in mind is: In the original dataframe, If I can merge both the json to a single json object. I can then obtain the intended result
spark.read.json(df.select("merged_column").as[String])
But cant find an easy way of merging two json object to single json object in spark
Update: The contents of the json is not known pre-hand. Looking for a way to auto-detect schema
I'm more familiar with pyspark syntax. I think this works:
import pyspark.sql.functions as f
from pyspark.sql.types import *
schema_cola = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True)
])
schema_colb = StructType([
StructField('country', StringType(), True)
])
df = spark.createDataFrame([('{"name":"Adam", "age": 23}', '{"country" : "USA"}')], ['cola', 'colb'])
display(df
.withColumn('cola_struct', f.from_json(f.col('cola'), schema_cola))
.withColumn('colb_struct', f.from_json(f.col('colb'), schema_colb))
.select(f.col('cola_struct.*'), f.col('colb_struct.*'))
)
The output looks like this:

Replacing whitespace in all column names in spark Dataframe

I have spark dataframe with whitespaces in some of column names, which has to be replaced with underscore.
I know a single column can be renamed using withColumnRenamed() in sparkSQL, but to rename 'n' number of columns, this function has to chained 'n' times (to my knowledge).
To automate this, i have tried:
val old_names = df.columns() // contains array of old column names
val new_names = old_names.map { x =>
if(x.contains(" ") == true)
x.replaceAll("\\s","_")
else x
} // array of new column names with removed whitespace.
Now, how to replace df's header with new_names
As best practice, you should prefer expressions and immutability.
You should use val and not var as much as possible.
Thus, it's preferable to use the foldLeft operator, in this case :
val newDf = df.columns
.foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "_")))
var newDf = df
for(col <- df.columns){
newDf = newDf.withColumnRenamed(col,col.replaceAll("\\s", "_"))
}
You can encapsulate it in some method so it won't be too much pollution.
In Python, this can be done by the following code:
# Importing sql types
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import col
# Building a simple dataframe:
schema = StructType([
StructField("id name", StringType(), True),
StructField("cities venezuela", StringType(), True)
])
column1 = ['A', 'A', 'B', 'B', 'C', 'B']
column2 = ['Maracaibo', 'Valencia', 'Caracas', 'Barcelona', 'Barquisimeto', 'Merida']
# Dataframe:
df = sqlContext.createDataFrame(list(zip(column1, column2)), schema=schema)
df.show()
exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]
df.select(*exprs).show()
You can do the exact same thing in python:
raw_data1 = raw_data
for col in raw_data.columns:
raw_data1 = raw_data1.withColumnRenamed(col,col.replace(" ", "_"))
In Scala, here is another way achieving same -
import org.apache.spark.sql.types._
val df_with_newColumns = spark.createDataFrame(df.rdd,
StructType(df.schema.map(s => StructField(s.name.replaceAll(" ", ""),
s.dataType, s.nullable))))
Hope this helps !!
Here is the utility we are using.
def columnsStandardise(df: DataFrame): DataFrame = {
val dfcolumnsStandardise= df.toDF(df.columns map (_.toLowerCase().trim().replaceAll(" ","_")): _*)
(dfcolumnsStandardise)
}
I wanted to add also this solution
import re
for each in df.schema.names:
df = df.withColumnRenamed(each, re.sub(r'\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*','',each.replace(' ', '')))
I have being using the answer given by #kanielc to trim the leading and trailing spaces in the column headers and that works great when the number of columns are less. I had to load one csv file which had around 600 columns and execution of the code took a sufficient amount of time and was not meeting our expectations.
Earlier Code:
val finalSourceTable = intermediateSourceTable.columns
.foldLeft(intermediateSourceTable)((curr, n) => curr.withColumnRenamed(n, n.trim))
Changed Code:
val finalSourceTable = intermediateSourceTable
.toDF(intermediateSourceTable.columns map (_.trim()): _*)
The changed code worked like a charm and it was also fast compared to the earlier code.
Also we are maintaining the immutability by not using var variables.

Pass one dataframe column values to another dataframe filter condition expression + Spark 1.5

I have two input datasets
First input dataset like as below :
year,make,model,comment,blank
"2012","Tesla","S","No comment",
1997,Ford,E350,"Go get one now they are going fast",
2015,Chevy,Volt
Second Input dataset :
TagId,condition
1997_cars,year = 1997 and model = 'E350'
2012_cars,year=2012 and model ='S'
2015_cars ,year=2015 and model = 'Volt'
Now my requirement is read first data set and based on the filtering condition in second dataset need to tag rows of first input dataset by introducing a new column TagId to first input data set
so the expected should look like :
year,make,model,comment,blank,TagId
"2012","Tesla","S","No comment",2012_cars
1997,Ford,E350,"Go get one now they are going fast",1997_cars
2015,Chevy,Volt, ,2015_cars
I tried like :
val sqlContext = new SQLContext(sc)
val carsSchema = StructType(Seq(
StructField("year", IntegerType, true),
StructField("make", StringType, true),
StructField("model", StringType, true),
StructField("comment", StringType, true),
StructField("blank", StringType, true)))
val carTagsSchema = StructType(Seq(
StructField("TagId", StringType, true),
StructField("condition", StringType, true)))
val dfcars = sqlContext.read.format("com.databricks.spark.csv").option("header", "true") .schema(carsSchema).load("/TestDivya/Spark/cars.csv")
val dftags = sqlContext.read.format("com.databricks.spark.csv").option("header", "true") .schema(carTagsSchema).load("/TestDivya/Spark/CarTags.csv")
val Amendeddf = dfcars.withColumn("TagId", dfcars("blank"))
val cdtnval = dftags.select("condition")
val df2=dfcars.filter(cdtnval)
<console>:35: error: overloaded method value filter with alternatives:
(conditionExpr: String)org.apache.spark.sql.DataFrame <and>
(condition: org.apache.spark.sql.Column)org.apache.spark.sql.DataFrame
cannot be applied to (org.apache.spark.sql.DataFrame)
val df2=dfcars.filter(cdtnval)
another way :
val col = dftags.col("TagId")
val finaldf = dfcars.withColumn("TagId", col)
org.apache.spark.sql.AnalysisException: resolved attribute(s) TagId#5 missing from comment#3,blank#4,model#2,make#1,year#0 in operator !Project [year#0,make#1,model#2,comment#3,blank#4,TagId#5 AS TagId#8];
finaldf.write.format("com.databricks.spark.csv").option("header", "true").save("/TestDivya/Spark/carswithtags.csv")
Would really appreciate if somebody give me pointers how can I pass the filter condition to filter function of dataframe.
Or another solution .
My apppologies for such a naive question as I am new to scala and Spark
Thanks
There is no simple solution to this. I think there are two general directions you can go with it:
Collect the conditions (dftags) to a local list. Then go through it one by one, executing each on the cars (dfcars) as a filter. Use the results to get the desired output.
Collect the conditions (dftags) to a local list. Implement the parsing and evaluation code for them yourself. Go through the cars (dfcars) once, evaluating the ruleset on each line in a map.
If you have a high number of conditions (so you cannot collect them) and a high number of cars, then the situation is very bad. You need to check every car against every condition, so this will be very inefficient. In this case you need to optimize the ruleset first, so it can be evaluated more efficiently. (A decision tree may be a nice solution.)