Rename unnamed columns in Pyspark Dataframe - pyspark

The data is in excel file that means file format is in '.xlsx'. The Header for the table has been sort of split amongst the first two rows. How do I fix this? Are there any solutions to take the best of the two names for each column, and make that column name as header
I have these rows in source file:
|Unnamed:_0|Unnamed:_1|Unnamed:_2|Unnamed:_3|Unnamed:_4|Year |2018|2018.1|
|Col1 |Col2 |Col3 |Col4 |Col5 |Month|Jul |Aug |
I want to display header for the table as:
|Col1|Col2|Col3|Col4|Col5|Year_Month|2018_07|2018.1_08|
I would be glad if you would help me provide a solution for this since i am new to pyspark

you can share more of your code but I bet that it header option for csv:
df = spark.read.format("csv").option("header", "true").load("csvfile.csv")
if it is not csv you can use schema and specify column names in schema. Example with schema:
schema = StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True),
StructField('languages', ArrayType(StringType()), True),
StructField('state', StringType(), True),
StructField('gender', StringType(), True)])
df = spark.createDataFrame(data = data, schema = schema)
sometimes also for csvs can be useful to auto detect schema from file so:
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("csvfile.csv")
if you load from excel you can also use above options like
.option("header", "true")
in loading excel useful is also option dataAddress as you can target table/selection as you do in excel so after some experiments header will match:
.option("dataAddress", "'My Sheet'!B3:C35")
if any of that solution is not working you can promote your first line to header but it is a bit complicated. An excellent script and manual how to do it by #desertnaut is described here https://stackoverflow.com/a/34837299/10972959

Related

pyspark json datframe created with all null values

I have created a dataframe from a json file. However dataframe is created with all the schema but with values as null. Its a valid json file.
df = spark.read.json(path)
when I displayed the data , using df.display() all i can view is null in the dataframe. Can anyone tell me what could be the issue?
Reading the json file without enabling multiline might be the cause for this.
Please go through the sample demonstration.
My sample json.
[{"id":1,"first_name":"Amara","last_name":"Taplin"},
{"id":2,"first_name":"Gothart","last_name":"McGrill"},
{"id":3,"first_name":"Georgia","last_name":"De Miranda"},
{"id":4,"first_name":"Dukie","last_name":"Arnaud"},
{"id":5,"first_name":"Mellicent","last_name":"Scathard"}]
I got null values when multiline not used.
When multiline enabled I got proper result.
df= spark.read.option('multiline', True).json('/FileStore/tables/Sample1_json.json')
df.display()
If you want to give schema externally also, you can do like this.
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,LongType
schema = StructType([StructField('first_name', StringType(), True),
StructField('id', IntegerType(), True),
StructField('last_name', StringType(), True)])
df= spark.read.schema(schema).option('multiline', True).json('/FileStore/tables/Sample1_json.json')
df.show()

Issue with pyspark df.show

I am reading a gz file in pyspark creating an RDD & Schema and then using that RDD to create the Dataframe. But I am not able to see any output.
Here is my code, I am not sure what I am doing wrong.
lines = sc.textFile("x.gz")
parts = lines.map(lambda l: l.split("\t"))
db = parts.map(lambda p: (p[0], int(p[1]), int(p[2]), int(p[3]).strip()))
schema = StructType([
StructField("word", StringType(), True),
StructField("count1", IntegerType(), True),
StructField("count2", IntegerType(), True),
StructField("count3", IntegerType(), True)])
df = sqlContext.createDataFrame(db, schema)
df.show()
df.createOrReplaceTempView("dftable")
result = sqlContext.sql("SELECT COUNT(*) FROM dftable")
result.show()
Moreover, I also want to calculate the number of rows in my table, that's why I used SQL query. But whenever try to call .show() error is thrown. What am I doing it wrong over here?
The data in the gz file is something like.....
A'String' some_number some_number some_number
some_number are in string format.
Please guide me what am I doing wrong?

Spark generate a dataframe from two json columns

I have a dataframe with two columns. Each column contains json.
cola
colb
{"name":"Adam", "age": 23}
{"country" : "USA"}
I wish to convert it to:
cola_name
cola_age
colb_country
Adam
23
USA
How do I do this?
The approach I have in mind is: In the original dataframe, If I can merge both the json to a single json object. I can then obtain the intended result
spark.read.json(df.select("merged_column").as[String])
But cant find an easy way of merging two json object to single json object in spark
Update: The contents of the json is not known pre-hand. Looking for a way to auto-detect schema
I'm more familiar with pyspark syntax. I think this works:
import pyspark.sql.functions as f
from pyspark.sql.types import *
schema_cola = StructType([
StructField('name', StringType(), True),
StructField('age', IntegerType(), True)
])
schema_colb = StructType([
StructField('country', StringType(), True)
])
df = spark.createDataFrame([('{"name":"Adam", "age": 23}', '{"country" : "USA"}')], ['cola', 'colb'])
display(df
.withColumn('cola_struct', f.from_json(f.col('cola'), schema_cola))
.withColumn('colb_struct', f.from_json(f.col('colb'), schema_colb))
.select(f.col('cola_struct.*'), f.col('colb_struct.*'))
)
The output looks like this:

Get the column names of malformed records while reading a csv file using pyspark

I am reading a csv file using pyspark with predefined schema.
schema = StructType([
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True)
StructField("col3", FloatType(), True)
])
df = spark.sqlContext.read
.schema(schema)
.option("header",true)
.option("delimiter", ",")
.csv(path)
Now in the csv file, there is float value in col1 and string value in col3. I need to raise an exception and get the names of these columns(col1, col3) because these columns contain the values of different data type than that of defined in schema.
How do I achieve this?
In pyspark versions >2.2 you can use columnNameOfCorruptRecord with csv:
schema = StructType(
[
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True),
StructField("col3", FloatType(), True),
StructField("corrupted", StringType(), True),
]
)
df = spark.sqlContext.read.csv(
path,
schema=schema,
header=True,
sep=",",
mode="PERMISSIVE",
columnNameOfCorruptRecord="corrupted",
).show()
+----+----+----+------------+
|col1|col2|col3| corrupted|
+----+----+----+------------+
|null|null|null|0.10,123,abc|
+----+----+----+------------+
EDIT: CSV record fields are not independent of one another, so it can't generally be said that one field is corrupt, but others are not. Only the entire record can be corrupt or not corrupt.
For example, suppose we have a comma delimited file with one row and two floating point columns, the Euro values 0,10 and 1,00. The file looks like this:
col1,col2
0,10,1,00
Which field is corrupt?

Is there a way to make an empty schema with column names from a list?

I am trying to make an empty PySpark dataframe in the case where it didn't exist before. I also have a list of column names. Is it possible to define an empty PySpark dataframe without manual assignment?
I have a list of columns final_columns, which I can use to select a subset of columns from a dataframe. However, in the case when this dataframe doesn't exist, I would like to create an empty dataframe with the same columns in final_columns. I would like to do this without manually assigning the names.
final_columns = ['colA', 'colB', 'colC', 'colD', 'colE']
try:
sdf = sqlContext.table('test_table')
except:
print("test_table is empty")
mySchema = StructType([ StructField("colA", StringType(), True),
StructField("colB", StringType(), True),
StructField("colC", StringType(), True),
StructField("colD", StringType(), True),
StructField("colE", DoubleType(), True) ])
sdf = sqlContext.createDataFrame(spark.sparkContext.emptyRDD(),schema=mySchema)
sdf = sdf.select(final_columns)