Issue with pyspark df.show - pyspark

I am reading a gz file in pyspark creating an RDD & Schema and then using that RDD to create the Dataframe. But I am not able to see any output.
Here is my code, I am not sure what I am doing wrong.
lines = sc.textFile("x.gz")
parts = lines.map(lambda l: l.split("\t"))
db = parts.map(lambda p: (p[0], int(p[1]), int(p[2]), int(p[3]).strip()))
schema = StructType([
StructField("word", StringType(), True),
StructField("count1", IntegerType(), True),
StructField("count2", IntegerType(), True),
StructField("count3", IntegerType(), True)])
df = sqlContext.createDataFrame(db, schema)
df.show()
df.createOrReplaceTempView("dftable")
result = sqlContext.sql("SELECT COUNT(*) FROM dftable")
result.show()
Moreover, I also want to calculate the number of rows in my table, that's why I used SQL query. But whenever try to call .show() error is thrown. What am I doing it wrong over here?
The data in the gz file is something like.....
A'String' some_number some_number some_number
some_number are in string format.
Please guide me what am I doing wrong?

Related

How to join rdd rows based on common value in another column RDD pyspark

Textfile Dataset I have :
Benz,25
BMW,27
BMW,25
Land Rover,22
Audi,25
Benz,25
The result I want is :
[((Benz,BMW),2),((Benz,Audi),1),((BMW,Audi),1)]
it basically pairs the cars with common values and gets the occurrence together.
My code so far is :
cars= sc.textFile('cars.txt')
carpair= cars.flatMap(lambda x: float(x.split(',')))
carpair.map(lambda x: (x[0], x[1])).groupByKey().collect()
As I'm a beginner I'm not able to figure it out.
Getting the occurrence is secondary i cant even map the values together.
In pyspark, it'd look like this:
schema = StructType([
StructField('car_model', StringType(), True),
StructField('value', IntegerType(), True)
])
df = (
spark
.read
.schema(schema)
.option('header', False)
.csv('./cars.txt')
)
(
df.alias('left')
.join(df.alias('right'), ['value'], 'left')
.select('value', f.col('left.car_model').alias('left_car_model'), f.col('right.car_model').alias('right_car_model'))
.where(f.col('left_car_model') != f.col('right_car_model'))
.groupBy('left_car_model', 'right_car_model')
.agg(f.count('*').alias('count'))
.show()
)

Rename unnamed columns in Pyspark Dataframe

The data is in excel file that means file format is in '.xlsx'. The Header for the table has been sort of split amongst the first two rows. How do I fix this? Are there any solutions to take the best of the two names for each column, and make that column name as header
I have these rows in source file:
|Unnamed:_0|Unnamed:_1|Unnamed:_2|Unnamed:_3|Unnamed:_4|Year |2018|2018.1|
|Col1 |Col2 |Col3 |Col4 |Col5 |Month|Jul |Aug |
I want to display header for the table as:
|Col1|Col2|Col3|Col4|Col5|Year_Month|2018_07|2018.1_08|
I would be glad if you would help me provide a solution for this since i am new to pyspark
you can share more of your code but I bet that it header option for csv:
df = spark.read.format("csv").option("header", "true").load("csvfile.csv")
if it is not csv you can use schema and specify column names in schema. Example with schema:
schema = StructType([
StructField('firstname', StringType(), True),
StructField('middlename', StringType(), True),
StructField('lastname', StringType(), True),
StructField('languages', ArrayType(StringType()), True),
StructField('state', StringType(), True),
StructField('gender', StringType(), True)])
df = spark.createDataFrame(data = data, schema = schema)
sometimes also for csvs can be useful to auto detect schema from file so:
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("csvfile.csv")
if you load from excel you can also use above options like
.option("header", "true")
in loading excel useful is also option dataAddress as you can target table/selection as you do in excel so after some experiments header will match:
.option("dataAddress", "'My Sheet'!B3:C35")
if any of that solution is not working you can promote your first line to header but it is a bit complicated. An excellent script and manual how to do it by #desertnaut is described here https://stackoverflow.com/a/34837299/10972959

How to use a spark dataframe as a table in a SQL statement

I have a spark dataframe in python. How do I use it in a SparkSQL statement?
For example:
df = spark.createDataFrame(data = array_of_table_and_time_tuples
, schema = StructType([StructField('table_name', StringType(), True),
StructField('update_time', TimestampType(), True)]))
# something needs to be added here to make the dataframe reference-able by the SQL below
spark.sql(f"""merge {load_tracking_table} t
using update_datetimes s
on t.table_name = s.table_name
when matched UPDATE SET t.valid_as_of_date = s.update_time""")
Buried in the Spark documentation:
df.createOrReplaceTempView("the_name_of_the_view")
so for the example above:
df = spark.createDataFrame(data = array_of_table_and_time_tuples
, schema = StructType([StructField('table_name', StringType(), True),
StructField('update_time', TimestampType(), True)]))
df.createOrReplaceTempView("update_datetimes")
spark.sql(f"""merge {load_tracking_table} t
using update_datetimes s
on t.table_name = s.table_name
when matched UPDATE SET t.valid_as_of_date = s.update_time""")

Get the column names of malformed records while reading a csv file using pyspark

I am reading a csv file using pyspark with predefined schema.
schema = StructType([
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True)
StructField("col3", FloatType(), True)
])
df = spark.sqlContext.read
.schema(schema)
.option("header",true)
.option("delimiter", ",")
.csv(path)
Now in the csv file, there is float value in col1 and string value in col3. I need to raise an exception and get the names of these columns(col1, col3) because these columns contain the values of different data type than that of defined in schema.
How do I achieve this?
In pyspark versions >2.2 you can use columnNameOfCorruptRecord with csv:
schema = StructType(
[
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True),
StructField("col3", FloatType(), True),
StructField("corrupted", StringType(), True),
]
)
df = spark.sqlContext.read.csv(
path,
schema=schema,
header=True,
sep=",",
mode="PERMISSIVE",
columnNameOfCorruptRecord="corrupted",
).show()
+----+----+----+------------+
|col1|col2|col3| corrupted|
+----+----+----+------------+
|null|null|null|0.10,123,abc|
+----+----+----+------------+
EDIT: CSV record fields are not independent of one another, so it can't generally be said that one field is corrupt, but others are not. Only the entire record can be corrupt or not corrupt.
For example, suppose we have a comma delimited file with one row and two floating point columns, the Euro values 0,10 and 1,00. The file looks like this:
col1,col2
0,10,1,00
Which field is corrupt?

Is there a way to make an empty schema with column names from a list?

I am trying to make an empty PySpark dataframe in the case where it didn't exist before. I also have a list of column names. Is it possible to define an empty PySpark dataframe without manual assignment?
I have a list of columns final_columns, which I can use to select a subset of columns from a dataframe. However, in the case when this dataframe doesn't exist, I would like to create an empty dataframe with the same columns in final_columns. I would like to do this without manually assigning the names.
final_columns = ['colA', 'colB', 'colC', 'colD', 'colE']
try:
sdf = sqlContext.table('test_table')
except:
print("test_table is empty")
mySchema = StructType([ StructField("colA", StringType(), True),
StructField("colB", StringType(), True),
StructField("colC", StringType(), True),
StructField("colD", StringType(), True),
StructField("colE", DoubleType(), True) ])
sdf = sqlContext.createDataFrame(spark.sparkContext.emptyRDD(),schema=mySchema)
sdf = sdf.select(final_columns)