Apache pyspark pandas

Apache pyspark pandas - pyspark

I am new in apache spark. I create the schema and data frame and it show me result but the format was not good and it so messy. Hardly I can read the line. So i want to show my result in pandas format. I attached the screen shot of my data frame result. But i don't know how to show my result in pandas format.
Here's my code
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.types import *
from IPython.display import display
import pandas as pd
import gzip
schema = StructType([StructField("crimeid", StringType(), True),
StructField("Month", StringType(), True),
StructField("Reported_by", StringType(), True),
StructField("Falls_within", StringType(), True),
StructField("Longitude", FloatType(), True),
StructField("Latitue", FloatType(), True),
StructField("Location", StringType(), True),
StructField("LSOA_code", StringType(), True),
StructField("LSOA_name", StringType(), True),
StructField("Crime_type", StringType(), True),
StructField("Outcome_type", StringType(), True),
])
df = spark.read.csv("crimes.gz",header=False,schema=schema)
df.printSchema()
PATH = "crimes.gz"
csvfile = spark.read.format("csv")\
.option("header", "false")\
.schema(schema)\
.load(PATH)
df1 =csvfile.show()
it shows the result like below
but in want this data pandas form
Thanks

You can try showing them vertically per row, or truncate big names if you like:
df.show(2, vertical=True)
df.show(2, truncate=4, vertical=True)

Please try:
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.types import *
from IPython.display import display
import pandas as pd
import gzip
schema = StructType([StructField("crimeid", StringType(), True),
StructField("Month", StringType(), True),
StructField("Reported_by", StringType(), True),
StructField("Falls_within", StringType(), True),
StructField("Longitude", FloatType(), True),
StructField("Latitue", FloatType(), True),
StructField("Location", StringType(), True),
StructField("LSOA_code", StringType(), True),
StructField("LSOA_name", StringType(), True),
StructField("Crime_type", StringType(), True),
StructField("Outcome_type", StringType(), True),
])
df = spark.read.csv("crimes.gz",header=False,schema=schema)
df.printSchema()
pandasDF = df.toPandas() # transform PySpark dataframe in Pandas dataframe
print(pandasDF.head()) # print 5 first rows

Related

Create a column for each struct in an array of struct's in a PySpark DataFrame

Suppose I have a dataframe where a have an id and a distinct list of keys and values, such as the following:
import pyspark.sql.functions as fun
from pyspark.sql.types import StructType, StructField, ArrayType, IntegerType, StringType
schema = StructType(
[
StructField('id', StringType(), True),
StructField('columns',
ArrayType(
StructType([
StructField('key', StringType(), True),
StructField('value', IntegerType(), True)
])
)
)
]
)
data = [
('1', [('Savas', 5)]),
('2', [('Savas', 5), ('Ali', 3)]),
('3', [('Savas', 5), ('Ali', 3), ('Ozdemir', 7)])
]
df = spark.createDataFrame(data, schema)
df.show()
For each struct in the array type column I want create a column, as follows:
df1 = df\
.withColumn('temp', fun.explode('names'))\
.select('id', 'temp.key', 'temp.value')\
.groupby('id')\
.pivot('key')\
.agg(fun.first(fun.col('value')))\
.sort('user_id')
df1.show()
Is there a more efficient way to achieve the same result?

Extract field(s) and value from nested json in PySpark dataframe and Sort it based off of value

How I can write a function in databricks/spark which will take email or md5 value of email and Mon as input and return top 5 cities sorted by activityCount in Dict format (if it doesn't have 3 cities then return however many matches are found).
PS: there's more columns in df for other days as well such as "Tue", "Wed", "Thu", "Fri", "Sat", "Sun" and they'll have data in similar format in them but I've only added "Mon" for brevity.
Dataframe
+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|email |Mon |
+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|aaaa#aol.com|{[California]={"[San Francisco]":{"activityCount":4}}, {"[San Diego]":{"activityCount":5}}, {"[San Jose]":{"activityCount":6}}, [New York]={"[New York City]":{"activityCount":1}}, {"[Fairport]":{"activityCount":2}}, {"[Manhattan]":{"activityCount":3}}}|
|bbbb#aol.com|{[Alberta]={"[city1]":{"activityCount":1}}, {"[city2]":{"activityCount":2}}, {"[city3]":{"activityCount":3}}, [New York]={"[New York City]":{"activityCount":7}}, {"[Fairport]":{"activityCount":8}}, {"[Manhattan]":{"activityCount":9}}}|
+------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
dataframe_schema is as following:
schema = StructType([
StructField("email", StringType(), True),
StructField("Mon", StringType(), False)
])
Sample code to set it up
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import StructType, StructField, StringType
if __name__ == "__main__":
spark = SparkSession.builder.master("local[1]") \
.appName('SparkByExamples.com') \
.getOrCreate()
data2 = [("aaaa#aol.com",
{
"[New York]": "{\"[New York City]\":{\"activityCount\":1}}, {\"[Fairport]\":{\"activityCount\":2}}, "
"{\"[Manhattan]\":{\"activityCount\":3}}",
"[California]": "{\"[San Francisco]\":{\"activityCount\":4}}, {\"[San Diego]\":{\"activityCount\":5}}, "
"{\"[San Jose]\":{\"activityCount\":6}}"
}
)]
schema = StructType([
StructField("email", StringType(), True),
StructField("Mon", StringType(), False)
])
task5DF = spark.createDataFrame(data=data2, schema=schema)
task5DF.show(truncate=False)

Get the column names of malformed records while reading a csv file using pyspark

I am reading a csv file using pyspark with predefined schema.
schema = StructType([
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True)
StructField("col3", FloatType(), True)
])
df = spark.sqlContext.read
.schema(schema)
.option("header",true)
.option("delimiter", ",")
.csv(path)
Now in the csv file, there is float value in col1 and string value in col3. I need to raise an exception and get the names of these columns(col1, col3) because these columns contain the values of different data type than that of defined in schema.
How do I achieve this?

In pyspark versions >2.2 you can use columnNameOfCorruptRecord with csv:
schema = StructType(
[
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True),
StructField("col3", FloatType(), True),
StructField("corrupted", StringType(), True),
]
)
df = spark.sqlContext.read.csv(
path,
schema=schema,
header=True,
sep=",",
mode="PERMISSIVE",
columnNameOfCorruptRecord="corrupted",
).show()
+----+----+----+------------+
|col1|col2|col3| corrupted|
+----+----+----+------------+
|null|null|null|0.10,123,abc|
+----+----+----+------------+
EDIT: CSV record fields are not independent of one another, so it can't generally be said that one field is corrupt, but others are not. Only the entire record can be corrupt or not corrupt.
For example, suppose we have a comma delimited file with one row and two floating point columns, the Euro values 0,10 and 1,00. The file looks like this:
col1,col2
0,10,1,00
Which field is corrupt?

PySpark TypeErrors

Writing a simple CSV to Parquet conversion.
CSV file has a couple of Timestamps in it. Therefore I am getting type errors when I try to write.
To work around that, I tried implementing this line to identify the timestamp cols and perform a to_timestamp on them.
rdd = sc.textFile("../../../Downloads/test_type.csv").map(lambda line: [to_timestamp(i) if instr(i,"-")==5 else i for i in line.split(",")])
Getting this error:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:/yy/xx/Documents/gg/csv_to_parquet/csv_to_parquet.py", line 55, in <lambda>
rdd = sc.textFile("../../../test/test.csv").map(lambda line: [to_timestamp(i) if (instr(i,"-")==5) else i for i in line.split(",")])
AttributeError: 'NoneType' object has no attribute '_jvm'
Any idea how to pull this off?
==========================================
Version 2
Made some progress today, I am writing the parquet file now, but when I query the data I get a Binary data vs timestamp data error:
HIVE_BAD_DATA: Field header__timestamp's type BINARY in parquet is incompatible with type timestamp defined in table schema
I modified the code to use all StringTypes initially and then modified the datatypes in the dataframe.
sc = SparkContext(appName="CSV2Parquet")
sqlContext = SQLContext(sc)
schema = StructType\
([
StructField("header__change_seq", StringType(), True),
StructField("header__change_oper", StringType(), True),
StructField("header__change_mask", StringType(), True),
StructField("header__stream_position", StringType(), True),
StructField("header__operation", StringType(), True),
StructField("header__transaction_id", StringType(), True),
StructField("header__timestamp", StringType(), True),
StructField("l_en_us", StringType(), True),
StructField("priority", StringType(), True),
StructField("typecode", StringType(), True),
StructField("retired", StringType(), True),
StructField("name", StringType(), True),
StructField("id", StringType(), True),
StructField("description", StringType(), True),
StructField("l_es_ar", StringType(), True),
StructField("adw_updated_ts", StringType(), True),
StructField("adw_process_id", StringType(), True)
])
rdd = sc.textFile("../../../Downloads/pctl_jobdatetype.csv").map(lambda line: line.split(","))
df = sqlContext.createDataFrame(rdd, schema)
df2 = df.withColumn('header__timestamp', df['header__timestamp'].cast('timestamp'))
df2 = df.withColumn('adw_updated_ts', df['adw_updated_ts'].cast('timestamp'))
df2 = df.withColumn('priority', df['priority'].cast('double'))
df2 = df.withColumn('id', df['id'].cast('double'))
df2.write.parquet('../../../Downloads/input-parquet')
Sample Data:
"header__change_seq","header__change_oper","header__change_mask","header__stream_position","header__operation","header__transaction_id","header__timestamp","l_en_us","priority","typecode","retired","name","id","description","l_es_ar","adw_updated_ts","adw_process_id"
,"I",,,"IDL",,"1970-01-01 00:00:01.000","Effective Date","10.0","Effective","0","Effective Date","10001.0","Effective Date","Effective Date","2020-02-16 15:45:07.432","fb69d6f6-06fa-4c93-b8d6-bb7c7229ee88"
,"I",,,"IDL",,"1970-01-01 00:00:01.000","Written Date","20.0","Written","0","Written Date","10002.0","Written Date","Written Date","2020-02-16 15:45:07.432","fb69d6f6-06fa-4c93-b8d6-bb7c7229ee88"
,"I",,,"IDL",,"1970-01-01 00:00:01.000","Reference Date","30.0","Reference","0","Reference Date","10003.0","Reference Date","Reference Date","2020-02-16 15:45:07.432","fb69d6f6-06fa-4c93-b8d6-bb7c7229ee88"

After I modified the dataframe name to df2 for lines 3-6 below, seems to be working fine, and Athena is also returning results.
df = sqlContext.createDataFrame(rdd, schema)
df2 = df.withColumn('header__timestamp', df['header__timestamp'].cast('timestamp'))
df2 = df2.withColumn('adw_updated_ts', df['adw_updated_ts'].cast('timestamp'))
df2 = df2.withColumn('priority', df['priority'].cast('double'))
df2 = df2.withColumn('id', df['id'].cast('double'))
df2.write.parquet('../../../Downloads/input-parquet')

Is there a way to make an empty schema with column names from a list?

I am trying to make an empty PySpark dataframe in the case where it didn't exist before. I also have a list of column names. Is it possible to define an empty PySpark dataframe without manual assignment?
I have a list of columns final_columns, which I can use to select a subset of columns from a dataframe. However, in the case when this dataframe doesn't exist, I would like to create an empty dataframe with the same columns in final_columns. I would like to do this without manually assigning the names.
final_columns = ['colA', 'colB', 'colC', 'colD', 'colE']
try:
sdf = sqlContext.table('test_table')
except:
print("test_table is empty")
mySchema = StructType([ StructField("colA", StringType(), True),
StructField("colB", StringType(), True),
StructField("colC", StringType(), True),
StructField("colD", StringType(), True),
StructField("colE", DoubleType(), True) ])
sdf = sqlContext.createDataFrame(spark.sparkContext.emptyRDD(),schema=mySchema)
sdf = sdf.select(final_columns)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Apache pyspark pandas - pyspark

You can try showing them vertically per row, or truncate big names if you like: df.show(2, vertical=True) df.show(2, truncate=4, vertical=True)

Related

Create a column for each struct in an array of struct's in a PySpark DataFrame

Extract field(s) and value from nested json in PySpark dataframe and Sort it based off of value

Get the column names of malformed records while reading a csv file using pyspark

PySpark TypeErrors

Is there a way to make an empty schema with column names from a list?

Categories

Resources