how to parse datetime.datetime(,,,tzinfo) to spark dataframe - pyspark

I have following json
{'client_partition_id': 2, 'client_id': 'd7d3e26f-d105-4816-825d-d5858b9cf0d1', 'ingestion_timestamp': datetime.datetime(2020, 4, 10, 4, 3, 19, 654977, tzinfo=<UTC>)}
its in ndjson file
trying to read in spark spark.read.json() but getting error parse in pyspark to parse datetime.
please help me to resolve it.

Related

MyPy on PySpark error: "DataFrameLike" has no attribute "values"

I am developing a program in PySpark 3.2.1.
Mypy == 0.950
One of the operations requires to transform information of a small DataFrame into a list.
The code is:
result = df.select("col1","col2","col3").toPandas().values.tolist()
I need to convert it to a list because I then broadcast the information and a pyspark broadcast can't be a DataFrame
For this code I get the following mypy error:
error: "DataFrameLike" has no attribute "values"
I there something I might do to avoid the mypy error?
This is working fine for me.
>>> df=spark.read.option('header','true').csv("C:/Users/pc/Desktop/myfile.txt")
>>> df
DataFrame[col1: string, col2: string, col3: string]
>>> result = df.select("col1","col2","col3").toPandas().values.tolist()
>>> result
[['1', '100', '1001'], ['2', '200', '2002'], ['3', '300', '1421'], ['4', '400', '24214'], ['5', '500', '14141']]
what is Mypy here ?

How to store bigint array in file in Pyspark?

I've a UDF that returns bigint array. I want to store that in a file on Pyspark cluster.
Sample Data -
[
Row(Id='ABCD505936', array=[0, 2, 5, 6, 8, 10, 12, 13, 14, 15, 18]),
Row(Id='EFGHI155784', array=[1, 2, 4, 10, 16, 22, 27, 32, 36, 38, 39, 40])
]
I tried saving it like this -
df.write.save("/home/data", format="text", header="true", mode="overwrite")
But it throws an error saying -
py4j.protocol.Py4JJavaError: An error occurred while calling
o101.save. : org.apache.spark.sql.AnalysisException: Text data source
does not support array data type.;
Can anyone please help me?
try this :
from pyspark.sql import functions as F
df.withColumn(
"array",
F.col("array").cast("string")
).write.save(
"/home/data", format="text", header="true", mode="overwrite"
)

compare metadata of parquet file using pyspark

I am using pyspark and have a situation where I need to compare metadata of 2 parquet files.
Example:-
Parquet 1 Schema is :
1, ID, string
2, address string
3, Date, date
Parquet 2 Schema is :
1, ID, string
2, Date, date
3, address string
This should show me a difference, as col 2 moved to col 3 in parquet 2.
Thanks,
VK
In Spark there isn't a native command to do the comparison of the headers. A solution to your problem could be the following:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df1 = spark.read.parquet('path/to/file1.parquet', header='true')
df2= spark.read.parquet('path/to/file2.parquet', header='true')
df1_headers = df1.columns
df2_headers = df2.columns
# Now in Python you could compare the lists with the headers
# You don't need Spark to compare simple headers :-)

Fetch columns based on list in Spark

I have a list List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13) and I have a dataframe which read input from text file with no headers. I want to fetch the columns mentioned in my List from that dataframe(inputFile). My input files has more 20 column but I want to fetch only columns mentioned in my list
val inputFile = spark.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("delimiter", "|")
.load("C:\\demo.txt")
You can get the required columns using the following :
val fetchIndex = List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13)
val fetchCols = inputFile.columns.zipWithIndex
.filter { case (colName, idx) => fetchIndex.contains(idx) }
.map(x => col(x._1) )
inputFile.select( fetchCols : _* )
Basically what it does is, zipWithIndex adds a continuous index to each element of the collection. So you get something like this :
df.columns.zipWithIndex.filter { case (data, idx) => a.contains(idx) }.map(x => col(x._1))
res8: Array[org.apache.spark.sql.Column] = Array(companyid, event, date_time)
And then you can just use the splat operator to pass the generated array as varargs to the select function.
You can use the following steps to get the columns that you have defined in a list as indexes.
You can get the column names by doing the following
val names = df.schema.fieldNames
And you have a list of column indexes as
val list = List(0, 1, 2, 3, 4, 5, 6, 7, 10, 8, 13)
Now you can select the column names that the indexes that the list has by doing the following
val selectCols = list.map(x => names(x))
Last step is to select only the columns that has been selected by doing the following
import org.apache.spark.sql.functions.col
val selectedDataFrame = df.select(selectCols.map(col): _*)
You should have the dataframe with column indexes mentioned in the list.
Note: indexes in the list should not be greater than the column indexes present in the dataframe

pyspark substring and aggregation

I am new to Spark and I've got a csv file with such data:
date, accidents, injured
2015/20/03 18:00 15, 5
2015/20/03 18:30 25, 4
2015/20/03 21:10 14, 7
2015/20/02 21:00 15, 6
I would like to aggregate this data by a specific hour of when it has happened. My idea is to Substring date to 'year/month/day hh' with no minutes so I can make it a key. I wanted to give average of accidents and injured by each hour. Maybe there is a different, smarter way with pyspark?
Thanks guys!
Well, it depends on what you're going to do afterwards, I guess.
The simplest way would be to do as you suggest: substring the date string and then aggregate:
data = [('2015/20/03 18:00', 15, 5),
('2015/20/03 18:30', 25, 4),
('2015/20/03 21:10', 14, 7),
('2015/20/02 21:00', 15, 6)]
df = spark.createDataFrame(data, ['date', 'accidents', 'injured'])
df.withColumn('date_hr',
df['date'].substr(1, 13)
).groupby('date_hr')\
.agg({'accidents': 'avg', 'injured': 'avg'})\
.show()
If you, however, want to do some more computation later on, you can parse the data to a TimestampType() and then extract the date and hour from that.
import pyspark.sql.types as typ
from pyspark.sql.functions import col, udf
from datetime import datetime
parseString = udf(lambda x: datetime.strptime(x, '%Y/%d/%m %H:%M'), typ.TimestampType())
getDate = udf(lambda x: x.date(), typ.DateType())
getHour = udf(lambda x: int(x.hour), typ.IntegerType())
df.withColumn('date_parsed', parseString(col('date'))) \
.withColumn('date_only', getDate(col('date_parsed'))) \
.withColumn('hour', getHour(col('date_parsed'))) \
.groupby('date_only', 'hour') \
.agg({'accidents': 'avg', 'injured': 'avg'})\
.show()