Pyspark get the wrong date when importing from MongoDB - mongodb

I'm trying to work with a dataframe populated from MongoDB in pyspark, but the dates I get are offseted by 1 day before. The field in Mongo look like this fechaEnvio 2022-11-03T00:00:00.000+00:00
But when reading in spark it look like this: 2022-11-02 21:00:00
This is the schema for the collection I'm using:
camp_schema = StructType([
StructField("idCampana", StringType(), True),
StructField("fechaEnvio", DateType(), True),
StructField("fechaInicio", DateType(), True),
StructField("error", StringType(), True),
])
And this is the connection string:
campaigns_source = spark.read.format("mongo").option("uri", mongo_str.get_read_str('campanas')).option("schema", camp_schema).load()
What I'm I doing wrong?

Related

Issue with pyspark df.show

I am reading a gz file in pyspark creating an RDD & Schema and then using that RDD to create the Dataframe. But I am not able to see any output.
Here is my code, I am not sure what I am doing wrong.
lines = sc.textFile("x.gz")
parts = lines.map(lambda l: l.split("\t"))
db = parts.map(lambda p: (p[0], int(p[1]), int(p[2]), int(p[3]).strip()))
schema = StructType([
StructField("word", StringType(), True),
StructField("count1", IntegerType(), True),
StructField("count2", IntegerType(), True),
StructField("count3", IntegerType(), True)])
df = sqlContext.createDataFrame(db, schema)
df.show()
df.createOrReplaceTempView("dftable")
result = sqlContext.sql("SELECT COUNT(*) FROM dftable")
result.show()
Moreover, I also want to calculate the number of rows in my table, that's why I used SQL query. But whenever try to call .show() error is thrown. What am I doing it wrong over here?
The data in the gz file is something like.....
A'String' some_number some_number some_number
some_number are in string format.
Please guide me what am I doing wrong?

Apache pyspark pandas

I am new in apache spark. I create the schema and data frame and it show me result but the format was not good and it so messy. Hardly I can read the line. So i want to show my result in pandas format. I attached the screen shot of my data frame result. But i don't know how to show my result in pandas format.
Here's my code
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.types import *
from IPython.display import display
import pandas as pd
import gzip
schema = StructType([StructField("crimeid", StringType(), True),
StructField("Month", StringType(), True),
StructField("Reported_by", StringType(), True),
StructField("Falls_within", StringType(), True),
StructField("Longitude", FloatType(), True),
StructField("Latitue", FloatType(), True),
StructField("Location", StringType(), True),
StructField("LSOA_code", StringType(), True),
StructField("LSOA_name", StringType(), True),
StructField("Crime_type", StringType(), True),
StructField("Outcome_type", StringType(), True),
])
df = spark.read.csv("crimes.gz",header=False,schema=schema)
df.printSchema()
PATH = "crimes.gz"
csvfile = spark.read.format("csv")\
.option("header", "false")\
.schema(schema)\
.load(PATH)
df1 =csvfile.show()
it shows the result like below
but in want this data pandas form
Thanks
You can try showing them vertically per row, or truncate big names if you like:
df.show(2, vertical=True)
df.show(2, truncate=4, vertical=True)
Please try:
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.types import *
from IPython.display import display
import pandas as pd
import gzip
schema = StructType([StructField("crimeid", StringType(), True),
StructField("Month", StringType(), True),
StructField("Reported_by", StringType(), True),
StructField("Falls_within", StringType(), True),
StructField("Longitude", FloatType(), True),
StructField("Latitue", FloatType(), True),
StructField("Location", StringType(), True),
StructField("LSOA_code", StringType(), True),
StructField("LSOA_name", StringType(), True),
StructField("Crime_type", StringType(), True),
StructField("Outcome_type", StringType(), True),
])
df = spark.read.csv("crimes.gz",header=False,schema=schema)
df.printSchema()
pandasDF = df.toPandas() # transform PySpark dataframe in Pandas dataframe
print(pandasDF.head()) # print 5 first rows

Get the column names of malformed records while reading a csv file using pyspark

I am reading a csv file using pyspark with predefined schema.
schema = StructType([
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True)
StructField("col3", FloatType(), True)
])
df = spark.sqlContext.read
.schema(schema)
.option("header",true)
.option("delimiter", ",")
.csv(path)
Now in the csv file, there is float value in col1 and string value in col3. I need to raise an exception and get the names of these columns(col1, col3) because these columns contain the values of different data type than that of defined in schema.
How do I achieve this?
In pyspark versions >2.2 you can use columnNameOfCorruptRecord with csv:
schema = StructType(
[
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True),
StructField("col3", FloatType(), True),
StructField("corrupted", StringType(), True),
]
)
df = spark.sqlContext.read.csv(
path,
schema=schema,
header=True,
sep=",",
mode="PERMISSIVE",
columnNameOfCorruptRecord="corrupted",
).show()
+----+----+----+------------+
|col1|col2|col3| corrupted|
+----+----+----+------------+
|null|null|null|0.10,123,abc|
+----+----+----+------------+
EDIT: CSV record fields are not independent of one another, so it can't generally be said that one field is corrupt, but others are not. Only the entire record can be corrupt or not corrupt.
For example, suppose we have a comma delimited file with one row and two floating point columns, the Euro values 0,10 and 1,00. The file looks like this:
col1,col2
0,10,1,00
Which field is corrupt?

PySpark TypeErrors

Writing a simple CSV to Parquet conversion.
CSV file has a couple of Timestamps in it. Therefore I am getting type errors when I try to write.
To work around that, I tried implementing this line to identify the timestamp cols and perform a to_timestamp on them.
rdd = sc.textFile("../../../Downloads/test_type.csv").map(lambda line: [to_timestamp(i) if instr(i,"-")==5 else i for i in line.split(",")])
Getting this error:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:/yy/xx/Documents/gg/csv_to_parquet/csv_to_parquet.py", line 55, in <lambda>
rdd = sc.textFile("../../../test/test.csv").map(lambda line: [to_timestamp(i) if (instr(i,"-")==5) else i for i in line.split(",")])
AttributeError: 'NoneType' object has no attribute '_jvm'
Any idea how to pull this off?
==========================================
Version 2
Made some progress today, I am writing the parquet file now, but when I query the data I get a Binary data vs timestamp data error:
HIVE_BAD_DATA: Field header__timestamp's type BINARY in parquet is incompatible with type timestamp defined in table schema
I modified the code to use all StringTypes initially and then modified the datatypes in the dataframe.
sc = SparkContext(appName="CSV2Parquet")
sqlContext = SQLContext(sc)
schema = StructType\
([
StructField("header__change_seq", StringType(), True),
StructField("header__change_oper", StringType(), True),
StructField("header__change_mask", StringType(), True),
StructField("header__stream_position", StringType(), True),
StructField("header__operation", StringType(), True),
StructField("header__transaction_id", StringType(), True),
StructField("header__timestamp", StringType(), True),
StructField("l_en_us", StringType(), True),
StructField("priority", StringType(), True),
StructField("typecode", StringType(), True),
StructField("retired", StringType(), True),
StructField("name", StringType(), True),
StructField("id", StringType(), True),
StructField("description", StringType(), True),
StructField("l_es_ar", StringType(), True),
StructField("adw_updated_ts", StringType(), True),
StructField("adw_process_id", StringType(), True)
])
rdd = sc.textFile("../../../Downloads/pctl_jobdatetype.csv").map(lambda line: line.split(","))
df = sqlContext.createDataFrame(rdd, schema)
df2 = df.withColumn('header__timestamp', df['header__timestamp'].cast('timestamp'))
df2 = df.withColumn('adw_updated_ts', df['adw_updated_ts'].cast('timestamp'))
df2 = df.withColumn('priority', df['priority'].cast('double'))
df2 = df.withColumn('id', df['id'].cast('double'))
df2.write.parquet('../../../Downloads/input-parquet')
Sample Data:
"header__change_seq","header__change_oper","header__change_mask","header__stream_position","header__operation","header__transaction_id","header__timestamp","l_en_us","priority","typecode","retired","name","id","description","l_es_ar","adw_updated_ts","adw_process_id"
,"I",,,"IDL",,"1970-01-01 00:00:01.000","Effective Date","10.0","Effective","0","Effective Date","10001.0","Effective Date","Effective Date","2020-02-16 15:45:07.432","fb69d6f6-06fa-4c93-b8d6-bb7c7229ee88"
,"I",,,"IDL",,"1970-01-01 00:00:01.000","Written Date","20.0","Written","0","Written Date","10002.0","Written Date","Written Date","2020-02-16 15:45:07.432","fb69d6f6-06fa-4c93-b8d6-bb7c7229ee88"
,"I",,,"IDL",,"1970-01-01 00:00:01.000","Reference Date","30.0","Reference","0","Reference Date","10003.0","Reference Date","Reference Date","2020-02-16 15:45:07.432","fb69d6f6-06fa-4c93-b8d6-bb7c7229ee88"
After I modified the dataframe name to df2 for lines 3-6 below, seems to be working fine, and Athena is also returning results.
df = sqlContext.createDataFrame(rdd, schema)
df2 = df.withColumn('header__timestamp', df['header__timestamp'].cast('timestamp'))
df2 = df2.withColumn('adw_updated_ts', df['adw_updated_ts'].cast('timestamp'))
df2 = df2.withColumn('priority', df['priority'].cast('double'))
df2 = df2.withColumn('id', df['id'].cast('double'))
df2.write.parquet('../../../Downloads/input-parquet')

Is there a way to make an empty schema with column names from a list?

I am trying to make an empty PySpark dataframe in the case where it didn't exist before. I also have a list of column names. Is it possible to define an empty PySpark dataframe without manual assignment?
I have a list of columns final_columns, which I can use to select a subset of columns from a dataframe. However, in the case when this dataframe doesn't exist, I would like to create an empty dataframe with the same columns in final_columns. I would like to do this without manually assigning the names.
final_columns = ['colA', 'colB', 'colC', 'colD', 'colE']
try:
sdf = sqlContext.table('test_table')
except:
print("test_table is empty")
mySchema = StructType([ StructField("colA", StringType(), True),
StructField("colB", StringType(), True),
StructField("colC", StringType(), True),
StructField("colD", StringType(), True),
StructField("colE", DoubleType(), True) ])
sdf = sqlContext.createDataFrame(spark.sparkContext.emptyRDD(),schema=mySchema)
sdf = sdf.select(final_columns)