Pyspark : "ImportError: cannot import name 'st_makePoint' - postgresql

I am trying to enter some data in postgresql database using pyspark. There is one field in postresql table which defined as data type GEOGRAPHY(Point). I have written below pyspark code to creat this field using longitude and latitude
from pyspark.sql.functions import st_makePoint
df = (Load input file into pyspark dataframe)
df = df.withColumn("Location", st_makePoint(col("Longitude"), col("Latitude")))
Next step is load the data into postgresql
But I am getting the error
"ImportError: cannot import name 'st_makePoint'
I think st_makePoint is part of pyspark.sql.function. Not sure why it is giving error. Please help.
Also if there is better way of entering the Geography(Point) field in postgresql from pyspark please let me know

check this documentation of geo-mesa
Registering user-defined types and functions can be done manually by invoking geomesa_pyspark.init_sql() on the Spark session object:
import geomesa_pyspark
geomesa_pyspark.init_sql(spark)
Then you can use st_mkPoint

Related

PySpark - Read CSV and ignore file header (not using pandas)

I have a problem that I hope you can help me with.
The text file that looks like this:
Report Name :
column1,column2,column3
this is row 1,this is row 2, this is row 3
I am leveraging Synapse Notebooks to try to read this file into a dataframe. If I try to read the csv file using spark.read.csv() it thinks that the column name is "Report Name : ", which is obviously incorrect.
I know that the Pandas csv reader has a 'skipRows[1]' function but unfortunately I cannot read the file directly with Pandas, as I am getting some strange networking errors. I can however convert a PySpark dataframe to a Pandas dataframe via: df.toPandas()
I'd like to be able to solve this with straight PySpark dataframes.
Surely someone else has encountered this issue! Help!
I have tried every variation of reading files, and drop, etc. but the schema has already been defined when the first dataframe was created, with 1 column (Report Name : ).
Not sure what to do now..
Copied answer from similar question: How to skip lines while reading a CSV file as a dataFrame using PySpark?
import csv
from pyspark.sql.types import StringType
df = sc.textFile("test.csv")\
.mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'column1')\
.toDF(['column1','column2','column3'])
Microsoft got back to me with an answer that worked! When using pandas csv reader, and you use the path to the source file you want to read. It requires an endpoint to blob storage (not adls gen2). I only had an endpoint that read dfs in the URI and not blob. After I added the endpoint to blob storage, the pandas reader worked great! Thanks for looking at my thread.

how to replace missing values from another column in PySpark?

I want to use values in t5 to replace some missing values in t4. Searched code, but doesn’t work for me
Current:
example of current
Goal:
example of target
df is a dataframe.Code:
pdf = df.toPandas()
from pyspark.sql.functions import coalesce
pdf.withColumn("t4", coalesce(pdf.t4, pdf.t5))
 Error: 'DataFrame' object has no attribute 'withColumn'
Also, tried the following code previously, didnt work neither.
new_pdf=pdf['t4'].fillna(method='bfill', axis="columns")
Error: No axis named columns for object type
Like the error indicates .withColumn() is not a method of pandas dataframes but spark dataframes. Note that when using .toPandas() your pdf becomes a pandas dataframe, so if you want to use .withColumn() avoid the transformation
UPDATE:
If pdf is a pandas dataframe you can do:
pdf['t4']=pdf['t4'].fillna(pdf['t5'])

Pyspark over zeppilin: unable to export to csv format?

I'm trying to export the dataframe into .csv file to S3 bucket.
Unfortunately it is saving in parquet files.
Can someone please let me know, how to get export pyspark dataframe into .csv file.
I tried below code:
predictions.select("probability").write.format('csv').csv('s3a://bucketname/output/x1.csv')
it is throwing this error: CSV data source does not support struct,values:array> data type.
Appreciate anybody help.
Note: my spark setup is based in zepplin.
Thanks,
Naseer
Probability is an array column (contains multiple values) and needs to be converted to string before you can save it to csv. One way to do it is using udf (user defined function):
from pyspark.sql.functions import udf
from pyspark.sql.functions import column as col
from pyspark.sql.types import StringType
def string_from_array(input_list):
return ('[' + ','.join([str(item) for item in input_list]) + ']')
ats_udf = udf(string_from_array, StringType())
predictions = predictions.withColumn('probability_string', ats_udf (col("probability")))
Then you can save your dataset:
predictions.select("probability_string").write.csv('s3a://bucketname/output/x1.csv')

create_dynamic_frame_from_catalog returning zero results

I'm trying to create a dynamic glue dataframe from an athena table but I keep getting an empty data frame.
The athena table is part of my glue data catalog
The create_dynamic_frame_method call doesn't raise any error. I tried loading a random table and it did complain just as a sanity check.
I know the Athena table has data, since querying the exact same table using Athena returns results
The table is an external json, partitioned table on s3
I'm using pyspark as shown below:
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
# Create a DynamicFrame using the 'raw_data' table
raw_data_df =
glueContext.create_dynamic_frame.from_catalog(database="***",
table_name="raw_***")
# Print out information about this data, im getting zero here
print "Count: ", raw_data_df.count()
#also getting nothing here
raw_data_df.printSchema()
Anyone facing the same issue ? Could this be a permissions issue or a glue bug since no errors are raised?
There are several poorly documented features/gotchas in Glue which is sometimes frustrating.
I would suggest to investigate the following configurations of your Glue job:
Does the S3 bucket name has aws-glue-* prefix?
Put the files in S3 folder and make sure the crawler table definition is on folder
rather than actual file.
I have also written a blog on LinkedIn about other Glue gotchas if that helps.
Do you have subfolders under the path where your Athena table points to? glueContext.create_dynamic_frame.from_catalog does not recursively read the data. Either put the data in the root of where the table is pointing to or add additional_options = {"recurse": True} to your from_catalog call.
credit: https://stackoverflow.com/a/56873939/5112418

Scala repartition cannot resolve symbol

I am trying to save my dataframe aa parquet file with one partition per day. So trying to use the date column. However, I want to write one file per partition, so using repartition($"date"), but keep getting errors:
This error "cannot resolve symbol repartition" and "value $ is not a member of stringContext" when I use,
DF.repartition($"date")
.write
.mode("append")
.partitionBy("date")
.parquet("s3://file-path/")
This error Type mismatch, expected column, actual string, when I use:
DF.repartition("date")
.write
.mode("append")
.partitionBy("date")
.parquet("s3://file-path/")
However, this works fine without any error.
DF.write.mode("append").partitionBy("date").parquet("s3://file-path/")
Cant we use date type in repartition? Whats wrong here?
To use the $ symbol inplace of col(), you need to first import spark.implicits. spark here is an instance of a SparkSession, hence the import must be done after the creation of a SparkSession. A simple example:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
This import will also enable other functionallity such as converting RDDs to Dataframe of Datasets with toDF() and toDS() respectively.