Pyspark over zeppilin: unable to export to csv format?

Pyspark over zeppilin: unable to export to csv format? - pyspark

I'm trying to export the dataframe into .csv file to S3 bucket.
Unfortunately it is saving in parquet files.
Can someone please let me know, how to get export pyspark dataframe into .csv file.
I tried below code:
predictions.select("probability").write.format('csv').csv('s3a://bucketname/output/x1.csv')
it is throwing this error: CSV data source does not support struct,values:array> data type.
Appreciate anybody help.
Note: my spark setup is based in zepplin.
Thanks,
Naseer

Probability is an array column (contains multiple values) and needs to be converted to string before you can save it to csv. One way to do it is using udf (user defined function):
from pyspark.sql.functions import udf
from pyspark.sql.functions import column as col
from pyspark.sql.types import StringType
def string_from_array(input_list):
return ('[' + ','.join([str(item) for item in input_list]) + ']')
ats_udf = udf(string_from_array, StringType())
predictions = predictions.withColumn('probability_string', ats_udf (col("probability")))
Then you can save your dataset:
predictions.select("probability_string").write.csv('s3a://bucketname/output/x1.csv')

Related

PySpark - Read CSV and ignore file header (not using pandas)

I have a problem that I hope you can help me with.
The text file that looks like this:
Report Name :
column1,column2,column3
this is row 1,this is row 2, this is row 3
I am leveraging Synapse Notebooks to try to read this file into a dataframe. If I try to read the csv file using spark.read.csv() it thinks that the column name is "Report Name : ", which is obviously incorrect.
I know that the Pandas csv reader has a 'skipRows[1]' function but unfortunately I cannot read the file directly with Pandas, as I am getting some strange networking errors. I can however convert a PySpark dataframe to a Pandas dataframe via: df.toPandas()
I'd like to be able to solve this with straight PySpark dataframes.
Surely someone else has encountered this issue! Help!
I have tried every variation of reading files, and drop, etc. but the schema has already been defined when the first dataframe was created, with 1 column (Report Name : ).
Not sure what to do now..

Copied answer from similar question: How to skip lines while reading a CSV file as a dataFrame using PySpark?
import csv
from pyspark.sql.types import StringType
df = sc.textFile("test.csv")\
.mapPartitions(lambda line: csv.reader(line,delimiter=',', quotechar='"')).filter(lambda line: len(line)>=2 and line[0]!= 'column1')\
.toDF(['column1','column2','column3'])

Microsoft got back to me with an answer that worked! When using pandas csv reader, and you use the path to the source file you want to read. It requires an endpoint to blob storage (not adls gen2). I only had an endpoint that read dfs in the URI and not blob. After I added the endpoint to blob storage, the pandas reader worked great! Thanks for looking at my thread.

Pyspark : "ImportError: cannot import name 'st_makePoint'

I am trying to enter some data in postgresql database using pyspark. There is one field in postresql table which defined as data type GEOGRAPHY(Point). I have written below pyspark code to creat this field using longitude and latitude
from pyspark.sql.functions import st_makePoint
df = (Load input file into pyspark dataframe)
df = df.withColumn("Location", st_makePoint(col("Longitude"), col("Latitude")))
Next step is load the data into postgresql
But I am getting the error
"ImportError: cannot import name 'st_makePoint'
I think st_makePoint is part of pyspark.sql.function. Not sure why it is giving error. Please help.
Also if there is better way of entering the Geography(Point) field in postgresql from pyspark please let me know

check this documentation of geo-mesa
Registering user-defined types and functions can be done manually by invoking geomesa_pyspark.init_sql() on the Spark session object:
import geomesa_pyspark
geomesa_pyspark.init_sql(spark)
Then you can use st_mkPoint

Split json sturc value into multiple columns in Pyspark

I am importing a json file into pyspark dataframe. I have import the json with following code
df = sqlContext.read.json("json_file.json").select("item", "attributes")
I want to split attributes from one column to multiple columns.
Here is sample json format:
{"item":"item-1","attributes":{"att-a":"att-a-15","att-b":"att-b-10","att-c":"att-c-7"}}
{"item":"item-2","attributes":{"att-a":"att-a-15","att-b":"att-b-10","att-c":"att-c-7"}}

If you want to see your output to be like this
+------+--------+--------+-------+
| item| att-a| att-b| att-c|
+------+--------+--------+-------+
|item-1|att-a-15|att-b-10|att-c-7|
|item-2|att-a-15|att-b-10|att-c-7|
+------+--------+--------+-------+
Use
from pyspark.sql import functions as f
df.select('item','attributes.*').show()
So that all attributes you can see it in the multiple columns.

how to replace missing values from another column in PySpark?

I want to use values in t5 to replace some missing values in t4. Searched code, but doesn’t work for me
Current:
example of current
Goal:
example of target
df is a dataframe.Code:
pdf = df.toPandas()
from pyspark.sql.functions import coalesce
pdf.withColumn("t4", coalesce(pdf.t4, pdf.t5))
 Error: 'DataFrame' object has no attribute 'withColumn'
Also, tried the following code previously, didnt work neither.
new_pdf=pdf['t4'].fillna(method='bfill', axis="columns")
Error: No axis named columns for object type

Like the error indicates .withColumn() is not a method of pandas dataframes but spark dataframes. Note that when using .toPandas() your pdf becomes a pandas dataframe, so if you want to use .withColumn() avoid the transformation
UPDATE:
If pdf is a pandas dataframe you can do:
pdf['t4']=pdf['t4'].fillna(pdf['t5'])

Need to convert Informatica reg_extract expression to Pyspark dataframe

I have a scenario where I need to convert Informatica mapping (source and target SQL Server) into Pyspark code (source blob file and target Hive). In expression transformation one column contains 'reg_extract' function and I need to convert this to Pyspark dataframe. My final goal is to create the same table in Hive as it is in SQL Server.
What will be the replacement for reg_extract function in Pyspark? I am using Pyspark 2.
Below is the code from Informatica Expression transformation (for one column variable field):
LTRIM(RTRIM(IIF(instr(v_DATE,'AMENDED')>0,
reg_Extract(DATE,'.*(^\w+\s+[0-9]{2}[,]\s+[0-9]{4}|^\w+\s+[0-9]{1}[,]\s+[0-9]{4}).*'),
reg_Extract(DATE,'.*((\s0?[1-9]|1[012])[./-](0?[1-9]|[12][0-9]|3[01])[./-][0-9]{2,4}|(^0?[1-9]|1[012])[./-](0?[1-9]|[12][0-9]|3[01])[./-][0-9]{2,4}|(0[1-9]|[12][0-9]|3[01])[./-](0?[1-9]|1[012])[./-][0-9]{2,4}|\s\w+\s+(0?[1-9]|[12][0-9]|3[01])[.,](\s+)?[0-9]{4}|^\w+\s+(0?[1-9]|[12][0-9]|3[01])[.,](\s+)?[0-9]{4}|^(19|20)[0-9]{2}|^[0-9]{2}\s+\w+\s+[0-9]{4}|^[0-9]{6}|^(0?[1-9]|[12][0-9]|3[01])\s+\w+[.,]?\s+(19|20)[0-9]{2}|^[0-9]{1,2}[-,/]\w+[-,/][0-9]{2,4}).*'))))
In Pyspark, I have saved the source file in one dataframe and selected the required columns. After that I am unable to proceed.
input_data=spark.read.csv(file_path,header=True)
input_data.createOrReplaceTempView("input_data")
df_test = "select ACCESSION_NUMBER, DATE, REPORTING_PERSON from input_data"
df = sqlContext.sql(df_test)
I am new to Pyspark/SparkSQL. Please help.

You can use regexp_extract :
df = df.withColumn('New_Column_Name', regexp_extract(col('Date'), '.*(^\w+\s+[0-9]{2}[,]\s+[0-9]{4}|^\w+\s+[0-9]{1}[,]\s+[0-9]{4}).*', 1))
Related question

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark over zeppilin: unable to export to csv format? - pyspark

Related

PySpark - Read CSV and ignore file header (not using pandas)

Pyspark : "ImportError: cannot import name 'st_makePoint'

Split json sturc value into multiple columns in Pyspark

how to replace missing values from another column in PySpark?

Need to convert Informatica reg_extract expression to Pyspark dataframe

Categories

Resources