Load Text Files and store it in Dataframe using Pyspark - scala

I am migrating pig script to pyspark and I am new to Pyspark so I am stuck at data loading.
My pig script looks like:
Bag1 = LOAD '/refined/em/em_results/202112/' USING PigStorage('\u1') AS
(PAYER_SHORT: chararray
,SUPER_PAYER_SHORT: chararray
,PAID: double
,AMOUNT: double
);
I want something similar to this in Pyspark.
Currently I have tried this in Pyspark:
df = spark.read.format("csv").load("/refined/em/em_results/202112/*")
I am able to read the text file with this but values are coming in single column instead of separated in different columns. Please find below some sample values:
|_c0
|AZZCMMETAL2021/1211FGPP7491764 |
|AZZCMMETAL2021/1221HEMP7760484 |
Output should look like this:
_c0 _c1 _c2 _c3_c4 _c5 _c6 _c7
AZZCM METAL 2021/12 11 FGP P 7 491764
AZZCM METAL 2021/12 11 HEM P 7 760484
Any clue how to achieve this? Thanks!!

Generaly spark would try to take (,)[comma] as a separator value in you case you have to provide space as your separator.
df = spark.read.csv(file_path, sep =' ')

This resolves the issue. Instead of "\\u1", I used "\u0001". Please find below the answer.
df = spark.read.option("sep","\u0001").csv("/refined/em/em_results/202112/*")

Related

cast big number in human readable format

I'm working with databricks on a notebook.
I have a column with numbers like this 103503119090884718216391506040
They are in string format. I can print them and read them easily.
For debugging purpose I need to be able to read them. However I also need to be able to apply them .sort() method. Casting them as IntegerType() return null value, casting them as double make them unreadable.
How can I convert them in a human readable format but at the same time where .sort() would work? Do I need to create two separate columns?
To make the column sortable, you could convert your column to DecimalType(precision, scale) (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.types.DecimalType.html#pyspark.sql.types.DecimalType). For this data type you can choose the possible value range the two arguments
from pyspark.sql import SparkSession, Row, types as T, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
Row(string_column='103503119090884718216391506040'),
Row(string_column='103503119090884718216391506039'),
Row(string_column='90'),
])
(
df
.withColumn('decimal_column', F.col('string_column').cast(T.DecimalType(30,0)))
.sort('decimal_column')
.show(truncate=False)
)
# Output
+------------------------------+------------------------------+
|string_column |decimal_column |
+------------------------------+------------------------------+
|90 |90 |
|103503119090884718216391506039|103503119090884718216391506039|
|103503119090884718216391506040|103503119090884718216391506040|
|103503119090884718216391506041|103503119090884718216391506041|
+------------------------------+------------------------------+
Concerning "human readability" I'm not sure whether that helps, though.

How to Save a file with multiple delimiter in spark

I need to save a file delimited by "|~" characters but I get an error when I execute the command below. Can I save a file using multiple delimiters in Spark?
mydf1.coalesce(1).write.option("compression","none").format("csv").mode("Overwrite").option("delimiter","|~").save("my_hdfs_path")
// Error : pyspark.sql.utils.IllegalArgumentException: u'Delimiter cannot be more than one character: |~'
AFAIK, we are still waiting for an "official" solution, because the issue "support for multiple delimiter in Spark CSV read" is still open, and SparkSQL still rely on univocity-parsers. In univocity CSV settings, the CSV delimiter can only be a single character, which constrains both the parser (reader) and generator (writer).
Workarounds
Finding a universally fast and safe way to write as CSV is hard. But depends on your data size and the complexity of CSV contents (date format? currency? quoting?), we may find a shortcut. Following are just some, hopefully inspiring, thoughts...
write to CSV with special character (say ⊢) then substitute to |~.
(haven't been benchmarked, but IMO it's very hopeful to be the fastest)
df.coalesce(1).write.option("compression","none").option("delimiter", "⊢").mode("overwrite").csv("raw-output")
then post-process with (ideally locally) with, say sed
sed -i '.bak' 's/⊢/\|~/g' raw-output/*.csv
within PySpark, concatenate each row to a string, then write as a text file
(can be flexible to deal with locality and special needs -- with a bit more work)
d = [{'name': 'Alice', 'age': 1},{'name':'Bob', 'age':3}]
df = spark.createDataFrame(d, "name:string, age:int")
df.show()
#+-----+---+
#| name|age|
#+-----+---+
#|Alice| 1|
#| Bob| 3|
#+-----+---+
#udf
def mkstr(name, age):
"""
for example, the string field {name} should be quoted with `"`
"""
return '"{name}"|~{age}'.format(name=name, age=age)
# unparse a CSV row back to a string
df_unparsed = df.select(mkstr("name", "age").alias("csv_row"))
df_unparsed.show()
#+----------+
#| csv_row|
#+----------+
#|"Alice"|~1|
#| "Bob"|~3|
#+----------+
df_unparsed.coalesce(1).write.option("compression", "none").mode("overwrite").text("output")
numpy.savetxt allows multiple character as delimiter, so ...
(numpy has lots of builtins if you cares about precisions of floating numbers)
import pandas as pd
import numpy as np
# convert `Spark.DataFrame` to `Pandas.DataFrame`
df_pd = df.toPandas()
# use `numpy.savetxt` to save `Pandas.DataFrame`
np.savetxt("a-long-day.csv", df_pd, delimiter="|~", fmt="%s")
From Spark 3.0 We dont have this issue But if using prior version > Spark 2.3 This can be also used as solution. basically concatenating all column and fill null with blank and write the data with the desired delimiter along with the header . This will more generic solution than hardcoding.This allows to have the header also retained.
from pyspark.sql.functions import *
from pyspark.sql import functions as F
target_row_delimited = "|,"
df=df.select([col(c).cast("string") for c in df.columns])
df=df.na.fill("")
headername=target_row_delimited.join(df.columns)
df=df.withColumn(headername, concat_ws(target_row_delimited, *df.columns))
df.select(df[headername].write.format("csv").mode(modeval).option("quoteAll", "false").option("quote","\u0000").option("header", "true").save(tgt_path + "/")
In case we need to read with multiple delimiters , the following solution can be avialed
source_delimiter = "|_|"
headcal = spark.read.text(source_filename)
headers = headcal.take(1)[0]['value']
header_column = headers.split(source_delimiter)
df = sc.textFile(source_filename).map(lambda x: x.split(source_delimiter)).toDF(header_column)

How to extract a single (column/row) value from a dataframe using PySpark?

Here's my spark code. It works fine and returns 2517. All I want to do is to print "2517 degrees"...but I'm not sure how to extract that 2517 into a variable. I can only display the dataframe but not extract values from it. Sounds super easy but unfortunately I'm stuck! Any help will be appreciated. Thanks!
df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").option("delimiter", "\t").load("dbfs:/databricks-datasets/power-plant/data")
df.createOrReplaceTempView("MyTable")
df = spark.sql("SELECT COUNT (DISTINCT AP) FROM MyTable")
display(df)
here is the alternative:
df.first()['column name']
it will give you the desired output. you can store it in a variable.
I think you're looking for collect. Something like this should get you the value:
df.collect()[0]['count(DISTINCT AP)']
assuming the column name is 'count(DISTINCT AP)'
If you want to extract value in specific row and column:
df.select('column name').collect()[row number][0]
for example df.select('eye color').collect()[20][0]

Spark Scala read CSV which has a comma in the data

My CSV file which is in a zip file has the below data,
"Potter, Jr",Harry,92.32,09/09/2018
John,Williams,78,01/02/1992
And I read it using spark scala csv reader. If I use,
.option('quote', '"')
.option('escape', '"')
I will not be getting the fixed number of columns as output. For line 1, the output would be 5 and line 2 it would be 4. The desired output should return 4 columns only. Is there any way to read it as DF or RDD?
Thanks,
Ash
For the given input data, I was able to read the data using:
val input = spark.read.csv("input_file.csv")
This gave me a Dataframe with 4 string columns.
Check this.
val df = spark.read.csv("in/potter.txt").toDF("fname","lname","value","dt")
df.show()
+----------+--------+-----+----------+
| fname| lname|value| dt|
+----------+--------+-----+----------+
|Potter, Jr| Harry|92.32|09/09/2018|
| John|Williams| 78|01/02/1992|
+----------+--------+-----+----------+

How can I pretty print a data frame in Hue/Notebook/Scala/Spark?

I am using Spark 2.1 and Scala 2.11 in a HUE 3.12 notebook. I have a dataframe that I can print like this:
df.select("account_id", "auto_pilot").show(2, false)
And the output looks like this:
+--------------------+----------+
|account_id |auto_pilot|
+--------------------+----------+
|00000000000000000000|null |
|00000000000000000002|null |
+--------------------+----------+
only showing top 2 rows
Is there a way of getting the data frame to show as pretty tables (like when I query from Impala or pyspark)?
Impala example of same query:
you can use a magic function %table , however this function only works for datasets not dataframe. One option is to convert dataframe to datasets before printing.
import spark.implicits._
case class Account(account_id: String, auto_pilot: String)
val accountDF = df.select("account_id", "auto_pilot").collect()
val accountDS: Dataset[Account] = accountDF.as[Account]
%table accountDS
Right now this is the solution that I can think of. Other better solutions are always welcome. I will modify this as soon I find any other elegant solution.
From http://gethue.com/bay-area-bike-share-data-analysis-with-spark-notebook-part-2/
This is what I did
df = sqlContext.sql("select * from my_table")
result = df.limit(5).collect()
%table result