Convert PySpark URL Decoder into Scala

Convert PySpark URL Decoder into Scala - scala

I have created a PySpark udf by doing the following:
from urllib.parse import urljoin, urlparse
import unicodedata
from pyspark.sql.functions import col, udf, count, substring
from pyspark.sql.types import StringType
decode_udf = udf(lambda val: urljoin(unicodedata.normalize('NFKC',val), urlparse(unicodedata.normalize('NFKC',val)).path), StringType())
For reference, the code above takes a url like this:
https://www.dagens.dk/udland/steve-irwins-soen-taet-paa-miste-livet-ny-video-viser-flugt-fra-kaempe-krokodille?utm_medium=Social&utm_source=Facebook#Echobox=1644308898
and transforms into
https://www.dagens.dk/udland/steve-irwins-soen-taet-paa-miste-livet-ny-video-viser-flugt-fra-kaempe-krokodille
How can I convert this into Scala? I have tried many ways to replicate the code but unsuccessful. Thanks in advance.

Related

select the first element after sorting column and convert it to list in scala

what is the most efficient way to sort one column in data frame, convert it to list, and assign the first element to variable in scala. I tried the following
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, first, regexp_replace}
import org.apache.spark.sql.functions._
println(CONFIG.getString("spark.appName"))
val conf = new SparkConf()
.setAppName(CONFIG.getString("spark.appName"))
.setMaster(CONFIG.getString("spark.master"))
val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
val df = spark.read.format("com.databricks.spark.csv").option("delimiter", ",").load("file.csv")
val dfb=df.sort(desc("_c0"))
val list=df.select(df("_c0")).distinct
but I'm still no able to save the first element as variable

Use select, orderBy, map & head
Assuming column _c0 is of type string, If it is different type you have to modify your column data type in _.getAs[<your column datatype>]
Check below code.
scala> import spark.implicits._
import spark.implicits._
scala> val first = df
.select($"_c0")
.orderBy($"_c0".desc)
.map(_.getAs[String](0))
.head
Or
scala> import spark.implicits._
import spark.implicits._
scala> val first = df
.select($"_c0")
.orderBy($"_c0".desc)
.head
.getAs[String](0)

not able to create a field with DateType using PySpark

I am trying to create dataframe using sample record. One of the field is of DateType. I am getting error for value provided in DatType field. Please find below code
Error is
TypeError: field date: DateType can not accept object '2019-12-01' in type <class 'str'>
I tried to convert stringType to DateType using to_date plus some other ways but not able to do so. Please advise
from pyspark.sql.functions import to_date,col,lit,expr
from pyspark.sql.types import StructType,StructField,IntegerType,DateType,StringType
from pyspark.sql import Row
MySchema = StructType([ StructField("CustomerID",IntegerType(),True),
StructField("Quantity",IntegerType(),True),
StructField("date",DateType(),True)
])
myRow=Row(10,100,"2019-12-01")
mydf=spark.createDataFrame([myRow],MySchema)
display(mydf)

You can use datetime class to convert string to date:
from datetime import datetime
myRow=Row(10,100,datetime.strptime('2019-12-01','%Y-%m-%d'))
mydf=spark.createDataFrame([myRow],MySchema)
mydf.show()
It should work.

What works for me (I'm on Python 3.8.12 and Spark version 3.0.1):
from datetime import datetime
from pyspark.sql.types import DateType, StructType, StructField,
IntegerType, Row
from pyspark.sql import SparkSession
MySchema = StructType([ StructField("CustomerID",IntegerType(),True),
StructField("Quantity",IntegerType(),True),
StructField("date",DateType(),True)
])
spark = SparkSession.builder.appName("local").master("local").getOrCreate()
myRow=Row(10,100,datetime(2019, 12, 1))
mydf=spark.createDataFrame([myRow],MySchema)
mydf.show(truncate=False) #I'm not on DataBricks, so I use mydf.show(truncate=False) instead of display

Window Functions partitionBy over a list

I have a dataframe tableDS
In scala I am able to remove duplicates over primary keys using the following -
import org.apache.spark.sql.expressions.Window.partitionBy
import org.apache.spark.sql.functions.row_number
val window = partitionBy(primaryKeySeq.map(k => tableDS(k)): _*).orderBy(tableDS(mergeCol).desc)
tableDS.withColumn("rn", row_number.over(window)).where($"rn" === 1).drop("rn")
I need to write a similar thing in python. primaryKeySeq is a list in python. I tried the first statement like this -
from pyspark.sql.window import Window
import pyspark.sql.functions as func
window = Window.partitionBy(primaryKeySeq).orderBy(tableDS[bdtVersionColumnName].desc())
tableDS1=tableDS.withColumn("rn",rank().over(window))
This does not give me the correct result.

It got solved -
Here is the final conversion.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window = Window.partitionBy(primaryKeySeq).orderBy(tableDS[bdtVersionColumnName].desc())
tableDS1=tableDS.withColumn("rn", row_number.over(window)).where(tableDS["rn"] == 1).drop("rn")

Spark.sql and sqlContext.sql

I have imported the below modules. I tried to load data from sqlCtx.read.format, I am getting "IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"" error, but it works well when I use spark.read.format. I am seeing same behavior when I am retrieving data from registered temptable/view. What can I add extra to use sqlCtx.sql instead of spark.sql?
import os
import sys
import pandas as pd
import odbc as pyodbc
import os
import sys
import re
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import *
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pyspark.sql.functions as func
import matplotlib.patches as mpatches
import time as time
from matplotlib.patches import Rectangle
import datetime
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setMaster("local")
conf.setAppName("AppName")
sqlCtx = SQLContext(sc)

I spent two hours of my life in this one, just to realize I did not need:
sqlCtx = SQLContext(sc)
Just using SQLContext.read.(...), solved this in my case.

Converting RDD to DataFrame scala - NoSuchMethodError

I am trying to convert an RDD to a DataFrame in scala as follows
val posts = spark.textFile("~/allPosts/part-02064.xml.gz")
import org.apache.spark.SparkContext._
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext._
val sqlContext = new org.apache.spark.sql.SQLContext(spark)
import sqlContext.implicits._
posts.map(identity).toDF()
When I do this I get the following error.
java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext$implicits$.stringRddToDataFrameHolder(Lorg/apache/spark/rdd/RDD;)Lorg/apache/spark/sql/DataFrameHolder;
I can't for the life of me figure out what I'm doing wrong.

you need to define a schema to convert a RDD to Dataframes either by Reflection method or via programmatically.
One very important point about Dataframes- Dataframe is a RDD with a schema. In your case define a case class and map the values of a file to that class. Hope it will help

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Convert PySpark URL Decoder into Scala - scala

Related

select the first element after sorting column and convert it to list in scala

not able to create a field with DateType using PySpark

Window Functions partitionBy over a list

Spark.sql and sqlContext.sql

Converting RDD to DataFrame scala - NoSuchMethodError

Categories

Resources