'module' object has no attribute 'analyse' when using jieba - pyspark

My pyspark job fail, and the error says that: 'module' object has no attribute 'analyse'. But I have already import jieba.analyse in the script. And similar script can run successfully in the vm locally. Not sure why the job fail.
part of my code is as follow:
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import jieba
from jieba import analyse
import pyspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)
text_file = sc.textFile("gs://xxx")
def process_uinfo(line):
line = line.strip()
line_arr = line.split('\t')
(title, content) = line_arr
l_title = jieba.analyse.extract_tags(title, topK=20, withWeight=True)
return "\t".join([l_title, content])
out_rdd = text_file.map(process_uinfo)
And the error "'module' object has no attribute 'analyse'" occur in the following line:
l_title = jieba.analyse.extract_tags(title, topK=20, withWeight=True)

Related

Spark badRecordsPath is not writing records to the Path as expected

I have a following sample csv data:
id
name
salary
1
"Raju"
1000
2
"Gautam"
15000
3
"Kishan"
30000
4
"Mike"
two hundread
The salary field in last record is corrupted.
I am trying to handle the corrupt record with badRecordsPath as shown in the code below. But it is not working. I am using Spark 3.0.3, Scala 12 and Windows 10.
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.ArrayType
object BadDataPathExample extends App{
Logger.getLogger("org").setLevel(Level.ERROR)
val sparkConf = new SparkConf()
sparkConf.set("spark.app.name", "BadDataPathExample")
sparkConf.set("spark.master", "local[2]")
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val schema_string = "id int, name String, salary int"
Logger.getLogger(getClass.getName).info(">> Starting to read Data")
// read CSV
val badDF = spark.read
.format("csv")
.option("header", true)
.schema(schema_string)
.option("badRecordsPath", "D:/spark_practice/bad_dir")
.option("path", "D:/spark_practice/data/bad_emp.csv")
.load
badDF.show()
badDF.printSchema()
}
The Output from the above code is as below:
As we can see that record is present with corrupted column value set to Null., which is coming from default behavior of "PERMISSIVE" mode. Also, there is no record being written to the bad records path specified.
But same code works as expected in Databricks as shown below.
What am I doing wrong? Or is badRecordsPath a Databricks specific feature?
badRecordsPath is only a Databricks specific feature.
We can see the logic in source code FailureSafeParser.
class FailureSafeParser[IN](
def parse(input: IN): Iterator[InternalRow] = {
try {
rawParser.apply(input).toIterator.map(row => toResultRow(Some(row), () => null))
} catch {
case e: BadRecordException => mode match {
case PermissiveMode =>
Iterator(toResultRow(e.partialResult(), e.record))
case DropMalformedMode =>
Iterator.empty
case FailFastMode =>
throw QueryExecutionErrors.malformedRecordsDetectedInRecordParsingError(e)
}
}
}
}
emmm...
I have a idea to refactor this code...
When there have badRecordsPath option, the mode forced to be DropMalformedMode and ignore mode which user set.
DropMalformedMode parse rows with exception and write to badRecordsPath, then empty Iterator.

How to convert Dataframe to dynamic frame

I am new to AWS glue and I am trying to run some transformation process using pyspark. I successfully ran my ETL but I am looking for another way of converting dataframe to dynamic frame.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
# load data from crawler
students = glueContext.create_dynamic_frame.from_catalog(database="example_db", table_name="samp_csv")
# move data into a new variable for transformation
students_trans = students
# convert dynamicframe(students_trans) to dataframe
students_= students_trans.toDF()
# run transformation change column names/ drop columns
students_1= students_.withColumnRenamed("state","County").withColumnRenamed("capital","cap").drop("municipal",'metropolitan')
#students_1.printSchema()
#convert df back to dynamicframe
from awsglue.dynamicframe import DynamicFrame
students_trans = students_trans.fromDF(students_1, glueContext, "students_trans")
#load into s3 bucket
glueContext.write_dynamic_frame.from_options(frame = students_trans,
connection_type = "s3",
connection_options = {"path": "s3://kingb/target/"},
format = "csv")
from awsglue import DynamicFrame
students_trans = DynamicFrame.fromDF(students_1, self._glue_context, "df")

Getting workflow runtime properties for AWS Glue workflow in Scala

I am working on an AWS Glue job. I am using scala to write the code. I need to get the workflow runtime properties. I can do this very easily in python. However i could not find any sample code or documentation to do this in scala.
Equivalent code in python is as follows.
I will be very grateful if someone can help me with the scala equivalent.
import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
glue_client = boto3.client("glue")
args = getResolvedOptions(sys.argv, ['JOB_NAME','WORKFLOW_NAME', 'WORKFLOW_RUN_ID'])
workflow_name = args['WORKFLOW_NAME']
workflow_run_id = args['WORKFLOW_RUN_ID']
workflow_params = glue_client.get_workflow_run_properties(Name=workflow_name,
RunId=workflow_run_id)["RunProperties"]
target_database = workflow_params['target_database']
target_s3_location = workflow_params['target_s3_location']
This worked for me.
import com.amazonaws.regions.Regions
import com.amazonaws.services.glue.{AWSGlue, AWSGlueClient}
import com.amazonaws.services.glue.model.GetWorkflowRunPropertiesRequest
import com.amazonaws.services.glue.model.GetWorkflowRunPropertiesResult
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import com.amazonaws.services.glue.GlueContext
object ReadProps {
def main(sysArgs: Array[String]) {
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME","WORKFLOW_NAME", "WORKFLOW_RUN_ID").toArray)
val workflowName= args("WORKFLOW_NAME")
val workflowId = args("WORKFLOW_RUN_ID")
val sc: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(sc)
val sparkSession: SparkSession = glueContext.getSparkSession
val region = Regions.fromName("your-region-name")
val glue = AWSGlueClient.builder().withRegion(region).build()
val req = new GetWorkflowRunPropertiesRequest()
req.setName(workflowName)
req.setRunId(workflowId)
val result = glue.getWorkflowRunProperties(req)
val resultMap = result.getRunProperties()
println(resultMap.get("propertykey"))
}
}

SparkSession and SparkContext initiation in PySpark

I would like to know the PySpark equivalent of the following code in Scala. I am using databricks. I need the same output as below:-
to create new Spark session and output the session id (SparkSession#123d0e8)
val new_spark = spark.newSession()
**Output**
new_spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#123d0e8
to view SparkContext and output the SparkContext id (SparkContext#2dsdas33)
new_spark.sparkContext
**Output**
org.apache.spark.SparkContext = org.apache.spark.SparkContext#2dsdas33
It's very similar. If you have already a session and want to open another one, you can use
my_session = spark.newSession()
print(my_session)
This will produce the new session object I think you are trying to create
<pyspark.sql.session.SparkSession object at 0x7fc3bae3f550>
spark is a session object already running, because you are using a databricks notebook
SparkSession could be created as http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html
>>> from pyspark.sql import SparkSession
>>> from pyspark.conf import SparkConf
>>> SparkSession.builder.config(conf=SparkConf())
or
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('FirstSparkApp').getOrCreate()

Pyspark Error when UDF is defined outside function that calls it: Method __getnewargs__([]) does not exist

I've seen several questions about this but I don't seem to understand why I get this error when my UDF is defined outside of the function I'm calling on my dataframe.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from data.utils import PropertiesGetter
glueContext = GlueContext(SparkContext.getOrCreate())
input_source = glueContext.create_dynamic_frame.from_catalog(database = "db_name", table_name = "input")
input_source_df = input_source.toDF()
test_df = PropertiesGetter(glueContext).add_subscription_properties(input_df)
Calling PropertiesGetter's add_subscription_properties on my input_df does NOT throw an error when my class looks like this (note the nested UDF):
class PropertiesGetter(object):
def __init__(self, gc):
...
def add_subscription_properties(self, input_df):
def _add_subscription_properties(self, subscription_name):
subscription_mapping = {...}
return subscription_mapping[subscription_name]
udf_add_subscription_properties = udf(_add_subscription_properties, StringType())
return input_df.withColumn("subscription_properties",
udf_add_subscription_properties("subscription_type"))
...
But it DOES throw an error
(specifically the Could not serialize object: Py4JError: An error occurred while calling o116.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist..) when it looks like this:
class PropertiesGetter(object):
def __init__(self, gc):
...
def _add_subscription_properties(self, subscription_name):
subscription_mapping = {...}
return subscription_mapping[subscription_name]
def add_subscription_properties(self, input_df):
udf_add_subscription_properties = udf(self._add_subscription_properties, StringType())
return input_df.withColumn("subscription_properties",
udf_add_subscription_properties("subscription_type"))
...
Can someone please explain to me why this is? I'm struggling to understand why this makes a difference. I have a few UDFs that I'm using in this class so I want to know what a way around nesting some of these UDFs would be.
P.S. I know you do not need a UDF to create a column that applies a mapping, but just wanted to demonstrate on a simple example.