I am trying to save the output of SparkSQL to a path but not sure what function to use. I want to do this without using spark data frames. I was trying using write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out") but not successful. Can someone tell how to do it?
Note: spark SQL will give the output in multiple files. Need to ensure that the data is sorted globally across all the files (parts). So, all words in part 0, will be alphabetically before the words in part 1.
case class Docword(docId: Int, vocabId: Int, count: Int)
case class VocabWord(vocabId: Int, word: String)
// Read the input data
val docwords = spark.read.
schema(Encoders.product[Docword].schema).
option("delimiter", " ").
csv("hdfs:///user/bdc_data/t3/docword.txt").
as[Docword]
val vocab = spark.read.
schema(Encoders.product[VocabWord].schema).
option("delimiter", " ").
csv("hdfs:///user/bdc_data/t3/vocab.txt").
as[VocabWord]
docwords.createOrReplaceTempView("docwords")
vocab.createOrReplaceTempView("vocab")
spark.sql("""SELECT vocab.word AS word1, SUM(count) count1 FROM
docwords INNER JOIN vocab
ON docwords.vocabId = vocab.vocabId
GROUP BY word
ORDER BY count1 DESC""").show(10)
write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out")
// Required to exit the spark-shell
sys.exit(0)
.show() returns void you should dp something like below
val writeDf = spark.sql("""SELECT vocab.word AS word1, SUM(count) count1 FROM
docwords INNER JOIN vocab
ON docwords.vocabId = vocab.vocabId
GROUP BY word
ORDER BY count1 DESC""")
writeDf.write.mode("overwrite").csv("file:///home/user204943816622/Task_3a-out")
writeDf.show() // this should not be used in prod environment
The code below ran perfectly well on the standalone version of PySpark 2.4 on Mac OS (Python 3.7) when the size of the input data (around 6 GB) was small. However, when I ran the code on HDInsight cluster (HDI 4.0, i.e. Python 3.5, PySpark 2.4, 4 worker nodes and each has 64 cores and 432 GB of RAM, 2 header nodes and each has 4 cores and 28 GB of RAM, 2nd generation of data lake) with larger input data (169 GB), the last step, which is, writing data to the data lake, took forever (I killed it after 24 hours of execution) to complete. Given the fact that HDInsight is not popular in the cloud computing community, I could only reference posts that complained about the low speed when writing dataframe to S3. Some suggested to repartition the dataset, which I did, but it did not help.
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StringType, IntegerType, BooleanType
from pyspark.sql.functions import udf, regexp_extract, collect_set, array_remove, col, size, asc, desc
from pyspark.ml.fpm import FPGrowth
import os
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.5"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3.5"
def work(order_path, beer_path, corpus_path, output_path, FREQ_THRESHOLD=1000, LIFT_THRESHOLD=1):
print("Creating Spark Environment...")
spark = SparkSession.builder.appName("Menu").getOrCreate()
print("Spark Environment Created!")
print("Working on Checkpoint1...")
orders = spark.read.csv(order_path)
orders.createOrReplaceTempView("orders")
orders = spark.sql(
"SELECT _c14 AS order_id, _c31 AS in_menu_id, _c32 AS in_menu_name FROM orders"
)
orders.createOrReplaceTempView("orders")
beer = spark.read.csv(
beer_path,
header=True
)
beer.createOrReplaceTempView("beer")
beer = spark.sql(
"""
SELECT
order_id AS beer_order_id,
in_menu_id AS beer_in_menu_id,
'-999' AS beer_in_menu_name
FROM beer
"""
)
beer.createOrReplaceTempView("beer")
orders = spark.sql(
"""
WITH orders_beer AS (
SELECT *
FROM orders
LEFT JOIN beer
ON orders.in_menu_id = beer.beer_in_menu_id
)
SELECT
order_id,
in_menu_id,
CASE
WHEN beer_in_menu_name IS NOT NULL THEN beer_in_menu_name
WHEN beer_in_menu_name IS NULL THEN in_menu_name
END AS menu_name
FROM orders_beer
"""
)
print("Checkpoint1 Completed!")
print("Working on Checkpoint2...")
corpus = spark.read.csv(
corpus_path,
header=True
)
keywords = corpus.select("Food_Name").rdd.flatMap(lambda x: x).collect()
orders = orders.withColumn(
"keyword",
regexp_extract(
"menu_name",
"(?=^|\s)(" + "|".join(keywords) + ")(?=\s|$)",
0
)
)
orders.createOrReplaceTempView("orders")
orders = spark.sql("""
SELECT order_id, in_menu_id, keyword
FROM orders
WHERE keyword != ''
""")
orders.createOrReplaceTempView("orders")
orders = orders.groupBy("order_id").agg(
collect_set("keyword").alias("items")
)
print("Checkpoint2 Completed!")
print("Working on Checkpoint3...")
fpGrowth = FPGrowth(
itemsCol="items",
minSupport=0,
minConfidence=0
)
model = fpGrowth.fit(orders)
print("Checkpoint3 Completed!")
print("Working on Checkpoint4...")
frequency = model.freqItemsets
frequency = frequency.filter(col("freq") > FREQ_THRESHOLD)
frequency = frequency.withColumn(
"items",
array_remove("items", "-999")
)
frequency = frequency.filter(size(col("items")) > 0)
frequency = frequency.orderBy(asc("items"), desc("freq"))
frequency = frequency.dropDuplicates(["items"])
frequency = frequency.withColumn(
"antecedent",
udf(
lambda x: "|".join(sorted(x)), StringType()
)(frequency.items)
)
frequency.createOrReplaceTempView("frequency")
lift = model.associationRules
lift = lift.drop("confidence")
lift = lift.filter(col("lift") > LIFT_THRESHOLD)
lift = lift.filter(
udf(
lambda x: x == ["-999"], BooleanType()
)(lift.consequent)
)
lift = lift.drop("consequent")
lift = lift.withColumn(
"antecedent",
udf(
lambda x: "|".join(sorted(x)), StringType()
)(lift.antecedent)
)
lift.createOrReplaceTempView("lift")
result = spark.sql(
"""
SELECT lift.antecedent, freq AS frequency, lift
FROM lift
INNER JOIN frequency
ON lift.antecedent = frequency.antecedent
"""
)
print("Checkpoint4 Completed!")
print("Writing Result to Data Lake...")
result.repartition(1024).write.mode("overwrite").parquet(output_path)
print("All Done!")
def main():
work(
order_path=169.1 GB of txt,
beer_path=4.9 GB of csv,
corpus_path=210 KB of csv,
output_path="final_result.parquet"
)
if __name__ == "__main__":
main()
I first thought this was caused by the file format parquet. However, when I tried csv, I met with the same problem. I tried result.count() to see how many rows the table result has. It took forever to get the row number, just like writing the data to the data lake.
There was a suggestion to use broadcast hash join instead of the default sort-merge join if a large dataset is joined with a small dataset. I thought it was worth trying because the smaller samples in the pilot study told me the row number of frequency is roughly 0.09% of that of lift (See the query below if you have difficulty tracking frequency and lift).
SELECT lift.antecedent, freq AS frequency, lift
FROM lift
INNER JOIN frequency
ON lift.antecedent = frequency.antecedent
With that in mind, I revised my code:
from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StringType, IntegerType, BooleanType
from pyspark.sql.functions import udf, regexp_extract, collect_set, array_remove, col, size, asc, desc
from pyspark.ml.fpm import FPGrowth
import os
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.5"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3.5"
def work(order_path, beer_path, corpus_path, output_path, FREQ_THRESHOLD=1000, LIFT_THRESHOLD=1):
print("Creating Spark Environment...")
spark = SparkSession.builder.appName("Menu").getOrCreate()
print("Spark Environment Created!")
print("Working on Checkpoint1...")
orders = spark.read.csv(order_path)
orders.createOrReplaceTempView("orders")
orders = spark.sql(
"SELECT _c14 AS order_id, _c31 AS in_menu_id, _c32 AS in_menu_name FROM orders"
)
orders.createOrReplaceTempView("orders")
beer = spark.read.csv(
beer_path,
header=True
)
beer.createOrReplaceTempView("beer")
beer = spark.sql(
"""
SELECT
order_id AS beer_order_id,
in_menu_id AS beer_in_menu_id,
'-999' AS beer_in_menu_name
FROM beer
"""
)
beer.createOrReplaceTempView("beer")
orders = spark.sql(
"""
WITH orders_beer AS (
SELECT *
FROM orders
LEFT JOIN beer
ON orders.in_menu_id = beer.beer_in_menu_id
)
SELECT
order_id,
in_menu_id,
CASE
WHEN beer_in_menu_name IS NOT NULL THEN beer_in_menu_name
WHEN beer_in_menu_name IS NULL THEN in_menu_name
END AS menu_name
FROM orders_beer
"""
)
print("Checkpoint1 Completed!")
print("Working on Checkpoint2...")
corpus = spark.read.csv(
corpus_path,
header=True
)
keywords = corpus.select("Food_Name").rdd.flatMap(lambda x: x).collect()
orders = orders.withColumn(
"keyword",
regexp_extract(
"menu_name",
"(?=^|\s)(" + "|".join(keywords) + ")(?=\s|$)",
0
)
)
orders.createOrReplaceTempView("orders")
orders = spark.sql("""
SELECT order_id, in_menu_id, keyword
FROM orders
WHERE keyword != ''
""")
orders.createOrReplaceTempView("orders")
orders = orders.groupBy("order_id").agg(
collect_set("keyword").alias("items")
)
print("Checkpoint2 Completed!")
print("Working on Checkpoint3...")
fpGrowth = FPGrowth(
itemsCol="items",
minSupport=0,
minConfidence=0
)
model = fpGrowth.fit(orders)
print("Checkpoint3 Completed!")
print("Working on Checkpoint4...")
frequency = model.freqItemsets
frequency = frequency.filter(col("freq") > FREQ_THRESHOLD)
frequency = frequency.withColumn(
"antecedent",
array_remove("items", "-999")
)
frequency = frequency.drop("items")
frequency = frequency.filter(size(col("antecedent")) > 0)
frequency = frequency.orderBy(asc("antecedent"), desc("freq"))
frequency = frequency.dropDuplicates(["antecedent"])
frequency = frequency.withColumn(
"antecedent",
udf(
lambda x: "|".join(sorted(x)), StringType()
)(frequency.antecedent)
)
lift = model.associationRules
lift = lift.drop("confidence")
lift = lift.filter(col("lift") > LIFT_THRESHOLD)
lift = lift.filter(
udf(
lambda x: x == ["-999"], BooleanType()
)(lift.consequent)
)
lift = lift.drop("consequent")
lift = lift.withColumn(
"antecedent",
udf(
lambda x: "|".join(sorted(x)), StringType()
)(lift.antecedent)
)
result = lift.join(
frequency.hint("broadcast"),
["antecedent"],
"inner"
)
print("Checkpoint4 Completed!")
print("Writing Result to Data Lake...")
result.repartition(1024).write.mode("overwrite").parquet(output_path)
print("All Done!")
def main():
work(
order_path=169.1 GB of txt,
beer_path=4.9 GB of csv,
corpus_path=210 KB of csv,
output_path="final_result.parquet"
)
if __name__ == "__main__":
main()
The code ran perfectly well with the same sample data on my Mac OS and as expected took less time (34 seconds vs. 26 seconds). Then I decided to run the code to HDInsight with full datasets. In the last step, which is writing data to the data lake, the task failed and I was told Job cancelled because SparkContext was shut down. I am rather new to big data and have no idea with this means. Posts on the internet said there could be many reasons behind it. Whatever the method I should use, how to optimize my code so I can get the desired output in the data lake within bearable amount of time?
I would try several things, ordered by the amount of energy they require:
Check if the ADL storage is in the same region as your HDInsight cluster.
Add calls for df = df.cache() after heavy calculations, or even write and then read the dataframes into and from a cache storage in between these calculations.
Replace your UDFs with "native" Spark code, since UDFs are one of the performance bad practices of Spark.
I have figured it out after five days' struggle. Here are the approaches that I took to optimize the code. The time of code execution dropped from more than 24 hours to around 10 minutes. Code optimization is really really important.
As David Taub below pointed out, use df.cache() after heavy computation or before feeding the data to the model. I used df.cache().count() since calling .cache() on its own is lazily evaluated but the following .count() forces an evaluation of the entire dataset.
Use flashtext instead of regular expression to extract keywords. This greatly improves code performance.
Be careful with joins / merge. It might get extremely slow due to data skewness. Always think about ways to avoid unnecessary joins.
Set minSupport for FPGrowth. This significantly reduces the time when calling model.freqItemsets.
I'm working in Spark and Scala for the past 2 months and I'm new to this technology. I framed the select columns(with regexp_replace) as List [String] () and passed for Spark Data frame creation and its throwing me error as "Cannot resolve". Please find below the steps, I have followed and tried.
Defining the val:
Defining the column which I would like to identify in the src data frame
val col_name = "region_id"
Defining the column which will be used to replace the src data frame column from ref data frame
val surr_key_col_name = "surrogate_key"
I have created two Data frames as shown below
src_df
region id | region_name | region_code
10001189 | Spain | SP09 8545
10001765 | Africa | AF97 6754
ref_df
region id | surrogate_key
1189 | 2345
1765 | 8978
val src_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("s3://bucket/src_details.csv")
val ref_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("s3://bucket/ref_details.csv")
I'm iterating through to identify the column I need to use reg match and replace with another Data Frame column value and storing it in List to pass it to Data Frames select
val src_header_rec = src_df.columns.toList
//Loop through source file header to identify the region_id and replace it with surrogate_id by doing a pattern match( I don't want to replace the
for (src_header_cols <- src_header_rec) {
if (col_name == src_header_cols) {
src_column_names :+="regexp_replace("+"$"+s""""src.$src_header_cols""""+","+"$"+s""""ref.$src_header_cols""""+","+"$"+s""""ref.$surr_key_col_name""""+")"+".as("+s""""$src_header_cols""""+")"
}
else {
src_column_names :+= "src."+src_header_cols
}
}
After building the select column in the List [String] () using the for loop above, I'm passing it to the select columns for final_df creation
val final_df = src_df.alias("src").join(ref_df.alias("ref"), src_df(col_name)=== ref_df(col_name),"left_outer").select(src_column_names.head,src_column_names.tail:_*)
If I directly pass the columns without using the List [String] () in the select of the data frame my regexp_replace substitution works
val final_df = src_df.alias("src").join(ref_df.alias("ref"), src_df(col_name)=== ref_df(col_name),"left_outer").select(regexp_replace($"src.region_id",$"ref.region_id",$"ref.surrogate_key").as("region_id"))
I'm not sure why its not working when I'm passing it as a List [String] ()
When I remove the regexp_replace substitution in the for loop and pass it as List [String] () for Data Frame select it works properly as shown below:
This code works very well with Data Frame select:
for (src_header_cols <- src_header_rec) {
if (col_name == src_header_cols) {
src_column_names :+= "ref."+surr_key_col_name
}
else {
src_column_names :+= "src."+src_header_cols
}
}
val final_df = src_df.alias("src").join(ref_df.alias("ref"), src_df(col_name)===ref_df(col_name),"left_outer").select(src_column_names.head,src_column_names.tail:_*)
The result/output Data Frame I'm trying to derive is
final_df
region id | region_name | region_code
1000**2345** | Spain | SP09 8545
1000**8978** | Africa | AF97 6754
So, when I'm trying to build the Spark Data Frame select in the for loop with regexp_replace as a List and use it its throwing me "Cannot resolve" error.
I have tried creating Temporary view of the Data Frame and used the same regexp in the select statement of the Temporary view. It worked. Please find below the code I have tried and it worked.
//This for loop will scan through my header list and whichever column matches it frames regexp for those columns.So, the region_id from the Data Frame header matches the variable value that I have defined.
for (src_header_cols <- src_header_rec) {
if (col_name == src_header_cols) {
src_column_names :+= "regexp_replace(src."+s"$src_header_cols"+",ref."+s"$ref_col_name"+",ref."+s"$surr_key_col_name"+")"+s" $src_header_cols"
}
else {
src_column_names :+= "src."+src_header_cols
}
}
//Creating Temporary view to apply SQL queries on it
src_df.createOrReplaceTempView("src")
ref_df.createOrReplaceTempView("ref")
//Framing SQL statements to be passed while querying
val selectExpr_1 = "select "+src_column_names.mkString(",")
val selectExpr_2 = selectExpr_1+" from src left outer join ref on(src."+s"$col_name"+" = ref."+s"$ref_col_name"+")"
// Creating a final Data Frame using the SQL statement created
val src_policy_masked_df = spark.sql(s"$selectExpr_2")
I'm learning Play Scala with Slick and run into a problem. The query generated by Slick throws an error when using drop and take (works fine without drop and take, but I need pagination).
val categories = TableQuery[Categories]
def list(page: Int = 0, pageSize: Int = 10, orderBy: Int = 1)(implicit s: Session): Page[Category] = {
val offset = pageSize * page
val totalRows = count
/* Error here */
val result = categories.drop(offset).take(pageSize).list
/* This works fine */
/* val result = categories.list */
Page(result, page, offset, totalRows)
}
The query generated and error stacktrace:
[JdbcSQLException: Syntax error in SQL statement "SELECT X2.""id"", X2.""name"", X2.""description"" FROM (SELECT X3.""id"" AS ""id"", X3.""name"" AS ""name"", X3.""description"" AS ""description"" FROM ""core_category"" X3 FETCH[*] NEXT 10 ROW ONLY) X2 "; expected "RIGHT, LEFT, FULL, INNER, JOIN, CROSS, NATURAL, ,, WHERE, GROUP, HAVING, UNION, MINUS, EXCEPT, INTERSECT, ORDER, LIMIT, FOR, )"; SQL statement:
select x2."id", x2."name", x2."description" from (select x3."id" as "id", x3."name" as "name", x3."description" as "description" from "core_category" x3 fetch next 10 row only) x2 [42001-175]]
Any ideas how could I implement pagination without getting this error?
I was importing JdbcDriver instead of personalized drivers (H2Driver in my case) to get the correct dialect.
I have few tables, lets say 2 for simplicity. I can create them in this way,
...
val tableA = new Table[(Int,Int)]("tableA"){
def a = column[Int]("a")
def b = column[Int]("b")
}
val tableB = new Table[(Int,Int)]("tableB"){
def a = column[Int]("a")
def b = column[Int]("b")
}
Im going to have a query to retrieve value 'a' from tableA and value 'a' from tableB as a list inside the results from 'a'
my result should be:
List[(a,List(b))]
so far i came upto this point in query,
def createSecondItr(b1:NamedColumn[Int]) = for(
b2 <- tableB if b1 === b1.b
) yield b2.a
val q1 = for (
a1 <- tableA
listB = createSecondItr(a1.b)
) yield (a1.a , listB)
i didn't test the code so there might be errors in the code. My problem is I cannot retrieve data from the results.
to understand the question, take trains and classes of it. you search the trains after 12pm and you need to have a result set where the train name and the classes which the train have as a list inside the train's result.
I don't think you can do this directly in ScalaQuery. What I would do is to do a normal join and then manipulate the result accordingly:
import scala.collection.mutable.{HashMap, Set, MultiMap}
def list2multimap[A, B](list: List[(A, B)]) =
list.foldLeft(new HashMap[A, Set[B]] with MultiMap[A, B]){(acc, pair) => acc.addBinding(pair._1, pair._2)}
val q = for (
a <- tableA
b <- tableB
if (a.b === b.b)
) yield (a.a, b.a)
list2multimap(q.list)
The list2multimap is taken from https://stackoverflow.com/a/7210191/66686
The code is written without assistance of an IDE, compiler or similar. Consider the debugging free training :-)