Using Apache Hudi with Python/Pyspark [closed] - pyspark

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Has anyone used Apache Hudi in a Pyspark environment? If it is possible, are there any code samples available?

Here is the working pyspark sample with INSERT, UPDATE and READ operations:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = (
SparkSession.builder.appName("Hudi_Data_Processing_Framework")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.sql.hive.convertMetastoreParquet", "false")
.config(
"spark.jars.packages",
"org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.2"
)
.getOrCreate()
)
input_df = spark.createDataFrame(
[
("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("104", "2015-01-02", "2015-01-01T12:15:00.512679Z"),
("105", "2015-01-02", "2015-01-01T13:51:42.248818Z"),
],
("id", "creation_date", "last_update_time"),
)
hudi_options = {
# ---------------DATA SOURCE WRITE CONFIGS---------------#
"hoodie.table.name": "hudi_test",
"hoodie.datasource.write.recordkey.field": "id",
"hoodie.datasource.write.precombine.field": "last_update_time",
"hoodie.datasource.write.partitionpath.field": "creation_date",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.upsert.shuffle.parallelism": 1,
"hoodie.insert.shuffle.parallelism": 1,
"hoodie.consistency.check.enabled": True,
"hoodie.index.type": "BLOOM",
"hoodie.index.bloom.num_entries": 60000,
"hoodie.index.bloom.fpp": 0.000000001,
"hoodie.cleaner.commits.retained": 2,
}
# INSERT
(
input_df.write.format("org.apache.hudi")
.options(**hudi_options)
.mode("append")
.save("/tmp/hudi_test")
)
#UPDATE
update_df = input_df.limit(1).withColumn("last_update_time", lit("2016-01-01T13:51:39.340396Z"))
(
update_df.write.format("org.apache.hudi")
.options(**hudi_options)
.mode("append")
.save("/tmp/hudi_test")
)
# READ
output_df = spark.read.format("org.apache.hudi").load(
"/tmp/hudi_test/*/*"
)
output_df.show()

Related

Pattern match string from column in spark dataframe [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have column in spark dataframe, where i need to search the data with only string containing "xyz" and to stored in new column.
Input (need the only field from column having xyz )
col A colB
A bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656
B xyz:4462915,xyz:4462917,xyz:4462918
Required Output
col A colB colC
A bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656 xyz:3089656
B xyz:4462915,xyz:4462917,xyz:4462918 xyz:4462915,xyz:4462917,xyz:4462918
I have 100k rows and cannot used groupby on colA using collect_list, can you please to get the required output.
If you are using Spark 2.4+ then you can split the colB with comma , and use built in functions as expressions
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
("A", "bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656"),
("B", "xyz:4462915,xyz:4462917,xyz:4462918")
).toDF("colA", "colB")
val newDF = df.withColumn("split", split($"colB", ","))
.selectExpr("*", "filter(split, x -> x LIKE 'xyz%' ) filteredB")
.withColumn("colC", concat_ws(",", $"filteredB"))
.drop("split", "filteredB")
newDF.show(false)
Output:
+----+-----------------------------------------------------+-----------------------------------+
|colA|colB |colC |
+----+-----------------------------------------------------+-----------------------------------+
|A |bid:76563,bid:76589,bid:76591,ms:ms15-097,xyz:3089656|xyz:3089656 |
|B |xyz:4462915,xyz:4462917,xyz:4462918 |xyz:4462915,xyz:4462917,xyz:4462918|
+----+-----------------------------------------------------+-----------------------------------+

I am new to Scala. I used to work in pySpark. How I can convert these lines of code into scala? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
from sklearn.preprocessing import LabelEncoder
y_train = train_['country_destination']
train_user.drop(['country_destination', 'id'], axis=1, inplace=True)
x_train = train_df.values
label_encoder = LabelEncoder()
encoded_y_train = label_encoder.fit_transform(y_train)
In above mentioned code, I was trying to encode labels and features.
You can do so using the stringIndexer()
import org.apache.spark.ml.feature.StringIndexer
val df = spark.createDataFrame(
Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c"))
).toDF("id", "category")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
val indexed = indexer.fit(df).transform(df)
indexed.show()

Efficient way to loop through spark dataframe and determine counts of column values as types [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a spark dataframe, I would like to loop through each column in the dataframe and determine the count of datatypes(int, string, boolean, datetype) for each column. NOT the column type overall, but the counts of each value as it's own type. So for example
col_1|col_2|col_3
aaa
bbb
14
16
true
So the counts for col_1 would be, strings=2, int=2, boolean=1
Is there a way to do this in spark? If so, how? How do I need to convert to rdd and loop through each row?
Here is a rudimentary example. You'll have to pay close attention to your data and the type parsing order. For instance, "1".toDouble will succeed, and perhaps you wanted that to be counted as an int. If you only have the three types in the question, than this code should work out of the box for any number of string columns.
val data = spark.createDataset(Seq(
("aaa", "1", "true"),
("bbb", "bar", "2"),
("14", "10", "false"),
("16", "baz", "11"),
("true", "5", "4")
)).toDF("col_1", "col_2", "col_3")
import scala.util.Try
val cols = data.columns.toSeq
data.flatMap(r => {
cols.map(c => {
val str = r.getAs[String](c)
if(Try(str.toBoolean).isSuccess) {
(c, "boolean")
} else if(Try(str.toInt).isSuccess) {
(c, "int")
} else {
(c, "string")
}
})
}).toDF("col", "type")
.groupBy("col").agg(collect_list("type").as("types"))
.as[(String, Array[String])]
.map(r => {
val mp = r._2.groupBy(t=>t).mapValues(_.size)
(r._1, mp)
}).show(false)
This code results in:
+-----+----------------------------------------+
|_1 |_2 |
+-----+----------------------------------------+
|col_3|Map(boolean -> 2, int -> 3) |
|col_2|Map(int -> 3, string -> 2) |
|col_1|Map(boolean -> 1, int -> 2, string -> 2)|
+-----+----------------------------------------+

Convert RDD[(String,List[String])] to Dataframe [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
My RDD is in below format. i.e RDD[(String,List[String])]
(abc,List(a,b))
(bcb,List(a,b))
I want to convert it to Dataframe Like below
col1 col2 col3
abc a b
bcb a b
what is the best approach do it in scala ?
you first need to extract the elements of your List into a tuple, than you can use toDF on your RDD (spark implicit conversions need to be imported for this)
val rdd: RDD[(String, List[String])] = sc.parallelize(Seq(
("abc",List("a","b")),
("bcb",List("a","b"))
))
val df = rdd
.map{case (str,list) => (str,list(0),list(1))}
.toDF("col1","col2","col3")
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| abc| a| b|
| bcb| a| b|
+----+----+----+

Spark Scala: Distance between elements of RDDs [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I have 2 RRDs with time series. Like
rdd1.take(5)
[(1, 25.0)
(2, 50.23)
(3, 65.0)
(4, 7.23)
(5, 12.0)]
and
rdd2.take(5)
[(1, 85.0)
(2, 3.23)
(3, 9.0)
(4, 23.23)
(5, 65.0)]
I would like to find the disctance between each element of the first rdd and each element of the second and get next
result.take(5)
[((1,1): (25.0-85.0)**2),
((1,2): (25.0 - 3.23)**2),
.....
((1,5): (25.0 - 65.23)**2),
.....
((2,1): (50.23 - 85.0)**2),
.....
((5,5): (12.0 - 65.0)**2),
]
The number of elements can be from 10 000 to billions.
Please, help me.
#Mohit is right, you are looking for the cartesian product of your two RDDs, then you should map and compute your distance.
Here is an example :
val rdd1 = sc.parallelize(List((1, 25.0), (2, 50.23), (3, 65.0), (4, 7.23), (5, 12.0)))
val rdd2 = sc.parallelize(List((1, 85.0), (2, 3.23), (3, 9.0), (4, 23.23), (5, 65.0)))
val result = rdd1.cartesian(rdd2).map {
case ((a,b),(c,d)) => ((a,c),math.pow((b - d),2))
}
Now, let's see how it looks like :
result.take(10).foreach(println)
# ((1,1),3600.0)
# ((1,2),473.93289999999996)
# ((1,3),256.0)
# ((1,4),3.1328999999999985)
# ((1,5),1600.0)
# ((2,1),1208.9529000000002)
# ((2,2),2209.0)
# ((2,3),1699.9128999999998)
# ((2,4),728.9999999999998)
# ((2,5),218.1529000000001)
What you are looking for is Cartesian Product. This gives you the product (or pairing) between each element of RDD1 with RDD2.
Since you are dealing with billion-size dataset, make sure your infrastructure supports it.
A similar question may help you further.