Pyspark Usage of Col() Function - pyspark

I am new to spark.I have faced a confusion in which there are multiple ways to access a column of a dataframe. For example:
df=select(df.columnname).show()
other way:
df=select(col("columnname")).show()
There are scenarios where we can not use direct column name and we have to use col() thing. However, I am unable to find those scearios where we have to accesss column name using col() otherwise it will throw the error.

This would be an example that would fail without using col():
df = df.withColumn('col3', when(df.col1 > 2, 5)) \
.withColumn('col4', when(df.col3 < 6, 4))
If you want to access a column that is part of the projection chain, but not of the data frame object yet.
This would work:
df = df.withColumn('col3', when(df.col1 > 2, 5)) \
.withColumn('col4', when(col("col3") < 6, 4))

Related

PyTest Mark some parameters as slow but not others [duplicate]

I have been trying to parameterize my tests using #pytest.mark.parametrize, and I have a marketer #pytest.mark.test("1234"), I use the value from the test marker to do post the results to JIRA. Note the value given for the marker changes for every test_data. Essentially the code looks something like below.
#pytest.mark.foo
#pytest.mark.parametrize(("n", "expected"),[
(1, 2),
(2, 3)])
def test_increment(n, expected):
assert n + 1 == expected
I want to do something like
#pytest.mark.foo
#pytest.mark.parametrize(("n", "expected"), [
(1, 2,#pytest.mark.test("T1")),
(2, 3,#pytest.mark.test("T2"))
])
How to add the marker when using parameterized tests given that the value of the marker is expected to change with each test?
It's explained here in the documentation: https://docs.pytest.org/en/stable/example/markers.html#marking-individual-tests-when-using-parametrize
To show it here as well, it'd be:
#pytest.mark.foo
#pytest.mark.parametrize(("n", "expected"), [
pytest.param(1, 2, marks=pytest.mark.T1),
pytest.param(2, 3, marks=pytest.mark.T2),
(4, 5)
])

How do I split dataframe in pyspark.sql.column.Column and get count of items

I am beginner in pythons and spark.
My Product_Version column looks like: 87.12.0.12. sometimes it has only 3 parts: 89.0.0
I want to split them using separator in different columns and so I did:
cols = split(dfroot_temp["Product_Version"], "\.")
dfroot_staged_temp = dfroot_temp.withColumn("Ver_Major", cols.getItem(0)) \
.withColumn("Ver_Minor", cols.getItem(1)) \
.withColumn("Ver_Build", cols.getItem(3))
Now, sometimes when the last part in not present in dataframe script throws error at line: .withColumn("Ver_Build", cols.getItem(3))
How can I know in advance weather it has 3rd part is present the column so I will do lit(0) in-place? Is there any function to get the count of these column items?
You can use pyspark.sql.functions.size method.
cols.size would return 3 for 89.0.0 and 4 for 89.0.0.1
cols = split(dfroot_temp["Product_Version"], "\.")
dfroot_staged_temp = dfroot_temp.withColumn("Ver_Major", cols.getItem(0)) \
.withColumn("Ver_Minor", cols.getItem(1)) \
.withColumn("Ver_Build", when(size(cols) == 4, cols.getItem(3)).otherwise(lit(0)))

How to read csv in pyspark as different types, or map dataset into two different types

Is there a way to map RDD as
covidRDD = sc.textFile("us-states.csv") \
.map(lambda x: x.split(","))
#reducing states and cases by key
reducedCOVID = covidRDD.reduceByKey(lambda accum, n:accum+n)
print(reducedCOVID.take(1))
The dataset consists of 1 column of states and 1 column of cases. When it's created, it is read as
[[u'Washington', u'1'],...]
Thus, I want to have a column of string and a column of int. I am doing a project on RDD, so I want to avoid using dataframe.. any thoughts?
Thanks!
As the dataset contains key value pair, use groupBykey and aggregate the count.
If you have a dataset like [['WH', 10], ['TX', 5], ['WH', 2], ['IL', 5], ['TX', 6]]
The code below gives this output - [('IL', 5), ('TX', 11), ('WH', 12)]
data.groupByKey().map(lambda row: (row[0], sum(row[1]))).collect()
can use aggregateByKey with UDF. This method requires 3 parameters start location, aggregation function within partition and aggregation function across the partitions
This code also produces the same result as above
def addValues(a,b):
return a+b
data.aggregateByKey(0, addValues, addValues).collect()

Iterate through mixed-Type scala Lists

Using Spark 2.1.1., I have an N-row csv as 'fileInput'
colname datatype elems start end
colA float 10 0 1
colB int 10 0 9
I have successfully made an array of sql.rows ...
val df = spark.read.format("com.databricks.spark.csv").option("header", "true").load(fileInput)
val rowCnt:Int = df.count.toInt
val aryToUse = df.take(rowCnt)
Array[org.apache.spark.sql.Row] = Array([colA,float,10,0,1], [colB,int,10,0,9])
Against those Rows and using my random-value-generator scripts, I have successfully populated an empty ListBuffer[Any] ...
res170: scala.collection.mutable.ListBuffer[Any] = ListBuffer(List(0.24455154, 0.108798146, 0.111522496, 0.44311434, 0.13506883, 0.0655781, 0.8273762, 0.49718297, 0.5322746, 0.8416396), List(1, 9, 3, 4, 2, 3, 8, 7, 4, 6))
Now, I have a mixed-type ListBuffer[Any] with different typed lists.
.
How do iterate through and zip these? [Any] seems to defy mapping/zipping. I need to take N lists generated by the inputFile's definitions, then save them to a csv file. Final output should be:
ColA, ColB
0.24455154, 1
0.108798146, 9
0.111522496, 3
... etc
The inputFile can then be used to create any number of 'colnames's, of any 'datatype' (I have scripts for that), of each type appearing 1::n times, of any number of rows (defined as 'elems'). My random-generating scripts customize the values per 'start' & 'end', but these columns are not relevant for this question).
Given a List[List[Any]], you can "zip" all these lists together using transpose, if you don't mind the result being a list-of-lists instead of a list of Tuples:
val result: Seq[List[Any]] = list.transpose
If you then want to write this into a CSV, you can start by mapping each "row" into a comma-separated String:
val rows: Seq[String] = result.map(_.mkString(","))
(note: I'm ignoring the Apache Spark part, which seems completely irrelevant to this question... the "metadata" is loaded via Spark, but then it's collected into an Array so it becomes irrelevant)
I think the RDD.zipWithUniqueId() or RDD.zipWithIndex() methods can perform what you wanna do.
Please refer to official documentation for more information. hope this help you

Implement a MergeSort like feature in spark with scala

Spark Version 1.2.1
Scala Version 2.10.4
I have 2 SchemaRDD which are associated by a numeric field:
RDD 1: (Big table - about a million records)
[A,3]
[B,4]
[C,5]
[D,7]
[E,8]
RDD 2: (Small table < 100 records so using it as a Broadcast Variable)
[SUM, 2]
[WIN, 6]
[MOM, 7]
[DOM, 9]
[POM, 10]
Result
[C,5, WIN]
[D,7, MOM]
[E,8, DOM]
[E,8, POM]
I want the max(field) from RDD1 which is <= the field from RDD2.
I am trying to approach this using Merge by:
Sorting RDD by a key (sort within a group will have not more than 100 records in that group. In the above example is within a group)
Performing the merge operation similar to mergesort. Here I need to keep a track of the previous value as well to find the max; still I traverse the list only once.
Since there are too may variables here I am getting "Task not serializable" exception. Is this implementation approach Correct? I am trying to avoid the Cartesian Product here. Is there a better way to do it?
Adding the code -
rdd1.groupBy(itm => (itm(2), itm(3))).mapValues( itmorg => {
val miorec = itmorg.toList.sortBy(_(1).toString)
for( r <- 0 to miorec.length) {
for ( q <- 0 to rdd2.value.length) {
if ( (miorec(r)(1).toString > rdd2.value(q).toString && miorec(r-1)(1).toString <= rdd2.value(q).toString && r > 0) || r == miorec.length)
org.apache.spark.sql.Row(miorec(r-1)(0),miorec(r-1)(1),miorec(r-1)(2),miorec(r-1)(3),rdd2.value(q))
}
}
}).collect.foreach(println)
I would not do a global sort. It is an expensive operation for what you need. Finding the maximum is certainly cheaper than getting a global ordering of all values. Instead, do this:
For each partition, build a structure that keeps the max on RDD1 for each row on RDD2. This can be trivially done using mapPartitions and normal scala data structures. You can even use your one-pass merge code here. You should get something like a HashMap(WIN -> (C, 5), MOM -> (D, 7), ...)
Once this is done locally on each executor, merging these resulting data structures should be simple using reduce.
The goal here is to do little to no shuffling an keeping the most complex operation local, since the result size you want is very small (it would be easier in code to just create all valid key/values with RDD1 and RDD2 then aggregateByKey, but less efficient).
As for your exception, you woudl need to show the code, "Task not serializable" usually means you are passing around closures which are not, well, serializable ;-)