I have a dataframe in pyspark as below
123,2016-09-21,[s3://mybucket/1/,s3://mybucket/2/,s3://mybucket/3/]
where I iterate as below:
for path in s3_list:
for i in range(len(path.S3_path)):
model(path.S3_path[i].strip())
print "the looping number"+ str(i)
model1(path.Id.strip(),path.dt.strip())
I need some help on how to do this using Flatmap and Map. Below is the code i started with:
Slist = lambda s3_list: (y for y in s3_list)
S3path= lambda path: (model(path.S3_path[i].strip()) for i in xrange(len(path.S3_path)))
sc.parallelize(s3_list).flatMap(Slist).flatMap(S3path)
How to call the function model1 in the above code? Any help would be appreciated.
Related
I am trying to translate a pyspark solution into scala.
Here is the code:
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c not in ['firstname','middlename','lastname']]
status = when(df1["id"].isNull(), lit("added"))
.when(df2["id"].isNull(), lit("deleted"))
.when(size(array_remove(array(*conditions_), "")) > 0, lit("updated"))
.otherwise("unchanged")
for scala, I am simply trying to use expr instead of * to substitute the conditions_ expression in my when clause, but it is not supported due to for syntax.
Can you please point me to the right syntax here to add a loop in when clause, calculating the count of column differences dynamically.
If you want to unpack array in scala you can use following syntax:
when(size(array_remove(array(conditions_:_*), "")) > 0, lit("updated"))
Examples of "_*" operator
I want to select all rows from a pyspark df except some rows where the array contains 1. It works with the code below in the notebook:
<pyspark df>.filter(~exists("<col name>", lambda x: x=="hello"))
But when I write it as this:
cond = '~exists("<col name>", lambda x: x=="hello")'
df = df.filter(con)
I got error as below:
pyspark.sql.utils.ParseException:
extraneous input 'x' expecting {')', ','}(line 1, pos 32)
I really can't spot any typo. Could someone give me a hint if I missed something?
Thanks, J
To pass in the conditions through variable, it needs to be written in the form of
expr str of spark sql. So it can be modified to:
cond = '!exists(col_name, x -> x == "hello")'
D = # came from numpy.int64 via pandas
E = # came from numpy.int64 via pandas
import pyspark.sql.functions as F
output_df.withColumn("c", F.col("A") - F.log(F.lit(D) - F.lit(E)))
I tried to use multiple lit inside pyspark with column operation. But I keep getting errors like
*** AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
But these ones work
D=2
output_df.withColumn("c", F.lit(D))
output_df.withColumn("c", F.lit(2))
Try this
df.withColumn("c", F.col("A") - F.log(F.lit(int(D - E))))
D = int(D)
E = int(E)
Just add these two lines and it will work. The issue is that pyspark doesn't know how to handle numpy.int64
I am learning PYSPARK and encountered a problem that I can't fix. I followed this video to copy codes from the PYSPARK documentation to load data for linear regression. The code I got from the documentation was spark.read.format('libsvm').load('file.txt'). I created a spark data frame before this btw. When I run this code in Jupyter notebook it keeps giving me some java error and the guy in this video did the exact same thing as I did and he didn't get this error. Can someone help me resolve this issue, please?
A lot of thanks!
I think I solved this issue by setting the "numFeatures" in the option method:
training = spark.read.format('libsvm').option("numFeatures","10").load('sample_linear_regression_data.txt', header=True)
You can use this custom function to read libsvm file.
from pyspark.sql import Row
from pyspark.ml.linalg import SparseVector
def read_libsvm(filepath, spark_session):
'''
A utility function that takes in a libsvm file and turn it to a pyspark dataframe.
Args:
filepath (str): The file path to the data file.
spark_session (object): The SparkSession object to create dataframe.
Returns:
A pyspark dataframe that contains the data loaded.
'''
with open(filepath, 'r') as f:
raw_data = [x.split() for x in f.readlines()]
outcome = [int(x[0]) for x in raw_data]
index_value_dict = list()
for row in raw_data:
index_value_dict.append(dict([(int(x.split(':')[0]), float(x.split(':')[1]))
for x in row[1:]]))
max_idx = max([max(x.keys()) for x in index_value_dict])
rows = [
Row(
label=outcome[i],
feat_vector=SparseVector(max_idx + 1, index_value_dict[i])
)
for i in range(len(index_value_dict))
]
df = spark_session.createDataFrame(rows)
return df
Usage:
my_data = read_libsvm(filepath="sample_libsvm_data.txt", spark_session=spark)
You can try to load via:
from pyspark.mllib.util import MLUtils
df = MLUtils.loadLibSVMFile(sc,"data.libsvm",numFeatures=781).toDF()
sc is Spark context and df is resulting data frame.
I'm new to scala/spark and am trying to loop through a dataframe and assign the results as the loop progresses. The following code works but can only print the results to screen.
traincategory.columns.foreach { x=>
val test1 = traincategory.select("Id", x)
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
//CODE TO PERFORM ONEHOT TRANSFORMATION
val encoded = encoder.transform(indexed)
encoded.show()
}
As val is immutable I have attempted to append the vectors from this transformation onto another variable, as might be done in R.
//var ended = traincategory.withColumn(x,encoded(0))
I suspect Scala has a more idiomatic way of processing this.
Thank you in advance for your help.
A solution was available at :
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/Correlations.scala
If anyone has similar issues with Scala MLIB there is great example code at :
https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/mllib