So I'm trying to convert my python algorithm to Spark friendly code and I'm having trouble with this one:
indexer = recordlinkage.SortedNeighbourhoodIndex \
(left_on=column1, right_on=column2, window=41)
pairs = indexer.index(df_1,df_2)
It basically compares one column against the other and generates index pairs for those likely to be the same (Record Matching).
My code:
df1 = spark.read.load(*.csv)
df2 = spark.read.load(*.csv)
func_udf = udf(index.indexer) ????
df = df.withColumn('column1',func_udf(df1.column1,df2.column2)) ???
I've been using udf for transformations involving just one dataframe and one column, but how do I run a function that requires two arguments, one column from one dataframe and other from other dataframe? I can't join both dataframes as they have different lengths.
That's not how udf work. UserDefinedFunctions can operate only on data that comes from a single DataFrame
Standard udf on data from a single row.
pandas_udf on data from a single partition or single group.
I can't join both dataframes as they have different lengths.
Join is exactly what you should do (standard or manual broadcast). There is no need for objects to be of the same length - Spark join is a relational join not row-wise merge.
For similarity joins you can use built-in approx join tools:
Efficient string matching in Apache Spark
Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each
Related
I have two physical nodes that are not synchronised.
Both nodes produce captured data. (Two nodes technology was put in place for resilience).
I am facing following challenge:
nodes produce two identical files (timestamps may not be the same, no unique identifier in order to remove duplicates). Both frames share the same schema.
Is there a way to write in data frame using pyspark something like:
df3= case
when df1.count()<df2.count() then df2,
when df1.count()>df2.count() then df1,
ELSE df1
Resolved following case by defining "comparison" function.
def compare(df1, df2):
if df1.count() > df2.count():
return df1
if df1.count() < df2.count():
return df2
else:
return df1
Seems possibility to work with dataframes as an object can be achieved via functions
Iam trying to do some transformations on the dataset with spark using scala currently using spark sql but want to shift the code to native scala code. i want to know whether to use filter or map, doing some operations like matching the values in column and get a single column after the transformation into a different dataset.
SELECT * FROM TABLE WHERE COLUMN = ''
Used to write something like this earlier in spark sql can someone tell me an alternative way to write the same using map or filter on the dataset, and even which one is much faster when compared.
You can read documentation from Apache Spark website. This is the link to API documentation at https://spark.apache.org/docs/2.3.1/api/scala/index.html#package.
Here is a little example -
val df = sc.parallelize(Seq((1,"ABC"), (2,"DEF"), (3,"GHI"))).toDF("col1","col2")
val df1 = df.filter("col1 > 1")
df1.show()
val df2 = df1.map(x => x.getInt(0) + 3)
df2.show()
If I understand you question correctly, you need to rewrite your SQL query to DataFrame API. Your query reads all columns from table TABLE and filter rows where COLUMN is empty. You can do this with DF in the following way:
spark.read.table("TABLE")
.where($"COLUMN".eqNullSafe(""))
.show(10)
Performance will be the same as in your SQL. Use dataFrame.explain(true) method to understand what Spark will do.
I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.
Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.
You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:
val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
mllibVec.asML
})
val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.
When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.
I am reading a csv as a Data Frame by below:
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("D:/ModelData.csv")
Then I group by three columns as below which returns a RelationalGroupedDataset
df.groupBy("col1", "col2","col3")
And I want each grouped data frame to be send through the below function
def ModelFunction(daf: DataFrame) = {
//do some calculation
}
For example if I have col1 having 2 unique (0,1) values and col2 having 2 unique values(1,2) and col3 having 3 unique values(1,2,3) Then i would like to pass each combination grouping to the Model function Like for col1=0 ,col2=1,col3=1 I will havea dataframe and I want to pass that to the ModelFunction and so on for each combination of the three columns.
I tried
df.groupBy("col1", "col2","col3").ModelFunction();
But it throw an error.
.
Any help is appreciated.
The short answer is that you cannot do that. You can only do aggregate functions on RelationalGroupedDataset (either ones you write as UDAF or built in ones in org.apache.spark.sql.functions)
The way I see it you have several options:
Option 1: The amount of data for each unique combination is small enough and not skewed too much compared to other combinations.
In this case you can do:
val grouped = df.groupBy("col1", "col2","col3").agg(collect_list(struct(all other columns)))
grouped.as[some case class to represent the data including the combination].map[your own logistic regression function).
Option 2: If the total number of combinations is small enough you can do:
val values: df.select("col1", "col2", "col3").distinct().collect()
and then loop through them creating a new dataframe from each combination by doing a filter.
Option 3: Write your own UDAF
This would probably not be good enough as the data comes in a stream without the ability to do iteration, however, if you have an implemenation of logistic regression which matches you can try to write a UDAF to do this. See for example: How to define and use a User-Defined Aggregate Function in Spark SQL?