How to visualize pyspark ml's LDA or other clustering - pyspark
Is there a simple way to visualize the pyspark's LDA (pyspark.ml.clustering.LDA)?
ldamodel.transform(result).show() generates
+--------------------+---+--------------------+--------------------+
| filtered| id| features| topicDistribution|
+--------------------+---+--------------------+--------------------+
| [problem, popul]| 0|(18054,[49,493],[...|[0.03282220322786...|
|[tyler, note, glo...| 1|(18054,[40,52,57,...|[0.00440868073429...|
|[mani, economist,...| 2|(18054,[12,17,25,...|[0.00404065731437...|
|[probabl, correct...| 3|(18054,[0,4,7,21,...|[0.00485107317270...|
|[even, popul, ass...| 4|(18054,[10,12,49,...|[0.00334279689625...|
|[sake, argument, ...| 5|(18054,[1,9,12,61...|[0.00285045818525...|
|[much, tougher, p...| 6|(18054,[27,32,49,...|[0.00485107690380...|
+--------------------+---+--------------------+--------------------
This notebook helped me to visualize pyspark LDA topics. It uses D3 bubble chart to visualize the clusters. You could also use pyLDAvis for an interactive topic model visualization.
Here is code with pyspark that shows transforming the topic distribution from .transform API on dataframe. I am using spark LDA example data set in SVM format
# Code to train LDA model using spark ml
from pyspark.ml.clustering import LDA
from pyspark.sql.types import DoubleType
from pyspark.sql import functions as F
# Loads data
dataset = spark.read.format("libsvm").load("file:///usr/sample_lda_libsvm_data.txt")
dataset.show(truncate=False)
Example data
dataset.show(truncate=False)
+-----+---------------------------------------------------------------+
|label|features |
+-----+---------------------------------------------------------------+
|0.0 |(11,[0,1,2,4,5,6,7,10],[1.0,2.0,6.0,2.0,3.0,1.0,1.0,3.0]) |
|1.0 |(11,[0,1,3,4,7,10],[1.0,3.0,1.0,3.0,2.0,1.0]) |
|2.0 |(11,[0,1,2,5,6,8,9],[1.0,4.0,1.0,4.0,9.0,1.0,2.0]) |
|3.0 |(11,[0,1,3,6,8,9,10],[2.0,1.0,3.0,5.0,2.0,3.0,9.0]) |
|4.0 |(11,[0,1,2,3,4,6,9,10],[3.0,1.0,1.0,9.0,3.0,2.0,1.0,3.0]) |
|5.0 |(11,[0,1,3,4,5,6,7,8,9],[4.0,2.0,3.0,4.0,5.0,1.0,1.0,1.0,4.0]) |
|6.0 |(11,[0,1,3,6,8,9,10],[2.0,1.0,3.0,5.0,2.0,2.0,9.0]) |
|7.0 |(11,[0,1,2,3,4,5,6,9,10],[1.0,1.0,1.0,9.0,2.0,1.0,2.0,1.0,3.0])|
|8.0 |(11,[0,1,3,4,5,6,7],[4.0,4.0,3.0,4.0,2.0,1.0,3.0]) |
|9.0 |(11,[0,1,2,4,6,8,9,10],[2.0,8.0,2.0,3.0,2.0,2.0,7.0,2.0]) |
|10.0 |(11,[0,1,2,3,5,6,9,10],[1.0,1.0,1.0,9.0,2.0,2.0,3.0,3.0]) |
|11.0 |(11,[0,1,4,5,6,7,9],[4.0,1.0,4.0,5.0,1.0,3.0,1.0]) |
+-----+---------------------------------------------------------------+
Train a LDA model
# Trains a LDA model
lda = LDA(k=10, maxIter=10)
model = lda.fit(dataset)
# Describe topics.
topics = model.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)
The topics described by their top-weighted terms:
+-----+-----------+---------------------------------------------------------------+
|topic|termIndices|termWeights |
+-----+-----------+---------------------------------------------------------------+
|0 |[4, 7, 10] |[0.10782284792565977, 0.09748059037449146, 0.09623493647157101]|
|1 |[1, 6, 9] |[0.16755678146051728, 0.14746675884135615, 0.12291623854765772]|
|2 |[3, 10, 6] |[0.2365737123772152, 0.10497827056720986, 0.0917840535687615] |
|3 |[1, 3, 7] |[0.1015758016249506, 0.09974496621850018, 0.09902599541011434] |
|4 |[9, 10, 3] |[0.10479879348457938, 0.10207370742688827, 0.09818478669740321]|
|5 |[8, 5, 7] |[0.10843493028120557, 0.0970150424500599, 0.09334497822531877] |
|6 |[8, 5, 0] |[0.09874156962344234, 0.09654280831555884, 0.09565956823827508]|
|7 |[9, 4, 7] |[0.11252483000458603, 0.09755087587088286, 0.09643430900592685]|
|8 |[4, 1, 2] |[0.10994283713713536, 0.09410686873447463, 0.0937471573628509] |
|9 |[5, 4, 0] |[0.15265940066441183, 0.14015412109446546, 0.13878634876078264]|
+-----+-----------+---------------------------------------------------------------+
View topic distribution for every document
# view topic distribution for every document
transformed = model.transform(dataset)
transformed.show(truncate=False)
+-----+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|features |topicDistribution |
+-----+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0.0 |(11,[0,1,2,4,5,6,7,10],[1.0,2.0,6.0,2.0,3.0,1.0,1.0,3.0]) |[0.004830688509084788,0.9563375886321935,0.004924669693727129,0.004830693291141946,0.004830675601199576,0.004830690970098452,0.004830731737552684,0.004830674902568036,0.004830730786933749,0.004922855875500012] |
|1.0 |(11,[0,1,3,4,7,10],[1.0,3.0,1.0,3.0,2.0,1.0]) |[0.008057778755383592,0.3149188541525326,0.00821568856074705,0.008057899973735082,0.00805773202965193,0.00805773219443841,0.00805772753178338,0.008057790266770967,0.008057845264839285,0.6204609512701176] |
|2.0 |(11,[0,1,2,5,6,8,9],[1.0,4.0,1.0,4.0,9.0,1.0,2.0]) |[0.004199741171245032,0.9620401773226402,0.004281469704273017,0.004199769097486346,0.004199807571784884,0.004199819505813106,0.004199835506062414,0.004199781772904878,0.004199800982100323,0.004279797365689855] |
|3.0 |(11,[0,1,3,6,8,9,10],[2.0,1.0,3.0,5.0,2.0,3.0,9.0]) |[0.003714896800546591,0.5070516557688054,0.4631584573147577,0.003714914880264338,0.0037150085177011572,0.003714949896828997,0.0037149846555122436,0.003714886267751718,0.003714909060953893,0.003785336836878225] |
|4.0 |(11,[0,1,2,3,4,6,9,10],[3.0,1.0,1.0,9.0,3.0,2.0,1.0,3.0]) |[0.004024716198633711,0.004348960756766257,0.9633765414688664,0.004024715826289515,0.0040247523412803785,0.004024714760590197,0.004024750967476446,0.004024750137766685,0.004024763598734582,0.004101333943595805] |
|5.0 |(11,[0,1,3,4,5,6,7,8,9],[4.0,2.0,3.0,4.0,5.0,1.0,1.0,1.0,4.0]) |[0.003714916720108325,0.004014106400247752,0.0037876992243613913,0.0037149522531312196,0.0037149927030871474,0.0037149587146134535,0.0037149750439419123,0.0037150099006180567,0.003714963609773339,0.9661934254301174] |
|6.0 |(11,[0,1,3,6,8,9,10],[2.0,1.0,3.0,5.0,2.0,2.0,9.0]) |[0.003863637584067354,0.44120209378688086,0.5278152614977222,0.0038636593932357263,0.003863751204372584,0.0038636970054184935,0.003863731528120536,0.0038636169190041057,0.003863652151710295,0.003936898929468125] |
|7.0 |(11,[0,1,2,3,4,5,6,9,10],[1.0,1.0,1.0,9.0,2.0,1.0,2.0,1.0,3.0])|[0.004390955723890411,0.004745014492795635,0.9600436030532219,0.004390986523517605,0.004391013571891052,0.004390968206875746,0.004391003804300225,0.004390998289212864,0.0043910030406065104,0.004474453293687847] |
|8.0 |(11,[0,1,3,4,5,6,7],[4.0,4.0,3.0,4.0,2.0,1.0,3.0]) |[0.004391082468515706,0.004744799620819518,0.004477230286216996,0.004391179034422902,0.004391083385391976,0.0043911102087152145,0.004391108242443274,0.0043911476110250714,0.0043911508747108575,0.9600401082677386] |
|9.0 |(11,[0,1,2,4,6,8,9,10],[2.0,8.0,2.0,3.0,2.0,2.0,7.0,2.0]) |[0.0033302167739046973,0.9698998050463385,0.0033949933226572675,0.0033302031974203014,0.0033302208173504686,0.003330228671311114,0.0033302277108795157,0.003330230056473623,0.0033302455331591036,0.0033936288705052665]|
|10.0 |(11,[0,1,2,3,5,6,9,10],[1.0,1.0,1.0,9.0,2.0,2.0,3.0,3.0]) |[0.0041998552715806015,0.004538086674649772,0.9617828003374762,0.0041998854155415434,0.004199964563679233,0.004199898040748559,0.004199948969028732,0.004199941207400563,0.004199894377993083,0.004279725141901989] |
|11.0 |(11,[0,1,4,5,6,7,9],[4.0,1.0,4.0,5.0,1.0,3.0,1.0]) |[0.0048305604098789244,0.005219225001032762,0.004924487214200011,0.004830543265675906,0.00483056515654878,0.004830577688731923,0.004830590528195045,0.004830599936989683,0.004830615233900232,0.9560422355648467] |
+-----+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Schema of transformed model
transformed.printSchema()
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
|-- topicDistribution: vector (nullable = true)
As you notice topicDistribution is a vector. Below helper function helps parsing a vector.
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = F.udf(ith_, DoubleType())
Format to display each topic distribution for every document as separate column
df = transformed.select(["label"] + [ith("topicDistribution", F.lit(i)).alias('topic_'+str(i)) for i in range(10)] )
df.show(truncate=False)
+-----+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
|label|topic_0 |topic_1 |topic_2 |topic_3 |topic_4 |topic_5 |topic_6 |topic_7 |topic_8 |topic_9 |
+-----+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
|0.0 |0.004830687791450502 |0.9563377999372255 |0.004830652446299898 |0.004830693203685635 |0.004924680975321234 |0.004830690324650106 |0.004830724790894176 |0.004830674545741453 |0.004830728328369402 |0.00492266765636222 |
|1.0 |0.00805777782592821 |0.3150888304586096 |0.008057821375392899 |0.008057900091752447 |0.00821563090347786 |0.008057731378987427 |0.008057716226340182 |0.00805778996991863 |0.008057841440203276 |0.6202909603293896 |
|2.0 |0.004199740539975822 |0.9620403414727842 |0.004199830281319767 |0.004199769011855544 |0.004281446354869374 |0.004199818930938506 |0.004199829456280457 |0.004199781450899189 |0.004199798835689997 |0.00427964366538733 |
|3.0 |0.003714883352496639 |0.39438266523895776 |0.0037149161634889914|0.003714899290148889 |0.5758276298046127 |0.003714939245435922 |0.0037149657297638815|0.003714878209574761 |0.0037148981104253493|0.0037853248550950695|
|4.0 |0.00402472343811409 |0.0043486720544167945|0.0040247584323080295|0.004024726616022349 |0.9633767817635327 |0.004024722506471514 |0.004024749723387701 |0.004024759068339994 |0.00402477228684825 |0.0041013341105585275|
|5.0 |0.0037149161731463167|0.00401410657859215 |0.0037150318186438148|0.003714952190974752 |0.0037876713720541993|0.003714958223027372 |0.003714969707955506 |0.0037150096299263177|0.003714961725756829 |0.9661934225799228 |
|6.0 |0.0038636235465470963|0.32506932380193027 |0.0038636563625666425|0.003863644344443025 |0.6439482136665527 |0.0038636867164242353|0.003863712160357752 |0.003863609226073573 |0.003863641557265962 |0.00393688861783849 |
|7.0 |0.004390963901259502 |0.004744419369141901 |0.004391020228883301 |0.00439099927884862 |0.9600441405838983 |0.004390977425037901 |0.004391002809855065 |0.004391008592998927 |0.004391013090740394 |0.004474454719336111 |
|8.0 |0.004391081853379135 |0.004744865767572997 |0.004391206214702098 |0.004391178993516226 |0.004477132667794462 |0.0043911096593825015|0.0043911019675074445|0.004391147323286589 |0.0043911486798455125|0.960040026873013 |
|9.0 |0.003330216240957084 |0.9698999783457445 |0.00333023738785573 |0.0033302030986131904|0.003394973102900875 |0.0033302280874212362|0.0033302228867079335|0.0033302291785187624|0.0033302391644247616|0.003393472506855918 |
|10.0 |0.004199858865711682 |0.004538534384183169 |0.004199958349762097 |0.004199894260340701 |0.9617823390796781 |0.004199903494953782 |0.0041999446501473445|0.004199945557171458 |0.004199899755712464 |0.004279721602339041 |
|11.0 |0.00483055973980833 |0.005219211145215135 |0.004830592303351509 |0.004830543225945144 |0.004924458988916403 |0.004830577090650675 |0.004830583633398643 |0.004830599625982923 |0.004830612825588896 |0.9560422614211423 |
+-----+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
You can use the results from here to visualize topic distribution for a document or topics with top-weighted terms.
Related
Convert a column from StringType to Json (object)
Here is a sample data val df4 = sc.parallelize(List( ("A1",45, "5", 1, 90), ("A2",60, "1", 1, 120), ("A6", 30, "9", 1, 450), ("A7", 89, "7", 1, 333), ("A7", 89, "4", 1, 320), ("A2",60, "5", 1, 22), ("A1",45, "22", 1, 1) )).toDF("CID","age", "children", "marketplace_id","value") thanks to #Shu for this piece of code val df5 = df4.selectExpr("CID","""to_json(named_struct("id", children)) as item""", "value", "marketplace_id") +---+-----------+-----+--------------+ |CID|item |value|marketplace_id| +---+-----------+-----+--------------+ |A1 |{"id":"5"} |90 |1 | |A2 |{"id":"1"} |120 |1 | |A6 |{"id":"9"} |450 |1 | |A7 |{"id":"7"} |333 |1 | |A7 |{"id":"4"} |320 |1 | |A2 |{"id":"5"} |22 |1 | |A1 |{"id":"22"}|1 |1 | +---+-----------+-----+--------------+ when you do df5.dtypes (CID,StringType), (item,StringType), (value,IntegerType), (marketplace_id,IntegerType) the column item is of string type, is there a way this can be of json/object type(if that is a thing)? EDIT 1: I will describe what I am trying to achieve here, the above two steps remains same. val w = Window.partitionBy("CID").orderBy(desc("value")) val sorted_list = df5.withColumn("item", collect_list("item").over(w)).groupBy("CID").agg(max("item") as "item") Output: +---+-------------------------+ |CID|item | +---+-------------------------+ |A6 |[{"id":"9"}] | |A2 |[{"id":"1"}, {"id":"5"}] | |A7 |[{"id":"7"}, {"id":"4"}] | |A1 |[{"id":"5"}, {"id":"22"}]| +---+-------------------------+ now whatever is inside [ ] is a string. which is causing a problem for one of the tools we are using. Sorry, pardon me I am new to scala, spark if this is a basic question.
Store json data using struct type, check below code. scala> dfa .withColumn("item_without_json",struct($"cid".as("id"))) .withColumn("item_as_json",to_json($"item_without_json")) .show(false) +---+-----------+-----+--------------+-----------------+------------+ |CID|item |value|marketplace_id|item_without_json|item_as_json| +---+-----------+-----+--------------+-----------------+------------+ |A1 |{"id":"A1"}|90 |1 |[A1] |{"id":"A1"} | |A2 |{"id":"A2"}|120 |1 |[A2] |{"id":"A2"} | |A6 |{"id":"A6"}|450 |1 |[A6] |{"id":"A6"} | |A7 |{"id":"A7"}|333 |1 |[A7] |{"id":"A7"} | |A7 |{"id":"A7"}|320 |1 |[A7] |{"id":"A7"} | |A2 |{"id":"A2"}|22 |1 |[A2] |{"id":"A2"} | |A1 |{"id":"A1"}|1 |1 |[A1] |{"id":"A1"} | +---+-----------+-----+--------------+-----------------+------------+
Based on the comment you made to have the dataset converted to json you would use: df4 .select(collect_list(struct($"CID".as("id"))).as("items")) .write() .json(path) The output will look like: {"items":[{"id":"A1"},{"id":"A2"},{"id":"A6"},{"id":"A7"}, ...]} If you need the thing in memory to pass down to a function, instead of write().json(...) use toJSON
How to convert column values into a single array in scala?
I am trying to convert all columns of my dataframe into single arrays. Is there an operation supported in structured streaming by which we can perform something opposite to "explode"? Any suggestion is much appreciated !!! Tried collect() and collectAsList() . But it is not supported in streaming +---+---------------+----------------+--------+ |row|ADDRESS_TYPE_CD|DISCONTINUE_DATE|param_cd| +---+---------------+----------------+--------+ |0 |1 |null |7 | |2 |6 |null |1 | +---+---------------+----------------+--------+ My result should look like : +---+---------------+----------------+--------+ |row|ADDRESS_TYPE_CD|DISCONTINUE_DATE|param_cd| +---+---------------+----------------+--------+ [0,2] [1,6] [null,null] [7,2] +---+---------------+----------------+--------+
You can use collect_list on all your columns for instance. It would go as follows: val aggs = df.columns.map(c => collect_list(col(c)) as c) df.select(aggs :_*).show() +------+---------------+----------------+--------+ | row|ADDRESS_TYPE_CD|DISCONTINUE_DATE|param_cd| +------+---------------+----------------+--------+ |[0, 2]| [1, 6]| [null, null]| [7, 1]| +------+---------------+----------------+--------+
How can I do map reduce on spark dataframe group by conditional columns?
My spark dataframe looks like this: +------+------+-------+------+ |userid|useid1|userid2|score | +------+------+-------+------+ |23 |null |dsad |3 | |11 |44 |null |4 | |231 |null |temp |5 | |231 |null |temp |2 | +------+------+-------+------+ I want to do the calculation for each pair of userid and useid1/userid2 (whichever is not null). And if it's useid1, I multiply the score by 5, if it's userid2, I multiply the score by 3. Finally, I want to add all score for each pair. The result should be: +------+--------+-----------+ |userid|useid1/2|final score| +------+--------+-----------+ |23 |dsad |9 | |11 |44 |20 | |231 |temp |21 | +------+------+-------------+ How can I do this? For the groupBy part, I know dataframe has the groupBy function, but I don't know if I can use it conditionally, like if userid1 is null, groupby(userid, userid2), if userid2 is null, groupby(userid, useid1). For the calculation part, how to multiply 3 or 5 based on the condition?
The below solution will help to solve your problem. import org.apache.spark.sql.SparkSession import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions._ val groupByUserWinFun = Window.partitionBy("userid","useid1/2") val finalScoreDF = userDF.withColumn("useid1/2", when($"userid1".isNull, $"userid2").otherwise($"userid1")) .withColumn("finalscore", when($"userid1".isNull, $"score" * 3).otherwise($"score" * 5)) .withColumn("finalscore", sum("finalscore").over(groupByUserWinFun)) .select("userid", "useid1/2", "finalscore").distinct() using when method in spark SQL, select userid1 or 2 and multiply with values based on the condition Output: +------+--------+----------+ |userid|useid1/2|finalscore| +------+--------+----------+ | 11 | 44| 20.0| | 23 | dsad| 9.0| | 231| temp| 21.0| +------+--------+----------+
Group by will work: val original = Seq( (23, null, "dsad", 3), (11, "44", null, 4), (231, null, "temp", 5), (231, null, "temp", 2) ).toDF("userid", "useid1", "userid2", "score") // action val result = original .withColumn("useid1/2", coalesce($"useid1", $"userid2")) .withColumn("score", $"score" * when($"useid1".isNotNull, 5).otherwise(3)) .groupBy("userid", "useid1/2") .agg(sum("score").alias("final score")) result.show(false) Output: +------+--------+-----------+ |userid|useid1/2|final score| +------+--------+-----------+ |23 |dsad |9 | |231 |temp |21 | |11 |44 |20 | +------+--------+-----------+
coalesce will do the needful. df.withColumn("userid1/2", coalesce(col("useid1"), col("useid1"))) basically this function return first non-null value of the order documentation : COALESCE(T v1, T v2, ...) Returns the first v that is not NULL, or NULL if all v's are NULL. needs an import import org.apache.spark.sql.functions.coalesce
Add leading zeros to Columns in a Spark Data Frame [duplicate]
This question already has answers here: Prepend zeros to a value in PySpark (2 answers) Closed 4 years ago. In short, I'm leveraging spark-xml to do some parsing of XML files. However, using this is removing the leading zeros in all the values I'm interested in. However, I need the final output, which is a DataFrame, to include the leading zeros. I'm unsure/can not figure out a way to add leading zeros to the columns I'm interested in. val df = spark.read .format("com.databricks.spark.xml") .option("rowTag", "output") .option("excludeAttribute", true) .option("allowNumericLeadingZeros", true) //including this does not solve the problem .load("pathToXmlFile") Example output that I'm getting +------+---+--------------------+ |iD |val|Code | +------+---+--------------------+ |1 |44 |9022070536692784476 | |2 |66 |-5138930048185086175| |3 |25 |805582856291361761 | |4 |17 |-9107885086776983000| |5 |18 |1993794295881733178 | |6 |31 |-2867434050463300064| |7 |88 |-4692317993930338046| |8 |44 |-4039776869915039812| |9 |20 |-5786627276152563542| |10 |12 |7614363703260494022 | +------+---+--------------------+ Desired output +--------+----+--------------------+ |iD |val |Code | +--------+----+--------------------+ |001 |044 |9022070536692784476 | |002 |066 |-5138930048185086175| |003 |025 |805582856291361761 | |004 |017 |-9107885086776983000| |005 |018 |1993794295881733178 | |006 |031 |-2867434050463300064| |007 |088 |-4692317993930338046| |008 |044 |-4039776869915039812| |009 |020 |-5786627276152563542| |0010 |012 |7614363703260494022 | +--------+----+--------------------+
This solved it for me, thank you all for the help val df2 = df .withColumn("idLong", format_string("%03d", $"iD"))
You can simply do that by using concat inbuilt function df.withColumn("iD", concat(lit("00"), col("iD"))) .withColumn("val", concat(lit("0"), col("val")))
Spark scala join RDD between 2 datasets
Supposed i have two dataset as following: Dataset 1: id, name, score 1, Bill, 200 2, Bew, 23 3, Amy, 44 4, Ramond, 68 Dataset 2: id,message 1, i love Bill 2, i hate Bill 3, Bew go go ! 4, Amy is the best 5, Ramond is the wrost 6, Bill go go 7, Bill i love ya 8, Ramond is Bad 9, Amy is great I wanted to join above two datasets and counting the top number of person's name that appears in dataset2 according to the name in dataset1 the result should be: Bill, 4 Ramond, 2 .. .. I managed to join both of them together but not sure how to count how many time it appear for each person. Any suggestion would be appreciated. Edited: my join code: val rdd = sc.textFile("dataset1") val rdd2 = sc.textFile("dataset2") val rddPair1 = rdd.map { x => var data = x.split(",") new Tuple2(data(0), data(1)) } val rddPair2 = rdd2.map { x => var data = x.split(",") new Tuple2(data(0), data(1)) } rddPair1.join(rddPair2).collect().foreach(f =>{ println(f._1+" "+f._2._1+" "+f._2._2) })
Using RDDs, achieving the solution you desire, would be complex. Not so much using dataframes. First step would be to read the two files you have into dataframes as below val df1 = sqlContext.read.format("com.databricks.spark.csv") .option("header", true) .load("dataset1") val df2 = sqlContext.read.format("com.databricks.spark.csv") .option("header", true) .load("dataset1") so that you should be having df1 +---+------+-----+ |id |name |score| +---+------+-----+ |1 |Bill |200 | |2 |Bew |23 | |3 |Amy |44 | |4 |Ramond|68 | +---+------+-----+ df2 +---+-------------------+ |id |message | +---+-------------------+ |1 |i love Bill | |2 |i hate Bill | |3 |Bew go go ! | |4 |Amy is the best | |5 |Ramond is the wrost| |6 |Bill go go | |7 |Bill i love ya | |8 |Ramond is Bad | |9 |Amy is great | +---+-------------------+ join, groupBy and count should give your desired output as df1.join(df2, df2("message").contains(df1("name")), "left").groupBy("name").count().as("count").show(false) Final output would be +------+-----+ |name |count| +------+-----+ |Ramond|2 | |Bill |4 | |Amy |2 | |Bew |1 | +------+-----+