I have the following dataframe which consist of the vertices (index) and all the edges of that Vertex:
spark version :2.4 /graphframe 0.6
|index| temp_index|
|364|[16, 28, 169..|
| 18|[18, 19, 45...|
|362|[3, 21, 22,...|
| 64|[39, 64, 211..|
| 82|[35, 43, 46...|
|281|[2, 91, 102...|
And i want to create a graph out of it using spark .The method i used was to explode all lists and then renaming the dataframe and thus creating all edges .
column = ['temp_index']
vertices = df.drop(*column).withColumnRenamed('index', 'id')
df= df.withColumn('temp_index', explode('temp_index'))
edges = df.withColumnRenamed('index', 'src').withColumnRenamed('temp_index', 'dst')
g = GraphFrame(vertices, edges)
For small datasets it works fine but for large datasets the explode function is pretty slow(each of these lists contain up to 1.000.000 edges) , is there a way to make it more efficient ?
Is there a way to map RDD as
covidRDD = sc.textFile("us-states.csv") \
.map(lambda x: x.split(","))
#reducing states and cases by key
reducedCOVID = covidRDD.reduceByKey(lambda accum, n:accum+n)
The dataset consists of 1 column of states and 1 column of cases. When it's created, it is read as
[[u'Washington', u'1'],...]
Thus, I want to have a column of string and a column of int. I am doing a project on RDD, so I want to avoid using dataframe.. any thoughts?
As the dataset contains key value pair, use groupBykey and aggregate the count.
If you have a dataset like [['WH', 10], ['TX', 5], ['WH', 2], ['IL', 5], ['TX', 6]]
The code below gives this output - [('IL', 5), ('TX', 11), ('WH', 12)]
data.groupByKey().map(lambda row: (row[0], sum(row[1]))).collect()
can use aggregateByKey with UDF. This method requires 3 parameters start location, aggregation function within partition and aggregation function across the partitions
This code also produces the same result as above
def addValues(a,b):
return a+b
data.aggregateByKey(0, addValues, addValues).collect()
I have read nearly 100 CSV files into one RDD
rdd=sc.textFile("file:///C:/Users\pinjala/Documents/Python Scripts/Files_1/*.csv")
I want to find Min and Max for each column in the RDD.Nearly 100 columns.
Can some one suggest how i can find Min and max for a RDD for different columns.
When I used
rdd.collect(), I am able to see rdd as list containing column names in first element and values of each columns in rest of elements in a list.
rdd=sc.textFile("file:///C:/Users\pinjala/Documents/Python Scripts/Files_1/*.csv")
It will be better if you had given some sample data.
Anyway, i just simulated and here is code-
new_list = []
list_p = [['John',19,1,9,20,68],['Jack',3,2,5,12,99]] #list of tuple
rdd = sc.parallelize(list_p) #Build a RDD
print(rdd.collect()) # [['John', 19, 1, 9, 20, 68], ['Jack', 3, 2, 5, 12, 99]]
for p in list_p:
header = p[0]
min_p = sc.parallelize(p).min()
max_p = sc.parallelize(p).max()
print(new_list) # ['[John,1,68]', '[Jack,2,99]']
I have a Class column which can be 1, 2 or 3, and another column Age with some missing data. I want to Impute the average Age of each Class group.
I want to do something along:
grouped_data = df.groupBy('Class')
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
Is there any workaround to that?
Thanks for your time
Using Imputer, you can filter down the dataset to each Class value, impute the mean, and then join them back, since you know ahead of time what the values can be:
subsets = []
for i in range(1, 4):
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
subset_df = df.filter(col('Class') == i)
imputed_subset = imputer.fit(subset_df).transform(subset_df)
# Union them together
# If you only have 3 just do it without a loop
imputed_df = subsets[0].unionByName(subsets[1]).unionByName(subsets[2])
If you don't know ahead of time what the values are, or if they're not easily iterable, you can groupBy, get the average values for each group as a DataFrame, and then coalesce join that back onto your original dataframe.
import pyspark.sql.functions as F
averages = df.groupBy("Class").agg(F.avg("Age").alias("avgAge"))
df_with_avgs = df.join(averages, on="Class")
imputed_df = df_with_avgs.withColumn("imputedAge", F.coalesce("Age", "avgAge"))
You need to transform your dataframe with fitted model. Then take average of filled data:
from pyspark.sql import functions as F
imputer = Imputer(inputCols=['Age'], outputCols=['imputed_Age'])
imp_model = imputer.fit(df)
transformed_df = imp_model.transform(df)
transformed_df \
.groupBy('Class') \
Using Spark 2.1.1., I have an N-row csv as 'fileInput'
colname datatype elems start end
colA float 10 0 1
colB int 10 0 9
I have successfully made an array of sql.rows ...
val df = spark.read.format("com.databricks.spark.csv").option("header", "true").load(fileInput)
val rowCnt:Int = df.count.toInt
val aryToUse = df.take(rowCnt)
Array[org.apache.spark.sql.Row] = Array([colA,float,10,0,1], [colB,int,10,0,9])
Against those Rows and using my random-value-generator scripts, I have successfully populated an empty ListBuffer[Any] ...
res170: scala.collection.mutable.ListBuffer[Any] = ListBuffer(List(0.24455154, 0.108798146, 0.111522496, 0.44311434, 0.13506883, 0.0655781, 0.8273762, 0.49718297, 0.5322746, 0.8416396), List(1, 9, 3, 4, 2, 3, 8, 7, 4, 6))
Now, I have a mixed-type ListBuffer[Any] with different typed lists.
How do iterate through and zip these? [Any] seems to defy mapping/zipping. I need to take N lists generated by the inputFile's definitions, then save them to a csv file. Final output should be:
ColA, ColB
0.24455154, 1
0.108798146, 9
0.111522496, 3
... etc
The inputFile can then be used to create any number of 'colnames's, of any 'datatype' (I have scripts for that), of each type appearing 1::n times, of any number of rows (defined as 'elems'). My random-generating scripts customize the values per 'start' & 'end', but these columns are not relevant for this question).
Given a List[List[Any]], you can "zip" all these lists together using transpose, if you don't mind the result being a list-of-lists instead of a list of Tuples:
val result: Seq[List[Any]] = list.transpose
If you then want to write this into a CSV, you can start by mapping each "row" into a comma-separated String:
val rows: Seq[String] = result.map(_.mkString(","))
(note: I'm ignoring the Apache Spark part, which seems completely irrelevant to this question... the "metadata" is loaded via Spark, but then it's collected into an Array so it becomes irrelevant)
I think the RDD.zipWithUniqueId() or RDD.zipWithIndex() methods can perform what you wanna do.
Please refer to official documentation for more information. hope this help you
I tried standard spark HashingTF example on DataBricks.
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val sentenceData = spark.createDataFrame(Seq(
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
)).toDF("label", "sentence")
val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
val featurizedData = hashingTF.transform(wordsData)
I have diffuculty in understanding result below.
Please see the image
When numFeatures is 20
If [0,5,9,17] are hash values
and [1,1,1,2] are frequencies.
17 has frequency 2
9 has 3 (it has 2)
13,15 have 1 while they must have 2.
Probably I am missing something. Could not find documentation of detailed explanation.
As mcelikkaya notes, the output frequencies are not what you would expect. This is due to hash collisions for a small number of features, 20 in this case. I have added some words to the input data (for illustration purposes) and upped features to 20,000, and then the correct frequencies are produced:
|label|sentence |words |rawFeatures |
|0 |Hi hi hi hi I i i i i heard heard heard about Spark Spark|[hi, hi, hi, hi, i, i, i, i, i, heard, heard, heard, about, spark, spark]|(20000,[3105,9357,11777,11960,15329],[2.0,3.0,1.0,4.0,5.0]) |
|0 |I i wish Java could use case classes spark |[i, i, wish, java, could, use, case, classes, spark] |(20000,[495,3105,3967,4489,15329,16213,16342,19809],[1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0])|
|1 |Logistic regression models are neat |[logistic, regression, models, are, neat] |(20000,[286,1193,9604,13138,18695],[1.0,1.0,1.0,1.0,1.0]) |
Your guesses are correct:
20 - is a vector size
first list is a list of indices
second list is a list of values
Leading 0 is just an artifact of internal representation.
There is nothing more here to learn.