reading second word of every line using spark scala - scala

I want to read/print second word of every line.
input->>people are not as beautiful as they look,
as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.
output->>
are
they
are
they

Please check this :
val myDF=spark.read.text("<path>")
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val rdd=myDF.rdd.map(_.mkString("")).map(f=> Row(f.split(" ")(1)))
val schema:StructType = (new StructType).add("values",StringType )
val result=spark.createDataFrame(rdd, schema)
result.show()

Related

Why difference when importing csv with spark

I have this csv file, payments.csv, for some particular rows the timestamp is changing by itself. the first 3 lines are the screenshots for easier understanding.
import spark.implicits._
import org.apache.spark.sql.functions.{col,when,to_date,row_number,date_add,expr}
import org.apache.spark.sql.expressions.{Window}
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
//Importing the csv
val df = spark.read.option("header","true").option("inferSchema","true").csv("payment.csv")
val df2 = df.filter($"payment_id" === 21112).show()
val time_value = df2.collect(){0}{5}
println(time_value)
clueless about it as of now.
Screenshots:

Spark dataframe join is failing if key column contains a period(".") in the end

I am getting below exception if I do join in between two dataframes in spark (ver 1.5, scala 2.10).
Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: col1.;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:118)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:182)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:653)
at com.nielsen.buy.integration.commons.Demo$.main(Demo.scala:62)
at com.nielsen.buy.integration.commons.Demo.main(Demo.scala)
Code works fine if column in dataframe does not contain any period . Please do help me out.
You can find the code that I am using.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import com.google.gson.Gson
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
object Demo
{
lazy val sc: SparkContext = {
val conf = new SparkConf().setMaster("local")
.setAppName("demooo")
.set("spark.driver.allowMultipleContexts", "true")
new SparkContext(conf)
}
sc.setLogLevel("ERROR")
lazy val sqlcontext=new SQLContext(sc)
val data=List(Row("a","b"),Row("v","b"))
val dataRdd=sc.parallelize(data)
val schema = new StructType(Array(StructField("col.1",StringType,true),StructField("col2",StringType,true)))
val df1=sqlcontext.createDataFrame(dataRdd, schema)
val data2=List(Row("a","b"),Row("v","b"))
val dataRdd2=sc.parallelize(data2)
val schema2 = new StructType(Array(StructField("col3",StringType,true),StructField("col4",StringType,true)))
val df2=sqlcontext.createDataFrame(dataRdd2, schema2)
val val1="col.1"
val df3= df1.join(df2,df1.col(val1).equalTo(df2.col("col3")),"outer").show
}
In general, period is used to access members of a struct field.
The spark version you are using (1.5) is relatively old. Several such issues were fixed in later versions so if you upgrade it might just solve the issue.
That said, you can simply use withColumnRenamed to rename the column to something which does not have a period before the join.
So you basically do something like this:
val dfTmp = df1.withColumnRenamed(val1, "JOIN_COL")
val df3= dfTmp.join(df2,dfTmp.col("JOIN_COL").equalTo(df2.col("col3")),"outer").withColumnRenamed("JOIN_COL", val1)
df3.show
btw show returns a Unit so you probably meant df3 to be equal to the expression without it and do df3.show separately.

One simple spark program in scala : println out all the element in the RDD

I wrote one simple spark in eclipse, I want to println out all the element in the RDD:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordCount {
def main(args:Array[String]): Unit = {
val conf = new SparkConf().setMaster("local");
val sc = new SparkContext(conf);
val data = sc.parallelize(List(1,2,3,4,5));
data.collect().foreach(println);
sc.stop();
}
}
And the result is like this:
<console>:16: error: not found: value sc
val data = sc.parallelize(List(1,2,3,4,5));
I searched and tried more than three solutions but still cannot solve this. Anyone can help me with this? Thanks a lot!
I don't know the exact cause of whatever is troubling you since you don't mention how you set it all up, but you said that you can run it in spark-shell in linux so it's not about the code. It's most likely about the config and setup.
Perhaps my short guide can help you. It's minimalistic, but it's all I had to do in order to get the Spark "hello world" to run in Eclipse.

How to get files name with spark sc.textFile?

I am reading a directory of files using the following code:
val data = sc.textFile("/mySource/dir1/*")
now my data rdd contains all rows of all files in the directory (right?)
I want now to add a column to each row with the source files name, how can I do that?
The other options I tried is using wholeTextFile but I keep getting out of memory exceptions.
5 servers 24 cores 24 GB (executor-core 5 executor-memory 5G)
any ideas?
You can use this code. I have tested it with Spark 1.4 and 1.5.
It gets the file name from the inputSplit and adds it to each line using the iterator using the mapPartitionsWithInputSplit of the NewHadoopRDD
import org.apache.hadoop.mapreduce.lib.input.{FileSplit, TextInputFormat}
import org.apache.spark.rdd.{NewHadoopRDD}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
val sc = new SparkContext(new SparkConf().setMaster("local"))
val fc = classOf[TextInputFormat]
val kc = classOf[LongWritable]
val vc = classOf[Text]
val path :String = "file:///home/user/test"
val text = sc.newAPIHadoopFile(path, fc ,kc, vc, sc.hadoopConfiguration)
val linesWithFileNames = text.asInstanceOf[NewHadoopRDD[LongWritable, Text]]
.mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(tup => (file.getPath, tup._2))
}
)
linesWithFileNames.foreach(println)
I think it's pretty late to answer this question but I found an easy way to do what you were looking for:
Step 0: from pyspark.sql import functions as F
Step 1: createDataFrame using the RDD as usual. Let's say df
Step 2: Use input_file_name()
df.withColumn("INPUT_FILE", F.input_file_name())
This will add a column to your DataFrame with source file name.

getOrElse method not being found in Scala Spark

Attempting to follow example in Sandy Ryza's book Advanced Analytics with Spark, coding using IntelliJ. Below I seem to have imported all the right libraries, but why is it not recognizing getOrElse?
Error:(84, 28) value getOrElse is not a member of org.apache.spark.rdd.RDD[String]
bArtistAlias.value.getOrElse(artistID, artistID)
^
Code:
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd._
import org.apache.spark.rdd.PairRDDFunctions
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.recommendation._
val trainData = rawUserArtistData.map { line =>
val Array(userID, artistID, count) = line.split(' ').map(_.toInt)
val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID)
Rating(userID, finalArtistID, count)
}.cache()
I can only make an assumption as the code listed is missing pieces, but my guess is that bArtistAlias is supposed to be a Map that SHOULD be broadcast, but isnt.
I went and found the piece of code in Sandy's book and it corroborates my guess. So, you seem to be missing this piece:
val bArtistAlias = sc.broadcast(artistAlias)
I am not even sure what you did without the code, but it looks like you broadcast an RDD[String], thus the error.....this would not even work anyway as you cannot work with another RDD inside of an RDD