How to use filter in dynamically in scala? - scala

I have the raw of line of logs file about 1TB. As below.
Test X1 SET WARN CATALOG MAP1,MAP2
INFO X2 SET WARN CATALOG MAPX,MAP2,MAP3
I read the logs file using spark scala scala and make the rdd of logs file.
I need to filter only those line which contains
1.SET
2.INFO
3. CATALOG
I write the filter like that
Val filterRdd = rdd.filter(f =>f.contains("SET")).filter(f => f.contains("INFO")).filter(f =>f.contains("CATALOG"))
can we do the same if these parameter are assign to list. and based on that we can filter dynamically not writing to much of line ; here in example i take only three restriction but in real it goes to upto 15 restriction keywords. can we do it dynamically.

Something like this could work when you require all words to appear in a line:
val words = Seq("SET", "INFO", "CATALOG")
val filterRdd = rdd.filter(f => words.forall(w => f.contains(w)))
and if you want any:
val filterRdd = rdd.filter(f => words.exists(w => f.contains(w)))

Related

NullpointException when reading file with RowCsvInputFormat in flink

I am a beginner on Flink streaming.
When reading a file with RowCsvInputFormat, the code that Kryo serializer creates Row does not work properly.
The code is below.
val readLocalCsvFile = new RowCsvInputFormat(
new Path("flink-test/000000_1"),
Array(Types.STRING, Types.STRING, Types.STRING),
"\n",
","
)
val read = env.readFile(
readLocalCsvFile,
"flink-test/000000_1",
FileProcessingMode.PROCESS_CONTINUOUSLY,
1000000)
read.print()
env.execute("test")
The contents of the file 000000_1 are as follows.
aa,bb,cc
aaa,bbb,ccc
As a result of debugging, I get the divided values ​​of aa, bb, and cc well. But when I put those values ​​into Row's fields one by one, a nullpointexception is raised because fields are null.
The image below shows that the fields of the Row are null.
enter image description here
The code that creates a Row when the above code is executed is as follows. KryoSerializer generates the row.
val kryo = new EmptyFlinkScalaKryoInstantiator().newKryo
val Row = kryo.newInstance(classOf[Row])
The output error is as follows.
java.lang.NullPointerException
at org.apache.flink.types.Row.setField(Row.java:140)
at org.apache.flink.api.java.io.RowCsvInputFormat.fillRecord(RowCsvInputFormat.java:162)
at org.apache.flink.api.java.io.RowCsvInputFormat.fillRecord(RowCsvInputFormat.java:33)
at org.apache.flink.api.java.io.CsvInputFormat.readRecord(CsvInputFormat.java:113)
at org.apache.flink.api.common.io.DelimitedInputFormat.nextRecord(DelimitedInputFormat.java:551)
at org.apache.flink.api.java.io.CsvInputFormat.nextRecord(CsvInputFormat.java:80)
at org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator.readAndCollectRecord(ContinuousFileReaderOperator.java:387)
at
Maybe you can post the complete code.
Judging from the task error report, it may be because the number of fields does not match

How to union multiple dynamic inputs in Palantir Foundry?

I want to Union multiple datasets in Palantir Foundry, the name of the datasets are dynamic so I would not be able to give the dataset names in transform_df() statically. Is there a way I can dynamically take multiple inputs into transform_df and union all of those dataframes?
I tried looping over the datasets like:
li = ['dataset1_path', 'dataset2_path']
union_df = None
for p in li:
#transforms_df(
my_input = Input(p),
Output(p+"_output")
)
def my_compute_function(my_input):
return my_input
if union_df is None:
union_df = my_compute_function
else:
union_df = union_df.union(my_compute_function)
But, this doesn't generate the unioned output.
This should be able to work for you with some changes, this is an example of dynamic dataset with json files, your situation would maybe be only a little different. Here is a generalized way you could be doing dynamic json input datasets that should be adaptable to any type of dynamic input file type or internal to foundry dataset that you can specify. This generic example is working on a set of json files uploaded to a dataset node in the platform. This should be fully dynamic. Doing a union after this should be a simple matter.
There's some bonus logging going on here as well.
Hope this helps
from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging
def transform_generator():
transforms = []
transf_dict = {## enter your dynamic mappings here ##}
for value in transf_dict:
#transform(
out=Output(' path to your output here '.format(val=value)),
inpt=Input(" path to input here ".format(val=value)),
)
def update_set(ctx, inpt, out):
spark = ctx.spark_session
sc = spark.sparkContext
filesystem = list(inpt.filesystem().ls())
file_dates = []
for files in filesystem:
with inpt.filesystem().open(files.path) as fi:
data = json.load(fi)
file_dates.append(data)
logging.info('info logs:')
logging.info(file_dates)
json_object = json.dumps(file_dates)
df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
df_2 = df_2.withColumn('upload_date', F.current_date())
df_2.drop_duplicates()
out.write_dataframe(df_2)
transforms.append(update_logs)
return transforms
TRANSFORMS = transform_generator()
So this question breaks down in two questions.
How to handle transforms with programatic input paths
To handle transforms with programatic inputs, it is important to remember two things:
1st - Transforms will determine your inputs and outputs at CI time. Which means that you can have python code that generates transforms, but you cannot read paths from a dataset, they need to be hardcoded into your python code that generates the transform.
2nd - Your transforms will be created once, during the CI execution. Meaning that you can't have an increment or special logic to generate different paths whenever the dataset builds.
With these two premises, like in your example or #jeremy-david-gamet 's (ty for the reply, gave you a +1) you can have python code that generates your paths at CI time.
dataset_paths = ['dataset1_path', 'dataset2_path']
for path in dataset_paths:
#transforms_df(
my_input = Input(path),
Output(f"{path}_output")
)
def my_compute_function(my_input):
return my_input
However to union them you'll need a second transform to execute the union, you'll need to pass multiple inputs, so you can use *args or **kwargs for this:
dataset_paths = ['dataset1_path', 'dataset2_path']
all_args = [Input(path) for path in dataset_paths]
all_args.append(Output("path/to/unioned_dataset"))
#transforms_df(*all_args)
def my_compute_function(*args):
input_dfs = []
for arg in args:
# there are other arguments like ctx in the args list, so we need to check for type. You can also use kwargs for more determinism.
if isinstance(arg, pyspark.sql.DataFrame):
input_dfs.append(arg)
# now that you have your dfs in a list you can union them
# Note I didn't test this code, but it should be something like this
...
How to union datasets with different schemas.
For this part there are plenty of Q&A out there on how to union different dataframes in spark. Here is a short code example copied from https://stackoverflow.com/a/55461824/26004
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
Since inputs and outputs are determined at CI time, we cannot form true dynamic inputs. We will have to somehow point to specific datasets in the code. Assuming the paths of datasets share the same root, the following seems to require minimum maintenance:
from transforms.api import transform_df, Input, Output
from functools import reduce
datasets = [
'dataset1',
'dataset2',
'dataset3',
]
inputs = {f'inp{i}': Input(f'input/folder/path/{x}') for i, x in enumerate(datasets)}
kwargs = {
**{'output': Output('output/folder/path/unioned_dataset')},
**inputs
}
#transform_df(**kwargs)
def my_compute_function(**inputs):
unioned_df = reduce(lambda df1, df2: df1.unionByName(df2), inputs.values())
return unioned_df
Regarding unions of different schemas, since Spark 3.1 one can use this:
df1.unionByName(df2, allowMissingColumns=True)

Scala Spark loop goes through without any error, but does not produce an output

I have a file in HDFS containing paths of various other files. Here is the file called file1:
path/of/HDFS/fileA
path/of/HDFS/fileB
path/of/HDFS/fileC
.
.
.
I am using a for loop in Scala Spark as follows to read each line of the above file and process it in another function:
val lines=Source.fromFile("path/to/file1.txt").getLines.toList
for(i<-lines){
i.toString()
val firstLines=sc.hadoopFile(i,classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
}
when I run the above loop, it runs through without returning any errors and I get the Scala prompt in a new line: scala>
However, when I try to see a few lines of output which should be stored in firstLines, it does not work:
scala> firstLines
<console>:38: error: not found: value firstLines
firstLine
^
What is the problem in the above loop that is not producing the output, however running through without any errors?
Additional info
The function hadoopFile accepts a String path name as its first parameter. That is why I am trying to pass each line of file1 (each line is a path name) as a String in the first parameter i. The flatMap functionality is taking the first line of the file that has been passed to hadoopFile and stores that alone and dumps all the other lines. So the desired output (firstLines) should be the first line of all the files that are being passed to hadoopFile through their path names (i).
I tried running the function for just a single file, without a looop, and that produces the output:
val firstLines=sc.hadoopFile("path/of/HDFS/fileA",classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
scala> firstLines.take(3)
res27: Array[String] = Array(<?xml version="1.0" encoding="utf-8"?>)
fileA is an XML file, so you can see the resulting first line of that file. So I know the function works fine, it is just a problem with the loop that I am not able to figure out. Please help.
The variable firstLines is defined in the body of the for loop and its scope is therefore limited to this loop. This means you cannot access the variable outside of the loop, and this is why the Scala compiler tells you error: not found: value firstLines.
From your description, I understand you want to collect the first line of every file which are listed in lines.
The every here can translate into different construct in Scala. We can use something like the for loop you wrote or even better adopt a functional approach and use a map function applied on the list of files. In the code below I put inside the map the code you used in your description, which creates an HadoopRDD and applies flatMap with your function to retrieve the first line of a file.
We then obtain a list of RDD[String] of lines. At this stage, note that we have not started to do any actual work. To trigger the evaluation of the RDDs and collect the result, we need an addition call to the collect method for each of the RDD we have in our list.
// Renamed "lines" to "files" as it is more explicit.
val fileNames = Source.fromFile("path/to/file1.txt").getLines.toList
val firstLinesRDDs = fileNames.map(sc.hadoopFile(_,classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
})
// firstLinesRDDs is a list of RDD[String]. Based on this code, each RDD
// should consist in a single String value. We collect them using RDD#collect:
val firstLines = firstLinesRDDs.map(_.collect)
However, this approach suffers from a flaw which prevent us to benefit from any advantage Spark can provide.
When we apply the operation in map to filenames, we are not working with an RDD, hence the file names are processed sequentially on the driver (the process which hosts your Spark session) and not part of a parallelizable Spark job. This is equivalent to doing what you wrote in your second block of code, one file name at a time.
To address the problem, what can we do? A good thing to keep in mind when working with Spark is to try to push the declaration of the RDDs as early as possible in our code. Why? Because this allows Spark to parallelize and optimize the work we want to do. Your example could be a textbook illustration of this concept, though an additional complexity here is added by the requirement to manipulate files.
In our present case, we can benefit from the fact that hadoopFile accepts comma-separated files in input. Therefore, instead of sequentially creating RDDs for every file, we create one RDD for all of them:
val firstLinesRDD = sc.hadoopFile(fileNames.mkString(","), classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
And we retrieve our first lines with a single collect:
val firstLines = firstLinesRDD.collect

How mimic the function map.getORelse to a CSV file

I have a CSV file that represent a map[String,Int], then I am reading the file as follows:
def convI2N (vkey:Int):String={
val in = new Scanner("dictionaryNV.csv")
loop.breakable{
while (in.hasNext) {
val nodekey = in.next(',')
val value = in.next('\n')
if (value == vkey.toString){
n=nodekey
loop.break()}
}}
in.close
n
}
the function give the String given the Int. The problem here is that I must browse the whole file, and the file is to big, then the procedure is too slow. Someone tell me that this is O(n) complexity time, and recomend me to pass to O(log n). I suppose that the function map.getOrElse is O(log n).
Someone can help me to find a way to get a best performance of this code?
As additional comment, the dictionaryNV file is sorted by the Int values
maybe I can divide the file by lines, or set of lines. The CSV has like 167000 Tuples [String,Int]
or in another way how you make some kind of binary search through the csv in scala?
If you are calling confI2N function many times then definitely the job will be slow because each time you have to scan the big file. So if the function is called many times then it is recommended to store them in temporary variable as properties or hashmap or collection of tuple2 and change the other code that is eating the memory.
You can try following way which should be faster than scanner way
Assuming that your csv file is comma separated as
key1,value1
key2,value2
Using Source.fromFile can be your solution as
def convI2N (vkey:Int):String={
var n = "not found"
val filtered = Source.fromFile("<your path to dictionaryNV.csv>")
.getLines()
.map(line => line.split(","))
.filter(sline => sline(0).equalsIgnoreCase(vkey.toString))
for(str <- filtered){
n = str(0)
}
n
}

Spark - create RDD of (label, features) pairs from CSV file

I have a CSV file and want to perform a simple LinearRegressionWithSGD on the data.
A sample data is as follow (the total rows in the file is 99 including labels) and the objective is to predict the y_3 variable:
y_3,x_6,x_7,x_73_1,x_73_2,x_73_3,x_8
2995.3846153846152,17.0,1800.0,0.0,1.0,0.0,12.0
2236.304347826087,17.0,1432.0,1.0,0.0,0.0,12.0
2001.9512195121952,35.0,1432.0,0.0,1.0,0.0,5.0
992.4324324324324,17.0,1430.0,1.0,0.0,0.0,12.0
4386.666666666667,26.0,1430.0,0.0,0.0,1.0,25.0
1335.9036144578313,17.0,1432.0,0.0,1.0,0.0,5.0
1097.560975609756,17.0,1100.0,0.0,1.0,0.0,5.0
3526.6666666666665,26.0,1432.0,0.0,1.0,0.0,12.0
506.8421052631579,17.0,1430.0,1.0,0.0,0.0,5.0
2095.890410958904,35.0,1430.0,1.0,0.0,0.0,12.0
720.0,35.0,1430.0,1.0,0.0,0.0,5.0
2416.5,17.0,1432.0,0.0,0.0,1.0,12.0
3306.6666666666665,35.0,1800.0,0.0,0.0,1.0,12.0
6105.974025974026,35.0,1800.0,1.0,0.0,0.0,25.0
1400.4624277456646,35.0,1800.0,1.0,0.0,0.0,5.0
1414.5454545454545,26.0,1430.0,1.0,0.0,0.0,12.0
5204.68085106383,26.0,1800.0,0.0,0.0,1.0,25.0
1812.2222222222222,17.0,1800.0,1.0,0.0,0.0,12.0
2763.5928143712576,35.0,1100.0,1.0,0.0,0.0,12.0
I already read the data with the following command:
val data = sc.textFile(datadir + "/data_2.csv");
When I want to create a RDD of (label, features) pairs with the following command:
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
So I can not continue for training a model, any help?
P.S. I run the spark with Scala IDE in Windows 7 x64.
After lots of efforts I found out the solution. The first problem was related to the header rows and the second was related to mapping function. Here is the complete solution:
//To read the file
val csv = sc.textFile(datadir + "/data_2.csv");
//To find the headers
val header = csv.first;
//To remove the header
val data = csv.filter(_(0) != header(0));
//To create a RDD of (label, features) pairs
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
}.cache()
I hope it can save your time.
When you read in your file the first line
y_3,x_6,x_7,x_73_1,x_73_2,x_73_3,x_8
Is also read and transformed in your map function so you're trying to call toDouble on y_3. You need to filter out the first row and do the learning using the remaining rows.