Group word based on length using pyspark - pyspark

I would like to group the data based on the length using pyspark.
a= sc.parallelize(("number","algebra","int","str","raj"))
Expected output is in the form
(("int","str","raj"),("number"),("algebra"))

a= sc.parallelize(("number","algebra","int","str","raj"))
a.collect()
['number', 'algebra', 'int', 'str', 'raj']
Now, do the following steps to get the final output -
# Creating a tuple of the length of the word and the word itself.
a = a.map(lambda x:(len(x),x))
# Grouping by key (which is length of tuple)
a = a.groupByKey().mapValues(lambda x:list(x)).map(lambda x:x[1])
a.collect()
[['int', 'str', 'raj'], ['number'], ['algebra']]

Related

How to interpret indexof expression and functions in Azure Data Factory

I'm trying to understand the indexof expression(function) of Azure Data Factory.
Example
This example finds the starting index value for the "world" substring in the "hello world" string:
indexOf('hello world', 'world')
And returns this result: 6
I'm confused by what is meant by the 'index value' and how the example arrived at the result 6.
Also, using the above example, can someone let me know what would be the answer for the following expression?
#if(greater(indexof(string(pipeline().parameters.Config),'FilenameMask'),0),pipeline().parameters.Config.FilenameMask,'')
indexof
{"FilenameMask":"accounts*."}
'Config' represents a field in sql database
Per the docs:
Return the starting position or index value for a substring. This function is not case-sensitive, and indexes start with the number 0.
hello world
01234567890
^
+--- "world" found starting at position 6
Regarding the 2nd part of your question. Here's the expression re-written for a bit of clarity:
#if( greater(indexof(string(pipeline().parameters.Config),'FilenameMask'),0)
,pipeline().parameters.Config.FilenameMask
,'')
which can be read as follows:
if the index of the string "FilenameMask" within x is greater than 0 then
return x.Filenamemask
else
return an empty string
where x is pipeline().parameters.Config, which is the value of your "Config" column from the database table. It will hold values such as
{"sparkConfig":{"header":"true"},"FilenameMask":"cashsales*."}
and
{"FilenameMask":"accounts*."}
The ADF expression can also be read as follows:
if the JSON in the Config column contains a "FilenameMask" key then
return the value of the FilenameMask key
else
return an empty string

Extracting matched using fuzzysearch

I am trying to extract the parameters in fuzzysearch
https://github.com/taleinat/fuzzysearch
The result I get looks like this:
>>> from fuzzysearch import find_near_matches
# search for 'PATTERN' with a maximum Levenshtein Distance of 1
>>> find_near_matches('PATTERN', '---PATERN---', max_l_dist=1)
[Match(start=3, end=9, dist=1, matched="PATERN")]
How do I extract 'matched' and 'dist' from the resulting list?
I can't seem to index the output

Count filtered records in scala

As I am new to scala ,This problem might look very basic to all..
I have a file called data.txt which contains like below:
xxx.lss.yyy23.com-->mailuogwprd23.lss.com,Hub,12689,14.98904563,1549
xxx.lss.yyy33.com-->mailusrhubprd33.lss.com,Outbound,72996,1.673717588,1949
xxx.lss.yyy33.com-->mailuogwprd33.lss.com,Hub,12133,14.9381027,664
xxx.lss.yyy53.com-->mailusrhubprd53.lss.com,Outbound,72996,1.673717588,3071
I want to split the line and find the records depending upon the numbers in xxx.lss.yyy23.com
val data = io.Source.fromFile("data.txt").getLines().map { x => (x.split("-->"))}.map { r => r(0) }.mkString("\n")
which gives me
xxx.lss.yyy23.com
xxx.lss.yyy33.com
xxx.lss.yyy33.com
xxx.lss.yyy53.com
This is what I am trying to count the exact value...
data.count { x => x.contains("33")}
How do I get the count of records who does not contain 33...
The following will give you the number of lines that contain "33":
data.split("\n").count(a => a.contains("33"))
The reason what you have above isn't working is that you need to split data into an array of strings again. Your previous statement actually concatenates the result into a single string using newline as a separator using mkstring, so you can't really run collection operations like count on it.
The following will work for getting the lines that do not contain "33":
data.split("\n").count(a => !a.contains("33"))
You simply need to negate the contains operation in this case.

Sphinx SetFilter Not Filtering as I expect

I am using a SetFilter command as so:
$mycategoryids = "345,366,456,444,789,345";
$cl->SetFilter( 'thecatid', array( $mycategoryids ));
However, of the results I get back, they all have 345 as either their primary or secondary category, so it appears that since 345 is the first number in that array, it is given more if not all of the weight. Am I doing something wrong? I thought that having all of those numbers in that array would simply mean that sphinx would only grab items that included one of those numbers within "thecatid" so that if there was an item like this:
[thecatid] => Array
(
[0] => 444
[1] => 552
[2] => 554
[3] => 566
)
Then it should still show up in the results because 444 is in the item's 'thecatid' array and 444 is also in the filter call.
Am I missing something?
Oh, and to make sure my query is correct, within the query I have:
SELECT u.ID,u.Downloads as downloads, CONCAT_WS(',', u.catID, u.CatID1, u.CatID2, u.CatID3) as thecatid, ...
And then down below:
sql_attr_multi = uint thecatid from field;
Thanks!
Craig
$mycategoryids = "345,366,456,444,789,345";
$cl->SetFilter( 'thecatid', array( $mycategoryids ));
Thats not valid code. You need to pass setFilter an array of numbers. Not an array containing a single string.
Both of these are better...
$mycategoryids = array(345,366,456,444,789,345);
$cl->SetFilter( 'thecatid', $mycategoryids);
or
$mycategoryids = "345,366,456,444,789,345";
$cl->SetFilter( 'thecatid', explode(',',$mycategoryids) );

Why can't I add a chart to my Excel spreadsheet with Perl's Spreadsheet::WriteExcel

When creating a chart in a spreadsheet using Spreadsheet::WriteExcel, the file it creates keeps coming up with an error reading
Excel found unreadable content in "Report.xls"
and asks me if I want to recover it. I have worked it out that the problem line in the code is where I actually insert the chart, with
$chartworksheet->insert_chart(0, 0, $linegraph, 10, 10);
If I comment out this one line, the data is fine (but of course, there's no chart). The rest of the relevant code is as follows (any variables not defined here are defined earlier in the code, like $lastrow).
printf("Creating\n");
my $chartworksheet = $workbook->add_worksheet('Graph');
my $linegraph = $workbook->add_chart(type => 'line', embedded => 1);
$linegraph->add_series(values => '=Data!$D$2:$D$lastrow', name => 'Column1');
$linegraph->add_series(values => '=Data!$E$2:$E$lastrow', name => 'Column2');
$linegraph->add_series(values => '=Data!$G$2:$G$lastrow', name => 'Column3');
$linegraph->add_series(values => '=Data!$H$2:$H$lastrow', name => 'Column4');
$linegraph->set_x_axis(name => 'x-axis');
$linegraph->set_y_axis(name => 'y-axis');
$linegraph->set_title(name => 'title');
$linegraph->set_legend(position => 'bottom');
$chartworksheet->activate();
$chartworksheet->insert_chart(0, 0, $linegraph, 10, 10);
printf("Finished\n");
I am at a total loss here, and I can't find any answers. Help please!
Looking at the expression:
'=Data!$D$2:$D$lastrow'
Is $lastrow some convention in Spreadsheet::WriteExcel or is it a variable from your script to be interpolated into the string expression? If it's your var, then this code probably won't do what you want inside single quotes, and you may want to use something like
'=Data!$D$2:$D' . $lastrow
"=Data!\$D\$2:\$D:$lastrow"
sprintf('=Data!$D2:$D%d',$lastrow)
The problem, as mobrule correctly points out, is that you are using single quotes on the series string and $lastrow doesn't get interpolated.
You can avoid these type of issues entirely when programmatically generating chart series strings by using the xl_range_formula() utility function.
$chart->add_series(
categories => xl_range_formula( 'Sheet1', 1, 9, 0, 0 ),
values => xl_range_formula( 'Sheet1', 1, 9, 1, 1 ),
);
# Which is the same as:
$chart->add_series(
categories => '=Sheet1!$A$2:$A$10',
values => '=Sheet1!$B$2:$B$10',
);
See the following section of the WriteExcel docs for more details: Working with Cell Ranges.