I have folder which contains many text files, I have to read this files in one RDD and save the file name with words on it
example :
doc1.txt :
" hello my name sam "
doc2.txt :
"hello world"
I need to pass folder path and the results be :
(hello, doc1), (my,doc1), (world,doc2), ..... etc
I tried this :
val rddWhole = spark.sparkContext.wholeTextFiles("C:/tmp/files/*")
rddWhole.foreach(f=>{
println(f._1+"=>"+f._2)
})
but it's dealing with whole text in the file as one string, any one have idea how ccan i solve it ?
Based on my assumptions, you want to extract every word in a file, and couple it with the file name which the word is contained in it. As you mentioned, spark gives you the whole content of a file as a single string. Like if this is the file content:
hello
my name is
John Doe
The value you get would be:
val fileString = "hello\nmy name is\nJohn Doe"
Right? So you need to split the string value by any amount of spaces or new line characters, like so:
val wordsSeparated = fileString.split("\\s+|\\n+") // \\s means space, \\n means new line (in regexes, character escaping and stuff)
So at the end, you'll need something like this:
rddWhole.foreach { f =>
f._2.split("\\s+|\\n+").foreach(word => println(f._1 + " => " + word))
}
This would be the result:
file:/tmp/spark-test/two.txt => and
file:/tmp/spark-test/two.txt => this
file:/tmp/spark-test/two.txt => would
file:/tmp/spark-test/one.txt => so
file:/tmp/spark-test/one.txt => hello
file:/tmp/spark-test/one.txt => my
file:/tmp/spark-test/one.txt => name
file:/tmp/spark-test/one.txt => is
file:/tmp/spark-test/one.txt => John
file:/tmp/spark-test/one.txt => Doe
file:/tmp/spark-test/two.txt => be
file:/tmp/spark-test/two.txt => the
file:/tmp/spark-test/two.txt => second
file:/tmp/spark-test/two.txt => text
file:/tmp/spark-test/two.txt => file
Related
I basically have to modify a 2,000+ line PHP file.
I have to replace all the 'null' values, with the original paremeter with a $ at the start. Now, by manually doing it line by line will take forever. Is there a shortcut or extension I can use to automate this for me?
This is a small block of code which I have:
'SecuredLoanDetails' => array(
'ExitStrategy' => null,
'ChangeOfUseRequired' => null,
'ProjectRequiresPlanning' => null,
'ProjectPlanningGranted' => null,
'BridgingDetails' => array(
'BridgingPaymentMethod' => 'RolledUp',
'PropertyPreviouslyBridged' => null,
'NumberOfMonthsSincePropertyBridged' => null,
'BridgingLoanPurpose' => 'First_Charge',
'OccupiedByClientOrFamilyMember' => null,
'LimitedCompany' => null,
'BridgingPropertyUse' => null,
'BridgingRefurbishmentType' => null,
'BridgingAdditionalPropertiesTotalValue' => null,
'BridgingAdditionalPropertiesTotalOutstanding' => null,
),
),
This is what I want:
'SecuredLoanDetails' => array(
'ExitStrategy' => $ExitStrategy,
'ChangeOfUseRequired' => $ChangeOfUseRequired,
'ProjectRequiresPlanning' => $ProjectRequiresPlanning,
'ProjectPlanningGranted' => $ProjectPlanningGranted,
'BridgingDetails' => array(
'BridgingPaymentMethod' => 'RolledUp',
'PropertyPreviouslyBridged' => $PropertyPreviouslyBridged,
'NumberOfMonthsSincePropertyBridged' => $NumberOfMonthsSincePropertyBridged,
'BridgingLoanPurpose' => 'First_Charge',
'OccupiedByClientOrFamilyMember' => $OccupiedByClientOrFamilyMember,
'LimitedCompany' => $LimitedCompany,
'BridgingPropertyUse' => $BridgingPropertyUse,
'BridgingRefurbishmentType' => $BridgingRefurbishmentType,
'BridgingAdditionalPropertiesTotalValue' => $BridgingAdditionalPropertiesTotalValue,
'BridgingAdditionalPropertiesTotalOutstanding' => $BridgingAdditionalPropertiesTotalOutstanding,
),
),
You can do this with VSCode's find and replace menu. You can enable regex and then use '(\w*)' => null as the find string and '$1' => $$$1 as the replace string.
This works by matching a quoted word followed by => null and capturing the quoted word into the group $1. Then the replacement string is built as '(captured word)' => $(captured word) where (captured word) is the contents of group $1.
Given your text, you can replace (Ctrl + H) the occurrences of 'identifier' => null with 'identifier' => $identifier using the following regex and replace pattern. Make sure to turn on the use of regular expressions in the Find/Replace box (Alt + R or the button on the right side of the Find text field that looks like .*).
('[^']*')(\s*=>\s*)(null)
$1$2\$$1
This captures group 1 'identifier', group 2 =>, and group 3 null and substitutes group 1, group 2, and your variable $identifier.
I can't get a match on this message with the date filter
"message" => "10.60.1.251\t\"10.60.1.152\"\t2016-12-28\t22:53:50\tPOST\t200\t1014\t0.084\
the message as it is displayed on stdout. The logfile where the message originates from is tab separated "\t"
any suggestions?
I have tried:
match => ["message", "YYYY-MM-dd HH:mm:ss"]
space between date and time is a tab
match => ["message", "YYYY-MM-dd'\t'HH:mm:ss"]
match => ["message", "YYYY-MM-dd\tHH:mm:ss"]
match => ["message", "YYYY-MM-dd..HH:mm:ss"]
match => ["message", "YYYY-MM-dd;HH:mm:ss"]
and several others
I came up with this solution - not very elegant though
filter {
grok {
match => ["message","%{DATE:extractDate} %{HAPROXYTIME:extractTime}"]
}
mutate {
add_field => {"dateTime" => "20%{extractDate} %{extractTime}"
}
remove_field => ["extractDate", "extractTime"]
}
date {
locale => "en"
match => ["dateTime", "yyyy-MM-dd HH:mm:ss"]
timezone => "Europe/Vienna"
target => "#timestamp"
add_field => { "debug" => "timestampMatched "}
remove_field => ["dateTime"]
}
}
The date{} filter is expecting a pattern that matches the entire string passed in.
The correct flow is to map the fields (as in your grok example), and then send just the date/time fields to the date{} filter.
With tab-separated data, I would look at the csv{} filter, which provides a "separator" parameter that I believe can be set to a tab.
I need to import my bank-exported transactions (CSV) into GNUcash.
I am almost finished with the perl script using Finance::QIF
I parse the CSV and write it out like this:
my $record = {
header => "Type:Bank",
date => $outdatum,
memo => $outtext,
transaction => $outbetrag,
};
$out->header( $record->{header} );
$out->write($record);
....
But my problem is creating a split.
http://finance-qif.sourceforge.net/ says " If the transaction contains splits this will be defined and consist of an array of hash references. With each split potentially having the following values." - so I tried this:
my $record = {
header => "Type:Bank",
date => $outdatum,
memo => $outtext,
transaction => $outbetrag,
#splits = (
{
category => "Gesundheit:Arzt:Kind1",
memo => "L",
amount => "-161,66"
},
{
category => "Gesundheit:Arzt:Kind2",
memo => "F",
amount => "-162,66"
}
)
};
This leads to the error:
Unsupported field 'HASH(0x221c9e8)' found in record ignored in file '>_TESTqif.qif' line 22 at convert_bank_CSV.pl line 195.
Unfortunately, I nowhere found an example for creating a split, just for a normal transaction.
Can someone please help how Finance::QIF can be used to create split-transactions?
I know nothing about Finance::QIF but your #splits code makes no sense.
Try this instead:
my $record = {
header => "Type:Bank",
date => $outdatum,
memo => $outtext,
transaction => $outbetrag,
splits => [
{
category => "Gesundheit:Arzt:Kind1",
memo => "L",
amount => "-161,66",
},
{
category => "Gesundheit:Arzt:Kind2",
memo => "F",
amount => "-162,66",
}
],
};
See perldoc perlreftut for more information about references and data structures in Perl.
Attempt to obtain projects that begin with a particular word , but I get the following error: "The 'StartsWith' member cannot be used in the expression."
ProjectContext projContext = new ProjectContext(urlPWA);
projContext.Credentials = new SharePointOnlineCredentials(user,passwordSecurity);
projContext.Load(projContext.Projects, c => c.Where(p => p.Name.StartsWith(name, true, new CultureInfo("es-ES"))).IncludeWithDefaultProperties(f => f.Name, f => f.Tasks, f => f.ProjectResources, f => f.Owner.UserId, f => f.CheckedOutBy.UserId));
projContext.ExecuteQuery();
I'm not too familiar with special queries like that. But a quick workaround would probably be to get the whole collection and iterate it afterwards. Hopefully you do not have a million projects in your PWA :)
projContext.Load(projContext.Projects);
projContext.ExecuteQuery();
foreach (PublishedProject pubProj in projContext.Projects)
{
if (pubProj.Name.StartsWith("yourString") {
// Do something
}
}
I have a relational model where users have managers that are also users. The below code works great and does exactly what it's suppose to, but it's only displaying the first name of the manager. I'm trying to get this to show both the first name and the last name of the manager.
<%= sf.input :managers, :as => :check_boxes, :member_label => (:firstname) ,:input_html => { :size => 20, :multiple => true}%>
The other field i'm trying to add is the :lastname. I cannot figure out how to get :member_label to take both fields.
I figured it out. By using the Proc.new, I was able to add in both first name and last name.
<%= sf.input :managers, :as => :check_boxes, :member_label => Proc.new { |t| h(t.firstname + " " + t.lastname) } ,:input_html => { :size => 20, :multiple => true}%>