Joining two (paired) RDDs in Scala, Spark

Joining two (paired) RDDs in Scala, Spark - scala

I am trying to join two paired RDDs, as per the answer provided here
Joining two RDD[String] -Spark Scala
I am getting an error
error: value leftOuterJoin is not a member of org.apache.spark.rdd.RDD[
The code snippet is as below.
val pairRDDTransactions = parsedTransaction.map
{
case ( field3, field4, field5, field6, field7,
field1, field2, udfChar1, udfChar2, udfChar3) =>
((field1, field2), field3, field4, field5,
field6, field7, udfChar1, udfChar2, udfChar3)
}
val pairRDDAccounts = parsedAccounts.map
{
case (field8, field1, field2, field9, field10 ) =>
((field1, field2), field8, field9, field10)
}
val transactionAddrJoin = pairRDDTransactions.leftOuterJoin(pairRDDAccounts).map {
case ((field1, field2), (field3, field4, field5, field6,
field7, udfChar1, udfChar2, udfChar3, field8, field9, field10)) =>
(field1, field2, field3, field4, field5, field6,
field7, udfChar1, udfChar2, udfChar3, field8, field9, field10)
}
In this case, field1 and field 2 are my keys, on which I want to perform join.

Joins are defined for RDD[(K, V)] (RDD of Tuple2 objects. In you case however, there arbitrary tuples (Tuple4[_, _, _, _] and Tuple8[_, _, _, _, _, _, _, _]) - this just cannot work.
You should
... =>
((field1, field2),
(field3, field4, field5, field6, field7, udfChar1, udfChar2, udfChar3)
and
... =>
((field1, field2), (field8, field9, field10))
respectively.

Related

DefaultIfEmpty mess ordered items in EF and LINQ to Entities?

Given this entity (and this records as examples)
Discount
Amount Percentage
1000 2
5000 4
10000 8
I want to get the percentage to apply to one P.O. Amount
I.E.: Having a P.O. Amount of 15000
if I use
db.Discount
.Where(d => d.Amount <= PO.Amount)
.OrderByDescending(o => o.Amount)
.Select(s => s.Percentage)
.ToList()
.DefaultIfEmpty(0)
.FirstOrDefault();
I get 8 (correct)
but if I use
db.Discount
.Where(d => d.Amount <= PO.Amount)
.OrderByDescending(o => o.Amount)
.Select(s => s.Percentage)
.DefaultIfEmpty(0)
.FirstOrDefault();
I get 2 (incorrect) and items are not ordered any more.
Am I doing an incorrect use of DefaultIfEmpty?

If you are using Entity Framework 6.*, then it is a known bug:
the workaround is to move the DefaultIfEmpty call to after the ToList, which is arguably better as there is no need for the replacement of an empty result set to be done in the database.
Following examples generated with EF 6.1.2 (and "captured" with Microsoft SQL Profiler on a Microsoft SQL Server 2016).
Now... Your "wrong" query:
var res = db.Discounts
.Where(d => d.Amount <= PO.Amount)
.OrderByDescending(o => o.Amount)
.Select(s => s.Percentage)
.DefaultIfEmpty(0)
.FirstOrDefault();
"removes" the OrderBy:
exec sp_executesql N'SELECT
[Limit1].[C1] AS [C1]
FROM ( SELECT TOP (1)
CASE WHEN ([Project1].[C1] IS NULL) THEN cast(0 as bigint) ELSE [Project1].[Percentage] END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1]
LEFT OUTER JOIN (SELECT
[Extent1].[Percentage] AS [Percentage],
cast(1 as tinyint) AS [C1]
FROM [dbo].[Discount] AS [Extent1]
WHERE [Extent1].[Amount] <= #p__linq__0 ) AS [Project1] ON 1 = 1
) AS [Limit1]',N'#p__linq__0 bigint',#p__linq__0=15000
"best" query would be:
var res = db.Discounts
.Where(d => d.Amount <= PO.Amount)
.OrderByDescending(o => o.Amount)
.Select(s => s.Percentage)
.Take(1)
.ToArray()
.DefaultIfEmpty(0)
.First(); // Or Single(), same result but clearer that there is always *one* element
See the Take(1)? It generates a TOP (1):
exec sp_executesql N'SELECT TOP (1)
[Project1].[Percentage] AS [Percentage]
FROM ( SELECT
[Extent1].[Amount] AS [Amount],
[Extent1].[Percentage] AS [Percentage]
FROM [dbo].[Discount] AS [Extent1]
WHERE [Extent1].[Amount] <= #p__linq__0
) AS [Project1]
ORDER BY [Project1].[Amount] DESC',N'#p__linq__0 bigint',#p__linq__0=15000
Then the ToArray() will move the elaboration to C#. You could use .FirstOrDefault() with ?? instead of using DefaultIfEmpty(), but the result would be different if Amount is already nullable (the null returned by FirstOrDefault() is because there are no rows or because the only row found has Amount == null? Who knows :-) ). To solve this problem it becomes a little more complex (in the most general case):
var res = (db.Discounts
.Where(d => d.Amount <= PO.Amount)
.OrderByDescending(o => o.Amount)
.Select(s => new { s.Percentage })
.FirstOrDefault() ?? new { Percentage = (long)0 }
).Percentage;
Here the (long) in (long)0 should be the data type of Percentage. This query gives:
exec sp_executesql N'SELECT TOP (1)
[Project1].[C1] AS [C1],
[Project1].[Percentage] AS [Percentage]
FROM ( SELECT
[Extent1].[Amount] AS [Amount],
[Extent1].[Percentage] AS [Percentage],
1 AS [C1]
FROM [dbo].[Discount] AS [Extent1]
WHERE [Extent1].[Amount] <= #p__linq__0
) AS [Project1]
ORDER BY [Project1].[Amount] DESC',N'#p__linq__0 bigint',#p__linq__0=15000
Other "worse" variant:
var res = db.Discounts
.Where(d => d.Amount <= PO.Amount)
.OrderByDescending(o => o.Amount)
.Select(s => s.Percentage)
.Take(1)
.DefaultIfEmpty(0)
.First();
that gives an overcomplicated query with two TOP (1):
exec sp_executesql N'SELECT
[Limit2].[C1] AS [C1]
FROM ( SELECT TOP (1)
CASE WHEN ([Project2].[C1] IS NULL) THEN cast(0 as bigint) ELSE [Project2].[Percentage] END AS [C1]
FROM ( SELECT 1 AS X ) AS [SingleRowTable1]
LEFT OUTER JOIN (SELECT TOP (1)
[Project1].[Percentage] AS [Percentage],
cast(1 as tinyint) AS [C1]
FROM ( SELECT
[Extent1].[Amount] AS [Amount],
[Extent1].[Percentage] AS [Percentage]
FROM [dbo].[Discount] AS [Extent1]
WHERE [Extent1].[Amount] <= #p__linq__0
) AS [Project1]
ORDER BY [Project1].[Amount] DESC ) AS [Project2] ON 1 = 1
) AS [Limit2]',N'#p__linq__0 bigint',#p__linq__0=15000

this is normal because
for first statment
db.Discount.Where(d => d.Amount <= PO.Amount).OrderByDescending(o => o.Amount).Select(s => s.Percentage).ToList().DefaultIfEmpty(0).FirstOrDefault();
you call .ToList() before DefaultIfEmpty(0) which means when you call .ToList() statment translated to sql as following
DECLARE #p0 Int = 15000
SELECT [t0].[Percentage]
FROM [AppScreens] AS [t0]
WHERE [t0].[Amount] <= #p0
ORDER BY [t0].[Amount] DESC
then executed and loaded in memory after that these two function run on data in memory .DefaultIfEmpty(0).FirstOrDefault(); so the result is as you expected
but for second statment
db.Discount.Where(d => d.Amount <= PO.Amount).OrderByDescending(o => o.Amount).Select(s => s.Percentage).DefaultIfEmpty(0).FirstOrDefault();
you don't call .ToList() which mean that statement won't be executed until it reach FirstOrDefault() because DefaultIfEmpty(0) function is implemented by using deferred execution and you can read its documentation from this reference of MSDN
When it reach .FirstOrDefault() statment translated to sql as following
DECLARE #p0 Int = 15000
SELECT case when [t2].[test] = 1 then [t2].[Percentage] else [t0].[EMPTY] end AS [value]
FROM (SELECT 0 AS [EMPTY] ) AS [t0]
LEFT OUTER JOIN ( SELECT TOP (1) 1 AS [test], [t1].[Percentage] FROM [Discount] AS [t1] WHERE [t1].[Amount] <= #p0 ) AS [t2] ON 1=1
ORDER BY [t2].[Amount] DESC
then executed and loaded in memory after that so the result isn't as you expected
because it get top 1 first before order so it get first item.

Convert a left join LINQ query from Comprehension to Lambda syntax

Is it possible to convert below LINQ query from Comprehension to "Lambda" syntax, that is table1.where().select().
from t1 in table1
from t2 in table2.Where(t2=>t2.Table1ID == t1.ID).DefaultIfEmpty()
select new {t1.C1, t2.C2}
Above query will be translated to a left join in SQL without using the ugly Join keyword in LINQ.

LinqPad gives this translation:
Table1
.SelectMany (
t1 =>
Table2
.Where (t2 => (t2.Table1ID) == t1.ID)
.DefaultIfEmpty (),
(t1, t2) =>
new
{
C1 = t1.C1,
C2 = t2.C2
}
)

OData is generating wrong URL or is it just me using wrong keyword

I have a query like this,
https://example.com/_vti_bin/exampleService/exampleService.svc/Categories?
$filter=Products/any(x:x/Status eq toupper('DELETED'))&
$select=ID,Products/Status,Products/Title&
$expand=Products
but it's not filtering dataset based on status = deleted and returns products which has status not deleted etc..
I looked at SQL trace and it is generating something like this,
exec sp_executesql N'SELECT
[Project2].[C1] AS [C1],
[Project2].[C2] AS [C2],
[Project2].[C3] AS [C3],
[Project2].[ID] AS [ID],
[Project2].[C4] AS [C4],
[Project2].[C5] AS [C5],
[Project2].[C8] AS [C6],
[Project2].[ID1] AS [ID1],
[Project2].[C6] AS [C7],
[Project2].[C7] AS [C8],
[Project2].[Title] AS [Title],
[Project2].[Status] AS [Status]
FROM ( SELECT
[Extent1].[ID] AS [ID],
1 AS [C1],
N''DataAccess.Product'' AS [C2],
N''ID'' AS [C3],
N''Products'' AS [C4],
N'''' AS [C5],
[Extent2].[ID] AS [ID1],
[Extent2].[Title] AS [Title],
[Extent2].[Status] AS [Status],
CASE WHEN ([Extent2].[ID] IS NULL) THEN CAST(NULL AS varchar(1)) ELSE N''DataAccess.Product'' END AS [C6],
CASE WHEN ([Extent2].[ID] IS NULL) THEN CAST(NULL AS varchar(1)) ELSE N''Title,Status,ID'' END AS [C7],
CASE WHEN ([Extent2].[ID] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C8]
FROM [dbo].[Categories] AS [Extent1]
LEFT OUTER JOIN [dbo].[Products] AS [Extent2] ON [Extent1].[ID] = [Extent2].[ProductID]
WHERE ([Extent1].[ClientID] = #p__linq__0) AND ( EXISTS (SELECT
1 AS [C1]
FROM [dbo].[Products] AS [Extent3]
WHERE ([Extent1].[ID] = [Extent3].[ProductID]) AND (([Extent3].[Status] = (UPPER(N''DELETED''))) OR (([Extent3].[Status] IS NULL) AND (UPPER(N''DELETED'') IS NULL)))
))
) AS [Project2]
ORDER BY [Project2].[ID] ASC, [Project2].[C8] ASC',N'#p__linq__0 int',#p__linq__0=23
Is it correct to use "eq" if I only want products whose status is "deleted" and nothing else ?
Edit
I am using OData V3, using WCF Data services with EF

I believe the problem is in the query.
Form the url you are saying something like ...
// get me categories
https://example.com/_vti_bin/exampleService/exampleService.svc/Categories?
// where any product is deleted
$filter=Products/any(x:x/Status eq toupper('DELETED'))&
// return the category id, product status and title
$select=ID,Products/Status,Products/Title&
$expand=Products
In other words you are filtering categories on the deleted status not products within them.
You could add a second filter to handle the product filtering and only return categories and their filtered set of products.
Try something like this instead ...
https://example.com/_vti_bin/exampleService/exampleService.svc/Categories?
$filter=Products/any(x:x/Status eq toupper('DELETED'))&
$select=ID,Products/Status,Products/Title&
$expand=Products/any(p:p/Status eq toupper('DELETED'))
Depending on your situation it may be best turn the query around ...
https://example.com/_vti_bin/exampleService/exampleService.svc/Products?
$filter=Status eq toupper('DELETED')&
$select=Category/ID,Status,Title
... by pulling a set of products and their related category Id's you get the same result but gain the ability to filter those products directly on the base query instead of a more complex child collection filter.
As discussed in chat though, this does require a valid OData model where the relationship between Products and Categories is properly defined.

Creating Optional Many To Many Mapping Entity Framework 6.1 Code First

I am trying to setup an Optional Many To Many Relationship using EF 6.1 Code First. I Have a Partner which can have 0 or more fund groups associated with it. I have successfully created the mapping using the following code:
HasMany(t => t.FundGroups)
.WithMany()
.Map(x =>
{
x.MapLeftKey("PartnerId");
x.MapRightKey("FundGroupId");
x.ToTable("PartnerFundGroupMap", "admin");
})
.MapToStoredProcedures(s => s.Insert(i => i.HasName("admin.InsertPartnerFundGroups")
.LeftKeyParameter(p => p.PartnerId, "PartnerId")
.RightKeyParameter(p => p.FundGroupId, "FundGroupId"))
.Delete(i => i.HasName("admin.DeletePartnerFundGroups")
.LeftKeyParameter(p => p.PartnerId, "PartnerId")
.RightKeyParameter(p => p.FundGroupId, "FundGroupId")));
The problem is the the code creates an inner join and I dont get back entities which dont have fund groups associated. Is there a way to force a left join so I can always retrieve partners even if a fund group is not associated with it. Please see the query generated below:
SELECT
[Project1].[IRSEntityTypeID1] AS [IRSEntityTypeID],
[Project1].[RegionID1] AS [RegionID],
[Project1].[CountryID1] AS [CountryID],
[Project1].[PartnershipLevelId1] AS [PartnershipLevelId],
[Project1].[PartnerID] AS [PartnerID],
[Project1].[C1] AS [C1],
[Project1].[ExternalCode] AS [ExternalCode],
[Project1].[CompanyCode] AS [CompanyCode],
[Project1].[PartnerName] AS [PartnerName],
[Project1].[EIN] AS [EIN],
[Project1].[KitCode] AS [KitCode],
[Project1].[IRSEntityTypeID] AS [IRSEntityTypeID1],
[Project1].[Address1] AS [Address1],
[Project1].[Address2] AS [Address2],
[Project1].[City] AS [City],
[Project1].[IsPartnership] AS [IsPartnership],
[Project1].[SessionId] AS [SessionId],
[Project1].[RegionID] AS [RegionID1],
[Project1].[CountryID] AS [CountryID1],
[Project1].[PostalCode] AS [PostalCode],
[Project1].[PartnershipLevelId] AS [PartnershipLevelId1],
[Project1].[C2] AS [C2],
[Project1].[C4] AS [C3],
[Project1].[C5] AS [C4],
[Project1].[Description] AS [Description],
[Project1].[C6] AS [C5],
[Project1].[C7] AS [C6],
[Project1].[CountryName] AS [CountryName],
[Project1].[C8] AS [C7],
[Project1].[C9] AS [C8],
[Project1].[RegionName] AS [RegionName],
[Project1].[C10] AS [C9],
[Project1].[C11] AS [C10],
[Project1].[IRSEntityTypeName] AS [IRSEntityTypeName],
[Project1].[C12] AS [C11],
[Project1].[C13] AS [C12],
[Project1].[C14] AS [C13],
[Project1].[C15] AS [C14],
[Project1].[C16] AS [C15],
[Project1].[FundGroupID] AS [FundGroupID],
[Project1].[C3] AS [C16],
[Project1].[FundGroupCode] AS [FundGroupCode],
[Project1].[FundGroupName] AS [FundGroupName],
[Project1].[SessionId1] AS [SessionId1],
[Project1].[CreateDate] AS [CreateDate],
[Project1].[ModifiedDate] AS [ModifiedDate]
FROM ( SELECT
[Extent1].[PartnerID] AS [PartnerID],
[Extent1].[ExternalCode] AS [ExternalCode],
[Extent1].[CompanyCode] AS [CompanyCode],
[Extent1].[PartnerName] AS [PartnerName],
[Extent1].[EIN] AS [EIN],
[Extent1].[KitCode] AS [KitCode],
[Extent1].[IRSEntityTypeID] AS [IRSEntityTypeID],
[Extent1].[Address1] AS [Address1],
[Extent1].[Address2] AS [Address2],
[Extent1].[City] AS [City],
[Extent1].[IsPartnership] AS [IsPartnership],
[Extent1].[SessionId] AS [SessionId],
[Extent1].[RegionID] AS [RegionID],
[Extent1].[CountryID] AS [CountryID],
[Extent1].[PostalCode] AS [PostalCode],
[Extent1].[PartnershipLevelId] AS [PartnershipLevelId],
[Extent2].[PartnershipLevelId] AS [PartnershipLevelId1],
[Extent2].[Description] AS [Description],
[Extent3].[CountryID] AS [CountryID1],
[Extent3].[CountryName] AS [CountryName],
[Extent4].[RegionID] AS [RegionID1],
[Extent4].[RegionName] AS [RegionName],
[Extent5].[IRSEntityTypeID] AS [IRSEntityTypeID1],
[Extent5].[IRSEntityTypeName] AS [IRSEntityTypeName],
N''3ed69e78-cbfb-4d5b-a270-85d0d62bb11c'' AS [C1],
N''FundGroups'' AS [C2],
[Join5].[FundGroupID1] AS [FundGroupID],
[Join5].[FundGroupCode] AS [FundGroupCode],
[Join5].[FundGroupName] AS [FundGroupName],
[Join5].[SessionID] AS [SessionId1],
[Join5].[CreateDate] AS [CreateDate],
[Join5].[ModifiedDate] AS [ModifiedDate],
CASE WHEN ([Join5].[PartnerId] IS NULL) THEN CAST(NULL AS varchar(1)) ELSE N''3ed69e78-cbfb-4d5b-a270-85d0d62bb11c'' END AS [C3],
N''PartnershipLevel'' AS [C4],
N''3ed69e78-cbfb-4d5b-a270-85d0d62bb11c'' AS [C5],
N''Country'' AS [C6],
N''3ed69e78-cbfb-4d5b-a270-85d0d62bb11c'' AS [C7],
N''Region'' AS [C8],
N''3ed69e78-cbfb-4d5b-a270-85d0d62bb11c'' AS [C9],
N''IRSEntityType'' AS [C10],
N''3ed69e78-cbfb-4d5b-a270-85d0d62bb11c'' AS [C11],
cast(0 as bit) AS [C12],
cast(0 as bit) AS [C13],
cast(0 as bit) AS [C14],
cast(0 as bit) AS [C15],
CASE WHEN ([Join5].[PartnerId] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C16]
FROM [admin].[vPartner] AS [Extent1]
INNER JOIN [admin].[PartnershipLevel] AS [Extent2] ON [Extent1].[PartnershipLevelId] = [Extent2].[PartnershipLevelId]
INNER JOIN [admin].[Country] AS [Extent3] ON [Extent1].[CountryID] = [Extent3].[CountryID]
INNER JOIN [admin].[Region] AS [Extent4] ON [Extent1].[RegionID] = [Extent4].[RegionID]
INNER JOIN [admin].[IRSEntityType] AS [Extent5] ON [Extent1].[IRSEntityTypeID] = [Extent5].[IRSEntityTypeID]
LEFT OUTER JOIN (SELECT [Extent6].[PartnerId] AS [PartnerId], [Extent7].[FundGroupID] AS [FundGroupID1], [Extent7].[FundGroupCode] AS [FundGroupCode], [Extent7].[FundGroupName] AS [FundGroupName], [Extent7].[SessionID] AS [SessionID], [Extent7].[CreateDate] AS [CreateDate], [Extent7].[ModifiedDate] AS [ModifiedDate]
FROM [admin].[PartnerFundGroupMap] AS [Extent6]
INNER JOIN [admin].[FundGroup] AS [Extent7] ON [Extent6].[FundGroupId] = [Extent7].[FundGroupID] ) AS [Join5] ON [Extent1].[PartnerID] = [Join5].[PartnerId]
WHERE [Extent1].[PartnerID] = #p__linq__0
) AS [Project1]
ORDER BY [Project1].[IRSEntityTypeID1] ASC, [Project1].[RegionID1] ASC, [Project1].[CountryID1] ASC, [Project1].[PartnershipLevelId1] ASC, [Project1].[PartnerID] ASC, [Project1].[C16] ASC',N'#p__linq__0 int',#p__linq__0=1

I have found the issue. It is unrelated to Many to Many mapping and a result of a required mapping for another entity. Thank you to Dabblernl for having me post the query, upon further inspection of the query I found the issue

Combine split lines with awk / gawk

A system wraps lines in a log file if they exceed X characters. I am trying to extract various data from the log, but first I need to combine all the split lines so gawk can parse the fields as a single record.
For example:
2012/11/01 field1 field2 field3 field4 fi
eld5 field6 field7
2012/11/03 field1 field2 field3
2012/12/31 field1 field2 field3 field4 fi
eld5 field6 field7 field8 field9 field10
field11 field12 field13
2013/01/10 field1 field2 field3
2013/01/11 field1 field2 field3 field4
I want to return
2012/11/01 field1 field2 field3 field4 field5 field6 field7
2012/11/03 field1 field2 field3
2012/12/31 field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13
2013/01/10 field1 field2 field3
2013/01/11 field1 field2 field3 field4
The actual max line length in my case is 130. I'm reluctant to test for that length and use getline to join the next line, in case there is a entry that is exactly 130 chars long.
Once I've cleaned up the log file, I'm also going to want to extract all the relevant events, where "relevant" may involve criteria like:
'foo' is anywhere in any field in the record
field2 ~ /bar|dtn/
if field1 ~ /xyz|abc/ && field98 == "0001"
I'm wondering if I will need to run two successive gawk programs, or if I can combine all of this into one.
I'm a gawk newbie and come from a non-Unix

$ awk '{printf "%s%s",($1 ~ "/" ? rs : ""),$0; rs=RS} END{print ""}' file
2012/11/01 field1 field2 field3 field4 field5 field6 field7
2012/11/03 field1 field2 field3
2012/12/31 field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13
2013/01/10 field1 field2 field3
2013/01/11 field1 field2 field3 field4
Now that I've noticed you don't actually want to just print recombined records, here's an alternative way to do that that's more amenable to test on the recompiled record ("s" in this script:
$ awk 'NR>1 && $1~"/"{print s; s=""} {s=s $0} END{print s}' file
Now with that structure, instead of just printing s you can perform tests on s, for example (note "foo" in 3rd record):
$ cat file
2012/11/01 field1 field2 field3 field4 fi
eld5 field6 field7
2012/11/03 field1 field2 field3
2012/12/31 field1 field2 foo field4 fi
eld5 field6 field7 field8 field9 field10
field11 field12 field13
2013/01/10 field1 field2 field3
2013/01/11 field1 field2 field3 field4
$ awk '
function tst(rec, flds,nf,i) {
nf=split(rec,flds)
if (rec ~ "foo") {
print rec
for (i=1;i<=nf;i++)
print "\t",i,flds[i]
}
}
NR>1 && $1~"/" { tst(s); s="" }
{ s=s $0 }
END { tst(s) }
' file
2012/12/31 field1 field2 foo field4 field5 field6 field7 field8 field9 field10 field11 field12 field13
1 2012/12/31
2 field1
3 field2
4 foo
5 field4
6 field5
7 field6
8 field7
9 field8
10 field9
11 field10
12 field11
13 field12
14 field13

gawk '{ gsub( "\n", "" ); printf $0 RT }
END { print }' RS='\n[0-9][0-9][0-9][0-9]/[0-9][0-9]/[0-9][0-9]' input
This can be somewhat simplified with:
gawk --re-interval '{ gsub( "\n", "" ); printf $0 RT }
END { print }' RS='\n[0-9]{4}/[0-9]{2}/[0-9]{2}' input

This might work for you (GNU sed):
sed -r ':a;$!N;\#\n[0-9]{4}/[0-9]{2}/[0-9]{2}#!{s/\n//;ta};P;D' file

Here's a slightly bigger Perl solution which also handles the additional filtering (as you tagged this perl as well):
root#virtualdeb:~# cat combine_and_filter.pl
#!/usr/bin/perl -n
if (m!^2\d{3}/\d{2}/\d{2} !){
print $prevline if $prevline =~ m/field13/;
$prevline = $_;
}else{
chomp($prevline);
$prevline .= $_
}
root#virtualdeb:~# perl combine_and_filter < /tmp/in.txt
2012/12/31 field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13

this may work for you:
awk --re-interval '/^[0-9]{4}\//&&s{print s;s=""}{s=s""sprintf($0)}END{print s}' file
test with your example:
kent$ echo "2012/11/01 field1 field2 field3 field4 fi
eld5 field6 field7
2012/11/03 field1 field2 field3
2012/12/31 field1 field2 field3 field4 fi
eld5 field6 field7 field8 field9 field10
field11 field12 field13
2013/01/10 field1 field2 field3
2013/01/11 field1 field2 field3 field4"|awk --re-interval '/^[0-9]{4}\//&&s{print s;s=""}{s=s""sprintf($0)}END{print s}'
2012/11/01 field1 field2 field3 field4 field5 field6 field7
2012/11/03 field1 field2 field3
2012/12/31 field1 field2 field3 field4 field5 field6 field7 field8 field9 field10 field11 field12 field13
2013/01/10 field1 field2 field3
2013/01/11 field1 field2 field3 field4

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Joining two (paired) RDDs in Scala, Spark - scala

Related

DefaultIfEmpty mess ordered items in EF and LINQ to Entities?

Convert a left join LINQ query from Comprehension to Lambda syntax

OData is generating wrong URL or is it just me using wrong keyword

Creating Optional Many To Many Mapping Entity Framework 6.1 Code First

Combine split lines with awk / gawk

Categories

Resources