Efficiently counting lines in a file

Efficiently counting lines in a file - powershell

I'm trying to count the lines in a not so small text file (multiple MBs). The answers I found here suggest this:
(Get-Content foo.txt | Measure-Object -Line).Lines
This works, but the performance is poor. I guess the whole file is loaded into memory instead of streaming it line by line.
I created a test program in Java to compare the performance:
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Scanner;
import java.util.concurrent.TimeUnit;
import java.util.function.ToLongFunction;
import java.util.stream.Stream;
public class LineCounterPerformanceTest {
public static void main(final String... args) {
if (args.length > 0) {
final String path = args[0];
measure(LineCounterPerformanceTest::java, path);
measure(LineCounterPerformanceTest::powershell, path);
} else {
System.err.println("Missing path.");
System.exit(-1);
}
}
private static long java(final String path) throws IOException {
System.out.println("Java");
try (final Stream<String> lines = Files.lines(Paths.get(path))) {
return lines.count();
}
}
private static long powershell(final String path) throws IOException, InterruptedException {
System.out.println("Powershell");
final Process ps = new ProcessBuilder("powershell", String.format("(Get-Content '%s' | Measure-Object -Line).Lines", path)).start();
if (ps.waitFor(1, TimeUnit.MINUTES) && ps.exitValue() == 0) {
try (final Scanner scanner = new Scanner(ps.getInputStream())) {
return scanner.nextLong();
}
}
throw new IOException("Timeout or error.");
}
private static <T, U extends T> void measure(final ExceptionalToLongFunction<T> function, final U value) {
final long start = System.nanoTime();
final long result = function.unchecked().applyAsLong(value);
final long end = System.nanoTime();
System.out.printf("Result: %d%n", result);
System.out.printf("Elapsed time (ms): %,.6f%n%n", (end - start) / 1_000_000.);
}
#FunctionalInterface
private static interface ExceptionalToLongFunction<T> {
long applyAsLong(T value) throws Exception;
default ToLongFunction<T> unchecked() {
return (value) -> {
try {
return applyAsLong(value);
} catch (final Exception ex) {
throw new RuntimeException(ex);
}
};
}
}
}
The plain Java solution is ~ 80 times faster.
Is there a built-in way to do this task with comparable performance? I'm on PowerShell 4.0, if that matters.

See if this isn't faster than your current method:
$count = 0
Get-Content foo.txt -ReadCount 2000 |
foreach { $Count += $_.count }
$count

You can use a StreamReader for this type of thing. Not sure how its speed compares to your Java code but my understanding is that only a single line at a time is being loaded by the ReadLine method.
$StreamReader = New-Object System.IO.StreamReader($File)
$LineCount = 0
while ($StreamReader.ReadLine() -ne $null)
{
$LineCount++
}
$StreamReader.Close()

SWITCH was faster for my GB+ file with 900+ character length lines.
$count = 0; switch -File $filepath {default { ++$count }}

Related

How can I force spark/hadoop to ignore the .gz extension on a file and read it as uncompressed plain text?

I have code along the lines of:
val lines: RDD[String] = sparkSession.sparkContext.textFile("s3://mybucket/file.gz")
The URL ends in .gz but this is a result of legacy code. The file is plain text with no compression involved. However spark insists on reading it as a GZIP file which obviously fails. How can I make it ignore the extension and simply read the file as text?
Based on this article I've tried setting configuration in various places that doesn't include the GZIP codec, e.g.:
sparkContext.getConf.set("spark.hadoop.io.compression.codecs", classOf[DefaultCodec].getCanonicalName)
This doesn't seem to have any effect.
Since the files are on S3, I can't simply rename them without copying the entire file.

First solution: Shading GzipCodec
The idea is to shadow/shade the GzipCodec as defined in the package org.apache.hadoop.io.compress by including in your own sources this java file and replacing this line:
public String getDefaultExtension() {
return ".gz";
}
with:
public String getDefaultExtension() {
return ".whatever";
}
When building your project, this will have for effect to use your definition of GzipCodec instead of the one provided by the dependencies (this is the shadowing of GzipCodec).
This way, when parsing your file, textFile() will be forced to apply the default codec as the one for gzip doesn't fit the naming of your file anymore.
The inconvenient of this solution is that you won't be able to also process real gzip files within the same app.
Second solution: Using newAPIHadoopFile with a custom/modified TextInputFormat
You can use newAPIHadoopFile (instead of textFile) with a custom/modified TextInputFormat which forces the use of the DefaultCodec (plain text).
We'll write our own line reader based on the default one (TextInputFormat). The idea is to remove the part of TextInputFormat which finds out it's named .gz and thus uncompress the file before reading it.
Instead of calling sparkContext.textFile,
// plain text file with a .gz extension:
sparkContext.textFile("s3://mybucket/file.gz")
we can use the underlying sparkContext.newAPIHadoopFile which allows us to specify how to read the input:
import org.apache.hadoop.mapreduce.lib.input.FakeGzInputFormat
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
sparkContext
.newAPIHadoopFile(
"s3://mybucket/file.gz",
classOf[FakeGzInputFormat], // This is our custom reader
classOf[LongWritable],
classOf[Text],
new Configuration(sparkContext.hadoopConfiguration)
)
.map { case (_, text) => text.toString }
The usual way of calling newAPIHadoopFile would be with TextInputFormat. This is the part which wraps how the file is read and where the compression codec is chosen based on the file extension.
Let's call it FakeGzInputFormat and implement it as follow as an extension of TextInputFormat (this is a Java file and let's put it in package src/main/java/org/apache/hadoop/mapreduce/lib/input):
package org.apache.hadoop.mapreduce.lib.input;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import com.google.common.base.Charsets;
public class FakeGzInputFormat extends TextInputFormat {
public RecordReader<LongWritable, Text> createRecordReader(
InputSplit split,
TaskAttemptContext context
) {
String delimiter =
context.getConfiguration().get("textinputformat.record.delimiter");
byte[] recordDelimiterBytes = null;
if (null != delimiter)
recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
// Here we use our custom `FakeGzLineRecordReader` instead of
// `LineRecordReader`:
return new FakeGzLineRecordReader(recordDelimiterBytes);
}
#Override
protected boolean isSplitable(JobContext context, Path file) {
return true; // plain text is splittable (as opposed to gzip)
}
}
In fact we have to go one level deeper and also replace the default LineRecordReader (Java) with our own (let's call it FakeGzLineRecordReader).
As it's quite difficult to inherit from LineRecordReader, we can copy LineRecordReader (in src/main/java/org/apache/hadoop/mapreduce/lib/input) and slightly modify (and simplify) the initialize(InputSplit genericSplit, TaskAttemptContext context) method by forcing the usage of the default codec (plain text):
(the only changes compared to the original LineRecordReader have been given a comment explaining what's happening)
package org.apache.hadoop.mapreduce.lib.input;
import java.io.IOException;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.Seekable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
#InterfaceAudience.LimitedPrivate({"MapReduce", "Pig"})
#InterfaceStability.Evolving
public class FakeGzLineRecordReader extends RecordReader<LongWritable, Text> {
private static final Logger LOG =
LoggerFactory.getLogger(FakeGzLineRecordReader.class);
public static final String MAX_LINE_LENGTH =
"mapreduce.input.linerecordreader.line.maxlength";
private long start;
private long pos;
private long end;
private SplitLineReader in;
private FSDataInputStream fileIn;
private Seekable filePosition;
private int maxLineLength;
private LongWritable key;
private Text value;
private byte[] recordDelimiterBytes;
public FakeGzLineRecordReader(byte[] recordDelimiter) {
this.recordDelimiterBytes = recordDelimiter;
}
// This has been simplified a lot since we don't need to handle compression
// codecs.
public void initialize(
InputSplit genericSplit,
TaskAttemptContext context
) throws IOException {
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);
start = split.getStart();
end = start + split.getLength();
final Path file = split.getPath();
final FileSystem fs = file.getFileSystem(job);
fileIn = fs.open(file);
fileIn.seek(start);
in = new UncompressedSplitLineReader(
fileIn, job, this.recordDelimiterBytes, split.getLength()
);
filePosition = fileIn;
if (start != 0) {
start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}
this.pos = start;
}
// Simplified as input is not compressed:
private int maxBytesToConsume(long pos) {
return (int) Math.max(Math.min(Integer.MAX_VALUE, end - pos), maxLineLength);
}
// Simplified as input is not compressed:
private long getFilePosition() {
return pos;
}
private int skipUtfByteOrderMark() throws IOException {
int newMaxLineLength = (int) Math.min(3L + (long) maxLineLength,
Integer.MAX_VALUE);
int newSize = in.readLine(value, newMaxLineLength, maxBytesToConsume(pos));
pos += newSize;
int textLength = value.getLength();
byte[] textBytes = value.getBytes();
if ((textLength >= 3) && (textBytes[0] == (byte)0xEF) &&
(textBytes[1] == (byte)0xBB) && (textBytes[2] == (byte)0xBF)) {
LOG.info("Found UTF-8 BOM and skipped it");
textLength -= 3;
newSize -= 3;
if (textLength > 0) {
textBytes = value.copyBytes();
value.set(textBytes, 3, textLength);
} else {
value.clear();
}
}
return newSize;
}
public boolean nextKeyValue() throws IOException {
if (key == null) {
key = new LongWritable();
}
key.set(pos);
if (value == null) {
value = new Text();
}
int newSize = 0;
while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {
if (pos == 0) {
newSize = skipUtfByteOrderMark();
} else {
newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));
pos += newSize;
}
if ((newSize == 0) || (newSize < maxLineLength)) {
break;
}
LOG.info("Skipped line of size " + newSize + " at pos " +
(pos - newSize));
}
if (newSize == 0) {
key = null;
value = null;
return false;
} else {
return true;
}
}
#Override
public LongWritable getCurrentKey() {
return key;
}
#Override
public Text getCurrentValue() {
return value;
}
public float getProgress() {
if (start == end) {
return 0.0f;
} else {
return Math.min(1.0f, (getFilePosition() - start) / (float)(end - start));
}
}
public synchronized void close() throws IOException {
try {
if (in != null) {
in.close();
}
} finally {}
}
}

Streaming in datamapper in mule esb

I need to take data(input.xml) from one file which is size in 100MB-200MB and need to write into four different files based on some logic.
input xml :
<?xml version="1.0"?>
<Orders>
<Order><OrderId>1</OrderId><Total>10</Total><Name>jon1</Name></Order>
<Order><OrderId>2</OrderId><Total>20</Total><Name>jon2</Name></Order>
<Order><OrderId>3</OrderId><Total>30</Total><Name>jon3</Name></Order>
<Order><OrderId>4</OrderId><Total>40</Total><Name>jon4</Name></Order>
<Orders>
logic is if Total is 1-10 then write to file1 and if Total is 11-20 then write to file2.....,
expected output:
1 10 jon1 -->write into file1
2 20 jon2 -->write into file2
3 30 jon3 -->write into file3
4 40 jon4 -->write into file4
Here i have enabled streaming in datamapper which is under configuration but i'm not getting proper output. The problem is i'm getting only some recodes into only one file which should come into that file after satisfying the condition.
But if i disable streaming button in datamapper it is working fine. As there are lakes of records i must use streaming option.
Is there any otherway to configure datamapper to enable streaming option..?
Please suggest me on this., Thanks.,

It is difficult to see a problem without a little more detail on what you are doing.
Nevertheless, I think this probably will help you to try another approach.
The data mapper will load the full XML document into memory although you activate streaming, it has to do it in order to support XPATH (it loads the full xml input into a DOM).
So if you can not afford to load 200Mb document into memory you will need to try a workaround.
What I have done before is creating a java component that transforms the input stream to an iterator with the help of a stax parser. With a very simple implementation you can code an iterator that pulls from the stream to create the next element (a pojo, a map, a string...). In the mule flow, after the "java component", you should be able to use a "for-each" with a "choice" within and apply your logic.
A quick example for your data:
package tests;
import java.io.InputStream;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;
import javax.xml.stream.FactoryConfigurationError;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
public class OrdersStreamIterator implements Iterator<Map<String,String>> {
final static Log LOGGER = LogFactory.getLog(OrdersStreamIterator.class);
final InputStream is;
final XMLStreamReader xmlReader;
boolean end = false;
HashMap<String,String> next;
public OrdersStreamIterator(InputStream is)
throws XMLStreamException, FactoryConfigurationError {
this.is = is;
xmlReader = XMLInputFactory.newInstance().createXMLStreamReader(is);
}
protected HashMap<String,String> _next() throws XMLStreamException {
int event;
HashMap<String,String> order = null;
String orderChild = null;
String orderChildValue = null;
while (xmlReader.hasNext()) {
event = xmlReader.getEventType();
if (event == XMLStreamConstants.START_ELEMENT) {
if (order==null) {
if (checkOrder()) {
order = new HashMap<String,String>();
}
}
else {
orderChild = xmlReader.getLocalName();
}
}
else if (event == XMLStreamConstants.END_ELEMENT) {
if (checkOrders()) {
end = true;
return null;
}
else if (checkOrder()) {
xmlReader.next();
return order;
}
else if (order!=null) {
order.put(orderChild, orderChildValue);
orderChild = null;
orderChildValue = null;
}
}
else if (order!=null && orderChild!=null){
switch (event) {
case XMLStreamConstants.SPACE:
case XMLStreamConstants.CHARACTERS:
case XMLStreamConstants.CDATA:
int start = xmlReader.getTextStart();
int length = xmlReader.getTextLength();
if (orderChildValue==null) {
orderChildValue = new String(xmlReader.getTextCharacters(), start, length);
}
else {
orderChildValue += new String(xmlReader.getTextCharacters(), start, length);
}
break;
}
}
xmlReader.next();
}
end = true;
return null;
}
protected boolean checkOrder() {
return "Order".equals(xmlReader.getLocalName());
}
protected boolean checkOrders() {
return "Orders".equals(xmlReader.getLocalName());
}
#Override
public boolean hasNext() {
if (end) {
return false;
}
else if (next==null) {
try {
next = _next();
} catch (XMLStreamException e) {
LOGGER.error(e.getMessage(), e);
end = true;
}
return !end;
}
else {
return true;
}
}
#Override
public Map<String,String> next() {
if (hasNext()) {
final HashMap<String,String> n = next;
next = null;
return n;
}
else {
return null;
}
}
#Override
public void remove() {
throw new RuntimeException("ReadOnly!");
}
// Test
public static String dump(Map<String,String> o) {
String s = "{";
for (Entry<String,String> e : o.entrySet()) {
if (s.length()>1) {
s+=", ";
}
s+= "\"" + e.getKey() + "\" : \"" + e.getValue() + "\"";
}
return s + "}";
}
public static void main(String[] argv) throws XMLStreamException, FactoryConfigurationError {
final InputStream is = OrdersStreamIterator.class.getClassLoader().getResourceAsStream("orders.xml");
final OrdersStreamIterator i = new OrdersStreamIterator(is);
while (i.hasNext()) {
System.out.println(dump(i.next()));
}
}
}
An example flow:
<flow name="testsFlow">
<http:listener config-ref="HTTP_Listener_Configuration" path="/" doc:name="HTTP"/>
<scripting:component doc:name="Groovy">
<scripting:script engine="Groovy"><![CDATA[return tests.OrdersStreamIterator.class.getClassLoader().getResourceAsStream("orders.xml");]]></scripting:script>
</scripting:component>
<set-payload value="#[new tests.OrdersStreamIterator(payload)]" doc:name="Iterator"/>
<foreach doc:name="For Each">
<logger message="#[tests.OrdersStreamIterator.dump(payload)]" level="INFO" doc:name="Logger"/>
</foreach>
</flow>

How to process multi line input records in Spark

I have each record spread across multiple lines in the input file(Very huge file).
Ex:
Id: 2
ASIN: 0738700123
title: Test tile for this product
group: Book
salesrank: 168501
similar: 5 0738700811 1567184912 1567182813 0738700514 0738700915
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486]
reviews: total: 12 downloaded: 12 avg rating: 4.5
2001-12-16 cutomer: A11NCO6YTE4BTJ rating: 5 votes: 5 helpful: 4
2002-1-7 cutomer: A9CQ3PLRNIR83 rating: 4 votes: 5 helpful: 5
How to identify and process each multi line record in spark?

If the multi-line data has a defined record separator, you could use the hadoop support for multi-line records, providing the separator through a hadoop.Configuration object:
Something like this should do:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "id:")
val dataset = sc.newAPIHadoopFile("/path/to/data", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val data = dataset.map(x=>x._2.toString)
This will provide you with an RDD[String] where each element corresponds to a record. Afterwards you need to parse each record following your application requirements.

I have done this by implementing custom input format and record reader.
public class ParagraphInputFormat extends TextInputFormat {
#Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) {
return new ParagraphRecordReader();
}
}
public class ParagraphRecordReader extends RecordReader<LongWritable, Text> {
private long end;
private boolean stillInChunk = true;
private LongWritable key = new LongWritable();
private Text value = new Text();
private FSDataInputStream fsin;
private DataOutputBuffer buffer = new DataOutputBuffer();
private byte[] endTag = "\n\r\n".getBytes();
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
FileSplit split = (FileSplit) inputSplit;
Configuration conf = taskAttemptContext.getConfiguration();
Path path = split.getPath();
FileSystem fs = path.getFileSystem(conf);
fsin = fs.open(path);
long start = split.getStart();
end = split.getStart() + split.getLength();
fsin.seek(start);
if (start != 0) {
readUntilMatch(endTag, false);
}
}
public boolean nextKeyValue() throws IOException {
if (!stillInChunk) return false;
boolean status = readUntilMatch(endTag, true);
value = new Text();
value.set(buffer.getData(), 0, buffer.getLength());
key = new LongWritable(fsin.getPos());
buffer.reset();
if (!status) {
stillInChunk = false;
}
return true;
}
public LongWritable getCurrentKey() throws IOException, InterruptedException {
return key;
}
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
public float getProgress() throws IOException, InterruptedException {
return 0;
}
public void close() throws IOException {
fsin.close();
}
private boolean readUntilMatch(byte[] match, boolean withinBlock) throws IOException {
int i = 0;
while (true) {
int b = fsin.read();
if (b == -1) return false;
if (withinBlock) buffer.write(b);
if (b == match[i]) {
i++;
if (i >= match.length) {
return fsin.getPos() < end;
}
} else i = 0;
}
}
}
endTag identifies the end of each record.

How to pass a parameter from C# to a PowerShell script file?

From the command line I can do.
.\test.ps1 1
How do I pass the parameter when doing this from C#?
I've tried
.AddArgument(1)
.AddParameter("p", 1)
And I have tried passing values in as IEnumerable<object> in the .Invoke() but $p does not get the value.
namespace ConsoleApplication1
{
using System;
using System.Linq;
using System.Management.Automation;
class Program
{
static void Main()
{
// Contents of ps1 file
// param($p)
// "Hello World ${p}"
var script = #".\test.ps1";
PowerShell
.Create()
.AddScript(script)
.Invoke().ToList()
.ForEach(Console.WriteLine);
}
}
}

How's this?
static void Main()
{
string script = #"C:\test.ps1 -arg 'hello world!'";
StringBuilder sb = new StringBuilder();
PowerShell psExec = PowerShell.Create();
psExec.AddScript(script);
psExec.AddCommand("out-string");
Collection<PSObject> results;
Collection<ErrorRecord> errors;
results = psExec.Invoke();
errors = psExec.Streams.Error.ReadAll();
if (errors.Count > 0)
{
foreach (ErrorRecord error in errors)
{
sb.AppendLine(error.ToString());
}
}
else
{
foreach (PSObject result in results)
{
sb.AppendLine(result.ToString());
}
}
Console.WriteLine(sb.ToString());
}
Here's a similar version that passes an instance of a DateTime
static void Main()
{
StringBuilder sb = new StringBuilder();
PowerShell psExec = PowerShell.Create();
psExec.AddCommand(#"C:\Users\d92495j\Desktop\test.ps1");
psExec.AddArgument(DateTime.Now);
Collection<PSObject> results;
Collection<ErrorRecord> errors;
results = psExec.Invoke();
errors = psExec.Streams.Error.ReadAll();
if (errors.Count > 0)
{
foreach (ErrorRecord error in errors)
{
sb.AppendLine(error.ToString());
}
}
else
{
foreach (PSObject result in results)
{
sb.AppendLine(result.ToString());
}
}
Console.WriteLine(sb.ToString());
}

So, to make a long answer short: Use AddCommand instead of AddScript

another way is to fill the runspace with variables.
public static string RunPs1File(string filePath, Dictionary<string, object> arguments)
{
var result = new StringBuilder();
using (Runspace space = RunspaceFactory.CreateRunspace())
{
space.Open();
foreach (KeyValuePair<string, object> variable in arguments)
{
var key = new string(variable.Key.Where(char.IsLetterOrDigit).ToArray());
space.SessionStateProxy.SetVariable(key, variable.Value);
}
string script = System.IO.File.ReadAllText(filePath);
using (PowerShell ps1 = PowerShell.Create())
{
ps1.Runspace = space;
ps1.AddScript(script);
var psOutput = ps1.Invoke();
var errors = ps1.Streams.Error;
if (errors.Count > 0)
{
var e = errors[0].Exception;
ps1.Streams.ClearStreams();
throw e;
}
foreach (var line in psOutput)
{
if (line != null)
{
result.AppendLine(line.ToString());
}
}
}
}
return result.ToString();
}

Currently a simpler solution is to read the script as a string first.
PowerShell
.Create()
.AddScript(File.ReadAllText(script)).AddParameter("p", 1)
.Invoke().ToList()
.ForEach(Console.WriteLine);

How do i accept multiple clients on this program?

Through a lot of learning and research, i wrote a server side program. But the problem with this program is that it doesn't accept multiple clients, and i also wanted to know how to send the output back to client side instead of displaying it on the server side. Can someone please help me out with the code? This is what I've tried till now -
class Program
{
private static Regex _regex = new Regex("not|http|console|application", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
static void Main(string[] args)
{
{
TcpListener serversocket = new TcpListener(8888);
TcpClient clientsocket = default(TcpClient);
serversocket.Start();
Console.WriteLine(">> Server Started");
clientsocket = serversocket.AcceptTcpClient();
Console.WriteLine("Accept Connection From Client");
try
{
using (var reader = new StreamReader(clientsocket.GetStream()))
{
string line;
int lineNumber = 0;
while (null != (line = reader.ReadLine()))
{
lineNumber += 1;
foreach (Match match in _regex.Matches(line))
{
Console.WriteLine("Line {0} matches {1}", lineNumber, match.Value);
}
}
}
}
catch (Exception ex)
{
Console.Error.WriteLine(ex.ToString());
}
clientsocket.Close();
serversocket.Stop();
Console.WriteLine(" >> exit");
Console.ReadLine();
}
}
}

You could change it to something like this.
static void Main(string[] args)
{
TcpListener serversocket = new TcpListener(8888);
TcpClient clientsocket = default(TcpClient);
serversocket.Start();
Console.WriteLine(">> Server Started");
while(true)
{
clientsocket = serversocket.AcceptTcpClient();
Console.WriteLine("Accept Connection From Client");
LineMatcher lm = new LineMatcher(clientsocket);
Thread thread = new Thread(new ThreadStart(lm.Run));
thread.Start();
Console.WriteLine("Client connected");
}
serversocket.Stop();
Console.WriteLine(" >> exit");
Console.ReadLine();
}
And then have this seperate class handling the linematching
public class LineMatcher{
private static Regex _regex = new Regex("not|http|console|application", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.CultureInvariant);
private TcpClient _client;
public LineMatcher(TcpClient client)
{
_client = client;
}
public void Run()
{
try
{
using (var reader = new StreamReader(_client.GetStream()))
{
string line;
int lineNumber = 0;
while (null != (line = reader.ReadLine()))
{
lineNumber += 1;
foreach (Match match in _regex.Matches(line))
{
Console.WriteLine("Line {0} matches {1}", lineNumber, match.Value);
}
}
}
}
catch (Exception ex)
{
Console.Error.WriteLine(ex.ToString());
}
Console.WriteLine("Closing client");
_client.Close();
}
}
This is purely a proof of concept and is no way stable code ;)
Please note that this doesn't manage the child-threads in any way.
There is also no correct way of shutting down the listener because of the while(true)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Efficiently counting lines in a file - powershell

See if this isn't faster than your current method: $count = 0 Get-Content foo.txt -ReadCount 2000 | foreach { $Count += $_.count } $count

SWITCH was faster for my GB+ file with 900+ character length lines. $count = 0; switch -File $filepath {default { ++$count }}

Related

How can I force spark/hadoop to ignore the .gz extension on a file and read it as uncompressed plain text?

Streaming in datamapper in mule esb

How to process multi line input records in Spark

How to pass a parameter from C# to a PowerShell script file?

How do i accept multiple clients on this program?

Categories

Resources