How to implement a beam sql udf function to split one column into multiple column?
I have already implemented this in bigquery udf function:
CREATE TEMP FUNCTION parseDescription(description STRING)
RETURNS STRUCT<msg STRING, ip STRING, source_region STRING, user_name STRING>
LANGUAGE js AS """
var arr = description.substring(0, description.length - 1).split(",");
var firstIndex = arr[0].indexOf(".");
this.msg = arr[0].substring(0, firstIndex);
this.ip = arr[0].substring(firstIndex + 2).split(": ")[1];
this.source_region = arr[1].split(": ")[1];
this.user_name = arr[2].split(": ")[1];
return this;
""";
INSERT INTO `table1` (parseDescription(event_description).* FROM `table2`;
Does beam sql udf function also support this kind of operation?
I tried to return an object in beam udf function, but it seems that beam sql doesn't support object.* syntax. I also tried to return a map or an array but still got error.
Is there anyway to implement the same udf in beam?
I tried to use MapElement method but got error, seems that the output row expected the same schema as input row, example:
import org.apache.beam.runners.direct.DirectOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.sql.SqlTransform;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.schemas.Schema;
import org.apache.beam.sdk.transforms.*;
import org.apache.beam.sdk.values.PBegin;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.Row;
public class BeamMain2 {
public static void main(String[] args) {
DirectOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(DirectOptions.class);
Pipeline p = Pipeline.create(options);
// Define the schema for the records.
Schema appSchema = Schema.builder().addStringField("string1").addInt32Field("int1").build();
Row row1 = Row.withSchema(appSchema).addValues("aaa,bbb", 1).build();
Row row2 = Row.withSchema(appSchema).addValues("ccc,ddd", 2).build();
Row row3 = Row.withSchema(appSchema).addValues("ddd,eee", 3).build();
PCollection<Row> inputTable =
PBegin.in(p).apply(Create.of(row1, row2, row3).withRowSchema(appSchema));
Schema newSchema =
Schema.builder()
.addNullableField("string2", Schema.FieldType.STRING)
.addInt32Field("int1")
.addNullableField("string3", Schema.FieldType.STRING)
.build();
PCollection<Row> outputStream = inputTable.apply(
SqlTransform.query(
"SELECT * "
+ "FROM PCOLLECTION where int1 > 1"))
.apply(MapElements.via(
new SimpleFunction<Row, Row>() {
#Override
public Row apply(Row line) {
return Row.withSchema(newSchema).addValues("a", 1, "b").build();
}
}));
p.run().waitUntilFinish();
}
}
Reference: https://beam.apache.org/documentation/dsls/sql/overview/
You can use emit 'Row' elements from a transform which can be later used as a table
The pipeline would look something like
Schema
Schema schema =
Schema.of(Schema.Field.of("f0", FieldType.INT64), Schema.Field.of("f1", FieldType.INT64));
Transform
private static MapElements<Row, Row> rowsToStrings() {
return MapElements.into(TypeDescriptor.of(Row.class))
.via(
row -> Row.withSchema(schema).addValue(1L).addValue(2L).build(););
}
Pipeline:
pipeline
.apply(
"SQL Query 1",
SqlTransform.query(<Query string 1>))
.apply("Transform column", rowsToStrings())
.apply(
"SQL Query 2",
SqlTransform.query(<Query string 2>))
Related
I'm consuming Avro serialized messages from Kafka using the "automatic" deserializer like:
props.put(
ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG,
"io.confluent.kafka.serializers.KafkaAvroDeserializer"
);
props.put("schema.registry.url", "https://example.com");
This works brilliantly, and is right out of the docs at https://docs.confluent.io/current/schema-registry/serializer-formatter.html#serializer.
The problem I'm facing is that I actually just want to forward these messages, but to do the routing I need some metadata from inside. Some technical constraints mean that I can't feasibly compile-in generated class files to use the KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG => true, so I am using a regular decoder without being tied into Kafka, specifically just reading the bytes as a Array[Byte] and passing them to a manually constructed deserializer:
var maxSchemasToCache = 1000;
var schemaRegistryURL = "https://example.com/"
var specificDeserializerProps = Map(
"schema.registry.url"
-> schemaRegistryURL,
KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG
-> "false"
);
var client = new CachedSchemaRegistryClient(
schemaRegistryURL,
maxSchemasToCache
);
var deserializer = new KafkaAvroDeserializer(
client,
specificDeserializerProps.asJava
);
The messages are a "container" type, with the really interesting part one of about ~25 types in a union { A, B, C } msg record field:
record Event {
timestamp_ms created_at;
union {
Online,
Offline,
Available,
Unavailable,
...
...Failed,
...Updated
} msg;
}
So I'm successfully reading a Array[Byte] into record and feeding it into the deserializer like this:
var genericRecord = deserializer.deserialize(topic, consumerRecord.value())
.asInstanceOf[GenericRecord];
var schema = genericRecord.getSchema();
var msgSchema = schema.getField("msg").schema();
The problem however is that I can find no to discern, discriminate or "resolve" the "type" of the msg field through the union:
System.out.printf(
"msg.schema = %s msg.schema.getType = %s\n",
msgSchema.getFullName(),
msgSchema.getType().name());
=> msg.schema = union msg.schema.getType = union
How to discriminate types in this scenario? The confluent registry knows, these things have names, they have "types", even if I'm treating them as GenericRecords,
My goal here is to know that record.msg is of "type" Online | Offline | Available rather than just knowing it's a union.
After having looked into the implementation of the AVRO Java library, it think it's safe to say that this is impossible given the current API. I've found the following way of extracting the types while parsing, using a custom GenericDatumReader subclass, but it needs a lot of polishing before I'd use something like this in production code :D
So here's the subclass:
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.io.ResolvingDecoder;
import java.io.IOException;
import java.util.List;
public class CustomReader<D> extends GenericDatumReader<D> {
private final GenericData data;
private Schema actual;
private Schema expected;
private ResolvingDecoder creatorResolver = null;
private final Thread creator;
private List<Schema> unionTypes;
// vvv This is the constructor I've modified, added a list of types
public CustomReader(Schema schema, List<Schema> unionTypes) {
this(schema, schema, GenericData.get());
this.unionTypes = unionTypes;
}
public CustomReader(Schema writer, Schema reader, GenericData data) {
this(data);
this.actual = writer;
this.expected = reader;
}
protected CustomReader(GenericData data) {
this.data = data;
this.creator = Thread.currentThread();
}
protected Object readWithoutConversion(Object old, Schema expected, ResolvingDecoder in) throws IOException {
switch (expected.getType()) {
case RECORD:
return super.readRecord(old, expected, in);
case ENUM:
return super.readEnum(expected, in);
case ARRAY:
return super.readArray(old, expected, in);
case MAP:
return super.readMap(old, expected, in);
case UNION:
// vvv The magic happens here
Schema type = expected.getTypes().get(in.readIndex());
unionTypes.add(type);
return super.read(old, type, in);
case FIXED:
return super.readFixed(old, expected, in);
case STRING:
return super.readString(old, expected, in);
case BYTES:
return super.readBytes(old, expected, in);
case INT:
return super.readInt(old, expected, in);
case LONG:
return in.readLong();
case FLOAT:
return in.readFloat();
case DOUBLE:
return in.readDouble();
case BOOLEAN:
return in.readBoolean();
case NULL:
in.readNull();
return null;
default:
return super.readWithoutConversion(old, expected, in);
}
}
}
I've added comments to the code for the interesting parts, as it's mostly boilerplate.
Then you can use this custom reader like this:
List<Schema> unionTypes = new ArrayList<>();
DatumReader<GenericRecord> datumReader = new CustomReader<GenericRecord>(schema, unionTypes);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(eventFile, datumReader);
GenericRecord event = null;
while (dataFileReader.hasNext()) {
event = dataFileReader.next(event);
}
System.out.println(unionTypes);
This will print, for each union parsed, the type of that union. Note that you'll have to figure out which element of that list is interesting to you depending on how many unions you have in a record, etc.
Not pretty tbh :D
I was able to come up with a single-use solution after a lot of digging:
val records: ConsumerRecords[String, Array[Byte]] = consumer.poll(100);
for (consumerRecord <- asScalaIterator(records.iterator)) {
var genericRecord = deserializer.deserialize(topic, consumerRecord.value()).asInstanceOf[GenericRecord];
var msgSchema = genericRecord.get("msg").asInstanceOf[GenericRecord].getSchema();
System.out.printf("%s \n", msgSchema.getFullName());
Prints com.myorg.SomeSchemaFromTheEnum and works perfectly in my use-case.
The confusing thing, is that because of the use of GenericRecord, .get("msg") returns Object, which, in a general way I have no way to safely typecast. In this limited case, I know the cast is safe.
In my limited use-case the solution in the 5 lines above is suitable, but for a more general solution the answer https://stackoverflow.com/a/59844401/119669 posted by https://stackoverflow.com/users/124257/fresskoma seems more appropriate.
Whether using DatumReader or GenericRecord is probably a matter of preference and whether the Kafka ecosystem is in mind, alone with Avro I'd probably prefer a DatumReader solution, but in this instance I can live with having Kafak-esque nomenclature in my code.
To retrieve the schema of the value of a field, you can use
new GenericData().induce(genericRecord.get("msg"))
I want to implement a configurable Kafka stream which reads a row of data and applies a list of transforms. Like applying functions to the fields of the record, renaming fields etc. The stream should be completely configurable so I can specify which transforms should be applied to which field. I'm using Avro to encode the Data as GenericRecords. My problem is that I also need transforms which create new columns. Instead of overwriting the previous value of the field they should append a new field to the record. This means the schema of the record changes. The solution I came up with so far is iterating over the list of transforms first to figure out which fields I need to add to the schema. I then create a new schema with the old fields and new fields combined
The list of transforms(There is always a source field which gets passed to the transform method and the result is then written back to the targetField):
val transforms: List[Transform] = List(
FieldTransform(field = "referrer", targetField = "referrer", method = "mask"),
FieldTransform(field = "name", targetField = "name_clean", method = "replaceUmlauts")
)
case class FieldTransform(field: String, targetField: String, method: String)
method to create the new schema, based on the old schema and the list of transforms
def getExtendedSchema(schema: Schema, transforms: List[Transform]): Schema = {
var newSchema = SchemaBuilder
.builder(schema.getNamespace)
.record(schema.getName)
.fields()
// create new schema with existing fields from schemas and new fields which are created through transforms
val fields = schema.getFields ++ getNewFields(schema, transforms)
fields
.foldLeft(newSchema)((newSchema, field: Schema.Field) => {
newSchema
.name(field.name)
.`type`(field.schema())
.noDefault()
// TODO: find way to differentiate between explicitly set null defaults and fields which have no default
//.withDefault(field.defaultValue())
})
newSchema.endRecord()
}
def getNewFields(schema: Schema, transforms: List[Transform]): List[Schema.Field] = {
transforms
.filter { // only select targetFields which are not in schema
case FieldTransform(field, targetField, method) => schema.getField(targetField) == null
case _ => false
}
.distinct
.map { // create new Field object for each targetField
case FieldTransform(field, targetField, method) =>
val sourceField = schema.getField(field)
new Schema.Field(targetField, sourceField.schema(), sourceField.doc(), sourceField.defaultValue())
}
}
Instantiating a new GenericRecord based on an old record
val extendedSchema = getExtendedSchema(row.getSchema, transforms)
val extendedRow = new GenericData.Record(extendedSchema)
for (field <- row.getSchema.getFields) {
extendedRow.put(field.name, row.get(field.name))
}
I tried to look for other solutions but couldn't find any example which had changing data types. It feels to me like there must be a simpler cleaner solution to handle changing Avro schemas at runtime. Any ideas are appreciated.
Thanks,
Paul
I have implemented Passing Dynamic values to your avro schema and validating union to in schema
Example :-
RestTemplate template = new RestTemplate();
HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.APPLICATION_JSON);
HttpEntity<String> entity = new HttpEntity<String>(headers);
ResponseEntity<String> response = template.exchange(""+registryUrl+"/subjects/"+topic+"/versions/"+version+"", HttpMethod.GET, entity, String.class);
String responseData = response.getBody();
JSONObject jsonObject = new JSONObject(responseData); // add your json string which you will pass from postman
JSONObject jsonObjectResult = new JSONObject(jsonResult);
String getData = jsonObject.get("schema").toString();
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(getData);
GenericRecord genericRecord = new GenericData.Record(schema);
schema.getFields().stream().forEach(field->{
genericRecord.put(field.name(),jsonObjectResult.get(field.name()));
});
GenericDatumReader<GenericRecord>reader = new GenericDatumReader<GenericRecord>(schema);
boolean data = reader.getData().validate(schema,genericRecord );
I'd like to to export AEM reports, page activity or component activity reports, in an excel file.
Is this feature is available in AEM or do I have to write custom for this?
The closest you will get is a CSV selector that can convert report data to CSV but even that has limitations (pagination, filters may be ignored depending on the report).
This, AFAIK, is not an OOTB function. There are old posts and blogs out there to show how this can be done on bpth client side (using JS) or server side using CSV writers.
If you are going down the route of writing a custom solution (most likely outcoume), have a look at the CSV text library that is used in acs-commons user CSV import/export utility, that makes the job really easy and is already a part of AEM.
Hope this helps.
The WriteExcel class simply takes the List collection that is used to populate the JTable object and writes the data to an Excel spreadsheet.
This class uses the Java Excel API. The Java Excel API dependency that is required to work with this API is already in the POM dependencies section.
import java.io.File;
import java.io.IOException;
import java.util.List;
import java.util.Locale;
import jxl.CellView;
import jxl.Workbook;
import jxl.WorkbookSettings;
import jxl.format.UnderlineStyle;
import jxl.write.Formula;
import jxl.write.Label;
import jxl.write.Number;
import jxl.write.WritableCellFormat;
import jxl.write.WritableFont;
import jxl.write.WritableSheet;
import jxl.write.WritableWorkbook;
import jxl.write.WriteException;
import jxl.write.biff.RowsExceededException;
public class WriteExcel {
private WritableCellFormat timesBoldUnderline;
private WritableCellFormat times;
private String inputFile;
public void setOutputFile(String inputFile) {
this.inputFile = inputFile;
}
public int write( List<members> memberList) throws IOException, WriteException {
File file = new File(inputFile);
WorkbookSettings wbSettings = new WorkbookSettings();
wbSettings.setLocale(new Locale("en", "EN"));
WritableWorkbook workbook = Workbook.createWorkbook(file, wbSettings);
workbook.createSheet("Comumunity Report", 0);
WritableSheet excelSheet = workbook.getSheet(0);
createLabel(excelSheet) ;
int size = createContent(excelSheet, memberList);
workbook.write();
workbook.close();
return size ;
}
private void createLabel(WritableSheet sheet)
throws WriteException {
// Lets create a times font
WritableFont times10pt = new WritableFont(WritableFont.TIMES, 10);
// Define the cell format
times = new WritableCellFormat(times10pt);
// Lets automatically wrap the cells
times.setWrap(true);
// create create a bold font with unterlines
WritableFont times10ptBoldUnderline = new WritableFont(WritableFont.TIMES, 10, WritableFont.BOLD, false,
UnderlineStyle.SINGLE);
timesBoldUnderline = new WritableCellFormat(times10ptBoldUnderline);
// Lets automatically wrap the cells
timesBoldUnderline.setWrap(true);
CellView cv = new CellView();
cv.setFormat(times);
cv.setFormat(timesBoldUnderline);
cv.setAutosize(true);
// Write a few headers
addCaption(sheet, 0, 0, "Number");
addCaption(sheet, 1, 0, "Points");
addCaption(sheet, 2, 0, "Name");
addCaption(sheet, 3, 0, "Screen Name");
}
private int createContent(WritableSheet sheet, List<members> memberList) throws WriteException,
RowsExceededException {
int size = memberList.size() ;
// This is where we will add Data from the JCR
for (int i = 0; i < size; i++) {
members mem = (members)memberList.get(i) ;
String number = mem.getNum();
String points = mem.getScore();
String name = mem.getName();
String display = mem.getDisplay();
// First column
addLabel(sheet, 0, i+2, number);
// Second column
addLabel(sheet, 1, i+2, points);
// Second column
addLabel(sheet, 2, i+2,name);
// Second column
addLabel(sheet, 3, i+2, display);
}
return size;
}
private void addCaption(WritableSheet sheet, int column, int row, String s)
throws RowsExceededException, WriteException {
Label label;
label = new Label(column, row, s, timesBoldUnderline);
sheet.addCell(label);
}
private void addNumber(WritableSheet sheet, int column, int row,
Integer integer) throws WriteException, RowsExceededException {
Number number;
number = new Number(column, row, integer, times);
sheet.addCell(number);
}
private void addLabel(WritableSheet sheet, int column, int row, String s)
throws WriteException, RowsExceededException {
Label label;
label = new Label(column, row, s, times);
sheet.addCell(label);
}
public int exportExcel( List<members> memberList)
{
try
{
setOutputFile("JCRMembers.xls");
int recs = write( memberList);
return recs ;
}
catch(Exception e)
{
e.printStackTrace();
}
return -1;
}
}
You can follow the steps described here: Adobe Forums
Select your required page to see component report at http://localhost:4502/etc/reports/compreport.html
Now hit the below URL. It gives you JSON output. http://localhost:4502/etc/reports/compreport/jcr:content/report.data.json
Copy paste the generated JSON output at below URL and click on JSON to excel http://www.convertcsv.com/json-to-csv.htm
You need to write your own logic
create a servlet,
construct Table with data
in the response object add below lines
response.setContentType("text/csv");
response.setCharacterEncoding("UTF-8");
response.setHeader("Content-Disposition", "attachment; filename=\"" + reportName + ".csv\"");
Cookie cookie = new Cookie("fileDownload", "true");
cookie.setMaxAge(-1);
cookie.setPath("/");
response.addCookie(cookie);
once click on the button you will get the report in the Excel format.
How can I create Dataframe with all my json files, when after reading each file I need to add fileName as field in dataframe? it seems Variable in for loop is not recognized outside loop. How to overcome this issue?
for (jsonfilenames <- fileArray) {
var df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
// trying to create temp table from dataframe created in loop
tblLanding.registerTempTable("LandingTable") // ERROR here, can't resolved tblLanding
Thank in advance
Hossain
I think you are new to programming itself.
Anyways here you go.
Basically you specify the type and initialise it before loop.
var df:DataFrame = null
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
}
df.registerTempTable("LandingTable") // Getting ERROR here
Update
Ok you are completely new to programming, even loops.
Suppose fileArray is having values as [1.json, 2.json, 3.json, 4.json]
So, the loop actually created 4 dataframe, by reading 4 json files.
Which one you want to register as temp table.
If all of them,
var df:DataFrame = null
var count = 0
for (jsonfilename <- fileArray) {
df = hivecontext.read.json(jsonfilename)
var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
df.registerTempTable(s"LandingTable_$count")
count++;
}
And reason for df being empty before this update is, your fileArray is empty or Spark failed to read that file. Print it and check.
To query any of those registered LandingTable
val df2 = hiveContext.sql("SELECT * FROM LandingTable_0")
Update
Question has changed to making a single dataFrame from all the json files.
var dataFrame:DataFrame = null
for (jsonfilename <- fileArray) {
val eachDataFrame = hivecontext.read.json(jsonfilename)
if(dataFrame == null)
dataFrame = eachDataFrame
else
dataFrame = eachDataFrame.unionAll(dataFrame)
}
dataFrame.registerTempTable("LandingTable")
Insure, that fileArray is not empty and all json files in fileArray are having same schema.
// Create list of dataframes with source-file-names
val dfList = fileArray.map{ filename =>
hivecontext.read.json(filename)
.withColumn("source_file_name", lit(filename))
}
// union the dataframes (assuming all are same schema)
val df = dfList.reduce(_ unionAll _) // or use union if spark 2.x
// register as table
df.registerTempTable("LandingTable")
I'm attempting to parse SQL using the TSql100Parser provided by microsoft. Right now I'm having a little trouble using it the way it seems to be intended to be used. Also, the lack of documentation doesn't help. (example: http://msdn.microsoft.com/en-us/library/microsoft.data.schema.scriptdom.sql.tsql100parser.aspx )
When I run a simple SELECT statement through the parser it returns a collection of TSqlStatements which contains a SELECT statement.
Trouble is, the TSqlSelect statement doesn't contain attributes such as a WHERE clause, even though the clause is implemented as a class. http://msdn.microsoft.com/en-us/library/microsoft.data.schema.scriptdom.sql.whereclause.aspx
The parser does recognise the WHERE clause as such, looking at the token stream.
So, my question is, am I using the parser correctly? Right now the token stream seems to be the most useful feature of the parser...
My Test project:
public static void Main(string[] args)
{
var parser = new TSql100Parser(false);
IList<ParseError> Errors;
IScriptFragment result = parser.Parse(
new StringReader("Select col from T1 where 1 = 1 group by 1;" +
"select col2 from T2;" +
"select col1 from tbl1 where id in (select id from tbl);"),
out Errors);
var Script = result as TSqlScript;
foreach (var ts in Script.Batches)
{
Console.WriteLine("new batch");
foreach (var st in ts.Statements)
{
IterateStatement(st);
}
}
}
static void IterateStatement(TSqlStatement statement)
{
Console.WriteLine("New Statement");
if (statement is SelectStatement)
{
PrintStatement(sstmnt);
}
}
Yes, you are using the parser correctly.
As Damien_The_Unbeliever points out, within the SelectStatement there is a QueryExpression property which will be a QuerySpecification object for your third select statement (with the WHERE clause).
This represents the 'real' SELECT bit of the query (whereas the outer SelectStatement object you are looking at has just got the 'WITH' clause (for CTEs), 'FOR' clause (for XML), 'ORDER BY' and other bits)
The QuerySpecification object is the object with the FromClauses, WhereClause, GroupByClause etc.
So you can get to your WHERE Clause by using:
((QuerySpecification)((SelectStatement)statement).QueryExpression).WhereClause
which has a SearchCondition property etc. etc.
Quick glance around would indicate that it contains a QueryExpression, which could be a QuerySpecification, which does have the Where clause attached to it.
if someone lands here and wants to know how to get the whole elements of a select statement the following code explain that:
QuerySpecification spec = (QuerySpecification)(((SelectStatement)st).QueryExpression);
StringBuilder sb = new StringBuilder();
sb.AppendLine("Select Elements");
foreach (var elm in spec.SelectElements)
sb.Append(((Identifier)((Column)((SelectColumn)elm).Expression).Identifiers[0]).Value);
sb.AppendLine();
sb.AppendLine("From Elements");
foreach (var elm in spec.FromClauses)
sb.Append(((SchemaObjectTableSource)elm).SchemaObject.BaseIdentifier.Value);
sb.AppendLine();
sb.AppendLine("Where Elements");
BinaryExpression binaryexp = (BinaryExpression)spec.WhereClause.SearchCondition;
sb.Append("operator is " + binaryexp.BinaryExpressionType);
if (binaryexp.FirstExpression is Column)
sb.Append(" First exp is " + ((Identifier)((Column)binaryexp.FirstExpression).Identifiers[0]).Value);
if (binaryexp.SecondExpression is Literal)
sb.Append(" Second exp is " + ((Literal)binaryexp.SecondExpression).Value);
I had to split a SELECT statement into pieces. My goal was to COUNT how many record a query will return. My first solution was to build a sub query such as
SELECT COUNT(*) FROM (select id, name from T where cat='A' order by id) as QUERY
The problem was that in this case the order clause raises the error "The ORDER BY clause is not valid in views, inline functions, derived tables, sub-queries, and common table expressions, unless TOP or FOR XML is also specified"
So I built a parser that split a SELECT statment into fragments using the TSql100Parser class.
using Microsoft.Data.Schema.ScriptDom.Sql;
using Microsoft.Data.Schema.ScriptDom;
using System.IO;
...
public class SelectParser
{
public string Parse(string sqlSelect, out string fields, out string from, out string groupby, out string where, out string having, out string orderby)
{
TSql100Parser parser = new TSql100Parser(false);
TextReader rd = new StringReader(sqlSelect);
IList<ParseError> errors;
var fragments = parser.Parse(rd, out errors);
fields = string.Empty;
from = string.Empty;
groupby = string.Empty;
where = string.Empty;
orderby = string.Empty;
having = string.Empty;
if (errors.Count > 0)
{
var retMessage = string.Empty;
foreach (var error in errors)
{
retMessage += error.Identifier + " - " + error.Message + " - position: " + error.Offset + "; ";
}
return retMessage;
}
try
{
// Extract the query assuming it is a SelectStatement
var query = ((fragments as TSqlScript).Batches[0].Statements[0] as SelectStatement).QueryExpression;
// Constructs the From clause with the optional joins
from = (query as QuerySpecification).FromClauses[0].GetString();
// Extract the where clause
where = (query as QuerySpecification).WhereClause.GetString();
// Get the field list
var fieldList = new List<string>();
foreach (var f in (query as QuerySpecification).SelectElements)
fieldList.Add((f as SelectColumn).GetString());
fields = string.Join(", ", fieldList.ToArray());
// Get The group by clause
groupby = (query as QuerySpecification).GroupByClause.GetString();
// Get the having clause of the query
having = (query as QuerySpecification).HavingClause.GetString();
// Get the order by clause
orderby = ((fragments as TSqlScript).Batches[0].Statements[0] as SelectStatement).OrderByClause.GetString();
}
catch (Exception ex)
{
return ex.ToString();
}
return string.Empty;
}
}
public static class Extension
{
/// <summary>
/// Get a string representing the SQL source fragment
/// </summary>
/// <param name="statement">The SQL Statement to get the string from, can be any derived class</param>
/// <returns>The SQL that represents the object</returns>
public static string GetString(this TSqlFragment statement)
{
string s = string.Empty;
if (statement == null) return string.Empty;
for (int i = statement.FirstTokenIndex; i <= statement.LastTokenIndex; i++)
{
s += statement.ScriptTokenStream[i].Text;
}
return s;
}
}
And to use this class simply:
string fields, from, groupby, where, having, orderby;
SelectParser selectParser = new SelectParser();
var retMessage = selectParser.Parse("SELECT * FROM T where cat='A' Order by Id desc",
out fields, out from, out groupby, out where, out having, out orderby);