- Comments,
- Line Breaks,
- Tabs, and
- Extra Spacing.
First we need to create regular expressions to identify the text to remove from the sql statement. Then we’ll create the VB.Net code to execute the expressions and remove the matches from the sql statement.
Regular Expression Statements
- Use parens () to organize regular expression rules into a logical group known as a set.
- Use brackets [] to define the characters to match.
- Use escaped characters for non-printing character matching.
- Use the pipe | character to match on multiple sets.
- A caret inside the brackets [^] indicates that the characters should be ignored.
- An asterisk * indicates that the previous set or character must be matched if it is found zero or more times.
There is no real magic in the statements that we will be creating today. They are very simple, straight-forward expressions.
Finding Line Breaks and Tabs
[\t\r\n]
As noted above we use brackets to indicate that any of the characters are matches. Since we are matching against non-printable characters we use escaped characters where \t represents a tab, \r is the return code and \n is a new line.
Finding Single Line Comments
--[^\r\n]*
Finding Multiple Line Comments
/\*[\w\W]*?(?=\*/)\*/
We begin by matching on a /*. Since the * has a special meaning (0 or more matches on the previous set) we have to escape it with a backslash. So /\* means match on /*.
Next we match on any alphanumeric character and any non-alphanumeric character. We continue this match for 0 or more times. [\w\W]* So, the first problem we encounter is that our ending characters */ is a non-word character and will be included in the match which for obvious reasons is undesired. That’s where the question mark comes in. The question mark tells the parser to use a lazy match. If the next expression is found then stop the match.
The question mark with the assignment operator tells the engine to match the previous condition if and only if it ends with the specified criteria which in this case is the asterisk and forward slash */. This instruction does not match the selection but does tell the parser when to stop matching.
Combining the Expressions
[\t\r\n]|(--[^\r\n]*)|(/\*[\w\W]*?(?=\*/)\*/)
Testing the Expressions
There is one problem…We are matching against characters contained within the text block. And, this is where things get sticky.
Writing a regular expression that ignores pattern sets while including other pattern sets is complicated. It is at this time that regular expressions get difficult to write and maintain.
Instead of wrestling the regular engine expression into ignoring matches found inside qualified text blocks we can tell the engine to include qualified text blocks in the match. Later while parsing thru the regular engine results we can ignore the text blocks.
Finding Text Blocks
'(''|[^'])*'
('(''|[^'])*')|[\t\r\n]|(--[^\r\n]*)|(/\*[\w\W]*?(?=\*/)\*/)
Now we are ready for a bit of VB.Net spice!
Using the Regular Expressions
In our code sample we execute our regular expression and loop thru the results. When we find a result that is a text block we ignore it. Any other result is replaced with a space.
We loop thru one last time searching again for text blocks and this time for extra spaces ([ ]{2,}). Once again we will ignore the text blocks and remove the extra spaces from the sql statement.
In case you are wondering why we didn’t simply use String.Replace(” “, ” “) to remove the extra spaces it is because this statement must be executed multiple times until String.IndexOf(” “) returns a negative number. That is much slower than a one time call to find all the extra spaces to remove. Additionally, we would have needed to write the logic to detect when the match was found inside a text block. Even for the simple scenario of removing extra spaces the regular expression saved us time and reduced the complexity of the VB.Net code.
Don’t be scared to use regular expressions. They have their place. At the same time don’t be scared to use old-fashioned code either. Use the tool that makes the best sense for the problem!
Public Shared Function ToRaw(ByVal commandText As String) As String Dim regExOptions As RegexOptions = regExOptions.IgnoreCase Or regExOptions.Multiline Dim rawText As String = commandText Dim regExText As String = "('(''|[^'])*')|[\t\r\n]|(--[^\r\n]*)|(/\*[\w\W]*?(?=\*/)\*/)" 'Replace Tab, Carriage Return, Line Feed, Single-row Comments and 'Multi-row Comments with a space when not included inside a text block. Dim patternMatchList As MatchCollection = Regex.Matches(rawText, regExText, regExOptions) For patternIndex As Integer = patternMatchList.Count - 1 To 0 Step -1 With patternMatchList.Item(patternIndex) If Not .Value.StartsWith("'") And Not .Value.EndsWith("'") Then rawText = rawText.Substring(0, .Index) & " " & rawText.Substring(.Index + .Length) End If End With Next 'Remove extra spacing that is not contained inside text qualifers. patternMatchList = Regex.Matches(rawText, "'([^']|'')*'|[ ]{2,}", regExOptions) For patternIndex As Integer = patternMatchList.Count - 1 To 0 Step -1 With patternMatchList.Item(patternIndex) If Not .Value.StartsWith("'") And Not .Value.EndsWith("'") Then rawText = rawText.Substring(0, .Index) & rawText.Substring(.Index + .Length - 1) End If End With Next 'Return value without leading and trailing spaces. Return rawText.Trim End Function
Conclusion
In today’s post we saw how to create a few simple regular expressions to parse SQL statements. We took advantage of the power of regular expressions while using the readability of VB.Net to keep the expressions simple.
I hope that my explanation showing how to construct regular expressions has peaked your interest so that you will want to learn more about regular expressions yourself.
If you are interested in more information about regular expressions I strongly recommend http://www.regular-expressions.info/.
Thank you.
==========================================================================
http://space.scmlife.com/home.php?mod=space&uid=8945&do=blog&quickforward=1&id=626
==========================================================================
public static string ToRaw(string commandText)
{
RegexOptions regExOptions = (RegexOptions.IgnoreCase | RegexOptions.Multiline);
string rawText=commandText;
string regExText = @”(‘(”|[^’])*’)|([\r|\n][\s| ]*[\r|\n])|(–[^\r\n]*)|(/\*[\w\W]*?(?=\*/)\*/)”;
//string regExText = @”(‘(”|[^’])*’)|[\t\r\n]|(–[^\r\n]*)|(/\*[\w\W]*?(?=\*/)\*/)”;
//’Replace Tab, Carriage Return, Line Feed, Single-row Comments and
//’Multi-row Comments with a space when not included inside a text block.
MatchCollection patternMatchList = Regex.Matches(rawText, regExText, regExOptions);
int iSkipLength = 0;
for (int patternIndex = 0; patternIndex < patternMatchList.Count; patternIndex++)
{
if (!patternMatchList[patternIndex].Value.StartsWith("'") && !patternMatchList[patternIndex].Value.EndsWith("'"))
{
rawText = rawText.Substring(0, patternMatchList[patternIndex].Index – iSkipLength) + " " + rawText.Substring(patternMatchList[patternIndex].Index – iSkipLength + patternMatchList[patternIndex].Length);
iSkipLength += (patternMatchList[patternIndex].Length – " ".Length);
}
}
//'Remove extra spacing that is not contained inside text qualifers.
patternMatchList = Regex.Matches(rawText, "'([^']|'')*'|[ ]{2,}", regExOptions);
iSkipLength = 0;
for (int patternIndex = 0; patternIndex < patternMatchList.Count; patternIndex++)
{
if (!patternMatchList[patternIndex].Value.StartsWith("'") && !patternMatchList[patternIndex].Value.EndsWith("'"))
{
rawText = rawText.Substring(0, patternMatchList[patternIndex].Index – iSkipLength)+" " + rawText.Substring(patternMatchList[patternIndex].Index – iSkipLength + patternMatchList[patternIndex].Length);
iSkipLength += (patternMatchList[patternIndex].Length – " ".Length);
}
}
//'Return value without leading and trailing spaces.
return rawText.Trim();
}
Comment by wangxn — July 9, 2011 @ 3:00 am |
What about multiline comments inside multiline comments? what would be a reg exp for that one?
Comment by opv — May 31, 2012 @ 1:29 pm |
Run it thru the parser twice would be the easiest.
Comment by Larry Steinle — October 7, 2012 @ 8:11 am |
Your regex does not remove this line when it’s the first line of the script:
— some comment ‘with literal text inside’
Comment by unknown user — November 14, 2012 @ 2:12 am |
I already found a fix for this: the first “if” statement in your VB.NET code should be replaced by this:
If .Value.StartsWith(“–“) OrElse (Not .Value.StartsWith(“‘”) And Not .Value.EndsWith(“‘”)) Then
Comment by unknown user — November 14, 2012 @ 2:19 am |
Howdy I am so happy I found your weblog, I really found you by error, while I was looking on Aol for something else,
Nonetheless I am here now and would just like to say cheers for a tremendous post
and a all round exciting blog (I also love the theme/design),
I don’t have time to browse it all at the moment but I have saved it and also included your
RSS feeds, so when I have time I will be back to read a lot more, Please
do keep up the great b.
Comment by renting commercial kitchens by the hour — July 29, 2013 @ 3:24 pm |