Larry Steinle

February 9, 2011

Use Regular Expressions to Clean SQL Statements

Filed under: RegEx,SQL Server — Larry Steinle @ 9:00 pm
Tags: ,

When writing programs that use inline SQL statements I prefer to format the sql statements for readability with solid comments in the sql statement. Sometimes the formatting or comments gets in the way during processing and must be removed. In this article I will demonstrate how to remove formatting and comments from a SQL Statement using Regular Expressions.
I consider a raw sql statement to be one that excludes:
  • Comments,
  • Line Breaks,
  • Tabs, and
  • Extra Spacing.

First we need to create regular expressions to identify the text to remove from the sql statement. Then we’ll create the VB.Net code to execute the expressions and remove the matches from the sql statement.

Regular Expression Statements

A quick review of regular expression syntax:
  • Use parens () to organize regular expression rules into a logical group known as a set.
  • Use brackets [] to define the characters to match.
  • Use escaped characters for non-printing character matching.
  • Use the pipe | character to match on multiple sets.
  • A caret inside the brackets [^] indicates that the characters should be ignored.
  • An asterisk * indicates that the previous set or character must be matched if it is found zero or more times.

There is no real magic in the statements that we will be creating today. They are very simple, straight-forward expressions.

Finding Line Breaks and Tabs

The regular expressions for the line breaks and tabs is very simple to create:
[\t\r\n]

As noted above we use brackets to indicate that any of the characters are matches. Since we are matching against non-printable characters we use escaped characters where \t represents a tab, \r is the return code and \n is a new line.

Finding Single Line Comments

In SQL a single line comment begins with two dashes. So that will be very easy to locate with a regular expression. The trick is defining when the expression ends.
--[^\r\n]*
The above regular expression begins the match on two dashes continuing until a return character and new line character are found. Since the comment ends with a line feed we instruct the expression to continue matching until we find a return character and new line character [^\r\n]. The asterisk tells the expression to continue the match 0 or more times until a return character or new line character is found breaking the match.

Finding Multiple Line Comments

The multiple line comment regular expression is the most complicated regular expression that we will create today. SQL multiple line comments begin with a slash and an asterisk /*. Multiple line comments end with an asterisk and a slash */. They can span a single row or multiple rows.
/\*[\w\W]*?(?=\*/)\*/

We begin by matching on a /*. Since the * has a special meaning (0 or more matches on the previous set) we have to escape it with a backslash. So /\* means match on /*.

Next we match on any alphanumeric character and any non-alphanumeric character. We continue this match for 0 or more times. [\w\W]* So, the first problem we encounter is that our ending characters */ is a non-word character and will be included in the match which for obvious reasons is undesired. That’s where the question mark comes in. The question mark tells the parser to use a lazy match. If the next expression is found then stop the match.

The question mark with the assignment operator tells the engine to match the previous condition if and only if it ends with the specified criteria which in this case is the asterisk and forward slash */. This instruction does not match the selection but does tell the parser when to stop matching.

Finally we end the match when an asterisk followed by a forward slash \*/. Remember that the asterisk must be escaped.

Combining the Expressions

It would be nice if we could call the regular expression engine just once. Right now we have to call it three times for each regular expression. So we will combine the previous regular expressions and “or” them together so that any one of the expressions will count as a match within a single statement:
[\t\r\n]|(--[^\r\n]*)|(/\*[\w\W]*?(?=\*/)\*/)

Testing the Expressions

I use RegExPal to test the regular expressions that I write. I find it a very simple and intuitive online interface to test my regular expressions. Using RegExPal we can see that the regular expression will highlight the matches.
Regular Expression to Find Tabs, New Lines and Comments

Regular Expression to Find Tabs, New Lines and Comments

There is one problem…We are matching against characters contained within the text block. And, this is where things get sticky.

Writing a regular expression that ignores pattern sets while including other pattern sets is complicated. It is at this time that regular expressions get difficult to write and maintain.

Instead of wrestling the regular engine expression into ignoring matches found inside qualified text blocks we can tell the engine to include qualified text blocks in the match. Later while parsing thru the regular engine results we can ignore the text blocks.

Finding Text Blocks

The regular expression to find text blocks is basically the same as the one for finding multi-row comments. This time I’ll leave it up to you to interpret and understand the regular expression:
'(''|[^'])*'
We will insert this new expression at the beginning of our combined regular expression block to ensure that it matches first so that it is ignored in the code.
('(''|[^'])*')|[\t\r\n]|(--[^\r\n]*)|(/\*[\w\W]*?(?=\*/)\*/)

Now we are ready for a bit of VB.Net spice!

Using the Regular Expressions

In our code sample we execute our regular expression and loop thru the results. When we find a result that is a text block we ignore it. Any other result is replaced with a space.

We loop thru one last time searching again for text blocks and this time for extra spaces ([ ]{2,}). Once again we will ignore the text blocks and remove the extra spaces from the sql statement.

In case you are wondering why we didn’t simply use String.Replace(”  “, ” “) to remove the extra spaces it is because this statement must be executed multiple times until String.IndexOf(”  “) returns a negative number. That is much slower than a one time call to find all the extra spaces to remove. Additionally, we would have needed to write the logic to detect when the match was found inside a text block. Even for the simple scenario of removing extra spaces the regular expression saved us time and reduced the complexity of the VB.Net code.

Don’t be scared to use regular expressions. They have their place. At the same time don’t be scared to use old-fashioned code either. Use the tool that makes the best sense for the problem!

Public Shared Function ToRaw(ByVal commandText As String) As String
  Dim regExOptions As RegexOptions = regExOptions.IgnoreCase Or regExOptions.Multiline
  Dim rawText As String = commandText
  Dim regExText As String = "('(''|[^'])*')|[\t\r\n]|(--[^\r\n]*)|(/\*[\w\W]*?(?=\*/)\*/)"
  'Replace Tab, Carriage Return, Line Feed, Single-row Comments and
  'Multi-row Comments with a space when not included inside a text block.
  Dim patternMatchList As MatchCollection = Regex.Matches(rawText, regExText, regExOptions)
  For patternIndex As Integer = patternMatchList.Count - 1 To 0 Step -1
    With patternMatchList.Item(patternIndex)
      If Not .Value.StartsWith("'") And Not .Value.EndsWith("'") Then
        rawText = rawText.Substring(0, .Index) & " " & rawText.Substring(.Index + .Length)
      End If
    End With
  Next
  'Remove extra spacing that is not contained inside text qualifers.
  patternMatchList = Regex.Matches(rawText, "'([^']|'')*'|[ ]{2,}", regExOptions)
  For patternIndex As Integer = patternMatchList.Count - 1 To 0 Step -1
    With patternMatchList.Item(patternIndex)
      If Not .Value.StartsWith("'") And Not .Value.EndsWith("'") Then
        rawText = rawText.Substring(0, .Index) & rawText.Substring(.Index + .Length - 1)
      End If
    End With
  Next
  'Return value without leading and trailing spaces.
  Return rawText.Trim
End Function

Conclusion

In today’s post we saw how to create a few simple regular expressions to parse SQL statements. We took advantage of the power of regular expressions while using the readability of VB.Net to keep the expressions simple.

I hope that my explanation showing how to construct regular expressions has peaked your interest so that you will want to learn more about regular expressions yourself.

If you are interested in more information about regular expressions I strongly recommend http://www.regular-expressions.info/.

Advertisement

6 Comments »

  1. Thank you.
    ==========================================================================
    http://space.scmlife.com/home.php?mod=space&uid=8945&do=blog&quickforward=1&id=626
    ==========================================================================
    public static string ToRaw(string commandText)
    {
    RegexOptions regExOptions = (RegexOptions.IgnoreCase | RegexOptions.Multiline);
    string rawText=commandText;
    string regExText = @”(‘(”|[^’])*’)|([\r|\n][\s| ]*[\r|\n])|(–[^\r\n]*)|(/\*[\w\W]*?(?=\*/)\*/)”;
    //string regExText = @”(‘(”|[^’])*’)|[\t\r\n]|(–[^\r\n]*)|(/\*[\w\W]*?(?=\*/)\*/)”;
    //’Replace Tab, Carriage Return, Line Feed, Single-row Comments and
    //’Multi-row Comments with a space when not included inside a text block.

    MatchCollection patternMatchList = Regex.Matches(rawText, regExText, regExOptions);
    int iSkipLength = 0;
    for (int patternIndex = 0; patternIndex < patternMatchList.Count; patternIndex++)
    {
    if (!patternMatchList[patternIndex].Value.StartsWith("'") && !patternMatchList[patternIndex].Value.EndsWith("'"))
    {
    rawText = rawText.Substring(0, patternMatchList[patternIndex].Index – iSkipLength) + " " + rawText.Substring(patternMatchList[patternIndex].Index – iSkipLength + patternMatchList[patternIndex].Length);
    iSkipLength += (patternMatchList[patternIndex].Length – " ".Length);
    }
    }
    //'Remove extra spacing that is not contained inside text qualifers.
    patternMatchList = Regex.Matches(rawText, "'([^']|'')*'|[ ]{2,}", regExOptions);
    iSkipLength = 0;
    for (int patternIndex = 0; patternIndex < patternMatchList.Count; patternIndex++)
    {
    if (!patternMatchList[patternIndex].Value.StartsWith("'") && !patternMatchList[patternIndex].Value.EndsWith("'"))
    {
    rawText = rawText.Substring(0, patternMatchList[patternIndex].Index – iSkipLength)+" " + rawText.Substring(patternMatchList[patternIndex].Index – iSkipLength + patternMatchList[patternIndex].Length);
    iSkipLength += (patternMatchList[patternIndex].Length – " ".Length);
    }
    }
    //'Return value without leading and trailing spaces.
    return rawText.Trim();

    }

    Comment by wangxn — July 9, 2011 @ 3:00 am | Reply

  2. What about multiline comments inside multiline comments? what would be a reg exp for that one?

    Comment by opv — May 31, 2012 @ 1:29 pm | Reply

  3. Your regex does not remove this line when it’s the first line of the script:
    — some comment ‘with literal text inside’

    Comment by unknown user — November 14, 2012 @ 2:12 am | Reply

  4. I already found a fix for this: the first “if” statement in your VB.NET code should be replaced by this:
    If .Value.StartsWith(“–“) OrElse (Not .Value.StartsWith(“‘”) And Not .Value.EndsWith(“‘”)) Then

    Comment by unknown user — November 14, 2012 @ 2:19 am | Reply

  5. Howdy I am so happy I found your weblog, I really found you by error, while I was looking on Aol for something else,
    Nonetheless I am here now and would just like to say cheers for a tremendous post
    and a all round exciting blog (I also love the theme/design),
    I don’t have time to browse it all at the moment but I have saved it and also included your
    RSS feeds, so when I have time I will be back to read a lot more, Please
    do keep up the great b.

    Comment by renting commercial kitchens by the hour — July 29, 2013 @ 3:24 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: