Friday, August 22, 2008

SQL Server string filter and replacement function

Often when I need to transmogrify data in SQL Server, I create custom stored procedures or update statement that filters incoming data. For example, a raw data import includes part numbers with letters while the system where this should go only allows 0-9 and “_”.

For a current project, I had to filter a lot of data so I decided to write a reusable function that NOT defines what is NOT allowed but simple the other way around: Define what you want.

I came across a interesting forum discussion that about such an function, but none of the solutions was exactly was I wanted. The main reason was that I wanted the function to be used as a filter and replacement at the same time. I give you a quick example.

Suppose you need a replacement that allows numbers only:

Input is “a12345abc”, replacement character is “*”, return should be “*12345*”. Please note that a single replacement character (“*”) is at the end, although the string ends with three non-numeric characters.

Now the same as a filter: Input is “a12345abc”, replacement character is “” (empty) , return should be “12345”.

I ended up by creating my own function “funcFilterString_ASCII” (roughly based on a solution in the forum by Peter Larsson from Helsingborg, Sweden). Usage is rather simple: @Source is the string that should be filtered, @Filter is the regex like filter and @Replacement is the character any no match should be replaced with (or empty if no matches should be removed).

Some examples:

print dbo.funcFilterString_ASCII('John Doe - 12345 New York _10', '[0-9]', '')
-- returns 1234510

print dbo.funcFilterString_ASCII('John Doe - 12345 New York _10', '[0-9]', '*')

-- returns *12345*10

print dbo.funcFilterString_ASCII('John Doe - 12345 New York _10', '[0-9_]', '')
-- returns 12345_10

print dbo.funcFilterString_ASCII('John Doe - 12345 New York _10', '[23]', '*')
-- returns *23*

print dbo.funcFilterString_ASCII('John Doe - 12345 New York _10', '[^0-9]', '*')
-- returns John Doe - * New York _*

To use it, simply copy the source below to a query window, change to the database where you want to have it and execute the SQL. Enjoy!



Here is the source:


CREATE FUNCTION [dbo].[funcFilterString_ASCII]
(
@Source VARCHAR(max), -- the string that contains characters to be replaced
@Filter VARCHAR(max), -- the regex what should be filterd
@Replacement VARCHAR(1) -- replace anything that does not match @Filter with this char (set to '' to remove non-matches)
)
/*
Example

declare @s varchar(500)
print '--Test string ---'
select @s='John Doe - 12345 New York - YYY_XXX - No 10 USD$' CHAR(13) CHAR(10) 'New Line'
print @s
print '--Results---'
print dbo.funcFilterString_ASCII(@s, '[0-9]', '') -- returns only 12345
print dbo.funcFilterString_ASCII(@s, '[0-9]', '*') -- returns *12345*
print dbo.funcFilterString_ASCII(@s, '[0-9_]', '') -- returns only numbers and "_" = 12345_10
print dbo.funcFilterString_ASCII(@s, '[0-9a-zA-z]', '') -- returns the entire string but without any special chars
print dbo.funcFilterString_ASCII(@s, '[0-9a-zA-z]', '*') -- returns the entire string but any special chars have been replaced
print dbo.funcFilterString_ASCII(@s, '[^' CHAR(13) CHAR(10) ']', '_') -- replace CR/LF with _
print dbo.funcFilterString_ASCII(@s, '[^' CHAR(13) CHAR(10) ']', '') -- remove CR/LF
print dbo.funcFilterString_ASCII(@s, '[e]', '*') -- returns only *e*e*e*e
print dbo.funcFilterString_ASCII(@s, '[^e]', '*') -- entire string, but all "e" are "*"

*/

RETURNS VARCHAR(max)
AS
BEGIN
-- Created by TeX HeX of Xteq System
-- http://www.texhex.info/

DECLARE @Index int
SET @Index = DATALENGTH(@Source)

DECLARE @dest varchar(max)
DECLARE @curr varchar(1)
DECLARE @last_add varchar(1)

SET @DEST=''
SET @last_add=''

-- Pass #1: Replacement
WHILE @Index > 0 BEGIN
SET @curr=SUBSTRING(@Source, @Index, 1)

-- collate is only needed for VARCHAR types
IF @curr COLLATE Latin1_General_BIN LIKE @Filter BEGIN
-- no replacement, copy from source
SET @DEST=@DEST @curr

SET @last_add=@curr
--SET @last_add_ignored=0
END ELSE BEGIN
-- do replacement. If @Replacement is empty, do not add anything
IF @Replacement!='' BEGIN
-- check if the last char we added was @Replacement
-- if so, do not add it again
IF @last_add!=@Replacement BEGIN
SET @DEST=@DEST @Replacement

SET @last_add=@Replacement
END
END
END

SET @Index=@Index-1
END

-- Pass #2: Reverse the string, because we build @Dest from reverse
SET @DEST=Reverse(@Dest)

-- Pass #3: Remove any left over spaces
SET @DEST=RTRIM(LTRIM(@Dest))


RETURN @DEST
END

No comments:

Post a Comment