E-mail protection on your site

by Míša Hájková

Let’s imagine following situation. You create new modern website and because you are looking for some feedback from your potential customers, your new e-mail address shouldn’t missed. But after some time, you receive first offer for some blue pills, and then next and next and… - and then you’re doing nothing else than deleting spam. That’s because your e-mails are not protected and can be easily harvested by specialized robots called spambots.

What spambots do? It’s easy, they get html of your web page and search for e-mails. Here is small example how spambot should work:

HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("website_url");
request.Method = "GET";
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
StreamReader stream = new StreamReader(response.GetResponseStream());
string html = stream.ReadToEnd();
foreach (Match m in Regex.Matches(html, @"\w[\w\.-]+?@(?:[\w-]+\.)+[\w]{2,4}"))
{
Console.WriteLine(m.Value);
}
}

As you see, it’s really very easy to get your e-mail addresses from your website. You can harvest links from your page in similar way and then harvest e-mails from all of Internet – except pages, where e-mails are protected. Let’s take a look, how to do it.

First question is: Where is the best place for code? Your e-mail protecting code can be placed into overridden Render method of your page or into the same method of base class of your page, or master page. But if you want to protect more websites, or protect non-aspx files such as html, the best way is to use ResponseFilter in new (or maybe your existing) HTTP Module. This HTTP Module can be used in as many sites as you want, you simply copy compiled files into bin directory of your site and register this module in web.config. If you are really brave and want to protect whole your IIS, you can copy your files to GAC and register it in machine.config. Base information about HTTP Modules can be found here: http://msdn.microsoft.com/en-us/library/zec9k340(v=VS.71).aspx.

Let’s show how to create our custom HTTP Module protecting e-mail addresses being harvested by spambots.

First, we need to create basic HTTP Module functionality:

public class EmailFilter : IHttpModule
{
public void Init(HttpApplication app)
{
app.PostRequestHandlerExecute += new EventHandler(app_PostRequestHandlerExecute);
}

private void app_PostRequestHandlerExecute(object sender, EventArgs e)
{

}

public void Dispose() { }
}

Methos Init and Dispose must be presented in every class implementing IHttpModule interface. Inside Init method, you see declaration of new PostRequestHandlerExecute handler. It fires right after page is executed and before it is send to the client. This is the right time to modify generated content and send to the client this modified code. Inside PostRequestHandlerExecute we append our new e-mail protecting ResponseFilter to the response stream.

private void app_PostRequestHandlerExecute(object sender, EventArgs e)
{
HttpResponse response = ((HttpApplication)sender).Response;
string rawUrl = ((HttpApplication)sender).Request.RawUrl.ToLower();
if (String.Equals(response.ContentType, "text/html"))
{
response.Filter = new EmailStreamFilter(response.Filter, response.ContentEncoding);
}
}

As you see, we check mime type of response. It isn’t good idea to replace some strings in images, documents or other files non-text files. We want to modify only html. If you want, you can add another conditions, for example if you have some private part of your website (i.e. admin subweb) where e-mail protecting is not necessary, or even unwanted, you can check url of request and pass only wanted parts of website. Inside condition, you see unknown class EmailStreamFilter. That is our next step – create this stream filter.

public class EmailStreamFilter : Stream
{
Stream stream;

public EmailStreamFilter(Stream sourceStream, Encoding sourceencoding)
{
stream = sourceStream;
}

public override void Write(byte[] buffer, int offset, int count)
{
stream.Write(buffer, offset, count);
}

public override int Read(byte[] buffer, int offset, int count)
{
return stream.Read(buffer, offset, count);
}
public override bool CanRead
{
get { return stream.CanRead; }
}
public override bool CanSeek
{
get { return stream.CanSeek; }
}
public override bool CanWrite
{
get { return stream.CanWrite; }
}
public override void Flush() { }
public override long Length
{
get { return stream.Length; }
}
public override long Position
{
get { return stream.Position; }
set { }
}
public override long Seek(long offset, SeekOrigin origin)
{
return 0;
}
public override void SetLength(long value) { }
}

As you see, Stream Filter is based on Stream object, and it must contain implementation of it’s base abstract methods and properties. Maybe it looks little complicated, but we won’t work with all of this members. All we need is only to declare some new members and modify constructor and Write method, as you see bellow.

public class EmailStreamFilter : Stream
{
Stream stream;
Encoding encoding;
string html;

public EmailStreamFilter(Stream sourceStream, Encoding sourceencoding)
{
stream = sourceStream;
encoding = sourceencoding;
html = "";
}

public override void Write(byte[] buffer, int offset, int count)
{
html += encoding.GetString(buffer, offset, count);

if (html.IndexOf("</html>") > 0)
{
//secure e-mails
byte[] data = encoding.GetBytes(html);
stream.Write(data, 0, data.Length);
}
}

#region implementation of abstract members of base Stream object
}

As you see in Write method, we simply collect response data as long as we find end of html represented by </html> tag. This is weak part of this solution, but problem is we don’t now size of response data of current page. Valid html is really necessary.

Our EmailStreamFilter is almost complete now. Only we need is replace comment //secure e-mails with some code which will do it :) First we need is some method to find e-mail inside html. You can use, for example, following regular expression:

\b(\w[\w\.-]+?@(?:[\w-]+\.)+[\w]{2,4})\b

It’s easy, and what is more important, it’s fast. Imagine very very large page – you really don’t need some complex expression used in for example in validators, you need combination of reliability but also great speed . Now, what shall we do with captured e-mails? We can simply replace dots and at-signs with their text representation (dot) and (at). After it, all of our e-mails on our page are secured. There are not any e-mails, there are only text representations which can be understood by humans, but not by spambots. If it’s not enough, you can use any other replacement, for example mail – here is at – domain – here is dot – com. Or – you can use images or links to special page with CAPTCHA code. There are many ways to fight with spambots (almost as many, as they know to fight with us).

In this example, we use first method and simple replace dots and at-sign with their text representation.

But because it’s not so comfortable to rewrite this encoded e-mails to real e-mails, we also use javascript to replace text representation of our secured e-mails back to original e-mails. Spambots, as far I now, don’t execute javascript. And why they should? There are many unprotected sites on the Internet and it’s easier to harvest e-mails from them, than lost time with programming script executing spambot.

<script type="text/javascript">var body = document.getElementsByTagName('body')[0]; body.innerHTML = body.innerHTML.replace(/\(dot\)/gm, '.').replace(/\(at\)/gm, '@');</script>

This small script replace all of our secured e-mail addresses to their original format. But only for those clients with enabled javascript, which is now majority of human users. We replace this only in body of html but if you want, you can replace also in head, or you can link external more sophisticated script. You have many possibilities again. Now let’s look to our completed EmailStreamFilter.

public class EmailStreamFilter : Stream
{
Stream stream;
Encoding encoding;
string html;

bool isEmailFound = false;
static Regex _regEmail = new Regex(@"\b(\w[\w\.-]+?@(?:[\w-]+\.)+[\w]{2,4})\b", RegexOptions.Compiled);

public EmailStreamFilter(Stream sourceStream, Encoding sourceencoding)
{
stream = sourceStream;
encoding = sourceencoding;
html = "";
}

public override void Write(byte[] buffer, int offset, int count)
{
html += encoding.GetString(buffer, offset, count);

if (html.IndexOf("</html>") > 0)
{
MatchEvaluator emailEvaluator = new MatchEvaluator(EmailEvaluator);
html = _regEmail.Replace(html, emailEvaluator);
if (isEmailFound) html = html.Insert(html.IndexOf("</body>"), @"<script type=""text/javascript"">var body = document.getElementsByTagName('body')[0]; body.innerHTML = body.innerHTML.replace(/\(dot\)/gm, '.').replace(/\(at\)/gm, '@');</script>");

byte[] data = encoding.GetBytes(html);
stream.Write(data, 0, data.Length);
}
}

private string EmailEvaluator(Match m)
{
isEmailFound = true;
string noScriptEmail = m.Groups[1].Value.Replace("@", "(at)").Replace(".", "(dot)");
return noScriptEmail;
}

#region implementation of abstract members of base Stream object
}

And that’s almost all. Last step is registering your new HTTP Module inside appropriate section of your web.config and work is done.

At the end, before comments :), I have one small icing on the cake. Maybe protecting e-mails is not enough, maybe you want to create links from e-mails not inside <a> tag. In that case, only you need is find right regular expression. For example this one:

\b(?![^<]*>)(?!.*</head>)(\w[\w\.-]+?@(?:[\w-]+\.)+[\w]{2,4})\b(?![^<]*(<[^a][^<]*)*</a>)

This ugly string above is regular expression, which find e-mails not inside head, not inside tag attribute and not inside <a> tag (and of course abbr, and other tags beginning with a – yes it’s not perfect, but it may be sufficient for your purposes). All you need then is inserting these two lines of code into right place, and work is done.

static Regex _regNoLinkEmail = new Regex(@"\b(?![^<]*>)(?!.*</head>)(\w[\w\.-]+?@(?:[\w-]+\.)+[\w]{2,4})\b(?![^<]*(<[^a][^<]*)*</a>)", RegexOptions.Compiled);
html = _regNoLinkEmail.Replace(html, "<a href=\"mailto:$1\">$1</a>");

Enjoy it.

Tags: Web

Add a Comment