Despamification

Everyone knows how to prevent spam – you ask the bots a question they won’t be very good at. If the odd clever bot or human spammer slips through you can just delete them and ban their IP, no problem. But what happens if you leave spam unmoderated? This is what happened to the old railsforum.com, aside from there being a lack of effective spam prevention in the first place, the spam was left unmoderated and it became overrun with spammers. The question is, how do you get rid of thousands of spam posts when the situation is beyond the powers of manual moderation?

First thoughts

Initially, you might think you can spot a pattern in the spam. Maybe you could write a few clever queries that would match some patterns in the spam? This works for the most blatant spam that you might find. Unfortuantely, most spam does not exhibit obviously searchable patterns. The first approach I tried was to look at the spam and write a list of obviously spammy words, and then rank the posts based on the number of matches of these words found in each post. This had some success, but there are some major problems with this method. Firstly, it gives you a statistically meaningless value – a score of how many hits each post had. This does allow you to rank the posts and then look at the spammiest posts, but since it is not a probability, you can’t reliably make decisions on it. The second problem is the high false postivie rate. Because the “spammy” words are manually identified, you can often find words that in the context of spam appear to be a good indicator of spam, but in other contexts that you perhaps hadn’t considered are perfectly valid and do not indicate spam at all. Another problem with this method is that there will always be things that you miss when identifying keywords and therefore a large quantity of the spam will be undedected.

What you need in a method is something that will thoroughly analyse all the data in each post and then return a probability that the post is spam. You also want this system to have a low false-positive rate and to be adaptable to new types of spam.

Something a bit more fancy

After a bit of research, I decieded to implement a Bayesian classifier to analyse the posts. The idea here is that we split posts into tokens, classify a training set of posts as either spam or ham, and build a database of tokens and the frequency at which they occur in both spam and ham posts. The tokenizer is relatively simple – we only consider alphanumeric characters, hyphens, apostrophes, exclamation marks and pound and dollar symbols to be part of tokens. We also ignore any tokens that consist solely of numbers, as they are relatively useless in classifying spam. Such a tokenizer might look like this:

def tokenize(str)
    # build charset
    a = ('A'..'Z').to_a
    c = a + a.map(&:downcase)
    c += ('0'..'9').to_a
    c += ["!","-","'","$","£"]

    tokens = []
    buff = ""
    str.each_char do |ch|
        if c.include? ch
            buff << ch
        elsif buff.length > 0
            tokens << buff unless buff =~ /\A\d+\Z/
            buff = ""
        end
    end

    tokens << buff if buff.length > 0 and not buff =~ /\A\d+\Z/
    tokens
end

Once the posts can be tokenized, we can then start to train the classifier. I started off by getting my script to select a batch of 20 or so of the most recent posts from the forum database and present them one by one, prompting for a yes or no response as to whether the post was spam. Once we have classified a post from the training set, it can be tokenized, and I store the tokens locally in an SQLite database. As well as keeping track of the number of times each token occurs in either spam or ham posts, it is also important to keep track of the total number of posts that have been reviewed as either spam or ham. This is used later in caclculating probabilities.

Once I had trained the system on a small sample of ~100 posts, I started to get it to generate a bayesian probability on each post to see how accurately it was predicting spam posts. This probability is eventually what we’re going to use to detect and remove spam users and their posts.

In order to generate a spam probability (or “spamicity”) of a post, the spamicity of each token in the post is calculated and the probabilities for each token are combined.

Wikipedia gives the formula for determining the spamicity of a given word (based on Bayes’ theoerem) as follows:

spamicity of word formula

This formula describes the spacimity of a word (probability that a message is spam given the occurence of the word W). P(W|S) is the probabitliy of the word W appearing in spam, P(S) is the probabiltiy of a spam message being chosen randomly from the sample, P(W|H) is the probability of W occuring in a ham message, and P(H) is the probabiltiy of a ham message being chosen randomly from our training sample.

In practice, the code for this might look like so:

total_corpus = spam_count + ham_count
pr_spam = spam_count / total_corpus
pr_ham = ham_count / total_corpus

pr_word_given_spam = spam_occurences / spam_count
pr_word_given_ham = ham_occurences / ham_count
spamicity_word = pr_word_given_spam * pr_spam / (pr_word_given_spam * pr_spam + pr_word_given_ham * pr_ham)
spamicity_word = 0.998 if spamicity_word == 1
spamicity_word = 0.001 if spamicity_word == 0

Here I refer to the corpus as the total body of posts that have been reviewed so far. spam_count and ham_count are the total number of spam and ham messsage in the corpus. spam_occurences and ham_occurences are the number of occurences of our given word in either spam or ham. We also make an assumption here that if a word has only ever been seen by our classifier as spam (thus spamicity of 1) or as ham (spamicity of 0), then we change the probability to be 0.998 or 0.001 respectively, as otherwise it would override the other probabilties when they were combined.

Looping through each token in a post, I assign the spamicity of each token to a hash, using the token as a key. From this hash we can then get 15 tokens with the most significant (extreme) probabilities. This is an essential part of the classifier, as by using the extreme (i.e. very spammy or very hammy) probabilites then only the tokens which are actually useful in classifying the post are taken into account. This makes the resultant probability much more clear cut (i.e. very close to 0 or very close to 1) and also means that neutral words (words that appear equally frequently in both spam and ham) do not affect the final probability. This is done like so, by calculating the distnace of each value from 0.5:

significant = tokens.sort_by { |k, v| (0.5-v).abs }.last(15) # top 15 most significant tokens

We then need to combine the probabilities for these signfiicant tokens, using another formula dervied from Bayes’ theoerem:

combining probabilities

This formula simply says to divide the product of the probabilities by the sum of the product of the probabilities and the product of the inverse (1-p) probabilities. We can combine them in code like so:

spamicities = significant.map { |a| a[1] } # get just the probabilities in an array
product = spamicities.inject(&:*)
inverse_product = spamicities.map { |s| 1-s  }.inject(&:*)
probability = product / (product + inverse_product)

Another important point to consider now is how we deal with tokens that the classifier previously hasn’t seen before (and therefore has no probability for). A useful way of dealing with this problem is to degrade more complex tokens into simpler tokens. For example, matching missing tokens that are in uppe or mixed case can be done by matching such tokens in lowercase. Another method is to remove trailing characters such as exclamation marks or other non-alphumaeric accepted token characters. A method such as the following can generate a simpler token to match if an exact token match is not found:

def make_simple_token(token)
    t = token.downcase
    new_t = t
    (t.length-1).downto(1) do |i|
        ch = t[i]
        if ["!","-","'","$","£"].include?(ch)
            new_t = new_t[0..-2]
        else
            return new_t
        end
    end
    new_t
end

If neither the original or simple token can be matched, then we must make an assumption about the probability. In order to guard against false-postivies, we slightly bias this probability to ham by choosing 0.4. This value (along with many of my other assumptions) is used by Paul Graham in the spam filter described in his article A Plan for Spam.

Finally we can start to see that our training is having an effect. By getting the training script to calculate the spamicity of our post before getting us to rate it, we can get a good idea of how well it is doing, and how much more training might be needed. I ended up training my classifier on ~2000 posts from the forum. However, I was using it to detect spam users with a training sample of only ~1000 posts.

Now that we have a trained classifier comes the fun part; destroying the spam users and all their posts. The way I did this was to select a batch of say 1000 of the most recently registered users. For each user my script analysed their posts, generating a bayesian spam probability for each of them. If the maximum probability for that user’s posts was greater than 50%, then I generated a summary of that user. This included their maximum and average spamicities, list of topic titles, and the full content of three of their posts with the highest spamicities. I then had to review that user, but from this information it was almost always obvious whether the user is a spammer or not. If I mark the user as spam, the system re-marks all their posts as spam, in order to improve the classifier. It then deletes all their posts, topics and finally their account. If however the classifier got this wrong (which happened surprisingly occasionaly), it marks all their content as ham, and adds their username to a whitelist so that they won’t be re-analysed. This method of analysis allowed me to get through the first 7000 users in a few hours, deleting all of their spam posts.

This method, from the first line of code to the last database query took about two days to implement, and removed thousands of spam posts from the forum. With a forum of around 150,000 posts, using a sample of only 1.3%, the classifier learnt to recognize almost all the spam on the forum. After having a poke around on archive.railsforum.com, I now can’t find any obvious spam remaining, but do let me know if you manage to find any!

Links

By

Making a Simple PHP URL Shortener

In this post I do a devlog while developing Shortbase, a really simple URL shortening engine in PHP. To learn more about how I went about developing it, read on. You can download the source on github, or check out a live demo I just deployed to my server. Enjoy!

I’m going to go through step by step how I developed Shortbase.

Let’s start by setting up our database.

CREATE DATABASE shortbase;
USE shortbase;

CREATE TABLE  urls (
    id INT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
    surl VARCHAR( 12 ) NOT NULL ,
    lurl VARCHAR( 255 ) NOT NULL ,
    added TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ,
    UNIQUE (
        surl
    )
);

CREATE TABLE  clicks (
    id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
    url_id INT NOT NULL ,
    ip VARCHAR( 40 ) NOT NULL ,
    `timestamp` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

That should do us for now – the tables are pretty self-explanatory. The urls table will hold all the shortened URLs (surl => short url, and lurl => long url). The clicks table will keep track of all our expansions. Pretty simple.

Now the first thing we’re going to do with PHP is setup our database config file, includes/config.php. Since we’re going to be using PDO for database access, let’s go ahead and give our config file a function that returns a PDO handle for our database:

function pdo_dbh() {
    /* edit the line below for your database */
    $dbh = new PDO('mysql:host=localhost;dbname=shortbase','user','pw');
    $dbh->setAttribute(PDO::ATTR_ERRMODE,PDO::ERRMODE_EXCEPTION);
    return $dbh;
}

Now we’ve got database access sorted, let’s create a really simple homepage:

<?php

require 'includes/config.php';

$stats = true; // whether or not to display stats

try {
	$dbh = pdo_dbh();
	$sth = $dbh->query('SELECT COUNT(*) FROM urls');
	$urls = $sth->fetchColumn();
	$sth = $dbh->query('SELECT COUNT(*) FROM clicks');
	$expansions = $sth->fetchColumn();
	$dbh = NULL;
}
catch (PDOException $e) {
	error_log('PDO exception on index.php: '.$e->getMessage());
	$stats = false;
}

?>

<!DOCTYPE html>
<html>
<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	<title>Shortbase URL Shortener</title>
</head>
<body>
	<h1>Welcome to Shortbase</h1>

	<?php
	if ($stats) {
		$ustr = 'url' . (($urls == 1) ? '' : 's');
		$estr = 'url' . (($expansions == 1) ? '' : 's');
		echo "<p><b>$urls</b> $ustr so far, and <b>$expansions</b> $estr expanded.</p>";
	} ?>

	<form action="shorten.php" method="post">
		Enter a URL to shorten:<br />
		<input type="text" name="url" size="80" />
		<input type="submit" value="Go!" />
	</form>
</body>
</html>

Nothing too complicated there – we’re retrieving some stats from our DB, and if they were retrieved successfully, we display them to the user. In the real thing you’d probably want to use AJAX for shortening, but here I’m just using a simple HTML form.

Now it’s time to start making the core behind the URL shortener. Let’s get it laid out:

<?php

// database config
require 'includes/config.php'

class Shortbase {
    private function reachable($lurl) {

    }

    private function next_url($url) {

    }

    function shorten($url) {

    }
}

?>

Before we implement these methods, we’re just going to deal with what they do. There are three functions we’ve laid out to start with on this file. reachable() does what it says on the tin, and checks that the page the user asked to shorten returns 200 OK, 301 moved permanently or 302 redirect. We wouldn’t want to waste shortened URLs on broken links! Next, the function next_url() determines the next URL in the sequence that we’re going to use for the URL shortener. The sequence is every possible combination in the character ranges 0-9, A-Z and a-z.

Shortened to demonstrate areas of significance, this is what our sequence should look like:

0, 1, 2.. 7, 8, 9, A, B, C.. X, Y, Z.. a, b, c.. x, y, z.. 00, 01, 02.. 08, 09, 0A, 0B.. 0X, 0Y, 0Z, 0a, 0b, 0c.. 0x, 0y, 0z.. 10, 11, 12 etc.

Python’s itertools.product can create this sequence for us, and you can try it out with this script:

from itertools import product

# use the following ascii ranges:
rngs = [
range(48,58), # 0-9
range(65,91), # A-Z
range(97,123) # a-z
]

# create an array holding those chars:
chars = []
for rng in rngs:
	for i in rng:
		chars.append(chr(i))

# print all 2 letter combinations:
for combo in product(chars,repeat=2):
	print ''.join(url)

Generating all the possibilities in the sequence is fairly trivial, however, generating the next one from any given URL is not quite as easy. I decided to mock it up in Python first, so here’s my Python algorithm for it:

def inc(c):
	if c == '9':
		return 'A'
	if c == 'Z':
		return 'a'
	else:
		return chr(ord(c)+1)

def next_url(url):
	st = list(url)
	l = len(st)
	while l > 0:
		l -= 1
		if st[l] == 'z':
			st[l] = '0'
			if l == 0:
				st.append('0')
		else:
			st[l] = inc(st[l])
			return ''.join(st)
	return ''.join(st)

This isn’t great code, but this is only our mock version, and the PHP version should be a bit neater. Note here that inc() is only called once, and shouldn’t really have it’s own method definition, but when I was working it out, I had it as a separate function in case I would have to increment characters at different points during the method – turns out I don’t, so in the PHP version that will be condensed into the main method body.

This function basically works backwards through the string. It checks if the character on the end is a ‘z’. If it is, then it resets it to ’0′, the first character in our sequence, since ‘z’ is the last character. If it isn’t, it can simply be incremented, and then returned, as once a character in the string has been incremented, it is the next URL in our sequence and can be returned. It will then step back through each character, until one gets incremented and the function returns the string.

When you’ve run out of URLs in a certain number of letters, e.g. zzz, and it gets to the first character in the sequence without having incremented anything, then it will append a 0 on the end, so all of the characters should be 0s with an extra 0 on the end, starting the next batch of URLs at four characters long.

The string will then be returned if all the characters have been looped through.

The inc() method is pretty simple, it’s just allowing us to jump across the character ranges 0-9, A-Z and a-z as if they were all one range. I hope I’ve explained that algorithm well enough, and we’ll come back to that in a minute.

Let’s start implementing our main method here, shorten(), and we’ll implement the others along the way. Now the first thing we’ll need to do, as mentioned above, is to check for reachability. If the link is dead we don’t want to waste short URLs on it. In this method, we’re going to use exceptions to handle any issues in shortening the URL.

function shorten($lurl) {
    if (!$this->reachable($lurl))
        throw new Exception ('URL is unreachable.');

Now let’s use a bit of cURL magic to check if the link is live:

private function reachable($url) {
	$ch = curl_init($url);
	curl_setopt($ch, CURLOPT_NOBODY, true);
	$a = curl_exec($ch);
	$c = curl_getinfo($ch, CURLINFO_HTTP_CODE);
	if ($a && ($c == 200 || $c == 301 || $c == 302)) return true;
	return false;
}

Lovely, reachability => sorted. Now here’s the core part of our shorten() method:

try {
    // check if already shortened
    $dbh = pdo_dbh();
    $sth = $dbh->prepare('SELECT surl FROM urls WHERE lurl = ?');
    $sth->execute(array($lurl));
    $surl = $sth->fetchColumn();
    if (is_string($surl)) return $surl;

    // determine the next shortened url
    $dbh->query('LOCK TABLES urls WRITE');
    $q = 'SELECT surl FROM urls ORDER BY id DESC LIMIT 1';
    $sth = $dbh->query($q);
    $last_surl = $sth->fetchColumn();
    if (!is_string($last_surl)) $surl = '0';
    else $surl = $this->next_url($last_surl);

    $q = 'INSERT INTO urls (surl,lurl) VALUES (?, ?)';
    $sth = $dbh->prepare($q);
    $sth->execute(array($surl,$lurl));
    $dbh->query('UNLOCK TABLES');

    $dbh = NULL;		

    return $surl;
}
catch (PDOException $e) {
    error_log('line '.$e->getLine().' PDO exception while shortening: '.$e->getMessage());
    throw new Exception ('Database Error.'); // << hide details
}

In the first block of code, we check whether we’ve already got that URL shortened in our table. The reason I use is_string() on the result of that query is to verify that it’s some actual data rather than NULL or false. You’re probably thinking why didn’t I just do this:

if (!$last_surl) $surl = '0';

The reason is that PHP evaluates “0″ as 0 in the if statement, and therefore indistinguishable from false or null. Since “0″ is a valid value for the short URL, we’ve got to check that it’s a string instead.

At this point in the method we know that the URL is valid, and that we don’t have it in our database already, so it looks like we’re going to want to generate the next available short URL. The first thing we do in this process is to lock our tables for writing. The last thing we want is for another URL to be added at the same time, as one of the two competing URLs would then be lost because of a uniqueness error on the field surl. We then proceed to get the shortened URL of the newest row in the table. Again, we use the same checking method as before, to check whether we got any rows. If we didn’t, then we use ’0′, the first possible short URL in our sequence. If we got the last short URL, we use the next_url() function to generate the next url. We’ll implement that now:

private function next_url($url) {
    $st = $url;
    $i = strlen($st);
    while ($i > 0) {
        $i -= 1;
        if ($st[$i] == 'z') {
            $st[$i] = '0';
            if ($i == 0) $st .= '0';
        }
        else {
            $c = $st[$i];
            if ($c == '9') $c = 'A';
            else if ($c == 'Z') $c = 'a';
            else $c = chr(ord($c)+1);
            $st[$i] = $c;
            return $st;
        }
    }
    return $st;
}

This method is based on the Python mock version, which I have explained above.

After we have obtained the new shortened URL, we add that and it’s respective long URL to the table, unlock the table, and then return the short URL, bringing an end to our shorten() method. Should a PDO exception occur, we catch it, logging the details, and obscuring them with a new “Database Error” message, which is safe to display to a user, should such an error occur.

Great – we’re done with core.php for now – let’s go and make use of it in shorten.php.

<?php

require 'includes/core.php';

if (!isset($_POST['url'])) header('Location: index.php');
$url = $_POST['url'];

$sb = new Shortbase();
try {
	$surl = $sb->shorten($url);
	$nurl = 'http://'.$_SERVER['SERVER_NAME'].'/'.$surl;
	$msg = 'Your shortened URL: <a href="'.$nurl.'">'.$nurl.'</a>';
}
catch (Exception $e) {
	// no need for log
	// either already logged db error
	// or bad url
	$msg = '<b>Error:</b> '.$e->getMessage();
}

?>

<!DOCTYPE html>
<html>
<head>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
	<title>Shortened URL</title>
</head>
<body>
<h1>Shortened URL</h1>
<?php echo $msg; ?>
</body>
</html>

Nothing too complicated here – that code should be self-explanatory.

Right, on to expansion now. The way that we want our URL shortener to work is that when a URL is requested, we are redirected to the long URL. In order to do that we need to user Apache’s mod_rewrite to re-write any URL that matches a certain pattern to point to our expand.php page. But first, let’s write the expansion code. We’re going to a new method, not surprisingly named expand() to our class in core.php now.

function expand($surl,$log=false) {
	try {
		$dbh = pdo_dbh();
		$sth = $dbh->prepare('SELECT id,lurl FROM urls WHERE surl = ?');
		$sth->execute(array($surl));
		$arr = $sth->fetch();
		if (!is_array($arr)) throw new Exception('404');
		$lurl = $arr['lurl'];
		if (!$log) {
			$dbh = NULL;
			return $lurl;
		}

		$ip = $_SERVER['REMOTE_ADDR'];
		$sth = $dbh->prepare('INSERT INTO clicks (url_id, ip) VALUES (?, ?)');
		$sth->execute(array($arr['id'],$ip));
		$dbh = NULL;
		return $lurl;
	}
	catch (PDOException $e) {
		error_log('PDO exception while expanding: '.$e->getMessage());
		throw new Exception('500');
	}
}

Not much that needs explaining there, we’re just grabbing the long URL and returning it. Whether or not it is logged as a “click” is optional. In expand.php of course we will log it, but for other uses it may not be desirable for the expansion to be logged.

Now we can use that method in the expand.php page:

<?php

require 'includes/core.php';

if (!isset($_GET['s'])) {
	header('Location: index.php');
	exit;
}
$surl = $_GET['s'];

try {
	$sb = new Shortbase();
	$l = $sb->expand($surl,true); // log this as a click
	header ('Location: '.$l);
}
catch (Exception $e) {
	$msg = $e->getMessage();
	if ($msg == 404) {
		header ('HTTP/1.1 404 Page Not Found');
		die('Page Not found');
	}
	else if ($msg == 500) {
		header ('HTTP/1.1 500 Internal Server Error');
		die('Database Error');
	}
}

?>

And that’s it – the back-end of our URL shortener is done!

Now it’s time to implement the mod_rewrite. When we’re doing this, it’s important not to break any of the pages (such as index.php or shorten.php). The following regex should match strings which contain only characters in the ranges we’re using: (^[0-9A-Za-z]+$)

So let’s turn that into a RewriteRule in our .htaccess file:

RewriteEngine On

RewriteRule (^[0-9A-Za-z]+$) expand.php?s=$1

Great – that should now mean that all requests consisting of a shortened URL will act as if they were a parameter to expand.php.

With that, the basic URL shortener is complete! Before I wrap this post up though and post the code on github, there’s one little thing I want to add to Shortbase. On quite a few popular URL shorteners if you add a + to the end of the URL, then it gives you stats about that URL, so let’s implement that now.

First, we’ll create the page stats.php, and then the corresponding RewriteRule to point a short URL with a + on the end to it.

<?php

require 'includes/config.php';

if (!isset($_GET['s'])) {
	header ('Location: index.php');
	exit;
}
$surl = $_GET['s'];

try {
	$dbh = pdo_dbh();
	$sth = $dbh->prepare('SELECT id,lurl FROM urls WHERE surl = ?');
	$sth->execute(array($surl));
	$arr = $sth->fetch();
	if (!is_array($arr)) {
		header ('HTTP/1.1 404 Not Found');
		die('Page not found');
	}
	$id = $arr['id'];
	$lurl = $arr['lurl'];

	$sth = $dbh->prepare('SELECT COUNT(*) FROM clicks WHERE url_id = ?');
	$sth->execute(array($id));
	$clicks = $sth->fetchColumn();

	$sth = $dbh->prepare('SELECT COUNT(DISTINCT ip) FROM clicks WHERE url_id = ?');
	$sth->execute(array($id));
	$uq_clicks = $sth->fetchColumn();
}
catch (PDOException $e) {
	error_log('PDO exception retrieving stats for '.$surl.': '.$e->getMessage());
	header ('HTTP/1.1 500 Internal Server Error');
	die('Database Error');
}

echo "<h1>URL Info</h1>";
echo '<b>URL: </b>'.$lurl.'<br />';
echo '<b>Clicks: </b>'.$clicks.'<br />';
echo '<b>Unique Clicks: </b>'.$uq_clicks;

?>

Once we have this page, we can finally add our regex to look for short URLs with a + on the end:

RewriteEngine On

RewriteRule (^[0-9A-Za-z]+$) expand.php?s=$1
RewriteRule ^([0-9A-za-z]+)\+$ stats.php?s=$1

And our URL shortener is now finished!

I’ve now uploaded Shortbase to github, and just deployed a live demo. Enjoy!

Making Tunes with Python

Recently, I’ve been looking into audio synthesis with Python, and I found this excellent blog post as an introduction to synthesis using numpy and scipy.

That post teaches you the basics of how it works, and does it quite nicely – but I’d like to take that a little further, and go into how to combine notes to make a tune.

First, let’s take a step back, and get that example working. Without the graph plotting code in that blog post, here is the basic code:

from scipy.io.wavfile import write
from numpy import linspace,sin,pi,int16

# tone synthesis
def note(freq, len, amp=1, rate=44100):
 t = linspace(0,len,len*rate)
 data = sin(2*pi*freq*t)*amp
 return data.astype(int16) # two byte integers

# A tone, 2 seconds, 44100 samples per second
tone = note(440,2,amp=10000)

write('440hzAtone.wav',44100,tone) # writing the sound to a file

Now the first thing you’ll notice is that we’ve got some dependencies. We need to get numpy and scipy (if you don’t have those already). You’re going to need two things – both of them require a C compiler (GCC) installed. If you’re on a mac, the easiest way is to just install the Xcode developer tools, which come with GCC. Also, scipy requires a fortran compiler, and they recommend this version for mac.

If you’re not on a mac, just check out their documentation for various operating systems.

Once you have the required compilers, you can just use a package manager like pip or easy_install to install them:

pip install numpy
pip install scipy

Once you’ve got those – you should be able to run that code. Once run, you should see a file named 440hzAtone.wav in the directory you ran the script from. That should be a tone at 440 Hz lasting for 2 seconds.

If that’s it, everything went well!

So we now have a basic tone, at 440 Hz. Now let’s try combining two different tones…

from scipy.io.wavfile import write
from numpy import linspace,sin,pi,int16,concatenate

def note(freq, len, amp=1, rate=44100):
 t = linspace(0,len,len*rate)
 data = sin(2*pi*freq*t)*amp
 return data.astype(int16) # two byte integers

tone1 = note(440,1,amp=10000)
tone2 = note(500,1,amp=10000)
tone = concatenate((tone1,tone2),axis=1)

write('sound.wav',44100,tone) # writing the sound to a file

If everything went well, that should output a sound file called sound.wav with two different tones, each one second long, the first one being at 440 Hz and the second at 500 Hz.

How did that work? Well – we imported numpy’s concatenate function, and used it to concatenate the two arrays which contained our sound waves. We add “axis=1″ onto the end of the function so that numpy knows how to concatenate the arrays.

Great, we can now combine tones! The next obvious step is to make use of our newfound abilities to make a tune…

from scipy.io.wavfile import write
from numpy import linspace,sin,pi,int16,concatenate

# tone synthesis
def note(freq, len, amp=1, rate=44100):
 t = linspace(0,len,len*rate)
 data = sin(2*pi*freq*t)*amp
 return data.astype(int16) # two byte integers

table = {
'c':523.25,
'd':587.33,
'e':659.26,
'f':698.46,
'g':783.99,
'a':880,
'bb':932.33,
'b':987.77,
'c1':1046.5
}

def play(tune,speed):
	global table
	tones = note(0,0)

	for anote in tune:
		length = anote[1]*speed
		if anote[0] == "r":
			ntone = note(100,length,amp=0)
		else:
			hz = table[anote[0]]
			ntone = note(hz,length,amp=10000)

		tones = concatenate((tones,ntone),axis=1)
	return tones

tune = [
['c',0.375],['c',0.125],['d',0.5],['c',0.5],['f',0.5],['e',0.5],['r',0.5],
['c',0.375],['c',0.125],['d',0.5],['c',0.5],['g',0.5],['f',0.5],['r',0.5],
['c',0.375],['c',0.125],['c1',0.5],['a',0.5],['f',0.5],['e',0.5],['d',0.5],['r',0.5],
['bb',0.375],['bb',0.125],['a',0.5],['f',0.5],['g',0.5],['f',0.5]
]

d = play(tune,1)

write('sound.wav',44100,d) # writing the sound to a file

Before I explain how it works, if you’re at all musical, you can try guessing the tune before running it. I decided to use the musical note-name system because I find it is a lot easier to refer to notes with letters than it is with frequency. You can see a table which shows all of the notes on the musical scale and their corresponding frequencies here.

So, what’s going on here?

Well, I’ve basically made a primitive way of representing a tune in Python, using nested lists. Each note in the tune is a list, and contains the note name, and relative duration of that note. The play function then loops through this, generating a tone for each note, and concatenating it into a numpy array which holds all the tones processed so far. It then returns this array, which I write to a file.

Note: with this list, I’m using note names starting at C5 on the actual musical note scale. The reason I use C1 in line three of the tune is to represent C6 (an octave above C5). I decided to use C to represent C5 in this script just for readability purposes, but if you wanted to represent all notes, then you should include the the number with each note.

Let’s step through the method:

def play(tune,speed):
	global table
	tones = note(0,0)

Here we define our method. The tune is our nested list containing the tune data. The speed is a number which all relative durations in our tune will be multiplied by to get the duration for that note. For example, if you run it at a speed of 0.5, it will sound twice the speed of the same tune at a speed of 1.

We then declare the table as a global, so we can use the note lookup table (at global scope) inside our local method scope. In this snippet we also setup an empty numpy array, by creating a note with a frequency and duration of 0, which our method can then append to.

for anote in tune:
    length = anote[1]*speed
    if anote[0] == "r":
        ntone = note(100,length,amp=0)
    else:
        hz = table[anote[0]]
        ntone = note(hz,length,amp=10000)

    tones = concatenate((tones,ntone),axis=1)
return tones

Here, we loop through the tune, processing the notes. The first thing we do, is calculate the length of the note (or rest), by multiplying the relative duration of it by the speed provided in the method call. We then check if the note type is a rest. A rest is a gap in the music, and so if it is a rest, we create a note with an amplitude of 0, and a frequency of 100. The frequency is not important, because we are creating a note with an amplitude of 0, so it won’t be heard. If it isn’t a rest, we then lookup the desired frequency in our table, and assign that to hz. We then create the desired note using the frequency and length we have obtained. I chose an amplitude of 10,000, as that was used in the original example (linked to above), and seemed to work well.

At each iteration of the loop, we concatenate the large tone array (tones), with our new tone (ntone). Once we’ve looped through all the notes, we then return the tones array.

So that’s about it – you should now be able to make tunes with Python! There’s just one little thing that I’d like to add to our play function. Being a musician, I would prefer to be able to specify my speed in BPM, so let’s go ahead and implement that now – it’s not exactly challenging:

def play(tune,bpm):
    speed = 60.0/bpm

That’s just about it!

There you have it – very basic tunes with Python. If you notate a tune playable by my code, feel free to share it in the comments.

Effective password hashing in PHP

Storing all passwords in complete security is impossible. There will always be some way that someone with access to your data will get a password, if they really want to. But your job as a web application developer is to ensure that you make it as difficult as possible.

For a long time, this is how I used to hash a password:

$hash = sha1($password);

That’s bad. OK, not as bad as plain text or md5(), but it’s still bad. Why is it bad? A combination of two things, sha1 is fast, and I’m not salting the password.

Salting your passwords doesn’t just make them taste better, it actually makes them harder to crack. There are two types of salting, a static salt, and a dynamic salt (or nonce).

A static salt is where you have a unique string that you use for your web application, which you then combine with the password to produce a longer string, which can then be hashed. Here is an example of using a static salt:

// config.php
define('ksitekey','reallylongstringofnumbersandletters');

// login.php
include 'config.php';
$password = 'ponies';
$hash = sha1($password.ksitekey);

Let’s say our user decides they want their password to be “ponies”. That’s only 6 characters long, and I can write a python script which will brute force that in about 8 minutes on my MacBook Pro:

from hashlib import sha1
import itertools

target = "9e09207037d10f79ad96ade0c4723a6e22033a60"
cracked = False

chars = []
for i in range(97,123):
	chars.append(chr(i))

length = 0
while not cracked:
	length += 1
	print length," chars..."
	for i in itertools.product(chars,repeat=length):
		s = ''.join(i)
		if sha1(s).hexdigest() == target:
			cracked = True
			print "Solved: "+s
			break

The output of that (with some timing code) is:

1  chars...
2  chars...
3  chars...
4  chars...
5  chars...
6  chars...
Solved: ponies
That took: 533.581736088 seconds

Obviously, this script can also be used to look in an entire table for a matching hash, thus finding all the passwords in the table that are 6 characters or below in around 10 minutes. If the cracker is serious, they might decide to run a script like this on a distributed network of servers, and will be able to solve these kinds of passwords very quickly.

Not only is that password easily brute-forceable, but it’s also highly susceptible to dictionary attacks. Unfortunately, many users do choose simple passwords like this, and so we need to take extra steps to protect them. With our salt, however, the threat of brute-force is pretty much ruled out, considering our resulting string will be 41 characters long. Dictionary attacks will also be useless here, considering our site key in our real web app would be a string of numbers and letters, usually the result of another hashing algorithm.

So it might seem like a static salt is the answer to everything… right? Well, not entirely. Although a static salt can greatly improve your password security, if your attacker somehow manages to find out your static salt, then you’re pretty screwed. They can simply brute-force or dictionary attack in the same time it would take without any salt. That’s why a dynamic salt or a “nonce” is used. This should be generated when the user’s password is set, for example:

$password = 'ponies';
$nonce = md5(uniqid());
$hash = sha1($password.$nonce);
// store the password and nonce for the user

This way, brute force attacks on multiple users will now be useless, because each user has a unique 16 character hash appended to their password.

However, because the nonce is stored with the password, it is likely that if the attacker has the password data, then he also has the nonces, and so attacks on individual users are still possible.

For this reason, the most secure thing you can do is to combine both a static and dynamic salt in your hashing algorithm. I currently do this on my sites, and I use a function based on this popular stack overflow answer:

function hash_password($password, $nonce) {
  return hash_hmac('sha512', $password . $nonce, ksitekey);
}

And so the complete code for creating a new user looks like this:

// config.php
// here I've used an sha256 hash for the key
define('ksitekey','87d4131041a6d0a2702cf9695a671858486c1a38495af5bc33c992062fb58110');

include 'config.php';

$password = 'ponies';
$nonce = md5(uniqid());
$hash = hash_hmac('sha512', $password . $nonce, ksitekey);
// store the hash and the nonce for the user

This produces a 128 character hash, which results from a 16 character nonce, a 64 character unique site key, and your user’s password. That’s pretty secure.

If you have any suggestions as to a better method of keeping passwords safe (or improving my python script!), let me know in the comments.

Where to start?

Where to start? Well, that’s a stupid question…

print "Hello World!"

Welcome to my blog, where I’m going to talk about coding, tech, and anything else that I find interesting. You can find out more about me on the about page.

Oh, and I might occasionally talk about server admin as well, which leads me into the subject of this blog post… WordPress. With the help of this nifty tutorial, I managed to get it up and running in about 20 minutes. The whole process was really quite simple, and everything worked just as expected… almost.

What I’ve found is that when possible, for things such as upgrading plugins and installing themes, using the command-line over WordPress’ GUI system management features is just easier. For example, WordPress provides an auto-update feature for plugins, and since one of the plugins that came with WordPress (akismet) needed updating, I tried to use the auto-update feature, but it then needed my FTP details, and since I use SSH publickey auth for FTP, then I really just gave up trying to do it via the GUI, considering you can do it with three commands:

cd /path/to/wordpress/wp-content/plugins
wget http://downloads.wordpress.org/plugin/akismet.2.5.5.zip
unzip akismet.2.5.5.zip

And.. your plugin is updated. OK, you may want to backup the original first, but even then it’s still only two extra lines.

But so far, running WordPress on my server has been great, really easy, and really powerful – I look forward to posting some hopefully interesting stuff on this blog.