Tuesday, August 19, 2008

SEO Problems and Solutions to improve optimisation within phpBB3

Many people think that SEO is basically using or implementing Human Readable URLs. However, this is not encompassing of “SEO” at all, it is one small part of SEO, and is generally grossly misunderstood. Many people believe that Human Readable URLs are the only way that Search Engines can correctly index a site, or that it is the best method for spiders to index the site. Or that Dynamic URLs somehow hurt your Search Engine Ranking or performance within Search Engines. But this is not the case. Human Readable URLs mainly benefit Search Engines as added Keywords within the page, this only reinforces the keywords already set by the page and topic title, which Search Engines already use for the index.
Search Engines have just as easy of a time indexing dynamic URLs as they will with Human Readable static URLs. The benefit is minimal.

SEO is Dead! Long Live SEO!
The other side of the coin is the people who think that you really don’t need to do any Optimisation (SEO) at all. These are people who don’t think that any form of Optimisation can have any sort of positive impact on your site. They have usually never been in the SEO industry, and instead just hear all the hype from people talking about SEO. They believe that SEO no longer applies to today’s world of the internet. But these people do not truly understand what SEO means or what it is meant to do, and they are talking out of inexperience or ignorance.

This post is meant to address both sides of the argument by giving people a better understanding of what SEO is and it’s place within phpBB3. It also includes problems identified within phpBB3 itself with regards to Search Engine performance and solutions to address these issues.

What is SEO?
SEO (Search Engine Optimisation) is defined as any action or adjustment, also known as Optimisation that you perform on your site for the purpose of improving the volume and quality of traffic to your site or board from search engines by utilising search results for targeted keywords, and can include one or many possible methods. From implementing Human Readable URLs and Keyword targeting, to Marketing, to adding a no-index page or directory to your robots.txt file.
All of this is Search Engine Optimisation (SEO).

How does phpBB3 Handle SEO?
phpBB3, out of the box, has good Search Engine Optimisation (SEO) capabilities. It handles BOT sessions appropriately, and it hides completely useless content such as forms, links to profiles, or links that spiders should not or could not access, also known as “dead links”, among a few other things. Some of these are just meant to improve the performance of spiders indexing the site, from not displaying useless content, which can cut down on the HTTP requests and bandwidth, to showing useless items such as forms. But that’s about the extent of it. There is so much more that one can do to optimise phpBB3 for best Search Engine capability. Many methods that people I believe are not aware of yet.

What is the main issue of SEO within phpBB3?
Within phpBB3, the main issue with Optimisation is duplicate content. No, not the kind of duplicate content that will get you penalised or banned from Google (that’s a whole other post), but the kind of duplicate content that distorts search results and causes slightly higher bandwidth because the spider is indexing and re-indexing the same exact content as separate pages within it’s index. Then the search results for this single page instead display as multiple results of the exact same page, which defeats the purpose of good search results and degrades the effectiveness of those results.

How can this be improved? First the problem must be completely understood. When a spider crawls your board, from the index page, it looks at all the links: There are links to the Categories, Forums, Subforums, and also the last post of that particular forum. Upon entering a forum, it sees a list of links to topics, up to four pages within a single topic, and again the last post within that topic. Upon entering the topic, it will see a whole other page through the print option.

The idea is that we want the spider to index the topic in pages, and to do this, the spider has to see the exact same URL, dynamic or static of the pages for that topic. If it sees different URLs (such as those containing the p -post- variable), it considers the page a completely new page and it indexes it as such. But the last post URL on the index, and category views, as well as the last post within a specific topic shows the spider that there are far more pages to your forum than there really are. Thus causing multiple indexing of the same content.

Secondly, users will post links directly to a post within a topic, this URL may contain parameters other than the forum id, topic id and the start variable, which the spider should only recognising. It may see multiple parameters including: sort key, sort direction, order, session id, post id, print view, and highlighting.

Each new variable thrown in that is different from the last time it indexed this topic and page will mean a completely new page and thus another indexing pass by the spider. Further diluting search results for keywords or content for this page and again repeatedly consuming bandwidth over the same content.

The Solution for Duplicate Content within phpBB3
Now that we understand the problem, what is the solution?
There are two methods that can be used to improve the optimisation with the duplicate content issue within phpBB3.
First, by removing (hiding) the links to the last post within the topics and forums.
Second, by filtering out all parameters except for the topic_id and the start variable for bots only -- and perhaps enforcing the forum_id, but it has to be consistent, one way or another. Remember that every variable added means a new page to the Search Engines including Google. -- This type of change means basically redirecting the page on viewtopic for spiders if the parameters are not those strictly allowed, this kind of change is something that can increase the number of HTTP Requests if overused, but may be necessary to improve the optimisation, and due to the decreased repeated indexing, may result in fewer HTTP Requests, so it may even itself out in the end. But this change should only be performed for spiders and bots, as it would be a detriment to the user attempting to navigate your board if they experience this effect.

These changes will be far more beneficial for Search Engines than any Human Readable URLs change will have. Implementing Human Readable URLs can have a negative impact on this issue if Bots are still allowed to index the dynamic URLs, thus throwing in additional pages to what the Search Engines already see.

Wednesday, May 14, 2008

Performance difference using mysql_real_escape_string versus intval

Someone asked me today what the performance difference would be if they used mysql_real_escape_string() on all of their variable inputs into a database instead of casting to variable type integers for the int inputs.
I created the below bechmark test to demonstrate what performance differences there would be in using each of the possible functions in such a situation.

It’s important to note that all user input variables that are intended to be strings go through the mysql_real_escape_string() function to ensure proper sanitisation for those strings.
But for integers, what about those? should we use the mysql_real_escape_string() function on them, or should we cast the variable type int?
Lets find out what the performance differences are:

PHP Code:
<?php

$link 
mysql_connect('localhost''username''password');
if (!
$link)
{
    die(
'Could not connect: ' mysql_error());
}
echo 
"Connected successfully<br />\n";

$string "12345'?;/FOO/FUM\';Example String =";
$num 500000;
echo 
"benchmark: Begin<br />\n";

$start_time microtime(true);

for (
$i 0$i $num$i++)
{
    
$escaped mysql_real_escape_string($string);
}

$escape_benchmark microtime(true) - $start_time;

$start_time microtime(true);

for (
$i 0$i $num$i++)
{
    
$intval intval($string);
}

$intval_benchmark microtime(true) - $start_time;

$start_time microtime(true);

for (
$i 0$i $num$i++)
{
    
$int = (int) $string;
}

$int_benchmark microtime(true) - $start_time;

echo 
"escape_string time: $escape_benchmark <br />\n";
echo 
"intval time: $intval_benchmark <br />\n";
echo 
"int time: $int_benchmark <br />\n";

echo 
'intval vs escape_string difference: ' . ($escape_benchmark $intval_benchmark) . "<br />\n";
echo 
'int vs escape_string difference: ' . ($escape_benchmark $int_benchmark) . "<br />\n";
echo 
'int vs intval difference: ' . ($intval_benchmark $int_benchmark) . "<br />\n";

echo 
'Percent faster intval is than escape: ' round((- ($intval_benchmark $escape_benchmark)) * 100) . "%<br />\n";
echo 
'Percent faster int is than escape: ' round((- ($int_benchmark $escape_benchmark)) * 100) . "%<br />\n";
echo 
'Percent faster int is than intval: ' round((- ($int_benchmark $intval_benchmark)) * 100) . "%<br />\n";
echo 
"<br />\nThe result of the sanitisation:<br />\n";
echo 
"escape_string: $escaped <br />\n";
echo 
"intval func: $intval <br />\n";
echo 
'cast int: ' $int;
mysql_close($link);

?>

Result:

Connected successfully
benchmark: Begin
escape_string time: 0.697912931442
intval time:  0.349272966385
int time:   0.16778588295

intval vs escape_string difference: 0.348639965057
int vs escape_string difference: 0.530127048492
int vs intval difference:   0.181487083435

Percent faster intval is than escape: 50%
Percent faster int is than escape:  76%
Percent faster int is than intval:  52%

The result of the sanitisation:

escape_string: 12345\'?;/FOO/FUM\\\';Example String = 
intval func: 12345
cast int:  12345

From the benchmark, we can deduce that (int) produces a 70% to 75% performance advantage over using mysql_real_escape_string(); and a 50% performance advantage over using the intval() function.
In addition, if you are using an UPDATE or INSERT SQL query, forcing var type int on integer variables will mean that your query will produce a valid result, whereas an intended int variable that contains a string which goes through the mysql_real_escape_string() function is going to throw an SQL Error if the SQL database is expecting a float or integer value of some sort.

Of course, don’t take my word for it... the above script is provided for you to test yourselves as well. :)
If you want to test the benchmark without making a database connect, substitute the mysql_real_escape_string() with the older mysql_escape_string() function, which does not require a database link to process.
Enjoy!
- Highway of Life

Friday, May 2, 2008

Why Sanitisation and Type Casting is Important in PHP

PHP is a language that is very open to different types of coding.
Many good, many bad. There are always multiple ways of doing something in PHP, that’s one reason it is such a great language.
Everyone has their own way of doing something. The most common problem, however, is lack of proper knowledge of the language. Take someone who has worked on Java or C for 20 years and put them in front of PHP, they will learn the language quickly, but they will not necessarily pick up the true and correct methods and concepts of the language. Amazingly, few do.

In working with many PHP programmers as an instructor, I’ve seen coders from every level, starting from novice to advanced.
From inexperienced beginners to 20-30 years of experience programming C-based languages.
It is an easy language to learn, but a difficult language to master -- properly.
But even as easy as the language is, as popular as the language is, as many people program in it, few code correctly and securely as the language was intended to be programmed in. -- As an instructor, I regularly see code that is vulnerable in one way or another, about 1 in every 3 scripts has a severe vulnerability of some kind (75% of the time, it is an SQL Injection and usually due to poor or no sanitisation methods).
These are mostly due to not understanding typecasting properly, not using it, or not understanding the concepts behind proper coding in PHP, including coding PHP securely and the way it was intended to be used.

The reason that there were 20,000 sites hacked between January and April of 2008 was due to SQL Injections in web applications that hackers exploited to insert code that would inject iframes into every web-based dynamic (ASP or PHP) file on the website.
Some 200,000 plus webpages contained these iframes.
So how do people address this? How do you, as a site owner and coder, secure your site or ensure that the applications or scripts that you use or code are going to be secure and free from exploits?
You code in the way that PHP was intended, and the way that PHP was intended to work with databases.
Sure, you can program PHP in the way that you want to, you can use shortcuts, or do things your way because you feel like it, or you can program in a way that is correct and secure.

One of the most important concepts that I’ve had to drill into the heads of my students has been proper sanitisation of variables AND type casting.
Type casting is equally as important as sanitisation of strings.
PHP is built to be a dynamic language where you can dynamically assign an assortment of type definitions to a variable, it does not require or support explicit type definitions, this means that it is set by the content that is assigned to the variable.

If your Database field is intended to contain a float, you, the programmer must ensure that the data you are placing into that field is a float. If the field is an integer, you must ensure that the variable is an integer. If it is meant to contain a string, you must ensure that the variable type is going to be a string.
User data can be entered as a string, integer, float, or an array.
You cannot insert an array into a string field, a string field into an integer field, or float into an integer field (for example).

When you create a database, you define the fields to be exact database types, exact numeric data types such as Integer, decimal and numeric, approximate numeric data types such as float, real or double, a bit data type (as of MySQL 5), date and time data types, string data types such as char, varchar, text, binary (similar to char, but stores binary byte strings), blob and enum.

When creating a database, you do not create each field to be a string, you create the field with the data type of the information it will be storing -- I hope I don’t have to go into examples about *why* you must do this with databases --.
You then assign data to these database tables through your database interface, most commonly the built-in mysql functions in PHP.
When inserting data, they are meant to contain the data type that you have assigned to the field. If you created an numeric data type field, you need to ensure that the variable is an integer or float (float, double or real) as necessary.
If you do not, you introduce many potential and guaranteed problems into both your code and your database, database errors notwithstanding.

If you insert string data types, you need to ensure that the data that you are inputting is not a resource or array. -- Integer data types are allowed in string data type fields.
That means that even when pulling data, such as $_POST['my_variable'], you cannot just input this straight into your database, obviously, you need to sanitise the variable.
Your method of sanitisation does depend on what data type the field is that you will be inserting it into.
If it is int, force var-type int, if it is a float, you need to force it to be a float or double. Remember, PHP is a dynamic language, so it will change the var-type easily.
The variable may be an array, which you will need to deal with if you are attempting to insert the data into a string data type field.

All these must be must be considered if one is to create a script that is completely secure from Injection. -- If you want to ensure you are secure, create your code the way PHP intended for it to be created. Use proper type-casting.
It will save you many headaches down the road, it is well worth the research anyone would bother to spend on data types and type casting in PHP and your database.

People take database manipulation and interaction much too lightly, and it shows... 20,000 sites hacked in 4 months is too many. Certainly unnecessary and could have been avoided if the programmers had bothered to become familiar with correct sanitisation methods and typecasting.

If you are a user who is unsure about your script, you are unsure if you are using proper sanitisation and type casting methods, talk to programmers who you know are experienced in the language you are working with.
Do some research, on type casting and proper usage of the mysql_real_escape_string() function. If you are not using MySQL, check your DBMS type to see what function it uses to properly sanitise strings.
And last but not least, read Type Casting in PHP on how to properly use typecasting in PHP.
And use Google to research the topic in question, a little bit of research will go a long way to providing you with the necessary knowledge to build secure and stable scripts for your site or your clients.

- Highway of Life