PHP Bulletin Board Home
News About Home
Features of phpBB Test drive phpBB Downloads Support for phpBB The phpBB Community Styles for customising phpBB 3rd party modifications to phpBB

Support Home | Knowledge Base Home | Submit Article | Search Articles | Browse Articles
 Decrease searchtables' size 
Description: How to decrease the size of your database's search tables.
Author: R. U. Serious
Date: Tue Oct 29, 2002 5:19 pm
Type: HowTo
Keywords: search, table, size, database, decrease, functions_post.php, search_stopwords, common, word, wordlist
Category: Improvements
Overview
  1. Introduction
  2. Fix functions_search.php (in 2.0.1 & 2.0.2)
  3. Make a search_stopwords for heavily indexed words.
  4. Delete search_stopwords from your searchtables

  1. Introduction
    The search function of phpBB uses two tables: search_wordlist which has a unique entry for each word that has ever been typed in your forum, and search_wordmatch which keeps track of which posts match which words from the search_wordlist.
    This is great for speed when searching, but has its drawback in the size of the searchtables. If you have more than enough db-space then don't mind reading this HowTo till the end. It will be of no/little relevance to you.. If however you are interested in reducing you db-size this is just for you! However you will need some knowledge in executing queries in your db (e.g. through phpmyadmin).
    With steps I describe in this HowTo I reduced my total db-size between 15-20%. (More specifically about 1 MB for a 3800 posts-forum (6.5MB to 5.4 MB)). Your results may vary greatly. This is not one of those "Loose weight fast"-scams; No, this really(!) works. I know, it sounds really cheep, but you wouldn't guess how many ppl. fall for these lines... *g*

    phpBB already brings functionality to avoid unnecessary growing of the searchtables. For exapmle it will mark 'heavily used' words with a common_word tag in the search_wordlist and stop adding further matches of this word in the search_wordmatch. For me this was only the case for a single word, so it is more like an "emergency exit", seldomly used, but good to have Wink
    Another thing phpBB does (and which I will concentrate on) is: it can use (and already uses, if you have english as default language) a blacklist of words, and all words on this list are not indexed at all, never. This blacklist is stored in a simple .txt file with one word per line (language/lang_esperanto/search_stopwords.txt). A list for the english language is already delivered with the regular download from your preferred phpBB-Dealer (just kidding Mr. Green Only d/l phpBB from this site: www.phpbb.com ! Wink ).

    This HowTo will tell you how to make a little fix so that this list gets actually used. I submitted it as a bug report and guess the fix will be in 2.0.2 (edit: well it's not)
    Further it shows you how to find out the most used (and thereby) indexed words. And finally it provides you with a little script that will remove all search_entries from your list of words in search_stopwords.txt.

  2. Fix functions_search.php (in 2.0.1 & 2.0.2)
    This fix will probably be in 2.0.2 (edit: it is not). It was not in CVS 10. Aug. '02. If you use an older Version, you probably will have to make this fix.
    Open includes/functions_search.php and go to line 108 and 109, you will find
    Code:
       $search_raw_words['text'] = split_words(clean_words('post', $post_text, $stopword_array, $synonym_array));
       $search_raw_words['title'] = split_words(clean_words('post', $post_title, $stopword_array, $synonym_array));
    Please replace with
    Code:
       $search_raw_words['text'] = split_words(clean_words('post', $post_text, $stopwords_array, $synonym_array));
       $search_raw_words['title'] = split_words(clean_words('post', $post_title, $stopwords_array, $synonym_array));
    The change is very subtle, a single 's' per line in the variable $stopwords_array. edit: Geez, looks like I had mixed it up, it's corrected now. Thanks Ashe!
    This change is necessary so phpBB does not index words on the blacklist. Because of the typo the blacklist was always assumed to be empty.

  3. Make a search_stopwords for heavily indexed words.
    If you are using the english language as default, you may possibly skip this point, although I suggest you don't, because words used at different sites vary. At the two sites I am involved with there were a lot of different Top-Used-Words.

    To make the list very effective you don't want it to be unnecessarily long and at the same time include words that are part of the commonly used language. So they are used very often, yet do not indicate any kind of special information, for example " the, they, can, you, although, almost" etc. In German my Top-Used-Words were "der, die, das, ich" (meaning the, the, the, I ). To find out your list of Top-Used-Words you can run this query on your DB (e.g. in phpmyadmin):
    Code:

    SELECT ls.word_id, ls.word_text, COUNT(wm.word_id)  as entries FROM `phpbb_search_wordlist` as ls LEFT JOIN `phpbb_search_wordmatch` as wm ON ls.word_id=wm.word_id GROUP BY wm.word_id ORDER BY entries DESC LIMIT 0,50

    This will return a list of your 50 Top-Used-Words in descending order of their occurence, with three columns:
    • the word_id - this is unique for every word and indicates the relevant entries in the two tables.
    • word_text - this shows you the actual word
    • entries - the number of entries in search_wordmatch for this word.

    Now look at each entry in that list and decide wether you want to keep it in your index or if you are going to delete it and make searching for it impossible. I used all but a handful of the words.

    Save these words in normal text file with one word per line and name that file search_stopwords.txt. Now put this list in the directory: language/lang_esperanto/search_stopwords.txt where esperanto is the language that is set as default-language in your general board-configuration in your AdminCP.

    Now these words will no longer be indexed for the searchfunction.

  4. Delete search_stopwords from your searchtables
    Now you want to remove the thousands of entries of the blacklisted searwords from your searchtables. And after that you should optmize the two tables.I have included an example script, that will do this. It is not pretty, but it works (for me). I have mySQL 3.23 and php 4.1, you might need to make modifications if you have a different setup.
    After you have run the script, you should delete it from the server (although it should not be dangerous to run it twice).

    That's it. Hopefully this works for you. Enjoy Smile

Code:

<?php
//*****  reduce_my_searchtables_with_stopwords.php ****//

define('IN_PHPBB', true);
$phpbb_root_path = './';
include($phpbb_root_path . 'extension.inc');
include($phpbb_root_path . 'common.'.$phpEx);
include($phpbb_root_path . 'includes/functions_search.'.$phpEx);

// Start session management
$userdata = session_pagestart($user_ip, PAGE_SEARCH);
init_userprefs($userdata);
// End session management

$stopwords_array = file($phpbb_root_path . 'language/lang_' . $board_config['default_lang'] . "/search_stopwords.txt");

$liste='';
foreach($stopwords_array as $curr_word)
{
      $liste .= ( ( $liste != '' ) ? ', ' : '' ) ."'".trim($curr_word)."'";
}

   $sql = "SELECT word_id
      FROM " . SEARCH_WORD_TABLE . "
      WHERE word_text IN ($liste)";
   if ( !($result = $db->sql_query($sql)) )
   {
      message_die(GENERAL_ERROR, 'Could not obtain common word list', '', __LINE__, __FILE__, $sql);
   }

   $common_word_id = '';
   while ( $row = $db->sql_fetchrow($result) )
   {
      $common_word_id .= ( ( $common_word_id != '' ) ? ', ' : '' ) . $row['word_id'];
   }

if ($common_word_id=='') message_die(GENERAL_ERROR,'None of the words in the list are in your search_tables.<br>Note: This could also mean the list is empty ;)');
   //echo '>'.trim($curr_word)."<<br>";
   //echo $liste .'<br>'. $common_word_id;
   //exit;

   $sql = "DELETE FROM " . SEARCH_WORD_TABLE . " 
      WHERE word_id IN ($common_word_id)";
   if ( !$db->sql_query($sql) )
   {
      message_die(GENERAL_ERROR, 'Could not delete word match entry', '', __LINE__, __FILE__, $sql);
   }
   $sql = "OPTIMIZE TABLE " . SEARCH_WORD_TABLE;
   if ( !$db->sql_query($sql) )
   {
      message_die(GENERAL_ERROR, 'Could not optimize', '', __LINE__, __FILE__, $sql);
   }

   $sql = "DELETE FROM " . SEARCH_MATCH_TABLE . " 
      WHERE word_id IN ($common_word_id)";
   if ( !$db->sql_query($sql) )
   {
      message_die(GENERAL_ERROR, 'Could not delete word match entry', '', __LINE__, __FILE__, $sql);
   }
   $sql = "OPTIMIZE TABLE " . SEARCH_MATCH_TABLE;
   if ( !$db->sql_query($sql) )
   {
      message_die(GENERAL_ERROR, 'Could not pütimize', '', __LINE__, __FILE__, $sql);
   }
   
   message_die(GENERAL_MESSAGE,'<b>Done!</b><br><br>The following list-entries have been removed from your searchtables:'.$liste);
//echo $liste .'<br>'. $common_word_id;

?>

Disclaimer: You yourself are fully responsible for what you do and wether you follow what some weirdo you never met tells you to. So don't be mad or hold someone else reponsible if somehow, something screws up you your forum while you are trying any of the things described here. Wink

Username: Password:
News | Features | Demo | Downloads | Support | Community | Styles | Mods | Links | Merchandise | About | Home
 © Copyright 2002 The phpBB Group.