Re: Google counts

jim_breen_at_hotmail.com
Date: 03/16/05


Date: 16 Mar 2005 11:08:54 GMT

Paul Blay <ranma@saotome.demon.co.uk> dixit:

>Unfortunately most of that does not apply to Japanese text searches.

The underlying assumptions do though.

>As discussed in various threads before the significant factors in Japanese
>text searching are
>1. Either including hl=ja in the search url or a kana $B$N(B to make sure
>it is Japanese pages that are included.
>2. Including " marks around dubious / rare words / phrases - or they
>will often be split into two words for the search.
>3. Being aware that _if_ you have hl=ja set then certain search terms
>will be 'merged' as recognized alternative spellings (for example
>$B%@%$%d%b%s%I(B vs $B%@%$%"%b%s%I(B)

Yes, a page in language-X is parsed and segmented by a language-X parser
before indexing. AFAIK for Japanese Google uses a parser from Basis
Technologies (y?) which in turn uses a parse dictionary from Jack
Halpern's company. For some reason I am not surprised to find
$B%@%$%d%b%s%I(B & $B%@%$%"%b%s%I(B combined.

>I don't say that Altavista is a better search engine - but it seems to
>be more reliable for relative word counts.

I found that a while ago when doing some research on the spread of
$BCzG+8l(B, $B7I8l(B, etc. across the .jp 2nd-level domains. Google may be
very good at getting the most relevant pages to the front, but its
page-hit stats are pretty flakey.

-- 
Jim Breen        http://www.csse.monash.edu.au/~jwb/
Computer Science & Software Engineering,
Monash University, VIC 3800, Australia 
$B%8%`!&%V%j!<%s(B@$B%b%J%7%eBg3X(B