Re: kanjidic parser in Perl?
- From: Ben Bullock <usenet@xxxxxxxxxx>
- Date: Tue, 16 Aug 2005 09:46:45 +0900
Gabor Farkas wrote:
Ben Bullock wrote:
"Gabor Farkas" <gabor@xxxxxxxxxxxxxx> wrote in message news:1f516$42ffa6b0$55d8898b$6438@xxxxxxxxxxxxxxxxxxxxxxxxxxx
Ben Bullock wrote:
David Alexander Ranvig wrote:
Ben Bullock <usenet@xxxxxxxxxx> writes:
| Does anyone know of a parser for Jim Breen's kanjidic written in | Perl?
<URL: http://search.cpan.org> is a nice tool for finding all things perl. Maybe you can use the module Lingua::JP::Kanjidic by Simon Cozens?
Thanks for the tip. I had a look at it, and it seems to need some work. Doesn't parse all the fields in the dictionary yet, unfortunately. I'll try editing it up a bit.
sorry, but what are you trying to achieve? i mean, what advanced features you need from a kanjidic parser?
for me it seems that 2-3 lines of perl (mostly splits) would completely parse the kanjidic database for you...
Thanks for your input.
very funny ;))
the problem is that i can read perl relatively well, but i'm not good enough to be able to write it (i use python to solve text processing problems)
but i know that perl can split a string into by the whitespace, and that should be enough for your needs (if you know perl).
in python, the code would look approximately like:
d = {}
for line in open('kanjidic.txt'): line = line.split() kanji = line[0] english = [] readings = [] for item in line[1:]: if not item[0].isalphanum(): if item[0] == '{': english.append( item[1:-1] ) else: readings.append(item) d[kanji] = (readings,english)
i'm pretty sure that this can be translated line-by-line to perl...
gabor
This is the parser as it stands:
#!/usr/bin/perl
use strict; use warnings;
package kdic;
%kdic::codes =
(
'W', 'KOREAN',
'Y', 'PINYIN',
'{', 'ENGLISH_START',
'}', 'ENGLISH_END',# Codes for kanji classification schemes.
'B', 'BUSHU', 'C', 'CLASSIC_RAD', 'U', 'UNICODE', 'G', 'GRADE', 'Q', 'FOUR_CORNER', 'S', 'STROKE_COUNT', 'P', 'SKIP',
# Codes for various books.
'N', 'NELSON', 'V', 'NEW_NELSON', 'L', 'HEISIG',
# The numbers used in P.G. O'Neill's "Japanese Names".
'O', 'ONEILL', 'K', 'GAKKEN', 'E', 'HENSHALL', 'I', 'SPAHN_HADAMITZKY', 'N', 'SH_KANJI_KANA',
# 'M', 'MOROHASHI',
'MP', 'MOROHASHI_PAGE', 'MN', 'MOROHASHI_INDEX', 'H', 'HALPERN', 'F', 'FREQUENCY',
'X', 'CROSS_REF', # 'Z', 'MIS_REF',
# Book-specific numbers:
# the index numbers used in "Japanese For Busy People" vols I-III, # published by the AJLT. The codes are the volume.chapter.
'DB', 'BUSY_PEOPLE',
# the index numbers used in "The Kanji Way to Japanese Language Power" # by Dale Crowley.
'DC', 'KANJI_WAY', # The index numbers used in the "Kodansha Compact Kanji Guide".
'DG', 'KODANSHA',
# The index numbers used in the 3rd edition of "A Guide To Reading and # Writing Japanese" edited by Ken Hensall et al.
'DH', 'HENSHALL',
# The index numbers used in the "Kanji in Context" by Nishiguchi and Kono.
'DJ', 'KANJIINCONTEXT',
# The index numbers used by Jack Halpern in his Kanji Learners # Dictionary, published by Kodansha in 1999. The numbers have been # provided by Mr Halpern.
'DK', 'HALPERN',
# The index numbers used in P.G. O'Neill's Essential Kanji The numbers have been provided by Glenn Rosenthal.
'DO', 'ONEILL',
# These are the codes developed by Father Joseph De Roo, and published # in his book "2001 Kanji" (Bojinsha). Fr De Roo has given his # permission for these codes to be included.
'DR', 'DEROO',
# The index numbers used in the early editions of "A Guide To Reading # and Writing Japanese" edited by Florence Sakade.
'DS', 'SAKADE',
# The index numbers used in the Tuttle Kanji Cards, compiled by Alexander Kask.
'DT', 'KASK',
'XJ', 'CROSSREF', 'XO', 'CROSSREF', 'XH', 'CROSSREF', 'XI', 'CROSSREF', 'XN', 'NELSONCROSSREF', 'XDR', 'DEROOCROSSREF', 'IN', 'SKIP', 'T', 'SPECIAL', 'ZPP', 'MISCLASSIFICATIONpp', 'ZRP', 'MISCLASSIFICATIONrp', 'ZSP', 'MISCLASSIFICATIONsp', 'ZBP', 'MISCLASSIFICATIONrp', 'ENGLISH', 'ENGLISH', );
# Parse one string from kanjidic and return it in an associative array.
#$| = 1;
sub kdic::parse_entry
{
my $input = $_[0];
# Remove the English entries first.
my @english;
my @japanese;
my %values;
while ($input =~ m/(\{[^\}]+\})/)
{
# print "$input, $1";
push (@english, $1);
$input =~ s/\{[^\}]+\}//;
} (my $kanji, my $jiscode, my @entries) = split (" ", $input);# my $kanji = @entries[0]; # my $jiscode = @entries[1];
foreach my $entry (@entries) { my $found = 0; if ($entry =~ m/(^[A-Z]+)(.*)/ ) { if ($kdic::codes{$1}) { $values{$1} = $2; $found = 1; } } elsif ($entry =~ m/([\x80-\xFF]+)/) { push (@japanese, $1); $found = 1; } if ($found == 0) { print "Mystery entry \"$entry\"\n"; } } # if ($values{"O"}) # { # if ($values{"DO"}) # { # if ($values{"O"} ne $values{"DO"}) # { # print "Inconsistency: O value ", $values{"O"}, # ", DO value ", $values{"DO"}, "\n"; # } # } # else # { # print "Only O value ", $values{'O'}, "\n"; # } # } # elsif ($values{"DO"}) # { # print "Only DO value ", $values{'DO'}, "\n"; # } # print "$kanji, @english, @japanese \n"; }
my $kanjidic = "kanjidic"; my $KANJIDIC; my $count = 0;
open ($KANJIDIC, $kanjidic) || die;
while (<$KANJIDIC>)
{
# $count++;
if ($count > 4)
{
exit (0);
}
# print $_;
if ( m/^\#/ )
{
next;
}
&kdic::parse_entry ("$_");
}
.- Follow-Ups:
- Re: kanjidic parser in Perl?
- From: David Alexander Ranvig
- Re: kanjidic parser in Perl?
- From: jwb
- Re: kanjidic parser in Perl?
- From: John J. Chew, III
- Re: kanjidic parser in Perl?
- References:
- Re: kanjidic parser in Perl?
- From: Gabor Farkas
- Re: kanjidic parser in Perl?
- From: Ben Bullock
- Re: kanjidic parser in Perl?
- From: Gabor Farkas
- Re: kanjidic parser in Perl?
- Prev by Date: Re: SLJ FAQ Comment
- Next by Date: Re: SLJ FAQ Comment
- Previous by thread: Re: kanjidic parser in Perl?
- Next by thread: Re: kanjidic parser in Perl?
- Index(es):
Relevant Pages
|