Re: kanjidic parser in Perl?



Gabor Farkas wrote:
Ben Bullock wrote:


"Gabor Farkas" <gabor@xxxxxxxxxxxxxx> wrote in message news:1f516$42ffa6b0$55d8898b$6438@xxxxxxxxxxxxxxxxxxxxxxxxxxx


Ben Bullock wrote:

David Alexander Ranvig wrote:

Ben Bullock <usenet@xxxxxxxxxx> writes:

| Does anyone know of a parser for Jim Breen's kanjidic written in
| Perl?

<URL: http://search.cpan.org> is a nice tool for finding all things
perl. Maybe you can use the module Lingua::JP::Kanjidic by Simon
Cozens?




Thanks for the tip. I had a look at it, and it seems to need some work. Doesn't parse all the fields in the dictionary yet, unfortunately. I'll try editing it up a bit.




sorry, but what are you trying to achieve?
i mean, what advanced features you need from a kanjidic parser?

for me it seems that 2-3 lines of perl (mostly splits) would completely parse the kanjidic database for you...



Thanks for your input.


very funny ;))

the problem is that i can read perl relatively well, but i'm not good enough to be able to write it (i use python to solve text processing problems)

but i know that perl can split a string into by the whitespace, and that should be enough for your needs (if you know perl).

in python, the code would look approximately like:

d = {}

for line in open('kanjidic.txt'):
    line = line.split()
    kanji = line[0]
    english = []
    readings = []
    for item in line[1:]:
        if not item[0].isalphanum():
            if item[0] == '{':
                english.append( item[1:-1] )
            else:
                readings.append(item)
    d[kanji] = (readings,english)


i'm pretty sure that this can be translated line-by-line to perl...

gabor

This is the parser as it stands:

#!/usr/bin/perl

use strict;
use warnings;

package kdic;

%kdic::codes =
(
 'W', 'KOREAN',
 'Y', 'PINYIN',
 '{', 'ENGLISH_START',
 '}', 'ENGLISH_END',

# Codes for kanji classification schemes.

 'B', 'BUSHU',
 'C', 'CLASSIC_RAD',
 'U', 'UNICODE',
 'G', 'GRADE',
 'Q', 'FOUR_CORNER',
 'S', 'STROKE_COUNT',
 'P', 'SKIP',

# Codes for various books.

 'N', 'NELSON',
 'V', 'NEW_NELSON',
 'L', 'HEISIG',

# The numbers used in P.G. O'Neill's "Japanese Names".

 'O', 'ONEILL',
 'K', 'GAKKEN',
 'E', 'HENSHALL',
 'I', 'SPAHN_HADAMITZKY',
 'N', 'SH_KANJI_KANA',

#    'M', 'MOROHASHI',

 'MP', 'MOROHASHI_PAGE',
 'MN', 'MOROHASHI_INDEX',
 'H', 'HALPERN',
 'F', 'FREQUENCY',

 'X', 'CROSS_REF',
# 'Z', 'MIS_REF',

# Book-specific numbers:

# the index numbers used in "Japanese For Busy People" vols I-III,
# published by the AJLT. The codes are the volume.chapter.

 'DB', 'BUSY_PEOPLE',

# the index numbers used in "The Kanji Way to Japanese Language Power"
# by Dale Crowley.

 'DC', 'KANJI_WAY',
# The index numbers used in the "Kodansha Compact Kanji Guide".

 'DG', 'KODANSHA',

# The index numbers used in the 3rd edition of "A Guide To Reading and
# Writing Japanese" edited by Ken Hensall et al.

 'DH', 'HENSHALL',

# The index numbers used in the "Kanji in Context" by Nishiguchi and Kono.

 'DJ', 'KANJIINCONTEXT',

# The index numbers used by Jack Halpern in his Kanji Learners
# Dictionary, published by Kodansha in 1999. The numbers have been
# provided by Mr Halpern.

 'DK', 'HALPERN',

# The index numbers used in P.G. O'Neill's Essential Kanji The numbers have been provided by Glenn Rosenthal.

 'DO', 'ONEILL',

# These are the codes developed by Father Joseph De Roo, and published
# in his book "2001 Kanji" (Bojinsha). Fr De Roo has given his
# permission for these codes to be included.

 'DR', 'DEROO',

# The index numbers used in the early editions of "A Guide To Reading
# and Writing Japanese" edited by Florence Sakade.

 'DS', 'SAKADE',

# The index numbers used in the Tuttle Kanji Cards, compiled by Alexander Kask.

 'DT', 'KASK',

 'XJ', 'CROSSREF',
 'XO', 'CROSSREF',
 'XH', 'CROSSREF',
 'XI', 'CROSSREF',
 'XN', 'NELSONCROSSREF',
 'XDR', 'DEROOCROSSREF',
 'IN', 'SKIP',
 'T', 'SPECIAL',
 'ZPP', 'MISCLASSIFICATIONpp',
 'ZRP', 'MISCLASSIFICATIONrp',
 'ZSP', 'MISCLASSIFICATIONsp',
 'ZBP', 'MISCLASSIFICATIONrp',
 'ENGLISH', 'ENGLISH',
);

# Parse one string from kanjidic and return it in an associative array.

#$| = 1;

sub kdic::parse_entry
{
    my $input = $_[0];
# Remove the English entries first.
    my @english;
    my @japanese;
    my %values;
    while ($input =~ m/(\{[^\}]+\})/)
    {
#        print "$input, $1";
        push (@english, $1);
        $input =~ s/\{[^\}]+\}//;
    }

    (my $kanji, my $jiscode, my @entries) = split (" ", $input);

#    my $kanji = @entries[0];
#    my $jiscode = @entries[1];


foreach my $entry (@entries) { my $found = 0; if ($entry =~ m/(^[A-Z]+)(.*)/ ) { if ($kdic::codes{$1}) { $values{$1} = $2; $found = 1; } } elsif ($entry =~ m/([\x80-\xFF]+)/) { push (@japanese, $1); $found = 1; } if ($found == 0) { print "Mystery entry \"$entry\"\n"; } } # if ($values{"O"}) # { # if ($values{"DO"}) # { # if ($values{"O"} ne $values{"DO"}) # { # print "Inconsistency: O value ", $values{"O"}, # ", DO value ", $values{"DO"}, "\n"; # } # } # else # { # print "Only O value ", $values{'O'}, "\n"; # } # } # elsif ($values{"DO"}) # { # print "Only DO value ", $values{'DO'}, "\n"; # } # print "$kanji, @english, @japanese \n"; }

my $kanjidic = "kanjidic";
my $KANJIDIC;
my $count = 0;

open ($KANJIDIC, $kanjidic) || die;

while (<$KANJIDIC>)
{
#    $count++;
    if ($count > 4)
    {
        exit (0);
    }
#    print $_;
    if ( m/^\#/ )
    {
        next;
    }
    &kdic::parse_entry ("$_");
}
.



Relevant Pages

  • Re: how can I find the driver list in perl
    ... Or English literature group>? ... This is the Perl programming newsgroup you know... ... There is no such command in the Perl language. ... There are so many grammatical problems with that "paragraph" it's not ...
    (comp.lang.perl.misc)
  • Re: a good TeX parser for use by software that needs to read TeX?
    ... : read TeX? ... In particular, has anyone used the perl Text::TeX parser, ... In addition, push $found into a list @commands, ...
    (comp.text.tex)
  • Re: how to do this job using perl?
    ... My mother language is not english.And my english is very poor. ... Perl is much more precise than natural language. ... You can find the posting guidelines at ... immediate problem, and then try to read the source code, or consult an ...
    (comp.lang.perl.misc)
  • Re: Syntax checker wtf?
    ... the parser has no means to detect the error ... I'll note that perl has a similarly flexible syntax, ... example, if you get a runaway unclosed string or regexp operator, ...
    (comp.lang.ruby)
  • Re: Precedence of exponentiation
    ... The parser I'm writing isn't even written in Perl ... (nor does it use yacc), and that's my primary reason for this topic. ... > Digits: Digit | Digit Digits ...
    (comp.lang.perl.misc)