Re: kanjidic parser in Perl?




"John J. Chew, III" <jjchew@xxxxxxxxxxxxxxxx> wrote in message news:eYidnUIo__qk-pzeRVn-qA@xxxxxxxxxxxxx
Ben Bullock wrote:

package kdic;

Once you say this you can lose the subsequent "kdic::"s.

Well, I tried it, but I got lots of error messages.

%kdic::codes =

my (%codes) =

'W', 'KOREAN',

Better style to write:

'W' => 'KOREAN',

That's a good tip. I didn't know about that, but it seems like a good way to avoid errors as well.


sub kdic::parse_entry

sub parse_entry ($) {

Always use prototypes.

I have no idea how to do so.

I tried a web search and found a page about Perl prototypes but it was totally incomprehensible to me.

    while ($input =~ m/(\{[^\}]+\})/)
    {
#        print "$input, $1";
        push (@english, $1);
        $input =~ s/\{[^\}]+\}//;
    }

More simply:

  push(@english, $1)
    while $input =~ s/({.*?})//;

Sorry, I don't know this regular expression syntax with the question mark; why does this not match all of


{abc}{def}

as one string? Also, don't I need to escape { and }? Obviously the original is meant to get {abc} and {def} as two strings. Also, your regular expression contains { and }, the quoting characters for the English "meanings" in kanjidic, but mine doesn't, so the expressions aren't equivalent.

(my $kanji, my $jiscode, my @entries) = split (" ", $input);

More simply:

  my ($kanji, $jiscode, @entries) = split(' ', $input);

    foreach my $entry (@entries)
    {
        my $found = 0;
        if ($entry =~ m/(^[A-Z]+)(.*)/ )
        {
            if ($kdic::codes{$1})
            {
                $values{$1} = $2;
                $found = 1;
            }
        }
        elsif ($entry =~ m/([\x80-\xFF]+)/)
        {
            push (@japanese, $1);
            $found = 1;
        }
        if ($found == 0)
        {
            print "Mystery entry \"$entry\"\n";
        }
    }

More simply:

  for my $entry (@entries) {
    if ($entry =~ /^([A-Z]+)(.*)/ && exists $codes{$1})
      { $values{$1} = $2; }
    elsif ($entry =~ /([\x80-\xFF]+)/)
      { push(@japanese, $1); }
    else
      { print "Mystery entry \"$entry\"\n"; }
    }

It is simpler but the original code was edited down from something more complex, hence the $found.


though I suspect the elsif clause should read

    elsif ($entry =~ /^[\x80-\xFF]+$/)
      { push(@japanese, $entry); }

Maybe, I'm not sure if there are any strings in kanjidic with mixed kana and other things. Actually I'd probably raise an error at that point just to see what they were.


my $kanjidic = "kanjidic";

Normally, before this you would put a

  package main;

to switch back to the default namespace.

Thanks for the tip.

my $KANJIDIC;
my $count = 0;

open ($KANJIDIC, $kanjidic) || die;

You can just say

  open (my $KANJIDIC, "<$kanjidic") or die;

in modern versions of Perl.

    if ( m/^\#/ )
    {
        next;
    }
    &kdic::parse_entry ("$_");

Or just

kdic::parse_entry $_ unless /^#/;

If you think about it, that's an extremely bad idea.

Thanks for the mental exercise,

Since you like mental exercise so much, I'll leave it to you to figure out why I don't agree with your last comment.



.



Relevant Pages

  • Re: How can I ensure that I always have a list?
    ... Since you want to treat a string directly as a list, ... You've hit the nail on the head. ... Note, however, that if you print out a list, what you *see* may contain extra characters such as backslashes and curly braces. ... it appears your data is not a list so there's no way to avoid errors like you're getting, unless you take the first step of making sure your data is a valid tcl list to begin with. ...
    (comp.lang.tcl)
  • To to check whether a string is a valid reference
    ... Set Location = Range ... This example would work but I wish to avoid errors generated by strings ... which aren't valid references. ... Is there a method about which can check whether the string is a valid ...
    (microsoft.public.excel.programming)

Loading