Re: weighting two values to get an overall result



On Nov 20, 10:43 am, "metaperl.com" <metap...@xxxxxxxxx> wrote:
Ok, I am using the Levenshtein edit distance on two fields, name and
address. I am comparing an input name and address again a reference
list of names and addresses to find the best match. Therefore for one
input name/address, I will produce x match vectors, where x is the
number of records in the reference list.

When there is _no_ difference between 2 names, the function returns 0.
When they are completely different, it returns 1. Likewise for two
addresses.

So, for a single match, run, I have x vectors, consisting of (n,a),
where both n and a are on the interval [0,1] and I need to find the
best match - I want to write a function to assess the overall
"goodness" of the match as a function of the 2 individual metrics.

Name means much more than address in terms of a match - if the names
are similar and addresses dont match, then it has a high "goodness".
If the addresses are similar and the names dont match, it has low
"goodness"... the address simply adds confidence to the match, while
the name must have a fairly low difference.

Part of the reason for this is the address data is very poorly input
and sometimes is even missing... it is an unreliable source that I
only want to attribute marginal significance too

So for a match vector (n,a):

(0,0) --- absolute best goodness
(0,n) ---- quite good
(1,0) ---- horrible
(1,1) ---- absolute worst goodness

I have specified the goodness for the 4 corners of the vector space, I
will now break down the unit space by quadrants.

If you horizontally and vertically bisect the square with coords
(0,0), (0,1), (1,0) and (1,1) then you create the 4 quadrants. The
lower left quadrant is highest goodness. Upper left is next, then
lower right, then upper right.

It maybe be best to simply run a series of if-thens and check on what
quadrants the vector (n,a) lies in, but I was hoping a single function
of two variables could outuput a unique number to determine this.

Standard Euclidean distance in two variables:

d = sqrt(dx^2 + dy^2)

where by dx and dy I mean your distance measures
that vary from 0 to 1 in each dimension.

This is a value that varies from 0 to sqrt(2)
and gives equal weight to both dimensions. You
could divide by sqrt(2) to obtain a [0,1] measure.

You want to apply weights. So just add unequal
multipliers:

d = sqrt( (a*dx)^2 + (b*dy)^2 )

This has a maximum value of sqrt(a^2 + b^2), so
let's normalize:

d = sqrt[ (a*dx)^2 + (b*dy)^2 )/(a^2 + b^2)]

This will work for any (positive) a and b. So
just choose any values that make sense to you.
Let's see what happens with a weight of 1
for dx and 2 for dy, so dy has twice the weight
of dx.

Consider dx = 0.5, dy = 0.3:

d = sqrt[ ( 0.5^2 + (2*0.3)^2 ) / 5 ]
= 0.349

But when dx = 0.3, dy = 0.5
d = sqrt[ (0.3^2 + (2*0.5)^2)/5 ]
= 0.467

So (0.3, 0.5) is much better than (0.5, 0.3)

- Randy
.



Relevant Pages

  • weighting two values to get an overall result
    ... I am comparing an input name and address again a reference ... "goodness" of the match as a function of the 2 individual metrics. ... will now break down the unit space by quadrants. ... lower right, then upper right. ...
    (sci.math)
  • Re: Jew arrested for wearing tallit
    ... MAKE ONE RELIGIOUS" in capital letters, as a strong assertion, I think ... aspect) I'm going to give that a lot of weight - you live it. ... Susan's been posting here at least as long as I have - probably longer. ... but her actual POV - where she's calling from (to reference R. Carver) ...
    (soc.culture.jewish.moderated)
  • Re: Earths rotation and mass calculation
    ... would lessen their weight. ... frame of reference is rotating. ... you're neglecting several other factors -- including the rotation ... mass is mass. ...
    (sci.space.shuttle)
  • Re: dBm dBm/Hz what is the relationship
    ... What is the bandwidth of my measurement? ... if your reference is the weight of an Elephant and you were looking ... to measure a pig then you would take the weight of your pig and divide it by ... the weight of your elephant and take the log of it and express your answer ...
    (sci.electronics.design)
  • Re: perfomance question about bool variables in cxx
    ... models, presuming that the data is naturally aligned by the compiler, ... each reference should be a single cache or memory reference. ... This overhead is the computational equivalent of the "tare ... weight", the weight of an empty container when placed on a scale. ...
    (comp.os.vms)