Re: weighting two values to get an overall result
- From: Randy Poe <poespam-trap@xxxxxxxxx>
- Date: Tue, 20 Nov 2007 08:15:19 -0800 (PST)
On Nov 20, 10:43 am, "metaperl.com" <metap...@xxxxxxxxx> wrote:
Ok, I am using the Levenshtein edit distance on two fields, name and
address. I am comparing an input name and address again a reference
list of names and addresses to find the best match. Therefore for one
input name/address, I will produce x match vectors, where x is the
number of records in the reference list.
When there is _no_ difference between 2 names, the function returns 0.
When they are completely different, it returns 1. Likewise for two
addresses.
So, for a single match, run, I have x vectors, consisting of (n,a),
where both n and a are on the interval [0,1] and I need to find the
best match - I want to write a function to assess the overall
"goodness" of the match as a function of the 2 individual metrics.
Name means much more than address in terms of a match - if the names
are similar and addresses dont match, then it has a high "goodness".
If the addresses are similar and the names dont match, it has low
"goodness"... the address simply adds confidence to the match, while
the name must have a fairly low difference.
Part of the reason for this is the address data is very poorly input
and sometimes is even missing... it is an unreliable source that I
only want to attribute marginal significance too
So for a match vector (n,a):
(0,0) --- absolute best goodness
(0,n) ---- quite good
(1,0) ---- horrible
(1,1) ---- absolute worst goodness
I have specified the goodness for the 4 corners of the vector space, I
will now break down the unit space by quadrants.
If you horizontally and vertically bisect the square with coords
(0,0), (0,1), (1,0) and (1,1) then you create the 4 quadrants. The
lower left quadrant is highest goodness. Upper left is next, then
lower right, then upper right.
It maybe be best to simply run a series of if-thens and check on what
quadrants the vector (n,a) lies in, but I was hoping a single function
of two variables could outuput a unique number to determine this.
Standard Euclidean distance in two variables:
d = sqrt(dx^2 + dy^2)
where by dx and dy I mean your distance measures
that vary from 0 to 1 in each dimension.
This is a value that varies from 0 to sqrt(2)
and gives equal weight to both dimensions. You
could divide by sqrt(2) to obtain a [0,1] measure.
You want to apply weights. So just add unequal
multipliers:
d = sqrt( (a*dx)^2 + (b*dy)^2 )
This has a maximum value of sqrt(a^2 + b^2), so
let's normalize:
d = sqrt[ (a*dx)^2 + (b*dy)^2 )/(a^2 + b^2)]
This will work for any (positive) a and b. So
just choose any values that make sense to you.
Let's see what happens with a weight of 1
for dx and 2 for dy, so dy has twice the weight
of dx.
Consider dx = 0.5, dy = 0.3:
d = sqrt[ ( 0.5^2 + (2*0.3)^2 ) / 5 ]
= 0.349
But when dx = 0.3, dy = 0.5
d = sqrt[ (0.3^2 + (2*0.5)^2)/5 ]
= 0.467
So (0.3, 0.5) is much better than (0.5, 0.3)
- Randy
.
- References:
- weighting two values to get an overall result
- From: metaperl.com
- weighting two values to get an overall result
- Prev by Date: Re: Implementable Specification and Logic
- Next by Date: Re: The infintely small number b
- Previous by thread: weighting two values to get an overall result
- Next by thread: Re: weighting two values to get an overall result
- Index(es):
Relevant Pages
|