## Thursday, October 18, 2012

### Proposed additional matching criterion for Strava segments

Strava is brilliant at quickly matching typical point-to-point segments. The speed with which it accomplishes this is simply incredible, with long rides in the segment-saturated San Francisco Bay area getting checked in seconds. It's truly amazing.

Part of their approach to this speed is keeping the segment matching as simple as possible. You need to hit the start and end points, but for points in between, there's some wiggle room.

This is important because GPS errors mean that a perfect match of a segment is impossible. Additionally, riders may make minor deviations from an optimal route. On wide roads, there's of course different lane choices. But consider a climbing segment where a rider turned back briefly during the climb to fetch a dropped water bottle, or took a brief wrong turn before retracing steps and continuing on the specified route. Most people would agree these trajectories should match the segment. The result of this is the first and final points of the segment are granted special status: they must be closely matched, but with intermediate points the match criterion is looser, and not all points in the reference data need to be closely passed.

This usually works find. For example, many climbs are up roads where they represent the quickest, shortest route between the two points. If someone wants to climb, say, 80% of the climb on the road then bush-wack the rest, that's their problem. They're only slowing themselves.

But where it fails miserably is on out-and-back segments. For example, a segment I designed and a favorite part of many of my runs is the Gashouse Cove Pier out-and-back. I run out the pier, around the small structure at the end, then return.

But the problem with such segments is Strava fails to recognize the importance of the turn-around point. To Strava, it's just another point, one which can remain unmatched if a sufficient fraction of the other points are matched. But using that 80% number, it's clear running 80% of the way out an then back will, in the case of minimal GPS error, match 80% of the segment points. If Strava's segment is no more than 80%, credit is given for the segment.

Indeed, that's what happened with the Gashouse Cove Pier segment. Here's the present KOM.

I propose a solution. The key is to introduce a "special point" which must be matched on an equal basis with the start and finish points. The question is how should this point be chosen? Manually by the user at the time of segment creation? That is a possibility, but there's already a LOT of segments out there which could use this treatment, and I doubt Strava wants to add complexity to the segment-creation process. For example users might be tempted to add too many "special points", yielding a large number of false negatives.

In cases like this it's a good policy to not try for perfection. Instead recognize that cases like the Gashouse Cove Pier segment tend mostly to consist of point A, a route to point B maximally distant from point A, then a return path back to point A which is maximally distant from point B. You could have a spiral path to a turn-around then spiral out, but this would be an exceptional case.

So this suggests a simple approach: take the two points which are furthest apart in the defining segment, and declare these as "special points". So then not only must one go from point A to point B, one must also visit the two points which are furthest apart. In our simple out-and-back case, these will likely be the start/finish and the turn-around.

The primary concern with this is if the reference data are of poor quality (but then poor quality data shouldn't be used to define segments). If there is a lot of GPS noise in intermediate points, it may be that a rogue point is more likely to be one of these implicit special points. It would be good to check if the extreme point appears to be a rogue point, for example if it is separated from its neighbors by an extreme relative speed. But this is in the details.

A nice thing about this approach is except in pathological cases it is likely to be purely restrictive: data which would have matched a segment would no longer do so when applying these criteria. Thus the test can be applied after the normal matching algorithm has accepted a match. This makes it relatively efficient: the check would only need to be done after the vast majority of segment candidates in the region had already been eliminated.

Finding the points of maximal separation is non-trivial computationally, but it would need be done only once per segment, then cached. From then on these points could be referenced directly, at least until the segment was possibly edited.

Anyway, easy to propose algorithms like this, but more impressive if I could test them. Unfortunately time for that sort of thing is perpetually limited. And it does nothing for the multi-lap problem, which could nonetheless be solved using an extension of the idea.