To count phones in a custom polygon we need to work out the percentage
of overlap with each known area. This means we need to get each known
area from the database to compare it.
At the moment we do this by running:
- one SQLite query to get the details of all matching areas
- a loop, which performs one SQLite query *per area* to get the polygons
This commit reduces the number of SQLite queries to one, which uses a
`JOIN` to get both the details of the areas and their polygons.
This gives a speed increase of about 25% for a big area like
Lincolnshire.
By using the simplified polygons instead of the full resolutions ones
we:
- query less data from SQLite
- pass less data around
- give Shapely a less complicated shape to do its calculations on
This makes it faster to calculate how much of each electoral ward a
custom area overlaps.
For the two areas in our tests:
Place represented by custom area | Before | After
---------------------------------|--------|--------
Bristol | 0.07s | 0.02s
Skye | 0.02s | 0.01s
Relates to: https://github.com/alphagov/notifications-govuk-alerts/pull/152
I ran the "create-broadcast-areas-db.py" script to regenerate the
Sqlite DB. Existing alerts with the old naming still appear correctly,
and since we don't (yet) store this text in the DB, there's nothing
more to update.
Our current assumption is that the bleed area has the same population
density as the broadcast area.
This is particularly naïve when:
- the bleed area overlaps the sea – no-one lives in the sea
- the broadcast area is a village and the bleed area is the surrounding
countryside
- the broadcast area is adjacent to a densely populated area like a city
We can be smarter about this now that we have a way of determining the
number of phones in an arbitrary area, based on the known areas that we
have population data about.
Calculating the population in an overlap is a slightly more intensive
calculation. So we only doing it for areas which are smaller enough that
it doesn’t slow things down too much. For larger areas we still use the
more naïve algorithm.
This is the only way I can think to stop this shape self-intersecting
without drastically changing its area (i.e. filling the hole in the
donut).
This is the only area in our library which is a genuine donut and
presents this problem
Some of the polygons in our source data are invalid. An invalid polygon
is one that self intersects, in other words has a point which causes
the boundary of the shape to cross itself.
This doesn’t cause an exception until we try to perform certain
operations on one of these polygons, like intersecting them with another
polygon. This is why we haven’t spotted that they are invalid until now.
This commit adds checks so that as we import the polygons we make sure
they are valid.
If they are not valid, we can automatically fix them by just looking at
the exterior boundary of the shape, and ignore any holes created by
self intersection.
Previously this was hidden away in an anonymous __init__.py file.
I did think about splitting the models into individual files, like
we do with the top-level models for the app. Since the models are
only imported in one place - i.e. are all used together - it didn't
seem worth the hassle, so I've kept them in one file.
The Python rtree library we are using to build RTrees has a dependency
on the C package libspatialindex. This package is not installed on PaaS,
so it’s hard for us to use it.
This commit changes the code to use a library called rtreelib instead.
rtreelib doesn’t have a built in way to serialise the index it builds,
so I’ve had to implement that using pickle.
We want to know how many phones are in a user-supplied polygon, so we
can show the impact of a broadcast, in the same way that we do when
users pick areas from our library.
We already know how many phones are in each electoral ward. But there
are challenges with an arbitrary polygon:
- where it does overlap a ward, the overlap could be partial
- it could overlap more than one ward
- finding out which wards it overlaps by brute force (looping through
all the wards and seeing which ones intersect with our polygon) would
be way to slow to do in real time
Instead we can use a data structure called an R-tree[1] to build an
index which provides a much, much faster way of looking up which
polygons overlap another. We can build this tree in advance and save it
somewhere, which means there’s a lot of computation we don’t need to do
in real time.
The R-tree returns a set of objects (ward IDs) which we can go and look
up in our library of electoral wards. These wards will be the ones that
might have some overlap with our custom polygon.
Once we have this small set of wards which might overlap our ward, we
can look at the size of the area of overlap (relative to the size of the
whole ward) and multiply that by the known count of phones in that ward
to get an approximation of the count of phones in the overlap area.
Summing these approximations give an estimate for the whole area of the
custom polygon.
1. https://en.wikipedia.org/wiki/R-tree
This allows MNOs to test delivery to multiple non-adjacent cells without
risk of sending a broadcast on the public network. This will also support
testing of multiple polygon geometries in a single message.
Test polygons are all non-UK (northern Finland).
Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>
This commit makes an abstract base class for broadcast areas, so that
methods and properties which are common between `BroadcastArea`s (those
which come from our library) and `CustomBroadcastArea`s (those supplied
via the API) can be shared.
If an area has a `count_of_phones` value of `0` it means we don’t have
data about the population.
This means we can’t do the maths to work out the estimated bleed. So we
should return the default amount of bleed of 1,500m instead, which is
something in between what we’d expect for a built up area and a rural
area.
This prevents us from giving unrealistically large or small bleed
estimates in case we have areas which are more dense or less dense than
the most/least dense areas we currently have.
Also means we don’t have to treat City of London as a special case.
There are basically two kinds of 4G masts:
Frequency | Range | Bandwidth
----------|-------------|----------------------------------
800MHz | Long (500m) | Low (can handle a bit of traffic)
1800Mhz | Short (5km) | High (can handle lots of traffic)
The 1800Mhz masts are better in terms of how much traffic they can
handle and how fast a connection they provide. But because they have
quite short range, it’s only economical to install them in very built up
areas†.
In more rural areas the 800MHz masts are better because they cover a
wider area, and have enough bandwidth for the lower population density.
The net effect of this is that cell broadcasts in rural areas are likely
to bleed further, because the masts they are being broadcast from are
less precise.
We can use population density as a proxy for how likely it is to be
covered by 1800Mhz masts, and therefore how much bleed we should expect.
So this commit varies the amount of bleed shown based on the population
density.
I came up with the formula based on 3 fixed points:
- The most remote areas (for example the Scottish Highlands) should have
the highest average bleed, estimated at 5km
- An town, like Crewe, should have about the same bleed as we were
estimating before (1.5km) – Pete D thinks this is about right based on
his knowledge of the area around his office in Crewe
- The most built up areas, like London boroughs, could have as little as
500m of bleed
Based on these three figures I came up with the following formula, which
roughly gives the right bleed distance (`b`) for each of their population
densities (`d`):
```
b = 5900 - (log10(d) × 1_250)
```
Plotted on a curve it looks like this:
This is based on averages – remember that the UI shows where is _likely_
to receive the alert, based on bleed, not where it’s _possible_ to
receive the alert.
Here’s what it looks like on the map:
---
†There are some additional subtleties which make this not strictly true:
- The 800Mhz masts are also used in built up areas to fill in the gaps
between the areas covered by the 1800Mhz masts
- Switching between masts is inefficient, so if you’re moving fast
through a built up area (for example on a train) your phone will only
use the 800MHz masts so that you have to handoff from one mast to
another less often
I emailed the Geography team at the ONS:
> Hi geography team,
>
> I work on GOV.UK Notify, which is a service run by Government Digital Service (part of the Cabinet Office). I was given your email address by [redacted] who’s been helping answer some of my questions on the cross-government Slack.
>
> We’re using some of the boundary datasets from the Open Geography Portal, and mostly they’ve been excellent.
>
> In the abstract, the problem we’re trying to solve is, given a point outside an area, what is the minimum distance to a point within that area. So, for example, if a crow was somewhere in Cardiff, what’s the shortest distance it would have to fly to reach somewhere in the Bristol local authority district?
>
> We’ve noticed some problems with the data that means our calculations would be wrong. We’ve noticed this around Torquay, Norwich and Bristol. Here are some screenshots of Bristol, from the generalised and full resolution boundaries:
>
> The artefacts I’ve highlighted are closer to Cardiff than any actual part of the land area of Bristol. They are either:
> - in the sea
> - land that’s part of North Somerset
>
> I suspect that this is being caused by the process of clipping the actual region of Bristol (which, unusually, extends into the water) to the mean high water line.
>
> I’ve worked around this by filtering out any polygons that are smaller than ~7,500m². It’s a bit hacky because parts of the Scilly Isles start disappearing. That’s not a problem for what I’m working on, but it would be nice to not need the hack.
>
> So my questions would be:
>
> - Is there a better way to remove these artefacts than filtering by area?
> - Is there a plan to remove these artefacts from the data in future releases?
>
> Thanks in advance,
> Chris
They emailed back to say:
> Hi Chris
>
> Thank you for your enquiry.
>
> We have completed the amendments to the LAD MAY 2020 BFC and BGC boundaries as mentioned so you should be able to download them from the portal now.
>
> Hope this helps.
>
> Kind regards
> [redacted]
This commit brings in the files they’ve updated. We still have to do
some filtering (but now at a higher resolution) because they haven’t
fixed Norwich yet. I’ll email them separately about that.
If you’re adding another area to your broadcast it’s likely to be close
to one of the areas you’ve already added.
But we make you start by choosing a library, then you have to find the
local authority again from the long list. This is clunky, and it
interrupts the task the user is trying to complete.
We thought about redirecting you somewhere deep into the hierarchy,
perhaps by sending you to either:
- the parent of the last area you’d chosen
- the common ancestor of all the areas you’d chosen
This approach would however mean you’d need a way to navigate back up
the hierarchy if we’d dropped you in the wrong place. And we don’t have
a pattern for that at the moment.
So instead this commit adds some ‘shortcuts’ to the chose library page,
giving you a choice of all the parents of the areas you’ve currently
selected. In most cases this will be one (unitary authority) or two
(county and district) choices, but it will scale to adding areas from
multiple different authorities.
It does mean an extra click compared to the redirect approach, but this
is still fewer, easier clicks compared to now.
This meant a couple of under-the-hood changes:
- making `BroadcastArea`s hashable so it’s possible to do
`set([BroadcastArea(…), BroadcastArea(…), BroadcastArea(…)])`
- making `BroadcastArea`s aware of which library they live in, so we can
link to the correct _Choose area_ page
At the moment there are some areas which have:
- a `count_of_phones` value of `None`
- no sub-areas
This is wrong, but until we fix the data the phone counting code needs
to handle this.
This commit:
- adds the `or 0` in the right place (where it will catch these areas
with missing data)
- adds a test which checks these areas, and compares them to other kinds
of areas
This is a better name for the module because it’s:
- not just constants, there’s a method in here now
- only stuff to do with populations, not other kinds of constants
We need to give people a better feel for the consequences of
broadcasting an alert. We’ve seen in research that some users will
assume it is subscription based, or opt-in, rather than going to every
phone in the area.
I reckon that the most effective way to communicate this is to put some
numbers next to the areas, to give people an idea of how many people
will get alerted.
We can estimate how many phones are in an area by:
- taking the population of all electoral wards in that area
- multiplying it by the percentage of people who own an internet
connected phone[1]
The Office for National Statistics publish both these datasets.
The number of people who own an intenet connected phone varies a lot by
age. Since the population data for each ward is broken down by age we
can factor this in. Simplified, the calculation looks like this:
- take the _Abbey_ ward of _Barking and Dagenham_
- in this ward there are 26 people aged 80
- 40% of people over 65 have an internet-connected phone
- therefore 10 of these 80-year-olds would be likely to receive a
broadcast
- (repeat for all other ages)
These numbers won’t be exact, but should be enough to give people a feel
for the severity of what they’re about to do. We can see if they acheive
this aim in user research.
1. This is a proxy for the number of people who are likely to have a 4G
capable phone, because only 4G capable phones will be receiving
broadcasts to begin with
We filter out very small polygons from the original data to remove
glitches. These glitches are caused by trying to subtract the water from
a polygon that includes some land and some water, but using two
different definitions or resolutions of mean high water line.
If we don’t do this then we end up with a bunch of very small polygons
which lie far outside the understood area of a place, causing large
overspill.
We need to increase the threshold for this process because we’re still
seeing this problem around Bristol and Norwich.
This does mean we lose a few very small polygons in places like Shetland
and the Scilly Isles, but not in such a way that we would avoid
broadcasting to them (because they’d still be caught by the
simplification and overspill).
to recap the previous commit, in the ward->local authority->county
library we want to return all local authorities and counties. We do this
by excluding anything that doesn't have children.
However, in the countries library, all four countries don't have
children.
I can't think of a generic way to separate these so just filter on the
library id