# A 24-hour inhabitants distribution dataset primarily based on cell phone information from Helsinki Metropolitan Space, Finland

### Research space: The Helsinki Metropolitan Space

The dataset covers the Helsinki Metropolitan Space (HMA) in Finland, which consists of 4 municipalities: Helsinki, Vantaa, Espoo and Kauniainen (Fig. 1). The research space has a inhabitants of over 1.1 million inhabitants (1,154,967 on 31.12.2017), which represents roughly one-fifth (21%) of the overall Finnish inhabitants27. The common inhabitants density within the research space primarily based on residential information is roughly 1,500 individuals/km2, being the best within the internal metropolis of Helsinki, which is situated on the peninsula within the southern a part of the research space.

Cell phones are used extensively within the research space. On the finish of 2017, the cell phone penetration fee (cellular subscriptions = SIM playing cards/100 inhabitants) of Finnish households was 126% with roughly 6,960,000 cellular subscriptions28, which is above the worldwide and the European common charges – 103.6% and 120.4%, accordingly29. It’s estimated that 89% of 16–89-year-olds personal a smartphone within the Finnish capital area30. The outcomes of the survey counsel that there is no such thing as a vital distinction between ladies and men by way of the cellphone possession or use. A survey achieved in 2017 from the research space exhibits that 69% of 7-year-old youngsters have already got their private cell phone31. On the finish of 2018, Elisa Oyj has the biggest market share of cellular subscriptions (38%) in Finland adopted by Telia Finland Oyj (33%) and DNA Oyj (28%)32.

### Information processing steps – flowchart

Producing the information required varied processing steps. First, we pre-processed the uncooked information by cleansing, reclassifying and aggregating the information into polygons representing the approximated protection areas of the operator base stations. Secondly, we used the pre-processed information as enter to estimate hourly weekday inhabitants distribution within the research space by making use of a devoted dasymetric interpolation methodology to boost the spatial accuracy of the cell phone information. We calculated the hourly weekday (Monday-Thursday), Saturday and Sunday inhabitants distribution utilizing a network-driven cell phone dataset outlined as Excessive-Velocity Packet Entry (HSPA) calls (see particulars beneath). Friday was not noted since time use patterns of individuals on Friday sometimes deviate from the opposite non-weekend days. We validated the information towards the official inhabitants register information representing the residential inhabitants and office information. Lastly, we packaged and visualized the information to supply an understanding of the dynamic inhabitants. The steps within the empirical research had been performed primarily utilizing Python for evaluation and QGIS for visualizing the outcomes. The workflow of the research is illustrated in Fig. 2.

### Cell phone information

Community-driven cell phone information from a two-and-a-half-month research interval from late October 2017 until early January 2018 offered by the Elisa Oyj cellular community operator (MNO) was used to map the dynamic inhabitants distribution within the research space. Extra particularly, we use HSPA (Excessive-Velocity Packet Entry) name information that are robotically collected and pre-calculated key efficiency indicator (KPI) for information transmission by customers within the cellular community primarily based on the usual rules launched by 3 GPP33. Since HSPA information are calculated primarily based on radio community counters there are not any identifiers or hyperlinks to any cellular system nor private data. Because the HSPA information are inherently nameless, there is no such thing as a opt-in or opt-out chance. Thus, all of the cellular gadgets related to the community are in scope, together with overseas cellular gadgets utilizing roaming providers.

The cell phone information used was passively (robotically) collected and processed by the MNO previous to offering us with the information. First, the MNO aggregated the set of uncooked counters used to calculate HSPA calls from antenna (cell) stage to base station (website) stage earlier than calculating the precise KPI for every base station in accordance with the rules outlined by 3 GPP33. The uncooked counters in addition to the information at base station (BS) stage had the temporal accuracy of 1 hour. Second, the BS coordinates of the bottom stations outfitted with a number of directional antennae, had been approximated utilizing the coordinates of the antenna with the utmost X and Y coordinate worth by the MNO. This solely has an affect on the spatial accuracy of the BS coordinates when the antennae weren’t hooked up to a mast-like cell tower.

Lastly, for causes of enterprise confidentiality, a randomized error of as much as ±100 metres was set to BS coordinates within the internal metropolis of Helsinki by the MNO earlier than offering us the information. That’s, every coordinate pair is randomly relocated throughout the vary between −100 metres and +100 metres from the unique location. Exterior the internal metropolis, the error was set as much as ±200 metres, accordingly. Basically, the spatial accuracy of the information relies on the density of the bottom station community (highest within the metropolis centre and different densely populated areas, the place use charges are highest)1,34. The median theoretical protection space primarily based on Voronoi polygon modelling in our research space is 0.24 km2.

### Content material of the uncooked information

The unique dataset contained roughly 3.8 million rows of knowledge and coated all base stations by the given operator within the Uusimaa area in Southern Finland. The unique dataset acquired from the MNO contained six attributes: the hourly rely of HSPA calls, the identifier of a base station and information document, geographical (X, Y) coordinates (in ETRS-TM35FIN coordinate system) and timestamp with an hourly precision (YYYY-MM-DD hh) (Desk 1).

To contextualize the HSPA name information, it’s a assortment of downlink (HSDPA) and uplink (HSUPA) protocols, which permits quicker information transmission in a Common Cell Telecommunications System (UMTS) mobile community35. Basically, radio entry bearers (RAB) are liable for transmitting voice or information in 3G telecommunication networks, but when HSPA is supported by the community, information switch might be changed by HSPA bearers when prompted by HSPA name requests35. Thus, the HSPA calls within the dataset embody the vast majority of 3G cellular information transmissions. Information switch from 4 G networks was, nevertheless, not obtainable for the research.

### Temporal distribution of the uncooked information

The HSPA name information present clear temporal patterns each at weekly and every day ranges. Relating to the entire research interval, a recurring weekly rhythm might be distinguished (Fig. 3). The quantity of community exercise is comparatively comparable between the weekdays from Monday to Friday, which decreases through the weekend, with the bottom charges on Sundays. The weekly sample is disrupted through the vacation season with decrease cell phone utilization in comparison with the day of the week common. Examples embrace Finland’s Independence Day (6.12.), New Yr’s Day and Christmas Day. Days with abnormally excessive values are system biases inherent within the uncooked dataset.

There’s a distinct sample within the temporal distribution of community actions, even on the diurnal stage. On a daily workday (Monday–Thursday), cell phone information observe an analogous sample as proven within the actions of individuals from the Time Use Survey with lowest values through the night time, from 00:00 to 05:00 and extra evenly distributed over the course of the day (Fig. 4).

### Pre-processing of cell phone information

The cell phone information had been ready for establishing the dynamic inhabitants by filtering, cleansing, manipulating and aggregating the unique information (see Fig. 5). We excluded days (n = 3) with irregular information (Fig. 3) and any hourly values (incorrect or lacking information from a base station) that may distort the outcomes. We additional cropped the information to the extent of the research space, eliminated a handful of base stations with no exercise throughout the entire research interval (or if two base stations had equivalent ID in several areas), and merged just a few base stations with equivalent coordinates. We additionally filtered out duplicate hour entries attributable to the transition to winter time.

After cleansing and enhancing the information, we filtered the information to current common workdays (Monday-Thursday), and individually each weekend days – Saturday and Sunday on account of distinctive temporal exercise sample. After the filtering, 62 days had been left for additional evaluation, out of which 42 between Monday and Thursday and 10 on each Saturdays and Sundays. Lastly, we aggregated the information to get the median variety of HSPA calls through the research interval for each base station (BS) for each hour of the day.

### Developing the dynamic inhabitants from cell phone information

To distribute the cell phone information from the bottom stations to the statistical grid squares, we used the multi-temporal function-based dasymetric (MFD) interpolation methodology1, see Figs. 5 and 6. The MFD methodology is a dasymetric interpolation methodology belonging to the identical household of areal interpolation strategies as areal weighting. Nonetheless, dasymetric interpolation differs from areal weighting as a result of it makes use of ancillary information to enhance the interpolation of knowledge from present spatial models (i.e. supply zones) to desired spatial models (i.e. goal zones). This strategy has been considered one of the possible strategies for refining the spatial decision of inhabitants and has been extensively utilized in several utility fields17,18.

The datasets used for making ready the dynamic inhabitants distribution utilizing a dasymetric interpolation methodology are listed in Desk 2.

### Creation of the bodily floor layer

Within the first stage of the MFD methodology, land cowl and constructing information had been pre-processed and mixed to create the bodily floor layer which is a spatial layer representing land use data together with a vertical dimension (constructing volumes). It’s used as an enter information for calculating the probability of human presence on the later levels of the MFD1,36. Every function within the bodily floor layer was assigned an exercise perform kind, which enabled us to additional hyperlink the information with the time use survey information (Desk 3).

Relating to the land cowl information, we used a country-specific CORINE Land Cowl raster dataset (the newest model of it on the time) with a spatial accuracy of 20 m × 20 m to find out the land cowl lessons of the research space37. The spatial accuracy of the extra broadly obtainable Pan-European CORINE Land Cowl vector dataset was too coarse (25 ha) for the research functions. Equally, the more moderen brazenly obtainable land cowl information offered by the Nationwide Land Survey of Finland and the Helsinki Area Environmental Providers Authority HSY weren’t relevant on account of too low spatial accuracy. The refined land cowl classification enabled us to hyperlink land use lessons to exercise varieties within the time use information.

To organize the land cowl information for the MFD methodology, the dataset was reworked into vector format, reclassified and cropped to the extent of the research space. Like Järv et al.1, the land cowl information had been reclassified from the unique lessons (n = 48) to 5 lessons primarily based on their exercise perform varieties: (1) residential, (2) work, (3) transport, (4) restricted and (5) different (see Bergroth p.5838; Fig. 7). To enhance the classification, each the worldwide Helsinki-Vantaa airport space (mid-north; Fig. 7) and the Vuosaari cargo harbour space (east) had been reclassified from transport class to the work class as an vital website for workforce on account of their work-driven features.

When it comes to the constructing information, constructing polygons had been extracted from the Nationwide Topographic Database39. In complete, 160,490 buildings had been situated within the research space. The constructing information had been cleaned by calculating the realm of every constructing footprint and filtering out buildings with an space beneath 20 m2 (n = 6,860) leaving 153,357 buildings left for additional evaluation. Equally with Järv et al.1, the buildings had been first categorised into three varieties in accordance with their main exercise perform kind – residential, work and different buildings (see Bergroth p. 5838). Right here, non-classified buildings had been assumed to have work as the principle exercise perform (i.e. work buildings), provided that the dataset has correct classification for buildings which have main exercise features related to residential and different exercise, however not for work exercise perform. To additional enrich the information and refine the classification, we retrieved extra constructing data from OpenStreetMap (n = 72,574)40. Utilizing the OpenStreetMap information, the constructing classification was expanded to cowl additionally retail and repair and transport exercise perform varieties, which couldn’t be extracted from the unique constructing information (see Bergroth p. 14038; Fig. 8).

Just one exercise perform kind was assigned to every constructing. We acknowledge the crudeness of the chosen strategy as buildings could have a number of use varieties both concurrently or at completely different instances. Nonetheless, the present stage of accuracy is predicted to be possible for the aim of this research. The ultimate classification of buildings per exercise perform kind is introduced in Fig. 8 and Desk 4.

The bodily floor layer additionally takes under consideration the vertical dimension within the probability of human presence. To retrieve the vertical dimension, we used details about constructing footprints, flooring space (m2) and flooring counts primarily based on nationwide constructing registers (not obtainable for town of Kauniainen). The municipal information had been additional cleaned, mixed and joined to the unique constructing dataset. Lastly, a geometrical union was carried out to mix the reclassified constructing and land cowl layers.

### Spatial disaggregation by the supply and goal zones

After creating the bodily floor layer, a geometrical union was carried out between the bodily floor layer, supply zone and goal zone layers to create the disaggregated bodily floor layer – a layer the place bodily floor layer models are divided into subunits so that every subunit (referred as s in Formulation, beneath) is designated each to at least one distinctive supply zone (j) and one distinctive goal zone (z), see Fig. 6. Basically, any spatial division can be utilized concerning the supply zones and goal zones. Voronoi polygons had been used to estimate the theoretical protection areas of base stations (supply zones), and 250 m × 250 m statistical grid cells had been used because the goal zones41. Consequently, the research space was divided into 345,917 subunits, every with a delegated exercise perform kind and spatial unit kind (constructing or land) in addition to flooring space. The world of every subunit was recalculated after the overlay operation.

Subsequent, the relative flooring space of every subunit was calculated to incorporate the vertical dimension within the interpolation. First, absolutely the flooring space was assigned to the subunits primarily based on their spatial unit kind and exercise perform kind. For subunits with the spatial unit kind ‘land’, the geometric space of the subunit was set as the ground space. For subunits with the spatial unit kind ‘constructing’, the ground space was primarily based on brazenly obtainable constructing information from the municipalities of Espoo42, Helsinki43 and Vantaa44 containing the constructing register-based flooring areas and flooring counts. Using precise flooring areas supplies a extra correct estimate than the LiDAR-based strategy utilized in Järv et al.1, by which the ground space was estimated from the constructing top extracted from the digital floor mannequin (DSM).

In case the constructing register information weren’t brazenly obtainable (e.g. in Kauniainen), the ground space was estimated primarily based on the precise or imply flooring rely and a particular flooring space coefficient. The imply flooring rely was 2 for residential, service and retail buildings, and 1 for others. The ground space coefficient was 0.95 for residential buildings, 0.91 for service and retail buildings, and 0.98 for different buildings. The ground space coefficient was calculated because the median ratio between the precise flooring space and the product of the constructing footprint space and the ground rely. Each the imply flooring rely and the ground space coefficient had been calculated individually for buildings of every exercise perform kind. Lastly, the relative flooring space (RFA) was calculated for every subunit inside a supply zone, primarily based on the System 1:

$$RF{A}_{s}^{j}=frac{F{A}_{s}^{j}}{sum F{A}_{s}^{j}in j}forall sin j$$

(1)

the place

RFA = relative flooring space

FA = flooring space

s = spatial subunit

j = supply zone

Consequently, the sum of the relative flooring space of all subunits inside one supply zone (Voronoi polygon) equals to 1. The upper the relative flooring space of the subunit, the upper the probability that exercise is allotted to that subunit.

### Integration of the temporal human exercise information

Within the third section of the MFD methodology, time use information had been used to combine the bodily floor layer to create a likelihood matrix for allocating the cell phone information to focus on zones inside every supply zone. Consequently, every spatial subunit received an hourly probability fee of human presence primarily based on its exercise perform kind.

The estimated human presence (EHP) in every subunit was calculated utilizing human exercise information primarily based on the newest Finnish time use survey45 carried out in 2009, in accordance with the rules for Harmonised European Time Use Surveys (HETUS) issued by Eurostat. The time use survey permits for the calculation of the human exercise information for every hour primarily based on the exercise location of over 10-year-olds within the HMA (Fig. 9).

To calculate the estimated human presence, we first aggregated the human exercise to the hourly stage. Second, we reclassified human exercise from the survey to the next lessons primarily based on the placement, the place the exercise was undertaken to hitch it with the bodily floor layer: 1) residential, 2) work (incl. training), 3) transport, 4) retail and repair, 5) unknown and 6) different (similar to leisure areas) (see Bergroth p. 13938).

An hourly likelihood coefficient (H) was assigned to each hour of the day primarily based on the time use information. As well as, a seasonal likelihood coefficient (M) was assigned to account for the affect of the season on the distribution of individuals indoors and outside. In keeping with a research performed by Hussein et al.46, individuals had been discovered to spend roughly 90% of the day indoors in Helsinki through the winter and spring. Equally, as in Järv et al.1, the outcomes are assumed to be appropriate for the dasymetric interpolation, because the cell phone information used for estimating the inhabitants distribution had been additionally collected throughout winter. The seasonal issue was utilized for 3 of the exercise perform varieties (residential, work and training, different). Thus, a subunit of the work exercise perform kind would obtain a coefficient of 0.9 if the spatial unit kind was ‘constructing’ and a coefficient of 0.1. if the spatial unit kind was ‘land’. Subunits with the opposite exercise perform varieties had been assigned an element of 1, besides restricted areas, which had been assigned an element of 0. This fashion, the MFD methodology prevents inhabitants being allotted to a subunit of a restricted kind. General, the estimated human presence per each spatial subunit at a given time unit (hour) was calculated utilizing System 2:

$$EH{P}_{s}^{j,t}=left[{H}_{a,u}^{t}times {M}_{a,u}right]instances RF{A}_{s}^{j}$$

(2)

the place

EHP = estimated human presence

t = time unit

H = hourly issue

M = seasonal issue

RFA = relative flooring space

a = exercise perform kind

u = spatial unit kind (constructing or land)

s = spatial subunit

j = supply zone

### Integration of the cell phone information

Within the fourth section of the dasymetric interpolation, the cell phone information, had been built-in to the bodily floor layer enriched with hourly and seasonal human exercise information. The cell phone information containing the hourly median variety of the completely different community actions had been linked to the bodily floor layer primarily based on the BS identifier. First, the cell phone exercise per spatial subunit was normalized by dividing it by the sum of the corresponding worth of all spatial subunits within the research space. Therefore, the sum of the relative proportion of cell phone information of all subunits within the research space is 1. The relative proportion of cell phone information per spatial subunit of research space complete at given hour was calculated utilizing System 3:

$$RM{P}_{s}^{j,t}=frac{M{P}_{s}^{j,t}}{sum M{P}_{s}^{j,t}in S}$$

(3)

the place

RMP = relative proportion of cell phone information

MP = cell phone information

s = spatial subunit

t = time unit

j = supply zone

S = research space

The formulation was calculated individually for every of the three weekdays – common workday (Monday – Thursday), Saturday and Sunday. Secondly, the hourly normalized cell phone information for every weekday had been multiplied by the hourly estimated human presence to allocate the inhabitants to the subunits primarily based on the bodily floor layer and time use statistics. The relative noticed inhabitants was calculated utilizing System 4:

$$RO{P}_{s}^{j,t}=EH{P}_{s}^{j,t}instances RM{P}_{s}^{j,t}$$

(4)

the place

ROP = relative noticed inhabitants

EHP = estimated human presence

RMP = relative proportion of cell phone information

s = spatial subunit

t = time unit

j = supply zone

### Spatial aggregation to focus on zones

Within the fifth and remaining section of the MFD methodology, the spatial subunits had been aggregated to the statistical 250 m × 250 m grid cells (n = 13,231). The aggregation was carried out by dissolving the subunits primarily based on the goal zone ID. Consequently, every goal zone was assigned the sum of the relative noticed inhabitants of all spatial subunits throughout the given goal zone. The aggregation to the goal zones might be summarized as follows (System 5):

$$ZRO{P}^{z,t}=sum _{sin z}RO{P}_{s}^{z,t}$$

(5)

the place

ZROP = spatially aggregated relative noticed inhabitants per goal zone

ROP = relative noticed inhabitants

t = time unit

s = spatial subunit

z = goal zone

As a remaining results of the MFD methodology, three normalized inhabitants information layers for every hour of the day for normal workday (Monday – Thursday), Saturday and Sunday had been created. After normalization, the sum of all values for every one-hour interval equals to 100 (i.e. 100% of complete inhabitants). The script used to run the MFD methodology relies on Järv et al.47 and brazenly shared by way of GitHub: https://github.com/DigitalGeographyLab/mfd-helsinki.