Classifying Data Heterogeneity within Budget and Spending Open Data

Open data has gained momentum for the past few years, but not much consumption was done over published open budget and spending datasets. Many challenges to consume open budget and spending data are still open. One of the challenges is the heterogeneity of these datasets. We analyze more than 75 different budget and spending datasets released by different public administrations from various levels of administrations and locations. We select five datasets, then present and illustrate several types of budget and spending heterogeneities. We compare these heterogeneities with state of the art fiscal data models, the OpenBudgets.eu (OBEU) data model and Fiscal Data Package (FDP) which are designed specifically for representing budget and spending datasets. The comparison provides hints for both datasets publishers and technical/research communities that deal with open data in budget and spending domain.


INTRODUCTION
Many public administrators have published budget and spending data as part of their open data program. A survey conducted by Open Knowledge International shows that budget datasets topped the first rank as the most published open datasets, among other types of datasets (e.g., national statistics, procurement, national laws, administrative boundaries, draft legislation, air quality, national maps, weather forecast, company register, election results, locations, water quality, government spending, and land ownership) [1]. Having a flexible way to publish a dataset simplifies the work of dataset publishers. Unfortunately, this flexibility leads to datasets complexity, which makes the datasets difficult to consume and integrate. In addition, the published fiscal data requires highly technical skills to analyze [2].
Publishing open data in the domain of budget and spending is often accompanied by different types of classifications. For example, during our analysis we found functional classification (e.g., elementary education and retirement funds) and administrative classification (e.g., Department of Education and Office of Retirement Services). A functional classification lists ICEGOV'18, April 2018, Galway, Ireland F. Musyaffa, F. Orlandi, H. Jabeen, M. Vidal possible value for items spent/budgeted from the usage perspective. An administrative classification provides the list of offices that manage the budget or spending.
The structures of these classifications are also heterogeneous. The diversity ranges from the level of details (i.e., the availability of hierarchies available within the list) as well as how the classifications are normalized or attached (e.g., within the dataset or outside of the dataset). Among the factors that contribute to these heterogeneities are the difference of business and budgeting process, the coverage level of the administration (e.g., supranational vs. municipal) or how projects within the public administration are funded.
In this paper, we classify the heterogeneity on budget and spending datasets and correlate these heterogeneities with two state-of-the-art data models designed specifically for budget and spending datasets. We present lessons learned that could be applied to datasets publishers and technical/scientific communities. Currently, we do not cover linguistic and metadata heterogeneity.
This paper is organized as follows: Section 2 provides motivating example, Section 3 briefly presents related work, Section 4 provides analysis of open fiscal data heterogeneity, Section 5 provides design concerns on the available state of the art data model on budget and spending domain, Section 6 defines common terms that are used throughout this paper, while Section 7 links the state of the art data models with enumerated heterogeneities. Section 8 discusses lesson learned from both fiscal data publishers and technical, scientific communities. Finally, Section 9 concludes this paper.  Figure 1 (b) provides an example of a row taken from the City of Bonn's budget 2017 dataset. Both datasets are published in their native languages (Spanish and German, respectively), and structured differently. The datasets from the city of Madrid include the description of each classification (Descripcion Centro describes Centro, Descripcion Capitulo describes Capitulo, and Descripcion Economico describes Economico) within the dataset itself. In contrast, the dataset from the City of Bonn does not directly provide the description of the classification (Profitcenter, Konto, PSP element, Auftrag, Geschäftsbereich, and Version). Additionally, Bonn datasets are not split into different operational character categories (e.g., income budget vs. expenditure budget), while Madrid dataset split the datasets into different operational categories. The operational character category in Bonn dataset is provided implicitly via the code in the Konto classification as well as the sign in the amount of money indicated (minus sign for income, positive numbers for expenditure).

MOTIVATING EXAMPLE
Despite the difference, some information between these datasets are relatable, as indicated in Figure 1 (c). For example, the amount of income is provided in the PrCtrHw column in Bonn datasets and in the Importe column for Madrid dataset. Konto in Bonn dataset consists of operational character classification and economic classification. In Madrid dataset, economic classification is provided as Economico. Profitcenter in Bonn dataset merges administrative classification and functional classification. In Madrid dataset, the administrative classification and functional classification are provided as Centro and Capitulo, respectively.

RELATED WORK
Current works for modeling heterogeneous fiscal datasets have been done by OpenBudgets.eu (OBEU) with the OBEU data model [3] and Open Knowledge International (OKI) with their Fiscal Data Package (FDP) data model [4]. The OBEU data model is currently the state of the art data model for fiscal data and was designed based on previous data models. An elaboration of the survey from 14 data models in budget and spending domain is provided in [5]. Open Knowledge International (OKI) has been working on the OpenSpending project. By September 2017, OpenSpending has collected 2.238 datasets from 76 countries. OpenSpending provides an open-source technology stack to manage fiscal data, including FDP, which is currently being actively developed by fiscal and transparency communities to model budget and spending datasets. A dataset in FDP consists of CSV and JSON files, with the CSV file as the core fiscal dataset and the JSON file as the dataset metadata. The JSON file also contains dataset column mapping information into a logical model that has been defined by the FDP specification. Once the datasets have been successfully packaged, the datasets can be visualized using OpenSpending Viewer tool. Successfully FDP-packaged datasets can also be transformed into OBEU data model. The OBEU data model that is stored in the semantic data server can be queried using a specific API [6].
Many fiscal datasets can be modeled by FDP or OBEU data model, depending on how the fiscal datasets are published. Since there is no binding standard followed by public administrations with regards to fiscal data publishing, datasets can be very heterogeneous. Classification of data heterogeneity on relational databases has been done by Kim and Seo [7]. Their work classified and enumerated general structural heterogeneity of relational databases, including schema and data conflicts. Our work focuses on heterogeneities that occur specifically in the open budget and spending data domain after surveying 77 datasets from different public administrations from various levels.
There are also heterogeneities in terms of an accounting standard. The attempt of accounting standardization across different public administrations have been made through several initiatives, such as International Public Sector Accounting Standards Board (IPSAS) 2 and European Public Sector Accounting Standard (EPSAS) 3 . In this paper, we limit our scope to the structural, contextual and syntactical heterogeneities of fiscal datasets and excluding accounting standards heterogeneities.

Datasets Example
We have conducted a comprehensive analysis of 77 heterogeneous budget and spending datasets. The spreadsheet of the detailed analysis is available online 4 . These datasets come from different levels (supranational, national, regional and municipalities). Among those analyzed datasets, we picked the following five datasets, which represent a good sample of possible heterogeneities on open fiscal datasets within budget and spending domain. These datasets are: • Bonn budget datasets (from a private repository) 5 . The Bonn datasets are currently obtained privately but licensed as Public Domain. These datasets contain budget data from 2008 -2024, along with several classifications that published once that valid for years, with occasional updates. Bonn budget datasets have likely similar structure with most of the budget datasets from the cities within German state North Rhine Westphalia. • Aragon budget datasets 6

Heterogeneity Types
This subsection enumerates several types of heterogeneity illustrated with cases from datasets mentioned in section 4.2.
Among these datasets, we enumerate several heterogeneities that also likely to occur over other datasets from different public administrations.

OPENBUDGETS.EU DATA MODEL AND FISCAL DATA PACKAGE
The OBEU data model is modeled after Data Cube Vocabulary (DCV) [10]. In the context of the semantic web, vocabulary is a set of defined concepts that can be used to annotate information published on the datasets over the web. DCV is a vocabulary that recommended by W3C to publish multi-dimensional data on the web. Multidimensional data include statistics, as well as budget and spending data. Publishing datasets in DCV allows linking to related concepts and datasets. The OBEU data model considers the following modeling patterns, which are extracted from [11]: 1. Data Structure Definition (DSD). A DSD is an additional file that provides detailed information regarding every dimension, measure, and attribute that are available in the datasets. Within OBEU data model, a DSD is required. 2. Component specification for budget/spending domain. The OBEU data model specifies different dimensions, attributes, and measures that frequently occur in budget and spending datasets. There are 20 components defined within OBEU core data model, in which some are abstract components. Abstract components require data maintainers to extend these components for a more fine-grained modeling. 3. Support for coded dimensions/attributes. Budget and spending datasets are often provided with classifications in the form of encoded notation along with its description. In the OBEU data model, these classifications are provided as a code list,   to define the metadata of the datasets. Some mandatory metadata fields are defined in the OBEU data model. Fiscal Data Package is another, state of the art, evolving data model. FDP consists the original data in CSV format, accompanied by a JSON file to describe the CSV file. FDP is designed based the following modeling patterns, which are summarized from [4]: 1. Consisted of main dataset/resource and metadata as core components. The usage of CSV and JSON utilizes openstandard. 2. Self-documenting metadata, with a progressive requirement. Some metadata are obligatory, but some are recommended or optional. 3. Designed with automated and standardized processing and analysis in mind. 4. Specifying detailed concepts common on budget and spending data (e.g., activity, entity, location). The FDP data model covers basic fiscal concepts, such as administrative and functional classifications, suppliers, amounts, etc. 5. Providing descriptors which define package metadata (name, country code, title, author, license, profiles, granularity, fiscal period), resource (column names and types), and models (mapping from CSV into FDP-defined logical models) such as measures and dimensions). 6. Online analytical processing (OLAP)-based design, which means the concepts of measures and multiple dimensions are taken into consideration. 7. Specifying some harmonized classifications, such as COFOG [13] by the United Nations and GFSM [14] by International Monetary Fund. In FDP, non-harmonized classifications could be modeled as well.

GLOSSARY
Since some of the terms discussed in this paper are rather technical, this section attempt to provide common understanding regarding the definitions used throughout this paper.
• Dimension, classification, and code list. A dimension defines the qualitative element of a budgeting line [3]. The term dimension corresponds to the definition within Data Cube Vocabulary (DCV). One particular type of dimension is a classification. The catalog that enlists the possible values of a classification is coined as a code list.
• Row, observation, and budget line. Every row in a tabular file from a budget/spending datasets correspond to an observation (in DCV terms) or a budget line (in OBEU terms). An observation consists of an observed value (such as the amount of money spent), along with corresponding dimensions (such as for which office and functional usage this value is spent) and attributes (such as the currency of the value).
• Measure and amount. The measure defines the value that is available in a particular observation. The measure is a concept in DCV which specifies the fact being observed. In the context of budget and spending, a measure typically represents the ICEGOV'18, April 2018, Galway, Ireland F. Musyaffa, F. Orlandi, H. Jabeen, M. Vidal amount of money being budgeted or spent within an observation.
• Spending. Spending here defines the actual value that is spent on an item. In this paper, the executed budget is considered similar to spending.
• Budget. The budget contains a list of planned values to be spent with regards to specified dimensions and attribute. Public budgets contain different budget phases, such as draft or proposed budget before it is approved by politicians, the approved budget after it is agreed upon by the politicians, revised or adjusted budget for a a budget that has been changed with regard to approved budget, and executed budget for the actual value paid after the budget has been spent.
• Expenditure. Expenditure refers to the amount of money budgeted to be spent on an item. In this paper, while expenditure refers to the budget that may has been or has not been spent, spending refers specifically to actual budgeted money that has already been spent.
• Income. Income refers to the amount of budgeted money which would flow in as revenue for the corresponding public administration.
• Fiscal datasets. Fiscal datasets refer to any datasets that elaborate the financial management of public administration. Fiscal datasets may include datasets of the budget, spending, procurement, contracts, beneficiaries and so on.
• Resource Description Framework (RDF). RDF is a standard data model published by the World Wide Web Consortium (W3C). Data in RDF are represented as triples. A triple connects two items with a property or predicate which would facilitate data merging despite schema difference. The first item is called as a subject. The second item is known as an object 11 . A subject consists of a Uniform Resource Identifier (URI) that provides an identifier as a web link to further information regarding the item. An object can be in the form of a URI or a literal. When an RDF datasets are linked to another dataset, they form a linked data. Linked open data represented as RDF enables linking across datasets from different sources.

LINKING OBEU DATA MODEL AND FDP
The following Table 2   Over the past few years, the technical and scientific communities have been working to provide sufficient tools and models for handling the open budget and spending data. In Table 2  The choice of a particular stack depends upon the use case of the public administration. If the public administration expects their data to be modeled/consumed in a more flexible, descriptive way and intended to be analyzed in RDF, then their datasets have to be published in an OBEU stack compatible manner. If the datasets publisher is more concerned about easy consumption without much technical skills required (albeit less descriptive), then the datasets publisher are mostly interested in publishing their data to be compatible with FDP stack. FDP-packaged datasets can be transformed semi-automatically using an ETL pipeline [15]. Table 2 shows that there are some heterogeneity issues which are not considered in the data model design yet, such as negative values interpretation (in both OBEU and FDP stack), multiple source funding (in FDP stack), multiple currency (in FDP stack), insignificant classification hierarchy (in both stacks), nonstandard classification (in FDP stack), harmonized and non-harmonized classification (in both stacks), a classification which are published once -and therefore normalized (in FDP stack), classification which are published periodically (in OBEU stack), optional classification (in OBEU stack), datasets that provide observation description (in FDP stack), datasets with normalized budget phase, ICEGOV'18, April 2018, Galway, Ireland F. Musyaffa, F. Orlandi, H. Jabeen, M. Vidal budget direction, and classification (in FDP stack), and datasets other than CSV format (in FDP stack). These known limitations against heterogeneity can be used as an evaluation to improve currently evolving budget and spending data model, as well as the technology stacks to process the budget and spending datasets.

CONCLUSION AND FUTURE WORK
In this paper, we present a list of heterogeneities that appear in open fiscal datasets. The heterogeneities are collected after analyzing different datasets from different public administration. A comparison has been made between these heterogeneities and support within state of the art data model. Lessons learned are provided for both datasets publishers and scientific/technical communities.
In the future, we would like to extend the work by analyzing heterogenous, multilingual datasets from different public administration and proposing an approach to map related concepts from the multilingual budget and spending dataset classifications. From the accounting perspective, considering accounting standard heterogeneities would also be a useful contribution to open budget and spending communities.