Facilitating Scientometrics in Learning Analytics and Educational Data Mining - the LAK Dataset

. The Learning Analytics and Knowledge (LAK) Dataset represents an unprecedented corpus which exposes a near complete collection of bibliographic resources for a specific research discipline, namely the connected areas of Learning Analytics and Educational Data Mining. Covering over five years of scientific literature from the most relevant conferences and journals, the dataset provides Linked Data about bibliographic metadata as well as full text of the paper body. The latter was enabled through special licensing agreements with ACM for publications not yet available through open access. The dataset has been designed following established Linked Data pattern, reusing established vocabularies and providing links to established schemas and entity coreferences in related datasets. Given the temporal and topic coverage of the dataset, being a near-complete corpus of research publications of a particular discipline, it facilitates scientometric investigations, for instance, about the evolution of a scientific field over time, or correlations with other disciplines, what is documented through its usage in a wide range of scientific studies and applications.


Introduction
While there exist a wealth of datasets containing bibliographic metadata, such as ACM 1 or DBLP 2 , these usually provide RDF data covering bibliographic metadata such as authors, affiliations and publication metadata, but -with positive exceptions such as the Semantic Web Journal -usually lack direct access to the content of the publication. This is despite wider 1 http://datahub.io/dataset/rkb-explorer-acm 2 http://datahub.io/dataset/l3s-dblp calls, for instance at the European level 3 , to publish data and scientific output in machine-readable and open formats to facilitate reuse and interoperability.
Such a lack of access to openly licensed and structured research information hinders researchers from 3  carrying out scientometric investigations or to deeply investigate the evolution of scientific disciplines, topics or researchers over time. In particular, for the investigation of the inherent dynamics and the evolution of an entire discipline over time, no dedicated corpus exists which (a) provides bibliographic metadata and full text in a structured and machineprocessable format such as Linked Data and (b) covers the near-complete output of a particular research community over its entire existence.
This paper describes the Learning Analytics and Knowledge (LAK) Dataset 4 which represents an unprecedented corpus exposing a near complete collection of bibliographic resources of the particular scientific disciplines Learning Analytics (LA) and Educational Data Mining (EDM), covering over five years of scientific literature from the most relevant conferences and journals in these disciplines. Considering the licensing and copyright constraints involved in publishing large amounts of scholarly publications across heterogeneous sources, the LA and EDM discipline lends itself to an ideal use case, as it is a young yet quickly evolving community. Scientific outlets here are still limited to a few main conferences and journals, many of which are open access, allowing for the accumulation of a close to complete corpus spanning all significant publications in the field.
The dataset provides Linked Data about bibliographic metadata as well as full text for all publications. Publication agreements were reached with ACM for publications not already available as open access. The dataset is published and maintained with support of the LinkedUp project 5 , the Society for Learning Analytics Research 6 (SoLAR), ACM 7 , the L3S Research Center 8 and the Institute for Educational Technology of the National Research Council of Italy 9 (CNR-ITD), with the main goals being (i) facilitating scientific and community analysis of the LA/EDM communities over time and (ii) improving access to scientific literature in said fields, and (iii) providing a general example of open publishing as well as a test-bed for scientometric tools and methods. The use and exploitation of the dataset is actively encouraged by means of the annual LAK Data Challenge, which has led to the emergence of an increas-ing number of applications and studies. In addition, the methods and vocabularies used for annotating and exposing the data are describing general practices for publishing bibliographic data beyond mere metadata.

Related Work
Publishers of bibliographic data and especially scientific bibliographies have been early adopters of Semantic Web technologies for several years, possibly because of the strong relationship between the fields of library management and information management and the strong use case for sharing scientific publications and related data. That led to a wealth of datasets and vocabularies in the area, where some of the most prominent datasets in the Linked Data cloud today are exposed by organisations such as the British Library (see Linked Open BNB 10 ), as well as repositories of research outputs (such as DBLP 11 or the Linked Data Platform from Nature 12 ).
That also led to the emergence of vocabularies for bibliographic information, where earlier works include the SwetoDblp ontology [9] and more recent efforts include the BibBase ontology [10], linked also to the Bibliographic Ontology BIBO 13 , and the Semantic Web for Research Communities 14 (SWRC) ontologies, two other widely used vocabularies. Schema.org, and the SPAR ontology suite 15 also offer a wide range of concepts and vocabularies in this context, where the WorldCat Linked Data Vocabulary 16 of the OCLC 17 recommends schema.org types.
The Semantic Web Dog Food (SWDF) 18 initiative, using the SWRC vocabulary, aims towards creating a complete Linked Data repository of metadata of papers submitted to conferences associated with the Semantic Web domain. Our endeavour follows a similar approach, collecting publication data from relevant scientific venues (even using same or mapped vocabularies), in the field of Learning Analytics. Different to SWDF or otherwise highly related works as in [10], we aim to enable analyses not only of the metadata, but also of the actual paper content, through also providing access to the full-text of papers, and linking them to other, complementary sources of data.

The LAK Dataset -Content, Scope and Maintenance
While we also offer regularly updated dumps (RDF/XML, N-Triples and R 19 ), here we specifically discuss the RDF dataset and SPARQL endpoint, accessible as described in Table 1.  (3) and (4), we have negotiated a formal agreement with ACM to publish, share and enable reuse of the data. We are currently in discussions to decide on a suitable licence and will update the data and respective metadata on the website and our entries in dataset registries such as the DataHub accordingly.

Creation, maintenance & sustainability
The knowledge extraction process implemented to transform unstructured publications into structured data is composed of three main steps: (1) transforming PDF to plain textual representation, (2) preprocessing, clean-up and consolidation of the textual information, (3) lifting data into RDF schema (Section 4.1). Given the inherent differences of the structure of papers across the different venues, the extraction had to be tailored to each publication origin. Additional issues arose from papers not complying entirely with the suggested layout, requiring several improvement iterations. Further details are provided in 0. At this stage the full text has been extracted without further considering its structure, while ongoing work is concerned with further structuring the text body. Literature references are also extracted and made available in order to support scientometrics based on co-citation networks.
Given the nature of the dataset, new publications are added continuously as these become available, i.e. whenever new proceedings or journal issues of the reflected series are published. Optimisation of the processing pipeline throughout previous years facili-tates a straight-forward and efficient extraction process for new publications.
The ongoing maintenance of the dataset is carried out as a collaborative activity of all partners including the authors of this paper and their institutions, as well as SoLAR, being one of the central organisations driving the advancement of the LA discipline. Maintenance is not only carried out at the data or instance level, but also with respect to the actual ontology and its alignment with other vocabularies, e.g. by frequently adding new alignments with emerging vocabularies.

Schema
For each publication the following features are extracted: title, authors, keywords, abstract, text body, references, publication venue (journal/conference proceedings). To ensure wide interoperability of the data, we have adapted Linked Data best practices 21 and investigated widely used vocabularies for the annotation of involved concepts as discussed in Section 2. Preliminary work in 0 investigated most frequent schemas, particularly for educational datasets, and additionally Linked Data vocabulary usage statistics 22 have been investigated. While the scope of our data model is not covered by a single vocabulary alone, we have opted for using established vocabularies for each specific type and predicate and included mappings between the chosen vocabularies as well as other overlapping ones. The schema is accessible at http://lak.linkededucation.org/schema/lak.rdf 23 . 21 http://www.w3.org/TR/ld-bp/#VOCABULARIES 22 http://stats.lod2.eu/stats 23 While this URL always refers to the latest version of the schema, current and previous versions are also accessible, The majority of schema elements are based on BIBO, FOAF 24 , SWRC, Schema.org, as reported in Table 2. While SWRC had shown a high overlap with the conceptual model of our dataset, it was used as starting point and gradually expanded with additional elements to fully represent the data model of the LAK dataset. Choice of vocabulary terms was influenced by the Web-wide adoption and maturity of the used schemas and their overlap with our data model. The combination of terms led to the emergence of new type and predicate mappings, which have been represented as explicit mappings using the predicates owl:equivalentClass and owl:equivalentProperty together with type and property inheritance statements. Mappings rely largely on established recommendations from the vocabulary owners, such as BIBO/schema.org mappings recommended by schema.org 25 . Mappings were evaluated for consistency (using the HermiT reasoner 26 ) with the involved schemas. The following Table 4 provides a general overview of the number of represented entities per type in the LAK dataset.  Table 5 summarizes the most frequently populated properties.

Inter-Dataset Links
While bibliographic metadata is widespread in the LOD graph, our interlinking efforts have particularly focused on co-reference resolution across entities such as authors, publications, and organisations. Given that LAK is considered a sub-discipline of Computer Science (CS), we have particularly considered the datasets DBLP and Semantic Web Dog Food. While DBLP allows us to link authors to their corresponding representation in a more exhaustive bibliographic CS knowledge base, the Semantic Web Dogfood has been particularly useful to relate equivalent organisations, given its strong overlap with the LAK Dataset with respect to authors' affiliations. All considered datasets complement each other with respect to the schema, i.e. the expressed properties and conceptual model, as well as its population, i.e. the amount of distinct entities actually represented within each dataset. While the LAK Dataset has a high depth with respect to the represented properties and features, even including references and textual body of publications in contrast to most bibliographic databases, it has a fairly narrow scope by focusing entirely on specific CS subjects (Learning Analytics and Educational Data Mining). Coreference resolution of entities, for instance authors, in other more broad bibliographic knowledge bases provides a more complete view on the work of individual authors or organisations and the CS community as a whole. Similarly, the LAK Dataset complements existing corpora by (a) enriching the limited metadata with additional properties and (b) containing additional publications not reflected in DBLP or the Semantic Web Dogfood, creating a more comprehensive knowledge graph of Computer Science literature as a whole. For instance, in DBLP and Semantic Web Dogfood, LAK publications are not exhaustively represented, references and full text are missing in both cases and, in the case of DBLP, affiliations are not reflected as explicit entities.
While overlap among authors in LAK and Semantic Web Dogfood has been less prominent, the majority of authors could be resolved using DBLP. Such links enable a broader understanding of the general scientific output of LAK researchers. For establishing coreferences, literals (foaf:name, dc:title) of entities in all three datasets have been matched. To improve recall and cater for different representations, some preprocessing was applied to address issues with character codes and distinct naming conventions.
Additional outlinks were created to DBpedia as reference vocabulary. To allow a more structured retrieval and clustering of publications according to their topic-wise similarity, we have linked keywords, manually provided by paper authors, to their corresponding entities in DBpedia, thereby using DBpedia as reference vocabulary for paper topic annotations. Keywords, i.e. terms, were disambiguated through state of the art NER (Name Entity Recognition) methods (DBpedia Spotlight), allowing to link for instance keywords such as "educational gaming" to corresponding DBpedia entities, such as http://dbpedia.org/resource/Educational_game, an example taken from a particular EDM2014 paper 27 .
The following Figure 2 depicts the links of resolved or enriched LAK entities. With respect to inlinks, the dataset is referenced by the LinkedUp catalog 29 and the majority of its resources are referenced by the Linked Dataset Profiles 30 dataset, further described in [5]. Additional inlinks might have been generated by the works described in [7][8].

Query and exploration
Some example queries 31 which demonstrate the datasets usefulness with respect to the reported objectives (Section 1) are shown below. The interlinks of the LAK dataset with external datasets support federated queries, combining data about the same entity spread across different sources, for instance, papers, 27 http://data.linkededucation.org/resource/lak/conference/edm2 014/paper/580 28 A high resolution version of this figure is available at: http://lak.linkededucation.org/?page_id=351 authors and properties in LAK, SW Dogfood and DBLP for one specific academic institution. At the same time, term-disambiguation with DBpedia facilitates more precise, entity-based queries, for instance, by using disambiguated DBpedia entities when querying for specific topics (Listing 1). The following example shows a federated query executed across the LAK dataset and the DBLP dataset. In this query, the information about a specific paper of the LAK dataset has been completed with additional data (DOI, reference to bibsonomy) included in DBLP. Listing 3 shows a query to retrieve influential publications in the LA field by selecting the most cited papers.

Applications, impact & usage
The LAK Dataset has received considerable attention and support from organisations such as SoLAR, which also advertises the dataset for its own purposes 32 . Throughout the last years, the dataset has emerged into a central resource for researchers in the LA and EDM field and beyond, documented by a variety of research publications which make use of the data. Including the proceedings [7][8], the authors already are aware of 16 scientific publications 33 which make use of the LAK dataset. While the value of the data for the LA and EDM fields is obvious, the dataset also provides an unprecedented resource for general investigations into scientometrics and in particular, their evolution over time, given the almost complete coverage of the entire research corpus of the covered communities.
The dataset also forms the basis of the LAK Data Challenge, organised by the authors and a team of researchers affiliated with SoLAR, LinkedUp 34 and associated organisations as an annual competition (now in its third year). It is co-located with the ACM LAK conferences (LAK2013 35 , LAK2014 36 ) with currently open calls for the 2015 edition, directly supported by the steering board of the LAK conference. While earlier editions of the challenge were held as workshops or tutorials at ACM LAK, the 2015 edition will be embedded into the main conference tracks. Below, we specifically summarise applications and explorations of the dataset developed by third parties as part of the LAK Data Challenge.
The challenge is revolving around the overall question on what insights can be gained from analytics on the LAK dataset about the evolution Learning Analytics as a whole or individual topics, researchers or organisation as well as their correlation with other fields. Given the narrow scope of the data, the variety of the short-listed submissions (so far 13 in total) has been very wide, where Figure 3 gives an overview of the involved author origins.  While all submissions are notable and in many cases, combine features from several categories, we would like to emphasize particularly works which have received recognition beyond the challenge, such as "Cite4Me" [3], or near complete scientometric environments such as DEKDIV [4] (depicted in Fig.  4). The latter combines a range of features, such as trending topic analysis, co-citation and collaboration analysis with recommendation approaches, for instance to suggest adequate reviewers and experts, where Fig. 4 shows the most frequent authors with regards to a specific set of topics.
Next to these applications, the dataset and some of its applications have been endorsed and supported by SoLAR and ACM, where current discussions are geared towards embedding some of the described applications into their more general libraries and platforms. In addition, as joint activity of the authors and SoLAR, current work aims at expanding the dataset with actual learning analytics research data, i.e. data usually used in the captured publications. The joint vision is to provide a near-complete corpus which provides not just the actual scientific publications in structured formats, but also to a larger extent, their used raw research datasets. This is meant to further facilitate LA & EDM research and open access to research publications and data in general.

Discussion & Future Work
In this paper, we have presented (a) the LAK Dataset, as a particular resource which enables the exemplary investigation and analysis of the evolution of scientific disciplines and the validation of scientometric methods and tools, and (b) a vocabulary, collection of mappings and linking practices for adoption in similar efforts, towards a wider movement engaging in the publication of open and machine-processable scholarly resources.
While, according to the 5-star classification 39 of LOD and Vocabulary use (see also [3]) the LAK Dataset qualifies as a 5-star dataset, there are known shortcomings which the authors are addressing as part of ongoing and future work. The extraction process is not entirely flawless and, depending on the quality of the source PDFs, had in some cases required manual adjustment. Given that the automated co-reference resolution had to consider particular drawbacks, we specifically preferred high precision in favor of recall, to ensure a knowledge graph which is as correct as possible, rather than as complete as possible. We are currently looking into more sophisticated entity interlinking methods, in order to further increase the linking to related entities in other datasets. In addition, the extraction of references and 39 http://www.w3.org/DesignIssues/LinkedData.html full text is so far in a preliminary stage, providing both references and text body in a fairly unstructured manner. Here, as part of upcoming releases, references will be extracted in a more structured format, where features are directly lifted into bibliographic metadata properties. Similarly, we are working on providing a more detailed structuring of the text body, applying the Document Components Ontology (Do-CO) 40 in order to distinct different textual components, such as headings, captions or sections.
Additional insights were gained from the vocabulary definition process. Given the specific scope of our dataset, covering bibliographic metadata and full text, it has been necessary to combine elements from different, partially overlapping vocabularies. We relied on established vocabularies to represent the different involved notions. Due to cross-vocabulary statements, implicit type and predicate mappings emerged which were explicitly represented through dedicated mapping statements. Next to these, additional mappings were introduced to ensure wide interoperability of the data. Given the complex relationships emerging from such vocabulary usage, assessing the compliance of new introduced crossvocabulary mappings is crucial to eliminate any conflicts. In particular the evolution of external vocabularies might pose issues, where continuous monitoring is required to ensure compliance at all times. To this end, the encapsulation of all schema-level statements in our datasets is meant to serve as a starting point for similar efforts, for instance, for exposing bibliographic data in other disciplines.
While the LAK Dataset has a fairly well-defined and somewhat narrow scope, covering only literature in a very specific subdiscipline -i.e. LA and EDManalysis and correlation with bibliographic information in other sources already now enables interesting investigations and applications [7] [8]. Given that the actual text body of publications contains substantial information but is yet still missing from the majority bibliographic Linked Data, we would like to encourage work on similar efforts, i.e. the creation of bibliographic datasets containing both metadata and the actual content. In this context, our work provides a set of practices for related efforts in other scientific areas. This would allow a more direct processing and analysis of scientific works across disciplines. Furthermore, applying such approaches to a wider area could contribute to resolving the gap between unstructured and hard-to process publication formats such as traditional PDFs and structured Linked Data, 40 http://purl.org/spar/doco