Homework – Scientific Argumentation

Assignment 2: Apply the Checklist from Zobel p.49 to “Efficient Graph-Based Document Similarity” by Christian Paul and others.

Hypotheses and Questions

In their work, Christian Paul et al. propose an approach to the efficient computation of document similarity using knowledge graphs. They investigate how document similarity computation can benefit from exploiting hierarchical and transversal relations in such graphs and how the resulting document similarity correlates with that obtained by humans.

As a basis, they consider several hypotheses: (1) Semantic graphs contain valuable information about the relationship of entities; (2) graph-based document similarity is more suitable for related-documents search than word-distribution-based similarity, because the first leverages the available semantic knowledge; (3) their similarity computation is faster than other graph-based approaches; (4) graph-based similarity measures also work for few annotations.

However, when considering the contributions given in the abstract, the introduction, and the conclusion, the hypotheses do not seem to be very precise, as they use terms like “efficiently compared to other […] approaches” (end of abstract), “overcome previous […] performance limitations” (end of introduction), and “compete with traditional […] approaches in every aspect” (end of conclusion).

To me, it is also not completely clear if they intend to convince the reader to use graph-based approaches over distribution-based ones, or if they want to show that their approach is a lot better than common graph-based approaches. In the introduction they state that distribution-based approaches do not explicitly consider semantical knowledge, which implies that graph-based approaches are better, whereas in the evaluation they conclude that graph-based methods can compete with distribution-based approaches, which gives the impression that they think distribution-based approaches are quite powerful. This argumentation seems a bit inconsistent to me.

The most innovative aspect they propose seems to be the pre-processing step, where they enrich the document with related semantic elements (Semantic Document Expansion) to avoid graph-traversal during search-time. They themselves state that this is the main difference to state-of-the-art graph-based approaches: the computational complexity during search-time. However, they do not claim to propose a novel approach in their paper.

In the section about Semantic Document Expansion (section 2), the authors mention some assumptions they make regarding the knowledge graph. They assume that, for a given annotation, more distant semantic elements are less relevant to the annotation. Furthermore, the more an annotation is connected with a semantic element in the graph, the higher the element is weighted, which also means that it is of greater relevance to the annotation. They also only consider outgoing edges during transversal expansion, assuming that it reduces noise.

They do not critically question their own work. They talk rarely about the limitations of their approach and only focus on how they outperform traditional methods. Aspects that may influence the measures are not pointed out and for me the explanation of what the correlation measure means and how it is to be interpreted was missing. They only mention the “correlation with human notions of document similarity” once in the abstract and once before evaluation, but do not describe how they obtained the human notion.

Evidence and Measurement

The authors use experiments for evidence by considering the quantitative measures Pearson and Spearman correlation, Normalized Discounted Cumulative Gain, and operation time to provide evidence for their aim of improving graph-based computation of document similarity. However, the operation times are not meaningful, because they are just given without reference to results of state-of-the-art approaches, making a comparison impossible. In contrast to that, as Paul et al. state in their paper, the correlation metrics are also used in related work for evaluation, which makes their approach directly comparable to the others. Therefore these measures seem to be appropriate.

The proposed method relies on the document being annotated. If such annotations are chosen badly, their approach may fail, which could be one of its weaknesses. They compare the computed document similarity with that obtained by humans. This measure lacks accuracy, as there are many different ways humans may assess document similarity, which makes it to some extent a subjective measure. They also use a benchmark data set for evaluation, which makes it difficult to make predictions about the general benefit of their approach for document similarity computation. The argument that links the evidence to the hypothesis is the comparison of their work to state-of-the-art approaches by using the proposed correlation measure.

In general, I am a little bit suspicious of their overall positive results, because there is not one single measure or data set where their approach does not work at least as well as related approaches. Together with the fact that they do not talk about the limitations or where their approach can be advanced, this makes me lose trust in their work.

Visual Analytics of Cohort Study Data – Abstract

Epidemiology studies health-related conditions in a population to derive disease-specific risk factors. Hence, groups of individuals are observed in terms of various aspects. Analysis of the resulting cohort study data is conventionally performed using strongly hypothesis-driven statistical approaches. Novel insights into complex correlations can be obtained by introducing visual analytics for explorative analysis to derive new hypotheses. In this paper, we describe recent advances in applying visual analytics techniques to analysis of cohort study data. In addition, we identify analysis tasks that need to be addressed in order to support the epidemiological workflow. We classify the techniques according to their analysis focus and data types they support. Finally, we identify remaining challenges that need further research.

Homework – Scientific English

Assignment 2: Punctuation Game – Put in the missing punctuation marks.

  • We live in the era of Big Data, with storage and transmission capacity measured not just in terabytes, but in petabytes (where peta- denotes a quadrillion or a thousand trillion). Data collection is constant and even insidious – with every click and every “like” stored somewhere for something. This book reminds us that data is anything but “raw”; that we shouldn’t think of data as a natural resource, but as a cultural one that needs to be generated, protected, and interpreted. The book’s essays describe eight episodes in the history of data, from the predigital to the digital. Together, they address such issues as the ways that different kinds of data and different domains of inquiry are mutually defining, how data are variously “cooked” in the processes of their collection, and use and conflicts over what can or can’t be “reduced” to data. Contributors discuss the intellectual history of data as a concept, describe early financial modeling and some unusual sources for astronomical data, discover the prehistory of the database in newspaper clippings and index cards, and consider contemporary “dataveillance” of our online habits as well as the complexity of scientific data curation.
  • During succession, ecosystem development occurs; but in the long term absence of catastrophic disturbance a decline phase eventually follows. We studied six long term chronosequences: in Australia, Sweden, Alaska, Hawaii, and New Zealand; for each, the decline phase was associated with a reduction in tree basal area and an increase in the substrate nitrogen to phosphorus ratio, indicating increasing phosphorus limitation over time. These changes were often associated with reductions in litter decomposition rates, phosphorus release from litter, and biomass and activity of decomposer microbes. Our findings suggest that the maximal biomass phase reached during succession cannot be maintained in the long term absence of major disturbance, and that similar patterns of decline occur in forested ecosystems spanning the tropical, temperate, and boreal zones.
  • Facebook’s Graph API is an API for accessing objects and connections in Facebook’s social graph. To give some idea of the enormity of the social graph underlying Facebook, it was recently announced that Facebook has 901 million users – and the social graph consists of many types beyond just users. Until recently, the Graph API provided data to applications in only a JSON format. In 2011, an effort was undertaken to provide the same data in a semantically enriched RDF forma containing Linked Data URIs. This was achieved by implementing a flexible and robust translation of the JSON output to a Turtle output. This paper describes the associated design decisions, the resulting Linked Data for objects in the social graph, and known issues.

Homework – Research, References, and Citation

Assignment 2: Find at least 5 important literature references for your student project topic. Write down the full bibliographical record of the reference and argue why you picked it. Keep in mind the quality criteria we discussed in this session.

[1] Martijn D. Steenwijk, Julien Milles, MA. Buchem, JH. Reiber, and Charl P. Botha. Integrated Visual Analysis for Heterogeneous Datasets in Cohort Studies. In IEEE VisWeek Workshop on Visual Analytics in Health Care, Volume 3, 2010.

Steenwijk et al. suggest that visual analytics can be used to investigate population data, in particular when no starting hypothesis is available. They propose a visual analysis framework that supports exploration of cohort study data using multiple-coordinated views. They cover multi-timepoint, image and non-image data and describe how they combine various visual representations and interaction techniques for an analysis workflow. Therefore they directly address the survey topic.

[2] Henry Völzke, Dietrich Alte, Carsten Oliver Schmidt, Dörte Radke, Roberto Lorbeer, Nele Friedrich, Nicole Aumann, Katharina Lau, Michael Piontek, Gabriele Born, et al. Cohort Profile: The Study of Health in Pomerania. International Journal of Epidemiology, 40(2):294-307, 2011.

This journal paper deals with a cohort study performed in the years 1997 to 2012. One of its main objectives is to investigate the associations among risk factors and clinical diseases. The number of authors from various domains indicates the interdisciplinary importance of the study and 387 citations show that it is often referred to as an exemplary cohort study.

[3] Robert H. Fletcher, Suzanne W. Fletcher, and Grant S. Fletcher. Clinical Epidemiology: the Essentials. Lippincott Williams & Wilkins, 2012.

3072 citations show that this book is often considered for gaining a fundamental understanding of the field, which is also essential for me as my survey addresses a methodology that is meant to support epidemiologists in their work.

[4] Zhiyuan Zhang, David Gotz, and Adam Perer. Iterative Cohort Analysis and Exploration. Information Visualization, page 1473871614526077, 2014.

Zhang et al. introduce an environment for interactive exploratory analysis of population data that  helps to speed up and simplify analysis processes for domain experts. They extensively describe the needs of real-world clinical domain experts and how their system design is tuned to these requirements. This paper contributes to the topic as a complete system for interactive analysis of cohort study data is introduced. They focus less on the visualizations themselves but more on a reasonable and forwarding system design, which approaches the survey topic from another direction.

[5] Paul Klemm, Steffen Oeltze-Jafra, Kai Lawonn, Katrin Hegenscheid, Henry Völzke, and Bernhard Preim. Interactive Visual Analysis of Image-Centric Cohort Study Data. IEEE Transactions on Visualization and Computer Graphics, 20(12):1673-1682, 2014.

Klemm et al. propose a web-based interactive visual analysis framework that allows for comprehensive investigation and exploration of cohort study data with special focus on image-based data. They provide high functionality and an evaluation using a high-dimensional data set, which makes it significantly important for a survey paper. As far as I know the TVCG is a highly regarded journal, which speaks for the quality of the paper.

[6] Paolo Angelelli, Steffen Oeltze-Jafra, Judit Haasz, Cagatay Turkay, Erlend Hodneland, Arvid Lundervold, Astri J. Lundervold, Bernhard Preim, and Helwig Hauser. Interactive Visual Analysis of Heterogeneous Cohort Study Data. IEEE Computer Graphics and Applications, (5):70-82, 2014.

Angelelli et al. contribute to the generation and validation of new hypotheses during analysis of cohort studies. For this purpose, they introduce a cube-based data representation that allows seamless integration of heterogeneous data and describe how they use it for visual analysis. The presented methodology contributes to the analysis of cohort study data using visual analytics and is therefore relevant for the survey paper.

[7] Paul Klemm, Kai Lawonn, Sylvia Glaßer, Uli Niemann, Katrin Hegenscheid, Henry Völzke, and Bernhard Preim. 3D Regression Heat Map Analysis of Population Study Data. IEEE Transactions on Visualization and Computer Graphics, 22(1):81-90, 2016.

Klemm et al. suggest to use statistical regression for analysis of epidemiological data and provide a three-dimensional heat map representation of relationships that allows interactive analysis of large feature sets with respect to a certain target disease. Like above, the reputation of the journal indicates a high quality paper.

Writing Prompt #3

A man jumps off the roof of a 40-story building. As he passes the 28th floor he hears the mobile ringing in his pocket. He regrets having jumped. Why?

The man is a stuntman, preparing for a big role in the latest movie. He is practicing for the key scene, in which he jumps off the building after having lost his wife, dog, and job.

The man calling him on his way down is one of the technicians, who tells him that the delivery of the mattresses, that should have been piled up on the street in front of the building is delayed and that the truck with the mattresses will not reach the building in time. He also says that the firefighters they called instead to bring a rescue sheet were not available due to an office outing. Meanwhile the stuntman has reached the third floor and, in expectation of the final collision, closes his eyes…

Homework – Interstellar Travel

Assignment 1: Read chapter 3 on “Reading and Reviewing” in “Writing for Computer Science”.


Assignment 2: Read the article “Warp Drive Research Key to Interstellar Travel” in the Scientific American Blog. Write a summary for this article (~500 words).

The article „Warp Drive Research Key to Interstellar Travel“, written by Mark Alpert and published in the Scientific American Blog in 2014, deals with the possibilities that successful research in warp-drive offers for interstellar travel and the difficulties to get there.

Taking the invention of a warp-drive engine in Star Trek as an example, Harold White has initiated a tabletop experiment to investigate the feasibility of creating a real warp-drive engine that would allow overcoming the physical laws that prohibit interstellar travel.

When faced with this immense challenge, other scientists have diverging opinions as to the realistic view on such projects. Nevertheless, surprisingly many engineers and amateurs believe in the plausibility of interstellar travel, which led to academic contributions and the foundation of various organizations. The idea of probes exploring interstellar space has reached a new meaning by the detection of Earthlike planets circling around stars relatively close to our sun.

A recent achievement towards interstellar travel has been made by NASA, whose probe Voyager 1 has entered interstellar space in 2012. However, with the currently possible speed it would take about 70,000 years for Voyager 1 to reach stars that might harbor habitable planets.

Besides hypothetical technologies like warp-drive, scientists focus on technologies such as fusion power. If the energy could be properly controlled, probes like Voyager 1 could travel in space thousands of times faster. However, up to now, there has not been much success with building a fusion power plant, let alone an engine that could be installed in a spacecraft.

Interstellar dust grains also complicate travel in interstellar space, as they cause significant damage to a probe when hit at millions of miles per hour. Heavy shielding at this point would increase the amount of fuel needed. Slowing down before the destination is reached is another issue, which could be solved by firing the engines in the opposite direction. This would also increase the required load of fuel.

Despite all challenges and prematurity of missions, people hold on to the dream of interstellar travel. This is reflected by recent academic conferences as well as advocates, who even state that the exploration of other star systems is indispensable for the long-term survival, as planetary catastrophes could always eliminate humanity when being confined to Earth.

Homework – Structure of a Scientific Manuscript

Assignment 1: Read “Getting Started” in “Writing for Computer Science”.

Done. I like Justin Zobel’s style of writing, just to mention the “pleasure to read” once again ;-).

Assignment 2: Find appropriate titles for the given abstracts.

A Survey on Urban Internet of Things:  Recent Advances and Challenges

The Internet of Things (IoT) shall be able to incorporate transparently and seamlessly a large number of different and heterogeneous end systems, while providing open access to selected subsets of data for the development of a plethora of digital services. Building a general architecture for the IoT is hence a very complex task, mainly because of the extremely large variety of devices, link layer technologies, and services that may be involved in such a system. In this paper, we focus specifically to an urban IoT system that, while still being quite a broad category, are characterized by their specific application domain. Urban IoTs, in fact, are designed to support the Smart City vision, which aims at exploiting the most advanced communication technologies to support added-value services for the administration of the city and for the citizens. This paper hence provides a comprehensive survey of the enabling technologies, protocols, and architecture for an urban IoT. Furthermore, the paper will present and discuss the technical solutions and best-practice guidelines adopted in the Padova Smart City project, a proof-of-concept deployment of an IoT island in the city of Padova, Italy, performed in collaboration with the city municipality.

Using Multi-Objective Optimization for String Test Case Generation

String test cases are required by many real-world applications to identify defects and security risks. Random Testing (RT) is a low cost and easy to implement testing approach to generate strings. However, its effectiveness is not satisfactory. In this research, black-box string test case generation methods are investigated. Two objective functions are introduced to produce effective test cases. The diversity of the test cases is the first objective, where it can be measured through string distance functions. The second objective is guiding the string length distribution into a Benford distribution based on the hypothesis that the population of strings is right-skewed within its range. When both objectives are applied via a multi-objective optimization algorithm, superior string test sets are produced. An empirical study is performed with several real-world programs indicating that the generated string test cases outperform test cases generated by other methods.

Assignment 3: Choose a topic for your student project.

Here it is: A Survey on Visual Analytics of Cohort Study Data.