Homework #7

Regarding hypotheses and questions

  • What phenomena or properties are being investigated? Why are they of interest?

The possibility of density extrapolation using a newly proposed algorithm is investigated. Density extrapolation is a promising topic in fields, where future states of a distribution (e.g. demography: age-structure of Germany in 20 years) or point predictions with a detailed statement about its uncertainty (e.g. economics: prediction of exchange rates) are of interest.

  • Has the aim of the research been articulated? What are the specific hypotheses and

research questions? Are these elements convincingly connected to each other?

The aim is to propose a new density extrapolation algorithm and make statements about its properties, abilities and limitations. The research question is to investigate, to what extend a new approach based on positon-extrapolated pseudo points is able to predict densities at future time points. Hypotheses are:

  • The proposed method is able to model changes in the position, the variance and the mixing proportion of Gaussian mixture components, which were used to generate the data.
  • The proposed method makes more accurate predictions than state-of-the-art algorithms.
  • The proposed method is faster than state-of-the-art algorithms.
  • To what extent is the work innovative? Is this reflected in the claims?

The work proposes an algorithm to address the task of density extrapolation – a relevant topic in practice, to which until now only a single algorithm with strong limitations can be found in the literature.

  • What would disprove the hypothesis? Does it have any improbable consequences?
  • A simulation on artificial data generated from a Gaussian mixture model could disprove the hypotheses. The consequence would be the knowledge, that the algorithm is not working properly even on this simple example – and therefore useless for further application and investigation.
  • and (3): These hypotheses could be disproved by experiments, where the goodness of fit of the predictions and the execution time is measured.
  • What are the underlying assumptions? Are they sensible?

Maybe the most important assumption is, that changes in the data occur gradually. For many practical applications, this is a realistic assumption.

Further assumptions are, that the data can be represented by a Gaussian mixture model and changes in the data can be modeled by components moving along a polynomial trajectory. For these two assumptions, it is not critical if they do not hold: The algorithm will still work, the resulting predictions may be less accurate.

  • Has the work been critically questioned? Have you satisfied yourself that it is sound science?

Not sure, what to answer here…

Regarding evidence and measurement

  • What forms of evidence are to be used? If it is a model or a simulation, what

demonstrates that the results have practical validity?

In order to validate hypothesis (1), I performed simulations on different artificial data sets, each focusing on a different type of change. They aim to check, that the algorithm works in the expected manner and further experiments on real-world data are worth to be performed. As this was the case, I also performed experiments on three different real-world data sets.

  • How is the evidence to be measured? Are the chosen methods of measurement

objective, appropriate, and reasonable?

To quantify the goodness of fit of the predictions I used two measures: Monte-Carlo Kullback-Leibler divergence and the mean absolute difference between the true and the predicted density. I also measured the execution times during the learning as well as the application phase for the different algorithms.

  • What are the qualitative aims, and what makes the quantitative measures you have

chosen appropriate to those aims?

This seems rather straight forward to me. I compared the true and the predicted density, as the goal of density extrapolation is to make predictions that are as close as possible to the reality. I measured the execution times to conclude, which algorithm is the fastest in learning its model and in making predictions.

  • What compromises or simplifications are inherent in your choice of measure?

I don’t know what to answer here…

  • Will the outcomes be predictive?

?

  • What is the argument that will link the evidence to the hypothesis?
  • If the Monte-Carlo Kullback-Leibler divergence is close to zero in the simulation, the algorithm is able to reflect the changes in the data.
  • If the mean absolute difference between the true and the predicted density is smaller for the proposed method than for the reference algorithms, its predictions are more accurate.
  • If the execution time is lower for the proposed method than for the reference algorithms, it is faster.
  • To what extent will positive results persuasively confirm the hypothesis? Will

negative results disprove it?

As the number of used data sets is limited, the results can never fully confirm or disprove the hypotheses.

  • What are the likely weaknesses of or limitations to your approach?

It can hardly model uniform distributions.

Homework – Punctuation Game

  • We live in the era of Big Data with storage and transmission capacity measured not just in terabytes but in petabytes (where peta- denotes a quadrillion or a thousand trillion). Data collection is constant and even insidious with every click and every “like” stored somewhere for something. This book reminds us, that data is anything but “raw”; that we shouldn’t think of data as a natural resource but as a cultural one, that needs to be generated, protected and interpreted. The book’s essays describe eight episodes in the history of data from the predigital to the digital. Together they address such issues as the ways, that different kinds of data and different domains of inquiry are mutually defining, how data are variously “cooked” in the processes of their collection and use and conflicts over what can or can’t be “reduced” to data. Contributors discuss the intellectual history of data as a concept, describe early financial modeling and some unusual sources for astronomical data, discover the prehistory of the database in newspaper clippings and index cards and consider contemporary “dataveillance” of our online habits as well as the complexity of scientific data curation.
  • During succession ecosystem development occurs, but in the long term absence of catastrophic disturbance a decline phase eventually follows. We studied six long term chronosequences in Australia, Sweden, Alaska, Hawaii and New Zealand; for each the decline phase was associated with a reduction in tree basal area and an increase in the substrate nitrogen to phosphorus ratio, indicating increasing phosphorus limitation over time. These changes were often associated with reductions in litter decomposition rates, phosphorus release from litter and biomass and activity of decomposer microbes. Our findings suggest, that the maximal biomass phase reached during succession cannot be maintained in the long term absence of major disturbance and that similar patterns of decline occur in forested ecosystems spanning the tropical temperate and boreal zones.
  • Facebook’s Graph API is an API for accessing objects and connections in Facebook’s social graph. To give some idea of the enormity of the social graph underlying Facebook, it was recently announced, that Facebook has 901 million users and the social graph consists of many types beyond just users. Until recently, the Graph API provided data to applications in only a JSON format. In 2011 an effort was undertaken to provide the same data in a semantically enriched RDF format containing Linked Data URIs. This was achieved by implementing a flexible and robust translation of the JSON output to a Turtle output. This paper describes the associated design decisions, the resulting Linked Data for objects in the social graph and known issues.

Tricky task. I wasn’t sure at all.

Homework – Active Learning references

Doing this homework I realized once again how annoying it is to investigate about a topic in computer science, whose name yields thousands of results in the field of education literature. Nevertheless I encountered several interesting works for my survey.

 

[1] Settles, Burr. “Active learning literature survey.” University of Wisconsin, Madison 52.55-66 (2010): 11.

[2] Settles, Brr. “Active learning.” Synthesis Lectures on Artificial Intelligence and Machine Learning 6.1 (2012): 1-114.

I picked these two references because Burr Settles is one of the most well-known researchers in Active Learning, who has invested a great deal of work on summarizing challenges and achievements in this field. It is a good starting point for a literature research, because he includes many references to related papers in the different sections of his survey.

Having our discussion last week in mind, it is funny that his technical report ([1]) has been cited 1714 times, while his book ([2]) including nearly the same content remains with 248 citations.

[3] Lewis, David D., and William A. Gale. “A sequential algorithm for training text classifiers.” Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Springer-Verlag New York, Inc., 1994.

[4] H. Sebastian Seung, Manfred Opper, and Haim Sompolinsky. “Query by committee.” Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992.

Lewis and Gale proposed the widely-used and very intuitive Active Learning approach of uncertainty sampling. Their algorithm is commonly used as a baseline in other publications. The conference paper has been cited 1523 times. Seung, Opper and Sompolinsky developed another popular algorithm, this time from the decision theory perspective, which obtained 980 citations.

[5] Lomasky, Rachel, et al. “Active class selection.” Machine learning: ECML 2007. Springer Berlin Heidelberg, 2007. 640-647.

I also picked this paper because Rachel Lomasky was the first to introduce Active Class Selection as a subfield of Active Learning. The paper explains the basic ideas as well as several possible algorithms and their pracitcal application. As Settle’s book only names this subfield without giving details, Lomasky’s paper supplements my list of references.

Summary of Mark Alpert’s article “Warp Drive Research Key to Interstellar Travel”

Though warp drive for most people sound like a Star Trek invention, the scientist Harold „Sonny“ White is investigating the possibility of developing a warp-drive engine in the real world. The basic idea of this kind of engine is to distort the spacetime along a spacecraft’s path. The consequence is the ability of overcoming physical laws and traveling faster than light.

Many physicians laugh at White’s idea and the NASA is spending only an apparently negligible fraction of their budget on his project. Nevertheless, the idea of interstellar flight has become more popular after astronomers discovered a few dozen of planets outside our solar system with suitable temperature conditions to support live.

With the current technological status reaching planets far away from Earth is impossible simply due to the huge amount of time necessary to get there. In fact, the NASA has a probe named Voyager 1 that left the solar system in 2012. Until it will reach any habitable planet, we have to wait at least 70.000 years.

To achieve the necessary breakthroughs to make the dream of a successful interstellar mission by the end of the century come true, space enthusiast founded organizations like 100 Year Space project, Tau Zero Foundation and Icarus Interstellar. Most of these scientist might be fascinated by White’s idea of a warp drive as well, but focus their research on less hypothetical technologies. Icarus Interstellar for example investigates the use of fusion power in spacecrafts – a technology, that allows speeds thousands of times faster than the Voyager 1.

Each new idea is linked with endless complications. Interstellar dust for example, though microscopically small, might cause severe damage to crafts traveling with high speed. Solving this problem by heavier shielding directly arises the next problem: the amount of fuel needed increases.

In times of fiscal belt-tightening and a long list of more urgent problems one might wonder why projects like White’s warp drive or Icarus Interstellar’s fusion power engine are funded. Though their technologies are still far away from working even on Earth, experts consider exploring other star systems essential to humanity’s long-term survival. In case of planetary catastrophes like nuclear wars or pandemics it remains for us to hope that White and his colleagues will be successful.

Homework #2

Assignment 1:

Check.

Assingment 2:

  1. Technologies for an Urban Internet of Things and their Application in the Pandova Smart City project
  2. String test set generation for black-box testing using multi-objective optimization

Assignment 3:

Hard disicion, I would like to write about one of the two following topcis:

  1. Active Learning: This yet quit unknown field of machine learning is in my opinion a very fascinating one. It was introduced to me during my search for a software project and since then I have read a lot about it, implemented many algorithms and even presented a new approach at a conference last month. I could focus on the writing process as I already have a good overview of the topic.
  2. Deep Learning: Deep Learning is one of those buzzwords that pop up and promise us to make our lives better. It makes me curious to know, what is really behind all this. Furthermore I got the task at work to check out Deep Learning frameworks. If I’d choose this topic, the focus would be much more on the literature research as I’m quit new to the field.

The writing process of my bachelor’s thesis – or strictly speaking: how I imagine it to be

Checking my mails I also found one of Katrin. With curiosity I opened the blog link to see what our first writer’s workshop homework would be about.

A text about the writing process of my bachelor’s thesis… In fact I’m currently at the point where I should start writing my thesis. The implementation is nearly finished, I have read some related papers and collected quit a lot of background knowledge for my algorithm that needs to be written down.

I can image that I’m going to like the writing part. To think of words that describe my algorithm in a manner that is easy to understand, words that underline the motivation and support a clear structure of the thesis – I think all of this could be fun. I always liked to explain things, why shouldn’t I like to write my thesis then?

The first difficulty I’m discovering at the moment is a little bit odd – it’s the difficulty to start. I can’t explain at all what the problem is. Maybe I’m just waiting for the morning I get up and feel myself in the greatest writing mood. At the same time I catch myself every once in a while thinking about if my thesis will ever reach 40-50 pages. Back in school I always wrote maybe a third of the pages my fellow students wrote. I hate to write more than necessary as I wouldn’t want to read more than necessary. The only part of my thesis I have already written down is the derivation of my algorithm, which seems to me somehow like the main contribution of the thesis. And it covers not even two pages.

I really hope that I’m going to be satisfied with the result in the end. As Katrin quoted my expectation in the last session: I hope my thesis is going to be “a pleasure to read”.