Homework – Algorithm description

Due to a difficult mathematical style formatting  inside wordpress, I put the text also into a pdf for a more professional look and feel: homework_describing_algorithm.pdf

Choose a simple algorithm and a standard description of it, such as linear search
in a sorted array. Rewrite the algorithm in prosecode. Repeat the exercise with a
more interesting algorithm, such as heapsort. Now choose an algorithm with an
asymptotic cost analysis. Rewrite the algorithm as literate code, incorporating
the important elements of the analysis into the algorithm’s description.


1. Prosecode – Simple algorithm (“Linear Search”)

LinearSearch(formula_17, formula_6) takes one element x to search for in a list formula_6.

Therefore it compares each element e of formula_6 withformula_17 and count all comparisons until either an element equals formula_17 or the end of the list formula_6 is reached.

    1. (Initialize) Set formula_0
    2. (Search) For each element formula_1 in formula_6:
      (a) If formula_2: Stop the search.
      (b) Continue
    3. (Return) Return formula_16

2. Prosecode – More interesting algorithm (“Heapsort”)

Heapsort(formula_6) takes one list formula_6 and sorts all elements of formula_6. In the first step all elements are sorted in a heap – a binary tree, with the property, that all root nodes are bigger or equal compared to the first child and smaller or equal compared to the second child. In the second step, a sorted list is created by removing repeatedly the biggest (root) node from the heap and appending it to the list.

    1. (Heapify) Sort all elements of the given list formula_6 into a heap.
      (a) Start is assigned the index in formula_6 of the last parent node:
      formula_3
      (b) (Sift down) Compare each root node with child nodes and setup heap order:
      For each formula_4. After sifting down the root all nodes/elements are in heap order.
    2. (Swap elements) Go through all elements (backwards, starting with last element) of formula_6 and swap positions, so the last element is transfered to the starting position.
      (a) Swap first element with last element: formula_5
      (b) Decrease size of (unsorted) list formula_6 by 1: formula_7
      (c) Restore heap property, by executing 1.b (Sift down)

3. Literate Code – Algorithm with asymptotic cost analysis (“Quicksort”)

Quicksort(l) sorts a given list by a divide and conquer approach. Therefore data is partitioned recursively into subpartitions, by comparing all elements with a so called pivot element. This pivot element is used to put lower and equal values into the one sublist and greater values into the another sublist. A well chosen pivot element is therefore needed for a balanced partitioning, meaning equal sized sublists.

The major steps of the algorithm are as follows:

    1. Pick a pivot element from the list formula_6.
    2. Partition the list into two sublists.
    3. Recursively apply step 1 and step 2.

We now examine these steps in detail, including a cost analysis.

    1. Pick a pivot element from the list formula_6.
      In the best case the pivot element is the median of the given list formula_6. Because we do not know the median on a unsorted list, we have to pick an element. Some implementation of Quicksort take the leftmost element of the given formula_6. The worst case for this scenario would be, if formula_6 is already sorted (independent of descending or ascending). Therefore we choose a random element of formula_6.(a) formula_8, randomly chosen
    2. Partition the list into two sublists.
      We take the formula_9 element to partition all remaining elements into two sublist. In case an element is lower or equal to formula_9, the element will be stored in the first sublist formula_10, otherwise in the second sublist formula_11.

      formula_15

      Therefore we need as much comparisons as elements are inside formula_6: formula_12

    3. Recursively apply step 1 and step 2.
      Recursively start with step 1 using formula_10 and formula_11 as input lists. If an sublist contains just one element, this element is well sorted in the sequence of (one-element sized) sublists and the recursion stops. The overall costs for the recursion itself is: formula_13With every recursion step, a new list, reduced in size by factor two is created. In fact for every recursion level all comparisons of all sublists of this specific level equals to the number of comparisons of the complete list formula_6, we get for every recursion level the same count of comparisions: formula_12Now we can put things together and get our overall costs: formula_14

Homework – Good Style

  1. Read the text „Fighting for breath“ in Tim Skerns „Writing Scientific English“ (p. 50f) and analyze its adherence to the aspects of good style.

The text “Fighting for breath” describes air pollution and its impact to our health.

My overall impression about the style of the text is good. Of course, there are some minor flaws, I would might change. For example the following (long) term inside brackets:

(ultraviolet light acts on traffic fumes to produce ozone – a vital filter in the stratosphere but a highly irritant substance when breathed at ground level).

This example breaks my reading flow hardly. One better way could be to remove the additional facts, introduced by the dash. Another way might be to write complete non-bracketed sentences with those information.

The other bracketed information was used in a better manner and quite okay for an popular science news magazine/radio(?) article. For an scientific report some of the other brackets also had to been shortened or completely removed.

Another remark about the text is, that the use of exclamation marks should be avoided. At least in a scientific article this would not held an quality assurance.

Besides these two points (too heavy usage of bracketed information and the use of exclamation marks) I like the text due to the general style of writing.

Homework – Scientific Argumentation

2. Apply the Checklist from Zobel p.49 to the research plan of your (future) Master thesis.
Try to answer all questions. If you feel that a question is inappropriate
explain shortly why this is the case.


Due to the fact that I have not a definite master thesis topic yet, I done the task with the research question of the given paper.

Regarding hypotheses and questions

• What phenomena or properties are being investigated? Why are they of interest?

The efficiency of graph based similarity index methods are investigated. The research question is: How a graph based similarity index method can be much faster, than traditional methods. This question is of interest, because the quality of graph based similarity index methods tends to be in some areas (e.g. document comparision) of higher quality compared to currently used word-distribution-based methods, but with lower speed.

• Has the aim of the research been articulated? What are the specific hypotheses and
research questions? Are these elements convincingly connected to each other?

The aim of the research is to show, that:
i) the similarity measure provides a significantly higher correlation with human notions of document similarity than comparable measures
ii) this also holds for short documents with few annotations
iii) document similarity can be calculated efficiently compared to other graph-traversal based approaches

These hypotheses are strongly connected to each other and therefore I think the connection is convincing.

• To what extent is the work innovative? Is this reflected in the claims?

The work is innovative, because current used similarity methods are not fast enough for a productive usage in many areas (e.g. web search tasks).

• What would disprove the hypothesis? Does it have any improbable consequences?

The result, that
i) the similarity measure produce a lower correlation with human notions of documents than existing word-distribution-based methods
ii) a much higher error for short documents
or
iii) the developed similarity method is not more efficient compared to other graph-traversal based approaches

… would disprove one (or two/three) of the hypotheses.

• What are the underlying assumptions? Are they sensible?

The assumption is, that the developed similarity method is used in the area of calculating document similarity. I think this is a sensible assumption, because these methods are mainly designed (already indicated by name) for this field.

• Has the work been critically questioned? Have you satisfied yourself that it is sound science?

(This could only answered by the author, but I will try)
Of course, the work has been critically questioned, because the outcome was not certain at the beginning of the research task. It is really science, because research in this area and getting answers of the research questions will provide an advance in science – especially in the domain ‘information retrieval’.

Regarding evidence and measurement

• What forms of evidence are to be used? If it is a model or a simulation, what demonstrates that the results have practical validity?

Experiments and simulation will generate the evidence of the hypotheses.

• How is the evidence to be measured? Are the chosen methods of measurement objective, appropriate, and reasonable?

Retrieving information about the efficiency of the algorithm/implementation will be objective. The correlation about all documents will be also objective, but the differencing into small documents is not objective, but should be done by an appropriate treshold value for the document size.

• What are the qualitative aims, and what makes the quantitative measures you have chosen appropriate to those aims?

The qualitative aims are to produce a higher correlation compared to currently used techniques. The quantitative measure is the higher efficiency compared to graph-based document models – may resulting to get a efficiency as high as word-distribution-based methods.

• What compromises or simplifications are inherent in your choice of measure?

The used documents for the experiment/simulation are representative.

• Will the outcomes be predictive?

Yes – If everything is like propagated the outcome should be: A high correlation/ High efficiency. Numbers of measurement in both disciplines are just vague predictive.

• What is the argument that will link the evidence to the hypothesis?

The argument that the simulation/experiment will be very similar to a everyday usage of the method will link the evidence to the hypothesis.

• To what extent will positive results persuasively confirm the hypothesis? Will negative results disprove it?

Positive results will show ‘how good’ the hypotheses are matched.
Negative results will disprove that the achieved quality and/or efficiency of the developed similarity method fits the criteria. But it will not exclude the general possibility, that similar techniques can be good enough to fit all the criteria.

• What are the likely weaknesses of or limitations to your approach?

There are no apparent weaknesses in the targeted approach. One limitation is, that the method can only compare text documents, and is not applicable for multimedia retrieval (e.g. finding similar images).

Homework – Abstract

  1. Read chapter 4 on “Hypotheses, Questions, and Evidence” in Zobel’s “Writing for Computer Science”. We will discuss that chapter in the next session.[Done]
  2. Write an abstract for your student project paper.

Virtualization is a technology that allows sharing the resources of a physical computer between multiple operating systems (OSs). The concept of Virtual Machines (VMs) was introduced in the 60’s by IBM. One revolutionary step was done by the FreeBSD project in 2000 with introducing FreeBSD Jails. With this work, the concept of operating system level virtualization – as one of two big categories of virtualization technologies – was boosted in the BSD community and later becomes also in the GNU/Linux world more and more popular.
Nowadays, companies, which act in the field of cloud computing, like Amazon, use such operating system level virtualization approaches for sharing their server capabilities in a highly dynamic way to customers.
First, we classify the main virtualization technologies into two categories and present a deeper introduction to operating system level virtualization. Secondly, we show differences of the two virtualization categories by comparing two popular representatives of each category. Finally we present a summary of the comparision and outline the advantages and disadvantages of *operating system level virtualization*.

Homework – Scientific English (Punctuation Game)

  1. (Re-)Read the sections in Zobel’s chapter 3 on reviewing (p. 30ff) and chapter 13 on “Editing”. We will discuss these chapters in class next time.

[DONE]

  1. Punctuation Game – Put in the missing punctuation marks (, ; -)

(My insertions are intended to be based on American English rules)

  • We live in the era of Big Data, with storage and transmission capacity measured not just in terabytes but in petabytes (where peta- denotes a quadrillion or a thousand trillion). Data collection is constant and even insidious with every click and every “like” stored somewhere for something. This book reminds us that data is anything but “raw”, that we shouldn’t think of data as a natural resource, but as a cultural one that needs to be generated protected, and interpreted. The book’s essays describe eight episodes in the history of data from the predigital to the digital. Together they address such issues as the ways that different kinds of data, and different domains of inquiry are mutually defining, how data are variously “cooked” in the processes of their collection and use – and conflicts over what can, or can’t be “reduced” to data. Contributors discuss the intellectual history of data as a concept, describe early financial modeling, and some unusual sources for astronomical data, discover the prehistory of the database in newspaper clippings and index cards, and consider contemporary “dataveillance” of our online habits, as well as the complexity of scientific data curation.
  • During succession, ecosystem development occurs; but in the long term absence of catastrophic disturbance, a decline phase eventually follows. We studied six long term chronosequences in Australia, Sweden, Alaska, Hawaii, and New Zealand, for each the decline phase was associated with a reduction in tree basal area and an increase in the substrate nitrogen to phosphorus ratio, indicating increasing phosphorus limitation over time. These changes were often associated with reductions in litter decomposition rates, phosphorus release from litter and biomass, and activity of decomposer microbes. Our findings suggest, that the maximal biomass phase reached during succession cannot be maintained in the long term absence of major disturbance and that similar patterns of decline occur in forested ecosystems, spanning the tropical temperate and boreal zones.
  • Facebook’s Graph API is an API for accessing objects and connections in Facebook’s social graph. To give some idea of the enormity of the social graph, underlying Facebook, it was recently announced, that Facebook has 901 million users and the social graph consists of many types (beyond just users). Until recently, the Graph API provided data to applications in only a JSON format. In 2011, an effort was undertaken to provide the same data in a semantically enriched RDF format, containing Linked Data URIs. This was achieved by implementing a flexible and robust translation of the JSON output to a Turtle output. This paper describes the associated design decisions, the resulting Linked Data for objects in the social graph, and known issues.

Homework – Research, References, and Citation

In the following you can see my best five references for my state of the art paper: “Operating system level virtualization technologies”. I just found one interesting book of my topic, which was not interesting enough for beeing in my best five references, so this list contains only journal and conference papers.


[1] Soltesz, Stephen, et al. “Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors.” ACM SIGOPS Operating Systems Review. Vol. 41. No. 3. ACM, 2007.

Why?
The journal paper introduces the topic of container-based virtualization, which covers the subject of my work in large parts. Furthermore it shows differences of this next-gen approach with the previously almost universally used hypervisor virtualization.
Google scholar shows a citation count of 441, which seems to be a high number in this area.

[2]Kamp, Poul-Henning, and Robert NM Watson. “Jails: Confining the omnipotent root.” Proceedings of the 2nd International SANE Conference. Vol. 43. 2000.

Why?
The conference paper of developer of “FreeBSD jails” write with first-hand knowledge about this technology. “FreeBSD jails” itself was a technology game changer and the conceptual ancestor to current used systems (e.g. lxc).
The paper was submitted to SANE (System Administration and Network Engineering) conference in 2000. Google scholar shows a citation count of 338, which is also a good number.

[3]Felter, Wes, et al. “An updated performance comparison of virtual machines and linux containers.” Performance Analysis of Systems and Software (ISPASS), 2015IEEE International Symposium On Performance Analysis and of Systems and Software. IEEE, 2015.

Why?
This relatively new conference paper compares specific software of both approaches: container based virtualization and hypervisor based virtualization. The presented information can be utilised very well in my work.

[4]RodrĂ­guez-Haro, Fernando, et al. “A summary of virtualization techniques.” Procedia Technology 3 (2012): 267-272.

Why?
This journal paper provides a gentle introduction to the virtualization techniques and an overview of the main concepts of virtualization. Because it is a good paper to classify the different approaches of virtualization I will use this work for retrieving basic informations.
Google scholar shows a citation count of 11.

[5]Rathore, Muhammad Siraj, Markus Hidell, and Peter Sjödin. “KVM vs. LXC: comparing performance and isolation of hardware-assisted virtual routers.” American Journal of Networks and Communications 2.4 (2013): 88-96.

Why?
This journal paper provides a deeper comparision between the hypervisor virtualization with the newer container-based virtualization by investigating representative software of each type.
Google scholar shows a citation count of “only” 5, but because the results of the paper seems to be very interesting for my work I choosed this as part of my “best five references”.

Lecture task: The process of my bachelor thesis

What is the question of my bachelor thesis?

The question of my bachelor thesis is: How a data structure and data management can be designed to make it possible to render an arbitrarily sized multichannel volume dataset with allowing an high rendering speed on casual consumer hardware.

I think I have to add here a short motivation: Medical and biological 3D datasets can be very, very huge (>100GB), but it is kind of a computer scientist’s dream that they can be processed by ordinary computers in a smooth way.

The company for I was researching for with my bachelor thesis already knew that the goal is in parts achieveable – and therefore how the question should be answered. That was caused by the fact an other software company already achieved this goal in parts. But unfortunately they don’t offered any details of their implementation and researching for getting this done.

What was the procedure to resolve the question?

At first I had looked for related work in the specific domain space – medical and biological volume data visualization – of my bachelor thesis. Afterwards I sized up the search range incrementally to get more interesting papers.
Some of the retrieved papers could answer specific questions to my implementation in more detail or bring up some more ideas. Especially nice caching algorithms were interesting at the end, which is needed in many computer science areas.
In fact I the main part of my thesis was built up on the statements of only a few papers (three or four in numbers) for my work, because they covered my task very good and had different details in their approaches.
While I creating my theories about how a good data structure and data management should be I started writing some code to verify the most hypothesis I created during my research phase. Unfortunately I had not such a good time management to produce an implementation, that covers all of my theoretical work.
Sometimes I needed some information I could not retrieve during my researching phase, so I had to create these information byself through writing some tools for benchmarking, like Input/Output performance depending on different operating systems, hardware and file systems. These results was interesting for my experimental implementation, but creating these results slowed down my progressing.

What are the results of the question?

My result is a detailed concept of a data structure and management for the special case of volume data visualization including a experimental implementation, which is able to load very huge datasets and render them with standard rendering techniques.

Homework – Structure of a Scientific Manuscript

Assignment 2: Find appropriate titles for the following abstracts.

1. Technical Overview: Urban Internet of Things
Alternative: Urban Internet of Things overview with evaluation of Padova Smart City project

2. Black-box string test case generation
Alternative: Generation of superior string test sets for real-world programs

Assignment 3: Pick a topic for your student project

Operating system level virtualization technologies
State of the art paper about technologies, allowing to keep services in separated, restricted environments. The paper should include an in-depth view of lxc, docker, rocket and bsd jails. May a comparision with “traditionally” virtualization technologies will be also part of the paper.