Zobel’s Checklist

Regarding hypotheses  and questions
What phenomena or properties are being investigated? Why are they of interest?

The paper proposes a new approach to the search for related and similar documents. This can be useful for document retrieval and recommendation.
Their approach is graph-based, which usually goes hand in hand with expensive graph-operations. The authors claim to circumvent this by applying the needed knowledge from the graph to the documents during a pre-processing step

Has the aim of the research been articulated? What are the specific hypotheses and research questions? Are these elements convincingly connected to each other?

The authors want to show three things:
Their approach provides higher correlation with human notions of document similarity than comparable measures.
This holds true for short documents with few annotations.
The calculation of document similarity is more efficient, compared to graph-traversal-based approaches.

To what extent is the work innovative? Is this reflected in the claims?

The work provides a new and better approach to the assessment of relatedness of documents.
Yes, this is reflected in the claims

What would disprove the hypothesis? Does it have any improbable consequences?

Experimental results that show lower correlation with human notions of document similarity than comparable measures for normal or short document lengths or a less efficient calculation of document similarity, compared to graph-traversal-based approaches would disprove their hypotheses.

What are the underlying assumptions? Are they sensible?

They assume, that graph-based methods are not fast enough and that their approach is better than other approaches in finding similar documents. Both assumptions are sensible

Has the work been critically questioned? Have you satisfied yourself that it is
sound science?

I think so.
Regarding evidence  and measurement
What forms of evidence are to be used? If it is a model or a simulation, what
demonstrates that the results have practical validity?

An experiment using the standard benchmark for multiple sentence document similarity.

How is the evidence to be measured? Are the chosen methods of measurement objective, appropriate, and reasonable?

They use the Pearson and Spearman correlation and their harmonic mean as well as a quality ranking using “Normalized Discounted Cumulative Gain”.
The authors state that those metrics are used in related work as well. I can’t say for sure, if those metrics are objective and appropriate, but they allow comparison to other work, which seems reasonable.

What are the qualitative aims, and what makes the quantitative measures you have chosen appropriate to those aims?

I can’t say if they are appropriate, because I don’t know anything about those measures.

What compromises or simplifications are inherent in your choice of measure?

Because they only want to measure how well relevant documents are discovered, the qualification evaluation is only used on the top elements their approach turns up.
Additionally, they use a benchmark, which may or may not replicate real data.

Will the outcomes be predictive?

I’m not really sure what this question aims at.

What is the argument that will link the evidence to the hypothesis?

Standard measures and standard benchmarks ensure that the results can at least partially confirm or disprove the hypotheses.

To what extent will positive results persuasively confirm the hypothesis? Will negative results disprove it?

As only a benchmark and not real data is used in the experiment, the results cannot totally prove or disprove the hypotheses. Nevertheless, the results can give an indication towards one or the other.

What are the likely weaknesses of or limitations to your approach?

The pre-processing step has to be done, which probably needs time.

Leave a Reply

Your email address will not be published. Required fields are marked *