Free Essay

Hawala

In:

Submitted By buny
Words 8460
Pages 34
Score Aggregation Techniques in Retrieval Experimentation
Sri Devi Ravana Alistair Moffat

Department of Computer Science and Software Engineering The University of Melbourne Victoria 3010, Australia {sravana, alistair}@csse.unimelb.edu.au

Abstract Comparative evaluations of information retrieval systems are based on a number of key premises, including that representative topic sets can be created, that suitable relevance judgements can be generated, and that systems can be sensibly compared based on their aggregate performance over the selected topic set. This paper considers the role of the third of these assumptions – that the performance of a system on a set of topics can be represented by a single overall performance score such as the average, or some other central statistic. In particular, we experiment with score aggregation techniques including the arithmetic mean, the geometric mean, the harmonic mean, and the median. Using past TREC runs we show that an adjusted geometric mean provides more consistent system rankings than the arithmetic mean when a significant fraction of the individual topic scores are close to zero, and that score standardization (Webber et al., SIGIR 2008) achieves the same outcome in a more consistent manner. Keywords: Retrieval system evaluation, average precision, geometric mean, MAP, GMAP. 1 Introduction

(or not) that was obtained relative to other previous techniques. A variety of mechanisms have evolved over the years to perform these tasks. For example, because of the high cost of undertaking relevance judgements, steps 1 and 2 have tended to be carried out via large whole-ofcommunity exercises such as TREC and CLEF. To perform step 3, we might build our own software system, or, to again make use of shared resources, we might choose to modify a public system such as Lemur, Terrier, or Zettair1 . In the area of effectiveness metrics (step 4), the IR community has converged on a few that are routinely reported and can be evaluated via public software such as trec eval. These include precision@10 (denoted here as P@10), R-precision, average precision (denoted as AP), normalized discounted cumulative gain (nDCG), rankbiased precision (RBP), and so on. Common tools and agreed techniques for step 5 are also emerging – there is now a clear community expectation that researchers must indicate whether any claimed improvements are statistically significant using an appropriate test (Cormack & Lynam 2007). Now consider the last stage, denoted step 6 above. We wish to describe the new system in a context that allows the reader to appreciate the aspects of it that will lead to superior performance; and then need to provide empirical evidence that the hypothesized level of performance is attained. And, inevitably, we seek to do all of that within an eight or ten page limit. In support of our new system we describe the (established) corpus and topics that we have used in any training that we did to set parameters for our system and a baseline or reference system; and we describe the (different) corpus and/or topics that we then used to test the two systems with those parameters embedded. We can also succinctly summarize any statistically significant relationships between the two systems (the step 5 output): the sentence “System New was significantly better than System Old at the 0.05 level for all of X, Y, and Z” (where X, Y, and Z are effectiveness metrics) takes up hardly any space at all in our paper. Yet such a claim is the holy grail of IR research – provided, of course that System Old is a state-of-the-art reference point; that we had implemented it correctly; and that the experiments do indeed lead to the desired level of significance. We would then like to provide evidence of the magnitude of the improvement we have obtained. To do that, we add a table of numbers to our paper. And here is the question that is at the heart of this work: what numbers? Two systems (at least) have been compared, over (probably) 50 or more topics, using (say) three effectiveness metrics. The experiments thus give rise to at least 300 per-system, per-topic, per-metric scores, and while we have all seen IR papers with that many numbers in them, including that
1 See http://www.lemurproject.org/, http://ir.dcs.gla.ac.uk/ terrier/, and http://www.seg.rmit.edu.au/zettair/ respectively.

Measurement is an essential precursor to all attempts to improve information retrieval (IR) system effectiveness. But to experimentally measure the effectiveness of an IR system is a non-trivial exercise, and requires that a complex sequence of tasks and computations be carried out. These tasks typically involve: 1. Selecting a representative corpus and a set of topics; 2. Creating appropriate relevance judgements that describe which documents are relevant to which topics; 3. For each of the systems being compared, building a set of runs, one for each of the topics; 4. Selecting one or more effectiveness metrics, and applying them to create a set of per-metric per-topic per-system scores; 5. For each effectiveness metric, comparing the scores obtained by one system against the scores obtained by the other systems, to determine if the difference in behavior between the systems can be assessed as being statistically significant; and then, finally, 6. Writing a paper that describes what the new idea was, and summarizing the measured improvement
Copyright c 2009, Australian Computer Society, Inc. This paper appeared at the Twentieth Australasian Database Conference (ADC 2009), Wellington, New Zealand, January 2009. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 92, Athman Bouguettaya and Xuemin Lin, Ed. Reproduction for academic, not-for profit purposes permitted provided this text is included.

volume of data is rather less than ideal. Even worse, the effectiveness scores are all derived from the fifty underlying runs, each containing (typically) 1,000 document identifiers. So if we wish to provide sufficient data that followup researchers can apply new effectiveness measures to the data, we need to publish 2×50×1,000 = 105 “things” as the output of even the simplest two-system evaluation. (For example, TREC has accumulated the actual system runs lodged by the participating groups in over a decade of experimentation, and these now form an invaluable resource in their own right, and have been used in the experiments leading to this paper.) Given that it is impossible for the data that was used as the input to our statistical testing to be included in our paper, but desirous of including at least some element of numeric output, we inevitably average the per-topic persystem per-metric effectiveness scores, to obtain a greatly reduced volume of per-system per-metric scores. We then populate a table in our paper with these numbers (perhaps as few as six of them in a one-collection, two-system, three-metric evaluation); use superscript daggers or bold font to indicate the relationships that are statistically significant, and submit our work for refereeing. This approach is now so prevalent that the phrase “mean average precision” has taken on a life of its own, and a table of “MAP” scores is an unavoidable part of every experimental IR paper, as if MAP was an axiomatic measure in its own right. In this paper we take a pace back from this process, and ask two very simple questions that arise from the discussion above: what does it mean to “average” effectiveness scores? And, if it is indeed a plausible operation, is the arithmetic mean the most sensible way to do it, or are there other methods that should be considered? 2 Numeric aggregation

of the system scores to be zero. To sidestep this difficulty, Robertson (2006) thus defined an ǫ-adjusted geometric mean, ǫGM = exp t i=1

log(xi + ǫ) t

−ǫ,

where ǫ is a small positive constant, and the summation, exp, and log functions calculate the product and t th root of the set of t ǫ-adjusted values xi in a non-underflowing manner. Because it is based on multiplication, in which there is no requirement that the quantities have the same units, it is permissible (although somewhat confusing) to take the geometric mean of, for example, three liters, five centimeters, and four kilograms. In this example, the GM is 3.91 (liters · centimeters · kilograms)1/3. Further examples of GM and ǫGM are shown in Table 1. A variant of ǫGM-AP in which ǫ is applied to AP scores in a thresholding sense rather than in an additive/subtractive sense is now one of the aggregate scores routinely reported by the trec eval program (see http://trec.nist.gov/trec_eval) using ǫ = 10−5 (Voorhees 2005): ǫGMtrec eval = exp t i=1

log max{xi , ǫ} t

.

If a set of observations describes some phenomenon, it is natural to seek some kind of gross, or aggregate statistic that summarizes those observations. The simplest of these central tendencies is the arithmetic mean, which, for a set of t observations {xi | i ∈ 1 . . . t} is computed as: t For example, the ǫGMtrec eval score for sequence S2 in Table 1 is 0.039, and would be 0.157 with ǫ = 0.01, the value used in the table. Note that when ǫ approaches ∞ the additive/subtractive ǫGM score for a set of numbers (but not the thresholded ǫGMtrec eval variant) approaches the AM score for the same set. For this continuity reason, we prefer the ǫGM-AP additive/subtractive version to the trec eval version, and have primarily used the former in this paper. The harmonic mean is another central tendency that is typically used to combine rates, and can also be used as a method for score aggregation. It is defined as the reciprocal of the average of the reciprocals,
HM

=

t t i=1 (1/xi )

.

AM

= i=1 xi

/t ,

As examples, consider the four possible sets of t = 5 observations that are shown in Table 1. The arithmetic mean of example set S1 is 0.28. One point worth noting in connection with the arithmetic mean is that all of the values should be on the same scale – it is not possible to compute the average of five inches, ten centimeters, and 0.001 miles without first converting them to a common framework. Similarly, it is impossible to average three liters, five centimeters, and four kilograms, because they cannot be converted to common units. Another key aggregation mechanism is the geometric mean, defined as the t th root of the product of the t numbers, t 1/t

The harmonic mean is undefined if any of the set values are zero, meaning that it is again convenient to make use of an ǫ-adjusted version: ǫHM = t i=1

t −ǫ. 1/(xi + ǫ)

GM

= i=1 xi

.

The geometric mean is more stable than the arithmetic mean, in the sense of being less affected by outlying values. However, when any of the values in a set is zero, the geometric mean over that set is also zero. For IR score aggregation purposes, this introduces the problem that a single “no answers” topic in the topic set might force all

When ǫ is large, the ǫHM score again converges to the AM score. The fourth and final central tendency explored in this paper is the median, denoted here as MD. The median of a set is the middle value of the set when they are sorted into numeric order: x(t+1)/2 when t is odd, and (xt/2 + xt/2+1 )/2 when t is even. The median has the benefit of being relatively unaffected by outliers, but the flip side of this is that it is completely insensitive to changes of value that do not affect the set ordering except when it is the middle values that change. Table 1 shows the application of these four aggregation methods, plus two ǫ-adjusted variants, to four example data sets, with the largest value in each column picked out in bold. It is apparent that the aggregation techniques have different properties, since the four sequences are placed into different “overall” orderings by the various aggregation techniques. There is, of course, no sense in which any of systems S1 to S4 in Table 1 is superior to the others (presuming that the five elements in each sequence can

System S1 S2 S3 S4 0.1 0.0 0.1 0.2 0.1 0.4 0.5 0.2

Scores 0.3 0.2 0.3 0.3 0.8 0.4 0.2 0.2 0.1 0.3 0.2 0.2

AM

GM

ǫGM 0.192 0.151 0.228 0.217

HM

ǫHM 0.148 0.034 0.200 0.214

MD

0.280 0.260 0.260 0.220

0.189 0.000 0.227 0.217

0.145 – 0.197 0.214

0.100 0.300 0.200 0.200

Table 1: Example sets of values corresponding to different systems applied to t = 5 topics, and their calculated central tendencies. In the ǫGM and ǫHM methods, ǫ = 0.01. The values in bold are the largest in each column. be regarded as paired observations and then a statistical significance test applied), so no answer is possible to the question “which system is better in the sense of having the highest score?” Nevertheless, and despite that patent lack of demonstrable differentiation, as soon as an aggregate score has been computed for the observations generated by some system, it is immediately tempting to then “order” the systems by their aggregate scores – exactly as we have in Table 1 by presenting the sequences in decreasing AM order. In Table 1, system S1 at face value “outperforms” the other three systems by a quite wide margin. Robertson (2006) provides an insightful discussion of measures and how they apply to AP and other effectiveness scores, including the relationship between AM-AP and GMAP. Our discussion here can be seen as extending Robertson’s evaluation, through the use of experiments in which aggregation methods are used to represent the overall performance of retrieval systems. In order to understand the effects of using these aggregation methods and their ability to produce consistent system rankings, evaluations were conducted using various TREC collections, and different types of effectiveness measure. Our experiments indicate that ǫGM handles variability in topic difficulty more consistently than does the usual AM aggregation method, and also better than the median MD and harmonic mean HM methods, when a significant fraction of the individual topic scores are close to zero. Also of considerable interest is that the standardized average precision scores of Webber et al. (2008a) achieve the same outcomes, even when coupled with the standard AM aggregation. 3 Topic hardness varying levels of difficulty. It has been noted that a key expectation (or rather, hope) arising from any form of IR experimentation is that the system performance results based on one topic or one collection should be able to predict system performance on other topics and other collections (Buckley 2004a, Webber et al. 2008b). It has also been noted (see Buckley (2004b) and Webber et al. (2008a)) that topic variability is at least as great as system variability, and that in the nominal matrix of per-system per-topic scores, there is more commonality of scores across any particular topic than there is across any individual system. Restating this observation another way, the score achieved by a particular system on a given topic tends to be more a function of the topic than of the system. In a similar vein, Mizzaro & Robertson (2007) argue that GM-AP is a more balanced measure than AM-AP for TREC effectiveness evaluations. This is due to the way in which the arithmetic mean can be influenced by easy topics, for which the system-topic scores are generally high, and bad systems might still get scores that are numerically large. Mizzaro & Robertson also asserted that GM-AP is not overly biased towards the low end of the scale, where the system-topic scores are low, and, equivalently, the topics are hard. Indeed, as was noted by O’Brien & Keane (2007), users prefer strategies and technologies that maximize the amount of information they gain as a function of the interaction cost that they invest. Observations such as these then raise the question as to how best to measure topic “hardness”. In the TREC 2003 Robust Retrieval Track, difficult topics were defined as being those with a low median AP score and at least one high outlier score (Voorhees 2003). Other definitions include computing (Mizzaro 2008) Dt = 1 − meant , (1)

The effectiveness of a retrieval system is gauged as a function of its ability to find relevant documents (Sanderson & Zobel 2005). One of the aims of the recent TREC Robust Track is to improve the consistency of retrieval technology by focusing on poorly performing topics – ones for which most of the participating systems score poorly (Voorhees 2003). The GM-AP aggregation method was introduced as part of this effort, in order to de-emphasize the role of high-scoring topics in system comparisons, and to enhance the relative differences amongst low-scoring topics (Voorhees 2005, Robertson 2006). Note, however, that changing to GM-AP has no effect on the significance or otherwise of any pairwise system comparison, since significance is a function of the elemental effectiveness scores, prior to any summary value being computed. The aggregation mechanism relates purely to the gross statistic that is presented as being the overall score for the system. Mizzaro (2008) makes further observations in this regard. The importance of good relative performance over all topics, and not just excellent performance on one (which is how system S1 obtains its high AM score in Table 1), and the fact that users remember any delivery of poor results by a system for a topic was also discussed by Mandl et al. (2008). Similarly, Buckley (2004a) points out that the topic variability is the main problem when designing an IR system for all user needs, and that a universal IR system should perform well across a range of topics with

where meant is the average of the system-topic scores for topic t; or computing Dt = 1 − maxt , where maxt is the maximum score for topic t. For the purposes of the experiment, we took another approach, and defined the difficulty Dt of a topic t to be Dt = maxt − meant , sdt (2)

in which standardized z-scores are calculated in the system-topic matrix (Webber et al. 2008a), and the most difficult topic is deemed to be the one for which the most “surprisingly good” score is obtained by one of the systems, with surprise defined in terms of standard deviations above the mean. 4 Methodology Our purpose in this investigation was to examine the effect that the choice of score aggregation technique had on the outcomes of experiments, and whether the proposed use of

the ǫGM adjusted geometric mean (Robertson 2006) could be experimentally supported in any way. To carry out this study, we devised the following experimental methodology. While we have no basis for proving that what we have computed has “real” meaning, we trust that the reader will find the experiment plausible (and interesting), and will agree that the results we have computed are grounded in practice. We made use of standard TREC resources – collections, matching topics, and corresponding relevance judgements (see http://trec.nist.gov). We also used the official TREC runs, as lodged each year by the participating research groups, and were able to compute retrieval effectiveness scores for each of the submitted systems based on any subset of the topics that we wished to use. Al-Maskari et al. (2008) consider these test collections and support their use in IR experimentation, arguing that they can be used to predict users’ effectiveness successfully. Each of our main experiments proceeded by: 1. Choosing a random subset containing half of the set of t topics. 2. Extracting the rankings for those t/2 topics from the s available runs. 3. Using the chosen effectiveness metric and the relevance judgements to calculate a set of st/2 persystem per-topic scores. 4. Using the chosen score aggregation technique to compute a set of s per-system scores. 5. Sorting the s systems into decreasing score order, based on the per-system scores. 6. Repeating this process, using the other t/2 topics. 7. Taking the two s-item system orderings, and calculating the similarity between them using a mechanism such as Kendall’s τ (Kendall & Gibbons 1990). 8. Then repeating this entire sequence 10,000 times, so that 10,000 Kendall’s τ scores could be used to represent the self-consistency of the score aggregation technique. We note that researchers have also applied this methodology in investigations examining other aspects of retrieval performance (Zobel 1998, Sanderson & Zobel 2005). Part of the purpose of Table 1 was to illustrate the inconsistencies that can arise out of system “orderings” based on aggregate scores, and our experiments in this paper are intended to uncover the extent to which such inconsistencies are an issue in real IR experimentation. If an aggregation computation was “perfect”, and if subsets of topics could be equally balanced (whatever that means), the two system orderings would be the same, and the Kendall’s τ would be 1.0. Variation in aggregation technique, and variations in subset balance, mean that it is unlikely that Kendall’s τ scores of 1.0 can be achieved. But, if the same (large number of) topic subsets are used for all aggregation methods, any consistently-observed difference in Kendall’s τ can be attributed to the aggregation method. 5 Testing aggregation

1.0

Kendall’s tau

eHM−AP eGM−AP

0.6 1e−04

0.8

0.01

1

Inf

Epsilon
Figure 1: Average system ordering correlations when GMand HM-AP are used as the score aggregation method across topics. In this experiment, the 50 TREC9 Web Track topics were randomly divided into equal-sized subset pairs, and the system rankings generated on those two subsets were compared using Kendall’s τ . When the GMAP and HM-AP parameter ǫ approaches infinity, the resultant system ordering approaches the ordering generated by AM-AP.
AP

Density

0

5

10

eGM−AP eHM−AP AM−AP MD−AP

15 0.6

0.7

0.8

0.9

1.0

Kendall’s tau
Figure 2: Density distribution of 10,000 Kendall’s τ values for ǫGM-AP, ǫHM-AP (both with ǫ = 0.01), AM-AP, and MD-AP. In each experiment the TREC9 Web Track topics are randomly split into two halves, each system is scored using the topic subsets, and then the two resultant system orderings are compared. Three of the four curves represent density cross-sections corresponding to points in Figure 1. The MD-AP cross-section is additional. The vertical dotted lines on the curves indicate the means of the four density distributions. 5.1 Results using average precision Figure 1 provides a first illustration of the data collected using the experimental methodology described in the previous section. In this graph ǫGM-AP and ǫHM-AP induced system orderings are compared for different values of ǫ, with the average value of Kendall’s τ for random pairs of query subset-induced system orderings plotted as a function of ǫ. The line shows the average τ value over 10,000 random splittings of the t = 50 TREC9 topics and s = 105 TREC9 systems, in an experiment designed to answer the very simple question as to whether either ǫGMAP or ǫHM-AP should be preferred to AM-AP, and if so, what range of values of ǫ is appropriate. The shape of the curve in Figure 1 makes it clear that when ǫ is small the average system ordering correlations are higher – that is, that the ǫGM-AP aggregate score (towards the left end of the graph) places the systems into rankings that are more self-consistent than does the conventional AM-AP summarization at the righthand end of the graph (when ǫ → ∞). The ǫHM-AP method is also better at ordering the systems than is AM-AP provided that midrange values are chosen for ǫ. For small ǫ, the quality of the induced system rankings for ǫHM-AP drops markedly. A major theme of this paper is that plain averages should be treated with caution, and we should heed our

For the initial experiments, the AP metric was coupled with a range of aggregation techniques. Use of AP in IR experimentation is widespread, and while it is normally regarded as being a “system” metric rather than a “user” one, it does still correspond to a (somewhat contrived) user model (Robertson 2008).

1.0

Kendall’s tau

TREC8 Ad−hoc TREC2001 Web

Density

TREC9 Web TREC2001 Web TREC8 Ad−hoc

0.8

0.6

1e−04

0.01

1

Inf

0 0.0

2

4

6

0.2

0.4

0.6

0.8

1.0

Epsilon
AP

Average precision scores
Figure 4: Density distribution of AP scores for TREC9 Web Track data (t = 50 topics and s = 105 systems); TREC8 Ad-Hoc Track data (t = 50 topics and s = 129 systems); and TREC2001 Web Track data (t = 50 topics and s = 97 systems). Metric
AP P@10 nDCG RBP0.95 SP

Figure 3: Average system ordering correlations when GMand AM-AP (as ǫ → ∞) are used as the score aggregation method across topics. The two curves for the TREC8 Ad-Hoc Track (t = 50 topics, s = 129 systems) and the TREC2001 Web Track (t = 50 topics, s = 97 systems) can be directly compared with the ǫGM-AP curve in Figure 1.

%=0 10 37 10 17 0

% ≤ 0.1 56 37 23 52 4

own advice in this regard. Figure 2 shows the density distribution of the 10,000 τ values at three of the points plotted in Figure 1, showing the variability of the system orderings when ǫGM-AP and ǫHM-AP (both with ǫ = 0.01) and AM-AP are used to aggregate the per-system per-topic and generate the system orderings. Also shown as a fourth line is the τ density curve for the system similarities generated using the median AP score as the gross system statistic. The density curves are derived from 10,000 random splittings of the 50 TREC9 topics into two 25-topic subsets. The system orderings produced by GM-AP aggregation are consistently more similar than the system orderings induced by AM-AP for the same topic splits, and suggest that GM-AP is the more stable aggregation technique for this data set. Analysis of the paired differences shows that all of the relationships shown are significant at the p = 0.01 level, that is, that ǫGM-AP > ǫHM-AP > AM-AP > MD-AP with high confidence. 5.2 Other collections Having used the TREC9 Web Track data to confirm that ǫGM-AP with ǫ = 0.01 yields more consistent system rankings than does AM-AP, we turned to other TREC data sets in order to determine the extent to which that relationship is a general one. Figure 3 was created via the same experimental methodology as was used for Figure 1, but using the TREC8 Ad-Hoc Track queries and systems (t = 50, s = 129); and the TREC2001 Web Track queries and systems (t = 50, s = 97). Neither of these experiments favor ǫGM-AP compared to AM-AP, and for the TREC8 data set, ǫGM-AP with ǫ = 0.01 is significantly less consistent that AM-AP (p = 0.05). Figure 4 helps understand why this difference in behavior arises. It shows the distribution of the per-topic per-system AP scores for the three TREC data sets used in our experiments. The TREC9 collection, topics, and judgements combination generates a high number of low scores compared to the other two data sets, and it is these low scores that the ǫGM method is handling better. For example, in the TREC9 results, 10% of the system-topic scores are zero, and a further 46% are below 0.1, making a total of 56% low scores, shown in the first row of Table 2. For TREC2001 the corresponding rates were 4% and 45%, totalling 49%; and for TREC8 the rates were 3% and 33%, totalling 36%. That is, in the TREC2001 and TREC8 experiments there were fewer low AP values in the score matrix, AM-AP suffers less vulnerability, and there is thus less scope for ǫGM-AP to be superior.

Table 2: Proportion of low scores among the TREC9 system-topic combinations when assessed using different effectiveness metrics. 5.3 Effectiveness measures Figure 5 and Figure 6 show what happens in the TREC9 environment when AP is replaced by P@10, RBP0.95 (see Moffat & Zobel (2009) for a definition of this metric), and nDCG (see J¨ rvelin & Kek¨ l¨ inen (2002)) as the underlya aa ing similarity measure. The score density plot in Figure 6 suggests that ǫGM-nDCG should be stable as ǫ is changed, because of the low density of nDCG near-zero scores, and that is what is observed in Figure 5. On the other hand, RBP0.95 has a relatively high density of near-zero scores, and ǫGM-RBP0.95 is accordingly sensitive to ǫ, with more consistent system rankings being generated with ǫ = 0.1 than when ǫ tends to ∞ and the arithmetic mean is being used. Note also the comparatively poor performance of P@10 – it is relatively immune to the choice of ǫGM or AM aggregation, but nor is it a terribly good basis for ordering systems. Webber et al. (2008a) recently introduced a standardized version of AP that we denote here as SP. The critical difference between AP and SP is that a set of t topic means and standard deviations are computed across the st persystem per-topic scores, and each of the st scores is then converted into a z score with regard to that topic’s statistics: es,t − meant e′ = , s,t sdt where e′ is the z-score corresponding to the average pres,t cision effectiveness es,t . Because z-scores are centered on zero, and can be negative as well as positive, Webber et al. further transformed the e′ values through the use of the s,t cumulative normal probability distribution. The result is a set of e′′ scores that lie strictly between zero and one, s,t and which, for each topic, has a mean of 0.5 and a uniform standard deviation. The last row of Table 2 shows the rather unique properties of the set of effectiveness scores that result from this two-stage standardization of AP scores to generate SP scores. Given the intention of the standardization process, it is unsurprising (Figure 7) that there is no change in sys-

1.0

Kendall’s tau

Kendall’s tau

nDCG RBP0.95 P@10

1.0

eGM−SP eGM−AP

0.8

0.6

1e−04

0.01

1

Inf

0.6 1e−04

0.8

0.01

1

Inf

Epsilon
Figure 5: Average system ordering correlations when ǫGM-P@10, ǫGM-RBP0.95, and ǫGM-nDCG are used to score the TREC9 systems. When ǫ approaches infinity the methods converge to AM-P@10, AM-RBP0.95, and AM-nDCG respectively.

Epsilon
Figure 7: Average system ordering correlations when ǫGM-SP (based on standardized average precision) is used as the evaluation metric, compared to the ǫGM-AP combination. The methodology and data set used in this experiment are identical to those used in Figure 1.

Density

Density

RBP0.95 P@10 nDCG

AP SP

6

4

2

0

0.0

0.2

0.4

0.6

0.8

1.0

0 0.0

2

4

6

0.2

0.4

0.6

0.8

1.0

System−topic scores
Figure 6: Density distribution of TREC9 system-topic scores using RBP0.95, P@10, and nDCG (t = 50 topics and s = 105 systems). tem ranking consistency as ǫ is changed; and also unsurprising (Figure 8, Table 2) that the fraction of SP scores that are near zero is small. That is, the condition that led to geometric mean average precision being proposed as an alternative to arithmetic mean average precision – the presence of important low-scoring topics in amongst the high-scoring ones – is removed by the standardization process, and there is no benefit in shifting to ǫGM aggregation. Instead, AM aggregation, with the elimination of the need to select ǫ, is the preferred approach, because standardization means that all topics are already contributing equally to any assessment in regard to differences in system effectiveness, and no further benefit arises from emphasizing low scores. Note that standardization is only possible when a set of retrieval systems are being mutually compared, and topic means and standard deviations can be calculated; or when pooled relevance judgements based on a set of previous systems are available, together with the runs that led to those judgements. Figure 9 draws together the observations we have made. A total of fifteen collection and effectiveness metric couplings are shown, with the percentage of low effectiveness scores for that coupling plotted horizontally, and the extent of the superiority (or inferiority) of ǫGM as an aggregation method relative to AM plotted vertically. The behavior of the SP metric does not depend on a specific aggregation technique to perform well, and for all three collections used is insensitive to the score averaging mechanism. In the case of the other four metrics the trend is as we have observed already: if there are many low scores, ǫGM gives more consistent system orderings than does AM.

System−topic scores
Figure 8: Density distribution of AP and SP scores for the TREC9 Web Track data (t = 50 topics and s = 105 systems). 5.4 Query “hardness” All of the results presented thus far have been based on the average performance of the aggregation methods over large numbers of random splittings of the topics in the query set. As an alternative, we also constructed two particular topic subsets based on the nominal topic difficulty scores defined by Equation 2, to evaluate the extent to which the various aggregation mechanisms were affected by topic difficulty. In undertaking this part of the experimentation, we additionally sought to determine whether the additive/subtractive version of the adjusted geometric mean gave rise to consistent rankings, or whether the trec eval ǫ-thresholding version was more consistent. The two query partitionings were constructed to illustrate extreme behavior. In the first partitioning, the Hard/Easy split, Equation 2 was used to assign a difficulty rating to each of the t = 50 TREC9 topics, and then the highest-scoring 25 were taken into one set, and the lowest 25 topics were taken into the other. The second Middle/Rest partitioning again ranked the topics by difficulty, but then extracted the mid-scoring (most internally consistent) group into one subset, and left the remaining 12 hardest topics and 13 easiest ones in the other set. System orderings based on aggregate score over AP and SP for each pair of sets were then computed, and compared using Kendall’s τ . The results of these experiments are shown in Table 3. The columns headed “ Random” relate to the previous methodology of taking the average τ value over 10,000 random splittings of the 50 topics, and represent the numeric values of some of the points already plotted in the various graphs. There are a number of trends that can be distilled from Table 3. Looking at the AP halves of the three subtables it is clear that the additive version of ǫGM is preferable to the ǫGMtrec eval version – in eight of the nine cases, ǫGM detects more similarity between the system orderings,

Aggregation method
AM ǫGM, ǫ = 0.01 ǫGMtrec eval , ǫ

Random

AP Hard/Easy Middle/Rest

Random

SP Hard/Easy Middle/Rest

= 0.00001

0.748 0.819 0.798

0.694 0.779 0.818

0.722 0.826 0.800

0.798 0.805 0.806

0.704 0.712 0.802

0.820 0.824 0.826

(a) TREC9 data set, t = 50 and s = 105.

Aggregation method
AM ǫGM, ǫ = 0.01 ǫGMtrec eval , ǫ

Random

AP Hard/Easy Middle/Rest

Random

SP Hard/Easy Middle/Rest

= 0.00001

0.693 0.729 0.682

0.584 0.650 0.639

0.611 0.674 0.567

0.741 0.740 0.737

0.622 0.622 0.657

0.798 0.798 0.816

(b) TREC2001 data set, t = 50 and s = 97.

Aggregation method
AM ǫGM, ǫ = 0.01 ǫGMtrec eval , ǫ

Random

AP Hard/Easy Middle/Rest

Random

SP Hard/Easy Middle/Rest

= 0.00001

0.773 0.755 0.681

0.719 0.732 0.672

0.753 0.760 0.676

0.781 0.769 0.768

0.739 0.714 0.680

0.790 0.785 0.785

(c) TREC8 data set, t = 50 and s = 129.

Table 3: System ranking correlation coefficients using Kendall’s τ , for two different effectiveness metrics, three different score aggregation methods, three different collections, and three different ways of splitting the topics into two halves. The three subtables are ordered by decreasing percentage of system-topic AP scores that are below 0.1. Note that the Hard/Easy and Middle/Rest splits are both one-off arrangements in each of the three data sets.

0.15 0.10

SP

AP

nDCG

P@10

RBP

0.15

SP

AP

nDCG

P@10

RBP

GM − GM trec_eval

0.10 0.05 0 20 −0.05 −0.10 40 60

GM − AM

0.05 0 30 −0.05 −0.10 60

% of scores < 0.1
Figure 9: Comparing the ǫGM and AM score aggregation methods across three collections and across five effectiveness metrics. The horizontal axis shows the percentage of low effectiveness scores generated by that collection/metric combination; the vertical axis plots the difference in the Kendall’s τ score obtained using ǫGM − AM and ǫ = 0.01. including in all three of the Random evaluations. Second, looking at the top row of each subtable, the AM-SP combination of effectiveness metric and aggregation method is uniformly better than the AM-AP combination that has dominated IR reporting for more than a decade. The third effect to be noted is that the Hard/Easy pairing, with just two exceptions in connection with the ǫGMtrec eval aggregation, is more likely to lead to different system orderings than is the Middle/Rest topics split. And fourth, it also appears that the Middle/Rest is no less likely to generate different system orderings than a Random split, and hence there is no sense in which it provides a problematic arrangement for the aggregation metrics to handle. This is a slightly surprising outcome, since the Rest group contains topics of widely varying difficulty, at least in terms of Equation 2.

% of scores < 0.1
Figure 10: Comparing the ǫGM aggregation method and the thresholded GM variant used in the trec eval program, with other experimental details as for Figure 9. The ǫGM approach is more self-consistent on all three collections, and for four of the five effectiveness metrics used.

We also explored the use of Equation 1 as a topic difficulty rating, and in results not included here, obtained correlation patterns similar to those shown in Table 3. Finally in this section, to reinforce our contention that trec eval’s thresholding ǫGMtrec eval aggregation method is less reliable than the additive/subtractive ǫGM version, Figure 10 repeats the “differences between the average Kendall’s τ score” experiment of Figure 9. The three points representing SP on the three different collections are again unaffected by the aggregation method used. But in all of the other twelve cases, ǫGM generates more consistent system rankings than does ǫGMtrec eval . We can see no basis for persisting with the thresholding version of GM-AP provided by trec eval, and suggest that other authors be similarly vary of using the scores it generates.

Density

AM−SP eGM−AP AM−AP MD−AP eHM−AP

baseline AM-AP approach. Nor – predictably – is the median an especially good score aggregation technique. 6 Conclusion Our exploration of AM-AP and ǫGM-AP has confirmed that for the TREC9 Web data the ǫ-adjusted geometric mean is a more appropriate score aggregation mechanism than is the arithmetic mean. This appears to be a consequence of the large number of low AP scores (more than 50% are less than 0.1) across the TREC9 systems and topics. Experiments conducted with other TREC data resources confirm that when a collection has a majority of system-topic AP scores that are zero or close to zero, the ǫ-adjusted geometric mean is a more appropriate score aggregation method than is the arithmetic mean. On the other hand, when there are only a minority of low system-topic scores, the arithmetic mean is resilient, and tends to perform well, where “performs well” is in the sense of system orderings derived from one subset of the topics being similar to the system orderings derived from a different set. We also experimented with other effectiveness metrics, including P@10, nDCG, RBP0.95, and SP, and observed the same overall outcome – when a large fraction of small effectiveness scores are generated by the metric, it is better to use the ǫGM aggregation approach. On the other hand, experiments using the standardized precision (SP) metric showed that it anticipates the benefits brought about through the use of the geometric mean, and that the best aggregation rule for it was the standard arithmetic mean. The SP metric has the useful attribute of transforming the effectiveness scores so that, for every topic in the set, the mean score for that topic is 0.5, and the standard deviation is also fixed. Standardizing thus converts all system-topic scores to the same “units”, and allows averaging as a logically correct operation. To broaden the scope of our investigation we also plan to explore both further aggregation mechanisms, such as the MEDRANK approach of Fagin et al. (2003); and also other rank correlation approaches, including the τ AP topweighted approach of Yilmaz et al. (2008). Top-weighting of the rank correlation scoring mechanism is important if we are more interested in fidelity near the top of each ranking than in (say) the bottom half of the two system orderings being compared. We conclude by reiterating that our experiments – in which we regard a metric and aggregation technique to be “good” if they yield similar overall system rankings from different topic sets – are founded in practice rather than in theory, and reflect common evaluation custom rather than an underlying principle. We also stress that to be plausible, experimentation should be accompanied by significance testing, and because significance tests are carried out over sets of values, the details of the aggregation technique used to obtain representative gross scores are of somewhat parenthetical interest if significance cannot be asserted. Nevertheless, and even given these caveats, we have found that ǫGM has a role to play when there are many small score values to be handled, and that AM-SP is the combined metric and aggregation technique that is most strongly self-consistent in terms of score-induced system rankings. Acknowledgment: This work was supported by the Australian Research Council, by the Government of Malaysia, and by the University of Malaya. We thank the referees for their helpful comments.

0 0.70

5

15

25

0.75

0.80

0.85

0.90

0.95

1.00

Pearson’s rho
Figure 11: Density distribution of 10,000 Pearson’s ρ values for AM-AP, GM-AP (ǫ = 0.01), HM-AP (ǫ = 0.01), MDAP, and AM-SP, based on 10,000 random splittings of the TREC9 data set (t = 50 and s = 105). The experimental methodology was as for Figure 2. 5.5 Correlation coefficient methods We have made extensive use of Kendall’s τ in our analysis, starting from the presumption outlined in Section 1 that whether or not significance of any particular pairwise relationships had been established, it was likely that overall system scores would be used to derive system rankings, and thus, that study of score aggregation mechanisms was of merit. Use of Kendall’s τ allowed the “closeness” of pairs of system rankings to then be quantified. Kendall’s τ takes into account the system ordering that is generated, but not the scores that led to that ordering, meaning that when there are clusters of near-similar scores, modest changes in the scores can lead to more dramatic changes in the correlation coefficient. An alternative metric that is based on scores rather than rankings is the Pearson product-moment coefficient. To verify that the relationships between rankings that have been noted above are not specific to the use of Kendall’s τ , we repeated the experiments that led to Figure 2, using Pearson’s ρ to compare pairs of lists of “system, score” pairs. The results are shown in Figure 11, again using the TREC9 resource. Broadly speaking, Figure 11 shows the same trends as had been identified using Kendall’s τ . The best method for obtaining a per-system score is the SP metric coupled with the AM averaging technique. Earlier in this paper we noted that because AM relied on addition, it could technically only be applied to values that were on the same scale, a restriction that did not apply to the geometric mean. The superior performance of AM-SP compared to AM-AP can be interpreted as a verification of this observation, since the process of standardizing the AP scores (subtracting the mean, and then dividing by the standard deviation for that topic) renders them into unitless values on a common scale. On the other hand, the pre-standardization AP values have different scales (in the sense of millimeters versus inches), because what is good performance on one topic might be substandard performance on another. In this sense, the process of standardization makes it “right” to then compute the arithmetic mean as the gross statistic for a system. And, even though the fully standardized scores are constrained to the (0, 1) interval, the fact that equality is not possible at either end of the scale means that no matter how good (or bad) a system is on a particular topic, it is possible – at least in theory – for a different system to be better (or worse). That is, there are no bookends in SP that force systems to be considered to be “equal” on very easy or very hard topics. Note also in Figure 11 the good performance of ǫGM – by ǫ-adjusting the scores, and computing a geometric mean, small effectiveness values are allowed to contribute to the final outcome. However use of the harmonic mean is not appropriate, and the HM-AP method is inferior to the

References Al-Maskari, A., Sanderson, M., Clough, P. & Airio, E. (2008), The good and the bad system: Does the test collection predict users’ effectiveness?, in ‘Proc. 31st Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval’, Singapore, pp. 59–66. Buckley, C. (2004a), Topic prediction based on comparative retrieval rankings, in ‘Proc. 27th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval’, Sheffield, England, pp. 506–507. Buckley, C. (2004b), Why current IR engines fail, in ‘Proc. 27th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval’, Sheffield, England, pp. 584–585. Cormack, G. V. & Lynam, T. R. (2007), Validity and power of t-test for comparing MAP and GMAP, in ‘Proc. 30th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval’, Amsterdam, The Netherlands, pp. 753–754. Fagin, R., Kumar, R. & Sivakumar, D. (2003), Efficient similarity search and classification via rank aggregation, in ‘Proc. SIGMOD Int. Conf. on Management of Data’, San Diego, CA, pp. 301–312. J¨ rvelin, K. & Kek¨ l¨ inen, J. (2002), ‘Cumulated gaina aa based evaluation of IR techniques’, ACM Transactions on Information Systems 20(4), 422–446. Kendall, M. & Gibbons, J. D. (1990), Rank Correlation Methods, Oxford University Press, New York. Mandl, T., Womser-Hacker, C., Nunzio, G. D. & Ferro, N. (2008), How robust are multilingual information retrieval systems?, in ‘Proc. 23rd Ann. ACM Symp. on Applied Computing’, Fortaleza, Cear, Brazil, pp. 16– 20. Mizzaro, S. (2008), The good, the bad, the difficult, and the easy: Something wrong with information retrieval evaluation?, in ‘Proc. 30th European Conf. on Information Retrieval’, Glasgow, Scotland, pp. 642–646. Mizzaro, S. & Robertson, S. (2007), HITS hits TREC: Exploring IR evaluation results with network analysis, in ‘Proc. 30th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval’, Amsterdam, The Netherlands, pp. 479–486. Moffat, A. & Zobel, J. (2009), ‘Rank-biased precision for measurement of retrieval effectiveness’, ACM Transactions on Information Systems. To appear. O’Brien, M. & Keane, M. T. (2007), Modeling user behavior using a search-engine, in ‘Proc. 12th Int. Conf. on Intelligent User Interfaces’, Honolulu, Hawaii, USA, pp. 357–360. Robertson, S. (2006), On GMAP: And other transformations, in ‘Proc. 15th ACM Int. Conf. on Information and Knowledge Management’, Virginia, USA, pp. 78–83. Robertson, S. (2008), A new interpretation of average precision, in ‘Proc. 31st Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval’, Singapore, pp. 689–690. Sanderson, M. & Zobel, J. (2005), Information retrieval system evaluation: effort, sensitivity, and reliability, in ‘Proc. 28th Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval’, Salvador, Brazil, pp. 162–169. Voorhees, E. M. (2003), Overview of the TREC 2003 Robust Retrieval Track, in ‘Proc. 12th Text REtrieval Conference (TREC 2003)’, Gaithersburg, Maryland. Voorhees, E. M. (2005), Overview of the TREC 2005 Robust Retrieval Track, in ‘Proc. 14th Text REtrieval Conference (TREC 2005)’, Gaithersburg, Maryland. Webber, W., Moffat, A. & Zobel, J. (2008a), Score standardization for inter-collection comparison of retrieval, in ‘Proc. 31st Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval’, Singapore, pp. 51–58. Webber, W., Moffat, A., Zobel, J. & Sakai, T. (2008b), Precision-at-ten considered redundant, in ‘Proc. 31st Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval’, Singapore, pp. 695– 696. Yilmaz, E., Aslam, J. A. & Robertson, S. (2008), A new rank correlation coefficient for information retrieval, in ‘Proc. 31st Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval’, Singapore, pp. 587–594. Zobel, J. (1998), How reliable are the results of largescale information retrieval experiments?, in ‘Proc. 21st Ann. Int. ACM SIGIR Conf. on Research and Development in Information Retrieval’, Melbourne, Australia, pp. 307–314.

Similar Documents

Free Essay

Insurance Agent

...An insurance agent career can be fulfilling and lucrative. Agents help people protect their homes, businesses and personal belongings. Depending upon what insurance agent career path is taken, agents may also be helping others to protect their families. Insurance agents have unlimited earning potential. But is an insurance agent career the right choice? The description of an insurance agent career has changed over the years, but the essential job description remains the same. Agents have to enjoy working with people in order to have a successful insurance agent career. Insurance agents have to face some very stressed out and unhappy customers in the event that they experience a loss, and it’s up to the agent to run interference with a cheerful and positive attitude. Negotiating skills, motivation, a comfort level with insurance advice and some computer and internet skills are also a necessary part of the insurance agent job description. Independent insurance agency jobs work for multiple insurance carriers, but aren’t officially employed by any of them. The benefits of an independent insurance agent career include a higher earning potential than working for an insurance company. In addition, the freedom of self-employment allows people to decide who they want to work with and where to place insurance. The downside is that a lot of new independent insurance agent careers fizzle out quickly without the marketing and training provided by an agency or insurance carrier. Taking...

Words: 432 - Pages: 2

Premium Essay

Role

...Role of Service Insurance Agents In: Business and Management Role of Service Insurance Agents rRole of an insurance agent / broker Insurance agents and brokers play an important role in marketing life insurance policies. They are the face of the insurance company.Most of the insurance policies world over are sold by insurance agents and brokers. From the insurance company point of they are the marketing and selling agents for their insurance plans. The following are the basic function they should perform. -Provide all the necessary application forms. -Submit application forms to the company. -Arrange for all the medical tests and related formalities. -Provide reminders premiums payments and return receipts. -Should help you make necesary changes in address ,nomination etc. -Help in the process of assignment -Assist you for any loan applications and related formalities -Should help you revive lapsed policies -Assist in claiming death benefits, if required ____----- ROLE OF AN INSURANCE AGENT The role of an insurance agent is to supply a comprehensive policy which will provide adequate protection in the event of a loss on your new home. It should offer coverage for your dwelling, personal property, loss of use, and liability. The amount of insurance should be equal to the replacement value of the dwelling. A bank or mortgage company cannot require insurance in excess of the dwelling replacement cost. The insurance agent can help you to calculate...

Words: 327 - Pages: 2

Free Essay

Going Mobile

...Is there a process in your company or in another company that you can think of that would be improved by making it mobile as described in the article? This is actually a topic that I've been discussing with an insurance agency owner. This agency deals with personal lines as well as commercial lines of insurance and also focuses on health insurance and life insurance. Equipping insurance agents with iPads/iPhones that have multiple apps on them will allow the agent to do a better job in the field. They would be able to access their current client information, write new business, and make changes to existing business/policies (such as add a vehicle or drop a vehicle from a policy); they could accomplish all of this without being in their office. In fact, they could make these changes in a client's home or at a coffee shop with or without their client being present. SIDE NOTE: we also discussed an app for their clients to use that would allow them to access their insurance information, such as ID cards, when premiums are due, the ability to pay the premiums, etc. (this app would be branded around this particular insurance agency). How would it improve the business process? Having the ability to handle things for their clients as they come up would allow the agency and the agents to be a lot more efficient. It could also play a major role in reducing E & O (errors and omissions) claims, this situation occurs if an agent forgets to do something such as add a vehicle to the...

Words: 333 - Pages: 2

Premium Essay

Criminal Law

...Essay CJ 513-Terrorism James R. Myers Kaplan University Professor Stephen Dunker August 8, 2011 Abstract 1. Describe the “hawala system.” What makes it successful among it’s users? 2. Describe “martyrdom.” Does it go hand in hand with being a suicide bomber? Support your answer. Question 1: Describe the “hawala system.” What makes it successful among it’s users? Remittance is a transfer of money by foreign workers, relatives to his or her own countries. Monies sent home or transferred yearly by migrants constitutes the second largest financial influx to many developing countries. Estimates of these types of transactions range from $310 billion (according to the International Fund for Agricultural Development) to the World Banks estimate s of over $450 billion. Once such remittance program, which draws a great deal of attention, is the “hawala system” of remittance and/or payment. “Officials at Interpol and the International Monetary Fund estimates, that transactions within this system range as high a $700 billion dollars yearly” (Jost and Sandhu, 34). So, what is the hawala system? How does it work? What are the advantages of this system of finance? In an article written by Patrick Jost and Hajit Sandhu: “The Hawala Alternative Remittance System and It’s Role in Money Laundering (2010)…“Hawala is an alternative or parallel remittance system. It exists and operates outside of, or parallel to 'traditional' banking or financial channels...

Words: 1628 - Pages: 7

Free Essay

Effect of Hawla

...summary | | | This paper presents a description of the hawala (also referred to as hundi) alternative remittance system. Hawala is an ancient system originating in South Asia; today it is used around the world to conduct legitimate remittances. Like any other remittance system, hawala can, and does, play a role in money laundering. In addition to serving as a 'tutorial' on hawala transaction, this paper will also discuss the way in which hawala is used to facilitate money laundering.   What is hawala? | | | Hawala (1) is an alternative or parallel remittance system. It exists and operates outside of, or parallel to 'traditional' banking or financial channels. It was developed in India, before the introduction of western banking practices, and is currently a major remittance system used around the world. It is but one of several such systems; another well known example is the 'chop', 'chit' or 'flying money' system indigenous to China, and also, used around the world. These systems are often referred to as 'underground banking'; this term is not always correct, as they often operate in the open with complete legitimacy, and these services are often heavily and effectively advertised. The components of hawala that distinguish it from other remittance systems are trust and the extensive use of connections such as family relationships or regional affiliations. Unlike traditional banking or even the 'chop' system, hawala makes minimal (often no) use of any sort of negotiable...

Words: 8522 - Pages: 35

Free Essay

Ahold Corporate Governance

...Plunder of India India now is witnessing not mere corruption, but national plunder. --Brahma Challeny, The Hindu, Dec 6, 2010 ESTIMATE OF DEPOSITS IN SAFE HAVENS Top 5 in the world India - $1456 billion (1.4 Trillion dollars) Russia - $470 billion UK $390 billion Ukraine - $100 billion China $96 billion Note: While these numbers are not substantiated because of secrecy, it does reflect the magnitude. Conservative Estimate by Global Financial Integrity India’s standing per Transparency International India’s Corruption Perception Index: 3.3 Scale of 0 to10 10 (highly clean), 0 (highly corrupt) India The Republic of Scams India, Republic of Scams 2G Spectrum Allocation Fraud, 1.76 lakh crores (40 billion dollars) Classic example of collusion between politicians, industrialists and media with high powered brokers. Indian Government refuses to constitute a joint parliamentary committee of all parties to investigate the scam (that resulted in total deadlock in entire last parliamentary session). India, Republic of Scams 2G Spectrum Allocation Fraud(2010) India Today Jan 3, 2010 No tolerance for corruption or deception? Govt refused Joint Parliamentary committee because based on proportion of MPs there will be majority representatives from other parties who will be able expose those who actually received money and deposited in swiss and other safe havens? Who are they protecting? According to Dr. Swamy, Sonia has taken a big share of...

Words: 3238 - Pages: 13

Free Essay

Computer

...IQRA UNIVERSITY KARACHI CAMPUS Subject: Essentials of Islamic Finance Instructor Yousuf Ibnul Hasan Time Allowed: 60 Minutes Maximum Marks: 40. ------------------------------------------------- Department of Business Administration Quiz No: 2 fall 2013 Name of student:______________________________ ID No: _________ Class Day __________Timing _______________ ____Marks _________ No cutting, rubbing or scratching, no questions A | Name the financial products for the given transactions | | 1 | Venture financing is made under the financing product know as | Musharka | 2 | Capital not be in commodity or in metal currency for financing products known as | Modaraba | 3 | Inflation and holding the scarcity and price hike through financing mode known as | Morabaha | 4 | Financing is not given but assets are given on rents in financing mode known as | Ijarah | 5 | Rs.5 million finance for Capital Assets | Ijarah | 6 | Early you pay less you pay, delay you pay more you pay | Ijarah | 7 | Four partners are not eligible for management fee amount merge with Net Profit | Musharka | 8 | Selling Expense is the part of the cost of the product | Morabaha | 9 | SME concept is drive from the Financing Mode known as | Modaraba | 10 | Capital is structured by contribution sale of equity participation certification known as | Sukuk | 11 | Net Worth divided into equal amount of unit which are sale by owner to avail financing | Musharka...

Words: 893 - Pages: 4

Premium Essay

Article Review on Aml Paper

...COURSE: BUSINESS RESEARCH METHODS (GSM 5114) LECTURER: DR NARESH KUMAR International Anti-Money Laundering Regulation of Alternative Remittance System Why the Current Approach Does Not Work in Developing Countries By: Ashida Mohamed Ibrahim GM04455 Article review Citation Joanna Trautsolt and Jesper Johnson (2012). International anti-money laundering regulation of alternative remittance systems why the current approach does not work in developing countries. Journal of Money Laundering Control, Volume 15, No 4 (2012) 407-420. Introduction The article was written by Ms Joanna Trautsolt and Mr. Jesper Johnson from School of Oriental and African Studies, London and Chr. Michelsen Institute, Bergen, Norway, respectively. The authors conducted comparative case study analysis on the Financial Action Task Force’s (FATF) recommendations in regulating Alternative Remittance Systems (ARS) in Afghanistan and the United Arab Emirates. Based on the article, the results indicated that FATF’s recommendation to regulate informal remittance systems is ineffective in developing countries. I took note of the authors indication that researches conducted on related issues was minimal, coupled with the inherent limitations in the current regulatory approach. Overall, I was of the opinion that the article was very informative and easy to be read. However, some enhancements could be made in several areas: i. In introduction part and in explaining their arguments, the...

Words: 440 - Pages: 2

Premium Essay

Corruption

...In fact, people are surprised to find an honest politician. These corrupt politicians go scot-free, unharmed and unpunished. Leaders like Lal Bahadur Shastri or Sardar Vallabh Bhai Patel are a rare breed now who had very little bank balance at the time of death. The list of scams and scandals in the country is endless. Now Recently Before Start 2010 Commen Wealth Games Corruption is playing major role with commen wealth games. The Bofors payoff scandal of 1986 involved a total amount of Rs 1750 crore in purchase of guns from a Swedish firm for the Army. The Cement scandal of 1982 involved the Chief Minister of Maharashtra, the Sugar Scandal of 1994 involved a Union Minister of State for food, the Urea Scam and of course no one can forget Hawala Scandal of 1991, the Coffin-gate, fodder scam in Bihar or the Stamp scandal which shocked not only...

Words: 734 - Pages: 3

Free Essay

Appraisal

...ASIAN METACENTRE RESEARCH PAPER SERIES no.20 The Social Organization of Remittances: Channelling Remittances from East and Southeast Asia to Bangladesh Md Mizanur Rahman Brenda S.A. Yeoh ASIAN METACENTRE FOR POPULATION AND SUSTAINABLE DEVELOPMENT ANALYSIS HEADQUARTERS AT ASIA RESEARCH INSTITUTE NATIONAL UNIVERSITY of SINGAPORE Md Mizanur Rahman is a Postdoctoral Fellow at Asia Research Institute, National University of Singapore, Singapore. He is a sociologist with particular interests in migration and development, migration and human (in)security, minority migration and migration policy in East and Southeast Asia. He obtained his Ph.D. in Sociology from National University of Singapore, Singapore, and M.A. in Sociology from Aligarh Muslim University, Aligarh, India. Brenda S.A. Yeoh is Professor, Department of Geography, and the Head of Southeast Asian Studies Programme, National University of Singapore. She leads the research cluster on Asian Migrations at the Asia Research Institute and is Principal Investigator of the Asian MetaCentre for Population and Sustainable Development Analysis (funded by the Wellcome Trust, UK) at the Asia Research Institute. She is a social geographer whose main interest in population-related studies lies in migration, family and gender issues. She has in recent years completed, in collaboration with other colleagues, research projects on modes of childcare in Singapore, migrant women as paid domestic labour in the Southeast Asian context...

Words: 15746 - Pages: 63

Premium Essay

Corruption

...In fact, people are surprised to find an honest politician. These corrupt politicians go scot-free, unharmed and unpunished. Leaders like Lal Bahadur Shastri or Sardar Vallabh Bhai Patel are a rare breed now who had very little bank balance at the time of death. The list of scams and scandals in the country is endless. Now Recently Before Start 2010 Commen Wealth Games Corruption is playing major role with commen wealth games. The Bofors payoff scandal of 1986 involved a total amount of Rs 1750 crore in purchase of guns from a Swedish firm for the Army. The Cement scandal of 1982 involved the Chief Minister of Maharashtra, the Sugar Scandal of 1994 involved a Union Minister of State for food, the Urea Scam and of course no one can forget Hawala Scandal of 1991, the Coffin-gate, fodder scam in Bihar or the Stamp scandal which shocked...

Words: 734 - Pages: 3

Free Essay

Migration in Afghanistan

...Migration in Afghanistan 1. Introduction Afghanistan is home to the largest refugee crises experienced since the inception of the UNHCR. Decades of war have led millions to flee their homes and seek refuge in the neighboring countries of Pakistan and Iran, and for those who were able, further abroad. The number of refugees spiked in 1990 at 6.2 million. They began to decrease in 1992 with the fall of the government, but began to increase again in 1996 with the rise of the Taliban. In 2002, with the fall of the Taliban and the US-led invasion, record numbers of Afghan refugees returned to Afghanistan. An international reconstruction and development initiative began to aid Afghans in rebuilding their country from decades of war. Reports indicate that change is occurring in Afghanistan, but the progress is slow. The Taliban have regained strength in the second half of this decade and insurgency and instability are rising. Afghanistan continues to be challenged by underdevelopment, lack of infrastructure, few employment opportunities, and widespread poverty. The slow pace of change has led Afghans to continue migrating in order to meet the needs of their families. Today refugee movements no longer characterize the primary source of Afghan migration. Migration in search of livelihoods is the primary reasons for migration and occurs through rural-urban migration in Afghanistan or circular migration patterns as Afghans cross into Pakistan and/or Iran. Afghans utilize their...

Words: 13339 - Pages: 54

Premium Essay

Islamic Contract

...Islamic Contract Law TYPES OF COMMITMENTS 1. Wa‘d – • • • ‫ – و‬unilateral promise One party binds itself to perform a function for another Does not normally create legal obligation Legal obligation is created: • • Genuine need of the masses – (‫ر‬ Contingent promise ‫ا س )رد ا‬ ‫ز‬ ‫ن ز‬ ‫ا‬ ‫إذ ا‬ 2. Muwaa‘ada – ‫ا ة‬ • • • • – bilateral promise Two parties performing two unilateral promises on the same subject Use of two unilateral promises can lead to a forward contract, which is impermissible Not allowed and non–enforceable according to majority (AAOIFI, IFA and others) Some Hanafi/subcontinent scholars allow it provided no other prohibition (excessive gharar or short selling) 3. ‘Aqd – – contract Promises do not constitute a contract • Difference between a contract and a bilateral promise is there is no proprietary transfer in bilateral promise ENFORCEABILITY OF PROMISES • • • Islam prohibits rolling 2 contracts into one (safaqah-fi-safaqah). Modern financial transactions often need to combine 2 contracts into one eg Hire Purchase. As Islam prohibits client signing agreement binding him to 2 contracts at the same time eg rent and purchase, how can a bank structure an Ijara Mortgage where there is a rental element and purchase element? How can a Bank offer an Islamic Mortgage yet still prevent itself having to hold huge assets on the balance sheet and potentially suffer massive losses on property disposals? Solution is to make client...

Words: 1081 - Pages: 5

Free Essay

Accounting Practice Legislations, Procedures and Policy Report

...Accounting Practice Legislations, Procedures and policy Report Introduction This report contains detail compliance analysis of the Accounting Practice, which undertakes Accounting and Bookkeeping services for Travel agent. The Agent provides Community services as well, along with day to day Travels and Tour services such as overseas workers sponsorship, Manage Payroll for overseas workers and provides Money transfer. This report outlines the regulations and the practice procedures and Manuals and also outlines the compliance with AUSTRAC regulations. Procedures Community services policies and procedures This Accounting firm has in place policies and procedures that govern and regulate privacy and confidentiality of client information. This concept not only applies to what you can disclose about your clients or your organisations outside of work, but also what can be shared in network meetings. What information can be shared with other organisations, who shares it and how this information is given out should be clearly defined in any effective, professional service. It is often incorporated into a worker’s duty statement or job description. This practice has developed and written policy and procedures, and staff training in the following areas: * a confidentiality policy * a clearly defined process for identifying and regularly updating a Community Resource Index so that all workers are aware of what other services are available to refer to * processes...

Words: 1023 - Pages: 5

Free Essay

Airtel Case Study

...same could be any of the following: a. Farming based labor b. Skilled or semi skilled labor c. Manual labor of any other form Also a migrant for the purpose of this case study is primarily someone from the rural setup and falls in SEC D1, D2 or E1, E2, E3 as per the new categorization. The total internal migrant base as per 2011 stats has swelled to close to 1.14 crore. Total annual remittance per customer is at a staggering Rs 14,600 and the annual potential of the overall remittance market is at a staggering Rs 20000 crore. A gender cut of remittance is as below: What are the ways thus for a person to transfer or remit, the conventional ways are,    Bank transfer Post office based transfers: Online transfer via FINO etc Hawala and transfer by hand are unconventional means to transferring money. Each of these means has its own set of advantages and inherent challenges/issues. Details as given: **** This case study is confidential and should not be shared in any form**** The most favored and prevalent mean is utilizing banks services to get these transfers done. But at the same time what we need to realize is that a majority of those transferring cash are unbanked or underbanked. (Unbanked: Without bank account, Under-banked: Has bank account in his/her hometown but has migrated to an urban town). The next big thing in the remittance industry seems to be mobile based remittance service where in the customer would be able...

Words: 930 - Pages: 4