TL;DR: Ranking items like resumes, companies, or project ideas semantically using LLMs is powerful but tricky. Simple Pointwise “scoring” is often unreliable due to calibration issues and lack of context. Comparative methods (Pairwise, Listwise, Setwise, Tournament) are better. Tournament ranking, especially with ensembling, offers the best scalability and robustness for high-stakes decisions, despite higher implementation complexity. Choose the method based on your needs for accuracy, scale, and tolerance for noise.

I’m working on Paper To Project that aims to identify useful/relevant projects from the latest scientific papers, and after generating many project ideas there’s a need to rank them. Talking to people more about it I noticed a bigger need for a semantic ranking:

  • When I was leading a team of ML engineers, I had to read and triage hundreds of (very good!) resumes to prioritize ones we have to talk to first
  • When I was searching for a new project, I had to rank hundreds of companies based on culture, growth opportunities and alignment to my goals
  • Business development teams have to rank government tenders based on feasibility, profitability, and strategic alignment
  • Ranking search results, product recommendations, internal documents – the list goes on, there is a claim that many hard problems can be reduced to ranking like vulnerability search.

Naturally, many turn to AI, specifically Large Language Models (LLMs), hoping to save some time. But simply asking an LLM to “score” resumes often leads to a ranked list that feels… off. Why? Because the way you ask the LLM to perform the ranking is important. People often get dissapointed with first results, and drop LLMs as a solution, but the problem is not the LLM, it’s the way we ask it to do the ranking. After some improvements LLMs show their usefullness.

The best part is you can keep teaching them to rank better. After an initial ranking you can look at first candidates, adjust your ranking criteria, include more candidates and rerank again. For example, you can say “this company is great but doesn’t pay enough based on levels.fyi”, “this company pays well buy I don’t want to work in this industry”, etc. Want to try this ranking tool? Work in progress, but you can subscribe for updates :)

Let’s look into each ranking method and how we can improve it. I will use the example of screening resumes, but the same principles apply to other ranking tasks like selecting companies, tenders, or even project ideas.

Method 1: Pointwise Ranking (“The Isolated Scorecard”)

This is the most intuitive method and often the first one implemented. It treats every item (resume, company, tender) in complete isolation.

  1. The Mechanics: You feed the LLM one item at a time along with your criteria (e.g., the Job Description). You ask the LLM to assign numerical scores to that single item based on how well it meets each criterion.
  2. Resume Screening Example:
    • Input: JD + Resume of Candidate ‘Alex’.
    • Prompt: “Score Alex (1-10) on Python, K8s, AWS, YoE, Leadership…”
    • Output: A set of scores like {Python: 9, K8s: 5, AWS: 8, YoE: 8, Leadership: 4}, consistency using structured output
  3. Ranking Logic: Repeat for all candidates. Calculate an aggregate score for each candidate (simple average, weighted average). Rank candidates based on this final number.
  4. Insertion of a new element is simple, just create a score for it.

Pointwise: Pros

  • Simplicity: Very easy to understand and implement.
  • Parallelism: Each item is scored independently, making it highly parallelizable. Compute time scales well with more processing units.
  • Low Latency Per Item: Getting the score for a single item is fast.

Pointwise: Cons

  • Calibration Chaos: LLMs lack a stable internal “ruler.” A score of ‘7’ for Alex’s Python skills is assigned based only on Alex’s resume text. It’s not calibrated against any other candidate or any objective standard. Bob might also get a ‘7’, but their actual proficiency could be vastly different. Scores can fluctuate wildly based on phrasing, resume length, or even minor changes in the LLM’s state. You cannot reliably compare a ‘7’ for Alex with a ‘7’ for Bob.
  • No Relative Context: The model never sees two candidates side-by-side. Ranking relies entirely on aggregating potentially meaningless absolute scores. Important relative differences are completely ignored.
  • Information Flattening: Averaging scores masks critical trade-offs. Your dream candidate might be a 10/10 on the one skill you absolutely need (K8s) but 5/10 on others, resulting in an average score of, say, 7. Another candidate might be a flat 7/10 across the board. The average score makes them look identical, potentially burying your specialist.

Pointwise ranking is fast and simple for getting individual assessments, but deeply flawed for creating reliable rankings in situations where relative merit and nuanced trade-offs matter. It often leads to noisy, untrustworthy lists for anything complex. (Caveat: Okay for very rough pre-filtering based on simple, objective criteria where precision isn’t paramount, not for a final decision-making).


Method 1(a): Pointwise Ranking with Few-Shot Examples (“The Anchored Scorecard”)

Can we improve Pointwise by giving the LLM some examples to “anchor” its scores?

  1. The Mechanics: Similar to Pointwise, but before asking the LLM to score the target item, you provide 1-3 examples of other items that have already been scored according to your criteria.
  2. Resume Screening Example:
    • Input: JD + Resume of Candidate ‘Charlie’ (already scored) + Resume of Candidate ‘Dana’ (already scored) + Resume of Candidate ‘Alex’ (target).

    • Prompt:

      Job Description: [JD Text...]
       
      Example 1:
      Resume: [Charlie's Resume Text...]
      Scores: {Python: 8, K8s: 4, AWS: 7, YoE: 6, Leadership: 3}
       
      Example 2:
      Resume: [Dana's Resume Text...]
      Scores: {Python: 6, K8s: 8, AWS: 5, YoE: 9, Leadership: 7}
       
      Now, please score Candidate Alex based *only* on their resume against the Job Description criteria, using a similar scale and judgment as the examples:
      Resume: [Alex's Resume Text...]
      Scores: ?
    • Output: A set of scores for Alex, hopefully more consistent with the provided examples.

  3. Ranking Logic: Same as Pointwise – aggregate scores and rank.

Improved Pointwise: Pros

  • Potential for Consistency: Might improve score consistency if the examples are highly relevant and well-chosen.
  • Scale Alignment: Can help nudge the LLM’s scores towards a desired range or interpretation of the scale (e.g., what constitutes an ‘8’ in K8s).

Improved Pointwise: Cons

  • Example Dependency: The quality of the ranking heavily depends on the quality, relevance, and diversity of the few-shot examples. Bad or unrepresentative examples can worsen the results.
  • Prompt Length: Including full examples significantly increases prompt length and token count, increasing costs and potentially hitting context limits faster.
  • Difficult Calibration: Ensuring examples truly calibrate the model across the wide variety of possible inputs (different resume styles, job types, etc.) remains very difficult.

The Shift: Embracing Comparative Judgement

The fundamental limitation of Pointwise is its isolation. Humans rank better by comparing. Can we make LLMs do the same? Yes, through comparative ranking methods. The core idea shifts from “How good is A?” to “Is A better than B for my needs?”

Method 2: Pairwise Ranking (“The Head-to-Head”)

  1. The Mechanics: You provide the LLM with your criteria (JD) and two items (resumes) at a time. You ask it to determine which of the two is a better fit overall.
  2. Resume Screening Example:
    • Input: JD + Resume A (Alex) + Resume B (Bob).
    • Prompt: “Considering the JD, which candidate is a stronger overall fit: A or B?”
    • Output: A preference, e.g., “B”.
  3. Ranking Logic: Use the LLM’s preference judgments to order the list. This typically involves:
    • Sorting: Integrating the LLM call as the comparison step in algorithms like Heap Sort, Merge Sort, etc. (Requires O(N log N) LLM calls).
    • Full Comparison (Less Common): Comparing all N*(N-1)/2 pairs and aggregating wins (O(N^2) calls - usually too expensive). Insertion of a new element is also expensive.
    • Tournament-based Strategies: The preference output (‘A’ or ‘B’) can also feed rating systems. For ongoing ranking or more dynamic systems, these pairwise outcomes could feed systems like ELO or Glicko to calculate ratings over many comparisons, often structured within a tournament format (swiss, elimination) to create a final ranking. This adds another layer of complexity but can be powerful. Insertion need another heuristic, like selecting a first “match” with a most calibrated item, and continuing with a closest score for some iterations.

Pairwise: Pros

  • Direct Comparison: Forces the LLM to weigh the specified criteria and make trade-offs between the two items.
  • Improved Accuracy: Often yields a more accurate relative ordering than Pointwise because it focuses on preference rather than unstable absolute scores.
  • Less Calibration-Sensitive: Whether A is better than B is often more stable than assigning precise scores to A and B individually.

Pairwise: Cons

  • Scalability Bottleneck (Cost/Latency): The number of comparisons needed for sorting grows quickly with N. O(N log N) LLM calls can be very slow and expensive, especially as comparisons in sorting algorithms often need to happen sequentially. Ranking 200 resumes might take hours and significant cost.

Method 3: Listwise Ranking (“The Mini-Leaderboard”)

Can we give the LLM more context than just two items at a time?

  1. The Mechanics: Provide the LLM with the criteria (JD) and a small list of k items (e.g., 5-10 resumes). Ask the LLM to rank that specific list from best to worst.
  2. Resume Screening Example:
    • Input: JD + Resumes [C, D, E, F, G].
    • Prompt: “Rank candidates C, D, E, F, G from best fit (#1) to worst fit (#5) for the JD.”
    • Output: A ranked list, e.g., “E > C > G > D > F”.
  3. Ranking Logic: Use the LLM’s generated ranking for the list. To rank N items where N > k, complex strategies like “sliding windows” ((Sun et al., 2023)) are needed, often requiring multiple passes over the data to establish a global order.
  4. Insertion: How will you create a list for a new item to put into?

Listwise: Pros

  • Richer Comparison Context: The LLM sees k items simultaneously, allowing for more nuanced comparisons within that set than Pairwise.
  • Potentially More Efficient than Pairwise: If k is well-chosen, it might require fewer total LLM calls than a full Pairwise sort, especially with an optimized windowing strategy.

Listwise: Cons

  • Limited Window (k): Still constrained by LLM context length limits. You can only compare a small subset at once.
  • Position Bias: Highly susceptible to the order items appear in the prompt. Items at the beginning or end often receive biased evaluations (“Lost in the Middle”). Shuffling helps but adds complexity.
  • Output Format Brittleness: LLMs might not strictly adhere to the requested output format, requiring robust parsing and error handling.
  • Windowing Strategy Complexity: Designing an effective sliding window approach (size, step, convergence criteria) is challenging and can significantly impact the final ranking quality and efficiency.

Method 4: Setwise Prompting (“The Optimized Comparison”)

This is less a standalone ranking method and more an optimization technique for speeding up comparative sorting, inspired by (Zhuang et al., 2023).

  1. The Mechanics: Provide the LLM with the criteria (JD) and a small set of c items (where c > 2, e.g., c=3, 4, or 5). Ask it to identify the single best item from that set.
  2. Resume Screening Example (within a Sort):
    • Context: Imagine a Heap Sort needing to compare a parent node (P) with its children (C1, C2, C3).
    • Input: JD + Resumes [P, C1, C2, C3].
    • Prompt: “From candidates P, C1, C2, C3, identify the single strongest fit for the JD.”
    • Output: The label of the best candidate, e.g., “C2”.
    • Logic: Instead of 3 Pairwise calls (P vs C1, P vs C2, P vs C3), you use 1 Setwise call to find the best child to potentially swap with the parent.
  3. Ranking Logic: This “select best of c” primitive replaces the standard binary comparison within sorting algorithms (Heap Sort, Bubble Sort variations, etc.).
  4. Insertion is discussed in a separate paper (Podolak et al., 2025)

Setwise: Pros

  • 🚀 Efficiency Boost: Significantly reduces the number of LLM calls required to perform a comparison-based sort compared to Pairwise, leading to faster ranking and lower costs.
  • Maintains Comparative Nature: Still relies on relative judgments.

Setwise: Cons

  • Optimization, Not a Full Method: It speeds up sorting but doesn’t eliminate the need for a sorting framework.
  • Implementation Effort: Requires modifying sorting algorithm implementations.
  • Optimal c? Performance might depend on finding the right set size c that the LLM can handle reliably.

Method 5: Tournament Ranking (“The Scalable & Robust Gauntlet”)

This method, inspired by sports tournaments and approaches like (Chen et al., 2024), is designed to handle large scale and maximize the reliability of the final ranking.

  1. The Mechanics: Structure the ranking like a multi-stage competition. Items compete in groups, winners advance, and often multiple independent runs are averaged (ensembled) for robustness.
  2. Resume Screening Example:
    • Stage 1 (Grouping & Parallel Competition): Divide 200 resumes into 20 groups of 10. In parallel, run a Listwise or Setwise prompt within each group to select the top 3 winners (e.g., “Rank these 10”, take top 3; or “Pick best 3 of 10”).
    • Stage 2: Collect the 60 winners (20 groups * 3 winners). Divide them into 6 groups of 10. Repeat the parallel competition to select the top 3 from each.
    • Stage 3 (Finals): Collect the 18 winners. Run a final Listwise ranking on this smaller set.
    • Ensemble for Reliability: Repeat the ENTIRE tournament (Stage 1 Stage 3) R times (e.g., R=5). Assign points based on advancement (e.g., +1 for Stage 1 win, +2 for Stage 2 win, +points based on final rank). Rank candidates by their total accumulated points across all R runs.
  3. Insertion is tricky, how do you create new groups? Do you just split your ranked data into several quantiles?

Tournament: Pros

  • Scalability: Handles very large N by breaking it down into smaller, manageable stages.
  • Parallelism: Group competitions run simultaneously, drastically reducing wall-clock time compared to sequential sorting.
  • ✨ Robustness (via Ensemble): This is the key advantage. Averaging results over R runs significantly smooths out LLM randomness, position bias, and prompt sensitivity. Candidates who consistently perform well rise to the top. A single bad comparison doesn’t derail a strong candidate. This leads to highly reliable rankings.
  • Tunable Trade-offs: You can balance cost/time vs. quality/reliability by adjusting the number of stages, group sizes, and ensemble runs (R).

Tournament: Cons

  • Implementation Complexity: Requires careful design of the tournament structure (grouping, stages, advancement rules, points). More complex logic than other methods.
  • Design Choices Matter: The effectiveness depends on smart choices for group size, winners per group, etc.

Conclusion: Choose Your Ranking Weapon Wisely

The simple Pointwise “scorecard” is often inadequate for ranking tasks where nuance, trade-offs, and reliability are important. For serious ranking needs – whether screening resumes, finding the right job, or selecting critical tenders – comparative methods are superior.

  • Pairwise establishes the principle but struggles with scale.
  • Listwise adds context but introduces bias and windowing issues.
  • Setwise makes comparative sorting more efficient.
  • Tournament Ranking (especially with ensembling) offers the best combination of scalability, parallelism (speed), and robustness for high-stakes ranking of large item sets.

Investing in a more sophisticated ranking method like Tournament, despite the initial setup complexity, often yields a dramatically higher ROI by delivering reliable results you can trust, saving you valuable time and leading to better decisions.

Comparison of LLM Ranking Techniques

FeaturePointwisePairwiseListwiseSetwise (Optimization)Tournament
Input per Call1 Item2 Itemsk Items (small list)c Items (small set, c>2)n Items (in a group stage)
Core IdeaScore item in isolationCompare A vs. BRank a short listPick best 1 from cMulti-stage group competition
Ranking LogicAggregate individual scoresSort using comparisons / WinsUse LLM output / Sliding windowUse within sorting algorithmsAdvancement / Ensemble points
Key ProSimple, ParallelizableDirect comparison, Better Rel. AccuracyMore context than PairwiseFaster sorting than PairwiseScalable, Parallel, Robust (Ens.)
Key ConCalibration issues, No Rel.N log N / N^2 calls, Slow sortContext limit, Position biasNeeds sorting frameworkComplex implementation
Best Use CaseInitial rough filtering, Simple criteriaSmall N, High accuracy neededModerate N, Where some context helpsSpeeding up Pairwise-logic sortsLarge N, High Stakes, Need Robustness & Reliability

References

@article{magas2025ranking,
  author       = {Magas, Dmitrii},
  title        = {Semantically rank everything using LLMs (companies, candidates, ideas etc)},
  year         = {2025},
  month        = {04},
  howpublished = {\url{https://eamag.me}},
  url          = {https://eamag.me/2025/Ranking-With-LLMs}
}
Chen, Y., Liu, Q., Zhang, Y., Sun, W., Shi, D., Mao, J., & Yin, D. (2024). TourRank: Utilizing Large Language Models for Documents Ranking with a Tournament-Inspired Strategy. ArXiv, abs/2406.11678. https://api.semanticscholar.org/CorpusID:270560753
Podolak, J., Peric, L., Janicijevic, M., & Petcu, R. (2025). Beyond Reproducibility: Advancing Zero-shot LLM Reranking Efficiency with Setwise Insertion. https://api.semanticscholar.org/CorpusID:277787122
Sun, W., Yan, L., Ma, X., Ren, P., Yin, D., & Ren, Z. (2023). Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. ArXiv, abs/2304.09542. https://api.semanticscholar.org/CorpusID:258212638
Zhuang, S., Zhuang, H., Koopman, B., & Zuccon, G. (2023). A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models. Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. https://api.semanticscholar.org/CorpusID:264146620