Analysis Tutorial #10: Event weights

In our examples so far we have only worked with the unweighted events. It means we have filled all the histograms with weight 1. While this is usually enough when working with real data, in case of the Monte Carlo data one usually needs to use weights. To illustrate this point, let’s make set of new plots where we plot stack of histograms from ggH and VBFH samples. We will make a new script “plotMe2.py” which is a slight modification of the original plotting script:

from Plot import Plot
from ROOT import TCanvas, TLegend, kMagenta, kYellow, kBlack

# set ATLAS plot style
from AtlasStyle import setStyle
setStyle()

# draw plots
histNames = ['ditau_m', 'ditau_visM']
flavorNames = ['SF', 'DF']
canvases = []
plots = []
legends = []

for histName in histNames:

 # load individual plots
 plots_ggH = []
 plots_VBFH = []
 for flavorName in flavorNames:
  plots_ggH += [ Plot("histograms.ggH.Alg{}.root".format(flavorName), histName) ]
  plots_VBFH += [ Plot("histograms.VBFH.Alg{}.root".format(flavorName), histName) ] 

 # add plots from two samples into a single plot
 plot_ggH = Plot(plots_ggH)
 plot_VBFH = Plot(plots_VBFH)

 # set style
 plot_ggH.setStyleSolid(kMagenta)
 plot_VBFH.setStyleSolid(kYellow)

 # create a THStack
 plot = Plot([plot_ggH, plot_VBFH], True)

 # create a canvas
 c = TCanvas(histName, histName, 800, 600)

 # draw
 plot.draw()

 # Create the same plot, but this time mething the histograms
 # We will set to the "errorbar style" and overlay the TH stack
 plotErr = Plot([plot_ggH, plot_VBFH])
 plotErr.setStyleErrorbar(kBlack)
 plotErr.draw("same")

 # draw legend
 legend = TLegend(0.6, 0.9, 0.9, 0.65)
 legend.AddEntry(plot_ggH.hist, "gluon-gluon fusion", "f")
 legend.AddEntry(plot_VBFH.hist, "vector-boson fusion", "f")
 legend.AddEntry(plotErr.hist, "Stat. uncertainty", "f")
 legend.Draw("same")

 # save the canvas
 c.Print("{}.pdf".format(histName))

 # save the instances so they are not deleted by the garbage collector
 canvases += [ c ]
 plots += [ plot, plotErr ]
 legends += [ legend ]

  • In the original script we merged plots from “ggH” and “VBFH” samples and stacked the plots from the “SF” and “DF” algorithms. In the new script we do it another way around.
  • Note that we have also updated the plot legend and histogram colours.

When you run this script, you should see plots like this:

> python -i plotMe2.py

Screenshot 2021-07-15 at 17.04.39

The events are unweighted and the yellow histogram has larger area because there is more events in the “VBFH” sample than in the “ggH” one.

In reality, gluon-gluon fusion process has about 10-times larger probability than the vector-boson fusion process. So the plot above does not realistically represent what we expect to see in real data. Furthermore, the number of events generated in the Monte Carlo simulation is arbitrary and is unrelated to the number of events we expect in real data. Usually, we try to generate as many events as possible given the available computer resources (MC simulation is computationally very hard) so we usually have more MC events than real data events. So in order to create MC plots that would show our expectation of what to see in real data, we need to apply some weights.

In particle physics, the following formula is used to get the expected number of events in data:

Nexp = σ * L * ε,

where σ is the cross-section of the process of interest, L is the integrated luminosity (often called just the luminosity) and ε is the efficiency.

The cross-section variable describes the physics process of interest (e.g. gluon-gluon fusion production of the Higgs boson and its subsequent decay into tau leptons). We could say it is a strangely normalised probability that a particular process happens in the proton-proton collision. To fully describe real data, one needs to add up cross-sections of all relevant processes that happen in proton-proton collisions. The cross-section does not depend on the experimental setup. It is calculated by theorists and we use it as in input.

Integrated luminosity describes the colliding beams (density and flow of the protons) and the length of the data taking period. It is purely a property of the collider and the experiment and does not depend on the physics process of interest. It is measured in a dedicated measurement and we use it as an input.

Finally, efficiency tells how big a fraction of events can be actually detected by our experiment and passes our selection. It depends on the experimental setup but also on the physics process we study. It has to be estimated using the MC simulation. We can calculate the efficiency as a ratio:

ε = Nsel / Ngen

where Nreco is the number of events reconstructed by the detector and accepted by our event selection. Ngen is the total number of generated events. So we can rewrite the formula for the expected number of events as:

Nexp = σ * L * Nreco / Ngen.

In our previous examples we have treated all events unweighted, so we only considered N = Nreco. So to get the number of expected events, we have to assign an additional weight to each event:

W = σ * L / Ngen.

In practice, there is usually an additional complication that most MC generators generate events which already carry a certain weight different from 1. (this is because of the way how the MC integration is implemented in the generator). So instead of the above formula one has to use an alternative:

W = σ * L * Wgen / ΣWgen.

where Wgen is the weight assigned to the event by the MC generator and ΣWgen is the sum of these weights for all the generated events. Note that Wgen can be different for each event as therefore event weight will W differ from event-to-event.

To implement these events weight into out example we need to do some modifications. These modifications will be very strongly dependent on the format of our data. If you use different n-tuples not produced for the Higgs to tau-tau analysis, you will need modify the way you calculate the weight, although the general idea stays the same.

Firstly, we need to modify “Data.h” and “Data.cpp” to read the MC event weight. In our n-tuples the variable is called “weight_mc”. So let’s add it to “Data.h”:

#ifndef DATA_H
#define DATA_H

#include "TTree.h"
#include "TLorentzVector.h"

class Data{
 public: 
  /**
  * @brief Construct a new Data object
  * 
  * @param tree - pointer to the TTree (or TChain) class
  */
  Data(TTree* tree);

  /**
  * @brief Tree variables
  */
  Float_t ditau_mmc_mlm_m;
  UInt_t tau_0;
  UInt_t tau_1;
  TLorentzVector *tau_0_p4 = 0; //MUST SET POINTERS TO 0!
  TLorentzVector *tau_1_p4 = 0; //MUST SET POINTERS TO 0!
  Double_t weight_mc;

 protected:

  /**
  * @brief pointer to the TTree (or TChain) class
  */
  TTree* m_tree = 0;

};

#endif

and to “Data.cpp”:

#include "Data.h"

Data::Data(TTree* tree) : m_tree(tree) {
 m_tree->SetBranchAddress("ditau_mmc_mlm_m", &ditau_mmc_mlm_m);
 m_tree->SetBranchAddress("tau_0", &tau_0);
 m_tree->SetBranchAddress("tau_1", &tau_1);
 m_tree->SetBranchAddress("tau_0_p4", &tau_0_p4);
 m_tree->SetBranchAddress("tau_1_p4", &tau_1_p4);
 m_tree->SetBranchAddress("weight_mc", &weight_mc);
}

Then we need to extract the sum-of-weights variable. Because our n-tuples already passed some quite strict selection, we cannot get this number by looping over all events in our n-tuple and adding up “weight_mc” variables (this would only worked if we had unfiltered n-tuple). Instead, we load this number from the sample’s metadata information. If you list the content of one of the example data files you will see that apart from the NOMINAL tree it also contains the “h_metadata” histogram:

> root -l ../ggH.root 

root [0] 
Attaching file ../ggH.root as _file0...
(TFile *) 0x55dc01a69730

> .ls

TFile** ../ggH.root 
 TFile* ../ggH.root 
 KEY: TTree NOMINAL;1 NOMINAL
 KEY: TH1D h_metadata;1 
 KEY: TH1D h_metadata_theory_weights;1 
 ...

> h_metadata->Draw()

Screenshot 2021-07-15 at 17.07.51

This histogram contains various useful numbers, including the sum of weights for all generated events. If you look at the label of the bin no 8 you’ll see it reads “Σ xAOD weighted events” which is exactly what we need. Please keep in mind this particular format of storing metadata is used only in the Higgs->tautau analysis!

So in order to get the “ΣWgen” part of the event weight formula we just need to pull out the number stored in bin 8:

h_metadata->GetBinContent(8); // in c++ or ROOT command line
h_metadata.GetBinContent(8)   # in python

There is one complication: our samples are split into multiple files. So we need to open all of them in a loop and add up the numbers from all the files before we start the event loop.

Getting the cross-section and integrated luminosity is harder. Either they are provided by your adviser (if you are a student) or you have to look them up in your experiment’s various documentation webpages. In case of ATLAS experiment, the cross-sections are summarised here and luminosities here. However, you have to be member of the ATLAS experiment to access these pages. For the two samples we use in our examples I will provide the cross-sections.

So let’s now move to the actual implementation. Firstly, we modify the “Algorithm” class so that it can apply the MC weights. Firstly, “Algorithm.h”:

#ifndef ALGORITHM_H
#define ALGORITHM_H

#include "TString.h"
#include "TH1D.h"
#include "Data.h"
#include "Cut.h"

class Algorithm {
 public: 
  /**
  * @brief Construct a new Event Loop object
  */
  Algorithm();

  /**
  * @brief Initialize the algorithm
  * 
  * @param data - pointer to the data-access class instance
  */
  void initialize(Data* data);

  /**
  * @brief Execute. Here the stuff happens
  */
  void execute();

  /**
  * @brief Pointer to the histogram class instance defined as public attribute
  */
  TH1D* h_ditau_m = 0;
  TH1D* h_ditau_visM = 0;

  /**
  * @brief Selection cuts
  */
  bool cut_selectSF = false;
  bool cut_selectDF = false;
  Cut cut_maxVisMass;

  /**
  * @brief Other algorithm properties
  */
  bool p_isMC = false;
  double p_sampleWeight = 1.;

 protected:

  /**
  * @brief Apply the selection
  * 
  * @return true when event passed the selection.
  */
  bool passedSelection();

  /**
  * @brief Evaluate the event weight
  */
  void calculateWeight();

  /**
  * @brief Fill the histograms
  */
  void fillPlots();

  /**
  * @brief Instance of the data-access class
  */
  Data* m_data = 0;

  /**
  * @brief Total event weight used when filling histograms
  */
  double m_eventWeight;

};

#endif
  • We have added two new public attributes of the Algorithm class (let’s call them properties):
    • “bool p_isMC”. We will set this property to true/false depending on whether we run over the MC or real data samples.
    • “double p_sampleWeight”. Here we set the actual sample weight, i.e. the value of
      σ * L / ΣWgen
      which is the same for the entire sample.
  • We have also added the protected attribute “m_eventWeight” where we store the full event weight, which is calculated in the “calculateWeight” method.
  • We have added “calculateWeight” method to calculate weight.

In the “Algorithm.cpp” we implement the changes:

#include "Algorithm.h"

Algorithm::Algorithm() {
}

void Algorithm::initialize(Data* data) {
 m_data = data;
}

void Algorithm::execute() {

 // apply selection. Exit if didn't pass
 if(!passedSelection()) return;

 // calculate the event weight
 calculateWeight();

 // fill the plots
 fillPlots();

}

bool Algorithm::passedSelection() {
 // select specific flavor combinations
 if(cut_selectSF && m_data->tau_0!=m_data->tau_1) return false;
 if(cut_selectDF && m_data->tau_0==m_data->tau_1) return false;

 // select invariant mass of the tau lepton visible decay products
 if(cut_maxVisMass.enabled() && cut_maxVisMass < (*(m_data->tau_0_p4) + *(m_data->tau_1_p4)).M()) return false;

 // passed all the cuts
 return true;
}

void Algorithm::calculateWeight() {
 // apply the sample weight
 m_eventWeight = p_sampleWeight;
 
 // apply the MC weight, if requested
 if(p_isMC) {
  m_eventWeight *= m_data->weight_mc;
 }
}

void Algorithm::fillPlots() {
 // here we fill the histograms. We protect the code against the empty pointers.
 if(h_ditau_m) h_ditau_m->Fill( m_data->ditau_mmc_mlm_m, m_eventWeight);
 if(h_ditau_visM) h_ditau_visM->Fill( (*(m_data->tau_0_p4) + *(m_data->tau_1_p4)).M(), m_eventWeight);
}
  • In the new method “calculateWeight” we take the “p_sampleWeight” set from the outside and multiply it by the “m_data->weight_mc” event weight. The resulting event weight is stored in the protected attribute “m_eventWeight”.
  • We modified the “fillPlots” method so that “m_eventWeight” is used as a second parameter in the “Fill” of the histograms.
  • Note that the event weight is calculated only is property “p_isMC” is set true. This is because for real data we do not want to apply the event weight.

Don’t forget to recompile the c++ code.

Finally, we have to update python classes that configure our Algorithm class. Because cross-section and sum-of-weights numbers are related to actual MC sample we process. So the new code will go to “EventLoop.py” and “Samples.py” while we leave the “Algorithm.py” unchanged.

Firstly, let’s deal with the sum-of-weights variable. Modify the “EventLoop.py” file:

from ROOT import EventLoop as EventLoopCpp
from ROOT import TFile

class EventLoop(object):
 """ Wrapper around the c++ class event loop
 """ 
 def __init__(self, name):
  # set the event loop name
  self.name = name

  # now we can create instance of the c++ class EventLoop
  self.eventLoop = EventLoopCpp()

  # set the tree name attribute
  self.eventLoop.treeName = "NOMINAL"

  # keep the list of python Algorithm class instances
  self.algs = []

  # sample meta-data
  self.isMC = False
  self.xSection = 1.
  self.luminosity = 1.
  self.sumOfWeightsGenerated = 0

 def addAlgorithm(self, algorithm):
  """ add an algorithm into this event loop
  """
  algorithm.setSumw2()
  self.algs += [ algorithm ]
  self.eventLoop.algorithms.push_back(algorithm.alg)
 
 def addAlgorithms(self, algorithms):
  """ add multiple algorithms into this event loop
  """
  for alg in algorithms:
   self.addAlgorithm(alg)

 def execute(self):
  """ initialize and execute the event loop
  """
  # load sumOfWeights and initialize algorithms
  if self.isMC:
   self.loadSumOfWeightsGenerated()
   self.setSampleWeight()
 
  self.eventLoop.initialize()
  self.eventLoop.execute()

 def save(self):
  """ save histograms from all algorithms
  """
  for alg in self.algs:
   alg.save(self.name)

 def loadSumOfWeightsGenerated(self):
  """ Open all input files and adds up the sum of weights for all 
      generated events. Only for MC.
  """
  if self.isMC:
   self.sumOfWeightsGenerated = 0
   for inputFile in self.eventLoop.inputFiles:
    # open each file and load the h_metadata histogram
    f = TFile.Open(str(inputFile))
    h_metadata = f.Get("h_metadata")
    # the generated number of events is stored in the 8th bin in the Higgs samples
    self.sumOfWeightsGenerated += h_metadata.GetBinContent(8)
 
 def setSampleWeight(self):
  """ Calculate the sample weight and pass to the algorithms. Only for MC
  """
  if self.isMC:
   sampleWeight = self.xSection * self.luminosity / self.sumOfWeightsGenerated
   for alg in self.algs:
    alg.alg.p_isMC = True
    alg.alg.p_sampleWeight = sampleWeight
  • In the constructor we have added new meta-data variables which characterizes the sample.
    • “self.isMC” will be set to true/false in the derived class depending on whether we run over MC sample or real data
    • “self.xSection” will be initialised in the derived class with the concrete sample’s cross-section.
    • “self.luminosity” will be initialised in the derived class with the dataset luminosity.
    • “self.sumOfWeightsGenerated” will be extracted from the input files.
  • New method “loadSumOfWeightsGenerated” opens all the input files one-by-one, loads the “h_metadata” histogram and adds up content of bin number 8. Note that when we open the ROOT file
    TFile.Open(str(inputFile))
    we have to cast “inputFile” variable to python “str”. It’s because in this particular case the “inputFile” is of ROOT’s type “TString”. Normally, if we had a pure python implementation, we would not need to do this.
  • We added a new method “setSampleWeight” where we calculate the sample weight and pass it to the c++ Algorithm class instance. This is done only if the sample is MC.
  • Finally, we call this method from the “execute” so that things are setup automatically.

Finally, we just set appropriate meta-data values in “Samples.py”:

from EventLoop import EventLoop

# -----------------------------------
class SampleGGH(EventLoop):
 """ event loop over the gluon-gluon fusion sample
 """ 
 def __init__(self):
  # call the inherited constructor
  EventLoop.__init__(self, "ggH")

  # add the ggH samples into the event loop
  self.eventLoop.inputFiles.push_back('../ggH.root')

  # set sample meta-data
  self.isMC = True
  self.xSection = 16.3789 # pb
  self.luminosity = 36074.16 # pb^-1

# -----------------------------------
class SampleVBFH(EventLoop):
 """ event loop over the vector boson fusion sample
 """ 
 def __init__(self):
  # call the inherited constructor
  EventLoop.__init__(self, "VBFH")

  # add the vbfH samples into the event loop
  self.eventLoop.inputFiles.push_back('../vbfH.root')

  # set sample meta-data
  self.isMC = True
  self.xSection = 1.3741 # pb
  self.luminosity = 36074.16 # pb^-1
  • The numbers for cross-sections are in picobarns, i.e. 10^−40 m^2.
  • The luminosity is in inverse picobarns and corresponds to the 2015+2016 ATLAS data period.

Let’s run the code to see how the plot has changed:

> python runMe.py

> python -i plotMe2.py

You should see plots like this:

Screenshot 2021-07-15 at 17.12.42

When MC weights are included, the histogram areas correspond more closely to what one would expect from looking at the process cross-sections. Also, total number of events is reduced significantly and becomes closer to the actual number of Higgs events expected in data.

Usually, the MC event weight is not the only weight that has to be applied for MC events to get realistic modelling if data. In modern experiments like ATLAS, lots of data-driven corrections are used to improve modelling and reduce measurement’s dependence on the simulation. However, it is beyond the scope of this tutorial to go into such details.