Analysis tutorial #3: Accessing data

In the previous example we have created an event loop class that can open data files and run through all the events. Now let’s add the bit where we actually access the data and read some variables.

Firstly, let’s learn how we can see which objects are stored in the root file and which variables are stored in a tree. We will use ROOT’s command line for this. You can load a ROOT file into the ROOT environment by executing the root with the file path given a command line argument. For example, using our test files from the Higgs analysis:

> root ../ggH.root 

Attaching file ../ggH.root...
(TFile *) 0x56497ffd4d50

ROOT informs you that it opened the file and that that pointer to the TFile class (which represents the ROOT file in memory) is available the variable called “_file0”. We can list the file content using the TFile method “ls”. In our example, you’ll see something like this

> _file0->ls()

TFile** ../ggH.root 
 TFile* ../ggH.root 
  KEY: TTree NOMINAL;1 NOMINAL
  KEY: TH1D cutflow_muon_NOMINAL;1 cutflow_muon_NOMINAL
  KEY: TH1D cutflow_ele_NOMINAL;1 cutflow_ele_NOMINAL
  KEY: TH1D cutflow_pho_NOMINAL;1 cutflow_pho_NOMINAL
  KEY: TH1D cutflow_tau_NOMINAL;1 cutflow_tau_NOMINAL
  KEY: TH1D cutflow_mc_hs_jet_NOMINAL;1 cutflow_mc_hs_jet_NOMINAL
  KEY: TH1D cutflow_mc_pileup_jet_NOMINAL;1 cutflow_mc_pileup_jet_NOMINAL
  KEY: TH1D cutflow_HSM_common;1 Number of accepted events
  KEY: TH1D h_metadata;1 
  KEY: TH1D h_metadata_theory_weights;1 

You see that the list of objects stored in the file with their type. Note that there is an object of type “TTree” called “NOMINAL”, which we loaded in the previous example using the “TChain” class. Apart form it, there are other objects stored in the file (type “TH1D”) and we come back to them later.

NOTE: the same printout can be gained by using ROOT shortcut command (the dot is part of the command!):

> .ls

Let’s have a look at the TTree NOMINAL. There are many ways you can browse its content and we will try some of them.

Using TBrowser class

ROOT has a graphic user interface embodied in the TBrowser cass. To launch it, you just need to create an instance of this class from the ROOT’s command line:

> TBrowser b

You will get a window like this:

Screenshot 2021-07-15 at 16.21.09

In the left panel you see the file we have opened. You can expand the list to see the objects stored in the file and then you can expand the TTree object NOMINAL to see the variables stored in the tree:

Screenshot 2021-07-15 at 16.24.03

You can even make simple plots using TBrowser. Try to double-click on a variable (e.g. “ditau_mmc_mlm_m”) and you will get a histogram of the reconstructed Higgs mass:

Screenshot 2021-07-15 at 16.25.24

While TBrowser is useful quickly to check the file content, it is impractical if you need to copy names of many variables or if you need to get their type.

TTree::Show or TTree::Print methods

Calling “Show” method from the ROOT command line will give you list of all variables in the tree but without their type. “Print” method will give you more information.

> NOMINAL->Show()

 HLT_2e17_lhvloose_nod0_L12EM15VHI = 0
 HTXS_Njets_pTjet25 = 0
 HTXS_Njets_pTjet30 = 0
 HTXS_Stage0_Category = 0
 HTXS_Stage1_1_Category_pTjet25GeV = 0
 HTXS_Stage1_1_Category_pTjet30GeV = 0
 HTXS_Stage1_1_Fine_Category_pTjet25GeV = 0
 HTXS_Stage1_1_Fine_Category_pTjet30GeV = 0
 HTXS_Stage1_Category_pTjet25GeV = 0
 HTXS_Stage1_Category_pTjet30GeV = 0
 HTXS_errorMode = 0
 HTXS_prodMode = 0
 NOMINAL_pileup_combined_weight = 0
 NOMINAL_pileup_random_lb_number = 0
 NOMINAL_pileup_random_run_number = 0
 PRW_DATASF_1down_pileup_combined_weight = 0
 PRW_DATASF_1up_pileup_combined_weight = 0
 boson_0_truth_p4 = NULL
 boson_0_truth_pdgId = 0
 boson_0_truth_q = 0
 boson_0_truth_status = 0
...

> NOMINAL->Print()

******************************************************************************
*Tree :NOMINAL : NOMINAL *
*Entries : 19785 : Total = 87411639 bytes File Size = 40373126 *
* : : Tree compression factor = 2.16 *
******************************************************************************
*Br 0 :HLT_2e17_lhvloose_nod0_L12EM15VHI : *
* | HLT_2e17_lhvloose_nod0_L12EM15VHI/i *
*Entries : 19785 : Total Size= 91611 bytes File Size = 12741 *
*Baskets : 98 : Basket Size= 1024 bytes Compression= 7.00 *
*............................................................................*
*Br 1 :HTXS_Njets_pTjet25 : HTXS_Njets_pTjet25/I *
*Entries : 19785 : Total Size= 90081 bytes File Size = 24508 *
*Baskets : 98 : Basket Size= 1024 bytes Compression= 3.58 *
*............................................................................*
*Br 2 :HTXS_Njets_pTjet30 : HTXS_Njets_pTjet30/I *
*Entries : 19785 : Total Size= 90081 bytes File Size = 23839 *
*Baskets : 98 : Basket Size= 1024 bytes Compression= 3.68 *
*............................................................................*

...

Using MakeClass method

In my experience, the easiest way to access the list of variables inside the tree is using the MakeClass method of the TTree class. Let’s give it a try:

> NOMINAL->MakeClass("tree")

Info in <TTreePlayer::MakeClass>: Files: tree.h and tree.C generated from TTree: NOMINAL
(int) 0

The command has created two files, tree.h and tree.C. We do not care about the .C file but let’s have a look at the header file: tree.h

You see that at the beginning of the file you have list of all variables including their type, all in a convenient way ready to be copied into your own code.

Reading data from TTree

Now that we know how to get the list of variables stored in the tree we can go ahead and write our own code that will access the data. First, let’s make a skeleton class, e.g. called “Data”. The header file “Data.h”:

#ifndef DATA_H
#define DATA_H

#include "TTree.h"

class Data{
 public: 
  /**
  * @brief Construct a new Data object
  * 
  * @param tree - pointer to the TTree (or TChain) class
  */
  Data(TTree* tree);

 protected:

  /**
  * @brief pointer to the TTree (or TChain) class
  */
  TTree* m_tree = 0;

};

#endif

And the source “Data.cpp”:

#include "Data.h"

 Data::Data(TTree* tree) : m_tree(tree) {

}

Couple of remarks:

  • In this code we use TTree class while in the event loop we used TChain. However, because TChain inherits from TTree, one can always pass the TChain instance into TTree pointer, but one cannot do it the other way around:
    TChain* chain = new TChain(“NOMINAL”);
    TTree* tree = chain; // this works
    TTree* tree2 = new TTree();
    chain = tree2; // this doesn’t
  • This time, we have constructor that takes one parameter: pointer to the tree (or TChain) instance. This pointer is then passed to the class’s attribute m_chain. Apart from this the constructor is empty (yet).

To compile this new class, you have to add it to the “Makefile” and “Linkdef.h” class. You have now learned enough to do it yourself :-)

Now that we have the skeleton ready and checked that it compiles, we can add variables we want to read from the ROOT file. The variables are read from the TTrees like this:

// first we create a c++ variables of the correct type
Float_t someFloatVar;
Int_t someIntVar;
TLorentzVector *someFourMomentumVar = 0;

//then we pass them into the tree
tree->SetBranchAddress("someFloatVar", &someFloatVar);
tree->SetBranchAddress("someIntVar", &someIntVar);
tree->SetBranchAddress("someFourMomentumVar", &someFourMomentumVar);

// event loop
for(int i=0; i<tree->GetEntries(); ++i) {
  // here the values from the file get copied into memory
  tree->GetEntry(i);

  // now we can work with the variables, e.g. print them
  std::cout << someFloatVar << std::endl;
}
  • It is important that the type of the variables as defined in our c++ code is the same as what is in the file. This is where the “tree.h” file becomes handy, because we can copy the variable declaration from there making sure the types are correct.
  • The SetBranchAddress method takes a name of the variable as stored in the tree as a first parameter and a pointer to the variable in c++ where we want it stored. The “&” symbol means that the pointer (i.e. address in the memory) is passed as a parameter, not the value itself.
  • The name of the c++ variable and the name of the variable in the file does not have to be the same. However, it is a good practice to keep them identical to make things consistent.
  • For more complicated types like TLorentzVector it is better to define the c++ variable itself as a pointer (note the * in the someFourMomentumVar declaration). It is important to set the pointer to 0, otherwise you get a segmentation violation error when running the code!
  • Data files usually contain more variables than is needed for your analysis. Do not read all the variables, it is not efficient. It is always better to access only variables that are actually needed.

Now that we know how the things work, let’s implement some real variables into our “Data” class. The example ntuple contains pre-processed data from the H->tautau Monte Carlo simulation. It contains hundreds of variables. We will not try to explain their meaning (not important for the exercise) and we will of course not try to read all of them but only few:

  • Reconstructed Higgs boson mass. It is stored in variable called ditau_mmc_mlm_m (complicated name :-) )
  • Type of the leading and sub-leading lepton: tau_0 and tau_1
  • Momentum of the leading and sub-leading lepton: tau_0_p4 and tau_1_p4

Look up the declaration of these variables in the “tree.h” file and copy them into the public section of the Data class declaration:

#ifndef DATA_H
#define DATA_H

#include "TTree.h"
#include "TLorentzVector.h"

class Data{
 public: 
  /**
  * @brief Construct a new Data object
  * 
  * @param tree - pointer to the TTree (or TChain) class
  */
  Data(TTree* tree);

  /**
  * @brief Tree variables
  */
  Float_t ditau_mmc_mlm_m;
  UInt_t tau_0;
  UInt_t tau_1;
  TLorentzVector *tau_0_p4 = 0; //MUST SET POINTERS TO 0!
  TLorentzVector *tau_1_p4 = 0; //MUST SET POINTERS TO 0!

 protected:

  /**
  * @brief pointer to the TTree (or TChain) class
  */
  TTree* m_tree = 0;

};

#endif
  • Note that we have added #include “TLorentzVector.h” declaration for the TLorentzVector class. Otherwise the code would not compile because compiler would not know what this type is. The simple types (float, int, uint, etc) do not need include statements.
  • We have declared the variables in the public section of the class so that we can access them from outside the class.
  • Note that for the TLorentzVector variables the pointer must be initialised to 0. It is important, otherwise you get a segmentation error.

We also need to add the “SetBranchAddress” calls. This will go into the constructor in the “Data.cpp” file. Since the variables are defined as attributed of the Data class, they are accessible from within any class’s method including the constructor. So we can simply do:

#include "Data.h"

Data::Data(TTree* tree) : m_tree(tree) {
 m_tree->SetBranchAddress("ditau_mmc_mlm_m", &ditau_mmc_mlm_m);
 m_tree->SetBranchAddress("tau_0", &tau_0);
 m_tree->SetBranchAddress("tau_1", &tau_1);
 m_tree->SetBranchAddress("tau_0_p4", &tau_0_p4);
 m_tree->SetBranchAddress("tau_1_p4", &tau_1_p4);
}

Now that we have our simple data access class, let’s integrate it into the event loop. We need to include the “Data.h” into the EventLoop.h and add a member attribute “Data* m_data” (EventLoop.h):

#ifndef EVENTLOOP_H
#define EVENTLOOP_H

#include <vector>
#include "TString.h"
#include "TChain.h"
#include "Data.h"

class EventLoop {
 public: 
  /**
  * @brief Construct a new Event Loop object
  */
  EventLoop();

  /**
  * @brief Initialize the event loop
  */
  void initialize();

  /**
  * @brief Execute the event loop
  */
  void execute();

  /**
  * @brief list of input ROOT file names
  */
  std::vector<TString> inputFiles;

  /**
  * @brief Name of the TTree instance. Must be same in all files
  */
  TString treeName;

 protected:

  /**
  * @brief Instance of the TChain class used to read the data 
  */
  TChain* m_chain = 0; // pointer is initialized to zero

  /**
  * @brief Instance of the data-access class
  */
  Data* m_data = 0;

};

#endif

In the source file we need to create an instance of the Data class and access it in the event loop (EventLoop.cpp):

#include "EventLoop.h"
#include <iostream>
#include <stdexcept>

EventLoop::EventLoop() {
 // nothing to do here
}

void EventLoop::initialize() {
 // create an instance of the TChain class
 m_chain = new TChain(treeName);

 // loop through the input files and add them to the chain
 for(auto inputFile : inputFiles) {
 m_chain->Add(inputFile);
 std::cout << "Added file: " << inputFile << std::endl;
 }

 // create an instance of the Data class. Here the variables
 // are linked with the tree using the SetBranchAddress method
 m_data = new Data(m_chain);
}

void EventLoop::execute() {
 // sanity check. m_chain must not be zero
 if(!m_chain) {
 throw std::runtime_error("Calling execute while the event loop was not initialized.");
 }

 // here we do the actual event loop
 for(int i=0; i<m_chain->GetEntries(); ++i) {
  // event number printout
  if(i%1000==0) {
   std::cout << "Event " << i << std::endl;
  }

  // read the data for i-th event
  m_chain->GetEntry(i);

  // now we can work with the variables. Let's for example print Higgs mass 
  // but only for every 1000th event
  if(i%1000==0) {
   std::cout << "m = " << m_data->ditau_mmc_mlm_m << std::endl;
  }
 }
}
  • Note that we use class inheritance as mentioned above. When we create the instance of the Data class we pass the pointer to TChain instance into the constructor that expects TTree. Everything compiles and runs just fine because TChain is inherited from TTree.
  • In the last block we can access the attribute ditau_mmc_mlm_m because it was defined as public. This is not how things are usually done in c++ (it breaks so called encapsulation) but it is the simplest implementation so we will use it.

 

Recompile as usual.

> make clean
> make

Run the example using the python script “runMe.py”:

> python runMe.py

Added file: ../ggH.root
Event 0
m = 117.665
Event 1000
m = 164.447
Event 2000
m = 61.6836
Event 3000
m = 53.453
Event 4000
m = 114.276
Event 5000
m = 198.691
Event 6000
m = 38.7263
Event 7000

Now we have a simple example that actually reads some data from the file. However, at the moment it does not do anything useful with them :-). In the next example, we will show how to create histograms and fill them with values read from the file.

Next section →