Dataset Open Access
Amram, Oz;
Anzalone, Luca;
Birk, Joschka;
Faroughy, Darius A.;
Hallin, Anna;
Kasieczka, Gregor;
Krämer, Michael;
Pang, Ian;
Reyes-Gonzalez, Humberto;
Shih, David
<?xml version='1.0' encoding='utf-8'?> <oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:creator>Amram, Oz</dc:creator> <dc:creator>Anzalone, Luca</dc:creator> <dc:creator>Birk, Joschka</dc:creator> <dc:creator>Faroughy, Darius A.</dc:creator> <dc:creator>Hallin, Anna</dc:creator> <dc:creator>Kasieczka, Gregor</dc:creator> <dc:creator>Krämer, Michael</dc:creator> <dc:creator>Pang, Ian</dc:creator> <dc:creator>Reyes-Gonzalez, Humberto</dc:creator> <dc:creator>Shih, David</dc:creator> <dc:date>2024-12-13</dc:date> <dc:description>This dataset contains approximately 180 M boosted jets, derived from open data collected by the CMS experiment at the Large Hadron Collider (LHC) in 2016 — specifically the JetHT datastream — and presented in a format suitable for Machine Learning (ML) applications. A detailed description of the dataset and how it was produced can be found in the companion paper, arxiv 2412.10504. For each jet we store its transverse momentum (p_T), pseudorapidity (eta), and azimuthal angular coordinate (phi). We also store its mass, groomed with the softdrop algorithm as computed within the CMS reconstruction. Up to 150 constituents of the jet are stored. For each constituent, its 4-momentum is stored in the format (p_x, p_y, p_z, E). We additionally store its transverse impact parameter (d_0) and longitudinal impact parameter (d_z) with their uncertainties, the charge of the candidate, its particle-ID (PID) in the PDG format (note that neutral hadrons are assigned the PID=130 of the neutral kaon K_L^0, while positively/negatively charged hadrons are assigned PID=211 of the charged pion) and its weight from the PUPPI algorithm. We also include additional jet substructure quantities computed within the CMS reconstruction, including the number of constituents in the jet, N-subjettiness variables, various jet-tagging observables from the CMS implementation of ParticleNet and a regression of the jet mass from ParticleNet. Events are stored in h5 format with 4 keys: 'event_info', shape (N_jets, 3): [Run Number, LumiBlock, Event Number] 'jet_kinematics', shape (N_jets, 4): [pt, eta, phi, softdrop mass] 'PFCands', shape (N_jets, 150, 11): Zero padded list of up to 150 PFcandidates inside the jet. Info for each candidate is [px, py, pz, E, d0, d0Err, dz, dzErr, charge, PDG ID, PUPPI weight] 'jet_tagging', shape (N_jets, 13): Tagging info/scores for the AK8 jet. Info for each jet: [nConstituents, tau1, tau2, tau3, tau4, ParticleNet H4q vs QCD, ParticleNet Hbb vs QCD, ParticleNet Hcc vs QCD, ParticleNet QCD score, ParticleNet T vs QCD, ParticleNet W vs QCD, ParticleNet Z vs QCD, ParticleNet regressed mass] The code that was used to create Aspen Open Jets from CMS open data can be found at https://github.com/OzAmram/AOJProcessing, and the code used for the OmniJet-alpha model and its training can be found at https://github.com/uhh-pd-ml/omnijet_alpha. </dc:description> <dc:identifier>https://www.fdr.uni-hamburg.de/record/16505</dc:identifier> <dc:identifier>10.25592/uhhfdm.16505</dc:identifier> <dc:identifier>oai:fdr.uni-hamburg.de:16505</dc:identifier> <dc:relation>doi:10.25592/uhhfdm.16504</dc:relation> <dc:rights>info:eu-repo/semantics/openAccess</dc:rights> <dc:rights>https://creativecommons.org/licenses/by/4.0/legalcode</dc:rights> <dc:subject>Machine learning</dc:subject> <dc:subject>Foundation models</dc:subject> <dc:subject>Particle physics</dc:subject> <dc:subject>Collider physics</dc:subject> <dc:subject>LHC</dc:subject> <dc:subject>Open data</dc:subject> <dc:subject>Large dataset</dc:subject> <dc:subject>Jet physics</dc:subject> <dc:subject>Point clouds</dc:subject> <dc:subject>Jet tagging</dc:subject> <dc:subject>Boosted jets</dc:subject> <dc:title>Aspen Open Jets: a real-world ML-ready dataset for jet physics</dc:title> <dc:type>info:eu-repo/semantics/other</dc:type> <dc:type>dataset</dc:type> </oai_dc:dc>