Use Case #114
Use case ATLAS: ATLAS001
This issue describes ATLAS use case defined as ATLAS001
ATLAS open data replication, augmentation, bookkeeping and validation
A series of exercises (data production, replication and documentation) before and during the DAC the next November 2021. They include the creation of datasets for real-kind final user analysis examples using current open access resources at http://opendata.atlas.cern/
The creation of instances to simulate the usage by several users.
- At CERN OpenStack, LAPP and personal computers
The usage of multiple sites (RSE’s) in the datalake by adding the data created artificially.
- We will prepare several extra files, replicating those in several RSE's.
- We will not replicate "100% x # RSE's". But we will plan two have two (2) replicas that, are later distributed in several RSE's.
- All those files combined are mean to be ~10TB total size.
We will adjust such data volume (~ < 10 TB) by creating and using the artificially multiplied data.
- Use the current ATLAS Open Data (AOD) datasets.
- This augmentation is done using the ROOT
We will use that augmented data to run the analysis examples (See ATLAS002 in "Related issues") below.
Create bigger ROOT files (current largest files are ~2.5GB)
We will need to
- Create and test a series of scripts that will automatically do the data augmentation, upload, and replication.
- Create clear instructions for users/computers that can be part of the challenge.
AOD public datasets
- Original set of ~300 GB (in ~1000 files) already hosted in the Datalake.
- The "multiplied data" is still to be created and integrated.
RUCIO instances/CLI clients in several clients (e.g. 3 or 4).
Arturo Sánchez Pineda
- Data is successfully stored.
- Data is successfully transferred/replicated among several RSEs.
- Basic metadata is stored.
- Users can discover data using the ESCAPE-RUCIO instance.
Things to test
- Faisability of the replication of the samples and transfer between RSEs.
- Localisation of the datasets using RUCIO CLI and the Jupyter RUCIO extension
- Reporting failure or bottlenecks.
- Cleaning procedure of the data from the Datalake.
Creation, replication, user analysis usage and deletion of datasets.