.. _Dataset_and_Preparation:

Machine Learning Force Fields (MLFF): Comprehensive Dataset Generation Guide
============================================================================

This tutorial provides a comprehensive guide to generating a robust and diverse dataset for training **Machine Learning Force Fields (MLFF)** for **Quantum Dots (QDs)**. The workflow integrates several computational techniques to ensure extensive coverage of the QD system’s configurational and chemical space.

Workflow Overview
-----------------

- **Ab-initio Molecular Dynamics (AIMD) Simulations (DFT-based)**  
  The process begins with AIMD simulations based on **Density Functional Theory (DFT)**. This step generates realistic atomic configurations and force data by accurately modeling the **QD system's dynamic behavior** at the atomic scale. The AIMD simulations are run using the **CP2K** quantum chemistry package.

- **Enhanced Sampling via Principal Component Analysis (PCA)**  
  To **broaden** the configurational space explored by AIMD, **PCA** is applied to the molecular dynamics trajectory. This statistical method identifies dominant modes of structural variation, enabling the generation of new, **diverse configurations** that capture essential system dynamics.

- **High-Accuracy DFT Calculations on New Samples**  
  The newly generated configurations are further refined through **high-precision DFT calculations** using the **QMflows** package. **QMflows** is a library that automates the input generation for CP2K-based DFT calculations. This step **computes accurate energy and force data**, enriching the dataset for effective MLFF training.

- **Dataset Preparation for Machine Learning Models**  
  The final step involves organizing the collected data into a **machine-learning-friendly format** for further processing.

By systematically combining **AIMD simulations, PCA-enhanced sampling, and high-accuracy DFT calculations**, this workflow ensures the development of a high-quality dataset.

Step 1: Running Ab-initio Molecular Dynamics (AIMD) Simulation
--------------------------------------------------------------

Objective
~~~~~~~~~

Obtain initial atomic configurations and force data necessary for **MLFF development** by simulating realistic **QD system behavior**.

Simulation Setup
~~~~~~~~~~~~~~~~

- Start by **preparing a QD model** (see relevant tutorial) and **relax its geometry**.  
  - Typically, a **2–3 nm QD passivated with Cl atoms** to charge balance the system is sufficient.
- Perform an **AIMD simulation** using an **NVT ensemble** at **300 K** (or another desired temperature).  
- Set the simulation duration to **5–10 picoseconds** to explore the system's **potential energy landscape**.  
- Extract **2000 structural snapshots** uniformly throughout the simulation to capture diverse atomic configurations.  

CP2K Input Configuration
~~~~~~~~~~~~~~~~~~~~~~~~

Use the following **CP2K input configuration** in the ``&MOTION`` block to enable printing for each frame. Adjust the **printing frequency** according to your needs.

.. code-block:: bash

   &MOTION
     &PRINT
       &TRAJECTORY
         FORMAT XYZ
         UNIT angstrom
         &EACH
           MD 1
         &END EACH
       &END TRAJECTORY
       &FORCES
         &EACH
           MD 1
         &END EACH
       &END FORCES
     &END PRINT
   &END MOTION

Considerations for Quantum Dots (QDs)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- For **quantum dots like CdSe passivated with halogens**, atomic motions are slower.  
  - Use a **timestep of 2–4 fs** to effectively explore the configurational space.  
- **Equilibrate** the system with **1000 NVT steps** before running a **production simulation** of **2000 frames**.  
  - Example: An **8 ps** simulation with a **4 fs** timestep.  
- **Record positions and forces at every step** to ensure a **detailed dataset** for subsequent analysis.

Step 2: Enhancing AIMD Data with Principal Component Analysis (PCA)
-------------------------------------------------------------------

**Objective:**  
Broaden the sampled configuration space by applying PCA to the AIMD data and generating additional, diverse structures.

Principal Component Analysis (PCA) in Dataset Generation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Principal Component Analysis (PCA)** is a statistical technique used to reduce the dimensionality of data by transforming it into a new set of variables called **principal components**.  

In the context of dataset generation for **MLFF**, PCA helps identify the most significant variations in atomic configurations by analyzing:
- **Energies**
- **Atomic positions**
- **Forces**
- **RMSD (Root-Mean-Square Deviation)**
- **SOAP (Smooth Overlap of Atomic Positions) descriptors**

By projecting the data onto the principal components, PCA effectively reveals directions of **maximum variance**, enabling the creation of **diverse and representative configurations** that capture essential structural dynamics. This approach ensures that the generated dataset spans the most relevant regions of the chemical space.

You can download the script `generate_mlff_dataset.py` from:  
`https://github.com/nlesc-nano/MLFF_QD/tree/main/src/mlff_qd/preprocessing`  

Then run the script with the following command:

.. code-block:: bash

   generate_mlff_dataset.py input.yaml

The YAML input processes the AIMD trajectory by reading positions and forces. An example YAML configuration file is provided below.

Example YAML Configuration for the Script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: yaml 

   pos_file: "mean_md-pos-1.xyz"
   frc_file: "mean_md-frc-1.xyz"
   scaling_factor: 0.4
   scaling_surf: 0.6
   scaling_core: 0.4
  max_random_displacement: 0.15
  surface_atom_types:
    - "In"
    - "P"
    - "Cl"
  clustering_method: "KMeans"
  num_clusters: 100
  num_samples_pca: 1200
  num_samples_pca_surface: 600
  num_samples_randomization: 200
  SOAP:
    species: ["In", "P", "Cl"]
    r_cut: 12.0
    n_max: 7
    l_max: 3
    sigma: 0.1
    periodic: False
    sparse: False

Structure Generation Breakdown
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- **1200 Structures from PCA Sampling**  
  These structures are generated by perturbing configurations along the **principal components** derived from the **entire atomic system**.  
  - This enhances the dataset by exploring high-variance directions in the molecular dynamics trajectory.

- **600 Structures with Surface-Specific PCA Sampling**  
  Here, PCA is applied **specifically to surface atoms** (e.g., **Cs** and **Br** in QDs), which are **more dynamic** than core atoms.  
  - This approach ensures the **surface chemistry** is well-represented by applying **larger displacements** to surface atoms, reflecting their **natural mobility**.

- **200 Structures from Random Sampling**  
  Random displacements are applied **uniformly** across selected structures to introduce **additional diversity** and help avoid biases in the sampled configurations.

Detailed Explanation of YAML Input Keywords
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- **pos_file**  
  Path to the `.xyz` file containing **atomic positions** from the AIMD simulation.  
  - This file serves as the input for PCA analysis.

- **frc_file**  
  Path to the `.xyz` file containing **corresponding atomic forces**.  
  - These forces are used to evaluate **structural dynamics**.

- **max_random_displacement**  
  The **maximum displacement** applied in the **random sampling step**.  

- **surface_atom_types**  
  A list of **atomic species** (e.g., `"In"`, `"Cl"`) considered as **surface atoms**.  
  - These atoms are **more prone to movement** and are treated differently during **PCA sampling**.

- **clustering_method**  
  The algorithm used for **clustering structures** in the **PCA space**.  
  - Here, `KMeans` is used to **group similar configurations** and **sample representative ones**.

- **num_clusters**  
  The **number of clusters** to create in **PCA space** for **diversity sampling**.  
  - Each cluster provides **representative structures**.

- **num_samples_pca**  
  Number of structures generated by **perturbing configurations along PCA components** applied to the **entire system (core + surface)**.

- **num_samples_pca_surface**  
  Number of structures generated by applying **PCA perturbations** specifically to **surface atoms**, allowing them **greater freedom to move**.

- **num_samples_randomization**  
  Number of **randomly perturbed structures** added to the dataset to increase **diversity**.

**SOAP** refers to **Smooth Overlap of Atomic Positions**. This representation expands a local neighborhood density in orthogonal radial and spherical harmonics basis functions.

- **species**: adjust according to your model.
- **r_cut**: a cutoff for the neighbouring environment.
- **n_max**: max number of radial basis functions (RBF).
- **l_max**: max degree of spherical harmonics.
- **sigma**: the width of smearing.


Output Files and Visualization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

- **Generated Structures:**  
  The script outputs a `training_dataset.xyz` file containing:
  - 1200 PCA-sampled structures
  - 600 surface-PCA structures
  - 200 randomized structures

- **PCA Plots:**  
  Visualizations illustrate the distribution of the sampled structures in PCA space, providing insights into the configurational diversity achieved compared to the reference AIMD trajectory.

By combining **PCA-driven sampling, surface-specific perturbations, and randomization**, this approach ensures a well-balanced dataset that thoroughly explores the system's **chemical and configurational space**.

Step 3: High-Accuracy DFT Calculations on the Generated Structures
------------------------------------------------------------------

In this step, the **2000 structures** generated in the previous step using the **enhanced sampling process** will be computed at the **Density Functional Theory (DFT)** level of theory.  
This process enables the calculation of **energy** and **force** data for these new configurations, which will be **added to the starting AIMD dataset**.

Detailed Workflow
~~~~~~~~~~~~~~~~~

1. **Organize the Working Directory**

   - Create a new folder to run the **DFT calculations**.  
   - Copy the file `dataset_2000.xyz` (which contains the **2000 sampled structures**) into this folder.  
   - Copy the `train.yaml` configuration file into the same directory.  
     - This file will guide the **DFT calculation setup**.

   **Example Commands:**

   .. code-block:: bash

      mkdir DFT_Calculations
      cp dataset_2000.xyz DFT_Calculations/
      cp train.yaml DFT_Calculations/
      cd DFT_Calculations/

2. **Configure the `train.yaml` Input File**

   Open the `train.yaml` file and adjust the settings according to your **computational environment**.

Example YAML Configuration for the Script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The `train.yaml` file contains various parameters necessary for **DFT calculations**, including:

- **DFT functional** (e.g., PBE, B3LYP)  
- **Basis set**  
- **Convergence criteria**  
- **HPC-specific settings** (e.g., number of cores, memory allocation)  

**Example Configuration:**

.. code-block:: yaml

   workflow:
       distribute_single_points

   project_name: PbSe_Cl
   calculate_guesses: "all"
   active_space: [100, 100]
   path_traj_xyz: “dataset_2000.xyz"
   path_hdf5: "CdSe_Cl.hdf5"
   scratch_path: "cp2k_chunks"
   workdir: "."
   blocks: 5

   job_scheduler:
       free_format: "
           #!/bin/bash \n
           #SBATCH --job-name=PbSe_cl_single_point_cal \n
           #SBATCH --time=24:00:00 \n
           #SBATCH --nodes 2 \n
           #SBATCH --ntasks-per-node=112 \n
           module load cp2k/2024.1\n"

   cp2k_general_settings:
       path_basis: “cp2k_basis"
       basis_file_name: "BASIS_MOLOPT"
       potential_file_name: "GTH_POTENTIALS"
       basis: "DZVP-MOLOPT-SR-GTH"
       potential: "GTH-PBE"
       cell_parameters: 49.0
       periodic: none
       executable: cp2k.popt
       wfn_restart_file_name: "scf.wfn”

   cp2k_settings_main:
       specific:
           template: "train_main"  

   cp2k_settings_guess:
       specific:
           template: "train_guess"  

Detailed Explanation of YAML Input Keywords
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

- **workflow:**  
  Defines the overall workflow. Here, ``"distribute_single_points"`` is used for **single-point energy and force calculations**.

- **project_name:**  
  Specifies the **name of the project**, e.g., ``"PbSe_Cl"``, which will be used for organizing output files.

- **calculate_guesses:**  
  Determines whether **initial wavefunction guesses** should be computed for each frame.  
  - ``"all"`` means **all frames** will undergo:
  
    1. **Orbital Transformation (OT) calculations** to obtain an efficient initial guess.
    2. A **main calculation**, which then **fully diagonalizes the Fock matrix**.

- **active_space:**  
  Defines the **number of active molecular orbitals** whose **coefficients and energies** will be stored in the **HDF5 file**.

- **path_traj_xyz:**  
  Path to the ``dataset_2000.xyz`` file containing **generated atomic structures**.  
  - This file **stacks all generated frames**, which will be computed using **DFT**.

- **path_hdf5:**  
  Path to an existing **HDF5 database**, which stores **DFT-derived properties**, including:

  - **Atomic structures (XYZ format)**
  - **Molecular Orbital (MO) coefficients**
  - **Other relevant electronic structure data**

- **scratch_path:**  
  Specifies the **temporary directory** where **DFT calculations** will be executed.

- **workdir:**  
  Specifies the **working directory**, where all **calculation results** will be stored.

- **blocks:**  
  Defines the **number of blocks** into which the **original dataset** will be split.  
  - This helps manage **computational efficiency** when running large-scale DFT calculations.

- **job_scheduler:**  
  Contains **HPC job submission settings**, including:

  .. code-block:: bash

     #SBATCH --job-name=PbSe_cl_single_point_cal
     #SBATCH --time=24:00:00
     #SBATCH --nodes=2
     #SBATCH --ntasks-per-node=112
     module load cp2k/2024.1

  - ``#SBATCH --job-name``: Specifies the **job name**.
  - ``#SBATCH --time``: Maximum **runtime allocation**.
  - ``#SBATCH --nodes``: Number of **compute nodes** requested.
  - ``#SBATCH --ntasks-per-node``: Number of **tasks per node**.
  - ``module load cp2k/2024.1``: Loads the **CP2K module** on the **HPC system**.

- **cp2k_general_settings:**  
  Contains **general CP2K input settings**, including:

  - ``basis_file_name``: Specifies the **MOLOPT basis set**.
  - ``potential_file_name``: Defines the **GTH pseudopotentials**.
  - ``cell_parameters``: Defines the **simulation box size** (e.g., **49.0 Å**).
  - ``periodic``: Specifies **boundary conditions** (``none`` for **isolated QDs**).
  - ``executable``: Points to the **CP2K binary** (e.g., ``cp2k.popt``).

- **cp2k_settings_main:**  
  Specifies the **main CP2K input template**, based on the **PBE functional**.  
  - Example: ``"train_pbe_main"``.

- **cp2k_settings_guess:**  
  Specifies the **wavefunction guess template**, also based on the **PBE functional**.  
  - Example: ``"train_pbe_guess"``.


3. **Run the QMflows Job Distribution Script**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   Use **`qmflows`** and **`nano-qmflows`** to automate the **DFT calculations**. Assuming both are already installed, launch the job distribution script:

   .. code-block:: bash

      distribute_jobs.py -i train.yaml

   **Process Explanation:**  
   - The script **splits the dataset** into **5 folders** (or more/less depending on settings) to parallelize calculations.  
   - In each folder, it generates the necessary **input files**, **Slurm job scripts**, and setup for the **DFT calculations**.  
   - The `input.yaml` generated is a **pre-processed YAML file** containing **all keywords, including default ones**, that will be used to generate the **CP2K input files**.  
   - It is always recommended to **check this file** before running the job to ensure all settings are correct.

4. **Submit the Jobs to the HPC Cluster**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   Navigate into each **generated folder** and submit the job to the HPC queue:

   .. code-block:: bash

      cd chunk_1/
      sbatch lauch.sh
      cd ../chunk_2/
      sbatch launch.sh
      # Repeat for all chunks

   *Tip:*  
   - You can **increase the number of chunks** to reduce the computational load per job and **speed up the calculations**, depending on the available **HPC resources**.

5. **Handling Interrupted Jobs**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   If any calculation is **interrupted** (e.g., due to **wall-time limits**), simply **rerun** the job distribution script.  
   **QMflows** efficiently manages restarts, ensuring **only incomplete calculations** are resumed:

   .. code-block:: bash

      sbatch launch.sh

Key Points to Consider
^^^^^^^^^^^^^^^^^^^^^^

- **Parallelization:** Adjust the **number of chunks** for optimal performance on your HPC system. More chunks with fewer structures can speed up computations.  
- **Resource Management:** Customize the Slurm scripts (`job.sh`) as needed for your **HPC environment**.  
- **Automatic Restart:** **QMflows** handles restarts smoothly, allowing you to **resume incomplete jobs** without manual intervention.

----

Step 4: Convert All DFT Structures to ML-Ready Format
-----------------------------------------------------

1. **Extract Structures from Chunk Folders**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   Once the **DFT calculations** in each chunk are completed, download the `extract.py` script and run it inside each folder:

   .. code-block:: bash

      python extract.py

   **Process Explanation:**  
   - The script scans the folder for **output files** containing **positions** and **forces**.  
   - It **identifies redundant structures** if some calculations have **failed** and required **restarts**.  
   - The most relevant output files are:

     * `positions_hartree_n.xyz`
     * `forces_hartree_n.xyz`  

     where `n` is the chunk number.

   **Merge all chunk outputs into a single file** (assuming **5 chunks** were generated):

   .. code-block:: bash

      cat positions_hartree_0.xyz positions_hartree_1.xyz positions_hartree_2.xyz \
          positions_hartree_3.xyz positions_hartree_4.xyz > positions_hartree_final.xyz 

      cat forces_hartree_0.xyz forces_hartree_1.xyz forces_hartree_2.xyz \
          forces_hartree_3.xyz forces_hartree_4.xyz > forces_hartree_final.xyz

2. **Merge Extracted Structures with MD Structures**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   Now, merge these **DFT-calculated structures** with those obtained from the **initial MD simulation**:

   .. code-block:: bash

      cat mean_md-pos-1.xyz positions_hartree_final.xyz > merged_positions.xyz 
      cat mean_md-frc-1.xyz forces_hartree_final.xyz > merged_forces.xyz

3. **Convert All DFT Structures to ML-Ready Format**
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   Use the `compact_xyz.py` script to generate a **single XYZ file** ready for **ML training**.  
   The script ensures:
   - **All frames computed with DFT** are included.
   - The file contains **energies, positions, and forces**.
   - Unit conversion is performed:

     * **Energies** → **eV**
     * **Positions** → **Ångström** (already in this format)
     * **Forces** → **eV/Ångström** (preferred for ML training)

   Run the script with:

   .. code-block:: bash

      python compact_input.py --pos merged_positions.xyz --frc merged_forces.xy

  Use `consolidate.py` to pick random structures suitable for ML training:

    .. code-block:: bash

        python consolidate.py input.yaml

  An example of input YAML file:

   .. code-block:: bash

      dataset:
         input_file: "dataset_pos_frc_ev.xyz"
         output_prefix: "consolidated_dataset" 
         sizes: [500, 1000, 2000, 4000]
      # Subset counts (number of structures from each method)
         subset_counts:
            MD: 2533   
            PCA: 1200 
            PCA_Surface: 600 
            Random: 200
            contamination: 0.05 
      SOAP: 
         species: ["In", "P", "Cl"] 
         r_cut: 12.0 
         n_max: 7 
         l_max: 3 
         sigma: 0.1 
        periodic: False
        sparse: False

Detailed Explanation of YAML Input :
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- ``input_file``: specifies the input file name.
- ``output_prefix``: specifies the prefix of the output files
- ``sizes``: creates chunks of different sizes.

Subset counts:

- ``MD``: structures obtained from Molecular Dynamics (MD) simulation. Adjust according to your data.
- ``PCA``: structures obtained from Principal Component Analysis (PCA).
- ``PCA_Surface``: surface-focused structures from PCA sampling.
- ``Random``: randomly selected structures for additional diversity.
- ``contamination``: fraction of outliers removed by Isolation Forest. 

The output files contain:
     * `consolidated_dataset`: a chunk of dataset with the most diverse structures (preferred for ML training).
     * `MD_random_dataset`: random structures picked from MD data.
     * `random_dataset`: random structures from the whole dataset.

Choose the subset preferred for your method and convert according `xyz` file to `npz` format using: 

  .. code-block:: bash

       python xyztonpz.py consolidated_dataset_1000.xyz