The Pipeline

If you have a galaxy catalog (either of Parametric origin or from a simulation), an `EmissionModel <../emission_models/emission_models.rst>`__, and a set of instruments you want observables for, you can easily write a pipeline to generate the observations you want using the Synthesizer UI. However, lets say you have a new catalog you want to run the same analysis on, or a whole different set of instruments you want to use. You could modify your old pipeline or write a whole new pipeline, but thats a lot of work and boilerplate.

This is where the Pipeline shines. Instead, of having to write a pipeline, the Pipeline class is a high-level interface that allows you to easily generate observations for a given catalog, emission model, and set of instruments. All you need to do is set up the Pipeline object, attach the galaxies, and pass the instruments to the observable methods you want to generate with the necessary parameters. Possible observables include:

  • Spectra.

  • Emission Lines.

  • Photometry.

  • Images (with or without PSF convolution/noise).

  • Spectral data cubes (IFUs).

  • Instrument specific spectroscopy.

The Pipeline will generate all the requested observations for all (compatible) instruments and galaxies, before writing them out to a standardised HDF5 format.

As a bonus, the abstraction into the Pipeline class allows for easy parallelization of the analysis, not only over local threads but distributed over MPI.

The Pipeline workflow

When working with a Pipeline there are 4 distinct phases:

  • Instantiating the Pipeline object with the global properties and attaching galaxies.

  • Signalling which observables to generate with which properties and instruments.

  • Running the Pipeline.

  • Writing the outputs.

We cover each of these phases in detail below.

Setting up a Pipeline object

Before we instatiate a pipeline we need to define its “dependencies”. These are an emission model, a set of instruments, and importantly some galaxies to “observe”.

Defining an emission model

The EmissionModel defines the emissions we’ll generate, including its origin and any reprocessing the emission undergoes. For more details see the EmissionModel docs.

For demonstration, we’ll use a simple premade IntrinsicEmission model which defines the intrinsic stellar emission (i.e. stellar emission without any ISM dust reprocessing).

[1]:
from synthesizer.emission_models import IntrinsicEmission
from synthesizer.grid import Grid

# Get the grid
grid_name = "test_grid"
grid = Grid(grid_name)

model = IntrinsicEmission(grid, fesc=0.1)
model.set_per_particle(True)  # we want per particle emissions

Defining the instruments

We don’t need any instruments if all we want is spectra at the spectral resolution of the Grid or emission lines. However, to get anything more sophisticated we need Instruments that define the technical specifications of the observations we want to generate. For a full breakdown see the instrumentation docs.

Here we’ll define a simple set of instruments including a subset of NIRCam filters (capable of imaging with a 0.1 kpc resolution), a set of UVJ top hat filters (only capable of photometry), a premade JWST MIRI instrument, and a completely arbitrary spectrometer and IFU. We’ll pass these explicitly to the observable methods below to associate them with the observations we want to generate.

[2]:
import numpy as np
from unyt import angstrom, kpc

from synthesizer.instruments import JWSTMIRI, UVJ, FilterCollection, Instrument

# Get the filters
lam = np.linspace(10**3, 10**5, 1000) * angstrom
webb_filters = FilterCollection(
    filter_codes=[
        f"JWST/NIRCam.{f}"
        for f in ["F090W", "F150W", "F200W", "F277W", "F356W", "F444W"]
    ],
    new_lam=lam,
)
uvj_filters = UVJ(new_lam=lam)

# Instatiate the instruments
webb_inst = Instrument("JWST.NIRCam", filters=webb_filters, resolution=1 * kpc)
miri_inst = JWSTMIRI()
uvj_inst = Instrument("UVJ", filters=uvj_filters)
spectrometer = Instrument("SpecInst", lam=np.logspace(3, 4, 100) * angstrom)
ifu = Instrument(
    "IFU", lam=np.logspace(3, 4, 100) * angstrom, resolution=1 * kpc
)
instruments = webb_inst + miri_inst + uvj_inst + spectrometer + ifu

print(instruments)
+-------------------------------------------------------------------------------------------------+
|                                      INSTRUMENT COLLECTION                                      |
+-------------------+-----------------------------------------------------------------------------+
| Attribute         | Value                                                                       |
+-------------------+-----------------------------------------------------------------------------+
| all_filters       | <synthesizer.instruments.filters.FilterCollection object at 0x7fa660d01d20> |
+-------------------+-----------------------------------------------------------------------------+
| ninstruments      | 5                                                                           |
+-------------------+-----------------------------------------------------------------------------+
| instrument_labels | [JWST.NIRCam, JWST.MIRI, UVJ, SpecInst, IFU,]                               |
+-------------------+-----------------------------------------------------------------------------+
| instruments       | JWST.NIRCam: Instrument                                                     |
|                   | JWST.MIRI: JWSTMIRI                                                         |
|                   | UVJ: Instrument                                                             |
|                   | SpecInst: Instrument                                                        |
|                   | IFU: Instrument                                                             |
+-------------------+-----------------------------------------------------------------------------+

Loading galaxies

You can load galaxies however you want but for this example we’ll load some CAMELS galaxies using the load_data module.

[3]:
from synthesizer import TEST_DATA_DIR
from synthesizer.load_data.load_camels import load_CAMELS_IllustrisTNG

# Create galaxy object
galaxies = load_CAMELS_IllustrisTNG(
    TEST_DATA_DIR,
    snap_name="camels_snap.hdf5",
    group_name="camels_subhalo.hdf5",
    physical=True,
)

Instantiating the Pipeline object

We have all the ingredients we need to instantiate a Pipeline object. All we need to do now is pass them into the Pipeline object alongside the number of threads we want to use during the analysis (in this notebook we’ll only use 1 for such a small handful of galaxies).

[4]:
from synthesizer.pipeline import Pipeline

pipeline = Pipeline(
    emission_model=model,
    nthreads=1,
    verbose=1,
)

                         ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⡀⠒⠒⠦⣄⡀⠀⠀⠀⠀⠀⠀⠀
                         ⠀⠀⠀⠀⠀⢀⣤⣶⡾⠿⠿⠿⠿⣿⣿⣶⣦⣄⠙⠷⣤⡀⠀⠀⠀⠀
                         ⠀⠀⠀⣠⡾⠛⠉⠀⠀⠀⠀⠀⠀⠀⠈⠙⠻⣿⣷⣄⠘⢿⡄⠀⠀⠀
                         ⠀⢀⡾⠋⠀⠀⠀⠀⠀⠀⠀⠀⠐⠂⠠⢄⡀⠈⢿⣿⣧⠈⢿⡄⠀⠀
                         ⢀⠏⠀⠀⠀⢀⠄⣀⣴⣾⠿⠛⠛⠛⠷⣦⡙⢦⠀⢻⣿⡆⠘⡇⠀⠀
                         ⠀⠀⠀+-+-+-+-+-+-+-+-+-+-+-+⡇⠀⠀
                         ⠀⠀⠀|S|Y|N|T|H|E|S|I|Z|E|R|⠃⠀⠀
                         ⠀⠀⢰+-+-+-+-+-+-+-+-+-+-+-+⠀⠀⠀
                         ⠀⠀⢸⡇⠸⣿⣷⠀⢳⡈⢿⣦⣀⣀⣀⣠⣴⣾⠟⠁⠀⠀⠀⠀⢀⡎
                         ⠀⠀⠘⣷⠀⢻⣿⣧⠀⠙⠢⠌⢉⣛⠛⠋⠉⠀⠀⠀⠀⠀⠀⣠⠎⠀
                         ⠀⠀⠀⠹⣧⡀⠻⣿⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⡾⠃⠀⠀
                         ⠀⠀⠀⠀⠈⠻⣤⡈⠻⢿⣿⣷⣦⣤⣤⣤⣤⣤⣴⡾⠛⠉⠀⠀⠀⠀
                         ⠀⠀⠀⠀⠀⠀⠈⠙⠶⢤⣈⣉⠛⠛⠛⠛⠋⠉⠀⠀⠀⠀⠀⠀⠀⠀
                         ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀


[00000.00]: Root emission model: intrinsic
[00000.00]: EmissionModel contains 10 individual models.
[00000.00]: EmissionModels split by emitter:
[00000.00]:    - galaxy:     0
[00000.00]:    - stellar:    10
[00000.00]:    - blackhole:  0
[00000.00]: EmissionModels split by operation type:
[00000.00]:    - extraction:   4
[00000.00]:    - combination:  3
[00000.00]:    - attenuating:  3
[00000.00]:    - generation:   0

Notice that we got a log out of the Pipeline object detailing the basic setup. The Pipeline will automatically output logging information to the console but this can be supressed by passing verbose=0 which limits the outputs to saying hello, goodbye, and any errors that occur.

Adding analysis functions

We could just run the analysis now and get whatever predefined outputs we want. However, we can also add our own analysis functions to the Pipeline object. These functions will be run on each galaxy in the catalog and can be used to generate any additional outputs we want. Importantly, these functions will be run after all other analysis has finished so they can make use of any outputs generated by the Pipeline object. They will also be run in the order they have been added allowing access to anything derived in previous analysis functions.

Any extra analysis functions must obey the following rules:

  • It must calculate the “result” for a single galaxy at a time.

  • The function’s first argument must be the galaxy to calculate for.

  • It can take any number of additional arguments and keyword arguments.

  • It must either:

    • Return an array of values or a scalar, such that np.array(<list of results>) is a valid operation. In other words, the results once combined for all galaxies should be an array of shape (n_galaxies, <result shape>).

    • Return a dictionary where each result at the “leaves” of the dictionary structure is an array of values or a scalar, such that np.array(<list of results>) is a valid operation. In other words, the dictionary of results once combined for all galaxies should be a dictionary with an array of shape (n_galaxies, <result shape>) at each “leaf”.

Below we’ll define an analysis function to compute stellar mass radii of each galaxy.

[5]:
def get_stellar_mass_radius(gal, fracs):
    """Compute the half mass radii.

    Args:
        gal (Galaxy):
            The galaxy to compute the half light radius of.
        fracs (list of float):
            The fractional radii to compute.
    """
    result = {}
    for frac in fracs:
        result[str(frac).replace(".", "p")] = gal.stars.get_attr_radius(
            "current_masses", frac
        )
    return result

To add this to the Pipeline we need to pass it along with a string defining the key under which the results will be stored in the HDF5 file and the fracs argument it requires.

[6]:
pipeline.add_analysis_func(
    get_stellar_mass_radius,
    result_key="Stars/HalfMassRadius",
    fracs=(0.2, 0.5, 0.8),
)
[00000.01]: Added analysis function: Stars/HalfMassRadius

This can also be done with simple lambda functions to include galaxy attributes in the output. For instance, the redshift.

[7]:
pipeline.add_analysis_func(lambda gal: gal.redshift, result_key="Redshift")
[00000.01]: Added analysis function: Redshift

Using the Pipeline Object

To run the analysis we need to instantiate the Pipeline, attach the galaxies, and then call the various observable generation methods to signal what observables (including any of the necessary arguments and/or instruments for each generation method) we want to generate. This approach allows you to explicitly control which observables you want to generate with a single line of code for each. Each of these getter methods signals to the Pipeline which observables you want to generate, with what instruments and parameters, and eventually write out to the HDF5 file.

Attaching the galaxies

To include the galaxies we simply pass the list of Galaxy objects we defined above to the add_galaxies method. Note that this method is “destructive”, i.e. calling it multiple times will overwrite the galaxies previously attached.

[8]:
pipeline.add_galaxies(galaxies)
[00000.02]: Galaxies memory footprint: 0.44 MB
[00000.02]: Adding 10 galaxies took 1.743 ms.

Signalling the observables

Now we have the galaxies we can setup which observables we want. We do this by calling the various observable generation methods on the Pipeline object to signal which observables we want, followed by the run method to perform the analysis.

Spectra

We’ll start with the spectra.

[9]:
pipeline.get_spectra()

If we want fluxes, we can pass an astropy.cosmology object to the get_observed_fluxes method to get the fluxes in the observer frame.

[10]:
from astropy.cosmology import Planck18 as cosmo

pipeline.get_observed_spectra(cosmo)

Line Emission

Next we’ll generate the emission lines. Here we can pass exactly which emission lines we want to generate based on line ID. We’ll just generate all lines offered by the Grid.

[11]:
pipeline.get_lines(line_ids=grid.available_lines)

Again, to get observed fluxes we can pass an astropy.cosmology object to the get_observed_lines method.

[12]:
pipeline.get_observed_lines(cosmo)

Photometry

Next, the photometry, where we need to pass the instruments defining the filters we want to apply and the label we want to apply that to. Unlike the previous methods that didn’t take instruments, these we can call multiple times with different instrument and label combinations to further control the generated observables.

Here we’ll generate intrinsic rest frame luminosities for the UVJ top hats, transmitted and intrinsic observed fluxes for the NIRCam filters and transmitted fluxes for UVJ top hat filters.

[13]:
pipeline.get_photometry_luminosities(uvj_inst, labels="intrinsic")
pipeline.get_photometry_fluxes(
    webb_inst, cosmo=cosmo, labels=["transmitted", "intrinsic"]
)
pipeline.get_photometry_fluxes(uvj_inst, cosmo=cosmo, labels="transmitted")

Imaging

Now, we’ll generate the images. Again, these are split into luminosity and flux flavours. Just like the photometry methods, we can call these multiple times with different instrument, label, and parameter combinations. These parameters include the field of view of the image, type of imaging (smoothed or histogram), and (if working in angular coordinates) a cosmology. If we are doing “smoothed” imaging (the default) where each particle is smoothed over its SPH kernel, we need to pass the kernel array, which we’ll extract here.

Note that had we defined instruments with PSFs and/or noise these would be applied automatically during the imaging process based on the instrument properties.

[14]:
from unyt import arcsecond

from synthesizer.kernel_functions import Kernel

# Get the SPH kernel
sph_kernel = Kernel()
kernel = sph_kernel.get_kernel()

pipeline.get_images_luminosity(
    webb_inst,
    fov=10 * kpc,
    kernel=kernel,
    labels="intrinsic",
)
pipeline.get_images_flux(
    webb_inst,
    fov=10 * kpc,
    kernel=kernel,
    labels=["transmitted", "intrinsic"],
)

pipeline.get_images_flux(
    miri_inst,
    fov=5.2 * arcsecond,
    kernel=kernel,
    cosmo=cosmo,
    labels="transmitted",
)

Spectroscopy (resolved/unresolved)

Similarly, we can generate both resolved and unresolved spectroscopy by passing the instruments and parameters, as well as the labels to act on (again for luminosity and flux flavours), to the data cube and spectroscopy methods. Though not demonstrated here, these methods can also be called multiple times with different configurations.

[15]:
# Get the resolved spectroscopy
pipeline.get_data_cubes_lnu(
    ifu,
    labels="intrinsic",
    kernel=kernel,
    fov=10 * kpc,
)
pipeline.get_data_cubes_fnu(
    ifu,
    labels="intrinsic",
    fov=5.2 * arcsecond,
    kernel=kernel,
    cosmo=cosmo,
)

# Get the unresolved spectroscopy
pipeline.get_spectroscopy_lnu(
    spectrometer,
    labels="intrinsic",
)
pipeline.get_spectroscopy_fnu(
    spectrometer,
    labels="intrinsic",
)

Running the Pipeline

Now we have every observable we want signalled for generation we can run the pipeline. This will generate all the observables on a galaxy by galaxy basis removing each galaxy from memory to reduce memory usage. This whole process will automatically be parallelized over the number of threads we defined when we instantiated the Pipeline object where available.

To run everything we simply call the run method.

[16]:
pipeline.run()
[00000.64]: Using 5 instruments.
[00000.64]: Instruments have 57 filters in total.
[00000.64]: Included instruments: UVJ, JWST.NIRCam, JWST.MIRI, IFU, SpecInst
[00000.64]: Instruments split by capability:
[00000.64]:    - photometry:             3
[00000.64]:    - spectroscopy:           2
[00000.64]:    - imaging:                2
[00000.64]:    - resolved spectroscopy:  1
[00000.65]: Pipeline memory footprint (MB): 484.3591785430908
[00000.65]: Running the pipeline...
[00000.65]: +------------+------------+------------+------------+------------+
[00000.65]: |    Galaxy #|      Nstars|        Ngas|         Nbh|      dt (s)|
[00000.65]: +------------+------------+------------+------------+------------+
[00003.04]: |           0|         278|          64|        None|        2.40|
[00010.79]: |           1|         656|           0|        None|        7.74|
[00014.21]: |           2|         441|          25|        None|        3.41|
[00023.35]: |           3|         726|           0|        None|        9.12|
[00023.65]: |           4|           8|         180|        None|        0.28|
[00024.04]: |           5|          26|           0|        None|        0.39|
[00030.82]: |           6|         569|           0|        None|        6.78|
[00032.35]: |           7|         165|           0|        None|        1.52|
[00033.67]: |           8|         149|           0|        None|        1.31|
[00039.43]: |           9|         461|           0|        None|        5.76|
[00039.45]: +------------+------------+------------+------------+------------+
[00039.45]: Computing 90 Lnu Spectra took 17.48 s
[00039.45]: Computing 90 Fnu Spectra took 2.25 s
[00039.45]: Computing 220 Luminosities took 4.74 s
[00039.45]: Computing 350 Fluxes took 10.76 s
[00039.45]: Computing 19350 Emission Line Luminosities took 1.35 s
[00039.45]: Computing 19350 Emission Line Fluxes took 83.64 ms
[00039.45]: Computing 220 Luminosity Images took 85.60 ms
[00039.45]: Computing 480 Flux Images took 405.20 ms
[00039.45]: Computing 10 Lnu Data Cubes took 374.25 ms
[00039.45]: Computing 10 Fnu Data Cubes took 372.44 ms
[00039.45]: Computing 10 Spectroscopy Lnu took 392.83 ms
[00039.45]: Computing 10 Spectroscopy Fnu took 387.72 ms
[00039.45]: Running 20 extra analyses took 7.39 ms
[00039.45]: Unpacking results took 10.21 ms
[00039.45]: Computing Stars/HalfMassRadius took 7.29 ms
[00039.45]: Computing Redshift took 0.02 ms
[00039.45]: Cleaning outputs took 4.394 ms.

Writing out the data

Finally, we write out the data to a HDF5 file. This file will contain all the observables we generated, as well as any additional analysis we ran. This file is structure to mirror the structure of Synthesizer objects, with each galaxy being a group, each component being a subgroup, and each individual observable being a dataset (or set of subgroups with the observables as datasets at their leaves in the case of a dicitonary attribute).

To write out the data we just pass the path to the file we want to write to to the write method.

Note that we all passing verbose=0 to silence the dataset timings for these docs. Otherwise, we would get timings for the writing of individual datasets. In the wild these timings are useful but here they’d just bloat the demo.

[17]:
pipeline.write("output.hdf5", verbose=0)
[00039.92]: Writing data took 459.317 ms.
[00039.92]: Total synthesis took 39.922 s.
[00039.92]: Goodbye!

Below is a view into the HDF5 file produced by the above pipeline (as shown by H5forest).

Pipeline HDF5 Example

Putting it all together

Here is what the pipeline would look like without all the descriptive fluff…

[18]:
# Create galaxy object
galaxies = load_CAMELS_IllustrisTNG(
    TEST_DATA_DIR,
    snap_name="camels_snap.hdf5",
    group_name="camels_subhalo.hdf5",
    physical=True,
)

pipeline = Pipeline(model, report_memory=True)
pipeline.add_analysis_func(
    get_stellar_mass_radius,
    result_key="Stars/HalfMassRadius",
    fracs=(0.2, 0.5, 0.8),
)
pipeline.add_analysis_func(lambda gal: gal.redshift, result_key="Redshift")
pipeline.add_galaxies(galaxies)
pipeline.get_spectra()
pipeline.get_observed_spectra(cosmo)
pipeline.get_lines(line_ids=grid.available_lines)
pipeline.get_observed_lines(cosmo)
pipeline.get_photometry_luminosities(uvj_inst, webb_inst, miri_inst)
pipeline.get_photometry_fluxes(webb_inst)
pipeline.get_images_luminosity(
    webb_inst, fov=10 * kpc, kernel=kernel, labels="intrinsic"
)
pipeline.get_images_flux(
    webb_inst, fov=10 * kpc, kernel=kernel, labels="intrinsic"
)
pipeline.get_data_cubes_lnu(
    ifu, labels="intrinsic", kernel=kernel, fov=10 * kpc
)
pipeline.get_data_cubes_fnu(
    ifu, labels="intrinsic", kernel=kernel, fov=5.2 * arcsecond, cosmo=cosmo
)
pipeline.get_spectroscopy_lnu(spectrometer, labels="intrinsic")
pipeline.get_spectroscopy_fnu(spectrometer, labels="intrinsic")
pipeline.run()
pipeline.write("output.hdf5", verbose=0)

                         ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⡀⠒⠒⠦⣄⡀⠀⠀⠀⠀⠀⠀⠀
                         ⠀⠀⠀⠀⠀⢀⣤⣶⡾⠿⠿⠿⠿⣿⣿⣶⣦⣄⠙⠷⣤⡀⠀⠀⠀⠀
                         ⠀⠀⠀⣠⡾⠛⠉⠀⠀⠀⠀⠀⠀⠀⠈⠙⠻⣿⣷⣄⠘⢿⡄⠀⠀⠀
                         ⠀⢀⡾⠋⠀⠀⠀⠀⠀⠀⠀⠀⠐⠂⠠⢄⡀⠈⢿⣿⣧⠈⢿⡄⠀⠀
                         ⢀⠏⠀⠀⠀⢀⠄⣀⣴⣾⠿⠛⠛⠛⠷⣦⡙⢦⠀⢻⣿⡆⠘⡇⠀⠀
                         ⠀⠀⠀+-+-+-+-+-+-+-+-+-+-+-+⡇⠀⠀
                         ⠀⠀⠀|S|Y|N|T|H|E|S|I|Z|E|R|⠃⠀⠀
                         ⠀⠀⢰+-+-+-+-+-+-+-+-+-+-+-+⠀⠀⠀
                         ⠀⠀⢸⡇⠸⣿⣷⠀⢳⡈⢿⣦⣀⣀⣀⣠⣴⣾⠟⠁⠀⠀⠀⠀⢀⡎
                         ⠀⠀⠘⣷⠀⢻⣿⣧⠀⠙⠢⠌⢉⣛⠛⠋⠉⠀⠀⠀⠀⠀⠀⣠⠎⠀
                         ⠀⠀⠀⠹⣧⡀⠻⣿⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⡾⠃⠀⠀
                         ⠀⠀⠀⠀⠈⠻⣤⡈⠻⢿⣿⣷⣦⣤⣤⣤⣤⣤⣴⡾⠛⠉⠀⠀⠀⠀
                         ⠀⠀⠀⠀⠀⠀⠈⠙⠶⢤⣈⣉⠛⠛⠛⠛⠋⠉⠀⠀⠀⠀⠀⠀⠀⠀
                         ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀


[00000.00]: Root emission model: intrinsic
[00000.00]: EmissionModel contains 10 individual models.
[00000.00]: EmissionModels split by emitter:
[00000.00]:    - galaxy:     0
[00000.00]:    - stellar:    10
[00000.00]:    - blackhole:  0
[00000.00]: EmissionModels split by operation type:
[00000.00]:    - extraction:   4
[00000.00]:    - combination:  3
[00000.00]:    - attenuating:  3
[00000.00]:    - generation:   0
[00000.00]: Added analysis function: Stars/HalfMassRadius
[00000.00]: Added analysis function: Redshift
[00000.00]: Galaxies memory footprint: 0.44 MB
[00000.00]: Adding 10 galaxies took 1.053 ms.
[00000.00]: Using 5 instruments.
[00000.00]: Instruments have 66 filters in total.
[00000.00]: Included instruments: JWST.MIRI, UVJ, JWST.NIRCam, IFU, SpecInst
[00000.00]: Instruments split by capability:
[00000.00]:    - photometry:             3
[00000.00]:    - spectroscopy:           2
[00000.00]:    - imaging:                2
[00000.00]:    - resolved spectroscopy:  1
[00000.01]: Pipeline memory footprint (MB): 485.92222023010254
[00000.01]: Running the pipeline...
[00000.01]: +------------+------------+------------+------------+------------+------------------------+------------------------+------------------------+
[00000.01]: |    Galaxy #|      Nstars|        Ngas|         Nbh|      dt (s)|   Memory footprint (MB)|      Galaxy memory (MB)|     Results memory (MB)|
[00000.01]: +------------+------------+------------+------------+------------+------------------------+------------------------+------------------------+
[00005.71]: |           0|         278|          64|        None|        5.70|                  487.38|                    0.40|                    1.50|
[00032.45]: |           1|         656|           0|        None|       26.72|                  488.73|                    0.33|                    2.93|
[00041.67]: |           2|         441|          25|        None|        9.19|                  490.10|                    0.28|                    4.35|
[00069.54]: |           3|         726|           0|        None|       27.85|                  491.45|                    0.20|                    5.77|
[00070.29]: |           4|           8|         180|        None|        0.72|                  492.87|                    0.18|                    7.21|
[00071.21]: |           5|          26|           0|        None|        0.91|                  494.28|                    0.17|                    8.64|
[00094.95]: |           6|         569|           0|        None|       23.72|                  495.65|                    0.10|                   10.06|
[00098.67]: |           7|         165|           0|        None|        3.70|                  497.05|                    0.08|                   11.48|
[00101.93]: |           8|         149|           0|        None|        3.24|                  498.48|                    0.06|                   12.94|
[00120.96]: |           9|         461|           0|        None|       19.02|                  499.85|                    0.00|                   14.36|
[00120.97]: +------------+------------+------------+------------+------------+------------------------+------------------------+------------------------+
[00120.97]: Computing 90 Lnu Spectra took 17.52 s
[00120.97]: Computing 90 Fnu Spectra took 2.14 s
[00120.97]: Computing 1980 Luminosities took 1.34 mins
[00120.97]: Computing 1980 Fluxes took 17.53 s
[00120.97]: Computing 19350 Emission Line Luminosities took 1.27 s
[00120.97]: Computing 19350 Emission Line Fluxes took 82.56 ms
[00120.97]: Computing 220 Luminosity Images took 77.23 ms
[00120.97]: Computing 220 Flux Images took 74.38 ms
[00120.97]: Computing 10 Lnu Data Cubes took 360.39 ms
[00120.97]: Computing 10 Fnu Data Cubes took 362.78 ms
[00120.97]: Computing 10 Spectroscopy Lnu took 386.66 ms
[00120.97]: Computing 10 Spectroscopy Fnu took 377.88 ms
[00120.97]: Running 20 extra analyses took 6.87 ms
[00120.97]: Unpacking results took 9.17 ms
[00120.97]: Computing Stars/HalfMassRadius took 6.77 ms
[00120.97]: Computing Redshift took 0.02 ms
[00120.98]: Cleaning outputs took 8.537 ms.
[00121.71]: Writing data took 726.400 ms.
[00121.71]: Total synthesis took 2.029 mins.
[00121.71]: Goodbye!

Note here that we set report_memory=True when we instantiated the Pipeline. This caused the pipeline to probe the memory usage after each galaxy is processed and report it. While this is extremely useful information for debugging purposes, it is also extremely expensive to calculate and is thus turned off by default.

Running a subset of observables

If you only want to generate a subset of observables then its as simple as only calling the methods for those observables. However, some observables are dependent on others.

For instance, to generate observed fluxes you need to have already generated observer frame spectra. If you signal you want one of these “downstream” observables without signalling the “upstream” observable then the Pipeline will automatically generate the upstream observable for you but they will not be written out to the HDF5 file.

The only difference is that you must supply the method you are calling with the arguments required to generate the upstream observable. Don’t worry though, the Pipeline will automatically tell you what is missing if you forget something.

We demonstrate this below by only selecting only the observed fluxes and observed emission lines.

[19]:
# Create galaxy object
galaxies = load_CAMELS_IllustrisTNG(
    TEST_DATA_DIR,
    snap_name="camels_snap.hdf5",
    group_name="camels_subhalo.hdf5",
    physical=True,
)
# Set up the pipeline
pipeline = Pipeline(model)
pipeline.add_galaxies(galaxies)
pipeline.get_observed_lines(cosmo, line_ids=grid.available_lines)
pipeline.get_photometry_fluxes(webb_inst, cosmo=cosmo)

                         ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⡀⠒⠒⠦⣄⡀⠀⠀⠀⠀⠀⠀⠀
                         ⠀⠀⠀⠀⠀⢀⣤⣶⡾⠿⠿⠿⠿⣿⣿⣶⣦⣄⠙⠷⣤⡀⠀⠀⠀⠀
                         ⠀⠀⠀⣠⡾⠛⠉⠀⠀⠀⠀⠀⠀⠀⠈⠙⠻⣿⣷⣄⠘⢿⡄⠀⠀⠀
                         ⠀⢀⡾⠋⠀⠀⠀⠀⠀⠀⠀⠀⠐⠂⠠⢄⡀⠈⢿⣿⣧⠈⢿⡄⠀⠀
                         ⢀⠏⠀⠀⠀⢀⠄⣀⣴⣾⠿⠛⠛⠛⠷⣦⡙⢦⠀⢻⣿⡆⠘⡇⠀⠀
                         ⠀⠀⠀+-+-+-+-+-+-+-+-+-+-+-+⡇⠀⠀
                         ⠀⠀⠀|S|Y|N|T|H|E|S|I|Z|E|R|⠃⠀⠀
                         ⠀⠀⢰+-+-+-+-+-+-+-+-+-+-+-+⠀⠀⠀
                         ⠀⠀⢸⡇⠸⣿⣷⠀⢳⡈⢿⣦⣀⣀⣀⣠⣴⣾⠟⠁⠀⠀⠀⠀⢀⡎
                         ⠀⠀⠘⣷⠀⢻⣿⣧⠀⠙⠢⠌⢉⣛⠛⠋⠉⠀⠀⠀⠀⠀⠀⣠⠎⠀
                         ⠀⠀⠀⠹⣧⡀⠻⣿⣷⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⡾⠃⠀⠀
                         ⠀⠀⠀⠀⠈⠻⣤⡈⠻⢿⣿⣷⣦⣤⣤⣤⣤⣤⣴⡾⠛⠉⠀⠀⠀⠀
                         ⠀⠀⠀⠀⠀⠀⠈⠙⠶⢤⣈⣉⠛⠛⠛⠛⠋⠉⠀⠀⠀⠀⠀⠀⠀⠀
                         ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀


[00000.00]: Root emission model: intrinsic
[00000.00]: EmissionModel contains 10 individual models.
[00000.00]: EmissionModels split by emitter:
[00000.00]:    - galaxy:     0
[00000.00]:    - stellar:    10
[00000.00]:    - blackhole:  0
[00000.00]: EmissionModels split by operation type:
[00000.00]:    - extraction:   4
[00000.00]:    - combination:  3
[00000.00]:    - attenuating:  3
[00000.00]:    - generation:   0
[00000.00]: Galaxies memory footprint: 0.44 MB
[00000.00]: Adding 10 galaxies took 1.030 ms.

If we want to see which observables we have signalled for generation we can call the report_operations method, which will print out a list of all observables we have signalled for generation and whether they will be written out to the HDF5 file.

[20]:
pipeline.report_operations()
[00000.00]: ------------------------------------------------------------
[00000.00]:                                      Compute?         Write?
[00000.00]: Line of Sight Optical Depths            False            N/A
[00000.00]: SFZH                                    False          False
[00000.00]: SFH                                     False          False
[00000.00]: Lnu Spectra                              True          False
[00000.00]: Fnu Spectra                              True          False
[00000.00]: Photometric Luminosities                False          False
[00000.00]: Photometric Fluxes                       True           True
[00000.00]: Emission Line Luminosities               True          False
[00000.00]: Emission Line Fluxes                     True           True
[00000.00]: Luminosity Images                       False          False
[00000.00]: Flux Images                             False          False
[00000.00]: Lnu Data Cubes                          False          False
[00000.00]: Fnu Data Cubes                          False          False
[00000.00]: Spectroscopy Lnu                        False          False
[00000.00]: Spectroscopy Fnu                        False          False
[00000.00]: ------------------------------------------------------------

Note here we have can see all the possible operations we could have signalled for generation but only the observed fluxes and observed emission lines will be written out to the HDF5 file since these are the getters we called.

Hybrid parallelism with MPI

Above we demonstrated how to run a pipeline using only local shared memory parallelism. We can also use mpi4py to not only use the shared memory parallelism but also distribute the analysis across multiple nodes (hence “hybrid parallelism”).

Instatiating a Pipeline when using MPI

To make use of MPI we only need to make a couple changes to running the pipeline. The first is simply that we need to pass the comm object to the Pipeline object when we instantiate it.

from mpi4py import MPI

pipeline = Pipeline(
    gal_loader_func=galaxy_loader,
    emission_model=model,
    n_galaxies=10,
    instruments=instruments,
    nthreads=4,
    verbose=1,
    comm=MPI.COMM_WORLD,
)

Here verbose=1 will mean only rank 0 will output logging information. If you want all ranks to output logging information you should set verbose=2. verbose=0 will silence all outputs apart from the greeting, total timing and errors as before.

Adding galaxies with MPI

We also need to partition the galaxies before we attach them to a Pipeline. For now we provide no mechanisms for this, it is entirely up to you how to split galaxies across the ranks. The important thing is that you only pass the galaxies on a rank to add_galaxies.

Writing out results with MPI

When running a distributed Pipeline you have several options for writing out the data. Regardless of which approach is used the process to write the outputs is the same as the shared memory version shown above (i.e. we call the write method). We detail each of these below.

Collective I/O [WIP]

If you have installed h5py with parallel HDF5 its possible to write collectively to a single HDF5 file. A Pipeline will detect if parallel h5py is available and will automatically chose this option if possible.

Individual I/O

When collective I/O operations aren’t available we produce a file per MPI rank. This is the most efficient method since communicating the results to a single rank for outputting is not only extremely time consuming but can also lead to communication errors when the outputs are sufficiently large.

Once the rank files have been written out we provide 2 options for combining them into a single file, note that working with the rank files is entirely possible though.

  1. Combination into a single physical file: calling combine_files will copy all the data across from each rank file into a single file before deleting each individual rank file. This is clean with regard to what is left, but is extremely time consuming.

  2. Combination into a single virtual file: calling combine_files_virtual will make a single file with symlinks to all the rank data in virtual datasets. This is far more efficient and gives the same interaction behaviour as the copy option (i.e. a single file to interact with) but does mean all the rank files must be kept alongside the virtual file.