Machine Learning at High Redshift
By Mitchell T. Dennis
Introduction
From the furthest reaches of our vast universe, light travels to us at incredible speeds - redshifted through time and space. Accurately measuring this redshift is crucial to understanding the universe’s structure also known as “the cosmic web.” In this post I’ll walk you through how scientists measure redshift, where some of the challenges are, and how machine learning can help identify areas of improvement.
A video created from a simulation of the universe using the gadget-2 simulation code.
Credit: Springel et al. (Virgo Consortium), Max-Planck-Institute for Astrophysics
When we think about using AI in science, we often think of it as replacing an existing process or procedure with something more “black box” that we hope will have higher performance. For example, with weather, we have detailed climate models based on complex formulas and theories that predict the future weather which compete against machine learning models that are given the same task, but only have access to historical weather data - not the complex formulas and theories. However, my recent article “Identifying Catastrophic Outlier Photometric Redshift Estimates in the COSMOS Field with Machine Learning Methods” used machine learning in a different way, but first a bit of background.
What is redshift?
Redshift measures how much the light from a distant object has stretched—shifted toward the red end of the spectrum—as it travels through expanding space. The shift is similar to how an ambulance’s sound changes pitch as it gets further away, but redshift uses light instead of sound - the Doppler effect, but in space. The redshift of an object is directly related to how far away it is from Earth.
Spectroscopic Redshift
We measure redshift using two primary methods. The first, spectroscopic redshift, involves breaking down light into a spectra using a spectroscope. If you think back to your high school science class, you may remember looking at different spectra of gas lamps. Every element in the universe emits a unique spectra at very specific wavelengths (or colors). Take a moment to look at the examples below.
These are the hydrogen spectrum lines visible to the eye. The red line is H-alpha, an extremely important emission line in astronomy.
Image Source: Wikipedia - Emission Spectrum Hydrogen
These are the iron spectrum lines visible to the eye. Each one of these lines is distinct from the hydrogen lines.
Image Source: Wikipedia - Emission Spectrum Iron
Each emission line in the spectrum of each element and molecule is distinct, acting like a fingerprint that we can use to determine what objects are made of. This is the essence of spectroscopy and is used in many fields including medicine, forensics, chemistry, geology, and astronomy to name a few.
The spectra above are at what astronomers call “rest wavelength” - no redshift. When a spectrum is redshifted, the gaps between the emission lines change. To get a value for the redshift, astronomers change the measured spectra we get from our instruments until they match the spectra we would expect to see for an object at rest wavelength. Because spectra are not perfect and objects are made of many elements (and therefore just as many spectra), astronomers use the value of the redshift that fits the best, typically with a best-fit metric such as Chi-squared.
Because the light gets spread into so many wavelengths, it takes longer for the detector to collect enough photons to achieve a good signal. This is in contrast to the other technique used to measure redshift - photometric redshift.
Photometric Redshift
Photometric redshift is a technique where photometric data is used to calculate the redshift, instead of spectroscopic data. Photometric data also measures light, but instead of spreading the light out over many wavelengths, it instead collects light into separate bands. For example, where spectroscopic data might distinguish between many different shades of green, photometric data would lump all of these shades into a single green band. This makes it easier to collect enough photons to get good signal, but now there is less information and the “fingerprinting” method above cannot be used. Instead, astronomers use galaxies with spectroscopically known redshifts to train models that estimate redshift from photometric data alone. This has historically been done with a technique known as template fitting, but machine learning is making inroads in this area.
A Helpful Analogy: Rain Buckets
How can we intuitively think about difference between spectroscopy vs photometry? Instead of thinking about photons and light, let us instead use rain drops as an analogy. Imagine our goal is not to measure light, but instead to collect drops of water as it is raining. Our goal is to collect rain water, but in order to get a good measurement, we need at least 100 drops of water in a bucket before it counts as a good measurement. Spectroscopy is akin to placing many small buckets onto the ground. We will have much more detailed information about the rainfall in different spots, but it will take much longer for all of the buckets to reach 100 rain drops. Photometry is akin to placing a few very large buckets onto the ground. We will reach 100 rain drops in each bucket much more quickly, but the information about where the rain fell will be less detailed. The detail provided by spectroscopy comes at the cost of significantly longer telescope exposure times (how long we expose the camera to a single galaxy) which is why spectra cannot possibly be taken for the billions of objects in our night sky. Photometric data is much easier to obtain and therefore photometric redshifts are more readily available, but they can be less accurate than spectroscopic redshifts. It is here where my research takes a role. These two methods can sometimes yield widely discrepant results, leading to what astronomers call “catastrophic outliers.”
Why It Matters: Catastrophic Outliers
When scientists compare photometric redshifts to their spectroscopic counterparts, sometimes there is a major discrepancy between the two. While definitions vary between research articles, when the difference between the spectroscopic redshift and the photometric redshift is substantial, the objects are called “catastrophic outliers.” In my work, I specifically investigate objects whose spectroscopic redshift is significantly larger than the reported photometric redshift. The figure below highlights these objects in red.
The y-axis here is the LePhare photometric redshift, and the x-axis is the spectroscopic redshift (Dennis et. al 2024).
The photometric redshift for these objects under predicts the spectroscopic redshift. If this happens for objects whose spectroscopic redshift has not been measured, the photometric data might underreport the number of objects at the highest redshifts. While this study did not include the opposite case, i.e. objects whose photometric redshift overpredicts the spectroscopic redshift, in that case the photometric data might overreport the number of objects at high redshifts.
Our ML Approach
We used machine learning, specifically deep neural networks, to attempt to identify the objects highlighted in red in the figure above. Our network was reasonably successful, identifying 33%-55% of the catastrophic outliers with a false positive rate of 0. The low false positive rate is crucial because we do not want the number of false positives to outnumber the number of true positives, and in a highly imbalanced dataset (a dataset where one class has a higher representation) this can happen easily.
What did we discover?
When we applied the algorithm on new data where no spectroscopic redshifts were available, we discovered that the number of catastrophic outliers in photometric datasets may currently be underestimated by as much as a factor of 5-10, and more studies are needed to determine the extent of catastrophic outliers. Further, any study which substantially relies on photometric redshifts, such as studies on extragalactic structure, may see substantial differences once the catastrophic outliers are corrected.
Going Forward
Our future studies will focus on increasing the accuracy of the network, and adding the other half of the catastrophic outliers to our study (the ones that the photometric redshift overpredicts). From here we can build a complete catalogue and follow up with spectroscopic observations to confirm our results. Once these results are confirmed, we can use the data to make photometric redshifts more accurate. In the era of large photometric sky surveys the accuracy of photometric redshifts will be essential to understanding the structure of our universe.
Where is the article?
Article Link: https://iopscience.iop.org/article/10.3847/1538-4357/adbe62#apjadbe62f2
Citation: Mitchell T. Dennis et al 2025 ApJ 983 173 DOI: 10.3847/1538-4357/adbe62