LHAASO - LLM for Astrophsics Seminar Note

Why LLM? The tasks of LHAASO that asks for AI solutions

The gamma-ray background of LHAASO mainly consists of mis-classified cosmic rays. At present, LHAASO Collaboration reduces the background events by setting a threshold. Traditional method needs to perform a multi-dimensional optimization problem for each event in a 5-dimensional parameter space. This is not only time-consuming but also demands lots of computation resource.
For KM2A, distinguishing gamma from protons is easier - gamma-ray air shower events produce mostly electron-positron pairs, while the latter is muon-rich. However, it’s not easy to do this on WCDA, a result from the insensitivity for muons. More often, events are categorized by its morphology. Given that proton cascade produce more secondary particles moving in random direction, the cascade is more extended than gamma events. By measuring the dispersion of secondary particles from the core position, both proton and gamma events exhibit Gaussian-like distributions, with larger mean value for protons. Thus by setting a threshold and ignore all of the events with dispersion lareger than the threshold, high classification purity is achieved at the cost of lower selection efficiency.
The goal of using AI tools is to reduce the cosmic ray background by a factor of 2-5. This might not be a great improvement for bright sources like crab, but for faint targets, i.e. flux equivalent to 0.01 crab, reducing background is of great importance in enhancing significance of detection.

There are 26 different categories of cosmic ray particles, from H to Fe. Theoretically, these events have different location of cascade maxima, thus could be classified according to observations of WFCTA. This is however almost impossible in practise due to low quality of WFCTA imaging and the influence of cascade location and observation angle.

Noise is quite common for both ED, MD and WCD. Assuming that secondary particles from a single event arrive at a flat surface in the space of x,y and t, noise outside this surface could be reduced. However, this will certainly exclude some signals, and noise in the surface is not distinguishable.

Event reconstruction is highly dependent on air shower simulations. Detector array can only do discrete sampling of signals. If AI tools can do some kind of “interpolation” but with prior knowledge from simulations, the reconstruction error will be smaller.

Near the milky way, source is dense and overlap, posing challenge to source identification. AI tools could help in analysis of source, resolving the contribution from each source.

困难：有监督的训练数据很少

生物大分子的研究，机器学习和冷冻电镜精度相差不大–可以先通过ai计算可信性再进行精确分析

《A new golden age of discovery》

GPT - 1e12 参数

ChatGPT的组成： GPT + Chat
GPT 无监督模型
- 噪声数据经过编码也可以帮助训练
- GPT的任务是将数据编码
Chat为有监督学习，（指令跟随）
基础模型与CPT,SFT…
- 基础模型GPT,021,LLaMa等，将数据压缩编码
- SFT， instruction model,指令模型
- 或者GPT+CPT+SFT，CPT是领域模型（特化数据对基础模型进行微调），对特定领域进行优化
一般而言，基础模型的训练数据不包含最专业的数据（语料库包含大量生物医药语料，不包含高能物理等），加入领域的语料进行基础模型的训练，形成领域模型。基础模型一般很宽泛

文理兼修，能成为未来科学家的辅助（大脑），然后开发工具（手脚）

1+N模型（基准模型，多领域）->支撑领域专家模型（混合专家模型MOE）

多模态数据对齐增强，知识图谱，Q*算法，外部工具辅助验证（模拟器，代码解释器等）

辅助地学科学家进行数据统计分析，知识整理等辅助性研究，从论文中抽取数据信息

300B领域语料，基于021科学基础模型，qwen等

天文大语言模型 + 恒星光谱模型等科学模型

一个语言模型+ 图文解释模型 + 其余科学模型： 1+1+X

同样面临无法验证的问题

太阳耀斑活动预测（感觉可迁移至提升灵敏度）以及特征提取（跨模态重构，太阳光学成像与磁场成像重构，郝奇组）