#docker-compose.yml - keep in root folder of your github actions repo
:
services
:
scraper: cfurl/stg4_24hr_scraper:v1
image: scrape24
container_name:
volumes- st4_data:/home/data
:
wgrib2: cfurl/sondngyn_wgrib2:v1
image: wgrib2
container_name: "wgrib2_commands.sh"
command:
depends_on:
scraper: service_completed_successfully
condition:
volumes- st4_data:/srv/
- st4_data:/opt/
:
parqs3: cfurl/stg4_24hr_parq_s3:v1
image: parqs3
container_name:
depends_on:
wgrib2: service_completed_successfully
condition:
volumes- st4_data:/home/data
:
env_file- .env
:
volumes: st4_data
Backend
The Details
Delivery of Stage IV data
Stage IV data are packaged as GRIB2 files are available each in 1-hr, 6-hr, and 24-hr files. The 6 and 24 hour files are summed from the one hour files. Presently, this tutorial works with a 24-hr file and scrapes data once a day. It would not be difficult to ammend the code and start the workflow every hour.
Stage IV data are available in near real-time on NOMADS at :55 past the hour. For example, hourly rainfall ending 2:00 (representing 1:00-2:00) are available at 2:55. Twenty-four hour data are available from 12:00 - 12:00 UTC (available at 12:55 UTC). Data for the last 14 days are kept on NOMADS after which they are cycled off. The real-time NOMADS site is available here. Note, there are .01h, .06hr, and .24hr files for CONUS, pr (Puerto Rico), and ak (Alaska).
After the near real-time file drop, an additional rerun at 30h after valid time (18:55Z) is available to supplement 24-hr mosaics if RFCs need to update their QPEs or make changes. Personal communication with the WGRFC indicates that this is done very infrequently.
Since at least 2012 when I started tinkering with the archived Stage IV data, the historical datasets were kept at data.eol.ucar. As of summer 2025, the data have been moved to rda.ucar.edu. At this new archive location they are tarring together Stage IV data on monthly increments. The data appear to be coming available around the 20th of the next month (for example August data available ~ September 20th).
I intend to add a page to this website describing my manual methods for processing Stage IV data. Currently, focused on real-time description.
GRIB2 data format
As previously described, Stage IV data are stored in GRIB2 binary format. Stage IV precipitation has been packaged in both the GRIB1 format (until July 2020) and currently resides in the GRIB2 format. The processing procedures and wgrib.exe utility are completely different. When I get to the descritption of building a historical archive at a site I’ll have to cover GRIB1 processing.
Container Orchestration
The workflow is accomplished by linking containers together using a docker compose orchestration and tripping off the series of containers with a github actions cron job.
Container orchestration and Github Action workflows are located in the following repo: Container Orchestration
The containers are managed by a docker-compose.yml shown below which describes when to spin up each container, how to manage storage, environments etc. The docker-compose.yml is stored in the root folder of the github repo shown above.
The docker-compose.yml gets triggered each day via Github Actions - described later. Each of the individual container images are stored on the docker hub repository at: hub.docker.cfurl. These should all be public. The workflow for processing the Stage IV data are contained in 3 individual containers described below.
Individual Containers
The Dockerfile, code, and explicit Powershell prompts to create the Docker Image and load onto Docker Hub are contained in the following repo: Docker Images-Containers
Before I get to the scripts held in each individual container I want to take a quick look at how images are constructed. You simply give docker a set of instructions in your ‘Dockerfile’ and then build it through powershell commands. AFter it’s built, you can upload to your repository.
# Dockerfile
/r-ver:4.2.2
FROM rocker
-p /home
RUN mkdir -p /home/code
RUN mkdir -p /home/data
RUN mkdir
/home
WORKDIR
/code/write_parq_2_s3.R /home/code/write_parq_2_s3.R
COPY /code/install_packages.R /home/code/install_packages.R
COPY /data/texas_buffer_spatial_join.csv /home/data/texas_buffer_spatial_join.csv
COPY
/home/code/install_packages.R
RUN Rscript
/home/code/write_parq_2_s3.R CMD Rscript
Below is are brief explanations of what each container accomplishes.
Container 1 - Scraper
This code takes the current system date and time, visits the appropriate NOMADS url based off of date and time, downloads the 24hr GRIB2 conus file at that url, writes a shell script that is used in the next container to inject prompts to wgrib2.exe, and stores both the shell script and the actual GRIB2 file in a shared storage volume (controlled through docker-compose.yml) that all three containers are linked to.
library(dplyr)
library(stringr)
library(rvest)
# create function to print out UTC time with utc_time()
<- function() {
now_utc <- Sys.time()
now attr(now, "tzone") <- "UTC"
now
}
# create character string of the hour
<-str_sub(as.character(now_utc()),start=12,end=13)
hour_char# create numeric hour
<-as.numeric(hour_char)
hour_num
# create character string for date
<- str_sub(as.character(now_utc()),start=1,end=10) %>% str_remove_all("-")
date_char# create dateclass date with utc timezone
<-as.Date(now_utc(),tz="UTC")
now_utc_date
# read nomads stg4 html page using date from utc_time()
<-read_html(paste0("https://nomads.ncep.noaa.gov/pub/data/nccf/com/pcpanl/prod/pcpanl.",date_char,"/"))
stg4_http_page
# find only files that end with .grb2 and have 'pcp' somewhere in the string
<- stg4_http_page %>%
grib2_available html_elements("a") %>%
html_text() %>%
str_subset("conus") %>%
str_subset("24h.grb2$")
# create path to download
<-paste0("https://nomads.ncep.noaa.gov/pub/data/nccf/com/pcpanl/prod/pcpanl.",date_char,"//",tail(grib2_available,n=1))
source_path
# create download destination
<-paste0("/home/data/",tail(grib2_available,n=1))
destination_path
#download the file
download.file (source_path,destination_path,method = "libcurl")
# Write your shell file to communicate with the wgrib2 container
<- paste("wgrib2", tail(grib2_available,n=1), "-csv", str_replace(tail(grib2_available,n=1), ".grb2", ".txt"))
txtwriteLines(txt,paste0("/home/data","/wgrib2_commands.sh"))
Container 2 - DeGRIB your file
This code simply executes the wgrib2.exe application and takes the data from the binary format and dumps as text. This works in this case since the only thing wrapped in this GRIB2 file is stage4 rainfall. In many meteorological GRIBs there are many variables. The .sh shell file written in container one contain the instructions that are given to the wgrib2.exe applicatoin. This is controlled through docker-compose.yml, namely: “command:”wgrib2_commands.sh”“. The text output which is a bunch of lat/lons and rain values are stored in the same shared volume as the output files from Container 1.
I didn’t place the wgrib.exe in an image. There are plenty of images available on docker hub where someone has already done this. I’ve been using the image located here: sondngyn/wgrib2:latest. Below I show the docker commands to copy that image into my repository so I’m not vulnerable to changes in the sondngyn hu repo.
:
Simple pull, retag, push
# pull upstream
/wgrib2:latest
docker pull sondngyn
# retag to your namespace
/wgrib2:latest cfurl/sondngyn_wgrib2:v1
docker tag sondngyn
# login and push
docker login/sondngyn_wgrib2:v1
docker push cfurl
# Test it with docker-compose.yml:
:
services
:
scraper: cfurl/stg4_24hr_scraper:v1
image: scrape24
container_name:
networks- some_name
:
volumes- st4_data:/home/data
:
wgrib2: cfurl/sondngyn_wgrib2:v1
image: wgrib2
container_name: "wgrib2_commands.sh"
command:
depends_on:
scraper: service_completed_successfully
condition:
volumes- st4_data:/srv/
- st4_data:/opt/
:
networks:
some_name:
external: st4_net
name
:
volumes: st4_data
Container 3 - Write a .parquet file to an S3 bucket
This docker container connects to your “stg4-texas-24hr” AWS S3 bucket, tidy’s the GRIB2 csv output, clips to area of interest (Texas), and then writes .parquet files that are partitioned by (year, month, day). A note, you would probably have to partition by hour if you do this hourly, because you can’t append a .parquet file to existing data because of how it is written on the disk. Your partitions have to be separate when you are doing this in an automated fashion.
Github Actions
GitHub Actions is a CI/CD platform integrated into GitHub. It runs workflows defined in YAML to build, test, and deploy code on events like pushes, pull requests, or schedules. Jobs run on hosted runners or self-hosted machines, support matrices, secrets, caching, and reusable actions, enabling reliable pipelines across repositories globally.
library("aws.s3")
library("arrow")
library("dplyr")
library("lubridate")
library("tidyr")
library("readr")
library("stringr")
# make sure you can connect to your bucket and open SubTreeFileSystem
<- s3_bucket("stg4-texas-24hr")
bucket
# list everything in your bucket in a recursive manner
$ls(recursive = TRUE)
bucket
# identify path where you will be writing the .parq files
<- bucket$path("")
s3_path
<-read_csv("/home/data/texas_buffer_spatial_join.csv")
aoi_texas_buffer
# list files that start with st4 and ends with .txt
= list.files("/home/data", pattern = "^st4_conus.*.txt$",full.names=FALSE)
raw_grib2_text
for (h in raw_grib2_text) {
<- h |>
name str_replace("st4_conus.", "t") |>
str_replace(".24h.txt","")
<-read_csv(paste0("/home/data/",h), col_names=FALSE) %>%
aa#aa<-read_csv(h, col_names=FALSE) %>%
setNames(c("x1","x2","x3","x4","center_lon","center_lat",name)) %>%
select(-x1,-x2,-x3,-x4)
# joins by "center_lon", "center_lat"
<- left_join(aoi_texas_buffer,aa,by=NULL)%>%
bbpivot_longer(!1:5, names_to = "time", values_to = "rain_mm") %>%
mutate(time = ymd_h(str_sub(time,2,11))) %>%
mutate (year = year(time), month = month(time), day = day(time), hour = hour(time)) %>%
relocate(rain_mm, .after = last_col())
}
|>
bbgroup_by(year,month,day) |>
write_dataset(path = s3_path,
format = "parquet")
Github Actions
Github actions is what makes the automation work. The docker-compose.yml is fired by the compose-workflow.yml. This compose-workflow.yml has to be helf in a folder called ‘.github/workflows’. AWS credentials are managed through Github via the repository. AWS credentials are injected at runtime through the actions .yml. The yml is shown below:
: stg4-texas-24hr-backend-actions
name
:
on:
schedule- cron: '45 13 * * *' # Runs daily at 13:45 UTC
: # Also allow manual triggering
workflow_dispatch
:
jobs-pipeline:
run-on: ubuntu-latest
runs
:
steps- name: Checkout repository
: actions/checkout@v4
uses
- name: Install Docker using Docker's official script
run: |
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
- name: Install Docker Compose
run: |
sudo curl -L "https://github.com/docker/compose/releases/download/v2.27.0/docker-compose-linux-x86_64" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose --version
- name: Create .env file with AWS credentials
run: |
echo "AWS_ACCESS_KEY_ID=${{ secrets.AWS_ACCESS_KEY_ID }}" >> .env
echo "AWS_SECRET_ACCESS_KEY=${{ secrets.AWS_SECRET_ACCESS_KEY }}" >> .env
echo "AWS_REGION=${{ secrets.AWS_REGION }}" >> .env
- name: Run Docker Compose (with AWS env)
run: docker-compose up
continue-on-error: false
Github repositories and this webpage
I started a github organization called ‘stg4-texas-24hr-ga’ this allows me to segregate polished work from my messy personal github repository. Below are links to pertinent repositories.