Global Earthquake Machine Learning Dataset: Machine Learning Asset Aggregation of the PDE (MLAAPDE)
Dates
Publication Date
2022-04-04
Start Date
2013-01-01
End Date
2021-01-01
Citation
Cole H. M. and W. L. Yeck, 2022, Global Earthquake Machine Learning Dataset: Machine Learning Asset Aggregation of the PDE (MLAAPDE): U.S. Geological Survey data release, doi:10.5066/P96FABIB
Summary
The Machine Learning Asset Aggregation of the PDE (MLAAPDE) is a waveform archive, feature labeled catalog, and Python module that together provide a routine way to gather high-quality input data to train machine learning models. While all the data provided are already publicly available, MLAAPDE packages it in a format that allows a user to prepare input for common machine learning frameworks with few lines of code. Most of the features that are part of the MLAAPDE dataset are selected from the Preliminary Determination of Epicenters (PDE), the official earthquake catalog of the USGS National Earthquake Information Center (NEIC). The PDE aims to provide a complete catalog of source characterization estimates for earthquakes roughly [...]
Summary
The Machine Learning Asset Aggregation of the PDE (MLAAPDE) is a waveform archive, feature labeled catalog, and Python module that together provide a routine way to gather high-quality input data to train machine learning models. While all the data provided are already publicly available, MLAAPDE packages it in a format that allows a user to prepare input for common machine learning frameworks with few lines of code. Most of the features that are part of the MLAAPDE dataset are selected from the Preliminary Determination of Epicenters (PDE), the official earthquake catalog of the USGS National Earthquake Information Center (NEIC). The PDE aims to provide a complete catalog of source characterization estimates for earthquakes roughly >M4.5 worldwide and >M2.5 within the United States, but often also contains smaller events. The PDE is a static catalog of human-reviewed source estimates. The review process involves multiple seismic analysts verifying the quality of the published events. Some label data do not originate from the PDE and are either derived from inventory metadata or computed from associated waveforms. All waveform data included in the MLAAPDE archives are sourced from IRIS web services and accessed using the ObsPy client.
MLAAPDE data consist of year-month file pairs for each month from July 2013 to December 2020. The file pairs are a comma-separated value (CSV) label catalog and a heirarchical data format version 5 (HDF5) waveform archive. Each CSV catalog contains a header with names for each column and one row for each trace. The CSV files are sparsely populated with empty cells when a value is not provided for that trace. Each HDF5 archive contains a single group named 'data' that holds a few attributes and many datasets keyed by their unique 'trace_id' which links the 3-component waveforms to their label data in the CSV catalog. The important included attributes are 'resample_hz', 'pre_event_sec', and 'post_event_sec' which allow a user to reconstruct the time domain and arrival time for each dataset in the group. Also included are the attributes 'created_utc' and 'mlaapde_version' which denote when the file was generated and what version of the MLAAPDE Python module was used.
A user can download the data here and access it with any programming language capable of loading HDF5 files. However, the Python module 'neic-mlaapde' is also provided that simplifies the task of selecting traces from the MLAAPDE catalogs and loading their waveform data.
During the generation of the dataset, waveforms were returned from the Incorporated Research Institutions for Seismology (IRIS) with 1 s of buffer time, resampled to 40 Hz, demeaned, detrended, and trimmed to 60 s before and 60 s after the phase arrival. Each waveform in the HDF5 archive is 120 s in length, resampled to 40 Hz, and centered on the first P, Pn, or Pg phase arrival. If desired, a user can modify these parameters and generate the dataset themselves with the 'neic-mlaapde' Python module.
All latitude/longitude coordinates are given with respect to the WGS84 reference system and all geodetic calculations are done using the WGS84 ellipsoid.
The CSV file contains the following labels:
phase_id (str) - The unique ID for each row, formatted like "EVENTID_NET.STA.CHA.LOC_PHASE".
waves_id (str) - The event-station pair ID, formatted like "EVENTID_NET.STA.CHA.LOC".
event_id (str) - The PDE event ID.
nscl_code (str) - The NSCL dotcode for the network, station, channel, location.
network (str) - Network code.
station (str) - Station code.
channel (str) - Seed format channel code. the last character is always '*'.
location (str) - Location tag, or '--' if absent.
station_latitude (float) - Station latitude, per IRIS metadata at creation time of the dataset.
station_longitude (float) - Station longitude, per IRIS metadata at creation time of the dataset.
station_elevation_m (float) - Station elevation, per IRIS metadata at creation time of the dataset.
chan_order_3c (str) - Three character order of channels, almost always ENZ or 12Z.
chan_azimuth_1 (float) - Azimuth in degrees of the first channel's orientation.
chan_azimuth_2 (float) - Azimuth in degrees of the second channel's orientation.
chan_azimuth_Z (float) - Azimuth in degrees of the vertical channel's orientation.
source_type (str) - Tag describing source type, per PDE (currently, all are "earthquake" type events).
source_origin_time (str) - Origin time in UTC. Like: "2013-07-23 00:23:14.260000+00:00".
source_latitude (float) - Epicenter latitude, per PDE.
source_longitude (float) - Epicenter longitude, per PDE.
source_depth_km (float) - Hypocenter depth, per PDE.
source_magnitude (float) - Preferred magnitude, per PDE.
source_magnitude_type (str) - Preferred magnitude type, per PDE.
source_magnitude_author (str) - Preferred magnitude contributor, per PDE.
focal_np1_strike (float) - 1st nodal plane strike in degrees.
focal_np1_dip (float) - 1st nodal plane dip in degrees.
focal_np1_rake (float) - 1st nodal plane rake in degrees.
focal_np2_strike (float) - 2nd nodal plane strike in degrees.
focal_np2_dip (float) - 2nd nodal plane dip in degrees.
focal_np2_rake (float) - 2nd nodal plane rake in degrees.
moment_Mrr (float) - Moment tensor rr value.
moment_Mtt (float) - Moment tensor tt value.
moment_Mpp (float) - Moment tensor pp value.
moment_Mrt (float) - Moment tensor rt value.
moment_Mrp (float) - Moment tensor rp value.
moment_Mtp (float) - Moment tensor tp value.
moment_scalar_moment (float) - Seismic moment in Nm.
stf_type (str) - Source time function type. Can only be "triangle" now.
stf_duration_sec (float) - Source time function duration in seconds.
stf_rise_sec (float) - Time in seconds to peak of source time function.
stf_decay_sec - Time in seconds of decay from peak of source time function.
phase_time (str) - Arrival time in UTC, like: "2013-07-23 00:23:14.260000+00:00".
phase_hint (str) - Tagged phase type from IASPEI standard phase list.
phase_author () - Phase tag author agency, almost always "us".
phase_status (str) - Flag denoting how pick was made. Either "automatic" or "manual".
phase_arrival_time_weight (float) - Weight of arrival time, per PDE.
phase_arrival_time_residual (float) - Residual time in seconds, per PDE.
phase_travel_sec (float) - Arrival time minus origin time in seconds.
phase_analyst_id (int) - Anonymous ID of the human analyst who made the manual pick.
source_distance_deg (float) - Source-receiver distance in degrees, per geodetic calculation.
source_distance_km (float) - Source-receiver distance in kilometers, converted from source_distance_deg.
source_azimuth_deg (float) - Source-receiver azimuth in degrees, per geodetic calculation.
source_back_azimuth_deg (float) - Source-receiver back azimuth in degrees, per geodetic calculation.
source_takeoff_deg (float) - Takeoff angle in degrees, per PDE.
snr_db (float) - Signal to noise ratio in decibels, computed during dataset generation.
The contents of this dataset may evolve over time as new label data are added and as new PDE data become available. Any updates to this dataset will be noted in the update log.