BigQuery Dataset
The Asteroid Institute maintains a BigQuery instance of the Small Bodies Node (SBN) replica of the Minor Planet Center database. The SBN provides replication services for the MPC database (see https://sbnmpc.astro.umd.edu/MPC_database/replication-info.shtml), and the Asteroid Institute maintains a BigQuery instance of this replica.
The dataset is available through Google Cloud’s Analytics Hub:
Dataset Scale and Storage
The MPC dataset is substantial in size, which contributes to both its power and the importance of careful query optimization. For example, the primary observations table public_obs_sbn contains:
Row Count: Over 479 million observations (479,055,827 rows)
Logical Size: 181.44 GB of data
Physical Storage: - Current: 28.82 GB (compressed) - Total: 411.68 GB (including time travel and system snapshots)
This scale enables powerful analyses but also means that full table scans can quickly consume significant query resources. Understanding these numbers is crucial for:
Query Performance: The large row count means indexes and query optimization are essential
Cost Management: Scanning the full observations table processes ~181 GB of data (~$0.90 at standard pricing). However, you rarely need all columns.
Dataset Access
To access the dataset:
Create a Google Cloud Platform account if you don’t have one
Visit the Analytics Hub listing using the link above
Subscribe to the dataset
Note the dataset ID from your subscription - you’ll need this to initialize the client
After subscribing, you’ll receive one dataset ID that you’ll use to initialize the BigQueryMPCClient:
from mpcq.client import BigQueryMPCClient
client = BigQueryMPCClient(dataset_id="your_subscribed_main_dataset_id")
Dataset Overview
The main MPC dataset is a complete, real-time replica of the MPC’s database containing all core tables:
public_obs_sbn: Primary observations tablepublic_mpc_orbits: Orbital elements and uncertaintiespublic_neocp_*: Near-Earth Object Confirmation Page datapublic_current_identifications: Current object identificationspublic_numbered_identifications: Numbered asteroid identificationspublic_obs_alterations_*: History of observation modifications
This dataset is updated in real-time as changes occur in the MPC database.
Key Tables
public_obs_sbn
The primary observations table containing all asteroid observations:
SELECT *
FROM `your-dataset-id.asteroid_institute_mpc_replica.public_obs_sbn`
WHERE provid = '2013 RR165'
LIMIT 5
- Key columns:
obsid: Unique observation identifierprovid: Provisional designationobstime: Observation timestampra,dec: Position in degreesmag: Magnitudeband: Filter bandstn: Observatory codesubmission_id: Submission identifier
public_current_identifications
Links between different designations for the same object:
SELECT *
FROM `your-dataset-id.asteroid_institute_mpc_replica.public_current_identifications`
WHERE unpacked_secondary_provisional_designation = '2013 RR165'
- Key columns:
unpacked_primary_provisional_designationunpacked_secondary_provisional_designationpermid
public_numbered_identifications
Information about numbered asteroids:
SELECT *
FROM `your-dataset-id.asteroid_institute_mpc_replica.public_numbered_identifications`
WHERE permid = '12345'
- Key columns:
permid: Permanent identifierunpacked_primary_provisional_designation
public_orbits
Orbital elements for objects:
SELECT *
FROM `your-dataset-id.asteroid_institute_mpc_replica.public_orbits`
WHERE provid = '2013 RR165'
ORDER BY epoch DESC
LIMIT 1
- Key columns:
provid: Provisional designationepoch: Epoch of orbital elementsa,e,i: Semi-major axis, eccentricity, inclinationom,w,ma: Longitude of ascending node, argument of perihelion, mean anomaly
Performance Optimization
The dataset includes several performance optimizations:
Table Layout: The canonical
public_obs_sbntable uses partitioning and clustering in the provider dataset for efficient query pruning.-- Query: Count observations for specific observatories SELECT stn, COUNT(obsid) FROM `your_dataset.public_obs_sbn` WHERE stn in ("W68", "T08", "T05", "M22") GROUP BY stn;
- Query Best Practices:
Filter by station and time windows whenever possible
Filter on indexed columns when possible
Use
LIMITto test queries before running on full dataset
Example Queries
Find all observations of an object:
SELECT
obstime,
ra,
dec,
mag,
band,
stn
FROM `your-dataset-id.asteroid_institute_mpc_replica.public_obs_sbn`
WHERE provid = '2013 RR165'
ORDER BY obstime ASC
Find objects with multiple designations:
WITH object_ids AS (
SELECT
unpacked_primary_provisional_designation,
unpacked_secondary_provisional_designation,
permid
FROM `your-dataset-id.asteroid_institute_mpc_replica.public_current_identifications`
WHERE unpacked_secondary_provisional_designation = '2013 RR165'
)
SELECT DISTINCT
o.obstime,
o.ra,
o.dec,
o.provid,
i.unpacked_primary_provisional_designation
FROM `your-dataset-id.asteroid_institute_mpc_replica.public_obs_sbn` o
JOIN object_ids i
ON o.provid = i.unpacked_secondary_provisional_designation
OR o.provid = i.unpacked_primary_provisional_designation
ORDER BY o.obstime ASC
Pricing and Free Tier
BigQuery offers a free tier and a pay-as-you-go pricing model. Note that your free monthly 1TB of analysis credits are maintained on a paid plan.
- Free Tier (Monthly):
1 TB of query processing
10 GB of active storage
- Standard Pricing:
Query pricing: $6.25 per TB of data processed
Storage pricing: $0.02 per GB per month for active storage
To manage costs effectively:
Use the BigQuery Console to estimate query costs before running them
Consider setting up billing alerts and quotas
- Use query optimization techniques:
Select specific columns instead of
SELECT *Use
LIMITto test queriesFilter early in queries to reduce data processed
Cache frequently accessed results locally
You can estimate query costs programmatically by setting up a dry run:
from google.cloud import bigquery
# Configure a dry run
job_config = bigquery.QueryJobConfig(dry_run=True)
# Your query
query = "SELECT * FROM `your_dataset.public_obs_sbn`"
# Get bytes that would be processed
query_job = client.query(query, job_config=job_config)
bytes_processed = query_job.total_bytes_processed
# Estimate cost ($5.00 per TB)
estimated_cost_usd = (bytes_processed / 1e12) * 5.00
To manage BigQuery costs effectively, it’s important to understand the scale of the data:
- Query Cost Examples:
Full scan of observations table (181.44 GB): ~$0.90
Scanning 10% of the table: ~$0.09
Monthly free tier (1 TB) could process the full table ~5.5 times