Introduction

The CMS Synthetic Patient Data OMOP (named cms_synthetic_patient_data_omop in Google BigQuery) is a synthetic patient dataset in the OMOP Common Data Model v5.2, originally released by the CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010. It is intended to be used for research and educational purposes.

Objective

Analyzing procedures, patients, providers, payers, and drugs using SQL. Figure out the most frequently used drugs related to the coronary artery bypass graft (CABG)

Import libraries

# Import the necessary libraries
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
import db_dtypes

# pandas settings to increase readability
pd.options.display.max_seq_items = 1000
pd.options.display.max_rows = 1000 
pd.options.display.max_columns = 200
pd.options.display.max_colwidth = 200

Connect to BigQuery

# Set up your Google Cloud credentials
credentials = service_account.Credentials.from_service_account_file('../assets/creds/google-creds-cms-claims.json')

# Initialize the BigQuery client
client = bigquery.Client(credentials=credentials)

Procedure

Statistics of Procedure Occurrence

query = """
    SELECT 
        COUNT(DISTINCT provider_id) cnt_provider,
        COUNT(DISTINCT person_id) cnt_patient,
        MIN(procedure_dat) from_date,
        MAX(procedure_dat) to_date,
    FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence`
"""

df = client.query(query).to_dataframe()
df
cnt_provider cnt_patient from_date to_date
0 902021 1979484 2007-11-27 2010-12-31

During 2007/11/27-2010/12/31, a total of 902,021 providers and 1,979,484 people are involved in the dataset.

Top 10 procedure by number of patients

query = """
    SELECT
        domain_id,
        concept_name,
        COUNT(Distinct person_id) cnt_patients
    FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
    JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` c
        ON p.procedure_concept_id = c.concept_id
    GROUP BY 1,2
    ORDER BY 3 DESC
    LIMIT 10
"""

df = client.query(query).to_dataframe()
df
domain_id concept_name cnt_patients
0 Procedure Office or other outpatient visit for the evaluation and management of an established patient, which requires at least 2 of these 3 key components: An expanded problem focused history; An expanded ... 1718733
1 Procedure Other diagnostic procedures on lymphatic structures 1689905
2 Procedure Collection of venous blood by venipuncture 1674947
3 Procedure Office or other outpatient visit for the evaluation and management of an established patient, which requires at least 2 of these 3 key components: A detailed history; A detailed examination; Medic... 1652934
4 Procedure Biopsy of lymphatic structure 1625521
5 Procedure Biopsy of mouth, unspecified structure 1562999
6 Metadata No matching concept 1487928
7 Procedure Long-term drug therapy 1303593
8 Procedure Office or other outpatient visit for the evaluation and management of an established patient, which requires at least 2 of these 3 key components: A problem focused history; A problem focused exam... 1271717
9 Procedure Subsequent hospital care, per day, for the evaluation and management of a patient, which requires at least 2 of these 3 key components: An expanded problem focused interval history; An expanded pr... 1199980

The top 1 procedure is Office or other outpatient visit related. What about more specific procedures? For example, procedures like Coronary artery bypass

Coronary Artery Bypass Grafting (CABG) Procedures

query = """
    SELECT
        concept_id,
        concept_name,
        vocabulary_id,
        COUNT(Distinct person_id) cnt_patients
    FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
    JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` c
        ON p.procedure_concept_id = c.concept_id
    WHERE LOWER(concept_name) LIKE '%coronary artery bypass%'
    GROUP BY 1,2,3
    ORDER BY 4 DESC
    LIMIT 10
"""

df = client.query(query).to_dataframe()
df
concept_id concept_name vocabulary_id cnt_patients
0 2107231 Coronary artery bypass, using arterial graft(s); single arterial graft CPT4 15077
1 2107214 Endoscopy, surgical, including video-assisted harvest of vein(s) for coronary artery bypass procedure (List separately in addition to code for primary procedure) CPT4 8104
2 2107223 Coronary artery bypass, using venous graft(s) and arterial graft(s); 2 venous grafts (List separately in addition to code for primary procedure) CPT4 5199
3 2100873 Anesthesia for direct coronary artery bypass grafting; with pump oxygenator CPT4 5189
4 2107224 Coronary artery bypass, using venous graft(s) and arterial graft(s); 3 venous grafts (List separately in addition to code for primary procedure) CPT4 3670
5 2107222 Coronary artery bypass, using venous graft(s) and arterial graft(s); single vein graft (List separately in addition to code for primary procedure) CPT4 2787
6 2100872 Anesthesia for direct coronary artery bypass grafting; without pump oxygenator CPT4 2270
7 2107230 Reoperation, coronary artery bypass procedure or valve procedure, more than 1 month after original operation (List separately in addition to code for primary procedure) CPT4 1659
8 2001514 Single internal mammary-coronary artery bypass ICD9Proc 1471
9 2107226 Coronary artery bypass, using venous graft(s) and arterial graft(s); 4 venous grafts (List separately in addition to code for primary procedure) CPT4 1264

There are many procedures related to CABG. I’m going to select the concept_id 2107231 as it has the most patients.

Patient

Age Range

cte_age = """
/* Get birth date of patients */

WITH person_bdate AS (
  SELECT
    person_id,
    CAST(CONCAT(year_of_birth,'-',month_of_birth,'-',day_of_birth) AS date) AS birth_date
  FROM `bigquery-public-data.cms_synthetic_patient_data_omop.person`
),

/* Get age of patients */

procedure_person_age AS (
  SELECT
    p.person_id,
    CAST(AVG(DATE_DIFF(p.procedure_dat, b.birth_date, year)) AS INT) AS age_at_procedure
  FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` AS p
  JOIN person_bdate b
    ON p.person_id = b.person_id
  GROUP BY 1
)
"""

query = cte_age + \
"""
SELECT
    COUNT(DISTINCT person_id) cnt_patients, 
    MIN(age_at_procedure) min_age, 
    APPROX_QUANTILES(age_at_procedure, 100)[OFFSET(24)] as p25_age,
    APPROX_QUANTILES(age_at_procedure, 100)[OFFSET(49)] as p50_age,
    APPROX_QUANTILES(age_at_procedure, 100)[OFFSET(74)] as p75_age,
    MAX(age_at_procedure) max_age
FROM procedure_person_age

"""

df = client.query(query).to_dataframe()
df
cnt_patients min_age p25_age p50_age p75_age max_age
0 1979484 25 67 73 81 101

75% of patients are older than 67 because it’s a Medicare (for the eld population 65+ years old) dataset.

Gender/Ethnicity Ratio

cte_gender_eth = """
WITH person_gender AS (
    SELECT 
        p.person_id, 
        c.concept_name AS gender, 
        ethnicity_concept_id
    FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
    JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.person` pe
        ON pe.person_id = p.person_id
    JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` c
        ON c.concept_id = pe.gender_concept_id
),
person_eth AS (
    SELECT 
        g.person_id, 
        c.concept_name AS ethnicity
    FROM person_gender g
    JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` c
        ON c.concept_id = g.ethnicity_concept_id
)
"""

gender_query = cte_gender_eth +\
"""
    SELECT
        gender,
        cnt,
        ROUND(cnt/(SUM(cnt) OVER ())*100, 2) percent_patients
    FROM (
    SELECT 
        gender, 
        COUNT(DISTINCT person_id) AS cnt
    FROM person_gender
    GROUP BY gender
    ) tmp
"""

# get count of patients by gender
df_gender = client.query(gender_query).to_dataframe()
df_gender
gender cnt percent_patients
0 MALE 850727 42.98
1 FEMALE 1128757 57.02
eth_query = cte_gender_eth +\
"""
    SELECT
        ethnicity,
        cnt,
        ROUND(cnt/(SUM(cnt) OVER ())*100, 2) percent_patients
    FROM (
    SELECT 
        ethnicity, 
        COUNT(DISTINCT person_id) AS cnt
    FROM person_eth
    GROUP BY ethnicity
    ) tmp
"""

# get count of patients by gender
df_eth = client.query(eth_query).to_dataframe()
df_eth
ethnicity cnt percent_patients
0 Not Hispanic or Latino 1935032 97.75
1 Hispanic or Latino 44452 2.25

Patients that Visit Multiple Providers

Patients with complex medical conditions may visit multiple providers.

query = """
SELECT 
    person_id,
    COUNT(DISTINCT provider_id) cnt_providers
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
"""

df = client.query(query).to_dataframe()
df
person_id cnt_providers
0 54123 246
1 1986824 232
2 281743 230
3 415512 230
4 935946 230
5 1864049 226
6 2275872 225
7 111683 224
8 1553227 223
9 1891594 220

The count of providers seems too large to be true. That’s probably because it’s a synthetic dataset.

Provider

Top 10 Providers by Average Daily Patient Count

query = """
WITH patients_daily AS (
    SELECT 
        npi, 
        procedure_dat,
        COUNT(DISTINCT person_id) cnt_patients
    FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
    JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.provider` pro
        ON pro.provider_id = p.provider_id
    GROUP BY 1,2
    ORDER BY 3 DESC
)

SELECT
    npi,
    CAST(CEILING(AVG(cnt_patients)) AS INT) avg_daily_cnt_patients
FROM patients_daily d
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10

"""

df = client.query(query).to_dataframe()
df
npi avg_daily_cnt_patients
0 0888901330 502
1 8207899456 365
2 9979265126 333
3 1973668724 287
4 2052541087 280
5 1052105939 250
6 7961400834 230
7 2866295111 218
8 0941734431 212
9 3552796390 210

The dataset does not provide further details such as provider names, but considering the number of patients per day, the top 10 providers are probably hospitals.

CABG Providers

We’ve known the concept_id for CABG is 2107231.

query = """
    SELECT 
        npi,
        COUNT(DISTINCT p.person_id) cabg_cnt_patients
    FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
    JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.provider` pro
        ON p.provider_id = pro.provider_id
    WHERE p.procedure_concept_id = 2107231
    GROUP BY 1
    ORDER BY 2 DESC
"""

df = client.query(query).to_dataframe()
df
npi cabg_cnt_patients
0 0888901330 86
1 9979265126 78
2 8207899456 67
3 2052541087 58
4 1973668724 52
... ... ...
12849 5326588845 1
12850 6529848843 1
12851 0211677309 1
12852 8731816368 1
12853 2994512661 1

12854 rows × 2 columns

Among 902021 providers, 12855 (1.4%) of them can provide CABG service. The provider with npi 0888901330 provided that service to 86 patients.

Payer

Which payer plan has the most beneficiaries? What’s the corresponding averge age of beneficiaries?

query = """
WITH payer_cnt AS (
    SELECT
        plan_source_value,
        COUNT(DISTINCT p.person_id) cnt_beneficiaries
    FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
    JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.payer_plan_period` pa
        ON p.person_id = pa.person_id
    GROUP BY 1
)

SELECT 
    plan_source_value, 
    cnt_beneficiaries,
    ROUND(cnt_beneficiaries/(SUM(cnt_beneficiaries) OVER ())*100, 2) percent_beneficiaries
FROM payer_cnt
ORDER BY percent_beneficiaries DESC
"""

df = client.query(query).to_dataframe()
df
plan_source_value cnt_beneficiaries percent_beneficiaries
0 Medicare Part B 1976387 29.96
1 Medicare Part A 1972066 29.89
2 Medicare Part D 1905775 28.89
3 HMO 742521 11.26

Drug

Top 10 Drugs by Frequency

query = """
SELECT 
    concept_name,
    COUNT(DISTINCT d.visit_occurrence_id) cnt_visits
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.drug_exposure` d
JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` c
ON c.concept_id = d.drug_concept_id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
"""

df = client.query(query).to_dataframe()
df
concept_name cnt_visits
0 Influenza virus vaccine, trivalent (IIV3), split virus, 0.5 mL dosage, for intramuscular use 1423956
1 No matching concept 1105382
2 Epoetin Alfa 488914
3 Sodium Chloride Injectable Solution 415049
4 Vitamin B 12 1 MG 358829
5 Dexamethasone 1 MG 327797
6 paricalcitol Injectable Solution 314188
7 Midazolam 301279
8 Triamcinolone 283925
9 Ondansetron 275889

The drug with most visits is Influenza virus vaccine.

query = """
    SELECT 
        concept_name AS drug,
        COUNT(DISTINCT drug_exposure_id) cabg_cnt_visits
    FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
    JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.drug_exposure` d
        ON p.visit_occurrence_id = d.visit_occurrence_id
    JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` c
        ON c.concept_id = d.drug_concept_id
    WHERE p.procedure_concept_id = 2107231
    GROUP BY 1
    ORDER BY 2 DESC
"""

df = client.query(query).to_dataframe()
df
drug cabg_cnt_visits
0 Influenza virus vaccine, trivalent (IIV3), split virus, 0.5 mL dosage, for intramuscular use 112
1 Adenosine Triphosphate 95
2 regadenoson 85
3 Sodium Chloride Injectable Solution 71
4 Dipyridamole Injectable Solution 57
5 Triamcinolone 52
6 No matching concept 46
7 Vitamin B 12 1 MG 46
8 Dexamethasone 1 MG 45
9 methylprednisolone acetate 40 MG 29
10 Aminophylline 250 MG 28
11 Methylprednisolone 28
12 Pneumococcal polysaccharide vaccine, 23-valent (PPSV23), adult or immunosuppressed patient dosage, when administered to individuals 2 years or older, for subcutaneous or intramuscular use 15
13 darbepoetin alfa 13
14 Dobutamine Injectable Solution 13
15 Ketorolac 12
16 tetanus toxoid vaccine, inactivated 11
17 Promethazine Hydrochloride 50 MG 11
18 Betamethasone 11
19 Ceftriaxone 250 MG Injection 11
20 diphtheria toxoid vaccine, inactivated 11
21 Granisetron Injectable Solution 8
22 heparin 8
23 Epoetin Alfa 6
24 Perflutren 6
25 rituximab Injection 6
26 Influenza virus vaccine, trivalent (IIV3), split virus, preservative free, 0.5 mL dosage, for intramuscular use 6
27 Morphine Sulfate 10 MG 5
28 Leucovorin Injectable Solution 5
29 Midazolam 5
30 Lidocaine 10 MG 5
31 docetaxel Injectable Solution 4
32 Diphenhydramine Citrate 50 MG 4
33 Testosterone 4
34 sargramostim Injectable Solution 3
35 Estradiol Injectable Solution [Depo-estradiol] 3
36 hyaluronate 3
37 Leuprolide 3
38 Atropine 3
39 Tetanus and diphtheria toxoids adsorbed (Td), preservative free, when administered to individuals 7 years or older, for intramuscular use 3
40 Fentanyl 0.1 MG 3
41 fulvestrant Prefilled Syringe 3
42 Ranitidine Injectable Solution 2
43 pegfilgrastim Prefilled Syringe 2
44 Cyclophosphamide 2
45 Theophylline Injectable Solution 2
46 bevacizumab Injection 2
47 abatacept 2
48 Levalbuterol 2
49 cetuximab Injection 2
50 Filgrastim 2
51 Meperidine Hydrochloride 100 MG 2
52 Gentamicin Sulfate (USP) 2
53 zoledronic acid Injectable Solution [Zometa] 2
54 Atropine Sulfate 0.3 MG 2
55 Zoster (shingles) vaccine (HZV), live, for subcutaneous injection 2
56 Nalbuphine 2
57 Doxorubicin Injectable Solution 1
58 Budesonide 1
59 Sodium ferric gluconate complex Injectable Solution 1
60 hyaluronate Prefilled Syringe [Orthovisc] 1
61 Epinephrine 1
62 Ibandronate 1
63 Testosterone 100 MG 1
64 Histrelin acetate 0.00208 MG/HR Drug Implant [Vantas] 1
65 Cefotaxime 1000 MG Injection 1
66 Tetanus, diphtheria toxoids and acellular pertussis vaccine (Tdap), when administered to individuals 7 years or older, for intramuscular use 1
67 Immunoglobulin G Injectable Solution 1
68 Doxorubicin 1
69 Lincomycin Injectable Solution 1
70 Lorazepam 2 MG 1
71 Mepivacaine 1
72 Ipratropium 1
73 Magnesium Sulfate 1
74 Glucose 1
75 Ondansetron 1
76 alteplase Injectable Solution 1
77 Kanamycin Injectable Solution 1
78 BCG, Live, Tice Strain 1
79 Carboplatin Injectable Solution 1
80 bortezomib Injectable Solution 1

Interestingly, the top 1 drug for Coronary Artery Bypass Grafting is Influenza virus vaccine, why? After do some research, I found National Library of Medicine (NIH) reported there is a significant benefit of vaccination against influenza in patients hospitalized due to an acute coronary event.

Therefore, when the drug is a influenza virus vaccine, we should realize that the patient’s illness is not necessarily flu.