Analyzing CMS Synthetic Patient Data using SQL
Introduction
The CMS Synthetic Patient Data OMOP (named cms_synthetic_patient_data_omop
in Google BigQuery) is a synthetic patient dataset in the OMOP Common Data Model v5.2, originally released by the CMS and accessed via BigQuery. The dataset includes 24 tables and records for 2 million synthetic patients from 2008 to 2010. It is intended to be used for research and educational purposes.
Objective
Analyzing procedures, patients, providers, payers, and drugs using SQL. Figure out the most frequently used drugs related to the coronary artery bypass graft (CABG)
Import libraries
# Import the necessary libraries
from google.cloud import bigquery
from google.oauth2 import service_account
import pandas as pd
import db_dtypes
# pandas settings to increase readability
pd.options.display.max_seq_items = 1000
pd.options.display.max_rows = 1000
pd.options.display.max_columns = 200
pd.options.display.max_colwidth = 200
Connect to BigQuery
# Set up your Google Cloud credentials
credentials = service_account.Credentials.from_service_account_file('../assets/creds/google-creds-cms-claims.json')
# Initialize the BigQuery client
client = bigquery.Client(credentials=credentials)
Procedure
Statistics of Procedure Occurrence
query = """
SELECT
COUNT(DISTINCT provider_id) cnt_provider,
COUNT(DISTINCT person_id) cnt_patient,
MIN(procedure_dat) from_date,
MAX(procedure_dat) to_date,
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence`
"""
df = client.query(query).to_dataframe()
df
cnt_provider | cnt_patient | from_date | to_date | |
---|---|---|---|---|
0 | 902021 | 1979484 | 2007-11-27 | 2010-12-31 |
During 2007/11/27-2010/12/31, a total of 902,021 providers and 1,979,484 people are involved in the dataset.
Top 10 procedure by number of patients
query = """
SELECT
domain_id,
concept_name,
COUNT(Distinct person_id) cnt_patients
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` c
ON p.procedure_concept_id = c.concept_id
GROUP BY 1,2
ORDER BY 3 DESC
LIMIT 10
"""
df = client.query(query).to_dataframe()
df
domain_id | concept_name | cnt_patients | |
---|---|---|---|
0 | Procedure | Office or other outpatient visit for the evaluation and management of an established patient, which requires at least 2 of these 3 key components: An expanded problem focused history; An expanded ... | 1718733 |
1 | Procedure | Other diagnostic procedures on lymphatic structures | 1689905 |
2 | Procedure | Collection of venous blood by venipuncture | 1674947 |
3 | Procedure | Office or other outpatient visit for the evaluation and management of an established patient, which requires at least 2 of these 3 key components: A detailed history; A detailed examination; Medic... | 1652934 |
4 | Procedure | Biopsy of lymphatic structure | 1625521 |
5 | Procedure | Biopsy of mouth, unspecified structure | 1562999 |
6 | Metadata | No matching concept | 1487928 |
7 | Procedure | Long-term drug therapy | 1303593 |
8 | Procedure | Office or other outpatient visit for the evaluation and management of an established patient, which requires at least 2 of these 3 key components: A problem focused history; A problem focused exam... | 1271717 |
9 | Procedure | Subsequent hospital care, per day, for the evaluation and management of a patient, which requires at least 2 of these 3 key components: An expanded problem focused interval history; An expanded pr... | 1199980 |
The top 1 procedure is Office or other outpatient visit related. What about more specific procedures? For example, procedures like Coronary artery bypass
Coronary Artery Bypass Grafting (CABG) Procedures
query = """
SELECT
concept_id,
concept_name,
vocabulary_id,
COUNT(Distinct person_id) cnt_patients
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` c
ON p.procedure_concept_id = c.concept_id
WHERE LOWER(concept_name) LIKE '%coronary artery bypass%'
GROUP BY 1,2,3
ORDER BY 4 DESC
LIMIT 10
"""
df = client.query(query).to_dataframe()
df
concept_id | concept_name | vocabulary_id | cnt_patients | |
---|---|---|---|---|
0 | 2107231 | Coronary artery bypass, using arterial graft(s); single arterial graft | CPT4 | 15077 |
1 | 2107214 | Endoscopy, surgical, including video-assisted harvest of vein(s) for coronary artery bypass procedure (List separately in addition to code for primary procedure) | CPT4 | 8104 |
2 | 2107223 | Coronary artery bypass, using venous graft(s) and arterial graft(s); 2 venous grafts (List separately in addition to code for primary procedure) | CPT4 | 5199 |
3 | 2100873 | Anesthesia for direct coronary artery bypass grafting; with pump oxygenator | CPT4 | 5189 |
4 | 2107224 | Coronary artery bypass, using venous graft(s) and arterial graft(s); 3 venous grafts (List separately in addition to code for primary procedure) | CPT4 | 3670 |
5 | 2107222 | Coronary artery bypass, using venous graft(s) and arterial graft(s); single vein graft (List separately in addition to code for primary procedure) | CPT4 | 2787 |
6 | 2100872 | Anesthesia for direct coronary artery bypass grafting; without pump oxygenator | CPT4 | 2270 |
7 | 2107230 | Reoperation, coronary artery bypass procedure or valve procedure, more than 1 month after original operation (List separately in addition to code for primary procedure) | CPT4 | 1659 |
8 | 2001514 | Single internal mammary-coronary artery bypass | ICD9Proc | 1471 |
9 | 2107226 | Coronary artery bypass, using venous graft(s) and arterial graft(s); 4 venous grafts (List separately in addition to code for primary procedure) | CPT4 | 1264 |
There are many procedures related to CABG. I’m going to select the concept_id 2107231 as it has the most patients.
Patient
Age Range
cte_age = """
/* Get birth date of patients */
WITH person_bdate AS (
SELECT
person_id,
CAST(CONCAT(year_of_birth,'-',month_of_birth,'-',day_of_birth) AS date) AS birth_date
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.person`
),
/* Get age of patients */
procedure_person_age AS (
SELECT
p.person_id,
CAST(AVG(DATE_DIFF(p.procedure_dat, b.birth_date, year)) AS INT) AS age_at_procedure
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` AS p
JOIN person_bdate b
ON p.person_id = b.person_id
GROUP BY 1
)
"""
query = cte_age + \
"""
SELECT
COUNT(DISTINCT person_id) cnt_patients,
MIN(age_at_procedure) min_age,
APPROX_QUANTILES(age_at_procedure, 100)[OFFSET(24)] as p25_age,
APPROX_QUANTILES(age_at_procedure, 100)[OFFSET(49)] as p50_age,
APPROX_QUANTILES(age_at_procedure, 100)[OFFSET(74)] as p75_age,
MAX(age_at_procedure) max_age
FROM procedure_person_age
"""
df = client.query(query).to_dataframe()
df
cnt_patients | min_age | p25_age | p50_age | p75_age | max_age | |
---|---|---|---|---|---|---|
0 | 1979484 | 25 | 67 | 73 | 81 | 101 |
75% of patients are older than 67 because it’s a Medicare (for the eld population 65+ years old) dataset.
Gender/Ethnicity Ratio
cte_gender_eth = """
WITH person_gender AS (
SELECT
p.person_id,
c.concept_name AS gender,
ethnicity_concept_id
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.person` pe
ON pe.person_id = p.person_id
JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` c
ON c.concept_id = pe.gender_concept_id
),
person_eth AS (
SELECT
g.person_id,
c.concept_name AS ethnicity
FROM person_gender g
JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` c
ON c.concept_id = g.ethnicity_concept_id
)
"""
gender_query = cte_gender_eth +\
"""
SELECT
gender,
cnt,
ROUND(cnt/(SUM(cnt) OVER ())*100, 2) percent_patients
FROM (
SELECT
gender,
COUNT(DISTINCT person_id) AS cnt
FROM person_gender
GROUP BY gender
) tmp
"""
# get count of patients by gender
df_gender = client.query(gender_query).to_dataframe()
df_gender
gender | cnt | percent_patients | |
---|---|---|---|
0 | MALE | 850727 | 42.98 |
1 | FEMALE | 1128757 | 57.02 |
eth_query = cte_gender_eth +\
"""
SELECT
ethnicity,
cnt,
ROUND(cnt/(SUM(cnt) OVER ())*100, 2) percent_patients
FROM (
SELECT
ethnicity,
COUNT(DISTINCT person_id) AS cnt
FROM person_eth
GROUP BY ethnicity
) tmp
"""
# get count of patients by gender
df_eth = client.query(eth_query).to_dataframe()
df_eth
ethnicity | cnt | percent_patients | |
---|---|---|---|
0 | Not Hispanic or Latino | 1935032 | 97.75 |
1 | Hispanic or Latino | 44452 | 2.25 |
Patients that Visit Multiple Providers
Patients with complex medical conditions may visit multiple providers.
query = """
SELECT
person_id,
COUNT(DISTINCT provider_id) cnt_providers
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
"""
df = client.query(query).to_dataframe()
df
person_id | cnt_providers | |
---|---|---|
0 | 54123 | 246 |
1 | 1986824 | 232 |
2 | 281743 | 230 |
3 | 415512 | 230 |
4 | 935946 | 230 |
5 | 1864049 | 226 |
6 | 2275872 | 225 |
7 | 111683 | 224 |
8 | 1553227 | 223 |
9 | 1891594 | 220 |
The count of providers seems too large to be true. That’s probably because it’s a synthetic dataset.
Provider
Top 10 Providers by Average Daily Patient Count
query = """
WITH patients_daily AS (
SELECT
npi,
procedure_dat,
COUNT(DISTINCT person_id) cnt_patients
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.provider` pro
ON pro.provider_id = p.provider_id
GROUP BY 1,2
ORDER BY 3 DESC
)
SELECT
npi,
CAST(CEILING(AVG(cnt_patients)) AS INT) avg_daily_cnt_patients
FROM patients_daily d
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
"""
df = client.query(query).to_dataframe()
df
npi | avg_daily_cnt_patients | |
---|---|---|
0 | 0888901330 | 502 |
1 | 8207899456 | 365 |
2 | 9979265126 | 333 |
3 | 1973668724 | 287 |
4 | 2052541087 | 280 |
5 | 1052105939 | 250 |
6 | 7961400834 | 230 |
7 | 2866295111 | 218 |
8 | 0941734431 | 212 |
9 | 3552796390 | 210 |
The dataset does not provide further details such as provider names, but considering the number of patients per day, the top 10 providers are probably hospitals.
CABG Providers
We’ve known the concept_id
for CABG is 2107231.
query = """
SELECT
npi,
COUNT(DISTINCT p.person_id) cabg_cnt_patients
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.provider` pro
ON p.provider_id = pro.provider_id
WHERE p.procedure_concept_id = 2107231
GROUP BY 1
ORDER BY 2 DESC
"""
df = client.query(query).to_dataframe()
df
npi | cabg_cnt_patients | |
---|---|---|
0 | 0888901330 | 86 |
1 | 9979265126 | 78 |
2 | 8207899456 | 67 |
3 | 2052541087 | 58 |
4 | 1973668724 | 52 |
... | ... | ... |
12849 | 5326588845 | 1 |
12850 | 6529848843 | 1 |
12851 | 0211677309 | 1 |
12852 | 8731816368 | 1 |
12853 | 2994512661 | 1 |
12854 rows × 2 columns
Among 902021 providers, 12855 (1.4%) of them can provide CABG service. The provider with npi 0888901330 provided that service to 86 patients.
Payer
Which payer plan has the most beneficiaries? What’s the corresponding averge age of beneficiaries?
query = """
WITH payer_cnt AS (
SELECT
plan_source_value,
COUNT(DISTINCT p.person_id) cnt_beneficiaries
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.payer_plan_period` pa
ON p.person_id = pa.person_id
GROUP BY 1
)
SELECT
plan_source_value,
cnt_beneficiaries,
ROUND(cnt_beneficiaries/(SUM(cnt_beneficiaries) OVER ())*100, 2) percent_beneficiaries
FROM payer_cnt
ORDER BY percent_beneficiaries DESC
"""
df = client.query(query).to_dataframe()
df
plan_source_value | cnt_beneficiaries | percent_beneficiaries | |
---|---|---|---|
0 | Medicare Part B | 1976387 | 29.96 |
1 | Medicare Part A | 1972066 | 29.89 |
2 | Medicare Part D | 1905775 | 28.89 |
3 | HMO | 742521 | 11.26 |
Drug
Top 10 Drugs by Frequency
query = """
SELECT
concept_name,
COUNT(DISTINCT d.visit_occurrence_id) cnt_visits
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.drug_exposure` d
JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` c
ON c.concept_id = d.drug_concept_id
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
"""
df = client.query(query).to_dataframe()
df
concept_name | cnt_visits | |
---|---|---|
0 | Influenza virus vaccine, trivalent (IIV3), split virus, 0.5 mL dosage, for intramuscular use | 1423956 |
1 | No matching concept | 1105382 |
2 | Epoetin Alfa | 488914 |
3 | Sodium Chloride Injectable Solution | 415049 |
4 | Vitamin B 12 1 MG | 358829 |
5 | Dexamethasone 1 MG | 327797 |
6 | paricalcitol Injectable Solution | 314188 |
7 | Midazolam | 301279 |
8 | Triamcinolone | 283925 |
9 | Ondansetron | 275889 |
The drug with most visits is Influenza virus vaccine.
Drugs related to CABG
query = """
SELECT
concept_name AS drug,
COUNT(DISTINCT drug_exposure_id) cabg_cnt_visits
FROM `bigquery-public-data.cms_synthetic_patient_data_omop.procedure_occurrence` p
JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.drug_exposure` d
ON p.visit_occurrence_id = d.visit_occurrence_id
JOIN `bigquery-public-data.cms_synthetic_patient_data_omop.concept` c
ON c.concept_id = d.drug_concept_id
WHERE p.procedure_concept_id = 2107231
GROUP BY 1
ORDER BY 2 DESC
"""
df = client.query(query).to_dataframe()
df
drug | cabg_cnt_visits | |
---|---|---|
0 | Influenza virus vaccine, trivalent (IIV3), split virus, 0.5 mL dosage, for intramuscular use | 112 |
1 | Adenosine Triphosphate | 95 |
2 | regadenoson | 85 |
3 | Sodium Chloride Injectable Solution | 71 |
4 | Dipyridamole Injectable Solution | 57 |
5 | Triamcinolone | 52 |
6 | No matching concept | 46 |
7 | Vitamin B 12 1 MG | 46 |
8 | Dexamethasone 1 MG | 45 |
9 | methylprednisolone acetate 40 MG | 29 |
10 | Aminophylline 250 MG | 28 |
11 | Methylprednisolone | 28 |
12 | Pneumococcal polysaccharide vaccine, 23-valent (PPSV23), adult or immunosuppressed patient dosage, when administered to individuals 2 years or older, for subcutaneous or intramuscular use | 15 |
13 | darbepoetin alfa | 13 |
14 | Dobutamine Injectable Solution | 13 |
15 | Ketorolac | 12 |
16 | tetanus toxoid vaccine, inactivated | 11 |
17 | Promethazine Hydrochloride 50 MG | 11 |
18 | Betamethasone | 11 |
19 | Ceftriaxone 250 MG Injection | 11 |
20 | diphtheria toxoid vaccine, inactivated | 11 |
21 | Granisetron Injectable Solution | 8 |
22 | heparin | 8 |
23 | Epoetin Alfa | 6 |
24 | Perflutren | 6 |
25 | rituximab Injection | 6 |
26 | Influenza virus vaccine, trivalent (IIV3), split virus, preservative free, 0.5 mL dosage, for intramuscular use | 6 |
27 | Morphine Sulfate 10 MG | 5 |
28 | Leucovorin Injectable Solution | 5 |
29 | Midazolam | 5 |
30 | Lidocaine 10 MG | 5 |
31 | docetaxel Injectable Solution | 4 |
32 | Diphenhydramine Citrate 50 MG | 4 |
33 | Testosterone | 4 |
34 | sargramostim Injectable Solution | 3 |
35 | Estradiol Injectable Solution [Depo-estradiol] | 3 |
36 | hyaluronate | 3 |
37 | Leuprolide | 3 |
38 | Atropine | 3 |
39 | Tetanus and diphtheria toxoids adsorbed (Td), preservative free, when administered to individuals 7 years or older, for intramuscular use | 3 |
40 | Fentanyl 0.1 MG | 3 |
41 | fulvestrant Prefilled Syringe | 3 |
42 | Ranitidine Injectable Solution | 2 |
43 | pegfilgrastim Prefilled Syringe | 2 |
44 | Cyclophosphamide | 2 |
45 | Theophylline Injectable Solution | 2 |
46 | bevacizumab Injection | 2 |
47 | abatacept | 2 |
48 | Levalbuterol | 2 |
49 | cetuximab Injection | 2 |
50 | Filgrastim | 2 |
51 | Meperidine Hydrochloride 100 MG | 2 |
52 | Gentamicin Sulfate (USP) | 2 |
53 | zoledronic acid Injectable Solution [Zometa] | 2 |
54 | Atropine Sulfate 0.3 MG | 2 |
55 | Zoster (shingles) vaccine (HZV), live, for subcutaneous injection | 2 |
56 | Nalbuphine | 2 |
57 | Doxorubicin Injectable Solution | 1 |
58 | Budesonide | 1 |
59 | Sodium ferric gluconate complex Injectable Solution | 1 |
60 | hyaluronate Prefilled Syringe [Orthovisc] | 1 |
61 | Epinephrine | 1 |
62 | Ibandronate | 1 |
63 | Testosterone 100 MG | 1 |
64 | Histrelin acetate 0.00208 MG/HR Drug Implant [Vantas] | 1 |
65 | Cefotaxime 1000 MG Injection | 1 |
66 | Tetanus, diphtheria toxoids and acellular pertussis vaccine (Tdap), when administered to individuals 7 years or older, for intramuscular use | 1 |
67 | Immunoglobulin G Injectable Solution | 1 |
68 | Doxorubicin | 1 |
69 | Lincomycin Injectable Solution | 1 |
70 | Lorazepam 2 MG | 1 |
71 | Mepivacaine | 1 |
72 | Ipratropium | 1 |
73 | Magnesium Sulfate | 1 |
74 | Glucose | 1 |
75 | Ondansetron | 1 |
76 | alteplase Injectable Solution | 1 |
77 | Kanamycin Injectable Solution | 1 |
78 | BCG, Live, Tice Strain | 1 |
79 | Carboplatin Injectable Solution | 1 |
80 | bortezomib Injectable Solution | 1 |
Interestingly, the top 1 drug for Coronary Artery Bypass Grafting is Influenza virus vaccine, why? After do some research, I found National Library of Medicine (NIH) reported there is a significant benefit of vaccination against influenza in patients hospitalized due to an acute coronary event.
Therefore, when the drug is a influenza virus vaccine, we should realize that the patient’s illness is not necessarily flu.