Return to Pregnancy Discrimination in the Workplace Report

Fuzzy matching is a probabilistic record linkage technique used to link two datasets where no perfectly identical identifier exists in the two datasets.  Fuzzy matching between the charge data and the EEO-1 reports is necessary because a unique numeric identifier is not consistently available.  The EEO-1 data contain a unique unit number,  however,  this unit number is not required during the intake process for filing a charge, and is thus missing for most charges. For example, only 17% of pregnancy discrimination charges have a valid EEO-1 unit number in the charge database. Fuzzy matching allows for linking the charge data and the EEO-1 reports using firm name and address available in both datasets. In this report, we match pregnancy charges to EEO-1 reports in order to compare workplaces charged with pregnancy discrimination to those not charged with pregnancy discrimination. This appendix details the process for matching pregnancy discrimination charges to the EEO-1 employer reports.

Before proceeding with the matching process, it important to note that all charges are not expected to match in the EEO-1 reports data. First, EEO-1 reports are only required for firms with 100 or more employees (or 50 employees for federal contractors with a contract of at least $50,000), so charges against smaller firms will likely not find matches in the EEO- 1 reports data. Additionally, the EEO-1 reports only cover private employers, so charges against public employers will not match.

The charge data contain a rough categorical variable for the number of employees: 2% of pregnancy discrimination charges are filed against employers with 15 or fewer employees, 42% are filed against employers between 15 and 100 employees, and 42% are filed against employers with 101 or more employees (13% of charges are missing information on employer size). The charge data also contain a rough categorical variable indicating the institution type: about 7% of charges are made against public employers. Based on these variables, we expect about half of the charges to match in the EEO-1 data. This match rate aligns with previous research matching the charge and EEO-1 data.

We proceed with the fuzzy matching process in 3 steps: (1) direct match for charges with a valid EEO-1 unit number (2) standardize names and addresses (3) fuzzy matching.  We then assess how the matched sample compares to the charges which we expected to match (based on size and institution type) but did not match.

 

Data Sources

The EEO-1 establishment level reports are matched to the charge data to obtain estab- lishment level characteristics of workplaces charged with pregnancy discrimination.

 

EEOC Charge Data

The charge data comprise workplace discrimination charges from the Equal Employment Opportunity Commission (EEOC). The charges can be filed directly with the EEOC or with one of the state or local Fair Employment Practices Agencies (FEPAs). The data include all workplace charges filed between fiscal years 2012 and 2016 with the EEOC and FEPAs that have agreements with the EEOC to share the processing of charges.

The charge data are derived from the EEOC’s case processing software and originally were in four data files: “allegations”, “charging party”, “respondents”, and “charges”. All data files contain a consistent unique charge identification number which allowed the four files to be merged into a single analysis file.

Data on each charge include the employer information (address, industry, and establish- ment size);  charging partys basic demographics (age, race, national origin, and sex); the basis for the charge—the protected class, such as sex, sexual orientation, gender identity, race, national origin; the issue charged—the action or policy alleged to be discriminatory (the type of discrimination that took place such as promotion, harassment, discharge, etc.); the processing of the charge; and the outcome of the charge.

 

EEO-1 Reports

The EEOC collects annual data for private sector employers on the EEO-1 survey. The EEOC has been collecting these data since 1966, two years after they were authorized to do so by the U.S. Congress in Title VII of the 1964 Civil Rights Act. Title VII instructed the EEOC to monitor progress toward an equal opportunity society. Employers with 100 or more employees and federal contractors with 50 or more employees and a contract of at least

$50,000 are required to submit an EEO-1 report. The EEO-1 data include establishment- level records of the employers name and address, industry, federal contractor status, and employment totals by race, sex, and occupation.

 

 

Direct Matching

First, we merge charges with a valid (non-missing) EEO-1 unit number by year. Table 9 shows the number of charges with a valid EEO-1 unit number, the number of matches made, the number of valid matches,  and the number of charges that remain unmatched for each year and overall.

 

Table 9: Direct Matching Results by Year
  Total Charges Valid Unit Number Matches Valid Matches Total Unmatched
2012 5,690 1,054 834 743 4,947
2013 5,435 988 777 692 4,743
2014 5,244 987 733 656 4,588
2015 5,277 886 671 610 4,667
2016 5,101 722 527 469 4,632
Total 26,747 4,637 3,542 3,170 23,577

 

Overall, only 4,637 of the total 26,747 pregnancy discrimination charges (17%) had a valid EEO-1 number. Of the 4,637 charges with a valid EEO-1 unit number, 3,542 matched with an establishment in the EEO-1 database for that year. To ensure the validity of these matches (i.e. that the EEO-1 number was correctly assigned in the charge data), we keep only those matches in which the three digit zip code match in the charge data and EEO- 1 report data (“valid matches”). Three digit zip code, rather than the full five digit zip code, is used because the final digits of a zip code are often error prone and the person reporting the zip code may have made a typing error. Of the 3,542 EEO-1 unit matches, 3,170 (90%) are valid matches. After this direct matching process, 23,577 charges (88%) remained unmatched.

Because we perform this process by year, it is possible that some matches are missed because the firm did not submit its EEO-1 form(s) in that year. As such, we next take all charges which had a non-missing EEO-1 unit number, but did not match an EEO-1 establishment in its given year and perform a direct match against all other EEO-1 years, keeping the year closest to the charge (for example, if a charge in 2012 matched in the 2013 and 2014 EEO-1 years, we keep the 2013 match). After this process, we match 369 additional charges. Table 10 shows the additional matches picked up for each charge year and overall. At the end of the two rounds of direct matching, we matched 3,539 charges and 23,208 charges (87%) remained unmatched.

 

Table 10: Direct Matching Results by Year Round 2

 

Unmatched Round1

Matched in Other Years

Total Direct Match

Total Unmatched

2012

220

51

794

4,896

2013

211

41

733

4,702

2014

254

78

734

4,510

2015

215

96

706

4,571

2016

195

103

572

4,529

Total

1,095

369

3,539

23,208

 

Name and Address Standardization

We next move on to the fuzzy matching process for all 23,208 charges that did not have a direct match. Before the fuzzy matching can be implemented, company names and ad- dresses must be standardized. To standardize names and addresses, we implement user writ- ten Stata commands, stnd compname and stnd address for standardizing company names and addresses written by Wasi and Flaaen.1 This process helps remove inconsistencies in name formats. Specifically, these commands use rule-based pattern files to breakdown the company name and addresses into sub-parts. In addition to these standardize commands, we implement our own customized standardizing routine based on the specifics of the data.

 

Fuzzy Matching

To perform the fuzzy matching, we use Stata command reclink2, an extension of reclink.2 We match the charge data to the EEO-1 reports using standardized parent com- pany name and establishment address, blocking on the three digit zip code. We use parent company name because establishment names of multi-establishment firms often abbreviate the full company name and are likely to contain extra information (for example, a branch number) that could lower the quality of the match. In the case of a single establishment firm, the parent company name is the same as the establishment name.

To assess the performance of the fuzzy matching routine, we first perform the fuzzy match on the set of direct matches from stage 1. This allows us to assess how the fuzzy match routine performs on data known to be matches from the direct matching process.

 

Table 11: Test of Fuzzy Match Using Direct Matched Sample

 

Total Charges

Fuzzy Match

Exact Match

Unmatched

Avg rlsc

2012

794

603

318

191

0.95

2013

733

559

305

174

0.96

2014

734

558

301

176

0.95

2015

706

547

309

159

0.96

2016

572

433

233

139

0.94

Total

3,539

2,700

1,466

839

0.95

.95 (median is .99) which indicates that the “fuzzy matches” are likely to be true matches. Next, we perform the fuzzy match routine on the unmatched sample—the records that were not directly matched in the direct match process. Table 12 shows the number of exact, fuzzy, and unmatched charges for the charges which were unmatched during the direct matching process by year and overall. As Table 12 shows, the number of exact matches is low. Overall only 8% of charges were exactly matched in the EEO-1 data. The number of fuzzy matches is large (78% of charges are fuzzy matches), however, the average pair similarity score is only .74. This implies that many of these fuzzy matches are weak matches (lower rlsc score) and therefore less likely to be true matches.Table 11 shows the number of fuzzy, exact, and unmatched charges by year and overall. An exact match indicates the standardized establishment name and address in the charge data exactly matched the establishment name and address in the EEO-1 data. The fuzzy matching procedure produces a pair-similarly score (the rlsc) which indicates the strength of the match (where 1=exact match). Records with a minimum pair similarity score of .6 are considered “fuzzy matches”. Records with a pair similarity score below .6 are considered unmatched. Column 6 of Table 11 also shows the average pair similarly score. Overall, about 54% of the direct match charges are exact matches, 76% are fuzzy matches, and 24% are unmatched in the fuzzy matching process. Overall, the average pair similarity score (rlsc) is

 

Table 12: Unweighted Fuzzy Match Results

 

Total Charges

Fuzzy Match

Exact Match

Unmatched

Avg rlsc

2012

4,896

3,666

367

1,230

0.74

2013

4,702

3,584

402

1,118

0.74

2014

4,510

3,492

350

1,018

0.74

2015

4,571

3,600

373

971

0.74

2016

4,529

3,721

410

808

0.75

Total

23,208

18,063

1,902

5,145

0.74

 

To account for the large number of weak matches, we perform the fuzzy matching proce- dure again using the weighting options available in the reclink2 command.3 The wmatch and wnomatch options allow for different weights to be applied to the various matching vari- ables. Weights must be greater than or equal to 1 and are typically between 1 and 20. Specifically, the wmatch option specifies weights given to matches for each variable in the matching variable list. The weights reflect the relative likelihood of a variable match indi- cating a true observation match. Larger weights are applied to variables that more likely predict a true match. For example, a name variable will typically have a larger weight (as it is more likely to predict a true match) while a variable like city will have a smaller weight since duplicates are expected.4

Similarly, the wnomwatch option specifies weights given to mismatches for each variable in the matching variable list. These weights reflect the relative likelihood that a mismatch on a variable indicates that the observations do not match. A small weight indicates mismatches are expected even if the observation is a true match. For example, telephone number would commonly have a small wnomatch because of changes over time, multiple phone numbers per entity, or data entry errors.

There are no exact rules for what weights should be applied. In results not shown, we experiment with different weighting techniques. Generally, larger weights reduce the number of “fuzzy matches” (and thus increase the number of unmatched observations) while smaller weights increase the number of “fuzzy matches”. Table 13 shows results from our preferred weighting scheme. We weight on firm name, firm address, and zip code.

 

Table 13: Weighted Fuzzy Match Results

 

Total Charges

Fuzzy Match

Exact Match

Unmatched

Avg rlsc

2012

4,896

1,277

367

3,619

0.89

2013

4,702

1,275

402

3,427

0.90

2014

4,510

1,224

350

3,286

0.90

2015

4,571

1,240

373

3,331

0.90

2016

4,529

1,423

410

3,106

0.90

Total

23,208

6,439

1,902

16,769

0.90

Fuzzy match results using weights on firm name, address, and zip code and are defined as: wmatch(10 8 1) wnomatch(8 6 8)

As the table 13 shows, with this weighting scheme, the number of fuzzy matches is reduced but the pair similarity score increases indicating that although fewer matches were made, the matches are more likely to be true. This method produced 6,437 matches.

We consider all fuzzy matches with a pair similarity score of .95 or larger to be true matches (a non-systematic review of these charges indicated that those with a pair similarity score of .95 or larger were true matches). Because we use the npairs(2) option in the fuzzy matching routine, which allows for the top 2 potential matches to be retained, we manually go through those with multiple potential matches and determine the correct match. This created a sample of 3,563 charges. I then added back in the 3,539 direct matches for a final sample of 7,102 charges.

 

Matched Sample Vs. Expected Matches

The sample from this process is below the expected match rate (we expected to match around half of the charges), however, as long as it is a random sample of the potential matches, the smaller sample will not bias the results. Of the charges we expected to match (private employers with at least 100 employees), there is no reason to suspect that the charges that are matched in this sample are different than those which did not match in this sample (i.e the charges which matched are a random sample of the charges that could have matched). In the tables below, we compare the matched charges in the sample to the charges we expected to match (based on size and institution type), but did not match in the sample. As tables 14 -17 show, the matched sample and the expected matches (but unmatched) samples are very similar.  The one exception is that our matched sample are more likely to have non-missing industry information.

Table 14: Variable Means Matched Sample vs. Expected Matches

 

Matched Sample

Expected Matches

Found Cause

0.024

0.028

Got Benefit

0.196

0.203

White

0.326

0.333

Observations

7,102

8,216

 

Table 15: Receiving District Office Matched Sample vs. Expected Matches

 

Matched Sample

Expected Matches

Atlanta District Office

4.44

4.00

Birmingham District Office

3.27

2.58

Charlotte District Office

5.27

4.89

Chicago District Office

10.03

7.96

Dallas District Office

4.67

6.33

Houston District Office

3.25

3.16

Indianapolis District Office

7.81

8.12

Los Angeles District Office

4.18

3.58

Memphis District Office

3.13

2.54

Miami District Office

6.24

7.07

New York District Office

10.63

13.84

Office of The Chair

8.70

9.94

Philadelphia District Office

9.14

7.70

Phoenix District Office

5.49

6.57

San Francisco District Office

6.52

6.04

St. Louis District Office

5.51

3.82

Washington Field Office

1.73

1.84

Observations

7,102

8,216

 

Table 16: Charge Year Matched Sample vs. Expected Matches

 

Matched Sample

Expected Matches

2012

20.50

21.34

2013

20.70

19.72

2014

19.59

20.17

2015

19.74

19.43

2016

19.47

19.35

Observations

7,102

8,216

 

Table 17: Industry Matched Sample vs. Expected Matches

 

Matched Sample

Expected Matches

Accommodation and Food Services

4.48

4.03

Admin. Support & Waste Remediation Services

5.70

3.13

Agriculture, Forestry, Fishing and Hunting

0.20

0.13

Arts, Entertainment, and Recreation

0.84

0.37

Construction

0.70

0.32

Educational Services

0.51

1.01

Finance and Insurance

5.00

1.92

Health Care and Social Assistance

17.46

6.73

Information

2.39

0.63

Management of Companies and Enterprises

0.94

0.29

Manufacturing

7.29

2.04

Mining, Quarrying, and Oil and Gas Extraction

0.31

0.11

Other Services (except Public Administration)

1.15

0.96

Professional, Scientific, and Technical Services

3.79

1.07

Public Administration

0.25

0.51

Real Estate and Rental and Leasing

0.89

0.84

Retail Trade

10.67

4.33

Transportation & Warehousing

2.18

0.96

Utilities

0.24

0.15

Wholesale Trade

1.80

0.55

Missing

33.19

69.91

Total

100.00

100.00

Observations

7,102

8,216

 

 

Missing Industry Appendix

This appendix compares pregnancy discrimination charges that are missing industry in- formation to those not missing industry information. We examine whether charges missing industry information are systematically related to other variables. Table 18 displays the means of key variables used in this analysis for pregnancy discrimination charges that are missing and not missing industry information. Table 19 shows the duration of pregnancy discrimination for charges that are missing and not missing industry information. As tables 18 and 19 show, charges missing industry information are quite similar to those not missing industry information.

Tables 20 - 23 explore additional variables that are not central to the analysis in this paper. Though the two samples are generally similar, there is some variation across employer size, district office, and time. Smaller establishments are more likely to be missing industry in- formation (table 20). The Dallas, New York and San Francisco EEOC district offices as well as state FEPA agencies are more likely to be missing on industry as well. In contrast, The Charlotte, Chicago, Indianapolis, and Los Angeles EEOC district offices have substantially lower rates of industry missing data (table 21). There is also a secular trend toward lower rates of industry missing, which hopefully will continue into the future (table 23).

 

Table 18: Variable Means: Charges Missing Industry vs. Charges Not Missing Industry

 

Charges Missing Industry

Charges Not Missing Industry

Found Cause

0.03

0.03

Got Benefit

0.20

0.19

White

0.32

0.39

Total Number of Issues

1.89

1.97

Total Number of Bases

1.78

1.90

Represented by Counsel

0.19

0.19

Average Monetary Benefit

15,128.69

21,734.22

Observations

16,314

10,433

 

Table 19: Duration of Pregnancy Discrimination: Charges Missing Industry vs. Charges Not Missing Industry

 

Charges Missing Industry

Charges Not Missing Industry

1 Day

53.26

52.78

2 Days-2 Weeks

6.32

5.64

2 Weeks - 2 Months

12.97

12.70

3 - 6 Months

16.46

17.02

7 Months - 1 Year

7.45

7.70

More than a year

3.54

4.16

Observations

28,183

18,497

Note: The number of observations in this table is larger than other tables because this table is examining all individual  allegations  of  pregnancy  discrimination, rather than charges which contain a pregnancy allegation

 

Table 20: Number of Employees: Charges Missing Industry vs. Charges Not Missing Industry

 

Charges Missing Industry

Charges Not Missing Industry

Under 15 Employees

2.92

0.65

15 - 100 Employees

47.82

33.96

101 - 200 Employees

7.05

9.17

201 - 500 Employees

7.07

12.38

501+ Employees

18.73

33.97

Unknown Number Of Employees

15.72

4.78

Missing

0.68

5.08

Observations

16,314

10,433

 

Table 21: Receiving District Office: Charges Missing Industry vs. Charges Not Missing Industry

 

Charges Missing Industry

Charges Not Missing Industry

Atlanta District Office

4.19

3.19

Birmingham District Office

2.41

3.72

Charlotte District Office

3.92

6.23

Chicago District Office

4.73

13.85

Dallas District Office

6.34

3.04

Houston District Office

3.52

2.26

Indianapolis District Office

4.96

10.18

Los Angeles District Office

2.92

5.41

Memphis District Office

3.15

2.86

Miami District Office

10.46

9.07

New York District Office

15.20

4.88

Office of The Chair

8.71

8.87

Philadelphia District Office

6.94

9.20

Phoenix District Office

5.87

4.71

San Francisco District Office

10.70

5.45

St. Louis District Office

4.39

5.99

Washington Field Office

1.61

1.10

Observations

16,314

10,433

 

Table 22: Charge Filed With EEOC or FEPA: Charges Missing Industry vs. Charges Not Missing Industry



 

  Charges Missing Industry Charges Not Missing Industry
EEOC 63.82 72.28
FEPA 36.18 27.72
Observations 16,314 10,433

 

Table 23: Year Charge Filed: Charges Missing Industry vs. Charges Not Missing Industry

 

Charges Missing Industry

Charges Not Missing Industry

2012

19.93

23.37

2013

19.84

21.08

2014

19.37

19.98

2015

20.20

18.99

2016

20.66

16.59

Observations

16,314

10,433

 


Nada Wasi and Aaron Flaaen. “Record linkage using Stata: Preprocessing, linking, and  reviewing utilities”. In: The Stata Journal 15.3 (2015), pp. 672–697.

Ibid.

Ibid.

Ibid.