Return to Pregnancy Discrimination in the Workplace Report
Fuzzy matching is a probabilistic record linkage technique used to link two datasets where no perfectly identical identifier exists in the two datasets. Fuzzy matching between the charge data and the EEO-1 reports is necessary because a unique numeric identifier is not consistently available. The EEO-1 data contain a unique unit number, however, this unit number is not required during the intake process for filing a charge, and is thus missing for most charges. For example, only 17% of pregnancy discrimination charges have a valid EEO-1 unit number in the charge database. Fuzzy matching allows for linking the charge data and the EEO-1 reports using firm name and address available in both datasets. In this report, we match pregnancy charges to EEO-1 reports in order to compare workplaces charged with pregnancy discrimination to those not charged with pregnancy discrimination. This appendix details the process for matching pregnancy discrimination charges to the EEO-1 employer reports.
Before proceeding with the matching process, it important to note that all charges are not expected to match in the EEO-1 reports data. First, EEO-1 reports are only required for firms with 100 or more employees (or 50 employees for federal contractors with a contract of at least $50,000), so charges against smaller firms will likely not find matches in the EEO- 1 reports data. Additionally, the EEO-1 reports only cover private employers, so charges against public employers will not match.
The charge data contain a rough categorical variable for the number of employees: 2% of pregnancy discrimination charges are filed against employers with 15 or fewer employees, 42% are filed against employers between 15 and 100 employees, and 42% are filed against employers with 101 or more employees (13% of charges are missing information on employer size). The charge data also contain a rough categorical variable indicating the institution type: about 7% of charges are made against public employers. Based on these variables, we expect about half of the charges to match in the EEO-1 data. This match rate aligns with previous research matching the charge and EEO-1 data.
We proceed with the fuzzy matching process in 3 steps: (1) direct match for charges with a valid EEO-1 unit number (2) standardize names and addresses (3) fuzzy matching. We then assess how the matched sample compares to the charges which we expected to match (based on size and institution type) but did not match.
Data Sources
The EEO-1 establishment level reports are matched to the charge data to obtain estab- lishment level characteristics of workplaces charged with pregnancy discrimination.
EEOC Charge Data
The charge data comprise workplace discrimination charges from the Equal Employment Opportunity Commission (EEOC). The charges can be filed directly with the EEOC or with one of the state or local Fair Employment Practices Agencies (FEPAs). The data include all workplace charges filed between fiscal years 2012 and 2016 with the EEOC and FEPAs that have agreements with the EEOC to share the processing of charges.
The charge data are derived from the EEOC’s case processing software and originally were in four data files: “allegations”, “charging party”, “respondents”, and “charges”. All data files contain a consistent unique charge identification number which allowed the four files to be merged into a single analysis file.
Data on each charge include the employer information (address, industry, and establish- ment size); charging partys basic demographics (age, race, national origin, and sex); the basis for the charge—the protected class, such as sex, sexual orientation, gender identity, race, national origin; the issue charged—the action or policy alleged to be discriminatory (the type of discrimination that took place such as promotion, harassment, discharge, etc.); the processing of the charge; and the outcome of the charge.
EEO-1 Reports
The EEOC collects annual data for private sector employers on the EEO-1 survey. The EEOC has been collecting these data since 1966, two years after they were authorized to do so by the U.S. Congress in Title VII of the 1964 Civil Rights Act. Title VII instructed the EEOC to monitor progress toward an equal opportunity society. Employers with 100 or more employees and federal contractors with 50 or more employees and a contract of at least
$50,000 are required to submit an EEO-1 report. The EEO-1 data include establishment- level records of the employers name and address, industry, federal contractor status, and employment totals by race, sex, and occupation.
Direct Matching
First, we merge charges with a valid (non-missing) EEO-1 unit number by year. Table 9 shows the number of charges with a valid EEO-1 unit number, the number of matches made, the number of valid matches, and the number of charges that remain unmatched for each year and overall.
Table 9: Direct Matching Results by Year
| Total Charges | Valid Unit Number | Matches | Valid Matches | Total Unmatched | |
|---|---|---|---|---|---|
| 2012 | 5,690 | 1,054 | 834 | 743 | 4,947 |
| 2013 | 5,435 | 988 | 777 | 692 | 4,743 |
| 2014 | 5,244 | 987 | 733 | 656 | 4,588 |
| 2015 | 5,277 | 886 | 671 | 610 | 4,667 |
| 2016 | 5,101 | 722 | 527 | 469 | 4,632 |
| Total | 26,747 | 4,637 | 3,542 | 3,170 | 23,577 |
Overall, only 4,637 of the total 26,747 pregnancy discrimination charges (17%) had a valid EEO-1 number. Of the 4,637 charges with a valid EEO-1 unit number, 3,542 matched with an establishment in the EEO-1 database for that year. To ensure the validity of these matches (i.e. that the EEO-1 number was correctly assigned in the charge data), we keep only those matches in which the three digit zip code match in the charge data and EEO- 1 report data (“valid matches”). Three digit zip code, rather than the full five digit zip code, is used because the final digits of a zip code are often error prone and the person reporting the zip code may have made a typing error. Of the 3,542 EEO-1 unit matches, 3,170 (90%) are valid matches. After this direct matching process, 23,577 charges (88%) remained unmatched.
Because we perform this process by year, it is possible that some matches are missed because the firm did not submit its EEO-1 form(s) in that year. As such, we next take all charges which had a non-missing EEO-1 unit number, but did not match an EEO-1 establishment in its given year and perform a direct match against all other EEO-1 years, keeping the year closest to the charge (for example, if a charge in 2012 matched in the 2013 and 2014 EEO-1 years, we keep the 2013 match). After this process, we match 369 additional charges. Table 10 shows the additional matches picked up for each charge year and overall. At the end of the two rounds of direct matching, we matched 3,539 charges and 23,208 charges (87%) remained unmatched.
Table 10: Direct Matching Results by Year Round 2
|
|
Unmatched Round1 |
Matched in Other Years |
Total Direct Match |
Total Unmatched |
|---|---|---|---|---|
|
2012 |
220 |
51 |
794 |
4,896 |
|
2013 |
211 |
41 |
733 |
4,702 |
|
2014 |
254 |
78 |
734 |
4,510 |
|
2015 |
215 |
96 |
706 |
4,571 |
|
2016 |
195 |
103 |
572 |
4,529 |
|
Total |
1,095 |
369 |
3,539 |
23,208 |
Name and Address Standardization
We next move on to the fuzzy matching process for all 23,208 charges that did not have a direct match. Before the fuzzy matching can be implemented, company names and ad- dresses must be standardized. To standardize names and addresses, we implement user writ- ten Stata commands, stnd compname and stnd address for standardizing company names and addresses written by Wasi and Flaaen.1 This process helps remove inconsistencies in name formats. Specifically, these commands use rule-based pattern files to breakdown the company name and addresses into sub-parts. In addition to these standardize commands, we implement our own customized standardizing routine based on the specifics of the data.
Fuzzy Matching
To perform the fuzzy matching, we use Stata command reclink2, an extension of reclink.2 We match the charge data to the EEO-1 reports using standardized parent com- pany name and establishment address, blocking on the three digit zip code. We use parent company name because establishment names of multi-establishment firms often abbreviate the full company name and are likely to contain extra information (for example, a branch number) that could lower the quality of the match. In the case of a single establishment firm, the parent company name is the same as the establishment name.
To assess the performance of the fuzzy matching routine, we first perform the fuzzy match on the set of direct matches from stage 1. This allows us to assess how the fuzzy match routine performs on data known to be matches from the direct matching process.
Table 11: Test of Fuzzy Match Using Direct Matched Sample
|
|
Total Charges |
Fuzzy Match |
Exact Match |
Unmatched |
Avg rlsc |
|---|---|---|---|---|---|
|
2012 |
794 |
603 |
318 |
191 |
0.95 |
|
2013 |
733 |
559 |
305 |
174 |
0.96 |
|
2014 |
734 |
558 |
301 |
176 |
0.95 |
|
2015 |
706 |
547 |
309 |
159 |
0.96 |
|
2016 |
572 |
433 |
233 |
139 |
0.94 |
|
Total |
3,539 |
2,700 |
1,466 |
839 |
0.95 |
.95 (median is .99) which indicates that the “fuzzy matches” are likely to be true matches. Next, we perform the fuzzy match routine on the unmatched sample—the records that were not directly matched in the direct match process. Table 12 shows the number of exact, fuzzy, and unmatched charges for the charges which were unmatched during the direct matching process by year and overall. As Table 12 shows, the number of exact matches is low. Overall only 8% of charges were exactly matched in the EEO-1 data. The number of fuzzy matches is large (78% of charges are fuzzy matches), however, the average pair similarity score is only .74. This implies that many of these fuzzy matches are weak matches (lower rlsc score) and therefore less likely to be true matches.Table 11 shows the number of fuzzy, exact, and unmatched charges by year and overall. An exact match indicates the standardized establishment name and address in the charge data exactly matched the establishment name and address in the EEO-1 data. The fuzzy matching procedure produces a pair-similarly score (the rlsc) which indicates the strength of the match (where 1=exact match). Records with a minimum pair similarity score of .6 are considered “fuzzy matches”. Records with a pair similarity score below .6 are considered unmatched. Column 6 of Table 11 also shows the average pair similarly score. Overall, about 54% of the direct match charges are exact matches, 76% are fuzzy matches, and 24% are unmatched in the fuzzy matching process. Overall, the average pair similarity score (rlsc) is
Table 12: Unweighted Fuzzy Match Results
|
|
Total Charges |
Fuzzy Match |
Exact Match |
Unmatched |
Avg rlsc |
|---|---|---|---|---|---|
|
2012 |
4,896 |
3,666 |
367 |
1,230 |
0.74 |
|
2013 |
4,702 |
3,584 |
402 |
1,118 |
0.74 |
|
2014 |
4,510 |
3,492 |
350 |
1,018 |
0.74 |
|
2015 |
4,571 |
3,600 |
373 |
971 |
0.74 |
|
2016 |
4,529 |
3,721 |
410 |
808 |
0.75 |
|
Total |
23,208 |
18,063 |
1,902 |
5,145 |
0.74 |
To account for the large number of weak matches, we perform the fuzzy matching proce- dure again using the weighting options available in the reclink2 command.3 The wmatch and wnomatch options allow for different weights to be applied to the various matching vari- ables. Weights must be greater than or equal to 1 and are typically between 1 and 20. Specifically, the wmatch option specifies weights given to matches for each variable in the matching variable list. The weights reflect the relative likelihood of a variable match indi- cating a true observation match. Larger weights are applied to variables that more likely predict a true match. For example, a name variable will typically have a larger weight (as it is more likely to predict a true match) while a variable like city will have a smaller weight since duplicates are expected.4
Similarly, the wnomwatch option specifies weights given to mismatches for each variable in the matching variable list. These weights reflect the relative likelihood that a mismatch on a variable indicates that the observations do not match. A small weight indicates mismatches are expected even if the observation is a true match. For example, telephone number would commonly have a small wnomatch because of changes over time, multiple phone numbers per entity, or data entry errors.
There are no exact rules for what weights should be applied. In results not shown, we experiment with different weighting techniques. Generally, larger weights reduce the number of “fuzzy matches” (and thus increase the number of unmatched observations) while smaller weights increase the number of “fuzzy matches”. Table 13 shows results from our preferred weighting scheme. We weight on firm name, firm address, and zip code.
Table 13: Weighted Fuzzy Match Results
|
|
Total Charges |
Fuzzy Match |
Exact Match |
Unmatched |
Avg rlsc |
|---|---|---|---|---|---|
|
2012 |
4,896 |
1,277 |
367 |
3,619 |
0.89 |
|
2013 |
4,702 |
1,275 |
402 |
3,427 |
0.90 |
|
2014 |
4,510 |
1,224 |
350 |
3,286 |
0.90 |
|
2015 |
4,571 |
1,240 |
373 |
3,331 |
0.90 |
|
2016 |
4,529 |
1,423 |
410 |
3,106 |
0.90 |
|
Total |
23,208 |
6,439 |
1,902 |
16,769 |
0.90 |
Fuzzy match results using weights on firm name, address, and zip code and are defined as: wmatch(10 8 1) wnomatch(8 6 8)
As the table 13 shows, with this weighting scheme, the number of fuzzy matches is reduced but the pair similarity score increases indicating that although fewer matches were made, the matches are more likely to be true. This method produced 6,437 matches.
We consider all fuzzy matches with a pair similarity score of .95 or larger to be true matches (a non-systematic review of these charges indicated that those with a pair similarity score of .95 or larger were true matches). Because we use the npairs(2) option in the fuzzy matching routine, which allows for the top 2 potential matches to be retained, we manually go through those with multiple potential matches and determine the correct match. This created a sample of 3,563 charges. I then added back in the 3,539 direct matches for a final sample of 7,102 charges.
Matched Sample Vs. Expected Matches
The sample from this process is below the expected match rate (we expected to match around half of the charges), however, as long as it is a random sample of the potential matches, the smaller sample will not bias the results. Of the charges we expected to match (private employers with at least 100 employees), there is no reason to suspect that the charges that are matched in this sample are different than those which did not match in this sample (i.e the charges which matched are a random sample of the charges that could have matched). In the tables below, we compare the matched charges in the sample to the charges we expected to match (based on size and institution type), but did not match in the sample. As tables 14 -17 show, the matched sample and the expected matches (but unmatched) samples are very similar. The one exception is that our matched sample are more likely to have non-missing industry information.
Table 14: Variable Means Matched Sample vs. Expected Matches
|
|
Matched Sample |
Expected Matches |
|---|---|---|
|
Found Cause |
0.024 |
0.028 |
|
Got Benefit |
0.196 |
0.203 |
|
White |
0.326 |
0.333 |
|
Observations |
7,102 |
8,216 |
Table 15: Receiving District Office Matched Sample vs. Expected Matches
|
|
Matched Sample |
Expected Matches |
|---|---|---|
|
Atlanta District Office |
4.44 |
4.00 |
|
Birmingham District Office |
3.27 |
2.58 |
|
Charlotte District Office |
5.27 |
4.89 |
|
Chicago District Office |
10.03 |
7.96 |
|
Dallas District Office |
4.67 |
6.33 |
|
Houston District Office |
3.25 |
3.16 |
|
Indianapolis District Office |
7.81 |
8.12 |
|
Los Angeles District Office |
4.18 |
3.58 |
|
Memphis District Office |
3.13 |
2.54 |
|
Miami District Office |
6.24 |
7.07 |
|
New York District Office |
10.63 |
13.84 |
|
Office of The Chair |
8.70 |
9.94 |
|
Philadelphia District Office |
9.14 |
7.70 |
|
Phoenix District Office |
5.49 |
6.57 |
|
San Francisco District Office |
6.52 |
6.04 |
|
St. Louis District Office |
5.51 |
3.82 |
|
Washington Field Office |
1.73 |
1.84 |
|
Observations |
7,102 |
8,216 |
Table 16: Charge Year Matched Sample vs. Expected Matches
|
|
Matched Sample |
Expected Matches |
|---|---|---|
|
2012 |
20.50 |
21.34 |
|
2013 |
20.70 |
19.72 |
|
2014 |
19.59 |
20.17 |
|
2015 |
19.74 |
19.43 |
|
2016 |
19.47 |
19.35 |
|
Observations |
7,102 |
8,216 |
Table 17: Industry Matched Sample vs. Expected Matches
|
|
Matched Sample |
Expected Matches |
|---|---|---|
|
Accommodation and Food Services |
4.48 |
4.03 |
|
Admin. Support & Waste Remediation Services |
5.70 |
3.13 |
|
Agriculture, Forestry, Fishing and Hunting |
0.20 |
0.13 |
|
Arts, Entertainment, and Recreation |
0.84 |
0.37 |
|
Construction |
0.70 |
0.32 |
|
Educational Services |
0.51 |
1.01 |
|
Finance and Insurance |
5.00 |
1.92 |
|
Health Care and Social Assistance |
17.46 |
6.73 |
|
Information |
2.39 |
0.63 |
|
Management of Companies and Enterprises |
0.94 |
0.29 |
|
Manufacturing |
7.29 |
2.04 |
|
Mining, Quarrying, and Oil and Gas Extraction |
0.31 |
0.11 |
|
Other Services (except Public Administration) |
1.15 |
0.96 |
|
Professional, Scientific, and Technical Services |
3.79 |
1.07 |
|
Public Administration |
0.25 |
0.51 |
|
Real Estate and Rental and Leasing |
0.89 |
0.84 |
|
Retail Trade |
10.67 |
4.33 |
|
Transportation & Warehousing |
2.18 |
0.96 |
|
Utilities |
0.24 |
0.15 |
|
Wholesale Trade |
1.80 |
0.55 |
|
Missing |
33.19 |
69.91 |
|
Total |
100.00 |
100.00 |
|
Observations |
7,102 |
8,216 |
Missing Industry Appendix
This appendix compares pregnancy discrimination charges that are missing industry in- formation to those not missing industry information. We examine whether charges missing industry information are systematically related to other variables. Table 18 displays the means of key variables used in this analysis for pregnancy discrimination charges that are missing and not missing industry information. Table 19 shows the duration of pregnancy discrimination for charges that are missing and not missing industry information. As tables 18 and 19 show, charges missing industry information are quite similar to those not missing industry information.
Tables 20 - 23 explore additional variables that are not central to the analysis in this paper. Though the two samples are generally similar, there is some variation across employer size, district office, and time. Smaller establishments are more likely to be missing industry in- formation (table 20). The Dallas, New York and San Francisco EEOC district offices as well as state FEPA agencies are more likely to be missing on industry as well. In contrast, The Charlotte, Chicago, Indianapolis, and Los Angeles EEOC district offices have substantially lower rates of industry missing data (table 21). There is also a secular trend toward lower rates of industry missing, which hopefully will continue into the future (table 23).
Table 18: Variable Means: Charges Missing Industry vs. Charges Not Missing Industry
|
|
Charges Missing Industry |
Charges Not Missing Industry |
|---|---|---|
|
Found Cause |
0.03 |
0.03 |
|
Got Benefit |
0.20 |
0.19 |
|
White |
0.32 |
0.39 |
|
Total Number of Issues |
1.89 |
1.97 |
|
Total Number of Bases |
1.78 |
1.90 |
|
Represented by Counsel |
0.19 |
0.19 |
|
Average Monetary Benefit |
15,128.69 |
21,734.22 |
|
Observations |
16,314 |
10,433 |
Table 19: Duration of Pregnancy Discrimination: Charges Missing Industry vs. Charges Not Missing Industry
|
|
Charges Missing Industry |
Charges Not Missing Industry |
|---|---|---|
|
1 Day |
53.26 |
52.78 |
|
2 Days-2 Weeks |
6.32 |
5.64 |
|
2 Weeks - 2 Months |
12.97 |
12.70 |
|
3 - 6 Months |
16.46 |
17.02 |
|
7 Months - 1 Year |
7.45 |
7.70 |
|
More than a year |
3.54 |
4.16 |
|
Observations |
28,183 |
18,497 |
Note: The number of observations in this table is larger than other tables because this table is examining all individual allegations of pregnancy discrimination, rather than charges which contain a pregnancy allegation
Table 20: Number of Employees: Charges Missing Industry vs. Charges Not Missing Industry
|
|
Charges Missing Industry |
Charges Not Missing Industry |
|---|---|---|
|
Under 15 Employees |
2.92 |
0.65 |
|
15 - 100 Employees |
47.82 |
33.96 |
|
101 - 200 Employees |
7.05 |
9.17 |
|
201 - 500 Employees |
7.07 |
12.38 |
|
501+ Employees |
18.73 |
33.97 |
|
Unknown Number Of Employees |
15.72 |
4.78 |
|
Missing |
0.68 |
5.08 |
|
Observations |
16,314 |
10,433 |
Table 21: Receiving District Office: Charges Missing Industry vs. Charges Not Missing Industry
|
|
Charges Missing Industry |
Charges Not Missing Industry |
|---|---|---|
|
Atlanta District Office |
4.19 |
3.19 |
|
Birmingham District Office |
2.41 |
3.72 |
|
Charlotte District Office |
3.92 |
6.23 |
|
Chicago District Office |
4.73 |
13.85 |
|
Dallas District Office |
6.34 |
3.04 |
|
Houston District Office |
3.52 |
2.26 |
|
Indianapolis District Office |
4.96 |
10.18 |
|
Los Angeles District Office |
2.92 |
5.41 |
|
Memphis District Office |
3.15 |
2.86 |
|
Miami District Office |
10.46 |
9.07 |
|
New York District Office |
15.20 |
4.88 |
|
Office of The Chair |
8.71 |
8.87 |
|
Philadelphia District Office |
6.94 |
9.20 |
|
Phoenix District Office |
5.87 |
4.71 |
|
San Francisco District Office |
10.70 |
5.45 |
|
St. Louis District Office |
4.39 |
5.99 |
|
Washington Field Office |
1.61 |
1.10 |
|
Observations |
16,314 |
10,433 |
Table 22: Charge Filed With EEOC or FEPA: Charges Missing Industry vs. Charges Not Missing Industry
| Charges Missing Industry | Charges Not Missing Industry | |
|---|---|---|
| EEOC | 63.82 | 72.28 |
| FEPA | 36.18 | 27.72 |
| Observations | 16,314 | 10,433 |
Table 23: Year Charge Filed: Charges Missing Industry vs. Charges Not Missing Industry
|
|
Charges Missing Industry |
Charges Not Missing Industry |
|---|---|---|
|
2012 |
19.93 |
23.37 |
|
2013 |
19.84 |
21.08 |
|
2014 |
19.37 |
19.98 |
|
2015 |
20.20 |
18.99 |
|
2016 |
20.66 |
16.59 |
|
Observations |
16,314 |
10,433 |
1 Nada Wasi and Aaron Flaaen. “Record linkage using Stata: Preprocessing, linking, and reviewing utilities”. In: The Stata Journal 15.3 (2015), pp. 672–697.
2 Ibid.
3 Ibid.
4 Ibid.