With the increasing use of US for breast lesions, ACR described BI-RADS classification for US in 2003 to obtain a lingua franca and to determine a more accurate description for clinicians (
7). BI-RADS classification for mammography has been proposed since 1993. While there are many studies focused on interobserver agreement for image exams of mammography, studies concerning the agreement of US BI-RADS lexicon are few. Previous studies were published relatively in the early period of US BI-RADS description. We aimed to add our experience after ten years of worldwide usage of US BI-RADS lexicon. In our study, we evaluated both intraobserver and interobserver agreements for BI-RADS US classification. Our intraobserver agreements varied between substantial and almost perfect, while interobserver agreements varied between fair and substantial. Our results are compatible with many of the studies subjected on agreement variability of BI-RADS US classification ( 2, 4, 9, 10, 12). Lazarus et al. ( 12) published the first study on interobserver agreement for BI-RADS US in 2006. In this study, interobserver agreement for the sonographic BI-RADS descriptors ranged between fair and substantial agreement, both for the evaluation of lesion features and final BI-RADS category determination. The Kappa values of interobserver variability for previous studies and our study evaluating the sonographic BI-RADS descriptors are shown in Table 5.
Table 5. Interobserver Variability for Previous Studies Evaluating the Sonographic BI-RADS Descriptors
Description & Final Assessment Our Study κ value Lazarus et al. ( 12) κ value Berg et al. ( 13) κ value Park et al. ( 2) κ value Lee et al. ( 10) κ value Abdullah et al. ( 4) κ value Shape 0.45 0.66 0.62 0.42 0.49 0.64 Orientation 0.66 0.61 0.72 0.61 0.56 0.70 Margin 0.33 0.40 0.67 0.32 0.33 0.36 Lesion boundary 0.56 0.69 0.36 0.55 0.59 0.48 Echo pattern 0.41 0.29 0.25 0.36 0.37 0.58 Posterior feature 0.54 0.40 0.38 0.53 0.49 0.47 Final category 0.35 0.28 0.52 0.49 0.53 0.30
In our study, the interobserver agreement in the use of sonographic BI-RADS lexicon for shape was found as moderate. This ratio was similar to the studies of Park et al. (
2) and Lee et al. ( 10). Furthermore, in the studies conducted by Lazarus et al. ( 12), Abdullah et al. ( 4) and Berg et al. ( 13), the interobserver agreement for shape was found as substantial. The highest agreement for shape was found in the study carried out by Abdullah et al. ( 4) (κ=0.64). In this study, when the lesion dimensions are grouped to <0.7cm and >0.7cm, the interobserver agreement for small lesions (<0.7) was similar to our study (κ=0.48). In our study, all the lesions were non-palpable and their mean dimensions were smaller than 1 cm. In small lesions, sonographic descriptors such as shape and margins are especially difficult to evaluate.
In our study, the highest agreement was found for orientation that was substantial. This ratio was similar to other studies (
2, 4, 12, 13). The higher agreement levels for orientation can be explained by easier description of parallel and non-parallel orientation than evaluating other features with more parameters included ( 2).
The highest agreement rate for lesion margins was found in the study of Lazarus et al. (
12), while the lowest was detected in the study performed by Berg et al. ( 13). Other studies ( 2, 4, 10) are at moderate agreement, similar to our study.
The interobserver agreement for echo pattern in some studies was slight (
2, 10, 12, 13). Abdullah et al. ( 4) detected the highest agreement similar to our study that was moderate. This shows that the observers had difficulty in this categorization. However, the echo features are not considered as an important criteria in predicting malignant from benign ( 14).
In our study, interobserver agreement for margin was found as fair, similar to other studies (
2, 4, 10, 12). Margin features are of the most important parameters in choosing final BI-RADS category and making the biopsy decision; however, 5 subgroups are defined in sonographic BI-RADS lexicon for margins (circumscribed, indistinct, angular, microlobulated and spiculated). It is very difficult to choose only one of these subgroups using static images. In a prospective study conducted by Berg et al. ( 13), the agreement for margin was the highest (κ=0.67) compared to other studies. Because in this study, the margin descriptor had two alternatives; circumscribed and non-circumscribed. It is clear that deciding whether a lesion has a circumscribed margin or not is easier; however, choosing a subgroup for non-circumscribed is difficult. In fact, it is not a problem because one of either four descriptors should be assigned as a suspicious finding, so the final assessment would not be affected ( 10).
The interobserver agreement for final BI-RADS category in our study was found as fair similar to the studies performed by Lazarus et al. (
12), Abdullah et al. ( 4) and Lai et al. ( 5). Based on the study conducted by Abdullah et al. ( 4), the low levels are due to subcategorizing of BI-RADS 4 to 4a, 4b, 4c. When category 4 was evaluated as a whole, interobserver reproducibility increased to moderate agreement (κ=0.47). In our study, the agreement for BI-RADS 4 was found as moderate, but when categorized as BI-RADS 4 a-b-c, the agreement decreased to fair. This indicates that the subcategorization of BI-RADS 4 lesions, which embrace a wide range, is not clearly defined.
In the studies carried out by Park et al. (
2) and Berg et al. ( 13), the agreement for the final category of BI-RADS was found as moderate and these rates were higher than our study. The reason of higher final BI-RADS agreement in the study of Berg et al. ( 13) was the non-homogeneous distribution of patients. Of the 88 patients, 42 were categorized as BI-RADS category 1 and 2, 41 as category 3 and only 5 as category 4a, 4b, 4c and 5 which means there are very few subcategories difficult for observers to categorize in the final categorization. Evaluating the agreement levels without subcategorizing category 4 is the main reason for high final BI-RADS agreement results in the study of Park et al. ( 2).
In our study, despite the fair interobserver agreement for the final category of BI-RADS, it was found as substantial for BI-RADS category 5. This suggests that the observers provided consensus in predicting malignant lesions, but their opinion for possibly benign (BI-RADS 3) and suspicious lesions (BI-RADS 4) was variable.
Our study showed a higher level of intraobserver agreement than interobserver agreement. Our intraobserver agreement results are similar to the literature or better (
2, 9, 10). The κ values of previous studies and our study, evaluating the intraobserver variability for sonographic BI-RADS descriptors are presented in Table 6.
Table 6. Intraobserver Variability for Previous Studies Evaluating the Sonographic BI-RADS Descriptors.
Descriptors & Final Assesment Our Study κ value Park’s Study κ value ( 2) Lee’s Study κ value ( 10) Calas’ Study κ value ( 9) Shape 0.85-0.91 0.73 0.56-0.72 - Orientation 0.84-0.94 0.68 0.75-0.83 - Margin 0.71-0.83 0.64 0.53-0.61 - Lesion boundary 0.71-0.94 0.68 0.56-0.85 - Echo pattern 0.68-0.71 0.65 0.67-0.81 - Posterior feature 0.79-0.94 0.64 0.67-0.82 - Final categoery 0.64-0.83 0.74 0.72-0.79 0.37-0.75
The intraobserver agreement in the study of Park et al. (
2) was found as substantial both for the lesion descriptors and final BI-RADS category. In the study of Lee et al. ( 10), intraobserver agreement for lesion descriptors varied from moderate to almost perfect, and for the final BI-RADS category, the agreement was substantial. In the study performed by Calas et al. ( 9), only the final BI-RADS category was evaluated that was fair to substantial.
Our study had limitations. First, BI-RADS category 2 and 3 lesions were excluded from the study because only patients who underwent excisional biopsy after guide wire localization were included. Because the radiologists knew that only patients undergoing biopsy were included in the study, they tried to evaluate the lesions more cautiously. Second, observers only evaluated static images of the lesions, but routinely, real time US evaluation was performed. Third, the study was based on the performance of experienced radiologists on breast sonography. Inconsistencies and errors in using BI-RADS terminology among our observers may be a causative factor for a lower level of interobserver agreement than intraobserver agreement.
In conclusion, our results demonstrated that each observer was self-consistent in interpreting US BI-RADS classification, while interobserver agreement was relatively poor. Although it has been ten years since the description of sonographic BI-RADS lexicon, it has partially failed to provide a consensus among our observers. We think that feedback with pathological results of the lesions after their description by radiologists may improve the correct classification. In addition, further training and periodic performance evaluations would probably help to achieve better agreement among the radiologists.