In recruitment processes, manually reviewing résumés is a highly time-consuming job. In order to reduce the cost of these reviews, Information Extraction tasks have been introduced to extract the structure of the document and the personal information contained within. However, because there is no consensus on a standard structure of résumés, i.e., each résumé has its own distinctive layout, column numbers, or text properties, an accurate extraction process becomes highly challenging. This study addresses a part of this problem. We focus on the problem of estimating the number of columns in résumés, as we experience in the further processes that knowing the number of columns facilitates the separation of the main sections of the résumés, hence the analysis of the finer subsections. We employ the coordinates of the text blocks that build up a résumé. We hypothesize that the coordinates of the text blocks carry information on the number of columns. We define the problem in a clustering context. We proposed a novel clustering approaches dedicated to finding the number of columns in a résumé by the separation of the text block coordinates. The experiments are conducted on a dataset of the résumés of real applicants in two languages: Turkish and English. The results reveal that hybrid approaches that use the intermediate methods perform better than the individual methods. Furthermore, these findings could be extended to any unstructured textual data in any language and document format
Primary Language | English |
---|---|
Subjects | Computing Applications in Life Sciences |
Journal Section | Information and Computing Sciences |
Authors | |
Publication Date | March 26, 2025 |
Submission Date | February 10, 2025 |
Acceptance Date | March 12, 2025 |
Published in Issue | Year 2025 Volume: 12 Issue: 1 |