The author column was exported into a raw two-column text file (6.8MB 342,991 lines id|author).
We iterated through the file with multiple regular expression passes to strip out as much non-author text as possible also splitting multiple author lines into individual records. The text was examined for common text patterns for removal and to parse the file down to a regular form of ID \t singleName
Author data |
Example Expression |
Herrich-Schaeffer [1856 (Pls. 1853)] |
[A-Z]\w+-[A-Z]\w+ *\[^\]+\]$ |
Guérin-Ménéville 1842 |
^[A-Z]\w+-[A-Z]\w+ *\d+$ |
Henderson & Bartsch 1920 |
(.*) *& *(.*) *\d+ |
the resultant file (5.9MB, 373,340 lines) was input into a mySQL table and the table at left generated with
select author, count(author) as cnt from authors group by author having count(author) > 100 order by cnt desc.
|