I developed an algorithm to interpret the Bible in its original language. I started by writing a Perl script that parses the BHS Hebrew text, removes vowel points, and identifies every word used at least 50 times in the Bible. I also removed stop words (i.e., common irrelevant words such as ani [I], at/atah [you], mah [what], etc.). Keep in mind that in Hebrew some articles & prepositions are prefixes rather than distinct words (e.g., "land" = aretz, "the land" = haaretz, "in the land" = bearetz). The final list included 560 Hebrew words. I calculated the relative frequency of each word (i.e., how often the word is used compared to the other 559 words), then standardized the values. The result was 560 numeric variables, each representing a sufficiently common and sufficiently relevant Hebrew word.
560 variables is too many to easily work with, so I used Principal Component Analysis to reduce it to a few manageable variables, each of which was a linear combination of the standardized relative frequencies of all 560 words. To understand what the principal components mean, I plotted chapters of books of the Bible with obvious/known genres: History (e.g., 1 & 2 Chronicles), Prophesy (e.g., Isaiah), and Wisdom Literature (e.g., Proverbs). Each dot on the graph represents a chapter where the genre of the book (though not necessarily of the chapter) is known.
The first two principal components do an excellent job of separating the books of different genres! The first (PC1) seems to indicate how historical vs. poetic it is. The lowest value (-14.8) is for 2 Chronicles 27, a very historical chapter detailing the reign of king Jotham. The highest value (4.5) is for Psalm 21, a very poetic song. PC2 measures another dimension that (at least in theory) is not related to how historical/poetic a book is. It does a great job of distinguishing between prophecy and wisdom literature. The big outlier among the prophetic books (red triangle on the left side of the blue cluster, at PC1=-9.3, PC2=0.9) happens to be Jeremiah 52, which is a very historical chapter despite being in a prophetic book.
PC1 and PC2 also were calculated for entire books and for chapters/books of unknown genres. Those can be plotted on the same graph to visualize how similar they are to the known genres. For example:
For a more quantitative genre classification, I built a Logistic Regression model using the first 6 principal components. The model estimates the probability that a writing belongs to one of the three broad genres, assuming those are the only three options. As an example, I applied it to each chapter of Genesis and plotted the output below:
According to the model, the first, 3rd, and 15th chapters are by far the least historical, which might disappoint some who interpret Genesis 1 as a scientific or historical narrative. The biggest outlier, however, is chapter 15, which the model thought was very likely prophetic. Indeed, Chapter 15 is about God's covenant with Abram and includes several prophecies about the future.
|K-Means Clusters, Hebrew Bible
My plan (if I ever get enough free time) is to set up a web page where anyone can easily get the classification values for each chapter of each book. We may never get to a point where computers and algorithms can accurately interpret the Bible for us, but they certainly can be helpful.