Saturday, March 8, 2014

Interpreting the Hebrew Bible with Artificial Intelligence

I often hear the question "Do you interpret the Bible literally or figuratively?" The answer is "both" and "neither", mostly "neither". The Bible contains different genres of writing, which should be interpreted accordingly. They include history, prophecy, poetry/songs, stories, and wisdom literature, to name just a few. Identifying the genre is important, but it can be very subjective. It's also difficult without understanding the original language. Those problems can be solved with machine learning.

I developed an algorithm to interpret the Bible in its original language. I started by writing a Perl script that parses the BHS Hebrew text, removes vowel points, and identifies every word used at least 50 times in the Bible. I also removed stop words (i.e., common irrelevant words such as ani [I], at/atah [you], mah [what], etc.). Keep in mind that in Hebrew some articles & prepositions are prefixes rather than distinct words (e.g., "land" = aretz, "the land" = haaretz, "in the land" = bearetz). The final list included 560 Hebrew words. I calculated the relative frequency of each word (i.e., how often the word is used compared to the other 559 words), then standardized the values. The result was 560 numeric variables, each representing a sufficiently common and sufficiently relevant Hebrew word.

560 variables is too many to easily work with, so I used Principal Component Analysis to reduce it to a few manageable variables, each of which was a linear combination of the standardized relative frequencies of all 560 words. To understand what the principal components mean, I plotted chapters of books of the Bible with obvious/known genres: History (e.g., 1 & 2 Chronicles), Prophesy (e.g., Isaiah), and Wisdom Literature (e.g., Proverbs). Each dot on the graph represents a chapter where the genre of the book (though not necessarily of the chapter) is known.

The first two principal components do an excellent job of separating the books of different genres! The first (PC1) seems to indicate how historical vs. poetic it is. The lowest value (-14.8) is for 2 Chronicles 27, a very historical chapter detailing the reign of king Jotham. The highest value (4.5) is for Psalm 21, a very poetic song. PC2 measures another dimension that (at least in theory) is not related to how historical/poetic a book is. It does a great job of distinguishing between prophecy and wisdom literature. The big outlier among the prophetic books (red triangle on the left side of the blue cluster, at PC1=-9.3, PC2=0.9) happens to be Jeremiah 52, which is a very historical chapter despite being in a prophetic book.

PC1 and PC2 also were calculated for entire books and for chapters/books of unknown genres. Those can be plotted on the same graph to visualize how similar they are to the known genres. For example:

For a more quantitative genre classification, I built a Logistic Regression model using the first 6 principal components. The model estimates the probability that a writing belongs to one of the three broad genres, assuming those are the only three options. As an example, I applied it to each chapter of Genesis and plotted the output below:

According to the model, the first, 3rd, and 15th chapters are by far the least historical, which might disappoint some who interpret Genesis 1 as a scientific or historical narrative. The biggest outlier, however, is chapter 15, which the model thought was very likely prophetic. Indeed, Chapter 15 is about God's covenant with Abram and includes several prophecies about the future.

K-Means Clusters, Hebrew Bible
Classification into these broad genres is only the beginning. If other genres, writing styles, authors, topics, etc. can be identified, another model could easily be built to classify writings according to those, using the same principal components calculated here. If none of those things are known, Cluster analysis can be used to identify writings that have various features in common (see example on the right).

My plan (if I ever get enough free time) is to set up a web page where anyone can easily get the classification values for each chapter of each book. We may never get to a point where computers and algorithms can accurately interpret the Bible for us, but they certainly can be helpful.


  1. Like to see a bigger version of the K-Means clustering. I certainly found this an interesting exercise, especially the effective separation of genres using the first two PCs!

  2. A bigger version of the figure, that is.

  3. Pardon the late comment. A separate discussion elsewhere fortunately reminded me to re-read and revisit this intensely thought-provoking entry. I learn at least as much about statistical applications from your posts as about theology, if not more. This is a fascinating frontier in sorting some Old Testament writings for interpretive purpose using PCA.

    I know you're concentrating on the Hebrew Bible, but this approach similarly can be used to examine New Testament books in their native language such as the Gospels, letters of Paul, Revelation, etc., for similar classification directed at more astute interpretation. If I were 1) more confident in my statistical prowess, 2) good enough at the requisite software, which I don't have anyway and 3) could live two lives in parallel (smiley icon), I'd give it a go. If you ever decide to do this, please give me a head's up...I'd love to see the distributions!

    1. Thanks very much Roger! Interesting idea. I'm sure the same technique could be used for the New Testament, although the sample size would be a lot smaller. I'll definitely let you know if I ever do that. I'd be interested to see those distributions as well.

  4. This comment has been removed by a blog administrator.