How Hard is "The Count of Monte Cristo"? A Text Analysis of 16 Classic French Books
As a long time French learner, it's forever been on my to-do list to delve into French literature - in part for the personal enrichment, though mostly to surprise the in-laws with my sudden non-uselessness at Trivial Pursuit, FR Edition. Given that my previous analysis on Spanish led to me getting more into Spanish literature, I figured it'd be a good idea to do the same thing for French.
I mean, I still won't be great at it, but at least landing on Arts & Literature won't make me shudder as much.
I'll be looking at 16 books, written from 1666 to 2010, on a variety of metrics such as word count, text difficulty, various metrics on vocabulary, and verb tense usage. The books are the following:
Le Petit Prince (The Little Prince) - by Antoine de Saint-Exupéry
L'Etranger (The Stranger) - by Albert Camus
La Nausée (Nausea) - by Jean-Paul Sartre
Le Misanthrope ou l'Atrabilaire amoureux (The Misanthrope, or the Cantankerous Lover) - by Molière
La Condition Humaine (Man's Fate) - by André Malraux
Les Miserables - by Victor Hugo
Bel-Ami - by Guy de Maupassant
Le Père Goriot - by Honoré de Balzac
Madame Bovary - by Gustave Flaubert
Le Comte De Monte-Cristo (The Count of Monte Cristo) - by Alexandre Dumas
Les Liaisons Dangereuses (Dangerous Liaisons) - by Pierre Choderlos de Laclos
Candide, ou l'Optimisme (Candide: or, The Optimist) - by Voltaire
Voyage Au Centre De La Terre (Journey to the Center of the Earth) - by Jules Verne
Les Particules Elementaires (Atomised) - by Michel Houellebecq
A La Recherche Du Temps Perdu (In Search of Lost Time) - by Marcel Proust
La Carte Et Le Territoire (The Map and the Territory) - by Michel Houellebecq
Word Count/How Long They Take to Read
The first thing to note is that a few of these books are impressively long - in fact, Proust's "A La Recherche Du Temps Perdu" is approximately twice as long as War and Peace! While Proust's tome is something of an outlier, there are a couple other books that will need a month or 2 for the casual reader, though the majority can be read in under 10 hours, assuming a reading speed of 200 Words Per Minute:
The y-axis shows the number of words for each book.
Something to note is that if you're not a native speaker in French then you'll likely take a bit longer to get through the text - in my reading of The Count of Monte Cristo, for example, it's taken me a good 20h to get through 1/3 of the book. So, your mileage may vary with these estimates.
Ranking by Difficulty
For my Spanish book analysis I used the Spanish specific "Fernandez Huerta Rating" to determine difficulty, while also crunching some numbers on metrics like unique word count and unique words per sentence to see how they correlated to it. This time around I'm using the well known "Flesch Reading Ease" test, which works for a variety of languages, including French.
So, which books are the hardest?
As expected, Le Petit Prince is far and away the easiest read, followed by the L'Etranger (I've read this and confirm that it's not particularly taxing). Le Comte de Monte-Cristo is roughly at the half-way mark, with La Carte et Le Territoire just edging out A La Recherche du Temps Perdu, which is something of a surprise to me given the later's formidable reputation. Still, I think this has given more or less good results, even if I'd personally shift things around a bit.
Percentage Vocab by CEFR Level
For this next section I had to make a couple assumptions/logic leaps, but from the results I think that the result is both interesting and mostly true to life. Those assumptions are:
You can assign a specific number of words to a CEFR level, as described here, here and in a few other places, but admittedly nothing that I found in actual studies or academic papers. The idea is that you start with 500 words to reach the lower threshold of level A1, and then double the word count for each subsequent level - 1000 words for A2, 2000 for B1, etc, until you reach 16k words for C2.
This particular frequency list is accurate - it is, at the very least, the most detailed (and long: 142k individual words & 47k unique lemmas) frequency list I've ever seen, and provides different words depending on the data source, either films or books. Fluentu likes it too.
Anyway, I wanted to see what each book was composed of per CEFR level to see what, if any, differences there might be:
Sorry for the wonky presentation of numbers, it was the best I could do.
The results I found here were so similar that I thought I had done something wrong - every book on the list is between 60% and 70% 'A0' type words, i.e, the absolute base of the language, without which it's either impossible or impractically difficult to write coherent sentences - words like "a", "and", "the" etc in English. It does make sense that the majority of any given book's text would be composed of such words, I just didn't think the result would be so...uniform.
This does not, of course, mean that you'd be able to understand between 60-70% of these books if you only have the beginnings of a A1 level - the remaining 30-40% of the words beyond this level are the difference between these two sentences:
'Yesterday I took a _ on the _ and _ a _ _'
'Yesterday I took a stroll on the waterfront and witnessed a magnificent sunset'
You might understand 60% of the words (61.5% to be precise), but essentially 0% of the meaning that the sentence actually conveys.
CEFR Exposure Rate
A similar metric to the one above is the question of how much exposure to the words that correspond to each CEFR level you'd get with each book. For example: there are 2 thousand words in the B1 level (words like 'perte', 'chocolat' and 'mouton' are in there - 'loss', 'chocolate' and 'sheep' respectively) - when you read The Count of Monte Cristo, for instance, you will see 81% of these B1 words in some form.
Here are some examples of words by each level to give you an idea:
A0 (0-500 most frequent words) - 'de', 'la', 'et' ('of', 'the', 'and')
A1 (500-1000) - 'aujourd'hui', 'école', 'pluie' ('today', 'school', 'rain')
A2 (1000-2000) - 'faim', 'nord', 'maladie' ('hunger', 'north', 'disease/sickness')
B1 (2000-4000) - 'nettoyer', 'espérance', 'charbon' ('to clean', 'hope', 'coal')
B2 (4000-8000) - 'chevaucher', 'paresse', 'renard' ('to overlap', 'laziness', 'fox')
C1 (8000-16000) - 'renfrogné', 'attrouper', 'jongleur' ('sullen/sulky', 'to draw a crowd', 'juggler')
C2 (16000+) - 'dégonflé', 'jambonneau', 'tisonnier' ('deflated', 'knuckle of ham', 'fire poker')
Anyway, here's what each book covers for each level:
If you read 'A La Recherche' you'll have seen roughly half (46.7%) of the French language's 32k most frequently used words.
This chart, in a way, illustrates the difficulty of reaching the higher levels of a given language. B2 level words (I say B2 because it's often thought of as being the start of fluency) in these books, as in life, occur far less frequently than words at lower levels - how often do you come across words like 'guidon', 'trimbaler' or 'parrain' ('handlebar', 'to cart/lug around', and 'godfather')? The less we see, hear or say a word, the less likely we're going to remember it. Plus there are so many more of these harder-to-remember words than in B1, which has more words than A2, etc, etc. This, to me, highlights that reaching a B2+ level is indeed an accomplishment.
On the other hand, seeing these figures is also strangely comforting - a good number of these books contain much of the vocabulary needed to reach the lower threshold of the B2/C1 levels (4000 and 8000 words respectively) - even if they don't appear often, at least they're there, so at the very least your brain will register them unconsciously.
So go ahead and read lots of french literature, it counts as learning.
Ok, so this is definitely something to put in the "interesting" rather than "useful" column, but still - I was curious to see what, if any, change had occurred over time regarding the usage of some of French's tenses: namely, the passé simple, the subjonctif present, and the subjonctif imparfait, respectively known as the "writer's tense", the "French learner's nightmare tense" and the "no longer used tense".
For this, I decided to limit myself to just 5 of the most commonly used verbs ("être", "avoir", "aller", "faire" and "pouvoir" - "to be", "to have", "to go", "to do/make" and "to be able to", respectively) as a gauge for how often these tenses are used, so naturally you'll find these tenses more often than I indicate here, but I think this still gives an idea of the overall picture. Note that I've put the books in order of publishing date, from 1666 (Le Misanthrope) to 2010 (La Carte Et Le Territoire):
The most noticable trend is the sharp reduction in usage of the "subjonctif imparfait" from "La Condition Humaine" (published 1933) onwards, which makes sense - I'd never come across it at any point while I was learning French actively, and my partner, who is French, didn't recognise the forms of this tense that I showed her. Interestingly, it is still used in other romance languages, such as Spanish and Italian.
I also wonder if the subjonctif present became more seldomly used after the end of the 18th century - it seems to me that it drops off from Les Liaisons Dangereuses (1782) onward.
Comparing the text of these books with the various CEFR levels really hammered home the idea that books are a fantastic way of helping you get to the next level of your language learning journey, particularly when you're at around ~B2 level and can piece together unknown words from context. It's given me much more enthusiasm to really plunge into French literature & explore all its timeless stories, improve my French, and not be the last person picked for French trivia night.
Spacy was very useful in allowing me to convert every word into its associated "lemma" so I could get that "vocab by level" data, and ftfy and textstat were once again (I used them in the Spanish analysis) a great help in converting the text to the right format for analysis.
Over to You
Want to see a particular French book added to this analysis to see how it fares? Any other metrics I should include? Other critiques/comments? Let me know!