How Hard is Don Quixote? A Difficulty Analysis of 19 Popular Spanish Books
Earlier in the year, I decided to put my (middling, maybe a low B2 on a good day) Spanish through its paces, and picked up a book called La Sombra Del Viento (The Shadow of the Wind) - a big hit in the Spanish speaking market, and likely a fantastic read in whatever language you read it in.
While I did enjoy reading it, I found myself stopping to look up words a little bit too often for my liking - 2668 unique times, to be exact. The total number was actually even higher, as I looked up some words more than once.
A typical paragraph.
The book was throwing a bunch of new words at me, and my Spanish vocabulary was too limited to really get into a good flow. Still, I powered through - it was a bit painful and more than a little slow, but the intriguing story carried me through to the end.
This got me thinking: what other Spanish books out there - be they modern best sellers, older classics, or even kids books - are currently beyond my ability to understand? What can I read and enjoy now, what books should be reserved for later in my Spanish journey, and what would I struggle to even begin to comprehend, even after years of study?
To answer these questions, I thought it'd be fun to put together a Spanish text analysis tool, and analyse a bunch of books on my Spanish reading list. Yes, my definition of fun is likely different from yours.
Hours and hours of fun.
I analysed 19 books in total, ranging from a couple of graded readers (books for foreign language learners) all the way to the original Spanish novel, Don Quixote, mixed in with a bunch of European/Latin American classics, children's books, and Spanish translations of English/Portuguese/French books. The full list is as follows:
Short Stories in Spanish - (An intermediate graded reader by Olly Richards)
Tsunami - (A C2 graded reader by Paco Ardit)
El Juego Del Angel
Como Agua Para Chocolate
La Casa En Mango Street
El Alquimista (The Alchemist)
El Prisionero Del Cielo
Cien Anos De Soledad (One Hundred Years of Solitude)
Harry Potter Y La Piedra Filosofal (Harry Potter & The Philosopher's Stone)
El Principito (The Little Prince)
La Casa De Los Espiritus
El Amor En Los Tiempos De Colera (Love in the Time of Cholera)
Corazon Tan Blanco
La Sombra Del Viento (The Shadow of The Wind)
Harry Potter Y La Caliz De Fuego (Harry Potter & The Goblet of Fire)
Cronica De Una Muerte Anunciada
Measuring Text Difficulty
To measure difficulty, I used "Fernandez Huerta Rating", a text readability formula similar to the "Flesch Reading Ease" score commonly used in English. It's on a 100-0 scale (100 is very easy, 0 is very hard) that, for the purposes of this analysis, I decided to reverse to make it more intuitive (100 is very hard, 0 is very easy).
Bear in mind that this isn't a perfect measure of difficulty, and different text preparation methods seem to generate somewhat different results (which I tried to mitigate as much as possible).
That said, having read at least some of each one of these books, plus a handful cover-to-cover, I think it's done a pretty good job of quantifying how hard these books are to read, with only 1 real head-scratcher of a result (that being "El Juego Del Angel" scoring as lower difficulty than every other book, minus the graded readers, which is...odd).
Anyway, here's what the analysis tool spat out:
Unsurprisingly, the 2 graded readers were the least difficult, while Don Quixote was hardest by quite a margin, with La Sombra Del Viento coming in 3rd.
Apart from just calculating raw difficulty, I also wanted to see whether my hunch of "lots of unique words = hard" was on track, in addition to a few other metrics. I picked the following:
Words Per Sentence
Unique Word Density
Unique Words Per Sentence
Unique Word Length
My initial thoughts were that 1) Unique Word Count, and 2) Unique Word Density would correlate particularly well to difficulty. Of these, the first wasn't far off, and the second was pretty decisively wrong:
It turns out that "Total Words per Sentence" was the most highly correlated metric (.798, with 1 being perfectly correlated), followed by "Word Count" and "Unique Word Count", at .721 and .711 respectively. "Unique Word Density" was actually negatively correlated to difficulty, at -.447, which surprised me.
While European Spanish and Latin American Spanish are both part of the same language, there are several differences between them, in terms of both grammar and vocabulary, that I was curious to see whether there'd be a visible impact on difficulty score ("N/A" here refers to books that are translations from other languages, mostly English):
European Spanish seems to be a bit tougher (44.6 average difficulty, compared to 39.78 and 39 for Latin America and N/A respectively), though if we exclude the behemoth of an outlier that is Don Quixote, we get the following:
Still slightly tougher (down to 41.45 from 44.6), but not significantly so, and as my sample size is tiny I doubt it holds as a general rule.
Note - To differentiate between regions, I had to do some additional research to find the author's home country, and made the assumption that this was the style of Spanish that they used when writing their book(s), so it's not based on anything inherent to the text that the analysis tool picked up on (though I'd wager that this would be possible - anything with "computadora" in it would be a good hint that the text is Latin American, for example).
Gabriel García Márquez vs Carlos Ruiz Zafón
These 2 literary heavy hitters are likely the names you'll see most often when it comes to Spanish literature - Márquez for his timeless classics "Cien Anos De Soledad" and "El Amor En Los Tiempos De Colera" (among others), and Zafón for his more recently published, widely translated, best-selling "El Cementerio de los Libros Olvidados" (Cemetery of Forgotten Books) series.
Both are absolute masters of story-telling, and if you're a Spanish learner, reaching the level needed to read their books is a reward in itself.
So, how do these 2 authors' arguably most famous books stack up difficulty wise?
Zafón seems to have the wider spread, with "El Juego Del Angel" (which, as I already mentioned, is under-estimated difficulty wise in this analysis, but still directionally correct within this particular context) being the easiest book to dip into, and his "La Sombra Del Viento" being hardest by a decent margin.
Márquez's easiest read is "Cronica De Una Muerta Anunciada" - which, word for word, isn't markedly less difficult than "Cien Anos" or "El Amor" in my opinion, but it is a pretty quick read, and so I think it's a good intro to Márquez's work.
Word Length: Barely a Factor, Doesn't Vary Much Between Books
Something that I thought would be a better predictor of difficulty - as well as have a lot more variation - was the length of words, both unique words and total words. While Unique Word Length varied a bit (between 6.5 to 8.17 letters), Word Length was pretty dang similar across the board, from a minimum of 4.06 letters to a max of 4.74 letters.
While Unique Word Length does show some correlation to difficulty (.433, a bit above the average correlation of .329 for this analysis), Word Length generally appears to be a rubbish predictor, with a correlation of -.260.
Wanna Improve Your Vocab and/or Like Pain?
As I mentioned previously, I felt that my lack of vocabulary held me back from really getting into La Sombra Del Viento. That said, I picked up a ton of new words and phrases - I can now look at the highlights of that book and understand about a third of them, thanks to my compulsive habit of putting every interesting word into Anki and reviewing them until they're deeply lodged in my brain.
For those who might also enjoy putting themselves through such a masochistic exercise, here is a list of the books in order of unique word count:
Don Quixote comes out on top again, with La Sombra Del Viento coming out 4th, and the charming little kids book "El Principito" (The Little Prince) having the fewest unique words.
Bear in mind that conjugations of the same word family count as unique words - so "hago" (I do) and "hice" (I did) will count as 2 different words, for example.
What I got coming away from this little experiment is this: while there's a diminishing return on picking up the harder books along the difficulty spectrum (looking up 7 words per sentence would dampen the enthusiasm of even the most dedicated learner), the usefulness of going down the difficulty scale doesn't seem to diminish in the same way.
As an example, for this blog post I read through and finished both of the graded readers, as well as El Principito - all of which are on the easier side - and while I may not have drastically increased my Spanish vocabulary, I got into a flow with these books that I believe transferred into my speaking, resulting in slightly (but perceptively) more fluid, free-flowing speech.
Picking a less difficult book allows you to read more quickly and easily, reinforcing more grammar structures and words per minute spent reading. It's a different learning focus, but I'd argue that it's no less valid than exposing yourself to a barrage of new vocab, provided that the content is interesting.
Consequently, I'll be selecting a slightly easier Spanish book the next time around. Don Quixote and his windmill jousting adventures can wait.
The ftfy and textstat python libraries were very helpful in converting and analysing the text. Particularly helpful was ftfy, it saved me from having to wrestle with encoding/decoding Spanish characters for another dozen hours.
Over to You
Want to see a particular Spanish book added to this analysis to see how it fares? Any other metrics I should include? Let me know!