Are there any language detection tools for assigning language to music data?

Music is a matter of taste and some of us have….how should I put it? different ideas of what is good music and what is trash that should never have seen the day of light. I am, since a few years back, a huge fan of Chinese Hip Hop and Rap (哈狗帮龙井说唱 and 龍胆紫 ) and of Chinese pop, such as 周杰伦 , 張震嶽 (who also happens to be a rapper) and being a romantic, 光良. Earlier, I played mostly bands like The Smiths. So, in a way I am, according to Seth Stephens-Davidowitz, an outlier. Indeed, his study of Spotify data showed that

“[…] tastes are formed between the ages of 11 and 14, while an average man’s music tastes are virtually cemented between the ages of 13 and 16.”

That being said, it is always interesting to seen which types of song have been popular over time, regardless of one’s taste, or to confirm (solely to you) that you are one of the very few to have good taste in musics. When I was younger, the only way to do this was to…..well, there wasn’t very much you could do if you weren’t subscribing to a wide range of specialized monthly magazines, sat in you cellar with you Commodore 64 and made datasette (for those who have no clue about what a datasette is, go there) stored databases impossible to visualized. As fun as may seem (to the nostalgics), this was neither cheap, nor accessible to everyone. So, in other words, NO! It wasn’t better back in the days. Today, it’s an entirely new story. Spotify makes much of its data available to the general public via APIs. Strongly recommend people interested in playing with their data and create personal apps to take alook at their services, an especially at Spotify for Developers. It’s neat, easy to work and has enabled me to access all my personal data in minutes.

So, as I wrote above, visualizing billboard data in the past was a drag, partly because accessing global data and historical data was almost out of the question and even if you had the chance to have that data (and not ruined yourself to the point of not being able to afford a computer to use that data), tools to intelligibly visualize that data simply did not exist. In this blog, I wan’t to present a few of the myriad of ways to visualized data from a Spotify dataset that was released a few months ago and created by Yamac Eren Ay.

Just to get a feel for the dataset contains, here is the list of features:

  • id (Id of track generated by Spotify) Numerical: It is the code retrieved when you copy the URI to share a track (e.g. spotify:track:0F02KChKwbcQ3tk4q1YxLH)
  • acousticness (Ranges from 0 to 1)
  • danceability (Ranges from 0 to 1)
  • energy (Ranges from 0 to 1)
  • duration_ms (Integer typically ranging from 200k to 300k)
  • instrumentalness (Ranges from 0 to 1)
  • valence (Ranges from 0 to 1)
  • popularity (Ranges from 0 to 100)
  • tempo (Float typically ranging from 50 to 150)
  • liveness (Ranges from 0 to 1)
  • loudness (Float typically ranging from -60 to 0)
  • speechiness (Ranges from 0 to 1)
  • year (Ranges from 1921 to 2020)
  • mode (0 = Minor, 1 = Major)
  • explicit (0 = No explicit content, 1 = Explicit content) Categorical:
  • key (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…)
  • artists (List of artists mentioned)
  • release_date (Date of release mostly in yyyy-mm-dd format, however precision of date may vary)
  • name (Name of the song)

The dataset was created to construct a recommendation engine for new songs based on similarities between songs previosly listened to. In this blog, we are not going to be using it this way. We are, however, going to add some information to this dataframe and to make minor changes to facilitate the creation of visualizations. First of all, we can observe that the artist names are given as a list, e.g.

Spotify = pd.read_csv(r"C:/Users/......./data.csv")
Spotify['artists']
0                              ['Carl Woitschach']
1         ['Robert Schumann', 'Vladimir Horowitz']
2                          ['Seweryn Goszczyński']
3                             ['Francisco Canaro']
4         ['Frédéric Chopin', 'Vladimir Horowitz']
                            ...                   
169904                      ['DripReport', 'Tyga']
169905          ['Leon Bridges', 'Terrace Martin']
169906                       ['Kygo', 'Oh Wonder']
169907               ['Cash Cash', 'Andy Grammer']
169908                          ['Ingrid Andress']
Name: artists, Length: 169909, dtype: object

This is a little bit disturbing, because we might want to look at individual artists when visualizing the data, so a first step is to get rid of all unnecessary characters. While we’re at it, we can also rename the feature “name” to “Title” since the refers to a piece’s title

Spotify['artists'] = Spotify['artists'].map(lambda x: x.lstrip("\'\[").rstrip("\'\]"))
Spotify['artists'] = Spotify['artists'].str.replace(r"\', \'", ",")
Spotify = Spotify.rename(columns = {'name':'Title'})

Which returns

0                           Carl Woitschach
1         Robert Schumann,Vladimir Horowitz
2                       Seweryn Goszczyński
3                          Francisco Canaro
4         Frédéric Chopin,Vladimir Horowitz
                        ...                
169904                      DripReport,Tyga
169905          Leon Bridges,Terrace Martin
169906                       Kygo,Oh Wonder
169907               Cash Cash,Andy Grammer
169908                       Ingrid Andress
Name: artists, Length: 169909, dtype: object

Adding a couple of language features: Can it be done in a satisfactory way?

Now, taking a closer look at both titles and artist name one will discover that there are many languages and nations represented in this dataset. Comparing, for instance, popularity of songs (on Spotify) might then be a problematic thing to do, especially if on considers that Spotify is not represented in some countries such as China (People there use 酷狗音乐 (kugou, which I use as soon as I am in China) or QQ音乐 (QQ music), among others, both owned by Tencent and with the lyrics feature which Spotify unfortunately still hasn’t launched. So, to add some information to the data, let’s try to append language information to the data. One idea to try to match artist names with a language. However, this is not always optimal since many Asian bands migh have “Western” names that they go by. A couple of examples are:

  1. 陶喆·(Táozhé), a romantic pop singer that also goes by the name David Tao (which is the name found in the data set).
  2. 光良 (Guāngliáng), THE romatic pop artist also known as Micheal Wong (best known for his absolutely brilliant songs 童话 (Fairy Tale) and 第一次 (First Time)), which is the name given in the dataset.
  3. 王力宏 (Wánglìhóng), aka Wang leehom, an ABC (American Born Chinese).

However, to maximize the chances of discovering them in the data is to also find Chinese titles, because, even thought their names might be given in Western form, their songs are almost always in Chinese (1. Song titles for David Tao are for instance 就是愛你 (I just love you) and 愛我還是他 (meaning: Do you love me or him?) and 2. For Wang Leehom, a couple of titles are 唯一 (meaning: The only one) and 你不知道的事 (The things that you don’t know)).

I was actually asking myself why no one that had been working on this dataset had come up with the idea of appending a language feature. I mean, I cannot have been the only one realizing that comparing the popularity of two artists (however global the world might be) cannot be done unless 1) you consider the market availability of the artist’s work 2) given this, how many potential listeners that artist might have (i.e. popularity could be weighted in a way). So, I checked the Spotify Web Developer API to see if that feature was available to the public. The answer is “No”. However, market availability is a feature that can be extracted, but it doesn’t really help our purposes. Since I thought it really would be a neat thing to have, I gave myself the task to do that….little did I know what it would imply.

First attempt: Langdetect

Inorder to do so, we’ll use the library langdetect in the following way: We first define a function, try_detect, that is a pplied to the “artists” and “Title” columns and return the ISO 639-2 two letter code for a language and None if no language is found. But, having a two letter code can be complicated if the language is not common, so, to the results we apply the library pycountry which takes the two letter code into the full language name. Note that the language “ZH-CN” is returned as None and need to be manually replaced by “Chinese”.

from langdetect import detect
import pycountry

def try_detect(cell):
    try:
        detected_lang = detect(cell)
    except:
        detected_lang = None
    return detected_lang

Spotify['Artist language'] = Spotify['artists'].apply(try_detect)
Spotify['Artist language']  = Spotify['Artist language'].str.upper() 

Spotify['Title language'] = Spotify['Title'].apply(try_detect)
Spotify['Title language']  = Spotify['Title language'].str.upper() 

Artist_Languages = Spotify_w_l['Artist language'].unique()
LANGA = []
for lang in Artist_Languages:
    try:
        Lang = pycountry.languages.get(alpha_2=lang).name
    except:
        Lang = None
    LANGA.append(Lang)
    
Title_Languages = Spotify_w_l['Title language'].unique()
LANGT = []
for lang in Title_Languages:
    try:
        Lang = pycountry.languages.get(alpha_2=lang).name
    except:
        Lang = None
    LANGT.append(Lang)

d1 = {'Artist_Language':Artist_Languages, 'Artist_Language_name':LANGA}
ART_LANGUAGE_NAMES = pd.DataFrame(d1)
ART_LANGUAGE_NAMES['Artist_Language_name'] = np.where(ART_LANGUAGE_NAMES['Artist_Language'] == 'ZH-CN', 'Chinese', ART_LANGUAGE_NAMES['Artist_Language_name'])

d2 = {'Title_Language':Title_Languages, 'Title_Language_name':LANGT}
TITLE_LANGUAGE_NAMES = pd.DataFrame(d2)
TITLE_LANGUAGE_NAMES['Title_Language_name'] = np.where(TITLE_LANGUAGE_NAMES['Title_Language'] == 'ZH-CN', 'Chinese', TITLE_LANGUAGE_NAMES['Title_Language_name'])

Spotify_Art = Spotify_w_l.merge(ART_LANGUAGE_NAMES, right_on = 'Artist_Language', left_on = 'Artist language', how ='left')
Spotify = Spotify_Art.merge(TITLE_LANGUAGE_NAMES, right_on = 'Title_Language', left_on = 'Title language', how ='left')
Spotify = Spotify.drop(['Title language', 'Artist language'], axis = 1)

Just to make sure that we got this right, let’s do a little check. Let’s look at the subset

Spotify_check = Spotify_w_l[Spotify_w_l['artists'].str.contains('Jay Chou')]

Spotify_check['Title'].unique()

array(['黑色幽默', '愛在西元前', '回到過去', '最長的電影', '半島鐵盒', '園遊會', '髮如雪', '聽媽媽的話',
'以父之名', '退後', '彩虹', '東風破', '藉口', '千里之外', '我不配', '蒲公英的約定', '屋頂',
'暗號', '不能說的秘密', '給我一首歌的時間', '軌跡', '妳聽得到', '黑色毛衣', '菊花台', '告白氣球',
'世界末日', '上海一九四三', '她的睫毛', '止戰之殤', '外婆', '伊斯坦堡', '分裂', '最後的戰役',
'斷了的弦', 'Mojito', '晴天', '安靜', '七里香', '擱淺', '龍捲風', '簡單愛', '一路向北',
'珊瑚海', '星晴', '可愛女人', '夜曲', '青花瓷', '稻香', '開不了口', '楓', '說好的幸福呢'],
      dtype=object)

Spotify_check['Title language'].unique()

array(['ZH-CN', 'KO', 'ZH-TW', 'HR'], dtype=object)

As you can see from the above list this doesn’t look particualrily promissing. Apart from ZH-CN and ZH-TW (Jay Cou is not from Mainland China, so it makes sense to recognize title written in traditional Chinese as Taiwanese), we also have titles recognized as Korean and…..slovakian. So, it seems to be a no go there.

Another big issue with langdetect is speed-related. The Spotify dataset contains fewer than 170 000 rows and yet it took langdetect 21 hours to match titles and artists to languages.

Second attempt: polyglot

It is known that polyglot is probably the best language detection package available. So why didn’t I go for that one directly. Well, simply because it is not easy to install on Windows. Actually, it took me about 3 hours to get it right. Using

pip install polyglot

will return something like this

File "C:\.......\anaconda3\lib\encodings\cp1252.py", line 23, in decode
        return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4941: character maps to 

Downloading polyglot and installing manually doesn’t help and returns the same answer. A little research will lead you to the realization that you need to install two wheels:

  1. PyICU.whl
  2. PyCLD2.whl

and install them manually using pip. The first thing is to determine which wheel to use. To do so, you need to know which python version you are using. I use, python 3.6. Go to https://www.lfd.uci.edu/~gohlke/pythonlibs/ to find the right wheel. cp referes to the Cpython version you are using, in my case I woyuld choose cp36 and amd64.

PyICU‑2.3.1‑cp27‑cp27m‑win32.whl
PyICU‑2.3.1‑cp27‑cp27m‑win_amd64.whl
PyICU‑2.3.1‑cp35‑cp35m‑win32.whl
PyICU‑2.3.1‑cp35‑cp35m‑win_amd64.whl
PyICU‑2.3.1‑cp36‑cp36m‑win32.whl
PyICU‑2.3.1‑cp36‑cp36m‑win_amd64.whl
PyICU‑2.3.1‑cp37‑cp37m‑win32.whl
PyICU‑2.3.1‑cp37‑cp37m‑win_amd64.whl

Download the right version and save it in the folder where all your python libraries are. If you do not know, choose one library you know is installed, e.g. pandas and do the following:

import pandas
print(pandas.__init__)
#returns
['C:\\Users\\.....\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\pandas']

Once both PyICU and PyCLD2 are installed you are good to go and to detect languages you should import the following modules

import polyglot
from polyglot.detect import Detector

Example

Let’s see what Detector returns:

French = "Moi je n'était rien et voila qu'aujourd'hui, je suis le gardier de ses nuits"
detector = Detector(French)
print(detector.language)

name: franska     code: fr       confidence:  98.0 read bytes:  1010

Chinese = "忘了有多久
 再没听到你
 对我说你 最爱的故事

我想了很久
 我开始慌了
 是不是我又做错了什么
 你哭着对我说
 童话里都是骗人的"

detector = Detector(Chinese)
print(detector.language)

name: kinesiska   code: zh       confidence:  99.0 read bytes:  1937

So, it seems to be working perfectly as it returns the name of the language (here in swedish as my pc is installed in swedish) and with which confidence the language is found. A note is that it can also detect several languages in the same string.

Chinese_eng = """忘了有多久再没听到你 对我说你 最爱的故事 我想了很久 我开始慌了 是不是我又做错了什么 你哭着对我说 童话里都是骗人的 I forgot how long it's been
Since I last heard you Tell me your favorite story
I have thought for a long time
I'm starting to panic
Wondering if I've done something wrong again
"""

detector = Detector(Chinese_eng)
print(detector.language)

name: engelska    code: en       confidence:  51.0 read bytes:  1093
name: kinesiska   code: zh       confidence:  48.0 read bytes:  1937
name: un          code: un       confidence:   0.0 read bytes:     0

So, let’s apply this to our data set: As we said, we want to be able to append a langiage to both artist names and titles. One thing to note is that polyglot not only returns the language but also with with confidence level is associated the laguage to the test it read. We want to separate those, which gives the following piece of code for the artist part:

import icu

Spotify['artists'] = Spotify['artists'].astype(str)
Spotify['Artist origin with reliability'] = Spotify['artists'].apply(lambda x: Detector(x, quiet=True))
Spotify['Artist Language'] = Spotify['Artist origin with reliability'].apply(lambda x: icu.Locale.getDisplayName(x.language.locale))
Spotify['Artist Language Confidence'] = Spotify['Artist origin with reliability'].apply( lambda x: x.language.confidence)

At first, everything looked ok, but for sanity check, I took a subset for the artist “Jay Chou” and polyglot actually returned this:

       artists Artist Language
6141   Jay Chou      klingonska
6204   Jay Chou      klingonska
6314   Jay Chou      klingonska
6892   Jay Chou      klingonska
14344  Jay Chou      klingonska
14587  Jay Chou      klingonska
14644  Jay Chou      klingonska
14762  Jay Chou      klingonska
22499  Jay Chou      klingonska
22802  Jay Chou      klingonska

Now, as much as I like Star-Trek, you really got to wonder how that association even was possible. You’d think that the first possible association of the word “Chou” would be to the french language (Cabbage) before you even look to Klingon. In other word, that was somewhat of a drag. So, let’s see how polyglot performs on Chinese titles. It did actually do weel on a text, as we saw in the examples.

Spotify['Title'] = Spotify['Title'].astype(str)
Spotify['Title origin with reliability'] = Spotify['Title'].apply(lambda x: Detector(x, quiet=True))
Spotify['Title Language'] = Spotify['Title origin with reliability'].apply(lambda x: icu.Locale.getDisplayName(x.language.locale))
Spotify['Title Language Confidence'] = Spotify['Title origin with reliability'].apply( lambda x: x.language.confidence)

The result was unfortunately the same thing. Klingon!

What I came to conclude was that there was a major difference between the examples I fed the different language detection packages I had been using and the content of the artist name anf title columns. The only real difference was length. If polyglot and langdetect are not fed enough text, they’ll perform rather poorly.

BUT! Google translate does very well on very short pieces of text AND even asks whether the text should be translated from a language that it apparently has detected. Could this be used in an easy way?

Third attempt: googletrans

Installing googletrans is thankfully an easier task that installing polyglot. So far so good…..until I met the first speed bump, not even trying to detect a language

import googletrans
from googletrans import Translator
translator = Translator()
translator.translate('我是法国人.')  #I am French

returned

AttributeError                            Traceback (most recent call last)
<ipython-input-209-8b5d0763567e> in <module>
      1 from googletrans import Translator
      2 translator = Translator()
----> 3 translator.translate('我是法国人.')

~\AppData\Local\Continuum\anaconda3\lib\site-packages\googletrans\client.py in translate(self, text, dest, src, **kwargs)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\googletrans\client.py in _translate(self, text, dest, src, override)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\googletrans\gtoken.py in do(self, text)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\googletrans\gtoken.py in _update(self)

AttributeError: 'NoneType' object has no attribute 'group'

Now, there are hundred of pages on GitHub and Stackoverflow discussing the issue. I have changed code in both the gtoken.py and client.py files, with no luck. This seems to be an unresolved issue and I was just about to give up my hope of ever doing anything valuable until I found a very satisfying solution. Namely a new module: google_trans_new

Fourth and last attempt: google_trans_new

google_trans_new is a completely new package released by lushan88a in November 2020. As the name indicates, it builds on googles translation engine. Installing it was a walk in the park, so that simply made be happy. The first step was to make sure it didn’t fail on the same thing as googletrans.

import google_trans_new
from google_trans_new import google_translator  
  
detector = google_translator()  
detect_result = detector.detect(u"""首先感谢我的父母他们对我的关爱每粉每一秒对我包容的心态
感谢他们对我无微不至的培养 让快乐与温馨陪伴我的成长 我感谢我的老师对我的教导 感谢他们教我人生怎样去起跑""")

detect_result
['zh-CN', 'chinese (simplified)']

So, it did well on longer text. What about short ones? Just as fine. I then decided to apply it to the Spotify dataset.

detector = google_translator()  

def detect(x):
    try:
        detected_language = detector.detect(x)
    except:
        detected_language = None
    return detected_language

Spotify['Title language'] = Spotify['Title'].apply(detect)
Spotify['Artist name language'] = Spotify['artists'].apply(detect)

We previously tested outputs for the artist Jay Chou and got neither the Title language nor the Artist name language right. This time, we have

        artists  Title                 Title language Artist name language
6141   Jay Chou   黑色幽默  [zh-CN, chinese (simplified)]         [ar, arabic]
6204   Jay Chou  愛在西元前  [zh-CN, chinese (simplified)]         [ar, arabic]
6314   Jay Chou   回到過去  [zh-CN, chinese (simplified)]         [ar, arabic]
6892   Jay Chou  最長的電影  [zh-CN, chinese (simplified)]         [ar, arabic]
14344  Jay Chou   半島鐵盒  [zh-CN, chinese (simplified)]         [ar, arabic]
14587  Jay Chou    園遊會  [zh-CN, chinese (simplified)]         [ar, arabic]
14644  Jay Chou    髮如雪  [zh-CN, chinese (simplified)]         [ar, arabic]
14762  Jay Chou  聽媽媽的話  [zh-CN, chinese (simplified)]         [ar, arabic]
22499  Jay Chou   以父之名  [zh-CN, chinese (simplified)]         [ar, arabic]
22802  Jay Chou     退後  [zh-CN, chinese (simplified)]         [ar, arabic]

As mentioned previously, Artist names are often not representative of their origin, so classing Jay Chou as arabic is not a big issue. However, the Title language is spot on this time! However, working with columns of lists is not optimal, so before doing anything else, we need to make new columns containing the language abbreviation (two letter ISO-code) and the language full names. The best way to do this is to use literal_eval from the ast module

from ast import literal_eval

c = ['Artist name language ISO','Artist name language FULL']
Spotify[c] = pd.DataFrame(Spotify['Artist name language'].map(literal_eval).tolist())

d = ['Title language ISO','Title language FULL']
Spotify[d] = pd.DataFrame(Spotify['Title language'].map(literal_eval).tolist())

which produces (I removed columns for better visuals)

As you can, it is not optimal. In the first row, the title Fangdanguillo was returned as Chinese, but is obviously spanish (as is rightly pointed out in the artist name language column. This is due to the fact that google translate even detect chinese in pinyin-form (romantization of chinese characters) and Fang, Fan, dang, dan, gui and lo (actually, 咯 luò) are all sounds in chinese. This in itself is a terrible sign, because this mean that the fact that google_trans_new reads ANYTHING that resembles pinyin as Chinese results in assigning Titles that aren’t obviously English as Chinese. I discovered that when trying to associate Title Language Specific Rankings of song by doing the following:

Spotify['rank'] = Spotify.groupby('Title language FULL')['popularity'].rank('dense', ascending=True)
Spotify['Language Specific Popularity'] = Spotify.groupby('Title language FULL')['rank'].apply(lambda x: 1+ 100*(x-x.min())/(x.max()-x.min()))
Spotify['Language Specific Popularity'] = Spotify['Language Specific Popularity'].round(0)

And looking specifically at the Chinese subset, I found that:

As you can see, the sounds chan, lie, cha sin, bang, you (a tuff there!!), la, tu are all classified as Chinese which in turn makes Lady Gaga, One Direction and the rapper Polo G all Chinese. Furthermore, and here I have no explanation, digits are seen as Chinese characters.

Conclusion

This blog post was originally intended as a preliminary part of a visualization of Spotify data with variation modules. To make this interesting enough and worthwhile reading, I had the intention of assigning more data to the dataset. Given the amount of data from artists worldwide, I thought that assigning languages would add value to my work. Little did I know that it would be an almost impossible task….this not having been done before should have made me suspicious enough not to go down that route…..but I did it because I had to. If nothing else, it gave me, and hopefully you as well, an insight in different language detection modules available in python.

The lesson learned? If anything can be learned from this experience, it is the fact that most language detection packages demand (at least for now) a sufficient amount of information to perform sufficiently well. As I pointed out earlier, they perform quiet well on longer texts that contain very clear sentences, even if languages are mixed. However, many languages share similarities and in some cases, if words are not in a mix with other words that are clearly unique to a language, confusion can occur.

I also hope that it gave some of you a hint some recommendations on Chinese pop and rap artists- it is always fun to discover new things! Here is a playlist BlogPostRecommendation

Have I given up my plans for language based visualization of Spotify data? No. But what I will do it to make lists of artists from different countries and filter them out from the dataset I have. This way, there will be no doubt about how a language is assigned to the data.

Until then! Listen to music and enjoy yourselves!

One thought on “Are there any language detection tools for assigning language to music data?

Add yours

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Powered by WordPress.com.

Up ↑

%d bloggers like this: