Friday, July 14, 2006

Very Easy Blogger Categories [Internationalisation]

As can be seen in the comments to my previous posts, I have been under quite some pressure to support multiple character sets in my Very Easy Blogger Categories system.

I now try and automatically detect the character set that you are using when I crawl your page. This allows me to output correctly escaped characters in the category lists that I place on your page.

Aside: The PHP htmlentities function does not support many character sets at all. As a work around I use the following code to correctly escape text in (hopefully) any character set: htmlentities(mb_convert_encoding($text, 'UTF-8', $charset), $quotestyle, 'UTF-8').

This is an entirely server-side change and does not require any changes to your templates or blog posts. Please let me know how it works by adding comments to this post.



  • Hello David.

    With the changes you made it is a bit better but not all characters are appeared correctly.

    Is there any problem with the ISO-8859-7 greek encoding? This is the encoding I use!

    Does your script support windows-1253 encoding or not?

    Thank you again in advance!

    By Blogger Jimbo, At 7:23 pm  

  • It should support all encodings (well all of those that can be supported by the PHP mb_convert_encoding function) so I don't know why some of your characters are having problems. Are you able to try switching to windows-1253 to see if that helps? (You will need to allow 24 hours for my crawler to pick up on the change though.)

    By Blogger David Nicholson, At 8:55 pm  

  • I will try and tell!

    By Blogger Jimbo, At 12:01 pm  

  • David I tried Windows-1253 encoding but it is only worse!

    At least with ISO-8859-7 only the characters with accent have problemQ

    For example:

    ε is ok, έ is not ok!

    Do you have any idea to help?

    By Blogger Jimbo, At 6:52 pm  

  • Did you leave it for 24 hours after changing to Windows 1253? It might be that your blog was advertising Windows 1253 but my script was still sending ISO-8859-7 because it hadn't recrawled your page since the change.

    By Blogger David Nicholson, At 7:33 pm  

  • I tried it in a second blog I have where I make all my testing.

    I changed the encoding of the blog in windows-1253 and then made a new post with greek.

    By Blogger Jimbo, At 8:17 pm  

  • Dimitris, I am really stuck. Basically the mb_convert_encoding function is not treating your characters correctly. (I do not do any of the hard work on that front.) The only way around that I can see is for me to actually write a function to do your character conversions, but I really don't have time to do that for every possible character set that may have problems. If you want to pursue the problem further you could try asking around in PHP forums how people do correctly escape HTML characters in your character set and let me know if you discover any working code then I could implement it for you... Sorry that I cannot be of much more help.

    By Blogger David Nicholson, At 9:21 pm  

  • David thank you for your response. I will take a look around and see if there is any solution.

    If I find something I will let you know.

    Thank you!

    P.S.:Have in mind that here in Greece no-one yet managed to have blogger categories so if you/we find a solution, most of the greek blogosphere will use your code.

    By Blogger Jimbo, At 9:26 pm  

  • David also take a look at this:

    html entities with character set iso-8859-7

    You may find something useful...

    By Blogger Jimbo, At 9:30 pm  

  • The work around that is described as the solution to that bug is actually the code that I am using so that is not going to help :(

    Could you try using UTF-8 as the encoding for your blog? I think that should support Greek characters and have better support by my code.

    By Blogger David Nicholson, At 10:32 pm  

  • Unfortunately I tried this but when I do it I get all the letters as a "?".

    So I will go on searching hoping to find a solution! Thank you David!

    If you come up with any idea tell me!

    By Blogger Jimbo, At 10:46 pm  

  • i don't see why categorization is done server side, it's totally doable client side check this

    By Anonymous Anonymous, At 7:52 pm  

  • eslam, your suggestion may well work for people who are paranoid about having a 3rd party control display of their category list. However it does have some quite serious disadvantages:

    Each page load has some quite hefty JavaScript in it (although this could be cut down by linking to a script file instead).

    Each page load also involves downloading and processing the blogs entire atom.xml file, I am not quite sure on the details of how this file works but either it will not store all posts for a blog ir it will be massive.

    Posts have to be in non-standard HTML (because of the category tag).

    By Blogger David Nicholson, At 11:20 pm  

  • Hi David, I am using your script and it worked well until one week ago, I have problems with certain category names (Ateísmo, Política); I think the use of the accent is the cause of this problems, I don't know if you can see my blog and see the problem and confirm it. I like to write my blog with ortographic correctness and the accent is part of that. Could you do something for the spanish bloggers??? :). Thanks.

    By Blogger Israel F.F., At 7:17 am  

  • Israel F.F., I have made some changes that should solve the problem that you were having. Let me know if I have broken anything else!

    By Blogger David Nicholson, At 7:38 pm  

  • Thank you so much David, I have to wait until the unwanted category "Pol" get away, but I think now eveything is working properly. Thanks again.

    By Blogger Israel F.F., At 12:48 am  

