Translating HTML files

translation_articles_icon

ProZ.com Translation Article Knowledgebase

Articles about translation and interpreting
Article Categories
Search Articles


Advanced Search
About the Articles Knowledgebase
ProZ.com has created this section with the goals of:

Further enabling knowledge sharing among professionals
Providing resources for the education of clients and translators
Offering an additional channel for promotion of ProZ.com members (as authors)

We invite your participation and feedback concerning this new resource.

More info and discussion >

Article Options
Your Favorite Articles
Recommended Articles
  1. ProZ.com overview and action plan (#1 of 8): Sourcing (ie. jobs / directory)
  2. Réalité de la traduction automatique en 2014
  3. Getting the most out of ProZ.com: A guide for translators and interpreters
  4. Does Juliet's Rose, by Any Other Name, Smell as Sweet?
  5. The difference between editing and proofreading
No recommended articles found.

 »  Articles Overview  »  Technology  »  Localization and Globalization  »  Translating HTML files

Translating HTML files

By sylver | Published  03/12/2004 | Localization and Globalization | Recommendation:RateSecARateSecARateSecARateSecARateSecI
Contact the author
Quicklink: http://hin.proz.com/doc/143
Author:
sylver
हांगकांग
अंग्रेजी से फ्रांसीसी translator
 
View all articles by sylver

See this author's ProZ.com profile
Translating HTML files
Translating Web sites

Today, being able to translate HTML is crucial, for obvious reasons, and about every translator will accept HTML files. Yet, although it's not politically correct to mention this here, truth is that many translators don't know enough about HTML and websites to do a professional job.

There are LOTS of good HTML tutorials around, but they are all intended for webmasters wannabes or even professional webmasters, and skip important issues a translator should be aware of. I hope this fills in the gap and helps you do a better job.

If you are already well familiar with HTML, Keywords handling and style sheets, go straight to page 4 for more on preparing an HTML file for translation and doing the translation itself.

HTML issues
(Basic and not so basic)

What is HTML and how does it work? HTML stands for HyperText Markup Language. Hypertext is text characterized by the presence of links. Take a book. You read from the beginning and move toward the end. With hypertext, you can have access immediately to the information you are looking for by clicking on links.

An HTML file is a simple text file with an “htm” or “html” extension. Do the following experience: Take a simple text file, “whatever.txt" and rename it to “whatever.htm”. Double click on it and it will display in your default web browser. Now, you will note that there are no links. There are no bold, no underlines, no tables, no pictures and not even paragraph marks.

HTML is the "language" that you use to tell the browser (Internet Explorer, Netscape, Mozilla, Opera...) how the page should be displayed and what it should do in different situations (the user click on a link, the navigator finds the page and display it, for instance). To do that, it uses “markups”. A markup - or tag - is a small piece of code that provides this information. In HTML, tags are made of a “<” sign, some code and a “>” sign. Case is not important.

For instance “<b>” tells the browser that whatever information follows that tag should be displayed in bold. Now, unless you want everything to be displayed in bold, there must be another tag to tell the browser where it should stop to display the text in bold. That tag is “</b>”. Note the “/” sign. The tag triggering the bold display (<b>) is called an opening tag. The tag canceling the action of the opening tag (</b>) is called a closing tag. There are tags for about every formatting option: italics, underline, color, size… You will find them very easily on the net, like here for instance.

There are other types of tags in an HTML document. For instance, there are tags detailing the structure of the page and its general behavior. An HTML page is usually as follow:

<HTML> (To tell the browser that this page is in HTML)
<HEAD> (Header. Contains information about the page that will not be displayed, but can nevertheless influence the display.)
</HEAD> (Closes the “<head>” tag. Most tags should be opened and closed.)
<BODY> (The actual page. This is what you see when you open the page in the browser)
</BODY> (Closing tag for <body>)
</HTML> (Closing tag for <html>)

You need not change the structure tags when you translate.

Another type of tag is the Meta tag. These are located in the header and give information on the page, used mostly by search engines, like keywords, description of the page, author and copyrights… You will need to translate the contents of some of these tags. Bearing in mind that these tags are mostly intended for search engines, you have to translate the keywords and description using words that people will use to find the web site. It’s not a matter of just translating those.

You have to think a little bit about which terms are applicable to the page and will be the most popular. You are likely to find misspellings in the Meta tags. They are there on purpose, so that people who misspell their search terms in the search engine find the page anyway. If so, misspell too. Google listed the misspellings it found for “Britney Spears”. There are hundreds, and they have been searched for by thousands of people, so misspelling on popular searches could amount to a significant trafic.

If you find well thought of descriptions and several typos in the Meta tags, be extra careful, for this is evidence that your customer has attempted some search engine optimization, and perhaps paid a lot of money to do so. Don’t ruin it.

There is one other important item in the Meta tags: The charset. It tells the browser which character set is used in the page. If you translate from a language with a character encoding different of yours, you may have to change the encoding for the page to display properly. Here is what that Meta tag looks like:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

The TITLE tag (in the header. Shows in the title bar of the web browser when you display the page) <title>. THIS is the single most important piece of text in your web page. Why? Because Search Engines value it above everything else, when they analyze the page. “Welcome to Whatever.inc” is probably the most stupid title you can come up with. A title should contain the keywords that will be used to find the page. If the page talks about Blue widgets, the title should have “Blue widget” in it! Now, of course, you are translating. That means you have to follow the original Web page, and if the original name is “Welcome to Whatever.inc”, then keep it, but if you can see the author has put some thought on the title to include keywords in a specific sequence, give it some thought yourself.

Links. In HTML, a link looks like this:

<a href=“http://www.website.com” title=“Good web site”> Web Site </a>

“a” stands for “Anchor”, and “href” tells the browser where that “anchor” is located (here, “http://www.website.com”). “Title” gives a title for the link, so that when you pass the mouse over the link, a small note will display, “Good web site”, in this example. You have to translate it. “Web Site” is the text of the link. You may or may not have to translate it. “</a>” is the closing tag.

Images. Although you see images in web pages, they are not really inside the HTML document. It’s a simple text file, right? In fact, you have a tag that tells the web browser where the picture is stored and how to display it (what size, with or without a border, where in the screen…). The image tag is <img src=“http://www.website.com/image.jpg” alt=“Picture of a blue widget”>. It has no closing tag. You should not change the image tag except for the content of the "alt" tag. “Alt” stands for “Alternate text”.

In the early days of Internet, many browsers were not able to display pictures, or it was too slow, so many users disabled the pictures to surf faster. To enable those users to understand what picture should be there, the alt text is displayed instead. Even if the image is displayed, the alt text shows when you move the mouse over the image. You have to translate it.

The “alt" and the “title” are usually loaded with keywords for the search engines. If this is the case, make sure that the translation is the same way.

HTML has evolved a lot from the first version. Nowadays, a web designer can decide exactly the size of the text, create styles (a concept similar to styles in a word processor – more on that later), set the position and so on. But in the early days, HTML was much more frugal.

The web was used for text. You had a series of tags to identify the document’s hierarchy, called the “heading tags” <h1>, <h2>, <h3>… and their closing tags, </h1>, </h2>, </h3>. H1 is the main heading. It's big, bold, often too big, in fact. H2 is a secondary heading, slightly smaller. H3 is again small... You got the idea.

Although there are much better ways in current HTML to arrange the display, the H tags have remained and are used by search engines when they analyze a page, the rationale being that if a word is in a heading, it is more relevant to the page content. This is the main reason why many web sites still use those tags even if that means a little bit more work. As a translator, these tags tell you that you are translating a heading, and its position in the document's hierarchy.

They are also a warning that you have to be aware that the words inside these tags. Exactly. Keywords. Usually, you will see the same keywords used in the H tags and in the “keywords” Meta tag. Make sure that you use the same keywords. Search Engines analyze, amongst other things, the number of times a specific keyword appears compared to the total number of words in the page, and where. Try to keep the same proportion as the original document, and if a keyword is in a header, make sure your translation leaves a keyword in that same header.

For the same reason, HTML contains a number of redundant tags, like <b> and <strong>, or old ones that you almost don’t see anymore, like “<big>” (self explanatory, I think). Look for these. Too easy to concentrate on the “standard” <b>, <i>... and forget to handle those old things. you may need to move them, too.

Next, styles and style sheets. A “style” is a series of attributes defined in advance, either in the header of the document, or in a separate file called a style sheet.

To understand styles, you need to understand what problems they resolve:

Suppose you want the big titles in your web site to be bold, italic, blue, and centered. In good old HTML, you would write:

<h1><b><center><font color=“blue”>Title 1</font></center></b></h1>
<h1><b><center><font color=“blue”>Title 2</font></center></b></h1>
<h1><b><center><font color=“blue”>Title 3</font></center></b></h1>
<h1><b><center><font color=“blue”>Title 4</font></center></b></h1>

<h1><b><center><font color=“blue”>Title 356</font></center></b></h1>

Pretty clumsy, isn’t it? And that's just 4 simple attributes. The solution is to define a style with all these specifications: It’s bold, it’s blue, it's centered, and you give it a name, i.e.: bbc (For Bold Blue Centered. Just an example. It’s normally named so that one remembers easily what it is). Then, you don't need to write it every time. In the header of the page, you write:

<style type="text/css">
<!--
. bbc{
text-align: center;
font-weight: bold;
color: #blue;
}
-->
</style>

Then, anytime you have a title, you write

<h1 class=“bbc”>Title 1</h1>
<h1 class=“bbc”>Title 2</h1>
<h1 class=“bbc”>Title 3</h1>

But the best is that if after all is done, you decide that it would be nicer in red, or that italics would be cool, you don’t have to look all over the document and change all the tags, each time. You simply change 1 word in the style definition and every instance change at once. This not only saves a lot of time when you design the page, but also make the page size smaller, and thus faster to load.

Now, if you want to use a style in several pages, or even the whole site, you have to copy the same styles in the header of each page. Not too smart. The solution was to write all the styles in a separate file, called a style sheet, then to link each page to the style sheet. That way, you write the styles only one time, and in each page, you have a link in the header that looks like this:

<link href="/stylesheet.css" rel="stylesheet" type="text/css">

A style sheet file’s extension is “*.css”. Now, as a translator, this is relatively important to know because it determines how the text will be displayed and where. The same page can look completely different with and without the style sheet. With experience, you can look at the source code and “see” the page (No, this ain’t the Matrix yet ;-). That helps a lot, because you don’t need to check out the page in the browser every few minutes.

Anyway, this should cover the basic HTML you need to translate. When you get a bit more time, pick one of the many HTML tutorials on the Web and learn about tables and frames.

How to translate HTML

There are two reliable, proven methods and many wrong methods. Amongst the wrong methods, the most populars are:
• Opening the HTML file in Word, working there and “Save as a web page”. This changes the code and turns it into a complete mess that is twice the size of the original page, cause display issues no-end and is about as popular for search engines as a dead cat at a wedding. If you want to hear a knowledgeable customer scream, go ahead.
• Translating in other WYSIWYG editors (What You See Is What You Get). They mess up the code as well, usually, while I don’t know any as bad as Word for that matter, save perhaps frontpage. Dreamweaver is an exception to that rule, but a costly one if you are simply translating.
• Using a translation software that hides the tags. That can be very attractive for beginners, but if you understood the section above properly, you will see why this is not a good solution at all. An example of such software is Catscraddle. That software is very smooth but will cause problems because you don't know what is what, and the sentences are cut midway if the page use formating. If it was doing a correct job, I would be the first to use it because I love the interface and it's very fast. Unfortunately, the basic concept is VERY flawed and if you want to do a professional job, just don’t.

The correct methods include :
• Open the page in an HTML editor, preferably one that support color coding of the tags. There are many freewares. I like very much AceHTML, but that's far from the only one available. Either way, translate the text and move the tags as needed. I.e.:

English: John’s <i>girlfriend</i> is quite cute.
French: La <i>petite amie</i> de John est plutôt mignone.

As you can see, you have to decide where the tags should be in the target language.
Working that way can be a pain, but if you know your code and are careful, the output will be irreproachable. However, you must stay very alert not to forget or erase tags by mistake.

• Preparing the file, then using a CAT like Wordfast or Trados to translate it, then restoring the HTML format. Not all CAT work the same way, but remember that professional handling of web sites translation *requires* quick access to the tags. The ability to move, edit or delete tags is not optional, it’s a must. With Trados, you can also use TagEditor, although you may miss the flexibility that comes with working in Word. Moving/deleting tags can be quite clumsy in TE.

Preparing the text for translation:
1. What are tagged files?

What do I mean by “Preparing the text for translation”? For translation purposes, there are 2 types of tags:

• Tags that you may need to move or edit and that are/could be located in the middle of a segment

• Tags that you will almost never change and are not (should not) be in the middle of a segment

Overall, there are very few tags that you may need to delete during the translation process.

"Preparing files" means modifying the files so that they can be translated easily using a CAT. What follow is a description of a file prepared for Wordfast/Trados, a “tagged file”, in the translator lingo. Since Trados is/was widely used, most professional CAT can handle this type of files, with more or less success. However, if you own and use another CAT (SDLX, DV,…), please check your CAT's documentation. As you will use a CAT to work of the tagged file, I assume that you are familiar with the basic concepts. (If not, please read the following pages of this web site before going further: “What are CATs?” and “First translation”)

A tagged file is a RTF file containing the source code (meaning, tags + text) of the original HTML file. The tags are identified using 2 styles: tw4winInternal and tw4winExternal. Without getting into details, the tw4winInternal style is red, and the tw4winExternal is light grey. Whenever you receive a file with tags in red and grey, it’s almost a given that the file has been tagged. Although the handling is very similar, beware that HTML files are not the only tagged files, and many more exotic formats are tagged for use with CATs, like SGML, XML, QuarkXpress, FrameMaker, etc.

All tags are protected against deletion by default, to avoid you deleting one by mistake. Tags that you may need to move, like <b> (bold), are in tw4winInternal. “Internal” because they will be included in the segment you have to translate. They are in red. Tags that you don't need to change or to be concerned about during the translation process are in tw4winExternal, (like <p> (paragraph mark), <body>, …) and are in grey. A tag in tw4winExternal style will end a segment automatically.

Here is an example:

Correct: You are learning to translate <b>Web Sites</b></p>Bla bla bla

By now, you should know that “Web sites” is in bold, and that the </p> shows the end of a paragraph. When you open that sentence with Wordfast (or Trados), the segment will end just after the </b>, although there is no period, because <p> is in tw4winExternal style.

Incorrect: You are learning to translate <b>Web Sites</b></p>Bla bla bla

(The segment would stop right after “translate”).

Incorrect: You are learning to translate <b>Web Sites</b></p>Bla bla bla

(The segment would include everything).

Incorrect: You are learning to translate <b>Web Sites</b></p>bla bla bla

(The segment would include everything and the tags are not protected).

2. Tagging an HTML file?

If you open the source code of virtually any HTML file, you will see there are a LOT of tags. So changing the styles manually is just not workable. You need to use another software to tag (prepare) the file. It’s rather easy to do for HTML, and other relatively common formats like XML and SGML. My personal preference goes to a software called Rainbow (freeware). There are other possibilities like +Tools (also freeware).

The process is rather simple and well explained in both software documentations, so I won’t overkill it. In Rainbow, (once installed), you click on “Add”, select the HTML files you need to prepare, go to the Tools menu, select “Prepare for translation”, fill out the needed options, and under the tab “Package”, you select where the tagged files should be created.

Some stuff may look complex, but frankly it’s a no-brainer, when all you have to do is prepare an HTML file.

Find your files, open the rtf file in Word, and you are ready to translate.

3. Translating a tagged file.

This depends on your CAT. In Wordfast, start the translation as usual, with your TM and glossaries, the lock bolt on the door, gaffer tape across the neighbor’s kid mouth, Mozart playing (or AC/DC – your call), …,whatever your set-up usually is when you translate. ;-)

Tags in tw4winInternal are considered as placeables. You can select them in the source segment using “Ctrl + Alt + Left/Right” and “Ctrl + Alt + Down” will copy it inside the target segment, at the insertion point. Type your translation in the target and bring down the tags at the appropriate points in the target sentence.

Use the tags to know how the text will look like and do not hesitate to refer to the original HTML file, when in doubt. As explained, before, keep keywords in mind and balance the text to match the original’s proportions as closely as possible. (Of course, if the page is not meant for the general public but for Intranet, that becomes much less important).

Please refer to the “tagged files” section of your Wordfast’s manual. In summary, you have to make sure that you do not forget tags (Wordfast has settings to remind you), that you keep the internal tags in the tw4winInternal and the translatable text in whatever is the style originally used.

Example:

You are translating an <b>HTML</b> file!
Vous êtes en train de traduire un fichier <b>HTML</b> !

4. Done, now, what?

When your translation is done and the file cleaned (meaning all source segments and segment delimiter have been deleted), you have a nice …RTF file. If both the source and the target language do not require Unicode and that you do not have special characters in the file, save it as txt (or copy all the code in Notepad) and change the extension to “*.htm” or “*.html”. If you use a language that requires Unicode (Chinese, Japanese, Russian, Thai,...), save the file with the appropriate encoding and modify the charset information in the file header to reflect the new language (i.e.: UTF-8.) See the HTML links to find out more about encodings and file formats.

If you have respected the tags, the file should look about right in the browser. However, the translation is seldom the same size as the original text, and if so, you may have to make a few arrangements to make it fit nice. If lucky, everything can stay the same.

You are through. I hope these information will help you tackling HTML files in a professional manner and feel confident with them. As you can see, there is nothing really hard in HTML files, but they do require some extra attention too. If it's HTML, it's not just text.

At times the client wants you to translate the text with no consideration with the HTML or a potential use on the net. That’s all right. If so, skip everything and ask him to provide a regular *.doc file, or open the HTML in word and save it as *.doc.

Good luck. ;-)

Sylver

*This article is a courtesy of www.your-translations.com. You can find more articles there on CATs, Word, ...



Comments on this article

Knowledgebase Contributions Related to this Article
  • No contributions found.
     
Want to contribute to the article knowledgebase? Join ProZ.com.


Articles are copyright © ProZ.com, 1999-2025, except where otherwise indicated. All rights reserved.
Content may not be republished without the consent of ProZ.com.