Looking for a tool
Thread poster: Brandis (X)
Brandis (X)
Brandis (X)
Local time: 13:23
English to German
+ ...
Sep 23, 2004

Hi all! I am searching for a tool, using which complete website (source) content can be extracted, format is ofcourse .html. Here I have various websites, automobiles, medical, etc., I thought a tool like this would be wonderful, especially to go about pre-planned TMs and develope the target content in course of time. I shall appreciate all help
Regards,
Brandis


 
Judy Rojas
Judy Rojas  Identity Verified
Chile
Local time: 07:23
Spanish to English
+ ...
Try Webreaper Sep 23, 2004

Hi:
Try webreaper. You can download it at http://www.webreaper.net/download.html
Regards,
Ricardo


 
Brandis (X)
Brandis (X)
Local time: 13:23
English to German
+ ...
TOPIC STARTER
I know webreaper Sep 23, 2004

Ricardo Martinez de la Torre wrote:

Hi:
Try webreaper. You can download it at http://www.webreaper.net/download.html
Regards,
Ricardo
Hi I know this tool already. I am using others, but what I am searching for is a tool for source terminology extract function from multiple webpages pertaining to one topic or product, with a view to build professional TMs.But thank you.A closer description is Trados Tageditor, where one could extract terminology from multiple bi-lingual files, i am in search of something similar, only as a separate tool.
brandis

[Edited at 2004-09-23 01:13]


 
Luciano Monteiro
Luciano Monteiro  Identity Verified
Brazil
Local time: 14:23
English to Portuguese
+ ...
Fusion Sep 23, 2004

Hello Brandis

You might like to try Fusion. It has a terminology feature that I think would suit your needs.

Best regards,

Luciano Monteiro


 
Marc P (X)
Marc P (X)  Identity Verified
Local time: 13:23
German to English
+ ...
Website retrieval and translation Sep 23, 2004

Here's one way of doing it:

First, retrieve the web site with wget. For example, if you want to retrieve the OmegaT web site at www.omegat.org/omegat/omegat.html, you enter:

wget http://www.omegat.org/omegat/omegat.html -r -p

on the command line. The -r opt
... See more
Here's one way of doing it:

First, retrieve the web site with wget. For example, if you want to retrieve the OmegaT web site at www.omegat.org/omegat/omegat.html, you enter:

wget http://www.omegat.org/omegat/omegat.html -r -p

on the command line. The -r option causes folders to be saved recursively (i.e. sub-folders will be saved), the -p option causes any files needed for complete display of the pages to be saved.

Then you create a new project in OmegaT and place all the files you have downloaded in the /source folder of that project exactly as you downloaded them, i.e. with the same folder structure. (You can of course create the empty project first, then on the command line, switch to the /source folder, and then download the web site into it directly.) When you have finished translating the html files in OmegaT, compiling the project in OmegaT will reproduce the structure with the translated files in the /target folder.

Get wget from:

http://wget.sunsite.dk/

and OmegaT (latest version 1.4.3 is just out, September 2004) from:

http://sourceforge.net/projects/omegat

wget and OmegaT both run on both Linux and Windows.

Marc
Collapse


 
Brandis (X)
Brandis (X)
Local time: 13:23
English to German
+ ...
TOPIC STARTER
Thank you Sep 23, 2004

Luciano Monteiro wrote:

Hello Brandis

You might like to try Fusion. It has a terminology feature that I think would suit your needs.

Best regards,

Luciano Monteiro
But fusion doesn´t cover website localisation aspect directly, one would need further instrumentation to reproduce a target = source website, addtionally fusion limits term extraction to .doc files only. One could certainly convert .html files to .doc files, and process further, but the work involved is not feasible, if one does it industrially. For large docs or multiple documents, fusion in that sense is probably the best there is.
Rgds,
Brandis


 
Piotr Bienkowski
Piotr Bienkowski  Identity Verified
Poland
Local time: 13:23
English to Polish
+ ...
Try SDLX Sep 23, 2004

But fusion doesn´t cover website localisation aspect directly, one would need further instrumentation to reproduce a target = source website, addtionally fusion limits term extraction to .doc files only. One could certainly convert .html files to .doc files, and process further, but the work involved is not feasible, if one does it industrially. For large docs or multiple documents, fusion in that sense is probably the best there is.
Rgds,
Brandis


SDLX can do web formats, html, and html like files (this week I was translating chunks of html files {incomplete html code} which its web formats filter accepted happily), and many other formats, including XML and SGML, as well as RC and some programming languages files.

It will not download a web site for you but other than that it can handle translation of tagged files pretty well.

An until Sept. 30 it is available at half price.

For more information go to http://www.sdl.com/intltransday

HTH

Piotr


 
Brandis (X)
Brandis (X)
Local time: 13:23
English to German
+ ...
TOPIC STARTER
I have sdlx Sep 23, 2004

syntaxpb wrote:

But fusion doesn´t cover website localisation aspect directly, one would need further instrumentation to reproduce a target = source website, addtionally fusion limits term extraction to .doc files only. One could certainly convert .html files to .doc files, and process further, but the work involved is not feasible, if one does it industrially. For large docs or multiple documents, fusion in that sense is probably the best there is.
Rgds,
Brandis


SDLX can do web formats, html, and html like files (this week I was translating chunks of html files {incomplete html code} which its web formats filter accepted happily), and many other formats, including XML and SGML, as well as RC and some programming languages files.

It will not download a web site for you but other than that it can handle translation of tagged files pretty well.

An until Sept. 30 it is available at half price.

For more information go to http://www.sdl.com/intltransday

HTH

Piotr
I am probably not clear with regard to my posting. I was infact looking for a probable free/shareware tool only for the puspose of extracting one worded webcontent. If you know any,I shall be thankful for all help. Regards, brandis


 
Piotr Bienkowski
Piotr Bienkowski  Identity Verified
Poland
Local time: 13:23
English to Polish
+ ...
Terminology lists? Sep 24, 2004

Brandis wrote:

Piotr
I am probably not clear with regard to my posting. I was infact looking for a probable free/shareware tool only for the puspose of extracting one worded webcontent. If you know any,I shall be thankful for all help. Regards, brandis[/quote]

Do you mean web sites that contain terminology lists from different areas? If yes, I don't think there is a universal tool for this specific task, because these lists can be in different formats, e.g. a html table, separate paragraphs and lists (ordered and unordered).

Piotr


 
Brandis (X)
Brandis (X)
Local time: 13:23
English to German
+ ...
TOPIC STARTER
I do not mean that Sep 24, 2004

syntaxpb wrote:

Brandis wrote:

Piotr
I am probably not clear with regard to my posting. I was infact looking for a probable free/shareware tool only for the puspose of extracting one worded webcontent. If you know any,I shall be thankful for all help. Regards, brandis


Do you mean web sites that contain terminology lists from different areas? If yes, I don't think there is a universal tool for this specific task, because these lists can be in different formats, e.g. a html table, separate paragraphs and lists (ordered and unordered).

Piotr

[/quote]Hi! again a small correction. This could be any website. For example, Metal working websites, here you may find anywhere from 100 to a few thousand, all use some standard terminology in their product presentation or descriptions via web,if one could extract that type of content as to build monolongual glossary initially switch to target webs and compare, one would have a field specific glossary, I guess. It is that kind of a tool I am looking for. Sofar in case of fusion (doesn´t process .html files) we have a wonderful term extraction facility basing on the files fed to fusion, whereas other tools actually require you of doing the translation in order to generate a TM. My search is hence two-fold, term extraction (monolingual) using a functinality as in fusion, but extracting from websites. As in my case my outsourcer either indicates the website or sends me the website for local processing and I start with Trados, as I cannot process these sites directly in Fusion, despite it´s term extraction ability. Sometimes my outsourcer gives me a TM (5 - 10%) of the file prepared and fights over the price. Another point is also, that most of the webcontent is a global publication ( see kudoz , mostly you see webreferences), so the idea is, I guess it is obvious now.
Regards,
Brandis
Regards,
Brandis


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Looking for a tool






Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »