How to import DGT memories in WFP 6.5?
Thread poster: Hans Lenting
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Nov 15, 2021

How can I import DGT memories in WFP 6.5? Doesn't WFP accept the language codes?

Screenshot 2021-11-15 at 10.26.07


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 02:43
Member (2006)
English to Afrikaans
+ ...
Where is the file? Nov 15, 2021

German Dutch Engineering Translation wrote:
How can I import DGT memories in WFP 6.5? Doesn't WFP accept the language codes?

Share with us the link of the DGT memor you're trying to import?


 
Milan Condak
Milan Condak  Identity Verified
Local time: 02:43
English to Czech
JRC = Joined Resource Center Nov 15, 2021

Samuel Murray wrote:

German Dutch Engineering Translation wrote:
How can I import DGT memories in WFP 6.5? Doesn't WFP accept the language codes?

Share with us the link of the DGT memor you're trying to import?


https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory

DGT-TM-release 2021

Size = my size of downoalded files

Vol_2020_1.zip 127MB = 129 830 415 B

Vol_2020_2.zip 126MB = 128 665 329 B

Vol_2020_3.zip 112MB = 114 775 491 B

Vol_2020_4.zip 111MB = 113 422 381 B

Vol_2020_5.zip 107MB = 109 838 898 B

Total size 583MB

http://optima.jrc.it/Resources/DGT-TM-2021/Vol_2020_1.zip

Etc. 1 to 5.

https://www.proz.com/forum/czech/353644-dgt_tmx_za_rok_2020_je_konečně_ke_stažení.html

Cheers,

Milan


 
Multiverse Solutions s.r.o. (X)
Multiverse Solutions s.r.o. (X)
Local time: 02:43
Polish to English
+ ...
Slightly off-topic but not really Nov 15, 2021

To verify the reliability of the resources, I extracted the file 22020D0799 from the fifth volume of the archive into the EN-FR pair. Here is what we get:

1. The TMX file is segmented into paragraphs corresponding with the original document layout.
This approach is good for archival purposes, but has limited usability for translation needs.
We prefer sentence-to-sentence translation memory files. One, they are easier to search for a specific text string. Two, the very se
... See more
To verify the reliability of the resources, I extracted the file 22020D0799 from the fifth volume of the archive into the EN-FR pair. Here is what we get:

1. The TMX file is segmented into paragraphs corresponding with the original document layout.
This approach is good for archival purposes, but has limited usability for translation needs.
We prefer sentence-to-sentence translation memory files. One, they are easier to search for a specific text string. Two, the very search process is less demanding on the system. Three, single sentence matches displayed in side panels of CAT programs are easier to the eye and mind. Four, translator’s learning from such chunks of texts is not really comfortable.

2. In general, the whole text appears to be nicely segmented. Until you check it up.

When you download the source (PDF) documents and put them side-by-side, the following discrepancies will come up:
Missing:
– numbering of document units (sections, annexes)
– document unit names (article, annexes)
– whole segments; nicely moved by the generator to the end of the file – the problem is that these were taken from various places within the document

The whole DGT archive is a machine product, so there is no reason to expect human perfection from it. However, these limitations may suggest the existence of other discrepancies or, possibly, errors in the alignments.

Apart from this, the source (EU) documents occasionally have linguistic or alignment errors, from simple typos to complex rephrasing that disables text mirroring across languages. Some of these problems are derived from human actions (no in-depth correction), some are technical and impossible to bypass.
On top of this, we came across a number of supposedly EU-approved aligned translations with blatant bias of the translating persons. These included political colouring, offensive naming of parties (not present in the source), reframing circumstances, adding and removing texts, and more. Amazing.

Conclusions:
Automated alignment will always quickly produce huge volumes of resources, but their quality is not to be trusted (irrespective of the source). In other words, suggestions extracted from such resources by CAT systems need to be re-examined during the process.
We have found this limitation counter-productive. The mental effort (distraction) and extra time spent on re-reading inserts do not justify the ‘profit’ of getting seemingly free TMs.

Solution:
We have designed a quick workflow to align texts as and when needed. Partly manual alignment retains text attributes, produces accurate aligned segments, reduces TM garbage, and provides opportunity for the aligners to learn something more than clicking or selecting from a prefabricated chunks of text. The net result is that our human translation speed increased to about 8,000 words per hour, with real-time human correction integrated in the process.

Personally, I would not even bother to extract / convert / generate TMs from sources like DGT (which is a great archive, but nothing more): unknown reliability of the alignment + loss of time for flooding TMs with whatever comes in = no, no.

Also, TM files get corrupted, so adding huge unchecked volumes of something may not be healthy. In the example above, 466 EN segments are ‘aligned’ with 438 FR segments. 28 segments are missing / shuffled / omitted. Multiply it by the number of files in the 2020 files (2,445) and you get almost 70,000 unreliable segments. If you like excitement, that may be a good start
Collapse


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
@Samuel Nov 15, 2021

Samuel Murray wrote:

German Dutch Engineering Translation wrote:
How can I import DGT memories in WFP 6.5? Doesn't WFP accept the language codes?

Share with us the link of the DGT memor you're trying to import?


Here it is:

https://www.dropbox.com/s/m3zyh0mjtlbd7hf/dgt2007.tmx?dl=0


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
@Samuel Nov 16, 2021

Did you have any luck with importing the file?

 
Milan Condak
Milan Condak  Identity Verified
Local time: 02:43
English to Czech
I put your TMX in OmegaT project Nov 16, 2021

German Dutch Engineering Translation wrote:

Here it is:

https://www.dropbox.com/s/m3zyh0mjtlbd7hf/dgt2007.tmx?dl=0


Hi Hans,

I created in OmegaT project DE-DE, NL-NL and OmegaT can read the data. I translated one file with 40 segments and I saw fuzzy matches.

Milan


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
@Milan Nov 16, 2021

Thanks Milan, but I wanted specifically to know about WFP, since I wanted to do a comparison with the DGT in WFP and CTE.

OT: Coincidentally, I played a little with oT yesterday evening. Am I correct that one only has to place the DGT TMX files in the dedicated TMX folder? So there's actually no import, is there?

I was also wondering if oT has some kind of database system for large TMX files. (But probably we should discuss this in a separate thread in the appropriate f
... See more
Thanks Milan, but I wanted specifically to know about WFP, since I wanted to do a comparison with the DGT in WFP and CTE.

OT: Coincidentally, I played a little with oT yesterday evening. Am I correct that one only has to place the DGT TMX files in the dedicated TMX folder? So there's actually no import, is there?

I was also wondering if oT has some kind of database system for large TMX files. (But probably we should discuss this in a separate thread in the appropriate forum ...)


[Edited at 2021-11-17 01:24 GMT]
Collapse


 
Milan Condak
Milan Condak  Identity Verified
Local time: 02:43
English to Czech
OmegaT reads TMX Nov 16, 2021

German Dutch Engineering Translation wrote:

Thanks Milan, but I wanted specifically to know about WFP, since I want to do a comparison with the DGT in WFP and CTE.


In WFP is a need to import TMX into two project databases: DE-NL and NL-DE, if you translate in both direction.

German Dutch Engineering Translation wrote:

OT: Coincidentally, I placed a little with oT yesterday evening. Am I correct that one only has to place the DGT TMX files in the dedicated TMX folder? So there's actually no import, is there?

I was also wondering if oT has some kind of database system for large TMX files. (But probably we should discuss this in a separate thread in the appropriate forum ...)


No import = no lost time and no waiting.

I am using OmegaT for 64-bit Windows. I can give 10GB for Java and OmegaT.
There are databases for 32-bit DGT-OmegaT. I do not use them.

MT on premise: WFP can use Fiskmo and OmegaT can use OpusCAT. Model for DE-NL has 300 MB, your DE-NL TMX has 294 MB.
All segments in my testing file was translated with OpusCAT, but only for half of segments had some fuzzy matches.

I am sorry, I mainly use OmegaT. But I promote Fismo for Wordfast as alternative local MT.

Milan

[Edited at 2021-11-16 10:15 GMT]


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Info from support Nov 16, 2021

A kind employee of Wordfast's Support wrote to me:

We cater to translators. What you are doing is industrial-strength, agency-level, or research-level. It's like using a VW Beetle to move 100, rather than 4, people.




He also gave me some further instructions (convert to tab delimited) to proceed. Will be testing that route later this week.


 
Milan Condak
Milan Condak  Identity Verified
Local time: 02:43
English to Czech
Wordfast Server Nov 16, 2021

Milan Condak wrote:

I am sorry, I mainly use OmegaT.


Before I started use OmegaT, I was using for big TMs the Wordfast Server.

Google:

What does Wordfast do?

Is Wordfast free?

What is WF Server?

https://www.wordfast.net/?go=wfserver

Overview

Wordfast Server (WFS) is an efficient solution to centralize large Translation Memories and share them over the Internet. It is an agile and nimble self-contained application: a single Windows executable file that does not require any third-party DBMS. It deploys in a few seconds.
--
You can import into the database not only one DGT TMX from year 2007.

Milan


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Concordancing only Nov 18, 2021

Multiverse Solutions s.r.o. wrote:

1. The TMX file is segmented into paragraphs corresponding with the original document layout.
This approach is good for archival purposes, but has limited usability for translation needs.


In my opinion (and because of the restrictions you pictured), the DGT is only useful for concordancing on source, target or both. Not for fuzzy matching.


 
Philippe Locquet
Philippe Locquet  Identity Verified
Portugal
Local time: 01:43
English to French
+ ...
To infinity and beyond Nov 18, 2021

German Dutch Engineering Translation wrote:
In my opinion (and because of the restrictions you pictured), the DGT is only useful for concordancing on source, target or both. Not for fuzzy matching.


Importing this in Pro 6 can be done, this will depend on the size of the TM you want to have.
Before anything else, the first thing to consider is how powerful your machine is. You need at least a quad-core and 8GB of RAM to start thinking of processing this. The benefit of Wordfast Pro and Wordfast Server over Omega T is that your TM will be indexed. This means faster research time and faster fuzzy matches which is very important with big TMs.
A word on the DGT contents: it’s been improving over the years but you will find misaligned segments, repetitions etc. Ideally you would want to clean it, but this will be complicated due to the size of the TMs

Process:

I would suggest first to convert the TM to txt before import, this will help alleviate RAM since txt is a lot lighter than tmx and both Wordfast Server and Wordfast Pro 6 support import of TMs in txt. Convert to tmx using Wordfast Converter: free, fast and effective: http://wordfast.net/zip/WfConverter.zip
In Pro 6: create a local TM and import your first chunk. Then once your next chunk is ready, import it in the same one making sure you have selected “Import in an existing TM” and “If Tu already exists, overwrite existing TU”. This will get rid of doubles. If you import your TMs chronologically from oldest to newest, the overwrite of with the newest version will ensure you get the best TU possible.
Wordfast Pro 6 will need time, processing power and RAM to perform this task, so leave it to do it undisturbing the computer. Once the process is finished, accessing entries in the TM uses very little resources thanks to indexing, it’s very fast.

Size considerations:

I wouldn’t plan a local TU bigger than 1 million in Pro 6. The size of the TM on the disk can be big (slight bit bigger than tmx).
If you want to go bigger than 1 million TU, then use Wordfast Server instead (it’s free for freelancers). First, watch my video here to see how to set this up: https://youtu.be/3WtVV2PlaS8 If you want you TM to use little disk space, Wordfast Server will need a lot less space for TM than Pro 6.
Then import your TM incrementally as described for Pro (in WFS, select the TM and click on append). Importing and indexing in WFS is very fast. There is no limit to the size of the TM as long as you have enough RAM to index it upon import, I’ve ran test TMs on WFS that were several millions TU in size!
These kinds of processes can be tiresome, that’s among the job I offer to my clients, so if someone wants the result without the trouble, send me a message.

For those hearing about DGT for the first time here is a video explaining how to get it and direct download links for sections I cleaned a little below the video: https://youtu.be/wVeU9NKEYjM

Have fun!


Hans Lenting
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to import DGT memories in WFP 6.5?







Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »