How to clean a Multiterm termbase from repetitive entries?
Thread poster: Mushu
Mushu
Mushu
Local time: 09:51
Bulgarian to French
+ ...
Mar 15, 2010

Dear colleagues,

It is certainly my own deficiency, but I can't find a way to do this, except manually, and I don't feel I have the courage and the time to delete manually all the repetitions from a more than 18,000 entries bilingual glossary.

When I imported several quite large MS Excel glossaries into it, I just couldn't imagine it wouldn't filter similar entries as Workbench does with the identical TUs. So I did import, and then it was too late - there I was with the
... See more
Dear colleagues,

It is certainly my own deficiency, but I can't find a way to do this, except manually, and I don't feel I have the courage and the time to delete manually all the repetitions from a more than 18,000 entries bilingual glossary.

When I imported several quite large MS Excel glossaries into it, I just couldn't imagine it wouldn't filter similar entries as Workbench does with the identical TUs. So I did import, and then it was too late - there I was with the 18,000 entries stuff, maybe 2/3 of which are repetitive.

There should be a way to delete the repetitions in a smarter manner than the manual one (or to import the files in a new glossary without getting all the repetitions in the DB), but I couldn't find any helpful hints in Multiterm online help.
Maybe it has to do something with creating a proper filter, but I couldn't work out how to create such a filter from the help entry...

My Multiterm is a 2009 version, the one which came with Studio last year. I still haven't installed the SP 2.

I will be very thankful for any working suggestions.

[Edited at 2010-03-15 12:27 GMT]
Collapse


 
István Hirsch
István Hirsch  Identity Verified
Local time: 08:51
English to Hungarian
Just a question... Mar 16, 2010

Open the XML file in Excel as an XML list. Are the source and target terms below each other in the same column? If so, I think I know how to go on.

 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 08:51
English to Hungarian
+ ...
Nope Mar 16, 2010

Mushu wrote:

There should be a way to delete the repetitions in a smarter manner than the manual one (or to import the files in a new glossary without getting all the repetitions in the DB), but I couldn't find any helpful hints in Multiterm online help.


I don't think Multiterm offers a smart solution, although if it does, I would love to hear about it.
I would filter the source file(s) with Excel (data/filters) and import into a new termbase.


 
Daniel García
Daniel García
English to Spanish
+ ...
Export and import again? Mar 16, 2010

You could try exporting the data to XML and importing into a new termbase using the option "Synchronise on index field".

Then you can either choose to merge the duplicate entries or exclude the entries which already exist.

The problem of doing this that it will base the merge or the exclusion on the index field that you choose and you may end up merging (or excluding) valid duplicates.

For instance, if your glossary includes separate entries for "party=part
... See more
You could try exporting the data to XML and importing into a new termbase using the option "Synchronise on index field".

Then you can either choose to merge the duplicate entries or exclude the entries which already exist.

The problem of doing this that it will base the merge or the exclusion on the index field that you choose and you may end up merging (or excluding) valid duplicates.

For instance, if your glossary includes separate entries for "party=parti" (meaning "political party") party=fête (meaning "feast"), they will be either merged together (not good for concept-based termbase) or the second entry will not be imported and you would have to import it in a separate step adding it as a new entry.

If you don't have any of these required duplicates, it should work.

Daniel
Collapse


 
Mushu
Mushu
Local time: 09:51
Bulgarian to French
+ ...
TOPIC STARTER
Thank you all Mar 17, 2010

Thank you all!
So if there is no smart solution, I'd have to imagine something else, like playing on the source Excel files.
Merging in Multiterm does not remove the, say, 4 entries merged into one. You have to remove them manually, one by one, every time clicking to confirm that you do want to remove the selected entry.
But I could try, when I find some spare time, to work on the Excel files before importing them to a new database.

Istvan, I couldn't open the .xm
... See more
Thank you all!
So if there is no smart solution, I'd have to imagine something else, like playing on the source Excel files.
Merging in Multiterm does not remove the, say, 4 entries merged into one. You have to remove them manually, one by one, every time clicking to confirm that you do want to remove the selected entry.
But I could try, when I find some spare time, to work on the Excel files before importing them to a new database.

Istvan, I couldn't open the .xml file in Excel; either it is far too big or there are some problems with it, but after several minutes waiting, it opened with an error, and with only 1 (the first) line.

Well, I certainly must be a naive greenhorn, not so familiar with databases, but how on earth one could manage a database without some automated way to eliminate repetitions? I don't understand the developers' idea...
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 08:51
English to Hungarian
+ ...
In Excel Mar 17, 2010

Mushu wrote:

But I could try, when I find some spare time, to work on the Excel files before importing them to a new database.


Yes, Multiterm should offer this functionality (it probably does, in some weird, unfathomable and contorted way), but at least doing it in Excel is not too difficult:

http://office.microsoft.com/en-us/excel/HA010346261033.aspx

AFAIK if you select 2 columns, only the content of those will be taken into account i.e. rows with differences in other columns will still be filtered out. Alternatively, you can select the whole table to conserve all rows that have at least one cell anywhere that differs from other rows.


 
kimjasper
kimjasper  Identity Verified
Denmark
Local time: 08:51
Member (2006)
English to Danish
+ ...
Export in tab-delimited format Mar 17, 2010

This may work if you are familiar with the data functions in Excel:

If you export in tab-delimited format instead of xml, you get a .txt file with tabs as field separators. If you pull this file into an open Excel session, it usuall creates 5-10 columns in a nice and tidy format. If your termbase is purely bilingual, one of these columns is the source term, another is the target term. Delete all other columns and insert a first row with the names of the languages. Use the advanced f
... See more
This may work if you are familiar with the data functions in Excel:

If you export in tab-delimited format instead of xml, you get a .txt file with tabs as field separators. If you pull this file into an open Excel session, it usuall creates 5-10 columns in a nice and tidy format. If your termbase is purely bilingual, one of these columns is the source term, another is the target term. Delete all other columns and insert a first row with the names of the languages. Use the advanced filter function in Excel to remove duplicate rows (click on Data, then on Advanced next to the Filter icon, then on Unique records only, then delete these rows). Save as .xls and then create a new termbase from this xls file with Multiterm Convert.

I saw that in SP2 you can include a column with term id numbers in the xls files. This should make the task of updating termbases easier since MT will just replace the term if the id number exists in the termbase already. This required though that you or your client can implement a term id numbering system.

It would be nice if SDL could develop an easy way to remove duplicates from a termbase. The issue probably is that the hierarchical and very flexible xml-based data structure in MT makes it difficult to develop a generic tool doing that. But if SDL could develop a tool that works on bilingual term bases only, that would probably solve the issue for 99% of the users.
Collapse


 
Mushu
Mushu
Local time: 09:51
Bulgarian to French
+ ...
TOPIC STARTER
Will be trying this solution Mar 17, 2010

Thank you, kimjasper!

I will try the solution you suggest. I'm not so familiar with the data functions in Excel, but you have described the workout quite clearly and I'll go for it.

The point is, one of the languages in the termbase is French, and when I tried the other day to export it in tab-delimited format, it did lose some of the diacritic marks (I think the apostroph was replaced with a code), so I stopped trying to fix things through this format.
But I'll t
... See more
Thank you, kimjasper!

I will try the solution you suggest. I'm not so familiar with the data functions in Excel, but you have described the workout quite clearly and I'll go for it.

The point is, one of the languages in the termbase is French, and when I tried the other day to export it in tab-delimited format, it did lose some of the diacritic marks (I think the apostroph was replaced with a code), so I stopped trying to fix things through this format.
But I'll try anyway, as soon as I can spare time.

Thank you again!
Collapse


 
FarkasAndras
FarkasAndras  Identity Verified
Local time: 08:51
English to Hungarian
+ ...
Careful with tab delimited export Mar 17, 2010

kimjasper wrote:

This may work if you are familiar with the data functions in Excel:

If you export in tab-delimited format instead of xml, you get a .txt file with tabs as field separators. If you pull this file into an open Excel session, it usuall creates 5-10 columns in a nice and tidy format. If your termbase is purely bilingual, one of these columns is the source term, another is the target term. Delete all other columns and insert a first row with the names of the languages. Use the advanced filter function in Excel to remove duplicate rows (click on Data, then on Advanced next to the Filter icon, then on Unique records only, then delete these rows). Save as .xls and then create a new termbase from this xls file with Multiterm Convert.

I saw that in SP2 you can include a column with term id numbers in the xls files. This should make the task of updating termbases easier since MT will just replace the term if the id number exists in the termbase already. This required though that you or your client can implement a term id numbering system.

It would be nice if SDL could develop an easy way to remove duplicates from a termbase. The issue probably is that the hierarchical and very flexible xml-based data structure in MT makes it difficult to develop a generic tool doing that. But if SDL could develop a tool that works on bilingual term bases only, that would probably solve the issue for 99% of the users.


... because it doesn't work unless all entries include a term in all languages and all other fields are also filled in.

See http://www.proz.com/forum/sdl_trados_support/160039-converting_multilingual_termbase_into_excel_document.html


 
István Hirsch
István Hirsch  Identity Verified
Local time: 08:51
English to Hungarian
Multiterm solution Mar 18, 2010

The main point of the solution is that it is not necessary to delete any part of the termbase under discussion because you can switch on and off any separately added part of the „final” termbase by the use of filters in Multiterm, any time.

Open Multiterm and open the termbase you want to modify (just that). In the upper line of boxes with drop-down menus (source l., target l. etc) there is one called Flags layout. Click on the arrow in it and from the drop-down menu select Full
... See more
The main point of the solution is that it is not necessary to delete any part of the termbase under discussion because you can switch on and off any separately added part of the „final” termbase by the use of filters in Multiterm, any time.

Open Multiterm and open the termbase you want to modify (just that). In the upper line of boxes with drop-down menus (source l., target l. etc) there is one called Flags layout. Click on the arrow in it and from the drop-down menu select Full layout. Now in the main window the date of the creation and the name of the creator of the given term also appear, besides the source and target term. Clicking on the several occurances of the same source term in the source term list in the left column, you can identify when several portions of data were added to the termbase - you will need these dates in the next step.

Now, still in Multiterm, go to Termbase/Termbase catalogue/Filter tab and click Create, Next and give a name to the new filter, then in the next window check the "Simple filter" radiobutton. In the next „Filter definition” window from „Termbase fields” choose Source/Term/”Created on” and at the „Condition” drop-down menu select "Equal to", and type the date of the addition of the data that you want to be filtered in the „Value” box, then complete the process.

Now go back to Multiterm main window, click on the arrow in the last box (Filters), select the new filter, then click on the filter symbol in front of the box, and select „Displays matching entries only” from the drop down menu.

In the left column only the terms added that day will be displayed (and searched). By creating several filters, several parts of the „final” termbase can be excluded from the searching process.

You can use the same filters to import/export the termbase with the same effect.

Hope this helps.
Collapse


 
Mushu
Mushu
Local time: 09:51
Bulgarian to French
+ ...
TOPIC STARTER
thank you! Mar 18, 2010

Thank you so much, István!

You are so kind to explain things in such a clear and detailed way! Thank you for giving of your time to do this.


 
Lyudmila Dyankova
Lyudmila Dyankova
Russian Federation
Local time: 09:51
Russian to English
+ ...
Simple Solution Found Oct 18, 2017

This is an old post, however it still does appear first in Google Search when searching for a way to delete duplicates from a MultiTerm.
Since SDL MultiTerm 2017 does not (yet and up to this day!) offer this functionality in a simple way, I am sharing this lovely solution:
http://noradiaz.blogspot.ru/2016/01/removing-duplicates-from-multiterm.html


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

How to clean a Multiterm termbase from repetitive entries?







Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »
Wordfast Pro
Translation Memory Software for Any Platform

Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value

Buy now! »