Transcription

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretationD3 Study: Definition of an objective evaluation methodfor assessing the minimal quality of digital video andaudio sources required to provide simultaneousinterpretationFinal ReportAuthor:Thomas Sporer, Jens-Oliver Fischer, Judith Liebetrau,Daniel Fröhlich, Sebastian Schneider, Sven KämpfFraunhofer IDMTDate:December 1st, 2010This report was commissioned and financed by the Commission of the EuropeanCommunities. The views expressed herein are those of the Consultant, and do notrepresent an official view of the Commission.Version: 1.71December 1st, 2010

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretationExecutive SummaryThe progress in communication technology is changing the work environments of manyprofessions, providing new possibilities and reducing the work load for the individual. Onthe other hand these new technologies, if applied wrongly, might also be harmful.This report is the result of a study of audio-visual systems for interpreters. The study wasdesigned to update the requirements for on-site interpretation and to specify requirementsfor teleconferencing (videoconferencing).The report consists of the following parts: A discussion of the most prominent technical factors defining the workenvironment for interpreters. The specifications, report, results and conclusions of perceptual experimentsconducted with professional interpreters. These perceptual experiments are usedto find, generate, and evaluate audio-visual stimuli below and above theacceptance threshold for interpreters. A discussion of possibilities for measurement schemes measuring audio-visualquality of interpretation systems. This includes a verification of measurementschemes based on the audio-visual stimuli tested in the perceptual experimentsbefore. A description on how to specify the audio-visual components for new conferencesites based on perceptual measurements. Perceptual measurements can be appliedto any audio and video coding schemes known today and in future. Therefore noexplicit codec and bit rate recommendations are necessary. The description alsoincludes details on how to apply the measurement schemes duringcommissioning. An overview how the existing standards for interpretation systems can beimproved to respect the new possibilities of new technologies while preserving anadequate work environment for interpreters.Version: 1.72December 1st, 2010

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretationTable of ContentsScope. 51. Introduction. 52. Report of Phase 1: Perceptual Experiments. 62.1Production of Speech Stimuli . 62.1.1Hardware used . 62.1.2Language and Text Selection. 72.1.3Report of Recording Sessions. 82.1.4Results of Recording Sessions . 82.2Production of Stimuli with different Quality . 92.2.1Video Conditions . 92.2.2Audio Conditions . 102.2.3Transmission Paramters . 112.2.4Pretest – Selection of Conditions for Perceptual Test . 122.3Conditions used in Perceptual Test. 122.4Test Procedure . 142.4.1Specification of Test Procedure . 142.4.2Setup for Monitor Tests . 162.4.3Setup for Projector Tests. 162.4.4Randomization of Stimuli . 172.5Perceptual Test. 172.5.1Listeners. 172.5.2Report of the perceptual tests. 182.5.3Interpreters comments. 202.6Results of the Perceptual Test. 222.6.1Audio Coding. 222.6.2Audio Bandwidth . 232.6.3Room Acoustics . 252.6.4Background Noise. 272.6.5Video Coding & Resolution. 282.6.6Video Contrast . 292.6.7Audio-Visual Latency (Monitor) . 312.6.8Audio-Visual Latency (Projector). 323. Measurement of Quality and Definition of Requirements. 343.1Measurement of Acoustic Environment . 343.2Measurement of Audio Quality . 343.3Measurement of Video Quality. 363.4Measurement of Audio-Visual Latency. 423.5Measurement Results . 443.5.1Audio Coding. 443.5.2Audio Bandwidth . 483.5.3Room Acoustics . 50Version: 1.73December 1st, 2010

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretation3.5.4Background Noise. 533.5.5Video Coding & Resolution. 563.5.6Video Contrast . 573.5.7Audio-Visual Latency (Monitor) . 583.5.8Audio-Visual Latency (Projector). 593.6Comparison Subjective-Objective Results . 593.6.1Audio Coding. 593.6.2Audio Bandwidth . 623.6.3Room Acoustics . 643.6.4Visual Environment . 653.6.5Background Noise. 653.6.6Video Coding & Resolution. 663.6.7Video Contrast . 683.6.8Audio-Visual Latency (Monitor) . 693.6.9Audio-Visual Latency (Projector). 703.7Summary of Requirements . 704. Detailed Specification of Requirements and Measurement Procedures . 714.1Audio Quality. 714.2Video Quality (Monitor) . 714.3Video Quality (Projector) . 724.4Audio-visual coherence – lip sync. 734.5Recordings of test samples. 745. Current Standards for simultaneous Interpretation . 756. Conclusions. 777. Team members & roles in the team . 77REFERENCES . 78Version: 1.74December 1st, 2010

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretationScopeThe progress in communication technology is changing the work environments of manyprofessions, providing new possibilities and reducing the work load for the individual. Onthe other hand, these new technologies pose possible risks to the users.This study intends to explore various aspects of audio-visual systems for interpreters. It istargeted to create a scientific profound list of requirements for environments that areadequate for simultaneous interpretation.These requirements are especially valid for typical on-site interpretation, where a monitorin the interpretation booth or a projector visible through the window of the interpreters’booth is in use. They are also valid for teleconferencing (videoconferencing) situations.In both cases this includes the quality requirements for (bit-reduced) encoding of audioand video.This list of requirements might lead to standards, giving architects and planners ofconference sites guidance for their work.1. IntroductionIn this study a definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide (extempore) simultaneousinterpretation utilizing technological aids such as: projectors, displays inside booths, anddigital (audio) interpretation systems is presented.The main working steps are: Phase 1: Realization of perceptual experiments with experienced simultaneousinterpreters to collect subjective video and audio data about the audiovisualcoherence acceptance threshold of interpreters. Experiments concerning generalperceptual acceptance thresholds have been carried out in previous studiesconducted by the EC. The knowledge obtained in these perceptual experiments istaken into account in this phase. New perceptual experiments are necessary,because the audio data of these previous studies has not been stored for furtheruses, e.g., the calibration of measurement systems (see phase 2). Phase 2: Application of standardized objective evaluation methods for audio,video, and audiovisual coherence on this newly collected data and adaptation ofthese methods if necessary.The detailed working plan, divided into five topics, will be illustrated in this document. Atest schedule and certain detailed requirements will be provided to enable acquisition offacilities, equipment, and personnel resources required on-site in Brussels.Version: 1.75December 1st, 2010

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretation2. Report of Phase 1: Perceptual Experiments2.1Production of Speech StimuliIn total 8 speakers have been recorded, i.e. one male and one female speaker for each ofthe 4 languages (for details about the selected languages, see below). Speakers forEnglish, German and Polish language were recorded in Breydel Auditorium, Brussels.Hungarian speakers were recorded in an ITU-R BS.1116-1 [BS.1116] compliant room atFraunhofer IDMT, Ilmenau.2.1.1 Hardware usedAll video assets were recorded with a Panasonic HDC-SD707 High DefinitionCamcorder (consumer/prosumer grade).Resolution1920 x 1080Frame rate50iCodecMPEG-4AVC/H.264Bit rate17 mbps(VBR)ProfileAVCHD HATable 1: Parameters of the HDC-SD707 CamcorderBeing a consumer grade camera, it relieves the user from automatically adjusting whitelevel and contrast. As a result, recordings made with this device tend to be adapted ratherwell to different lighting situations. That way it was not necessary to manually adjust therecording equipment on both recording sites. The camera reduced the visible shadows inthe faces caused by sub-optimal lighting in the conference room significantly. Note thata professional video camera would need more manual intervention. In general, anylighting situation casting a shadow on the face of a person to be recorded should beavoided.All audio recordings were made using an Edirol R4-Pro HD recorder (professional grade)and a dpa 4011 microphone with cardioid capsule (professional grade) at 96 kHz and 24bit. One channel was recorded (mono).Figure 1: On-axis Frequency Response of the dpa 4011 MicrophoneVersion: 1.76December 1st, 2010

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretationThe SNR of the audio recording for the Hungarian speakers is ca. 33 dB.The SNR of theaudio recording for all others speakers is ca. 22 dB, due to background noise in theBreydel Auditorium mainly caused by air conditioning and reverberation.2.1.2 Language and Text SelectionThe languages have been selected to cover different regions of Europe and differentspectral-temporal behaviour: Slavic languages use a lot of high frequency fricatives thatare sensitive to low pass filtering. Germanic languages have as non-smooth temporalstructure (i.e. plosives) which is sensitive to time smearing in audio coding, andespecially German proved to be among the most critical languages in audio coding in thepast. English was chosen for being the main working language in EU. Uralic languages,like Finnish, Hungarian, and Estonian, contain more similar sounding vocals than otherEuropean languages. Small deviations in the reproduction of vocals can therefore reducethe perceived quality of these languages. 1Fraunhofer IDMT decided to use standardized phonetically balanced sentences as far asavailable. In case no or not enough phonetically balanced sentences were available,unbalanced sentences were used.English:Harvard Psychoacoustic Sentences is a closed set of 100 sentences originally developedto evaluate the word intelligibility in whole, meaningful sentences. The sentences werechosen considering the various segmental phonemes of the English language and theirrepresentation in accordance with their frequency of occurrence. Lists 1 to 12 [IEEE]were used for the recording.German:For the German Sentences the “Oldenburger Satztest (OLSA)” was used. This testconsists of 40 lists with 30 sentences each. The structure of the sentences is always: noun,verb, numeral, adjective, and object. The distribution of phonemes of the test list equalsthe mean phoneme distribution of the German language [Wagner et al.1999]. Lists 1 to 7were used for the recording.Polish:The polish test set contains 100 sentences in total. The first 30 sentences were providedby the European Commission, Directorate General for Interpretation. These sentences arethe “CEGLEX DTD for polish” released by the department of computer linguistics andartificial intelligence - Adam Mickiewicz University, Poznan, Poland. Because it consistsof an insufficient number of sentences, Fraunhofer prepared an additional list with 70sentences. These sentences were taken from poems and daily literature.Hungarian:For the Hungarian language no phonetically balanced material was available. TheEuropean Commission, Directorate General for Interpretation provided a list with 251As can be seen in the results of the perceptual experiments each of these languages has distinctrequirements on the audio-visual system.Version: 1.77December 1st, 2010

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretationsentences. In addition 35 sentences were selected by a native speaker. These sentencesare taken from the book “Veronika meg akar halni” by Paulo Coelho.2.1.3 Report of Recording SessionsThe recording of the audio and video assets was divided into two sessions. Session onewas conducted on 12th July 2010 at the Breydel Auditorium in Brussels. In a formermeeting in Brussels this room was chosen because of its acoustics. In contrast to theformer meeting, there was clearly audible background noise from the air condition on12th July. However, background noise of a certain amount is not to be considered asunwanted. In fact, audio which was not recorded in a studio-like environment is muchmore demanding when being encoded.No additional lights were installed for the video recordings, i.e., the on-site lighting wasused. In the rooms the recordings were taken in, all lights were installed into the ceiling.This configuration causes the noses of the speakers to produce a slight shadow on theirmouths. In total, the chosen recording setup (background noise, no additional light)corresponds to an in situ situation of current conference rooms.The speakers were young university graduates who were working as trainees at theEuropean Commission at that time. In total 8 volunteer speakers were recorded inBrussels, one female and one male native speaker for each language. One speaker at atime read the provided sentences. The speakers were seated in one of the audience seats,as the noise caused by the air condition was considered to be too loud on the podium.The recordings of the Hungarian speakers were not approved by the EuropeanCommission and had to be redone 2 . The second recording session of Hungarian speakerswas conducted at Fraunhofer IDMT in Ilmenau in an ITU-R BS.1116-1 compliant room.Each recording started with a clap marker, which was recorded by the camera and the HDaudio recorder.2.1.4 Results of Recording SessionsAfter a screening of the recorded assets, it occurred that some sentences could not beused due to too much noise (paper rustling, breathing directly into the microphone, noisecaused by mobile phones, mispronunciation, etc.).The following table lists the total number of sentences per speaker which were availablefor the assessment after 4898Male11720751952The first recordings of Hungarian stimuli done at Brussels had used other word lists taken from the HolyBible. These recordings were discarded because the language was regarded to be too archaic and unusualfor interpreters.Version: 1.78December 1st, 2010

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretationTable 2: Sentences per speaker suitable for further processing2.2Production of Stimuli with different QualityAudio and video data were recorded separately and were also post processed separately,i.e. the recorded audio and video assets have been modified in postproduction to createthe conditions/stimuli for the assessment.To ensure synchronicity of audio and video, a reference audio and video file was createdfor each speaker. In these files audio and video matched with an accuracy of 1.5 ms.This was achieved by aligning the clap markers recorded by the camera (audio) and theHD recorder.2.2.1 Video ConditionsVideo material was processed using Adobe Premiere Pro CS4. Different resolutions(Table 4), bit rates and levels of contrast (Table 5) were created.For the test conditions all videos were encoded using the H.264 codec with theparameters listed in Table 1-PassVBRKeyframeInterval25Table 3: Parameters of H.264Resolution1280 x 7201280 x 7201280 x 7201280 x 7201024 x 5761024 x 5761024 x 576640 x 360640 x 360640 x 360Bit rate10 mbps6 mbps3 mbps2 mbps3 mbps2 mbps1 mbps3 mbps2 mbps1 mbpsTable 4: Different resolutions considered for test itemsContrastC1C2C3Level100 %50 %25 %Table 5: Different contrasts considered for test itemsVersion: 1.79December 1st, 2010

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretation2.2.2 Audio ConditionsThe recorded audio material was processed using Adobe Audition 1.0.To investigate the impairments caused by background noise, sounds of a person typing ona keyboard were added to the speakers’ signals. The actual signal to noise ratio (SNR)depends on the individual sound pressure level of each speaker.ConditionSNR1SNR2SNRCa. 15 dBCa. 18 dBTable 6: Parameters of signal to noise ratiosThe room impulse response (IR) of the Breydel Auditorium was used to model highlyreverberant conditions.Reverberation was created by convolving the unaltered audio recordings of all speakerswith the room impulse response. The part of the direct sound has been eliminated in theimpulse response.In addition to the original room impulse response (length of IR: ca. 1500 ms) a secondversion was created in which the late/diffused part has been eliminated, but is given inbrackets as additional information (length of IR: 200 ms) in Table 7.The resultingreverberation and original recording were mixed into one single audio stimulus.Impulse ResponseIR1 (1500 ms)IR2 (200 ms)T30300 ms180 msDirect / Early / Late- 1dB / - 17 dB / - 34 dB- 1 dB / - 18 dB / (- 41 dB)Table 7: Impulse responses considered for audio only test itemsLevels as RMS dB full scale.Early: 0 ms – 150 msLate: 150 ms – 450 msThe bandwidth of the original recording has been reduced using FFT filters as shown inTable 8.FilterB1B2B3B4B5Bandwidth30 Hz – 7.500 Hz100 Hz – 12.500 Hz30 Hz – 15.000 Hz20 Hz – 20.000 Hz30 Hz – 12.500 HzTable 8: Parameters of band limitationsAs displayed in Table 9, two different profiles of the MPEG-4 AAC codec were used atdifferent bit rates. For the assessment of audio quality, the encoded audio data wasdecoded beforehand and saved as 48 kHz PCM WAV files. Encoding/decoding was doneVersion: 1.710December 1st, 2010

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretationusing the Fraunhofer IIS ISO/MPEG-4 Enhanced Low Delay AAC Fast Encoder(Version 2.4.7).ProfileLow ComplexityLow ComplexityLow ComplexityLow ComplexityLow DelayLow DelayLow DelayLow DelayBit rate32 kbps48 kbps64 kbps96 kbps32 kbps48 kbps64 kbps96 kbpsEncoding QualityVery HighVery HighVery HighVery HighVery HighVery HighVery HighVery HighTable 9: Parameters of audio codecsFrom former assessments it is known that audio-video coherency is crucial. For thatreason several conditions have been created in which video and audio are out of sync.Delay conditions (cf. Table 10) were not created in postproduction, but with the playbacksoftware.ConditionD -120D -100D -80D 20D 30D 40D 50D 60D 80D 100Delay120 ms (audio after video)100 ms (audio after video)80 ms (audio after video)20 ms (audio ahead of video)30 ms (audio ahead of video)40 ms (audio ahead of video)50 ms (audio ahead of video)60 ms (audio ahead of video)80 ms (audio ahead of video)100 ms (audio ahead of video)Table 10: Delays taken into consideration for test conditions2.2.3 Transmission ParamtersErrors within audio and video signals in the digital section of video conferences can beseparated in two parts: errors introduced by coding (bit reduction systems) and errorscaused by the transmission line. In the past, especially in times of analog transmission,there was a major difference between short distance (“in-house”) and long distancetransmission. With digital transmission and bitrate reduced audio and video coding, theerror structure of the actual transmission line is (partly) compensated by error correctionand error concealment algorithms in the video conferencing systems. As a result, thisstudy concerning transmission parameters will mainly be focused on the different typesof error concealment: In situations of severe transmission errors, some systems mightmute, while others might create moderate and plausible (but unintelligible) sounds andVersion: 1.711December 1st, 2010

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretationstill others might create nasty, loud distortions. Looking on currently available videoconferencing systems it was found, that usually the audio stream is very well protectedand the only distortion happening on the audio section is muting. On the video sectiontwo different distortions were observed: frame freeze and inserted black frames. Testswith frame freezes and inserted black frames with up to 5 frames (100ms) distorted havebeen used in the pretest.2.2.4 Pretest – Selection of Conditions for Perceptual TestA pretest conducted at Fraunhofer IDMT showed that the differences between some ofthe conditions are very small and almost undetectable even under controlled conditions atFraunhofer IDMT. This in particular is relevant for the Audio Coding conditions. Verysmall impairments cannot be assessed by the chosen test procedure, thus the number ofAudio Coding conditions was reduced to six. When assessing the audio-video coherencyconditions, the participants of the pretest denoted that they could not notice any delay forD20 and D30. The delay introduced by D40 was hardly perceptible. For that reason, D30and D50 were left out in the final perceptual test. In addition to the Initial Report, D60,D80, and D100 were added. Concerning transmission errors it was observed that due tothe strong error protection on the audio channel only very long drop outs of thetransmission line cause perceptual effects. Freezing or blackening of video frames werefound to be almost not perceivable if less than 5 frames. It is assumed that a transmissionchannel with such long distortion is completely broken and that perceptual experimentsmake no sense for such conditions. However the perceptual measurement schemesdefined in section 3 are able to detect such distortions.2.3Conditions used in Perceptual TestAudio Coding (Monitor)AC1AC2AC3AC4AC5AC6Video condition (unchanged)1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%Audio conditionMPEG-4 AAC LD 32 kbpsMPEG-4 AAC LD 64 kbpsMPEG-4 AAC LD 96 kbpsMPEG-4 AAC LC 32 kbpsMPEG-4 AAC LC 48 kbpsMPEG-4 AAC LC 64 kbpsAudio Bandwidth (Monitor)BW1BW2BW3BW4Video condition (unchanged)1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%Audio condition30 Hz – 7.500 Hz100 Hz – 12.500 Hz30 Hz – 15.000 Hz30 Hz – 12.500 HzAC6AC6AC6AC6Version: 1.712December 1st, 2010

Final report - Definition of an objective evaluation method for assessing the minimalquality of digital video and audio sources required to provide simultaneous interpretationBW51280 x 720 10 mbpsContrast 100%30 Hz – 7.500 HzAC1Audio conditionImpulse Response 1Impulse Response 1Impulse Response 2Impulse Response 2AC6AC2AC6AC2Audio conditionSNR 15 dBSNR 15 dBSNR 18 dBSNR 18 dBAC6AC2AC6AC2Room Acoustics (Monitor)RA1RA2RA3RA4Video condition (unchanged)1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%Background Noise (Monitor)BN1BN2BN3BN4Video condition (unchanged)1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%1280 x 720 10 mbps Contrast 100%Video Coding & Resolution (Monitor)VC1VC2VC3VC4VC5VC6VC7VC8VC9Video condition1280 x 720 6 mbps1280 x 720 3 mbps1280 x 720 2 mbps1024 x 576 3 mbps1024 x 576 2 mbps1024 x 576 1 mbps640 x 360 3 mbps640 x 360 2 mbps640 x 360 1 mbpsContrast 100%Contrast 100%Contrast 100%Contrast 100%Contrast 100%Con

Final report - Definition of an objective evaluation method for assessing the minimal quality of digital video and audio sources required to provide simultaneous interpretation D3 Study: Definition of an objective evaluation method for assessing the minimal quality of digital video and audi