Statistical comparison of Japanese V&L Datasets.
| Dataset |
Type |
# Images |
# Texts |
Avg. # Chars |
Vocabulary Size |
| Caption |
| STAIR Captions |
Human-annotated |
123,287 |
616,435 |
23.80 |
30,195 |
| MS COCO Translation |
Machine-translated |
123,287 |
616,767 |
22.41 |
32,960 |
| DEJIMA-Cap-Simple (Ours) |
Alt |
3,884,632 |
3,884,632 |
18.21 |
336,924 |
| DEJIMA-Cap-Refined (Ours) |
Alt + LLM |
3,884,629 |
3,884,629 |
38.03 |
314,900 |
| DEJIMA-Cap-Detection (Ours) |
Detection + LLM |
3,884,632 |
3,884,632 |
49.55 |
30,674 |
| DEJIMA-Cap-All (Ours) |
Alt + Detection + LLM |
3,884,632 |
3,884,632 |
79.62 |
287,434 |
| VQA |
| Japanese Visual Genome |
Human-annotated |
99,208 |
793,664 |
19.50 |
20,797 |
| GQA Translation |
Machine-translated |
71,067 |
3,999,765 |
22.58 |
11,856 |
| DEJIMA-VQA-Refined (Ours) |
Alt + LLM |
3,875,343 |
3,875,343 |
56.62 |
321,720 |
| DEJIMA-VQA-Detection (Ours) |
Detection + LLM |
3,883,943 |
3,883,943 |
77.00 |
31,929 |
| DEJIMA-VQA-All (Ours) |
Alt + Detection + LLM |
3,882,892 |
3,882,892 |
108.86 |
278,860 |
Table: Statistical comparison of Japanese V&L datasets (counts,
averages, and vocabulary sizes).
To examine DEJIMA’s representational coverage relative to existing datasets, we analyzed 2D feature
distributions by applying PCA to CLIP image embeddings. All datasets were jointly projected into a shared
2D space and discretized on a common 60×60 grid (with 2% padding) to obtain probability maps
pd(i,j).
For each dataset, we computed two measures: (1) the asymmetric coverage rate Coverage(P|Q)=∑b∈occQ pP(b), quantifying how
much probability mass of P lies in bins occupied by Q; and (2) the bidirectional KL
divergences KL(P||Q) and KL(Q||P) (with
ε=1e−12) to capture overlap and distributional divergence in the shared space.
Using the domestic dataset recruit-jp as the reference target, DEJIMA achieved the highest
coverage Coverage(target|DEJIMA)=0.785, exceeding Japanese Visual Genome
(0.435), STAIR Captions (0.430), MS COCO (0.406), and GQA (0.342). This indicates DEJIMA spans about 79%
of the visual domain occupied by real Japanese imagery. Conversely, for Coverage(dataset|target), Japanese Visual Genome was highest (0.534), followed
by MS COCO (0.502) and STAIR (0.492), while DEJIMA was lower (0.192), suggesting broader support beyond
the domestic domain.
KL divergences showed a consistent pattern: KL(recruit-jp||DEJIMA)=6.03
(lowest), Japanese Visual Genome (≈12.2), STAIR (≈12.3), MS COCO (≈12.8), GQA (≈14.2). In reverse, KL(DEJIMA||recruit-jp)=16.4, indicating DEJIMA includes regions not present in
recruit-jp.
PCA projection of CLIP image embeddings: DEJIMA covers the domestic
region (recruit-jp) and extends smoothly toward broader global contexts.
We evaluated caption and VQA quality via pairwise human comparisons (150 samples per dataset, randomized
order, 80 crowd workers). We inserted control items for quality and removed inconsistent annotations;
final sample sizes (n) are shown in each table.
Caption metrics: Japaneseness of image, Japaneseness of text, Naturalness of text, Image–text consistency,
Coverage, Expressiveness. VQA metrics: Japaneseness (image/text), Naturalness of text, Q–A relevance,
Q–image consistency, Answer correctness.
Significance was tested with two-sided binomial tests against 50%, using the 5% level with Holm–Bonferroni
correction. A * indicates significance at 5%.
Caption: Pairwise preference of DEJIMA-Cap-All vs. baselines.
| Compared Dataset |
n |
Japaneseness of image |
Japaneseness of text |
Naturalness of text |
Image-text consistency |
Coverage |
Expressiveness |
| MS COCO Translation |
105 |
82.86* |
87.62* |
86.67* |
20.00* |
39.05* |
92.38* |
| STAIR Captions |
135 |
74.07* |
77.78* |
62.22* |
20.74* |
43.70 |
68.89* |
| DEJIMA-Cap-Refined |
105 |
-- |
86.67* |
64.76* |
52.38 |
61.90* |
91.43* |
| DEJIMA-Cap-Detection |
135 |
-- |
76.30* |
65.93* |
62.96* |
70.37* |
81.48* |
Caption: Pairwise preference of DEJIMA-Cap-All vs.
baselines. * indicates significance at 5%.
VQA: Pairwise preference of DEJIMA-VQA-All vs. baselines.
| Compared Dataset |
n |
Japaneseness of image |
Japaneseness of text |
Naturalness of text |
Q-A relevance |
Q-Image consistency |
Answer correctness |
| GQA Translation |
105 |
91.43* |
92.38* |
89.52* |
41.90 |
51.43 |
41.90 |
| Japanese Visual Genome |
135 |
92.59* |
87.41* |
78.52* |
31.85* |
38.52* |
34.81* |
| DEJIMA-VQA-Refined |
90 |
-- |
76.67* |
74.44* |
57.78 |
56.67 |
57.78 |
| DEJIMA-VQA-Detection |
120 |
-- |
79.17* |
72.50* |
61.67* |
65.83* |
63.33* |
VQA: Pairwise preference of DEJIMA-VQA-All vs.
baselines. * indicates significance at 5%.