Wen Ying, Yeonsu Kim, Adil Rahman, Erzhen Hu, Geehyuk Lee, Seongkook Heo
Redirected Pinch: Efficient and Comfortable Bare-Hand Interaction for 2D Windows in VR
(Abstract) Virtual Reality (VR) offers portable and flexible workspaces. However, enabling efficient and comfortable interactions without external input devices remains challenging. We propose leveraging redirected input to enable comfortable and touch-like interaction for quick and intuitive control. Our design study revealed that while touch interaction performs well with direct input, its performance degrades significantly under input redirection. In contrast, using pinch improves redirected input by providing self-haptic feedback and reducing input dimensionality, thereby compensating for spatial discrepancies. Based on these findings, we introduce Redirected Pinch, a bare-hand interaction technique that combines input redirection with pinch confirmation. It creates a virtual plane at waist height, remapping hand movements on the plane to a vertical window, with pinch gestures used for confirmation. A user study demonstrated that Redirected Pinch achieves a strong balance of accuracy, efficiency, comfort, and sense of agency across fundamental interactions.
(Introduction) Virtual Reality (VR) is emerging as a promising platform for productivity tasks, offering portable, extendable, and distraction-free workspaces that adapt to user needs and contexts [29, 39, 59, 66, 70]. Commercial headsets like Meta Quest 3 and Apple Vision Pro already support desktop-like workflows while enabling large, reconfigurable virtual screens accessible anywhere. Despite the threedimensional nature of VR environments, many productivity applications continue to rely on two-dimensional (2D) interfaces, such as media editing, document review, and web browsing, as 2D layouts remain familiar, efficient, and cognitively lightweight [12, 13, 24, 28, 36, 75]. However, supporting efficient and comfortable interaction with such 2D windows in VR remains a critical challenge.
Common VR interaction methods, such as using handheld controllers, often lack both precision and comfort during prolonged use [71]. External devices like mice [31], keyboards [30, 40, 42], and tablets [12, 13, 28] can provide more accurate control but at the cost of portability, while also requiring physical surfaces for support. With handtracking integrated into commercial headsets, bare-hand interaction offers a natural, lightweight, and always-available alternative [43, 45], capable of enabling more dexterous input than device-constrained methods [48, 58]. However, bare-hand interaction lacks tangibility and stability, making it inefficient for precise input [34, 49], and it induces fatigue when arms are kept elevated, known as the “Gorilla Arm” effect [9, 32].
To reduce fatigue and extend reach, researchers have explored input remapping, which manipulates the spatial relationship between real and virtual hands. The Go-Go technique [64] introduced nonlinear reach extension, and subsequent work has amplified small arm movements for large-window interaction [52] or redirected near-body hand motions onto far-away windows for more comfortable input [16]. While effective for ergonomics, input remapping can introduce visual–motor mismatch, disrupting proprioception and making mid-air touch less reliable [41, 54]. Complementary work improves input redirection using physical surfaces or custom haptic devices [27, 54], but such approaches reduce mobility and are not always available for everyday VR use.
Self-haptic gestures offer an appealing lightweight alternative [25, 33, 61, 86]. Pinch provides tactile confirmation through thumbfinger contact and supports robust gesture recognition [18, 62, 79], leading to wide adoption in commercial systems [6, 55]. However, the performance benefitsofpincharenotconsistent across contexts. Pinch can be slower, more error-prone, and physically demanding in direct mid-air selection [14, 21, 57], yet has shown benefits in indirect interactions such as text entry [26, 41]. We hypothesize that pinch becomes more beneficial in redirected mid-air interaction than in direct interaction for two reasons: (1) its immediate tactile feedback without external devices, compensating for the uncertainty introduced by spatial remapping, and (2) its reduced input dimensionality, as 3D positioning is transformed into 2D motion with a binary pinch confirmation gesture, mitigating accuracy issues caused by depth perception challenges in redirected spaces.
In this paper, we present Redirected Pinch, a novel bare-hand interaction technique that combines pinch with input remapping to enable efficient and comfortable interaction with 2D windows in VR (Figure 1). Redirected Pinch creates a tilted virtual plane at waist height, decoupling the visual workspace from the control space for more ergonomic interaction [4, 16]. Hand movements relative to this plane are remapped in both position and orientation to the application window, while pinches provide self-haptic confirmation. This design reduces fatigue from elevated postures and compensates for spatial inaccuracies caused by remapping. We developed Redirected Pinch through a design study that explored different input mappings (i.e., direct vs. redirected) and confirmation gestures (i.e., touch vs. pinch). Our findings revealed that while touch worked effectively for direct input, its performance degraded significantly under redirected input. In contrast, pinch enhanced redirected input across accuracy, efficiency, and sense of agency, supporting our hypothesis about the value of pinch in remapped interactions.
Finally, weevaluatedtheperformanceofRedirectedPinchagainst commonly used VR interaction techniques, including direct pinch, gaze pinch, and handray pointer, in both simple selection tasks and more complex docking tasks. The results showed that Redirected Pinch provided the best overall balance of comfort, efficiency, and sense of agency in both tasks. In addition, Redirected Pinch was consistently preferred because it required less effort and offered easier control, particularly during prolonged, complex interactions involving continuous and multi-touch input.
(Conclusion) In this work, we investigated mid-air bare-hand interaction techniques for discrete, continuous, and multi-touch inputs on 2D windows in VR. Through design studies comparing input mapping and confirmation gestures, we propose Redirected Pinch, a novel interaction technique that allows users to perform pinch gestures with comfortable postures while enabling efficient interactions by hand redirection and input remapping. We compared Redirected Pinch with three commonly used methods: Direct Pinch, Gaze Pinch, and Handray Pointer. The comparative evaluation showed that Redirected Pinch provides a strong balance of speed, accuracy, agency, and comfort across both selection and docking tasks. While we tested Redirected Pinch in a controlled study, it can be flexible and adapted to more realistic applications. Future work could explore similar techniques in real-world productivity tasks involving windows of varying sizes, distances, orientations, and multi-window setups.
Gorilla Arm Effect
VR고글(HMD) 착용 후 가상 공간 내에 다수의 2D 윈도우를 띄움으로써 가상 오피스 환경을 구축할 수 있다.
마우스가 존재하지 않는 상태에서, 저 멀리 공중에 떠 있는 화면을 터치하려고 팔을 지속적으로 뻗는 행동은 어깨와 팔의 급격한 피로를 유발하고, 이를 고릴라 팔 효과라 한다.
고릴라 팔 효과의 감소를 위해 손을 편안하게 허리 높이에서 움직이고 화면에 손의 움직임을 매핑하는 기술(입력 리다이렉션, input redirection)을 도입하였으나, 손의 실제 위치와 눈이 보는 위치가 달라지다 보니 정밀한 터치 조작의 어려움이 존재한다.
Redirected Pinch
고릴라 팔 효과의 해결을 위해, 손이 편안한 위치(허리)에서의 상호작용과 가장 직관적인 손짓(집기, pinch) 제스처를 결합하였다.
사용자의 허리 높이에 눈에 보이지 않는 가상적인 평면이 있다고 가정할 때, 사용자가 허리 높이에서 손을 움직이면, 그 궤적이 눈앞의 수직 윈도우 창의 포인터로 매핑된다. 허리 높이에서의 상호작용을 통해, 팔을 높이 들고 있을 필요가 없으므로 팔의 피로도가 극소화된다.
손가락으로 화면을 누르는 느낌(touch) 대신, 엄지와 검지를 맞부딪히는 집기(pinch) 제스처를 클릭 및 확인 신호로 사용한다. 핀치(pinch)를 통해 본인의 손가락이 맞닿는 물리적 촉감(self-haptic feedback)이 생기기 때문에, 눈과 손의 위치가 다르더라도 훨씬 정확하고 안정적인 조작이 가능해진다.
터치(touch)와 핀치(pinch) 비교
눈앞의 화면을 직접 누르는 직접터치는 빠르고 직관적이지만, 위치를 왜곡하는 리다이렉션 환경에서는 정확도가 급격히 하락한다. 반면, 엄지와 검지를 맞대는 핀치(pinch)는 손가락끼리 맞닿는 자체 촉각 피드백(self-haptic feedback)을 주고 입력차원을 줄여줌으로써, 위치가 왜곡된 상황에서도 정확도를 보완해준다.
Gaze pinch(눈으로 보고 집기), 레이저 포인터 방식 등과 비교실험 한 결과, redirected pinch가 속도, 정확도, 편안함, sense of agency(에이전시 인식, 내가 화면을 완벽히 통제하고 있다는 느낌) 면에서 가장 뛰어난 균형을 보였다. 또한, 드래그, 멀티터치 등 복잡하고 연속적인 작업을 오래 할 때, 낮은 피로도와 높은 편의성을 줌으로써 사용자의 선호도가 높았다.
Adil Rahman, Wen Ying, Md Aashikur Rahman Azim, Michelle Annett, Seongkook Heo
"It Feels Like I am Invited to Communicate": Mediating Ad-Hoc Bystander-VR User Interruptions Through Proactive Proxies
(Abstract) As VRexpands into public spaces, new challenges emerge around spontaneous interactions between bystanders and unfamiliar VR users. While current VR systems often prioritize user awareness of their physical surroundings, they overlook the social dynamics affecting nearby bystanders. We conducted a deception-based study (N=80) examining how interface availability influences bystanders’ comfort, confidence, and hesitation when interrupting VR users. We compared traditional static interruption interfaces (e.g., button on screen) with a proactive proxy that actively approached bystanders upon detecting interruption intent. Static interfaces, due to insufficient cueing, frequently caused bystander discomfort, leading to hesitant physical interruptions or complete communication avoidance. In contrast, the proactive proxy implicitly conveyed social permission, significantly enhancing bystanders’ comfort and confidence. Our findings provide empirical insights into how bystanders assess availability and initiate interruptions with unfamiliar VR users in shared spaces, offering design implications for VR systems that support bystander agency and comfort during these interactions.
(Introduction) Human interruptions in shared spaces are inevitable- whether it is asking a colleague for their input on a shared project [80] or asking a fellow passenger to move [81]. In these moments, gaining someone’s attention is essential. However, interruptions become more challenging when individuals are deeply immersed in activities that render their primary communication channels unavailable (e.g., listening to music occupies their hearing, reading occupies their vision, etc.). In such cases, secondary communication channels like peripheral vision and ambient awareness serve as subtle cues for initiating interaction.
When individuals are immersed in virtual reality (VR), this dynamic becomes especially complex because their audiovisual immersion effectively limits the secondary use of these primary communication channels [19, 28, 40, 53]. Several alternative modalities have thus been proposed to manage interruptions in VR [50]. Modern VR headsets, such as the Meta Quest and Apple Vision Pro, incorporate open-ear audio technology that enables users to hear verbal interactions. However, this technology can reduce immersion and may not align with all users’ preferences [50]. Additionally, such technology can be problematic in shared spaces, as sound leakage can disturb nearby individuals and make VR users selfconscious about audio spillover [50, 63]. Users already seek full auditory isolation when completing focused work in public environments by using noise-canceling headphones [46, 55], and as portable VR headsets become productivity tools, extending this need for isolation to the visual domain is a natural progression. Both the Apple Vision Pro and Meta Quest now include dedicated travel modes for use on airplanes and trains [4, 41], explicitly supporting fully immersive use in shared spaces. Moreover, airlines have also begun providing VR headsets to passengers as part of in-flight entertainment services [16], while academic and public libraries increasingly offer VR headset lending programs and dedicated VR spaces [2, 34]. These developments signal that encounters between unfamiliar individuals (i.e., one immersed in VR and the other needing to interrupt them) will become increasingly common. Critically, because bystander awareness features within headsets remain user-controlled, bystanders have no guaranteed means of initiating contact when VR users opt for full immersion. Touch, while potentially the most direct way to interrupt a VR user, is often socially inappropriate in public settings, where interactions among strangers are governed by implicit boundaries [50, 53]. As VRheadsets transition beyond private environments into shared spaces where bystanders may need to interrupt VR users, developing socially appropriate and effective interruption mechanisms becomes increasingly critical [1, 14, 62, 72, 81].
One approach to this challenge has been to enhance VR users’ awareness of their surroundings. Previous work has explored embedding awareness systems within VR headsets to help users avoid physical accidents and maintain spatial boundaries [22, 40, 42, 49, 51, 53, 58, 73, 84]. Such systems, however, cannot adequately replicate the dynamics of face-to-face interruptions and the subtle social cues that initiators (i.e., bystanders) carefully weigh when determining whether an interruption is appropriate [44]. By shifting control from initiators to VR users, awareness systems invert the natural negotiation process that governs interpersonal interruptions. Beyond immersion trade-offs and the risk of information overload in public spaces, embedded awareness systems may also prematurely redirect a VR user’s attention to bystanders, triggering unwanted face engagements [20]. Although awareness systems can dynamically adjust how they deliver information by withholding alerts until a verbal interaction is attempted [52], such techniques overlook how bystanders assume others’ unavailability and make decisions not to engage [76, 80]. In crowded public spaces, VR users may even disable these systems to avoid constant notifications about nearby bystanders, especially when they do not anticipate needing to interact with others.
Although limited, prior work has explored bystander-initiated interruption interfaces, such as the HTC’s Knock Knock feature [71] and a physical doorbell peripheral [81], which offer explicit mechanisms for bystanders to attract a VR users’ attention. However, these interfaces typically assume that bystanders are aware of their existence, limiting their effectiveness in public spaces where interruptions are often spontaneous and involve unacquainted individuals. While interruptions between acquaintances may be more frequent, stranger-to-stranger interactions represent the most challenging case due to the absence of established social rapport and the heightened psychological cost of initiating contact [12]. Solutions effective in such demanding contexts are also likely to generalize to less demanding scenarios. While prior research underscored the importance of observing spontaneous interactions between unfamiliar individuals [50], little is known about how such encounters unfold when unacquainted VR users share physical spaces. Yet, insights into these dynamics are essential for designing VR headsets that integrate more seamlessly into public environments. In this work, we investigate how bystanders naturally navigate spontaneous interactions with unfamiliar VR users in shared environments.
As studying such unplanned encounters at scale is challenging duetotheir situational constraints, we conducted a deception-based study that recreated these constraints in a controlled setting. Unlike prior work that relied on solicited interruptions [17, 23, 50], acquainted participants [50, 53], or anecdotal reports [53], our study placed 80 unsuspecting bystanders in scenarios requiring urgent interaction with an unfamiliar VR user in a shared space. Our study had two interruption interface conditions. The baseline condition, experienced by 40 participants, used a typical desktop VR setup with a static interruption interface (e.g., doorbell peripheral). For the second condition, we hypothesized that explicitly visible, bystander-initiated interfaces could improve interruption experiences. Inspired by public display research [27, 30], we created a robot-based interruption interface, i.e., a proactive proxy that served as a design probe to examine whether explicit interface discoverability impacted bystander experiences. This deliberately extreme intervention was intended to isolate the effect of proactive presentation on bystander comfort and behavior. This condition was experienced by an additional 40 participants.
During our study, we found that, similar to public kiosk research [30], the participants who encountered the baseline condition suffered from the first-click problem [30], where they failed to notice or use the static interface. In contrast, the participants who encountered the proactive proxy reported increased comfort and engagement. While participants preferred mediated interaction over direct interruption, the static interface was largely ignored. These findings underscore the importance of interface visibility and design in shared VR spaces, offering actionable insights for future VR headset development that better supports both users and bystanders.
This research contributes: (1) Empirical evidence that bystanders feel significant discomfort during spontaneous interactions with unfamiliar VR users in shared spaces, with key insights revealing how physical boundaries, perceived safety, and social dynamics influence their interruption behaviors. (2) Identification of key limitations in existing static interruption interfaces, showing how privacy concerns and low situational awareness hinder their discoverability and usability during spontaneous interactions. (3) Design implications for bystander-aware VR interfaces that support comfortable and effective spontaneous interactions, emphasizing the need to explicitly integrate bystander perspectives into VR system design.
(Conclusion) As VRsystems increasingly extend beyond private spaces, understanding and supporting interactions between VR users and bystanders becomes essential. Our research provides empirical evidence that bystanders often feel significant discomfort when interrupting unfamiliar VR users and that traditional static interruption interfaces frequently go unnoticed during spontaneous encounters. Our proactive proxy interface, on the other hand, demonstrated that explicitly discoverable interfaces can significantly improve bystander comfort and interaction quality by signaling social permission to initiate contact. These findings suggest that as VR systems increasingly move into shared and public environments, their design must also consider the needs of nearby bystanders. In particular, incorporating interfaces that clearly communicate availability and guide bystander attention can help trigger interruptions in a comfortable and socially appropriate way.
We advocate for integrating bystander-centric design into VR systems, demonstrating that giving bystanders agency in managing interruptions enhances comfort and engagement. These interfaces can complement existing awareness systems by offering explicit communication channels without compromising immersion. We outline key design considerations for such systems and suggest that future research explore how to effectively combine these approaches across diverse contexts. As VR adoption expands into offices, transit systems, and public venues, this integration will be vital for creating socially attuned VR experiences that respect the needs of all participants.
It Feels Like가 왜 필요할까?
공항, 전시회, 도서관 로비 등의 공공장소에서 어떤 사람이 눈을 완전히 가리는 HMD를 쓰고 가상현실에 몰입해있다.
주변을 지나가던 행인(bystander)은, 이 VR사용자에게 급하게 길을 묻거나, "여기서 비키셔야 해요"라고 말을 걸어야 하는 상황이다.
하지만, VR사용자가 눈을 가리고 있으니 언제 말을 걸어야 할지 눈치가 보이고, 모르는 사람의 몸을 툭툭 치기도 무척이나 껄끄럽다.
It Feels Like內 로봇비서(proactive proxy, 주도적 프록시)의 정체
상기 문제 해결을 위해 VR사용자 옆에 바퀴가 달린 아주 작은 이동형 로봇 디바이스를 세워두었다.
이 로봇의 상단에는 스마트폰 화면 같은 인터페이스(화면과 버튼)가 존재하고, 아래에는 바퀴, 모터, 라즈베리 파이가 들어있다, ref., Appendix C.
"의도를 감지해 능동적으로 다가온다"의 실제적 과정
의도감지: 로봇이나 카메라 등의 주변센서가 VR사용자가 아닌, 그 주변을 서성거리거나 VR사용자를 쳐다보며 다가오려는 행인(bystander)를 포착한다, i.e., 말을 걸고 싶어하는 사람의 행동패턴을 감지한다.
로봇이 출발: VR사용자 옆에 가만히 서 있던 로봇은, 행인의 의도가 감지되면(말을 걸고 싶어함), 행인 앞으로 스르륵 이동한다.
소통주선, 화면표시: 행인 앞에 도착한 로봇의 화면에는 "지금 VR사용자는 가벼운 게임 중이라 말을 거셔도 괜찮습니다. 이 버튼을 누르면 VR사용자에게 알림이 갑니다" 같은 안내를 표출한다.
상호작용: 행인이 로봇의 화면內 버튼을 누르면, VR사용자의 고글 화면內 "주변에 있는 사람이 대화를 요청했습니다"라고 안전하게 알림이 표출되고, VR사용자가 고글을 벗거나 외부카메라를 켜서 행인과 눈을 맞추며 대화를 시작한다.
VR이 공공장소로 나올 때 발생하는 상호작용과 결론
사람들은 모르는 VR사용자에게 말을 걸 때 상당한 심리적 불편함을 느낀다.
벽에 붙은 벨과 같은 기존의 정적/수동적 인터페이스(static interface)의 경우, 사람들이 눈길도 주지 않고 무시한다.
시스템이 먼저 존재감을 드러내며 다가오는 주도적 프록시(proactive proxy) 방식의 경우, 주변인에게 "지금 소통해도 된다"는 사회적 허락(social permission)을 신호로 줌으로써, 소통에 있어서 편안함을 준다.
Erzhen Hu, Frederik Brudy, David Ledo, George Fitzmaurice, Fraser Anderson
PrevizWhiz: Combining Rough 3D Scenes and 2D Video to Guide Generative Video Previsualization
(Abstract) In pre-production, filmmakers and 3D animation experts must rapidly prototype ideas to explore a film’s possibilities before fullscale production, yet conventional approaches involve trade-offs in efficiency and expressiveness. Hand-drawn storyboards often lack spatial precision needed for complex cinematography, while 3D previsualization demands expertise and high-quality rigged assets. To address this gap, we present PrevizWhiz, a system that leverages rough 3D scenes in combination with generative image and video models to create stylized video previews. The workflow integrates frame-level image restyling with adjustable resemblance, time-based editing through motion paths or external video inputs, and refinement into high-fidelity video clips. A study with filmmakers demonstrates that our system lowers technical barriers for film-makers, accelerates creative iteration, and effectively bridges the communication gap, while also surfacing challenges of continuity, authorship, and ethical consideration in AI-assisted filmmaking.
(Introduction) Previsualization (previz) is a central practice in filmmaking, enabling directors and creative teams to explore the visual and narrative structure of a scene before production [54]. By creating early visualizations, filmmakers can test ideas for camera angles, blocking, pacing, and emotional beats without the expense of full-scale sets, actors, or detailed assets. Beyond its role as a creative sketching tool, previz also functions as a collaborative artifact, helping directors, cinematographers, production designer, and other stakeholder align around a shared vision [2, 4, 18].
Despite its importance, existing approaches force filmmakers to make trade-offs between speed, fidelity, and control. Storyboards and moodboards are quick and expressive, allowing for early exploration and communication of creative intent [16, 44]. However, these are static, offering limited spatial and temporal representation: they cannot adequately represent motion or timing, making it difficult to visualize complex shots or sequences. 3D previz tools on the other hand allow filmmakers to compose scenes, experiment with camera blocking, and ensure continuity across shots [12, 34, 54]. However, these tools require high-fidelity 3D assets, rigging, and animation expertise [40]. Many existing 3D previz tools also fail to convey fine-grained nuances like emotional beats and microactions.
Recent advances in generative AI can accelerate previsualization by producing images or videos directly from textual prompts, allowing filmmakers to quickly generate outputs with a compelling visual style [17]. Yet, they pose new challenges. Text-to-image and text-tovideo models often struggle with temporal consistency, making coherent motion across frames challenging [36]. They also lack spatial grounding: precise placement of objects and camera, blocking, and continuity are difficult to control. As a result, current approaches using generative AI risk producing highly polished looking results that are disconnected from the filmmaker’s intended structure. Filmmakers need a lightweight and flexible approach that combines the spatial grounding of 3D tools with the expressive richness of generative video tools.
We present PrevizWhiz, a system that allows filmmakers to rapidly explore and visualize their shots by combining rough 3D scene blocking for timing and spatial structure, 2D video references for detailed character motion, and generative stylization guided by images and text. Filmmakers begin by arranging rough 3D proxies to establish prop positions, character movement, as well as camera paths (Figure 1a). They can restyle frames from their 3D scenes to experiment with different aesthetic styles, ranging from strict adherence to loose reinterpretation of their compositions (Figure 1a). Finally, PrevizWhiz allows filmmakers to specify three levels of motion fidelity: (1) coarse motion from 3D blocking, (2) stylized animations that combine motion from 3D blocking with the restyled frame, and (3) control-video animation that augments the stylized animation with 2D reference videos for detailed character motion (Figure 1b). These frames of scene composition, and time-based elements can guide the video generation of final outputs (Figure 1c), shaping style, lighting, composition, and movement in ways that balance the structural consistency of 3D blocking with the expressiveness of 2D generative tools to create previsualization for film.
Our contributions are: (1) PrevizWhiz, a system that combines rough 3D blocking, frame stylization, and granular animation control to enable lightweight yet expressive previsualization, and (2) findings from a user study with filmmakers and 3D artists showing how the system enables rapid ideation during previsualisations and probes their thoughts on generative tools for pre-production.
(Conclusion) We presented PrevizWhiz, a system that combines rough 3D scene blocking, detailed character motion, and video stylization through generative AI to support flexible, rapid previsualization. Through a user study with filmmakers, we found that the system enabled lightweight scene setup, iterative refinement, and expressive authoring across modalities. Our findings suggest that AI-assisted previz can augment creative practice, lowering barriers for independent creators to communicate their creative intent. At the same time, issues of latency, consistency, and fear of displacement highlight the need for careful future design.
Previz
영화 또는 애니메이션을 본격적으로 촬영하기 전에, 카메라 앵글, 배우의 동선, 타이밍, 감정선 등을 미리 시각화하는 필수작업이 프리비즈이다. 이를 통해, 제작진 간의 vision 일치가 가능해진다.
기존 프리비즈 방식에는, 손그림 스토리보드, 3D 프리비즈 프로그램, 기존 텍스트 기반 생성형AI가 있으며, 각각의 trade-offs는 하기와 동일하다.
손그림 스토리보드: 빠르고 감정표현에 있어서는 효율적이지만, 정지화면이라 카메라의 복잡한 움직임 또는 공간적으로 정확한 타이밍 표현에 있어서 한계가 있다.
3D 프리비즈 프로그램: 공간 및 카메라 동선을 완벽히 제어할 수 있긴 하지만, 정교한 3D assets(캐릭터, 배경 등) 및 rigging(조작), 애니메이션 전문가가 필요하다, i.e., 진입장벽이 너무 높고 작업에 있어서 시간이 많이 소요된다.
기존 텍스트 기반 생성형AI: 그럴싸하고 멋진 영상은 금방 뚝딱 만들지만, 프레임 간의 일관성이 떨어지고, 감독이 원하는 정확한 위치에 사물을 배치하거나 카메라 이동을 제어하기가 불가능하다, i.e., grounding이 부족하다.
Grounding 부족
Grounding이란, 실제 물리적 규칙이나 현실세계의 데이터(좌표, 구조, 물리법칙 등)에 단단히 발을 붙이고 고정하는 것을 말하며, AI의 grounding 부족이란, AI가 현실의 구체적 기준(위치, 크기, 물리량)을 무시하고, 자기 마음대로 그럴싸한 이미지를 상상해서 뻗어나가는 것을 말한다.
다음은, 텍스트 기반 AI의 grounding 부족 예시이다: Sora 등의 텍스트-비디오 AI에게 "카페테이블 위에 아메리카노 잔이 놓여있고, 카메라가 오른쪽으로 이동한다"라고 입력하면, 영상 자체는 아주 감각적이고 멋지게 생성된다.
하지만, 감독이 원하는 정밀한 수준의 제어관점에서 보면, 다음과 같은 문제점이 존재한다
위치제어 불가: "아메리카노 잔을 테이블 정중앙에서 정확히 왼쪽으로 15cm 떨어진 곳에 배치해 줘"라고 해도, AI는 "왼쪽 15cm"라는 물리적 좌표를 이해하지 못해 엉뚱한 곳에 잔을 배치한다.
물리적 일관성 붕괴: 카메라가 오른쪽으로 이동하는 동안, 테이블 뒤에 있는 의장의 크기가 갑자기 커지거나, 잔의 손잡이 방향이 엉뚱하게 바뀌는 등 프레임 간의 일관성이 붕괴된다.
I.e., 텍스트라는 추상적 명령만으로는 현실의 3차원공간 법칙과 정확한 수치적 위치를 AI에게 고정(grounding)시킬 수 없다.
PrevizWhiz
PrevizWhiz는, 대충 배치한 3D모형(공간 가이드), 일반 2D비디오(동선 가이드), 생성형AI(스타일 입히기)를 결합한다.
Rough 3D Blocking: 정교한 assets이 아닌, 네모/세모 등의 거친 프록시(모형)들로 위치와 카메라 동선, 타이밍만 대충 잡아둔다, i.e., 공간적 뼈대를 구축한다.
2D비디오 레퍼런스 융합: 캐릭터의 세밀한 움직임이나 감정 묘사는 일반 2D카메라로 대충 찍은 비디오를 소스로 활용한다, i.e., 값비싼 3D 애니메이션 과정이 생략된다.
생성형AI 스타일링 지원: AI모델이 3D뼈대와 2D비디오를 가이드라인으로 활용하여, 감독이 원하는 멋진 화풍(실사풍, 애니메이션풍 등)의 고품질/고화질 비디오 클립으로 렌더링한다, i.e., 3D의 공간 통제력과 생성형AI의 표현력/속도 면에서 장점을 확보할 수 있다.
실제 영화감독 및 3D아티스트들을 대상으로 유저 스터디를 진행한 결과
기술적 장벽을 낮추면서, 아이디어를 빠르게 시도해보는 창의적 반복 작업(creative iteration)을 가속화 할 수 있었다. 또한, 독립영화제작자들의 경우, 큰 비용을 들이지 않으면서도 자신의 비전을 시각화할 수 있었다.
다만, 생성형AI의 고질적 문제인 연속성(continuity), 저작권(authorship), 윤리적 고려사항(ethical consideration, 인력대체 우려 등)이라는 과제들을 함께 도출했다.
Adil Rahman, Koichiro Ninuma, Aakar Gupta
DataSpeck: An AI-Driven Human-in-the-Loop System for Automating Transformations in Data Conversion Workflows
(Abstract) In data-driven systems, integrating disparate data sources becomes challenging when incoming data does not conform to the system’s specifications. Despite advances in automated schema matching systems, data integration tasks involving complex semantic interrelationships still require users to manually identify and define transformations between datasets, which can be cognitively demanding and time-consuming. We present DataSpeck, an end-to-end system that automates the conversion of disparate data sources to fit any pre-existing data specification. DataSpeck employs an AI-driven human-in-the-loop design, using LLMs to analyze semantic relationships and generate step-by-step transformation pipelines autonomously, while only requesting user attention to resolve semantic ambiguities. In our technical evaluation, DataSpeck successfully automated ~86% of varied data transformations while generating interpretable strategies with confidence scores and targeted clarification requests. In a user study (N=12), participants completed data conversion tasks ~53% faster with significantly reduced cognitive load using DataSpeck compared to Microsoft Excel with Copilot.
(Introduction) Data-driven systems are often built around specific data formats and structures optimized for their intended use cases. However, the growing variety of data sources poses a major challenge, as incoming data frequently deviates from expected formats and must be adapted before it can be utilized [58]. For example, a healthcare analytics platform designed to process standardized patient records may struggle to integrate external research datasets with differing schemas [11]. Integrating new data sources within established data architectures requires manually designing elaborate transformation pipelines which map incoming data to match the specification format [113]. This process consumes time and effort and must be repeated every time a new data source is introduced - creating significant bottlenecks in data integration workflows.
Reconciling different data formats has been a longstanding research goal spanning decades in data management and information systems, with schema matching and mapping techniques being the primary approaches to enable interoperability between heterogeneous data formats [85, 98]. While these systems excel at finding match candidates between source and target schemas, they often leave the semantic relationships and necessary transformations between matched attributes for users to define manually [6, 10, 13, 98, 126]. In data workflows, this transformation step remains a significant bottleneck, requiring detailed preparation before data can be utilized [75]. To address this, various interaction techniques seek to simplify the process by reducing manual coding requirements. Programming-by-example [8, 20, 44, 49, 62, 106] and Programming-by-demonstration [47, 64] systems have proven particularly effective in eliminating the need to manually write transformation code and allow users to automate data formatting through demonstrated examples. Natural language interfaces [56] have further enabled users to describe transformation requirements in plain language, though they often require precise phrasing to avoid ambiguity [66, 109, 123]. Despite these advancements, users must still understand data structures, interpret relationships, and devise appropriate transformations—a process that remains cognitively demanding for complex or unfamiliar datasets. While human insight remains crucial for exploratory analysis [105], scenarios with predetermined data structures could benefit from more nuanced automation approaches that reduce unnecessary overhead when adapting diverse sources to existing specifications.
In this paper, we present DataSpeck, an AI-driven system designed towards converting a dataset into a prescribed specification. Unlike previous approaches, DataSpeck does not require users to provide examples or describe the transformation process explicitly. Instead, it leverages LLMs to analyze the semantic relationships between the new data source and the pre-existing data specification, and uses this understanding to automatically generate both transformation strategies and the corresponding data conversion scripts. However, data conversions can involve ambiguities that require additional context. To address such scenarios, DataSpeck incorporates a human-in-the-loop design, classifying the need for human input based on system confidence, and prompting users to provide additional context when necessary.
To evaluate the effectiveness of our human-in-the-loop system, we first conducted a technical evaluation to understand the boundaries of our system’s automation capabilities. We tested DataSpeck against 43 isolated transformation scenarios and 5 complex realworld data conversion scenarios. Our system was able to successfully automate ~86% of the data transformation operations and yielded appropriate system confidence scores and clarification requests. Our technical evaluation also highlighted transformation scenarios where the automation capabilities struggled. Using these insights, we designed a user study with 12 participants who had prior data integration experience to measure the system’s impact on user performance and effort. As a baseline, we compared DataSpeck against Microsoft Excel with Copilot, which represented the counterpart human-driven, AI-in-the-loop approach where users infer the semantic relationships on their own, and then use natural language interactions to design the transformations. Participants achieved significantly higher performance and efficiency in the given data conversion tasks, completing tasks 53% faster on average using DataSpeck, and reported significantly lower levels of mental demand, effort, and frustration on the NASA-TLX scale. They found DataSpeck’s ability to automatically infer transformations from source-specification pairs highly usable and valuable for their professional settings, noting that such an interaction could transform the tedious task of manually analyzing datasets and writing migration scripts into simply reviewing system-generated strategies and answering clarifications, thereby significantly reducing manual effort and cognitive load.
We summarize our contributions as follows: (1) The design of DataSpeck, an end-to-end system that automates the transformation of disparate data sources into preexisting data specifications through an AI-driven, human-inthe-loop approach. (2) Technical findings demonstrating that DataSpeck successfully automates a comprehensive set of data transformation operations across diverse scenarios, while generating appropriate confidence scores and targeted clarification requests when human input is needed. (3) User performance results showing that DataSpeck’s AI-driven, human-in-the-loop approach enables users to complete data conversion tasks more efficiently while significantly reducing cognitive load across all NASA-TLX dimensions compared to human-driven AI-in-the-loop approaches.
(Conclusion) Data conversions often require significant manual effort to align diverse data sources to specific structural requirements. Traditional methods do not directly solve for this problem and using existing tools requires extensive manual efforts; in contrast, DataSpeck employs an end-to end AI-driven human-in-the-loop approach that understands the data, strategizes the conversion pipeline, and executes the transformation steps in a highly transparent manner, employing a tiered confidence-based human intervention mechanism when it may be needed. A technical evaluation shows that DataSpeck was able to handle a large number of transformation tasks without human intervention, automate a large part of the pipeline for realworld end-to-end conversions, and successfully surface the need for human intervention for the rest. Our user study demonstrates that DataSpeck’s AI-driven human-in-the-loop approach is significantly more efficient in terms of time and effort compared to a familiar human-driven AI-assisted method. While fully human-driven methods offer complete control, they are often time-consuming and cognitively demanding. In contrast, a human-in-the-loop approach, where the AI automates the majority of transformations and requests user input for clarification or uncertainties, demonstrated potential to significantly enhance efficiency and reduce cognitive overheads for such data tasks.
데이터 중심 시스템과 포맷/구조(specification)
데이터 중심 시스템들은 자신들의 목적에 맞게 최적화된 고유의 포맷/규격(Specification)을 가지고 있다.
하지만, 여러 외부 소스에서 새로운 데이터들이 들어올 때, 기존 규격과 일치하지 않는 구조적/의미론적 불일치 문제가 상시 발생한다.
이를 해결하기 위해 기존 시스템에 맞게 데이터를 매핑/변환(data conversion)하는 파이프라인을 수동으로 설계해야 하는데, 이 과정은 상당한 시간 및 노력을 소모한다, i.e., 데이터 통합 워크플로우의 병목 요인이다.
기존 기술들의 trade-offs
자동 스키마 매칭 시스템: source schema와 target schema의 필드(컬럼)간 연결(match candidates)하는 것은 잘 하지만, 구체적으로 데이터를 어떻게 통합, 분할, 단위변환해야 하는지 등의 의미론적 관계(semantic interrelationships)와 실제 변환로직에 대해서는, 사용자의 수작업(코딩 및 정의)이 필요하다.
예시 기반 프로그래밍(PBE, Programming By Example) 및 시연 기반 시스템: 예시를 보여주면 코드를 자동으로 만들어주어 수작업(코딩)은 줄었지만, 사용자가 데이터 구조를 완벽히 이해한 상태에서 직접 정확한 입출력 예시를 만들어 제공해야 하므로, 정신적 부담(cognitive demand)은 여전히 존재한다.
자연어 인터페이스: 말로 명령하면 데이터 변환을 수행하지만, 사용자가 극도로 정밀하고 정확한 자연어 문장(prompt engineering)을 구사해 지시해야 하므로, 데이터셋이 복잡한 경우 효율성이 떨어진다.
데이터 변환(data conversion)에서의 semantic gap 부족
Semantic gap이란, 데이터셋 필드명(예, buyer, customer)이나 데이터의 형태(예, 전체주소 문자열, 우편번호/도시 분할 필드)가 서로 상이할 때, 이것들이 본질적으로는 동일한 의미를 지니고 있음을 파악하고 연결해주는 논리적 바탕을 의미한다.
기존 AI의 의미론적 이해 부족: 일반적인 AI 어시스턴트는 단일 행의 수식을 만들거나 간단한 포맷팅을 돕는 데에는 뛰어나다. 하지만, 여러 테이블에 흩어진 데이터(예, 제품무게로 포장조건 분류, 환율 적용 등) 간의 고차원적 인과관계 및 매핑전략을 자율적으로 생성하지는 못한다, i.e., 사용자가 처음부터 끝까지 변환지시를 리드해야 하므로, 생소하거나 거대한 데이터셋을 마주하면, AI가 있어도 작업이 중단된다.
DataSpeck 및 human-in-the-loop
DataSpeck은 사용자가 직접 예시를 주거나 일일이 지시를 내릴 필요 없이, 소스 데이터셋과 타겟 명세(specification) 쌍을 입력하면, AI가 선제적으로 데이터 변환 스크립트 및 파이프라인을 자율적으로 생성한다. 다만, 의미가 모호한 failure point에서만 인간에게 질문을 던지는, AI주도형 인간참여형(AI-driven human-in-the-loop) 디자인을 채택했다.
Analysing dataset descriptors (데이터 요약 분석): 거대한 데이터를 LLM에 다 넣을 수는 없으므로, 데이터의 구조적 특징(샘플, 결측치, 통계량)을 추출한 후, LLM을 통해 각 컬럼의 의미, 데이터 타입, 단위, 포맷을 담은 압축된 semantic descriptors를 자율적으로 생성한다.
Establishing Semantic Relationships (의미론적 관계 구축): 소스와 타겟 명세의 descriptors를 비교 분석하여, 타겟 명세의 각 컬럼을 소스 데이터로부터 어떻게 유도해낼 수 있을지, 자연어로 된 변환전략(transformation strategy) 세트를 스스로 수립한다.
Tiered confidence-based human intervention (신뢰도 기반 인간 개입): AI가 자율적으로 전략을 짜는 과정에서 판단의 모호함(ambiguities)이 생기면 이를 신뢰도에 따라 confident(확신), assuming(가정), insufficient(부족)의 3단계로 분류한다. Assuming일 때는 가정을 수립한 뒤 확인을 요청하고, insufficient일 때만 사용자에게 명확화 요청(clarification request)을 보내 모호함을 해결한다.
Dynamic step/hybrid code generation (동적 코드 생성): 수립된 전략을 바탕으로 한 단계씩 실행 가능한 파이프라인 Python 코드를 빌드하며, 정밀한 규칙 표현이 어려울 때는 행 단위로 LLM을 호출하는 함수(ask_llm)를 삽입하는 하이브리드 전략을 취해 파이프라인을 완성한다.
실제 기술적 평가 및 유저 스터디 진행 결과
기술적 성능 검증: 43개의 격리된 변환 시나리오와 5개의 복잡한 실제 Kaggle 데이터셋 쌍을 대상으로 자동화 성능을 테스트한 결과, 인간의 개입 없이 전체 데이터 변환 작업의 약 86%~90%를 완전히 자동으로 성공시켰으며, 신뢰도 기반 질문 메커니즘도 매우 정확하게 작동함을 증명했다. 단, 데이터 필터링이나 정렬 같이 겉으로 드러나지 않는 규칙 패턴은 스스로 알아채지 못하고 놓치는 한계도 발견되었다.
유저 스터디 결과 (vs. MS Excel with Copilot): 사용자가 직접 데이터 관계를 파악하고 AI에게 명령을 내려야 하는 인간 주도형 AI 보조 방식(Excel + Copilot)을 대조군으로 실험을 진행했다.
수행 효율성: DataSpeck을 사용한 참가자들이 작업을 평균 53% 더 빠르게 완료했다.
직무 부하 감소: NASA-TLX(직무부하 평가지표) 기준 정신적 요구량, 노력, 좌절감 등의 모든 차원에서 인지적 부하가 유의미하게 감소했다. 데이터셋을 직접 분석하고 마이그레이션 스크립트를 만드는 대신, AI가 다 짜놓은 전략을 검토(reviewer)하고 모호한 질문에 답만 해주는 방식으로 워크플로우가 혁신됨을 확인했다.
NASA-TLX (NASA Task Load Index, 나사 작업 부하 평가지표)
사람이 어떤 작업(task)을 수행할 때 정신적으로 얼마나 힘들었는지(직무 부하, mental workload) 측정을 위해 NASA에서 개발한 세계 표준 설문 평가 도구이다.
단순히 "이 시스템 쓰니까 편해요?"라고 뭉뚱그려 묻지 않고, 인간의 인지적 고통과 스트레스를 6가지 구체적인 차원으로 나누어 점수(0~100점)를 측정한다.
정신적 요구 (Mental Demand): 머리를 얼마나 많이 써야 했는가? (생각, 계산, 기억 등)
신체적 요구 (Physical Demand): 몸을 얼마나 많이 움직여야 했는가? (밀고 당기기, 핀치 조작, 타이핑 등)
시간적 요구 (Temporal Demand): 시간적 압박이나 촉박함을 얼마나 느꼈는가?
작업 성능 (Performance): 내가 목표를 얼마나 성공적으로 달성했다고 느끼는가?
노력 (Effort): 원하는 성과를 내기 위해 정신적·신체적으로 얼마나 애를 써야 했는가?
좌절감 (Frustration): 작업을 하는 동안 얼마나 짜증, 스트레스, 낙담을 느꼈는가?